Tool to scan files for common byte sequences

dila · #1 01-22-2018, 06:22

Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

chants · #2 01-22-2018, 07:09

The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:

Quote:

https://en.wikipedia.org/wiki/Longest_common_substring_problem

Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example

Quote:

https://docs.lib.purdue.edu/cgi/viewcontent.cgi?httpsredir=1&article=2225&context=cstech

Stingered · #3 01-22-2018, 13:03

Quote:

Originally Posted by chants

The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:

Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example

I mean, yes, the same for grep (in general).

Stingered · #4 01-22-2018, 13:07

Quote:

Originally Posted by dila

Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

Actually, I think it could be depending one what you are searching for. obviously it is not a simple search, but very specific (over a large spectrum). Does not mean it isn't useful. I was intrigued by the thought.

chants · #5 01-22-2018, 14:08

Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.

Stingered · #6 01-23-2018, 02:33

Quote:

Originally Posted by chants

Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.

Good explanation (both posts). There was an old (privately written back in the early 90's) tool that I used for function and string searches (waaay before this particular code base was being indexed). And even that tool could take all day to search through (the code base was very large). Not exactly the same example, but I see your point regarding the size vs speed you described.

I believe I read that the metasploit framework included some heuristics for this kind of search, but I could find no specific tool.

I too agree that a tool like this would be very useful.

ontryit · #7 01-24-2018, 12:06

Quote:

Originally Posted by dila

Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

You can share you prototype tool with its source code here, may be it will be useful for someone. Thx

dosprog · #8 02-14-2018, 03:29

Similar problem solve the archivers.
That if try to take out the algorithm from some open-source archiver?

dila · #9 02-16-2018, 16:46

Yes, I figured the problem was similar to building a dictionary of common sequences, which you'd then substitute with shorter codes corresponding to the dictionary entries.

As we discussed, it doesn't sound like a perfect solution is possible, but some heuristics would work. You mentioned compression which would do exactly this kind of operation - pick your favourite algorithm.

(I won't make the code available for my tool, since it was a rushed prototype and I don't think there any chance of anyone getting all the necessary libs to compile it.)

Thread Tools
Show Printable Version Email this Page
Display Modes
Switch to Linear Mode Hybrid Mode Switch to Threaded Mode

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Is there any tool to replace the files packed in the NullSoft Install System package?	BlackWhite	General Discussion	4	09-02-2018 00:27

The Following User Says Thank You to dila For This Useful Post:
Stingered (01-22-2018)

The Following 2 Users Say Thank You to chants For This Useful Post:
dila (01-23-2018), Stingered (01-23-2018)

The Following User Says Thank You to chants For This Useful Post:
Stingered (01-23-2018)

The Following User Says Thank You to Stingered For This Useful Post:
dila (01-23-2018)

The Following User Says Thank You to dosprog For This Useful Post:
dila (02-16-2018)