Exetools - Tool to scan files for common byte sequences

Exetools (https://forum.exetools.com/index.php)

- General Discussion (https://forum.exetools.com/forumdisplay.php?f=2)

- - Tool to scan files for common byte sequences (https://forum.exetools.com/showthread.php?t=18615)

Tool to scan files for common byte sequences

I am looking for a tool that loads a set of files and will find common byte sequences between them. Does such a tool exist?

For example, if each file contains the sequence 0x01 0x02 0x03 0x04 0x05, then the tool will find this common string and print it.

Quote:

Originally Posted by dila (Post 111980)

A quick search came across these:

http://gnuwin32.sourceforge.net/packages/gsar.htm

https://wingrep.codeplex.com/

https://www.fileseek.ca/

Or you can just use Notepad++ and use the "Find in Files" menu option.

This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.

Quote:

Originally Posted by dila (Post 111982)

Hmmm... So you're looking for something like the OD command in unix (except for the addition of multiple file search)?

Would something like this work?

Code:

#include <stdio.h>

#include <string.h>

#include <ctype.h>

#include <stdlib.h>



#define BUF_SIZE 65536



int getnibble(char c)

{

        c = toupper(c);

        return (c > '9' ? c - 'A' + 10 : c - '0');

}



void main(int argc, char** argv)

{

        if (argc != 3)

        {

                printf(

                        "Usage:\n"

                        "%s <filename> <hex>\n",

                        argv[0]

                );

                return;

        }



        char* filename = argv[1];

        char* hexchars = argv[2];

        

        int len = strlen(hexchars);

        if (len % 2)

        {

                printf("Error: Odd number of hex chars\n");

                return;

        }

        len /= 2; // len = number of bytes in pattern

        

        // parse hexchars to real bytes

        char* pattern = (char*)malloc(len);

        char* p = pattern;

        while (*hexchars)

        {

                int h = getnibble(*hexchars++);

                int l = getnibble(*hexchars++);

                

                if (h > 16 || l > 16)

                {

                        printf("Error: invalid hex\n");

                        free(pattern);

                        return;

                }

                

                *p++ = (h << 4) + l;

        }

        

        // Open the file

        FILE* f = fopen(filename, "rb");

        if (f)

        {

                char* buf = (char*)malloc(BUF_SIZE);

                

                // we want to read in less than the whole buffer each time to avoid 

                // missing the needle when it's halfway across a boundary

                int readsize = BUF_SIZE - len; 

                

                int amtread;

                int offset = 0;

                char* p; // search result

                int bytessearched; // how many bytes we've already searched in this block

                

                // read in the first block in full

                amtread = fread(buf, 1, BUF_SIZE, f);

                while (amtread != 0)

                {

                        // search for the start byte

                        bytessearched = 0;

                        while ((p = (char*)memchr(buf + bytessearched, *pattern, amtread - len - bytessearched)) != NULL)

                        {

                                if (memcmp(p, pattern, len) == 0)

                                {

                                        printf("Found at %x\n", offset + p - buf);

                                }

                                bytessearched = p - buf + 1;

                        }

                        

                        // copy the tail of the buffer over the head

                        memmove(buf, buf + BUF_SIZE - len, len);

                        

                        // read in the next block

                        amtread = fread(buf + len, 1, BUF_SIZE - len, f);

                        offset += BUF_SIZE - len;

                }

                

                free(buf);

        }

        fclose(f);

        

        free(pattern);

}

And then just create a batch file for the multiple file search?

Quote:

Originally Posted by dila (Post 111982)

Ah... so you've already written something, sry (I see in the first image what you mean). Can't find anything that does this. :(

Last ditch effort:

http://www.vxsearch.com/search_files_by_binary_patterns.html

Windows app, 30-day trial download:

http://www.vxsearch.com/downloads.html

Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

Quote:

https://en.wikipedia.org/wiki/Longest_common_substring_problem

Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example

Quote:

https://docs.lib.purdue.edu/cgi/viewcontent.cgi?httpsredir=1&article=2225&context=cstech

Quote:

Originally Posted by chants (Post 111987)

The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:

Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example

I mean, yes, the same for grep (in general).

Quote:

Originally Posted by dila (Post 111986)

Actually, I think it could be depending one what you are searching for. obviously it is not a simple search, but very specific (over a large spectrum). Does not mean it isn't useful. I was intrigued by the thought.

Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.

Quote:

Originally Posted by chants (Post 111993)

Good explanation (both posts). There was an old (privately written back in the early 90's) tool that I used for function and string searches (waaay before this particular code base was being indexed). And even that tool could take all day to search through (the code base was very large). Not exactly the same example, but I see your point regarding the size vs speed you described.

I believe I read that the metasploit framework included some heuristics for this kind of search, but I could find no specific tool.

I too agree that a tool like this would be very useful.

Quote:

Originally Posted by dila (Post 111986)

You can share you prototype tool with its source code here, may be it will be useful for someone. Thx

Tool to scan files for common byte sequences

Similar problem solve the archivers.
That if try to take out the algorithm from some open-source archiver?

Yes, I figured the problem was similar to building a dictionary of common sequences, which you'd then substitute with shorter codes corresponding to the dictionary entries.

As we discussed, it doesn't sound like a perfect solution is possible, but some heuristics would work. You mentioned compression which would do exactly this kind of operation - pick your favourite algorithm.

(I won't make the code available for my tool, since it was a rushed prototype and I don't think there any chance of anyone getting all the necessary libs to compile it.)