Exetools

Exetools (https://forum.exetools.com/index.php)
-   General Discussion (https://forum.exetools.com/forumdisplay.php?f=2)
-   -   Tool to scan files for common byte sequences (https://forum.exetools.com/showthread.php?t=18615)

dila 01-22-2018 01:24

Tool to scan files for common byte sequences
 
I am looking for a tool that loads a set of files and will find common byte sequences between them. Does such a tool exist?

For example, if each file contains the sequence 0x01 0x02 0x03 0x04 0x05, then the tool will find this common string and print it.

Stingered 01-22-2018 02:30

Quote:

Originally Posted by dila (Post 111980)
I am looking for a tool that loads a set of files and will find common byte sequences between them. Does such a tool exist?

For example, if each file contains the sequence 0x01 0x02 0x03 0x04 0x05, then the tool will find this common string and print it.

A quick search came across these:

http://gnuwin32.sourceforge.net/packages/gsar.htm

https://wingrep.codeplex.com/

https://www.fileseek.ca/

Or you can just use Notepad++ and use the "Find in Files" menu option.

dila 01-22-2018 03:46

This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.

Stingered 01-22-2018 04:16

Quote:

Originally Posted by dila (Post 111982)
This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.

Hmmm... So you're looking for something like the OD command in unix (except for the addition of multiple file search)?

Would something like this work?

Code:

#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <stdlib.h>

#define BUF_SIZE 65536

int getnibble(char c)
{
        c = toupper(c);
        return (c > '9' ? c - 'A' + 10 : c - '0');
}

void main(int argc, char** argv)
{
        if (argc != 3)
        {
                printf(
                        "Usage:\n"
                        "%s <filename> <hex>\n",
                        argv[0]
                );
                return;
        }

        char* filename = argv[1];
        char* hexchars = argv[2];
       
        int len = strlen(hexchars);
        if (len % 2)
        {
                printf("Error: Odd number of hex chars\n");
                return;
        }
        len /= 2; // len = number of bytes in pattern
       
        // parse hexchars to real bytes
        char* pattern = (char*)malloc(len);
        char* p = pattern;
        while (*hexchars)
        {
                int h = getnibble(*hexchars++);
                int l = getnibble(*hexchars++);
               
                if (h > 16 || l > 16)
                {
                        printf("Error: invalid hex\n");
                        free(pattern);
                        return;
                }
               
                *p++ = (h << 4) + l;
        }
       
        // Open the file
        FILE* f = fopen(filename, "rb");
        if (f)
        {
                char* buf = (char*)malloc(BUF_SIZE);
               
                // we want to read in less than the whole buffer each time to avoid
                // missing the needle when it's halfway across a boundary
                int readsize = BUF_SIZE - len;
               
                int amtread;
                int offset = 0;
                char* p; // search result
                int bytessearched; // how many bytes we've already searched in this block
               
                // read in the first block in full
                amtread = fread(buf, 1, BUF_SIZE, f);
                while (amtread != 0)
                {
                        // search for the start byte
                        bytessearched = 0;
                        while ((p = (char*)memchr(buf + bytessearched, *pattern, amtread - len - bytessearched)) != NULL)
                        {
                                if (memcmp(p, pattern, len) == 0)
                                {
                                        printf("Found at %x\n", offset + p - buf);
                                }
                                bytessearched = p - buf + 1;
                        }
                       
                        // copy the tail of the buffer over the head
                        memmove(buf, buf + BUF_SIZE - len, len);
                       
                        // read in the next block
                        amtread = fread(buf + len, 1, BUF_SIZE - len, f);
                        offset += BUF_SIZE - len;
                }
               
                free(buf);
        }
        fclose(f);
       
        free(pattern);
}

And then just create a batch file for the multiple file search?

Stingered 01-22-2018 04:32

Quote:

Originally Posted by dila (Post 111982)
This is for searching for a given string.

I paste a screenshot of the prototype here: https://i.imgur.com/8IxxjE6.png. It shows that the string 0x00 0x04 0x00 0xE8 0x02 0x00 is common to 8 files out of the sample set.

And here it is, viewed in a hex editor: https://i.imgur.com/I06WEu7.png.

Ah... so you've already written something, sry (I see in the first image what you mean). Can't find anything that does this. :(

Stingered 01-22-2018 05:15

Last ditch effort:

http://www.vxsearch.com/search_files_by_binary_patterns.html

Windows app, 30-day trial download:

http://www.vxsearch.com/downloads.html

dila 01-22-2018 06:22

Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

chants 01-22-2018 07:09

The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:
Quote:

https://en.wikipedia.org/wiki/Longest_common_substring_problem
Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example
Quote:

https://docs.lib.purdue.edu/cgi/viewcontent.cgi?httpsredir=1&article=2225&context=cstech

Stingered 01-22-2018 13:03

Quote:

Originally Posted by chants (Post 111987)
The problem with this task - e.g. common substrings problem, is that is a high complexity so that it requires a lot of difficult heuristic tricks to get it below O(n^2) otherwise it is too slow or uses too much memory. I have not seen any tools to do this. It would work with pictures or videos or audio as well - to find matching image sections, video subclips, etc. But really, it would be quite useful. I am quite certain we are talking an NP-hard problem please see:

Although this problem is less complex, you did not specify only the longest common substring but all common substrings. Yes the suffix tree and other tricks can solve this one faster.

And there are proofs I believe that shortest common substring is NP-hard.
See for example

I mean, yes, the same for grep (in general).

Stingered 01-22-2018 13:07

Quote:

Originally Posted by dila (Post 111986)
Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

Actually, I think it could be depending one what you are searching for. obviously it is not a simple search, but very specific (over a large spectrum). Does not mean it isn't useful. I was intrigued by the thought.

chants 01-22-2018 14:08

Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.

Stingered 01-23-2018 02:33

Quote:

Originally Posted by chants (Post 111993)
Grep is just looking for regex's so its complexity is that of pattern matching of regex's. Now you are asking a very general and arbitrary common substring problem. They are not the same issue really at all.

This would be very useful, but it has a really problematic size vs speed tradeoff and would need some kind of limiting parameters like you are getting at. The NP-hard issue can be side stepped through heuristics and domain specific approach. Nonetheless, I doubt you will find such a tool for general cases.

Good explanation (both posts). There was an old (privately written back in the early 90's) tool that I used for function and string searches (waaay before this particular code base was being indexed). And even that tool could take all day to search through (the code base was very large). Not exactly the same example, but I see your point regarding the size vs speed you described.

I believe I read that the metasploit framework included some heuristics for this kind of search, but I could find no specific tool.

I too agree that a tool like this would be very useful.

ontryit 01-24-2018 12:06

Quote:

Originally Posted by dila (Post 111986)
Yes, i made a prototype. But it turns out such a tool is of no use to anyone, so i will not continue to develop it.

I think the idea is quite simple, but it seems not many people understood. Binary (or text) difference tools that compare a pair of files is not really the same at all.

The prototype I created can compare any number of files and find a string that is present in all of them, up to some desired length.

You can share you prototype tool with its source code here, may be it will be useful for someone. Thx

dosprog 02-14-2018 03:29

Tool to scan files for common byte sequences
 
Similar problem solve the archivers.
That if try to take out the algorithm from some open-source archiver?

dila 02-16-2018 16:46

Yes, I figured the problem was similar to building a dictionary of common sequences, which you'd then substitute with shorter codes corresponding to the dictionary entries.

As we discussed, it doesn't sound like a perfect solution is possible, but some heuristics would work. You mentioned compression which would do exactly this kind of operation - pick your favourite algorithm.

(I won't make the code available for my tool, since it was a rushed prototype and I don't think there any chance of anyone getting all the necessary libs to compile it.)

dosprog 02-18-2018 19:11

IMHO,
Such tool, if maked, It will not be very useful because of too many matches in files.
Redundancy of results..
However, the very formulation of the problem is interesting.

See old tool for similar purpose, but it searches only for 'string' duplicates inside one specially prepared [text or bin] file:

-> Dup_Str <-

- This tool does not prints line numbers of duplicated strings, - it prints only its duplicated strings.
It maked Q&D quite a long time ago, but used sometimes without changes.
Originally this tool was made to find duplicates of lexemes in the dictionary..



All times are GMT +8. The time now is 16:43.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2026, vBulletin Solutions, Inc.
Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX