Exetools

Exetools (https://forum.exetools.com/index.php)
-   General Discussion (https://forum.exetools.com/forumdisplay.php?f=2)
-   -   [Solution] How to check whether data is compressed or ciphered (https://forum.exetools.com/showthread.php?t=12677)

pp2 03-02-2010 05:56

[Solution] How to check whether data is compressed or ciphered
 
Some time ago, I asked myself if I can distinguish compressed data from enciphered. This can be useful in data and file analysis and some other cases. Maybe you will find this information useful also.

First, how we can theoretically answer whether data is packed or crypted? Honestly packed data blocks will not use all possible code combinations (or this data will be not unpackable), but best compressors do not use less than 0.01% of all possible code combinations. So, only statistics will help us in analyzing big amounts of data. Example:
compressed data
entropy for 8-bit elements: 0.999833729
ciphered data
entropy for 8-bit elements: 0.999999867

As you can see, difference is entropy is almost zero, and it cannot be right criteria to distinguish blocks of data. For some blocks of data compressed data entropy will be almost the same, as ciphered.
The right algorithm is to calculate chi-squared criteria for block of data.
Compare with the same blocks of data:
compressed data
chi-squared 0.001830034
ciphered data
chi-squared 0.000001432

Yes, here we got a 1300 times differing values! But why? Because ciphered (with good cipher) data will contain all possible code combinations, and compressed will not. This algorithm reveals these unused codes and makes such a difference.
Ok, how to calculate 8-bit entropy and chi-square?
Imagine, elements array has count of all bytes in data block (for 8-bit entropy), i.e. 0-th element has number of 0x00 bytes in block, 1-st - 0x01 and etc. Here is pseudocode for calculating entropy value:
Code:

long double GetEntropy(unsigned int bits)
{
    unsigned int i;
    long double result, temp;
    result = 0.0;
    for (i = 0; i < (1UL << bits); i++)
    {
        if (elements[i] == 0)
            continue;
        temp = (long double)elements[i] / quantity;
        temp *= log(temp) / log(2);
        result += temp;
    }
    return -result / (long double)bits;
}

And now, using the same elements array we can calculate chi-square criteria:
Code:

long double GetChiSquared(int bits)
{
    unsigned int i;
    long double result, temp;
    result = 0.0;
    for (i = 0; i < (1UL << bits); i++)
    {
        temp = (long double)quantity / (long double)(1ul << bits);
        result += ((long double)elements[i] - temp) * ((long double)elements[i] - temp) / temp;
    }
    return result / quantity;
}

Drawbacks: we cannot 100% prove, that file is compressed or ciphered if data size is too small. If data size grows, prove strengthens. My own investigations reveal, that random blocks more than 50mb in size (compressed by modern archivers or ciphered with popular block ciphers) can be distinguished with 99% guarantee (yes, maybe you will need some additional magic, which is left as an home exercise for most curious, hint - do not use only 8-bit values).

Happy coding :)

redbull 03-11-2010 15:24

Here is another article about Entropy. This guy's approach is different

http://gynvael.coldwind.pl/?id=162

Perhaps you guys could collaborate to improve his tool to detect compressed and encrypted entropies


All times are GMT +8. The time now is 16:26.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2026, vBulletin Solutions, Inc.
Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX