Exetools

Exetools (https://forum.exetools.com/index.php)
-   General Discussion (https://forum.exetools.com/forumdisplay.php?f=2)
-   -   decompiling back to C++? (https://forum.exetools.com/showthread.php?t=4573)

Rhodium 07-09-2004 06:34

decompiling back to C++?
 
Say you spent billions of dollars and hired all the best programmers in the world. Would they be able to come up with a program that decompiles applications back to their true C++ code?

Say you hired the worlds 100 best programmers and offered them 10 million dollars each if they did it.

Hypothetical.

JMI 07-09-2004 07:08

And how long did you give them to accomplish the task? :D

Regards,

Rhodium 07-09-2004 07:57

Give them a year.

fantast_xue 07-09-2004 20:58

They would failed. :eek:
But I think maybe scientists could do this job, with ten or more years. :D

Sarge 07-09-2004 21:43

Yeh, but what if those programmers were hired away from MS? What if they were the same programmers that wrote the C++ compiler in the first place? Maybe that would give them a edge; maybe they could do it in a year or less?

Sarge

Lunar_Dust 07-09-2004 22:04

I don't know, there are numerous optimizations which can result in totally throwing away the original source. Of course, this would also have the effect of optimizing the source, wouldn't it? But it would probably be much harder to read, and you wouldn't have comments anyway.

The problem is that converting back to C++ code doesn't really help you all that much, because you won't have comments, and you won't have variable names which make sense.

You will have constructs, and code flow.

But those you can still get from ASM disassemblers anyway (like IDA). Knowing the original high-level intent of the programmer (why something was done certain way, variable names, how variables connect to each other) in enough of a way to reconstruct a source is pretty much impossible. C++ really isn't a round trip language ( unlike .NET languages).

-Lunar

SHaG 07-10-2004 04:17

Check:
hxxp://boomerang.sourceforge.net

tAz 07-10-2004 08:44

true c++ code?
i wouldn't say impossible, but improbable.

decompilers will deal with the lost of code due to optimizations, and of course, the user-defined tokens (ex. variables, function names).

reusable code should be the target of decompilers, and until someone creates a program to analyze algorithms, and properly name all of the variables and functions, not to mention profiling the programmer on his/her preferences in the use of variables, we're still a long way to go before seeing the original code from a compiled sample.

_Servil_ 07-10-2004 15:17

it's impossible $10,000,000 is quite few.

JMI 07-10-2004 16:44

Ah, but he said $10 Million to EACH of the world's 100 best programers. ;)

Regards,

_Servil_ 07-10-2004 17:40

nevertheless :D

Sarge 07-11-2004 21:53

I like this comment:
>reusable code should be the target of decompilers<

but I don't necessarily see the need for EXACT/original source code re-creation, especially where variables are concerned. As long as the decompiler proggie keeps them straight, I'd think it's ok for the compiler to spit out a variable named "Var1", even though the source code was "MyVar", as long as "Var1" was consistantly named whenever that specific variable was actually used in the target proggie.

Do we want this decompiler to give us, for example, a structure definition too, or just give us the operations on the structures elements, and let the compiler (when operating on our recovered code) generate error messages telling us what (syntatically) is wrong that we poor humans would have to clean up (in this case, by defining that structure ourselves)? This would certainly result in useable, runnable code, but obviously not the EXACT/original source code. How close to the EXACT/original source code are we talking?

Further, if you actually reproduced runnable code, but it was only 80% (or 70%? or maybe only 50%) of the EXACT/original code, and therefore needed some additional user input, would people buy it?

Sarge

hmora 07-12-2004 09:21

Disassembler
 
sorry, but I don't know much about this topic, but I have used IDA a few times. My question is: is always possible to disassemble a program???
what are those exe protectors for?

Thanks in advance.

santa_kewl 07-12-2004 09:44

[what are those exe protectors for?]

To protect the exe from disassembled

To protect the exe from debugging

Satyric0n 07-12-2004 09:55

Quote:

Originally Posted by hmora
sorry, but I don't know much about this topic, but I have used IDA a few times. My question is: is always possible to disassemble a program???
what are those exe protectors for?

Thanks in advance.

Yes, it is always possible to disassemble a program, but if the program is protected/packed, the code you see disassembled is the unpacking/protector code; the actual program gets unpacked at runtime. So, you either need to unpack the program to see its code in the disassembler, or you just look at the code at runtime using a debugger.

Regards

WARM3CH 07-14-2004 18:34

With C you can reach only a partial decompiling due to the complexities caused by the optimizations in the compiler. The source code can have many statements that are simply optimized away when it is complied.
With C++, well, sorry, it is impossible. How on the earth you can reach the source code of a STL vector or a Boost smart pointer by looking at the machine code? They are already lost in the first compilation phases and even dont make it to the backend....

jsteed 07-15-2004 04:56

Actually, I can remember true decompilers for FORTRAN created during the 70s and early 80s. Grad students would build such things during the wee hours. Each different machine had to have it's very own handcrafted version. The binary for a DEC and CDC were very different. As I recall, aside from the lost variable names, (no one commented their FORTRAN code), these programs did quite well in reproducing the original code. Of course by comparison, FORTRAN is a relatively simple language, no classes, simple data structures, etc.

I would be surprised if such custom-made decompilers don't exist for C++. I can't imagine that some kid from M$ with plenty of time at night hasn't coded one up for VC.

cheers, jsteed

tricky 07-16-2004 09:39

I think it will depend on how big and how complex the program is. If the program is very big like M$ office, I would say it's impossible.

Shub-Nigurrath 07-16-2004 23:25

C++ get decompiled back to C to be exact, all the object things are converted to C structures such as the hierarchy, virtual functions and so on are effectively implemented using vtables and so on.

I read a study which described how to do object programming using simple plain c..it's not a hypothesis but a need on some platforms where there isn't available c++ compilers..

hmora 07-17-2004 03:50

why?
 
Certainly, it would be impossible to get the exact code, just as the programmers had written it. The code optimizations made by the compiler make this impossible.

You may get a C/C++ code, but it would be impossible to read. Have you tried to read a simple program written by a bad programmer? I'm a student and I had got to check some codes from other students, on their first programming course. Even if you know what the code should do, it's really hard to understand everything.

I think this would happen if you get something from ASM to C++. It would be a big mess. Maybe everything got sense thinking about classes, with functions making specific tasks. Now think the compiler will make a "few" changes. Then put it into assembler. Taking it back to C would complicate the things even more.

Why would you want to get SOME (and not THE) C code from a program? I still don't see what is the idea behind this.

Polaris 07-17-2004 17:42

Hmm... I think that there is some confusion about this... ;)

Decompilation to C++ is impossible. The decompiler can rebuild only information contained within its target: now, since its target is ASSEMBLER, which lacks anything related to HLL, it cannot surely rebuild things that are not included, like objects.

Also, consider that some C++ concept are completely discarded after the code checking phase, and are never really used within the compiler: for example the PRIVATE/PUBLIC/PROTECTED directives are used only for security checking.

fsheron 07-21-2004 08:25

It may be rewrite in C++, not decompile.

mmx133 07-23-2004 09:48

Recently I have disassembled some programs in RISA,
it is very easy to rewrite program in C++ from disassemble
codes.If you want to decompile, maybe you should
do it in different ways in different Platterm(i386,hpoa and so
on).

Viasek 07-23-2004 13:35

If you are actually interested in learning about what the structures from c++ to assembly look like, Kris Kaspersky presents very clear and useful information in his book Hacker Disassembling Uncovered.

He presents lots of information about how compilers optimize code and why it would be impossible to write a program to decompile back to c++. IDAs FLIRT signatures are a big step, recognizing the patterns of known api's and displaying them, however even those arn't entirely correct. IDA often misrecognizes calling conventions, I can't imagine relying on a program to transform anything more complex than that if it isn't within reasonable amount of accuracy.

mihaliczaj 08-13-2004 00:06

Exe2C
5-10 years ago I found a program called Exe2C. You can find some references to it on the Internet.

It produced a C program that theoretically results in the same .exe. Of course the exact result depends on the compiler and the optimizations, but one can see the functions, the global data areas that are referenced by some functions etc.

Theoretical thoughts
If you fix some parameters: compiler, optimization flags, platform, endianness etc., exact language rules, then in my opinion it is possible to write a program that recompiles an .exe to such a source code that produces the same output compiling with the fixed parameters.

Unfortunately not all these parameters can be retrieved from the .exe.

If somehow we had this information, it is still impossible to get the same source code (regardless of the names of course). There are lots of info that is not preserved even if you have a not optimized compilation.
Just an example:
class C
{
int m_iX;
public:
static int GetX_static( C *pThis ) { return pThis->m_iX; }
int GetX() { return m_iX; }
friend void GetX_global( C *pThis );
};
void GetX_global( C *pThis ) { return pThis->m_iX; }

There is no difference in the resulted code of these four (member )functions.
Most C++ language elements have their equivalent in C, and it is impossible to differentiate the resulting assembly code (assuming there is no debug info in the .exe, but this is the usual case).

There are some language elements (exceptions, virtual base classes) that cannot be directly translated to their C equivalent, so they can be recognized and rebuilt.

For a long time a C++ compiler (cfront, originally written by B.Stroustrup) was just a C++ to C compiler. When new language elements have been added (exceptions, templates etc.) this became impossible.
About the details of the implementation of different C++ language elements a very good description can be found in "C++ Object Model" of Stanley B. Lippman. It describes the internal structures for virtual inheritance and the structures used to handle member function pointers to virtual functions of virtual base classes among other things.

Conclusions
I think it is a reasonable target to write an .exe to C decompiler, but it is almost impossible to get back some really useful C++ extra. Knowing the compiler and having debug info can help a lot.
Virtual tables and virtual functions can be recognized, but there is no cue for templates and inline functions.

The optimization is a general problem that occurs in the case of all languages, because there is optimization at the language level, but there is also optimization at assembly level, that can hide the originally visible constructs.

LoveExeZ 08-13-2004 09:52

uncompiler is not a easy thing...
it needs more other experienced KB.
and more symbols and debug info ar lost during compiler,
so uncompiler endeaver recover these thing.
such as..
source code:
void SwapTwoNumber(int* a,in* b)
{.................
}

via uncompiler may be in these form:
sub_0121(DWORD* a1,DWORD* a2)
{......
}

yep,SwapTwoNumber is info, u maybe will soon master some funcs by name,,
So uncompiler will try to recover these name,this can be attained by AI.
the above is one easy instance...
Had time,we can dicuss these techz in detail.. :p

McS2oo4 08-14-2004 03:34

Inquisition IDA asm > C plugin
 
Thre are actualy 2 asm>C plugins for IDA decompiler, sometimes I combine 2 of them to get more clear view on code. This are not serious decompilers only just one more look from other perspective. Decomile to C hase better output than Inquisition plugin but it sometimes skips some parts of code that can not understand. So you are back at asm and IDA representation of code :D

mihaliczaj 08-14-2004 04:39

extra info in source code
 
It is worth seeing the home page of The International Obfuscated C Code Contest. (hxxp://www.ioccc.org)
I would be surprised if there would ever be such an AI that could retrieve those sources.
Just an example to taste it:
Code:

#include <stdio.h>
int l;int main(int o,char **O,
int I){char c,*D=O[1];if(o>0){
for(l=0;D[l              ];D[l
++]-=10){D  [l++]-=120;D[l]-=
110;while  (!main(0,O,l))D[l]
+=  20;  putchar((D[l]+1032)
/20  )  ;}putchar(10);}else{
c=o+    (D[I]+82)%10-(I>l/2)*
(D[I-l+I]+72)/10-9;D[I]+=I<0?0
:!(o=main(c/10,O,I-1))*((c+999
)%10-(D[I]+92)%10);}return o;}

This is a square root calculator, note the form of the whitespaces ;)

Ok, this (and the others on the IOCCC page) are not real-life examples, but as LoveExeZ pointed there are substantial information in the source code that is simply impossible to get back.

On the other hand if we just get back only a small subset of this extra info, it can help a lot. If one gets back a part of the inheritance hierarchy, then it can be very useful.
Polymorph classes and virtual function calls can be recognized because they use the vptr (exact implementation details differ from compiler to compiler). The hierarchy can be reproduced from the constructors and the destructors as they again have a certain structure (calling the ctor of base's base, the ctor of base etc.)
Finding constructors and destructors is easy from the virtual table, and having these functions identified, lots of info can be given.
Just imagine the following:

Originally:
Code:

function1()
{
  int i1, i2, i3, i4, i5;
  function2( &i1 );
  function3( &i4 );
  function4( &i1 );
  function5( &i4 );
  function6( &i4 );
  function7( &i1 );
}

Having ctor/dtor pairs identified:
Code:

function1()
{
  Class1 Object1;
  Class2 Object2;
  Object1.Member1();
  Object2.Member2();
}


sumeru 08-14-2004 04:43

decompiling code is not readable
 
since there is optimization when compiling,compilier changed it too much.

I have try some decompiling tools before. But it very difficult to read and understand. The organization is very badly.

br00t_4_c 08-17-2004 04:29

I think by it's very nature compilation is a one way process. You can reconstruct source code from a disassembled binary executable that may well closely resemble the original source code but as Sarge very astutely mentioned variable and function names will be mangled, comments will be lost, etc. Maybe if there was a decompiler that incorporated some kick-ass artificial intelligence that could magically analyze and emulate the personality and proclivities of the developer who wrote the code we'd see a decompiler of the nature discussed in this thread. Barring that, you can send me the money and I'll use it to buy crack.

mihaliczaj 08-17-2004 06:01

Quote:

Originally Posted by br00t_4_c
You can reconstruct source code from a disassembled binary executable that may well closely resemble the original source code but as Sarge very astutely mentioned variable and function names will be mangled, comments will be lost, etc.

Yes, that is the realistic view. These information are simply lost during compilation. Assuming there is no debug info, just the compiled, stripped .exe we can't do anything against this.
I am sure, however, that even such a source with names like variable1, variable2 etc. can be a great help for anyone who wants to understand the original ideas behind the code.
Don't forget that the other alternative is facing a huge, unorganized list of assembly functions.
Some information that I am sure can be (at least partly) recognized when the optimization doesn't hide it:
C++ specific:
- virtual tables
- ctors, dtors
- inheritance relationships
- dynamic_casts
- class sizes
- stack objects
- global objects
- member functions
- member pointers, member function pointers
- heap allocations
C specific:
- switch statements
- loops
- function calls
Assuming we have a tool that collects all these information and it is built into a debugger (OllyDbg for example), just imagine what help it could be.
OllyDbg supports writing comments next to the code. If this tool also supported naming of the recognized structures, complete parts of the original code could be reconstructed.
Quote:

Originally Posted by br00t_4_c
Maybe if there was a decompiler that incorporated some kick-ass artificial intelligence that could magically analyze and emulate the personality and proclivities of the developer who wrote the code

Creating utopias
If we had such AI the programs probably wouldn't be written by humans. Humans would just assist defining the target conditions.
Then the abstraction level would be more far from the assembly level, and that AI would still be not enough. But there would be recognizable patterns in the created code and a tool could be created to display them. A lot of info would be lost, but with some patience complete parts of the original code (or target conditions) could be reconstructed manually.
Back to the ground
As the coder is (probably) a person, just another person is smart enough to recognize his/her thoughts. The automatically recognizable patterns should be shown, these are the language elements (cycles, function calls etc.), but the rest should be left to the user. I know that there are some coding patterns that could be easily recognized, but the rev.engineer is who should recognize and mark them. The best tool doesn't do everything, but it does it in a reliable way you can build on.

If anyone were interested in writing the OllyDbg plugin contoured above, I would give further details on the possible recognization of the mentioned structures with pleasure.

phoenixodin2 08-17-2004 17:11

i think it is possible but ...
 
with me.
it's possible.
but convert this what for?
without understand of algorithm, a program like a death body.

i don't know any program to do that but anyone can convert it manually.

each c++ compiler has a own way to generate code from c++ to c and then from c to asm.

so if you want to convert code from program to c++ you must:
+ convert machine code to asm - ida pro can handle almost
+ convert this asm to c - the hardest step @@!!!@@
+ depend on what compiler that generated this code, convert this c code to c++ code - the easiest step.


hope this can help you.
regards

br00t_4_c 08-18-2004 01:06

@mihaliczaj:

On the subject of decompiler development, it would be interesting to develop an application that could perhaps transform a disassembly into source code by making statistical comparisons of a given disassembly (having unknown source) with disassembly from known source code. I'm thinking of something along the lines of a "genetic" decompiler, that is to say one that over time the decompiler would be capable of some form of "learning" and would generate a more and more accurate reproduction of the original source code. If a sufficiently large database of disassembly to source mappings could be generated for a given set of languages and a suitable set of pattern recognition algorithms could be developed and these pattern recognition algorithms were "evolutionary" (i.e. we have a set of pattern recoginition algorithms that can either be selected or discarded based on how accurate an approximation of known source code can be generated from the corresponding disassembly) we might be able to arrive at a decompiler that reconstructs source code in a manner that is true to the original. I do however believe that such an application goes way, way beyond the scope of what can be accomplished with the olly plugin api. I think it would maybe be more appropriate to consider developing it as a stand-alone app. But that's just my opinion. Holler back if you want to discuss this further.

Shub-Nigurrath 08-18-2004 02:15

Hi,
have you ever heard of Desquirr? An IDA plugin, free, sourceforge, which tries to do exactly what you are talking about..

hxxp://desquirr.sourceforge.net/desquirr/

Quote:

Desquirr is a decompiler plugin for IDA Pro. It is currently capable of simple data flow analysis of binaries with Intel x86 machine code.
might give some nice ideas, beside Boomerang, another free decompiling tool, already mentioned in this forum..

mihaliczaj 08-18-2004 06:12

less automation
 
Quote:

Originally Posted by br00t_4_c
...by making statistical comparisons of a given disassembly...
...a "genetic" decompiler...
...some form of "learning"...

It can be a good approach, but there are some uncertain points. It is hard to measure the result, and it is hard to code those heuristics that recreates the original code.

What I imagine is a tool that at first determines the compiler as precisely as it is possible, and then knowing the compilation 'habits' of that compiler configuration it looks for some information that can be retrieved in a reliable way. Otherwise we would just see fail imaginations of our software, I mean it would very easily mislead itself.

It is simply a false goal to get back the original source code.
What we can reach is a level between C and C++ from where we can see such correspondences that were otherwise unvisible. Having had this information we can reproduce additional C++ parts, moving that level more and more towards C++, but I think it cannot be automized.
The best would be a disassembled C++ source code explorer tool that would enable the user to reorganize the source code.
The problem is that a simple function can be either:
* a global function
* a static function
* a class level function (static member function)
* a member function
In the asm code there is no difference.
Exploring the already visible C++ parts an experienced coder who knows the software very well may be able to make such decisions and may be able to give reasonable names for the entities.

mihaliczaj 08-18-2004 06:27

C to C++ what is hard
 
Quote:

Originally Posted by phoenixodin2
without understand of algorithm, a program like a death body.
i don't know any program to do that but anyone can convert it manually.

Yes, that is the point. This is what I mean. You should do this manually, but a program can help in this a lot. I think it is impossible to give an always-working automatical solution, but instead it is possible to give one that assists you in recreating a code that has an equivalent structure.
Quote:

Originally Posted by phoenixodin2
each c++ compiler has a own way to generate code from c++ to c and then from c to asm.

This was true for a long time, but when new language features had been added this didn't work any more.
A referred to an excellent book about the details in a previous post.

Quote:

Originally Posted by phoenixodin2
+ convert this asm to c - the hardest step @@!!!@@

This is not that hard, there are existing tools for this.
C is very near to asm.
Quote:

Originally Posted by phoenixodin2
+ depend on what compiler that generated this code, convert this c code to c++ code - the easiest step.

This is the hardest.
The problem is that in C++ you don't just write what should the computer do (as you do in C), but at the same time you organize the code to mirror your thoughts. There is hidden information.
Just think about public, protected and private. It is some kind of documentation and error prevention, you cannot find anything about these in the generated code. This is why the assistance of a coder is needed to recognize such things.

mihaliczaj 08-18-2004 06:43

still only C, not C++
 
Quote:

Originally Posted by Shub-Nigurrath
...have you ever heard of Desquirr?...
...Boomerang, another free decompiling tool...

There are a lot that creates pseudo C code from asm, because there is less information that is lost.
Ok, usally a C code is also a C++ code, but this is not what we want.
We know that the sofware has been written in C++, so we would like to get back the extra information that is in the source code.

There are some key points (vptr, dynamic_cast, static initialization...) that makes it possible to discover some of the original C++ structures.

I haven't heard about a tool that tries this.

/*
Just an example for a code that works differently in an old version of C and in C++ :
Code:

int a = 2 //**/ 2
;

*/

kp_ 09-22-2004 05:39

Some thoughts..
 
Hi all,

I found this thread by accident, maybe its not that old..

My first question:
why are you afraid of optimizations? I think that a decompiler shouldn't look for compiler-specific structures or patterns in code. Instead, it should read the _semantics_ of the code. The meaning of it. Everything that the program does is written down in the code. It will do the same whether it is optimized or not (or if not, then the difference is not important from the result's point of view). Imagine a simple intermediate language, that is close to asm, but with simpler constructs, mov, cmp, jumps, simple arithmetical and logical funtions. The transformation frop asm to this language is trivial. Then, since you know the meaning of c constructs, you can automatically find them in this language, you just have to map them there. Of course, there will be different constructs that mean the same, but why should we care.. like you do a loop with a for or with a while and some init code.
I think, that this approach could be used for this purpose.

The second... not a question:
Someone wrote about a c++ Vector or Boost template.. how you could reverse them. Well, as you may know, all the template stuff is like #define-s in a more advanced way. You can even instruct the compiler to give you the source after substituting everything but before compileing it. If you have the source of the templates that are possibly used in the code you wanna decompile, then you can parse the decompiled source, and look for the constructs that could be created with a template, and simply transform them back.

Ok, it's not that easy, I know. I just thinked a lot about this thing, but was too lazy to code anything... I just want to argue with you, maybe we'll come up with useful ideas in this discussion.

kp

thebobbby 09-23-2004 03:33

IMHO, you can certainly get something.... Is it possible to extract some high level constructs from the binary... And this information can certainly be presented as C++.... Now, you will certainly not get any variable name, no classes/methods, just plain functions (it may be possible to get some information for virtual functions, but everytime the method is totally resolved at compile time, the information will not be available)... I don't think information on structures can be automatically extracted...

So what you will most likely get is a set of functions (which is not likely to be the same set of functions/methods used in the orginal code), accessing local and global variables, with un-precise types (some types can be inferred when the variable is used with known functions, but a variable is mostly a memory location.. can be anything).
The interesting information that can be extracted is part of the high-level constructs used in the function bodys: loops, tests... Some research project are already capable of extracting such information, which is then used to perform some optimization directly on the binary.

Anyway, to answer the original question, i would answer yes: it is IMHO possible to decompile to C++... For the most part, you can write assembly and C in C++, so it's not a big deal. However, you will get something which is more C than C++, and which may bear only small resemblance to the original code. And doing that would already need quite some time....

kp_ 09-23-2004 18:44

I'm glad you mentioned optimizing. I already though about a method.. If you make az abstarct representation of the semantics of the code (as I wrote in the first part of my previous post) then you can even transform it back to asm. Since you can describe asm with this language, you can create many representations back and choose the one that is most optimal for you. You just have to be sure that you transformed everything.
I'm not sure I can describe my idea well... so the algorithm could be like:
describe the asm instructions with these abstract statements (there could be more than one representation, of course). Then you look for these sequences in the abstract representation (doestn't have to be strictly a sequence, there canm be holes that will be represented by other instructions). Then, you will have many-many asm representations that do exactly the same. So you can make a function that computes you an optimality value, and choose the best.

I didn't know whether there is research in this direction, maybe I didn't find anything new :(

kp


All times are GMT +8. The time now is 03:36.

Powered by vBulletin® Version 3.8.8
Copyright ©2000 - 2026, vBulletin Solutions, Inc.
Always Your Best Friend: Aaron, JMI, ahmadmansoor, ZeNiX