Wednesday, September 14, 2011

ESORICS 2011 - Day 2

One talk I found particularly interesting was in yesterday's session on forensics, biometrics and software protection , and was entitled "Who Wrote This Code? Identifying the Authors Program Binaries", delivered by Nathan Rosenblum from the University of Wisconsin - Madison.

The talk considered how to recover provenance information from within program binaries and other forms of executable code. In this case, the goal is authorship attribution - identifying the author of a through the of a programmer's particular style. Trying to achieve this through an analysis of source code is perhaps easier; certain features are particularly strong at identifying a particular author. For instance, a programmer's indentation style, variable naming and use of standard library functions could quite heavily distinguish him or her from a different programmer.

However, source code is sent through many transformations (through a compiler, perhaps a code obfuscator) before being built into a executable program. The question to answer is how to indentify which features survive the most through this transformation process. The solution is a nice interplay between concepts in both security and machine learning. Author attribution is transformed into a machine learning problem, with control flow graphs and the underlying machine code forming the building blocks of the feature representation of a binary. A particularly interesting feature is the use of 'graphlets'; three-node subgraphs of the main control flow graph that reflect the local structure of a program.

The method was applied to two slightly different problems:
1) Identifying a particular author of a program;
2) Identifying similarities between programs of unknown origin,
and was tested on binaries from the Google Code Jam competition and submissions for a particular university course. The results, examined thoroughly in the paper, seem to indicate that programmer style is indeed preserved through the transformation process from source code into binary. Some nice applications of the technique could be in identification of known malware authors and detection of stolen software.

No comments:

Post a Comment