Site navigation

Machine Learning Can Now Identify Who Wrote a Piece of Code

Theo Priestley

,

Programmers fingerprints are all over the code

Algorithms can now be trained to recognise a programmer’s coding structure based on examples of their work, meaning they can be identified as the author even if they don’t leave traces.

How you write – your word choice, spelling, punctuation, sentence structure and syntax – are all dead giveaways and are as much an identifier as your fingerprints. Known as Stylometry these traits have previously been used to detect written plagiarism or anonymise authors by using sophisticated algorithms but now this technique is being applied to software code to identify who wrote it.

According to WIRED, Rachel Greenstadt, an associate professor of computer science at Drexel University, and Aylin Caliskan, Greenstadt’s former PhD student and now an assistant professor at George Washington University, have developed a machine learning algorithm that can identify programmers by how they’ve written their code.

Whether it’s through the raw source code or a set of compiled binaries, the approach trains an algorithm to recognise a programmer’s coding structure based on examples of their work, and uses those to examine common traits found in the programmer’s style.

Programmers Can No Longer Hide Behind The Code

Caliskan, Greenstadt, and two other researchers have previously demonstrated that even small examples of code on the repository site GitHub can be enough to differentiate one coder from another with a high degree of accuracy.

In another example, Caliskan and a team of other researchers showed in a separate paper that it’s possible to completely de-anonymize a programmer using only their compiled binary code.

In a test using results from Google’s Code Jam, the algorithm was set loose against 600 programmers with 8 samples each, results showing that the system could identify creators with an 83% accuracy.

Cybersecurity Applications

Both researchers say their work could be used to tell whether a computer science student plagiarised work, or whether a software developer violated a non-compete clause in their employment contract. Security researchers could potentially use it to help determine who might have created a specific type of malware simply by examining the source code.

“Style is preserved,” said Caliskan, “There is a very strong stylistic fingerprint that remains when things are based on learning on an individual basis.”

However this is a double-edged sword – while it’s still feasible to mask the origins of code and authorship, this new technique might make it more difficult to contribute code with true anonymity.

Theo Priestley

Latest News

Cybersecurity Data Protection
Editor's Picks Recruitment Trending Articles
Cybersecurity Featured Skills
%d bloggers like this: