Wed, May 27, 2020
There are not a lot of very strong empirical results in the field of programming languages. This is probably because there’s a huge amount of variables to control for, and most of the subjects available to researchers are CS undergraduates. However, I have recently found a result replicated across numerous codebases, which as far as I can tell makes it one of the most robust findings in the field:
If you have a very large (millions of lines of code) codebase, written in a memory-unsafe programming language (such as C or C++), you can expect at least 65% of your security vulnerabilities to be caused by memory unsafety.
This result has been reproduced across:
And these numbers are in line with what we’ve seen in 0days that have been discovered being exploited.
This observation has been reproduced across numerous very large code bases, built by different companies, started at different points in time, and using different development methodologies. I’m not aware of any counter-examples. The one thing they have in common is being written in a memory-unsafe programming language: C or C++.
Based on this evidence, I’m prepared to conclude that using memory-unsafe programming languages is bad for security. This would be an exciting result! Empirically demonstrated technical interventions to improve software are rare. And memory-unsafety vulnerabilities are one of the only kind that we know how to completely eliminate, by choosing memory-safe languages. However, it’s critical we approach this question as rational empiricists, and see if the evidence really merits the conclusion that memory-unsafe programming languages are bad for security.
Let’s consider the Venn diagram of vulnerabilities:
eval
on untrusted inputs;
eval
tends to only exist in very high-level languages,
which are all memory-safe)So the first set contains at least 65% of the vulnerabilities in these types of codebases, and logically the second set must contain 35% of the vulnerabilities. So if we change programming language to something memory-safe, we get rid of at least 65% of our vulnerabilities. But does the magnitude of the other sets change?
I posit that the second set stays the same size: there’s no reason or evidence to think that porting C++ to a memory-safe language results in additional SQL injection.
Our third set is vulnerabilities that are specific to memory-safe
languages. Actual use of eval
in production code is
incredibly rare in my experience, however its cousin “unsafe
deserialization” does occur in the real world. To investigate its
frequency, I looked into Java’s unsafe deserialization on Android. Based
on
research I reviewed, Android as a whole appears to have had maybe a
dozen of these. Basically every month it has more memory-unsafety issues
than it’s had vulnerabilities of this class all time. So I believe this
class to be orders of magnitude smaller than our first set.
In conclusion, the empirical research supports the proposition that using memory-safe programming languages for these projects would result in a game-changing reduction in total number of vulnerabilities.
Like all empirical claims, this is subject to revision as we obtain more data. You could prove me wrong by either a) finding very large codebases, written in memory-unsafe languages which, after being subjected to substantial first- and third-party security research, had a much lower ratio of memory-unsafety induced vulnerabilities, or b) finding codebases which have memory-safe specific vulnerabilities at a comparable scale (dozens fixed per release). Until you have the evidence, don’t bother with hypothetical notions that someone can write 10 million lines of C without ubiquitious memory-unsafety vulnerabilities – it’s just Flat Earth Theory for software engineers.