Code Encodings

12 min readApr 15, 2019

These are my slides from a recent talk about machine learning research we’ve been doing at GitHub. Under each slide I included the contents of my talk — it might not be a verbatim transcription, but it’s pretty close.

The goal of our research is to lower the barriers to coding by bridging the gap between natural language and code.

Today, learning to code can take years of specialized training. Coding is a powerful tool available only to a small subset of the population. People who can’t code are at an inherent disadvantage: They’re limited to using computers in ways that others have envisioned for them.

But the ability to tell a computer what to do could be useful for pretty much any profession. Making it easier to program would widen and diversify the community of coders, and could unleash the creativity of experts from every domain.

The problem is that millions of years of evolution have shaped the human mind to think in natural language, not in code. Programming languages are very unintuitive to most people.

Education can and should change that to some extent: Every child could someday learn to code in the way that every every child, at least in our society, now learns how to read, write, and do basic math.

But machine learning could help us even out the playing field much further and much faster. It could bridge the gap between natural language and code, allowing us to produce and understand code without having to think in code.

So how do we do this?

Well, unfortunately, full machine translation between natural language and code will be very hard to achieve. The biggest obstacle is the built-in ambiguity and nuance of natural language and the very structured nature of code.

I see a hierarchy of useful steps along this path, though.

The first and easiest step in the hierarchy is semantic search of code: We want to learn to search for code using natural language queries. To do this, of course, we need to figure out how to connect between these natural language queries and the relevant code.

Today, when you run across some problem you’re trying to solve, you probably search on Google or Stack Overflow, where you can find manually-curated answers to some of the most common problems many people have tried to solve before you. But the set of all GitHub public repositories holds way more information about far more uncommon problems you might need to solve. We want to unlock the ability to search through all that code to find examples and solutions.

This would make it far easier to quickly produce good code and to learn and level up our skills.

This is a relatively easy problem because it’s a matching problem rather than a generation problem: In a matching problem, we just need to match a query to a preexisting answer in some database. In a generation problem, on the other hand, we need to produce new content to answer a question.

The next step in the hierarchy would be machine translation of code from one programming language to another. For example, say I found this Java code that does exactly what I want, but I need this solution in Python.

The code translation problem can be addressed in two different ways: an easier way, and a harder way.

The easier way would be to use the Java code to query an existing solution written in Python. That again is a matching problem, and so it’s relatively easier, but of course it limits us to finding code that someone has already written.

The harder way — the one that gives the most general solution to the problem — would be to automatically translate code from Java to Python. That certainly isn’t easy, because it requires the ability to generate working source code, but there’s reason to hope that we can do it: Machine translation from one natural language to another has gotten immeasurably better over the past few years, with the advent of deep learning, so it’s not unreasonable to think that machine translation from one programming language to another could also be possible in the foreseeable future.

The third step in the hierarchy would be automatic documentation of code: Given a function or a method or a class written in source code, summarize it in natural language.

This would be useful in and of itself, because a lot of the source code out there is poorly documented, which makes it much harder for a new user to understand it and to start using it.

It’s also a harder problem, though, because it requires several different abilities:

First of all, it requires generation of new content rather than just matching.
Secondly, it requires translation from highly-structured code into human-readable natural language. We know from machine translation from one natural language to another, that the more different those languages are (in their structure and their grammar and their logic), the harder the task becomes — and code is very different from any natural language.
And thirdly, it requires summarization rather than just word-for-word translation: We want documentation that describes what the code does, not how it does it step-by-step. That adds another layer of abstraction, and requires identifying the key elements and sifting out all the irrelevant details.

The final step in the hierarchy would be translating human-generated pseudo-code — or natural language instructions — into working code in any programming language.

That would be the dream, because it would completely take away the need to code.

Not surprisingly, though, that’s also a particularly hard problem, because it involves not only generation of new content, and not only translation between extremely different languages, but also expanding more abstract, high-level human instructions into detailed, unambiguous code.

We’re not really there yet.

So what are we working on right now?

Right now we’re very much in the early stages of this challenge. It’s an active research field at the cutting edge of deep learning and NLP.

At GitHub we’re particularly well-positioned to work on this, because we have access to huge amounts of code in every language imaginable. But, no less importantly, we also have large quantities of natural language used in connection with that code: Documentation, commit descriptions, READMEs, pull requests, issues, and so on.

Right now, we’re leveraging this data to tackle the first step I mentioned: semantic search of code.

It turns out there aren’t great data sets out there with pairs of natural-language queries and the relevant code results that go with them — especially not datasets that are big enough to train on. The best proxy we’ve found is using pairs of functions or methods and their associated doctrings or header comments, where the comments are used as a stand-in for a natural language query. This is the dataset we’re using for now.

We’re tackling this problem using deep neural networks. We’ve experimented with various network architectures: From simple models such as Neural Bag-of-Words, to standard NLP models such as RNNs and 1D CNNs, to more state-of-the-art methods such as self-attention (BERT) and its derivatives. Whichever architecture we choose, though, the underlying framework remains the same.

We train these networks to translate text snippets into vectors.

Here, for visualization purposes, I’m showing just a two-dimensional vector space, but in reality the networks produce vectors in hundreds of dimensions.

The key is that we train the networks to map code snippets and their related documentation snippets into the same vector space. The ultimate goal is for the resulting vectors to fully capture the meaning of each snippet. Of course pairs of code and its documentation have similar meanings.

So if we do this well, our networks will learn to translate matching pairs of code and language into similar vectors…

…while at the same time translating non-matching pairs into dissimilar vectors.

The end result is that the networks generate similar vectors for the code and the documentation of function A, and they also generate similar vectors for the code and documentation of function B, but each of the functions is clearly distinct from the other.

If you’ve heard of word2vec or any other word embedding technique, you might be thinking that this sounds kind of similar.

Basically word2vec is a method of mapping words into a vector space.

In word2vec, words that are related end up in a similar location in vector space,…

…and words that are unrelated end up in very different parts of the space.

Conceptually our approach is similar to word2vec (although the inner workings are very different), but there are two major differences:

First of all, our vectors encapsulate the meaning of entire chunks of code or natural language, rather than just individual words.
Secondly, we map the code and the natural language to the same vector space, so that we can say not only that two code snippets are similar to one another, but that this code snippet is similar to that natural language snippet.

If our vectors successfully capture meaning, then we’re well on our way to solving the problem of semantic search.

In theory at least, we can store vector encodings for all the code on GitHub in a database.

When a user inputs a natural-language search-query, we use our trained networks to translate it into a vector.

We then just do a nearest-neighbor lookup on all the code vectors…

…and we return the code snippet whose vector is most similar to the query’s.

So where does this get us?

We’ve reached a point in our research where we’re starting to get good results for these kinds of queries, which means that our vectors are starting to capture the meaning of both natural language and code, and to know how to relate the two.

The framework I’ve shown so far begins to solve the problem of semantic search: If we can get the vectors good enough, search becomes a problem of nearest-neighbor lookup at scale.

What about the other steps in our hierarchy?

Looking up similar code in another language is very similar to semantic search, in some ways even easier. We’re using a great public dataset called Rosetta, that provides code snippets that carry out the same task in a variety of different programming languages. We use Rosetta to test the ability of our networks to match code across languages, and we’ve started working on that problem in parallel with semantic search.

All the other problems, though, are generation problems, which are much harder for two reasons: First of all, the vectors need to fully capture the meaning of each snippet, not just be closer to the correct snippet than to any other snippet in our database. Secondly, we need a method to translate back from vectors into language.

Think about all the vectors in our database as stars out in space. There are lots of stars, and yet space is mostly empty, because it’s so huge.

The same is true of our vector space: It’s hundreds of dimensions of mostly empty space. Even if our database of code is huge, when it’s mapped into a space that big the results will still be very sparse.

To solve a matching problem, our networks don’t have to translate perfectly; we only need them to land in the neighborhood of the correct point. Basically, we need them to land closer to the correct point than to any other point. Of course that’s not so hard to do, because the space is very sparsely filled.

On the other hand, to solve a generation problem, we need to land right smack on the perfect spot in space. Otherwise, even if we’re just a little bit off, when we try to translate back from the vector into language, our results will be just a little bit off too. If you’ve ever played around with Google Translate, you know that sometimes the results do come out just a little bit off, but our human minds are flexible enough to understand them anyway. Unfortunately, unlike human minds, computers are not too smart that way: If the source code we generate isn’t 100% correct, they just won’t do the right thing. This makes even our simplest generation problem — translation from one programming language to another — much harder than anything NLP has solved so far.

So what’s the bottom line? Can we do it?

The truth is that this is ongoing, cutting-edge research. It’s trying to push the boundaries of what machine learning and artificial intelligence can do. Sadly, I don’t know to tell you how we’re going to solve all these problems, or how fast we’ll achieve our ultimate goal of freely translating back and forth between machine and human languages.

Right now we’re in the process of open sourcing all our data as well as our codebase, to make it easy for people outside of GitHub to join in these efforts.

Our hope is that together, as a scientific community, we can make it possible for everyone to code.

If you’re interested in learning more you can keep an eye out on the GitHub blog — we’ll post something there as soon as the repo goes public. And of course feel free to reach out and be in touch!

Thank you!

Written by Tiferet Gazit

No responses yet