Good First Issues
Iterating on Machine Learning Products
My first machine learning product at GitHub shipped recently. Its goal is to help people contribute to open source by recommending good first issues for them to get their feet wet. For example, the machine learning Topic suggests some beginner-friendly issues in the Keras project, among others:
I’ve since gotten some questions along the lines of “So what’s the algorithm?”, or “What kind of model did you use?”
The truth is that there’s no machine learning involved.
OK, what?
This is the first iteration. It’s the simplest solution I could come up with to start surfacing recommendations. It’s a data product, but not an ML product. It uses the data in our warehouse — together with some logic, some statistics, and some outright assumptions — to detect a set of labels with meanings similar to “good first issue”. It then ranks and surfaces open issues bearing these labels from large open source repos.
There aren’t any deep learning network architectures here — not yet. Even classical machine learning algorithms such as random forests or SVMs aren’t being employed. These are very expensive tools, usually requiring lots of research time and iteration before yielding their first fruits. They require a data acquisition pipeline, data preprocessing, feature extraction, model selection, and hyper-parameter tuning. They’re incredibly powerful, but a simpler first iteration lets us validate our assumptions and ensure we have a solid product and data foundation.
How to build a useful machine learning product in 4,295 easy steps
This project is still very much in its early phases, but it’s worth sharing how we designed and iterated on this machine learning product.
Some machine learning products address asks from customers, engineers, or product managers. Often, though, as in this case, they start with a data scientist saying, “Hey, I have an idea: What if we do X?” And that’s fine — often non-data-scientists don’t have a good enough sense of what’s possible with machine learning to come up with ideas.
Once an idea comes up, there’s a series of questions we should ask before jumping in.
How worthwhile is the project?
In the case of Good First Issues, I would break up my thinking on this as follows:
What can it do for GitHub?
- Increase the number of users that interact with our platform and contribute to open source projects.
What can it do for our users?
- Help open source projects attract more contributors. Having enough contributors to share the load of maintainership is critical to project health and prevention of open-source burnout.
- Help new contributors get involved and build their reputation.
Do I have access to the data I need to solve the problem?
Luckily, working at GitHub I have access to a wealth of data from open source repos, including issues, pull requests, labels, conversations, and various user contributions. Some preliminary data munging and exploratory data analysis convinced me that I have the data I need to tackle this problem.
Can I bring this to my intended audience?
Will it scale enough to productionalize? Do I know where in my platform / product it can be surfaced? Do I have the infrastructure and engineering resources needed to build and ship it? These questions are occasionally forgotten in our excitement to build something cool. They require reaching out to product managers, designers, and engineers in other parts of the company. This often involves mutual education, to learn about their considerations while explaining machine learning capabilities and constraints. For exactly that reason, I love this stage and find I always learn a ton in the process.
Can I iterate on this problem?
Even if our assessment of the previous questions was very positive, there are always unforeseen pitfalls. Maybe we don’t have enough data, or the problem is too difficult for even state-of-the-art machine learning methods. Maybe priorities for engineering and product teams teams will change. Maybe users just won’t like this feature.
The only way to verify our assumptions is to ship a product and see our users interact with it. We want to reach that point as quickly as possible, before investing too much time in all the experimentation needed to create a really great ML model. It’s best to ship a quick first iteration, start measuring impact, and get feedback before investing more efforts. If we can use this first version — and our users’ interactions with it — to collect training data for the future ML algorithm, even better!
That’s what this MVP of Good First Issues is about. It relies heavily on the labels maintainers have given their issues, and can’t detect issues that were mislabeled or unlabeled. For some projects, it surfaces no recommended issues at all.
The user experience is equally preliminary. Right now we’re only showing Good First Issues on Topic pages, although eventually we want to show them everywhere repo recommendations are surfaced. We also have other applications in mind for the issue classifier — stay tuned! We chose the Topic pages as the first venue for this feature because they see enough traffic that we’ll get lots of user engagement, feedback, and data, but are not so critical to anyone’s workflow as to be disruptive while we iterate.
Even the infrastructure we use for serving recommendations in an API is still a work in progress. Machine learning products at GitHub are in their infancy, and our machine learning infrastructure is developing quickly.
What’s next?
The engineering team is polishing this first version while preparing to surface Good First Issues in more places.
The ML-infrastructure team is working to streamline the process of serving APIs in a general and customizable manner.
The machine learning team is working on building a dataset and a deep learning model to detect good first issues without relying on issue labels provided by maintainers. We’ll be able to use this model not only to recommend issues to users, but also to help projects triage, label, and sort issues, hopefully reducing the load on maintainers.
Each of these directions can now proceed in a parallel, data-driven fashion, informed by the user feedback from our initial release, ensuring our products delight our users and achieve our goals.
So go ahead and check out Good First Issues! Choose your favorite Topic, look at the listed projects, and consider tackling one of the beginner-friendly issues we recommend. We hope you’ll love it, and the more you interact with it the better we can make it!
A huge thanks to my collaborators: Justin Kenyon, Brandon Rosage, Lorena Mesa, and Ben Thompson. It’s a true pleasure working with all of you!