What difference do development aid projects make? Can we improve our ability to tell strong from weak projects, even before the project happens? How has the language of development changed over time? And how can machine learning (ML) help with all of these questions, and make those answers easy to access?

For the last few months, as part of my exploration of AI for democracy, I’ve been tackling those questions. I’ve been combining several datasets to do so, each of which sheds light on a different facet of development projects, and which are publicly available. One dataset is from AidData. The second is from Professor Dan Honig and his collaborators, called the “project performance dataset”. 

The third is from the World Bank itself but consolidated, cleaned and made easily available for the first time (to our knowledge). We built a scraper that gathered every project information, approval and review document from 20,000+ projects, encompassing the evolution of development projects over more than 70 years. We’ve made the cleaned dataset available in multiple formats as a public good, and hope others explore it extensively. The dataset is available via Hugging Face’s dataset repository and the notebooks on GitHub (the scraper script is also available here). 

(We also have a recorded webinar on these results: watch here.)

Development aid helps, a little

Coming to our results —we find, first, that aid does affect outcomes in health, education and water and sanitation. But the effects are small —typically a doubling in aid moves the needle by a percentage point or two — and fragile, sometimes shrinking or disappearing under alternative ways of measuring them. The amount of aid explains one fifth to one tenth the degree of variation as the strength of host country institutions. So in aggregate aid can help, but weakly.

Those results were derived from fairly traditional econometric methods, with a little help from some modern machine learning techniques. We used more state of the art methods to try to understand differences between aid projects — after all, that’s where the practical use comes in. We first created what are called “embeddings” of project documents. These embeddings are quite technical, but, grossly simplified, they turn texts into numerical representations of their similarity and difference (generated with the latest generation of transformer models). Then we trained machine learning models to predict outcomes in the projects’ sectors (ie health, education, etc) five years after the project ended, controlling for other effects. In doing so, we were also able to measure projects’ similarity to each other, in similar ways to the very latest generation of language translation and plagiarism detection.

World Bank project documents, as seen by an AI/ML model (Luke Jordan).

Models can predict aid outcomes using degree of focus and contextualization

We found that the models were able to predict those lagged outcomes with high precision—with a higher than 70% chance of correctly predicting positive or negative outcomes for an unseen project. The models were also able to understand greater than 80% of the variation in the outcomes they saw. These results were robust under K-fold cross-validation, and did not correlate with any simple word counts or other potential “label leaking”.

The most interesting result was to then probe what features of projects the model paid most attention to. That boiled down to two things: sector focus and contextualization (meaning how much a project is different from the average and tailored to take account of its context). The first was measured by the percentage of project funds in the project’s primary sector. The second we measure in a wholly new way, using the “embeddings” to quantify the difference between a project’s language and the average project in its sector globally. That distance, alongside the structure of the language in the key project documents (reflecting the care taken in its preparation), were what the model learned to use to make its predictions.

The features of projects that the model detected as most important for predictions (full details in forthcoming paper) (Luke Jordan).

Veterans of development might be unsurprised that focused and well contextualised projects do best. But we’ve been able to generate new evidence for that tacit knowledge, and in some ways are able to quantify it. We can also make the models themselves publicly available — look out for shouldmycountrytakethisloan.com, perhaps coming soon.

Next steps and future exploration

We also now have a base to explore further, both of data and methods. We’ll be writing up a paper describing the methods, and our code is already open. We’re already extending to look at how contextualization changes over time—an early result is that it appears to decline in the last decade, perhaps as IT systems made it easier to copy-paste other projects, or development agencies became more centralized. We can also train models to detect differences between project evaluation scores and outcomes — i.e., to predict how much a project’s review will be gamed. Another early result is that it’s possible, and the models learn to pay attention to big loans (perhaps unsurprisingly) in certain regions. Some questions others might want to explore include:

  • Does development, as a system, learn from past successes and mistakes? It certainly changes, but do the contextual embeddings of project documents and reviews respond to prior experience (prior proposals, reviews, outcome) or do they evolve on some internal logic? Concretely: does when a project is conducted predict its embedding better or worse than prior projects, reviews and outcomes related to it? If the second is a better predictor, development may be learning – if the first, then it’s evolving for external factors.
  • How does development change as country’s develop (or suffer shocks and declines)? Can contextualization and focus be predicted from countries’ income, size, past experience, or other factors?
  • Most of all, what are the actions that host countries and good-faith development practitioners can take to detect contextualization and focus, and good projects in general—what can help make projects better?

In all, it’s a rich vein, and one that —building on previous work on municipal bonds— confirms the potential of using new, advanced machine learning models to make the text in technical financial documents easily legible to predict outcomes that matter. That’s a mouthful, but it comes down to using one black box —advanced ML— to open up other black boxes —technical financial documents—in a way that allows greater access to highly consequential public action.

World Bank Headquarters, 2013: World Bank/IMF Spring Meetings (Flickr).