Graphing the World’s Largest Legal Database

News

Graphing the World’s Largest Legal Database

How can AI help us uncover relationships between different parts of the legal system?

April 2021 / Christina Warren

Earlier this year, the Development Data Lab published data on a comprehensive survey of India’s court system, one of the largest sets of legal data in the world. We sought to explore this database as part of exploratory work on AI for democracy led by MIT GOV/LAB practitioner-in-residence Luke Jordan — to better understand how artificial intelligence (AI) can augment democratic institutions and create new ones. We began to look for areas to focus our attention on accessibility of the law.

The Data Development Lab’s database consists of more than 80 million cases from over nine years. The data include details of the cases, information on their defendants and outcomes, judges, laws, and the ways in which these laws were utilized by the case, among other information.

Our sandbox: graphing the connections between judges, acts, and cases

A dataset of this size can be analyzed in many ways, and the Data Development Lab has already used it to look for bias in the legal system. We decided to turn it into a graph. Graphs focus not just on the content of the data but on the connections between elements of the data, revealing the relationships we might not otherwise be able to see. Cases, judges and acts (specific sections of the legal code), as well as districts and states, are represented as data points — or nodes, in graph parlance. We can then show relationships, or edges, between these nodes: a judge judges a case, a case uses an act.

Graphs allow us to look at not only how often a case or law appears in the database, but also how central and connected different elements of the legal system are. Identifying an act or judge that is more central within a connected part of the system than their raw counts of cases would suggest may help prioritize the energy and resources of legal activists, or guide further research . Knowing a judge or case’s centrality could help us understand how different legal acts affect different regions, or, for example, how gender bias might affect legal access and outcomes.

An example of a graph algorithm: Page rank

To explore these relationships, we looked at a couple of common graph algorithms. One such algorithm is page rank, which is used to determine the relative importance of various data points in a graph by figuring out whether they are related to other important data points. The centrality of a node is determined both by its number of connections as well as its connections to other important nodes. A highly central case might have been presided over by highly central judges, or have utilized highly central legal codes, but the centrality of these judges and codes might come from being heavily utilized.

After running page rank, we found many of the most central nodes to be sections of the legal code, with several orders of magnitude higher than most of the nodes. The highest-scored acts are the Indian Penal Code, the Code of Criminal Procedure, the Motor Vehicle Act, and the Code of Civil Procedure. This gives us a sense of the relative importance of various legal codes.

More surprising examples might be places where an act’s page rank score diverges from the actual number of cases that cite it. For instance, while the Excise Act is used in considerably more cases than the Hindu Marriage Act (approximately 230,000 versus 190,000), the Marriage Act is given a higher page rank score than the Excise Act.

We also looked at the most central states and districts. Many of the states’ ranking of centrality corresponds to their size, but, interestingly, several don’t. For instance, Karnataka ranks as the eighth largest state in terms of population, but has the third-highest centrality in the graph. There is plenty of opportunity here to examine what causes these differences in centrality and how centrality relates to accessibility of the law.

Another algorithm: Louvain

The Louvain algorithm, another important algorithm, is focused on finding clusters of closely-related data points. With this method we are able to find communities within the graph, such as sets of judges and legal acts that cluster together.

Using this algorithm, graphs are scored based on how well they can be separated, with lower scores indicating a more uniform graph, with linkages spread out evenly across the various nodes, and higher scores indicating a highly modularized graph, with many groups of nodes linked heavily to each other but not linked much to nodes outside the group. The legal graph is moderately modularized, which makes sense, given that the Indian legal system is to some extent unified by national laws, but justice is administered within states.

To further understand how the graph was modularized, we explored the makeup of communities. Most communities contain a single state node, meaning the cases are centered in that state. Interestingly, some communities contain multiple states, such as one that houses Andhra Pradesh (and Telangana, part of Andhra Pradesh until 2014), Assam, and Kerala.

We can also see which acts were associated with which communities; examining some of the most central acts we found with the page rank algorithm, the Indian Penal Code shared a community with the state Bihar, while the Code of Criminal Procedure shared a community with Himachal Pradesh.

Going forward

These are only a small number of examples of the type of insights we can gather from graphing analysis. We can continue looking at the detailed interconnections within this graph, or potentially change the relationships used and see whether different patterns emerge. We can also use the relationships as inputs to neural networks, and see if that creates greater accuracy in predicting case outcomes. These insights will guide our work as we integrate our findings and utilize these methods to develop our AI tools.

Similar to the Development Data Lab, our graph is publicly accessible, and if you have any questions you think could be interesting to test, please get in touch.

—

Christina Warren (MIT ‘21) is majoring in Computer Science and Writing. This semester she is a research intern with MIT GOV/LAB, supporting Luke Jordan on exploratory work looking at AI/ML and democracy.

Screenshot of one piece of the graph exploring India’s legal database (Christina Warren).