Information Management

Data Mining for Meaning: The Law and Corpus Linguistics Project

Corpus linguistics is a method for investigating the meanings of words by analyzing naturally occurring language in systematic collections of texts called “corpora.” Linguists have long understood that corpora are a better guide to interpretation than native speaker intuition or even dictionaries. With advances in computer technology, the use of corpus linguistics for research has expanded dramatically, but legal scholars and judges have only recently begun to tap the potential of this method. In late summer/early fall 2017, the BYU Law Library will launch a Law & Corpus Linguistics interface to make it easier for scholars, judges, and practitioners to apply linguistically methods to better understand meanings. Three primary Corpus will be initially released. (1) The text of United States Supreme Court opinions, (2) a collection of text from the the founding era of the American Republic (1760 -1799), and (3) a Corpus of Early Modern English. All text will be processed grammatically (e.g. parts of speech), by position/word order (e.g. where words fall in proximity to other words), and statistically (e.g. frequency). Additionally, the library is working on expanding the metadata associated with the text to aid in both linguistic and conceptual research (e.g. metadata such as date of use, or genre descriptors). The initial release will deliver three basic Corpus Linguistics tools: Concordance Lines (keywords displayed in context to allow researchers to develop insight into how words are used in any given set of text); Frequency (counts of specific words, phrases, or sets of phrases to help researchers track the development of key legal concepts through time); and Collocates (investigating words used around or in conjunction with other words to let researchers develop deeper understanding of how words are used). Hear about the development of the field of study, as well as, the project and its successes (or failures) from launch through early July 2018.



26 votes
28 up votes
2 down votes
Idea No. 164