
Building Legal Literacies for Text Data Mining Institute (“Building LLTDM”) was a project previously funded by the National Endowment for the Humanities in 2019. UC Berkeley Library directed Building LLTDM in June 2020, bringing together expert faculty from across the country to train 32 digital humanities researchers on how to navigate law, policy, ethics, and risk within text data mining projects.
What is text data mining?
If you were to crack open some popular English-language novels written in the 1850s–say, ones from Brontë, Hawthorne, Dickens, and Melville–you would find they describe men and women in very different terms. While a male character might be said to “get” something, a female character is more likely to have “felt” it. Whereas the word “mind” might be used when describing a man, the word “heart” is more likely to be used about a woman. Yet, as the 19th Century became the 20th, these descriptive differences between genders actually diminish. How do we know all this? We can make this assertion because researchers (including David Bamman, of UC Berkeley’s School of Information) used automated techniques to extract information from the novels, and analyzed these word usage trends at scale. They crafted algorithms to turn the language of those novels into data about the novels.
In fields of inquiry like the digital humanities, the application of such automated techniques and methods for identifying, extracting, and analyzing patterns, trends, and relationships across large volumes of unstructured or thinly-structured digital content is called “text data mining.” (You may also see it referred to as “text and data mining” or “computational text analysis”). Text data mining provides humanists and social scientists with invaluable frameworks for sifting, organizing, and analyzing vast amounts of material. For instance, these methods make it possible to:
- Detect racial disparity by evaluating language from police body camera footage;
- Develop new tools to enable large-scale analysis of television series and photographs; and
- Capture and design new physical representations of naturally occurring laughter
Why is help needed?
Until now, humanities researchers conducting text data mining have had to navigate a thicket of legal issues without much guidance or assistance. For instance, imagine the researchers needed to scrape content about Egyptian artifacts from online sites or databases, or download videos about Egyptian tomb excavations, in order to conduct their automated analysis. And then imagine the researchers also want to share these content-rich data sets with others to encourage research reproducibility or enable other researchers to query the data sets with new questions. This kind of work can raise issues of copyright, contract, and privacy law, not to mention ethics if there are issues of, say, indigenous knowledge or cultural heritage materials plausibly at risk. Indeed, in a recent study of humanities scholars’ text analysis needs, participants noted that access to and use of copyright-protected texts was a “frequent obstacle” in their ability to select appropriate texts for text data mining.
Potential legal hurdles do not just deter text data mining research; they also bias it toward particular topics and sources of data. In response to confusion over copyright, website terms of use, and other perceived legal roadblocks, some digital humanities researchers have gravitated to low-friction research questions and texts to avoid decision-making about rights-protected data. They use texts that have entered into the public domain or use materials that have been flexibly licensed through initiatives such as Creative Commons or Open Data Commons. When researchers limit their research to such sources, it is inevitably skewed, leaving important questions unanswered, and rendering resulting findings less broadly applicable. A growing body of research also demonstrates how race, gender, and other biases found in openly available texts have contributed to and exacerbated bias in developing artificial intelligence tools.
What was the Institute?
On June 23-26, 2020 we welcomed 32 digital humanities (DH) researchers and professionals to Building LLTDM. Our goal was to empower DH researchers, librarians, and professional staff to confidently navigate law, policy, ethics, and risk within digital humanities text data mining projects—so they can more easily engage in this type of research and contribute to the further advancement of knowledge.
Participants learned about how the following law and policy matters pertain to text data mining research:
- Copyright
- Contracts & licensing
- Privacy
- Ethics
- Special use cases (e.g. international collaborations, etc.)
- Risk
Due to the global health crisis, we had to transform our planned in-person, intensive workshop into an interactive and relevant remote experience. How did we do this? The pandemic meant we had to transition everything online, which of course presents challenges for a design-thinking framework. The substantive content was pre-recorded and delivered in a flipped classroom model. Faculty created a series of short videos, and shared readings relevant to the legal literacies. We also provided the video transcripts and slides to participants to promote accessibility and accommodate multiple learning styles.
We used Zoom to meet synchronously for discussion in groups of various sizes. We used Slack for asynchronous communication, and interactive tools such as Mural for design thinking exercises like journey mapping so that everyone could live edit and collaborate. We capped each day with a “happy half hour” on Zoom as an informal way to get to know each other a little better, even from afar.
We also relied on an institute moderator and daily writing exercises to reinforce the design-thinking stages and learning outcomes. Each night, we reviewed the participants’ free-writes and began the next morning by reflecting back to the participants the themes from what they had shared.
Project Team and Participants
Project Team
The Building LLTDM project team serves as faculty for the Institute. As a testament to the collaborative nature of this project and digital humanities research, we hail from more than a dozen North American universities and institutions, and are a collection of legal experts, librarians, faculty, and scholars immersed in digital humanities and research literacies.
Project Director: The Project Director will oversee curricular design and execution, as well as the administrative and operational aspects of Building LLTDM. The Project Director also serves as a Project Team member (below) helping to create and deliver educational materials and instruction. Rachael G. Samberg, Scholarly Communication Officer at UC Berkeley Library. A Duke Law graduate, Rachael practiced intellectual property litigation, and was a Lecturer in Law and Head of Reference & Instructional Services at Stanford’s law library. Rachael leads UC Berkeley Library’s Office of Scholarly Communication Services. She teaches throughout the country about copyright and information policy, and is a national presenter for ACRL’s Scholarly Communication Roadshow. Her chapter, Law & Literacy in Non-Consumptive Text Mining, will be published in Copyright Conversations (ACRL, 2019).
Project Manager: In addition to serving as a Project Team member, the Project Manager will coordinate design and execution of the Project, and streamline administrative and operational aspects of Building LLTDM. Timothy Vollmer, Scholarly Communication and Copyright Librarian at UC Berkeley Library. Tim has held various policy positions at Creative Commons for many years (public policy manager, 2015-2018; senior public policy manager, 2018-2019). He blogs on matters related to copyright policy, intellectual property, and advocacy. Previously he was at the American Library Association in various capacities, including as Assistant Director to the Program on Public Access to Information. He has a BA from University of Wisconsin, Madison, and master of science in information from the School of Information at the University of Michigan.
Project Team: Members contribute to institute administration and curricular design, and serve as instructors during the institute. Members have been designated as: humanities researchers (“HR”), librarians (“L”), or legal experts (“LE”). Their real-world roles straddle these boundaries (e.g. some legal experts are also librarians); yet, the divisions ensure that institute sessions are led by a set of experts who collectively offer a full range of relevant digital humanities expertise.
Scott Althaus (HR), Professor of Political Science & Communication, and Director of the Cline Center for Advanced Social Research at University of Illinois. Scott explores communication processes that support political accountability in democratic societies and that empower political discontent in non-democratic societies. He has a particular focus on data science methods for extreme-scale analysis of news coverage and cross-national comparative research on political communication.
David Bamman (HR), Assistant Professor at UC Berkeley’s School of Information. David applies natural language processing (NLP) and machine learning to empirical questions in the humanities and social sciences (research for which he has received NEH funding). He adds linguistic structure to statistical models of text, and develops core NLP techniques for languages and domains. Previously, he was a senior researcher at Tufts University’s Perseus Project.
Brandon Butler (LE), Director of Information Policy at the University of Virginia (UVA) Library. A UVA School of Law graduate, Brandon provides national guidance, education, and advocacy on intellectual property and related issues. He helped develop HathiTrust Research Center’s (HTRC) non-consumptive use policy and the Code of Best Practices in Fair Use for Academic and Research Libraries. Previously, Brandon taught copyright and supervised student attorneys in American University’s IP Law Clinic at American University.
Beth Cate (LE), Associate Professor at Indiana University Bloomington’s School of Public and Environmental Affairs (SPEA). Previously, she was Indiana University’s Associate General Counsel, focusing on intellectual property law and policy and advising. Her scholarly interests include the role and efficacy of law in promoting innovation and shaping an intellectual property commons, and law and policy surrounding personal information.
Kyle K. Courtney (LE), Copyright Advisor for Harvard University, within the Office for Scholarly Communication. In 2014, he founded “Fair Use Week,” and his “Copyright First Responders” initiative is now deployed across the U.S. He also teaches research sessions at Harvard Law School. Kyle holds a J.D. with distinction in Intellectual Property Law and an MSLIS, and has received a Knight Foundation Grant to develop technology for crowdsourcing copyright and fair use decisions.
Sean Flynn (LE), Associate Director of the Program on Information Justice and Intellectual Property (PIJIP) and Professorial Lecturer in Residence. Professor Flynn designs and manages a wide variety of research and advocacy projects that promote public interests in intellectual property and information law. He holds a J.D. from Harvard Law.
Maria Gould (L), Research Data Specialist/Product Manager, California Digital Library. Maria supports services surrounding persistent identifiers for scholarly literature. Previously, she was Scholarly Communication & Copyright Librarian at UC Berkeley, where she provided guidance and instruction on copyright and information policy aspects of scholarly publishing.
Cody Hennesy (L), Journalism and Digital Media Librarian at University of Minnesota. At both Minnesota and UC Berkeley, where he was formerly the E-Learning and Information Studies Librarian, Cody’s work has focused on developing library services and support for TDM, and addressed emerging literacies in DH and computational social sciences. He has spoken nationally on TDM and other DH topics, and co-authored the forthcoming chapter Law & Literacy in Non-Consumptive Text Mining.
Eleanor Dickson Koehl (L), HathiTrust Digital Scholarship Librarian at the University of Michigan Libraries, and Associate Director for Outreach and Education, HTRC. Eleanor leads outreach and training for HTRC, and provides reference and support for scholars engaged in TDM. She chaired the working group that drafted HTRC’s 2016 Non-Consumptive Use Research Policy. She worked on an IMLS national forum to set a research agenda addressing TDM with use-limited data, and an IMLS curriculum development project to build a TDM “train-the-trainer” program.
Thomas Padilla (L), Visiting Digital Research Services Librarian at University of Nevada Las Vegas. Thomas is PI of “Always Already Computational: Collections as Data” project, and the “Collections as Data: Part to Whole” project. He is a member of various advisory boards and councils including the Association for Computers and the Humanities, WhatEVery1Says, and Integrating digital humanities into the web of scholarship with SHARE.
Stacy Reardon (L), Literatures and Digital Humanities Librarian at UC Berkeley. Stacy provides guidance and instruction on digital humanities projects and methods. She is co-chair of the UC Berkeley’s Digital Humanities Working Group and serves on the Scholarly Communication Expertise Group. She is also a doctoral candidate in Ethnic American literature at the University of Massachusetts, Amherst and has several years of experience in academic technology.
Matthew Sag (LE), Professor of Law at Loyola University Chicago School of Law, HathiTrust Research Center Advisory Board Member. Matthew teaches copyright law and intellectual property courses. He co-authored the Digital Humanities amicus briefs in the HathiTrust and Google Books cases. He was a legal advisor to the Code of Best Practices in Fair Use of Copyrighted Materials for the Visual Arts and the Code of Best Practices in Fair Use in Software Preservation. He co-founded ScotusOA.com, a site devoted to empirical analysis of Supreme Court oral argument.
Brianna L. Schofield (LE), Executive Director of Authors Alliance, a nonprofit dedicated to supporting authors and the public good. She has co-authored guides to open access, fair use, rights reversion, and publication contracts. Previously, she was a Teaching Fellow in the Samuelson Law, Technology & Public Policy Clinic at UC Berkeley, School of Law. She holds a JD from UC Berkeley.
Megan Senseney (L), Head of the Office of Digital Innovation and Stewardship at University of Arizona Libraries. Megan was previously a research scientist for University of Illinois’ Center for Informatics Research in Science and Scholarship. Her scholarship focuses on social dimensions of data-intensive digital humanities initiatives, and digital training for humanities scholars. She was Co-PI on the IMLS-funded national forum on TDM with use-limited data.
Glen Worthey (L), Digital Humanities Librarian at Stanford University Libraries. Glen founded the Libraries’ Center for Interdisciplinary Digital Research (CIDR), and served as Program Committee co-chair for “Digital Humanities 2018” in Mexico City. Glen has served on executive boards of the Association for Computers in the Humanities, Text Encoding Initiative, and Alliance of Digital Humanities Organizations (for which he co-convenes the “DH in Libraries” Special Interest Group).
Consultant: During the institute, a legal expert will be on call via e-mail to field any legal questions that instructors are unable to answer in real time. Sara Benson (LE), Copyright Librarian at University of Illinois. Sara holds a JD from University of Houston Law Center and was a Lecturer at University of Illinois College of Law. She hosts the Podcast ©hat (“Copyright Chat”).
Participants
- Ilya Akdemir, University of California, Berkeley
- Tara Baillargeon, Marquette University
- Trevor Burrows, Purdue University
- Matthew Cannon, University of California, Berkeley
- Nathan Carpenter, Illinois State University
- Ashleigh Cassemere-Stanfield, University of Chicago
- James Clawson, Grambling State University
- Mark Clemente, Case Western Reserve University
- Quinn Dombrowski, Stanford University
- Alyssa Fahringer, George Mason University
- Heather Froehlich, Penn State University
- Nicole Garlic, Temple University
- Casey Hampsey, New York University
- Devin Higgins, Michigan State University
- Christian Howard, Bucknell University
- Daniel Johnson, Notre Dame University
- Spencer Keralis, University of Illinois
- Sarah Ketchley, University of Washington
- Melanie Kowalski, Emory University
- Barbara Levergood, Bowdoin College
- Jes Lopez, Michigan State University
- Rochelle Lundy, Seattle University
- Jon Marshall, UC Berkeley
- Jens Pohlmann, Stanford University
- Caitlin Pollock, University of Michigan
- Sarah Potvin, Texas A & M University
- Andrea Roberts, Texas A & M University
- Daniel Royles, Florida International University
- Hadassah St. Hubert, Florida International University
- Todd Suomela, Bucknell University
- Nicholas Wolf, New York University
- Madiha Zahrah Choksi, Columbia University
Project Outputs
To maximize the reach and impact of Building LLTDM, we published a comprehensive, openly licensed ebook titled Building Legal Literacies for Text Data Mining: What to Know & How to Teach It. This open educational resource covers copyright (both U.S. and international law), technological protection measures, privacy, and ethical considerations. It also helps other digital humanities professionals and researchers run their own similar institutes by describing in detail how we developed and delivered programming (including our pedagogical reflections and take-aways), and includes ideas for hosting shorter literacy teaching sessions. The resource (available as a web-book or in downloadable formats including PDF and EPUB) is in the public domain under the CC0 Public Domain Dedication, meaning it can be accessed, reused, and repurposed without restriction.
In addition to the OER, we’ve also published a white paper that describes the institute’s origins and goals, project overview and activities, and reflections and possible follow-on actions.
Notice
Any views, findings, conclusions, or recommendations expressed in Building LLTDM do not necessarily represent those of the National Endowment for the Humanities.