News

Wrapping up our NEH-funded project to help text and data mining researchers navigate cross-border legal and ethical issues

Black and white photograph with grass and concrete with the word "finish" painted on the concrete in large capitalized letters. — Image via rawpixel, public domain

In August 2022, the UC Berkeley Library and Internet Archive were awarded a grant from the National Endowment for the Humanities (NEH) to study legal and ethical issues in cross-border text and data mining (TDM).

The project, entitled Legal Literacies for Text Data Mining – Cross-Border (“LLTDM-X”), supported research and analysis to address law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content, or involves international research collaborations.

LLTDM-X is now complete, resulting in the publication of an instructive case study for researchers and white paper. Both resources are explained in greater detail below.

Project Origins

LLTDM-X built upon the previous NEH-sponsored institute, Building Legal Literacies for Text Data Mining. That institute provided training, guidance, and strategies to digital humanities TDM researchers on navigating legal literacies for text data mining (including copyright, contracts, privacy, and ethics) within a U.S. context.

A common challenge highlighted during the institute was the fact that TDM practitioners encounter expanding and increasingly complex cross-border legal problems. These include situations in which: (i) the materials they want to mine are housed in a foreign jurisdiction, or are otherwise subject to foreign database licensing or laws; (ii) the human subjects they are studying or who created the underlying content reside in another country; or, (iii) the colleagues with whom they are collaborating reside abroad, yielding uncertainty about which country’s laws, agreements, and policies apply.

Project Design

We designed LLTDM-X to identify and better understand the cross-border issues that digital humanities TDM practitioners face, with the aim of using these issues to inform prospective research and education. Secondarily, we hoped that LLTDM-X would also suggest preliminary guidance to include in future educational materials. In early 2023, we hosted a series of three online round tables with U.S.-based cross-border TDM practitioners and law and ethics experts from six countries.

The round table conversations were structured to illustrate the empirical issues that researchers face, and also for the practitioners to benefit from preliminary advice on legal and ethical challenges. Upon the completion of the round tables, the LLTDM-X project team created a hypothetical case study that (i) reflects the observed cross-border LLTDM issues and (ii) contains preliminary analysis to facilitate the development of future instructional materials.

We also charged the experts with providing responsive and tailored written feedback to the practitioners about how they might address specific cross-border issues relevant to each of their projects.

Guidance & Analysis

Case Study

Extrapolating from the issues analyzed in the round tables, the practitioners’ statements, and the experts’ written analyses, the Project Team developed a hypothetical case study reflective of “typical” cross-border LLTDM issues that U.S.-based practitioners encounter. The case study provides basic guidance to support U.S. researchers in navigating cross-border TDM issues, while also highlighting questions that would benefit from further research.

The case study examines cross-border copyright, contracts, and privacy & ethics variables across two distinct paradigms: first, a situation where U.S.-based researchers perform all TDM acts in the U.S., and second, a situation where U.S.-based researchers engage with collaborators abroad, or otherwise perform TDM acts in both U.S. and abroad.

White Paper

The LLTDM-X white paper provides a comprehensive description of the project, including origins and goals, contributors, activities, and outcomes. Of particular note are several project takeaways and recommendations, which we hope will help inform future research and action to support cross-border text data mining. Our project takeaways touched on seven key themes:

Uncertainty about cross-border LLTDM issues indeed hinders U.S. TDM researchers, confirming the need for education about cross-border legal issues;
The expansion of education regarding U.S. LLTDM literacies remains essential, and should continue in parallel to cross-border education;
Disparities in national copyright, contracts, and privacy laws may incentivize TDM researcher “forum shopping” and exacerbate research bias;
License agreements (and the concept of “contractual override”) often dominate the overall analysis of cross-border TDM permissibility;
Emerging lawsuits about generative artificial intelligence may impact future understanding of fair use and other research exceptions;
Research is needed into issues of foreign jurisdiction, likelihood of lawsuits in foreign countries, and likelihood of enforcement of foreign judgments in the U.S. However, the overall “risk” of proceeding with cross-border TDM research may remain difficult to quantify; and
Institutional review boards (IRBs) have an opportunity to explore a new role or build partnerships to support researchers engaged in cross-border TDM.

Gratitude & Next Steps

Thank you to the practitioners, experts, project team, and generous funding of the National Endowment for the Humanities for making this project a success.

We aim to broadly share our project outputs to continue helping U.S.-based TDM researchers navigate cross-border LLTDM hurdles. We will continue to speak publicly to educate researchers and the TDM community regarding project takeaways, and to advocate for legal and ethical experts to undertake the essential research questions and begin developing much-needed educational materials. And, we will continue to encourage the integration of LLTDM literacies into digital humanities curricula, to facilitate both domestic and cross-border TDM research.

[Note: this content is cross-posted on the UC Berkeley Library Update blog.]

Seeking U.S.-based text data mining researchers for participation in paid project

Are you a U.S.-based researcher who has done or wanted to do a computational text analysis (or “text data mining” / TDM) project on materials held in countries outside the U.S.? Have you ever collaborated with a colleague outside of the U.S. on a TDM project? Or conducted TDM on content created by people living outside the U.S.?

You may be eligible for up to $800 to tell us about your research as part of our NEH-funded Advancement Grant project: Legal Literacies for Text Data Mining, Cross-Border (LLTDM-X).

About LLTDM-X

Our project team has previously created guidance around copyright, licensing, privacy, and ethical issues for U.S. TDM researchers working with data in the U.S. But these legal and ethical issues necessarily become more complex when:

the materials you want to mine are housed in a foreign jurisdiction / are subject to foreign licensing or law,
the human subjects you are studying or who created the content you are studying reside in another country, or
the colleagues with whom you’re collaborating are abroad, and you are not sure whose law applies or what’s allowed.

We now want to help you text data mine corpora that are held or created beyond the U.S. border or that you access via foreign license agreements. We also want to help you collaborate with colleagues around the world on cross-border TDM projects.

You can help us help you, by sharing your experiences in a virtual roundtable discussion if you’ve ever done, or tried to do, any of the above. What law, policy, privacy, or ethics problems popped up, and what questions did you face or do you anticipate facing?

Eligibility & Application

The LLTDM-X team seeks to compensate 10 additional U.S.-based (living or working in U.S.) humanities and social sciences researchers with up to $800 stipends for discussing the legal and ethical issues they face or will face when conducting their cross-border TDM research.

Not sure if your TDM research counts as “cross-border”? We created this brief explanatory video to help you.

If after watching the video, you think we’re describing your research and you want to participate in the LLTDM-X roundtable, please submit an application no later than 5 p.m. PST November 4, 2022.

We will evaluate your application using the criteria described below. We will notify applicants in December 2022 about the results of the selection process.

Selection Criteria

The project team believes that the project will work best when it reflects the race and gender demographics of the broader population, and not just those of higher education–and will strive to achieve equity by reflecting these more representative demographics.

Additionally, we will work to develop a researcher participation group that is representative of different institution types, research advising and support experience, professional roles, levels of experience with digital humanities text data mining research career stages, and disciplinary perspectives.

Our selection process will prioritize the following criteria:

Digital humanities researcher or professional
Experience working with at least one cross-border digital humanities text data mining project
Articulated interest in the relationship between text data mining and the law
Articulated reason for participating in the roundtable
Demonstrated commitment to diversity and equity

Participation

If we grant your application to participate, you would be expected to participate in approximately 6-8 hours of work comprised of:

Preparation for Roundtable (~3 hours): Researchers will each write up a 2-page description of their TDM research, and cross-border law and policy challenges they have faced or that they expect will affect or impede them. The description will be due in February 2023.
Participation in Roundtable (~3 hours): Researchers will share and explain their experiences in the first half of the virtual Roundtable. In the second half, legal and ethical experts will interact with researchers, and ask you questions in order to inform the experts’ law and policy analysis.The roundtable will be held in February or March 2023.

Questions

If you have any questions not answered above or in our brief explanatory video, contact schol-comm@berkeley.edu.

We look forward to receiving your application by 5 p.m. PST on November 4, 2022.

UC Berkeley Library and Internet Archive co-directing project to help text data mining researchers navigate cross-border legal and ethical issues

We are excited to announce that the National Endowment for the Humanities (NEH) has awarded nearly $50,000 to UC Berkeley Library and Internet Archive to study legal and ethical issues in cross-border text data mining. The funding was made possible through NEH’s Digital Humanities Advancement Grant program.

NEH funding for the project, entitled Legal Literacies for Text Data Mining – Cross Border (“LLTDM-X”), will support research and analysis to address law and policy issues faced by U.S. digital humanities practitioners whose text data mining research and practice intersects with foreign-held or -licensed content, or involves international research collaborations.

LLTDM-X builds upon the highly successful Building Legal Literacies for Text Data Mining Institute (Building LLTDM), previously funded by the NEH in 2019. UC Berkeley Library directed Building LLTDM in June 2020, bringing together expert faculty from across the country to train 32 digital humanities researchers on how to navigate law, policy, ethics, and risk within text data mining projects. (All of the results and impacts are summarized in the white paper here.)

In Building LLTDM’s instructional sessions and post-workshop evaluations, participants identified cross-border research collaborations as an ongoing and critical legal and policy problem, and they also noted that foreign law and ethics issues pervaded their research. UC Berkeley Library’s Office of Scholarly Communication Services partnered with Internet Archive to begin to address these essential needs, and LLTDM-X sprung to life.

Why is LLTDM-X needed?

Text data mining, or TDM, is an increasingly essential and widespread research approach. TDM relies on automated techniques and algorithms to extract revelatory information from large sets of unstructured or thinly-structured digital content. These methodologies allow scholars to identify and analyze critical social, scientific, and literary patterns, trends, and relationships across volumes of data that would otherwise be impossible to sift through.

While TDM methodologies offer great potential, they also present scholars with nettlesome law and policy challenges that can prevent them from understanding how to move forward with their research. Building LLTDM trained TDM researchers and professionals on essential principles of copyright, licensing, and privacy law, as well as ethics—thereby helping them move forward with impactful digital humanities research.

As Building LLTDM revealed, United States digital humanities scholars do not conduct text data mining research only in or about the U.S. Further, digital humanities research in particular is marked by collaboration across institutions and geographical boundaries. Yet, U.S. practitioners encounter expanding and increasingly complex cross-border problems.

For example, U.S. contract law may supersede rights under copyright, such that a U.S. database license agreement may prohibit text data mining and other fair uses, whereas UK licenses cannot. Therefore U.S. TDM practitioners collaborating with UK-based colleagues face impactful choices about which agreements to apply, as this may determine whether text data mining is permitted. In the U.S., “breaking” technological protection measures to conduct text data mining is now authorized within certain parameters, yet other jurisdictions prohibit such work or apply different conditions. U.S. text data mining researchers must accordingly consider how they work with internationally-held or -licensed materials or collaborators.

There are at least three such “cross-border” TDM scenarios that scholars must parse, including: (i) if the materials they want to mine are housed in a foreign jurisdiction, or are otherwise subject to foreign database licensing or laws; (ii) if the human subjects they are studying or who created the underlying content reside in another country; or, (iii) if the colleagues with whom they are collaborating reside abroad, yielding uncertainty about which country’s laws, agreements, and policies apply. These may collectively be considered the “cross-border” TDM scenarios.

U.S. researchers are uncertain about how to navigate each of these scenarios. As evidenced in an informal survey that we conducted with digital humanities scholars, 70% of respondents reported cross-border copyright questions, 72% reported uncertainty about cross-border licensing terms, 52% noted privacy issues, and 48% identified ethical concerns. This confusion greatly impacted their TDM research. Twenty-eight percent (28%) of respondents confirmed that these cross-border copyright, licensing, privacy, or ethical issues impeded or prevented their project entirely. Of equal concern is that 40% of responding practitioners reported hesitation to share their workflows, methodology, or sources because of possible cross-border LLTDM issues. Without transparency, findings are deemed unreliable and scholarship may be rejected for publication. These problems will only mount given the increasing collaborativeness of research and the substantial amount of cross-border research occurring.

How will LLTDM-X help the world?

Our long-term goal is to design instructional materials and institutes to support digital humanities TDM scholars facing cross-border issues, but our first step with LLTDM-X is getting a better handle on the specific law and policy challenges they face.

Through a series of virtual roundtable discussions, and accompanying legal research and analysis, LLTDM-X will surface these cross-border issues and begin to distill preliminary guidance to help scholars in navigating them.

The first roundtable will engage U.S. digital humanities text data mining practitioners in sharing their cross-border TDM experiences. U.S. and global law and ethics experts will help guide the roundtable discussion to elicit the contours of practitioner experiences. During two subsequent roundtables—one focusing on cross-border copyright and licensing, and another on cross-border privacy and ethics—the experts will discuss practitioners’ hurdles in depth, and begin to develop customized guidance.

After the roundtables, we will work with the law and ethics experts to create instructive case studies that reflect the types of cross-border TDM issues practitioners encountered. These case studies will incorporate recommendations to help a broad audience of U.S. digital humanities text data mining practitioners navigate LLTDM-X concerns. Case studies, guidance, and recommendations will be widely-disseminated via an open access report to be published at the completion of the project. And most importantly, they will be used to inform our future educational offerings.

An experienced team

The team for LLTDM-X (introduced below) is eager to get started. The project is co-directed by Thomas Padilla, Deputy Director, Archiving and Data Services at Internet Archive.

“LLTDM-X responds strategically to a pervasive challenge that needlessly complicates, inhibits, and weakens the fullest potential of research. This work paves a critical path toward building future training institutes that address cross-border legal issues in TDM. At Internet Archive we’re committed to supporting universal access to all knowledge—LLTDM-X couldn’t be more clearly aligned with what we hope to achieve. We look forward to working with our partners at UC Berkeley Library and the wider community to advance this work.”

Rachael Samberg, who leads UC Berkeley Library’s Office of Scholarly Communication Services and oversaw Building LLTDM, joins Thomas as co-director and explains that:

“We are ready to begin analyzing and sorting out the complex legal challenges for digital humanities TDM researchers. We’ve already secured an incredible group of international legal and ethics experts to conduct the analyses, and will share more on that soon. In the meantime, we are gearing up to build out an even larger group of participating scholars whose experiences will help us create case studies.”

On behalf of the entire project team, we would like to thank NEH’s Office of Digital Humanities again for funding this important work. We invite you to contact us with any questions you may have.

Thomas Padilla (Project Director): Thomas is Deputy Director, Archiving and Data Services at Internet Archive, and has deep experience cultivating library, archive, and museum ability to support TDM research. He has previously served as Principal Investigator of the Andrew W. Mellon supported Collections as Data: Part to Whole, the Institute of Museum and Library Services supported, Always Already Computational: Collections as Data, and as author of the library community research agenda, Responsible Operations: Data Science, Machine Learning, and AI in Libraries. In addition, Padilla was an expert faculty for Building LLTDM, the precursor to LLTDM-X.

Rachael Samberg (Project Co-Director): Rachael is Scholarly Communication Officer & Program Director of the University of California, Berkeley Library’s Office of Scholarly Communication Services. She served as Project Director and legal expert for Building LLTDM. A Duke Law graduate, Rachael practiced intellectual property litigation at Fenwick & West LLP for seven years before spending six years at Stanford Law School’s library, where she was Head of Reference & Instructional Services and a Lecturer in Law. Rachael speaks throughout the country about copyright and TDM issues, about which she is widely published. Her chapter, Law & Literacy in Non-Consumptive Text Mining, was published in Copyright Conversations (ALA, 2019).

Stacy Reardon (Project Team Member): Stacy Reardon is Literatures and Digital Humanities Librarian at the University of California, Berkeley Library, where she provides guidance and instruction on digital humanities projects and methods. Stacy served as a library expert on the Project Team for the NEH-funded Building Legal Literacies for Text Data Mining. She is co-chair of the UC Berkeley’s Digital Humanities Working Group, and received her Ph.D. in literature from the University of Massachusetts, Amherst.

Timothy Vollmer (Project Manager): Timothy Vollmer is Scholarly Communication and Copyright Librarian at UC Berkeley Library. He served as Project Manager for the NEH-funded Building Legal Literacies for Text Data Mining. Tim worked as a senior public policy manager for Creative Commons, and contributed to writing and advocacy on the text data mining exceptions in the EU’s Directive on Copyright in the Digital Single Market. He formerly was the Assistant Director to the Program on Public Access to Information at the American Library Association.

Now available: Open educational resource of Building Legal Literacies for Text Data Mining

Last summer we hosted the Building Legal Literacies for Text Data Mining institute. We welcomed 32 digital humanities researchers and professionals to the weeklong virtual training, with the goal to empower them to confidently navigate law, policy, ethics, and risk within digital humanities text data mining (TDM) projects. Building Legal Literacies for Text Data Mining (Building LLTDM) was made possible through a grant from the National Endowment for the Humanities.

Since the remote institute in June 2020, the participants and project team reconvened in February 2021 to discuss how participants had been thinking about, performing, or supporting TDM in their home institutions and projects with the law and policy literacies in mind.

To maximize the reach and impact of Building LLTDM, we have now published a comprehensive open educational resource (OER) of the contents of the institute. The OER covers copyright (both U.S. and international law), technological protection measures, privacy, and ethical considerations. It also helps other digital humanities professionals and researchers run their own similar institutes by describing in detail how we developed and delivered programming (including our pedagogical reflections and take-aways), and includes ideas for hosting shorter literacy teaching sessions. The resource (available as a web-book or in downloadable formats such as PDF, EPUB, and MOBI) is in the public domain under the CC0 Public Domain Dedication, meaning it can be accessed, reused, and repurposed without restriction.

In addition to the OER, we’ve also published a white paper that describes the institute’s origins and goals, project overview and activities, and reflections and possible follow-on actions.

Thank you to the National Endowment for the Humanities, the project team, institute participants, and staff at the UC Berkeley Library for making Building LLTDM a success.

[Note: this content is cross-posted on the UC Berkeley Library Update blog.]

What happened at the Building LLTDM Institute

On June 23-26, we welcomed 32 digital humanities (DH) researchers and professionals to the Building Legal Literacies for Text Data Mining (Building LLTDM) Institute. Our goal was to empower DH researchers, librarians, and professional staff to confidently navigate law, policy, ethics, and risk within digital humanities text data mining (TDM) projects—so they can more easily engage in this type of research and contribute to the further advancement of knowledge. We were joined by a stellar group of faculty to teach and mentor participants. Building LLTDM is supported by a grant from the National Endowment for the Humanities.

Why was the Institute needed?

Until now, humanities researchers conducting text data mining in the U.S. have had to maneuver through a thicket of legal issues without much guidance or assistance. As an example, take a researcher scraping content about Egyptian artifacts from online sites or databases, or downloading videos about Egyptian tomb excavations, in order to conduct automated analysis about religion or philosophy. The researcher then shares these content-rich data sets with others to encourage research reproducibility or enable other researchers to query the data sets with new questions. This kind of work can raise issues of copyright, contract, and privacy law. It can also raise concerns around ethics, for example, if there are plausible risks of exploitation of people, natural or cultural resources, or indigenous knowledge.

Potential law and policy hurdles do not just deter text data mining research: They also bias it toward particular topics and sources of data. In response to confusion over copyright, website terms of use, and other perceived legal roadblocks, some digital humanities researchers have gravitated to low-friction research questions and texts to avoid making decisions about rights-protected data. When researchers limit their research to such sources, it is inevitably skewed, leaving important questions unanswered, and rendering resulting findings less broadly applicable.

Moving an interactive, design-thinking Institute online

After months of preparation, we had been looking forward to working and learning together at UC Berkeley, but the world had other plans for our Institute. Due to the global health crisis, we had to transform our planned in-person, intensive workshop into an interactive and relevant remote experience.

How did we do this? The pandemic meant we had to transition everything online, which of course presents challenges for a design-thinking framework. We are thrilled that our approach to interactive remote pedagogy was successful! (You can check out the schedule and framework in our Participant Packet.) The substantive content was pre-recorded and delivered in a flipped classroom model. Faculty created a series of short videos, and shared readings relevant to the legal literacies. We also provided the video transcripts and slides to participants to promote accessibility and accommodate multiple learning styles.

We used Zoom to meet synchronously for discussion in groups of various sizes. We used Slack for asynchronous communication, and interactive tools such as Mural for design thinking exercises like journey mapping so that everyone could live edit and collaborate. We capped each day with a “happy half hour” on Zoom as an informal way to get to know each other a little better, even from afar.

We also relied on an institute moderator and daily writing exercises to reinforce the design-thinking stages and learning outcomes. Each night, we reviewed the participants’ free-writes and began the next morning by reflecting back to the participants the themes from what they had shared.

A collection of themes from our morning plenary reflections.

Reflections on goals: social justice & effective empowerment

One of our priorities for the Institute was to invite a diverse pool of participants, including those involved in social justice research, in order to maximize the public value impact of Building LLTDM. We looked for demonstrated commitments to diversity and equity but could hardly have imagined the breadth and depth of experiences that applicants were willing to share. The selected participants research everything from understanding “place” data from community histories of historic African American settlements to the development of AIDS activist networks in communities of color; to portrayals of autism in literature; and more. Others demonstrated a commitment to bringing back the skills they learn to expand TDM opportunities for students and communities who have traditionally been marginalized or under-resourced. They also came from a variety of institution types, from research advising and support experience, professional roles, levels of experience with TDM, career stages, and disciplinary perspectives.

We are also moved by the participants’ own reflections on the experience. One of the last interactive exercises we hosted during the online Institute was a collective week-in-review discussion, and gratitude wall. We asked the participants to share what they were thankful for, highlighting other participants where possible. So many of the participants wrote about how valuable the learning experience was and how thoughtfully it was put together and delivered.

Digital stickies from our week-in-review and gratitude wall.

We can’t express the transformational impact of the week better than the participants, themselves. In Institute evaluation forms, they shared feelings like:

“This is by far the best organized event that I have ever attended. The content was by far the most substantive. The faculty were by far the most engaged. A+ across the board.”
“I am so grateful to have had the opportunity to engage with a diverse group of scholars (researchers and professionals)… The deliberately thought through breakdown and mix fostered incredibly valuable discussions and I would hope this kind of framework is used as a best practice for future DH institutes of all kinds going forward. Also, thank you for such an amazing virtual experience which I can only imagine took a tremendous amount of work to coordinate and plan with limited time to shift to an entirely different format–I was overjoyed to critically engage with complex subjects…”
“This has been phenomenal. I don’t want to qualify it (by adding something like “…for having to be moved online”), because it’s been so, so good: well organized, thoughtful, and human throughout.”
“There was clearly so much thought, care, and planning that went into the preparation of this institute, and it was an amazing opportunity to learn from a group of people — organizers, faculty, and participants — who all have such deep expertise. The video and readings lists alone are a huge resource, but to be able to process and reflect on that material together with a diverse group of people was really wonderful.”

Next steps, and our own gratitude

What’s next for Building LLTDM? The “Institute” is not over yet; only the 1-week training is complete. The cohort will be meeting again virtually in February 2021 to discuss how implementation of the literacies into our local communities and practices has gone. In the meantime, as the participants bring back the law and policy literacies they’ve learned to their home institutions, we are excited to see several cohort members already organizing their own post-Institute research subgroups, such as those whose TDM work relies heavily on social media content, and others who are exploring how to disseminate the Building LLTDM literacies within other instructional formats and frameworks.

As part of the grant, the project team will also be aggregating the resources from the Institute and developing supplementary material for an Open Educational Resource (OER). We know there is a large community of TDM researchers and professionals who may be interested in or who can benefit from these materials, and the OER will be made available for broad reuse in the public domain.

Thank you to all the participants for their insights and contributions, willingness to share, and flexibility in transitioning to a fully-remote Institute. Thank you to all the faculty for their unmatched legal and policy expertise, ongoing commitment to mentorship, and adaptability in content creation and delivery. And thank you again to the NEH for making such a meaningful experience possible.

Welcoming the Building LLTDM Cohort!

Our project team is thrilled to announce the cohort of participants for this summer’s “Building Legal Literacies for Text Data Mining” (Building LLTDM) Institute. Building LLTDM, supported by the National Endowment for the Humanities, will bring together 32 digital humanities (DH) researchers and professionals for an intensive training course on the UC Berkeley campus from 23-26 June 2020. The goal is for participants to be able to build, mine, and publish corpora with a solid approach for navigating the legal and ethical choices they will make along the way.

Announcing our participants

The application process was very competitive, and we received incredibly strong applications from DH researchers and professionals around the country. Please join us in warmly welcoming the following 15 DH researchers and 17 DH professionals (librarians and other staff):

Ilya Akdemir, University of California, Berkeley
Tara Baillargeon, Marquette University
Trevor Burrows, Purdue University
Matthew Cannon, University of California, Berkeley
Nathan Carpenter, Illinois State University
Ashleigh Cassemere-Stanfield, University of Chicago
James Clawson, Grambling State University
Mark Clemente, Case Western Reserve University
Quinn Dombrowski, Stanford University
Alyssa Fahringer, George Mason University
Heather Froehlich, Penn State University
Nicole Garlic, Temple University
Casey Hampsey, New York University
Devin Higgins, Michigan State University
Christian Howard, Bucknell University
Daniel Johnson, Notre Dame University
Spencer Keralis, University of Illinois
Sarah Ketchley, University of Washington
Melanie Kowalski, Emory University
Barbara Levergood, Bowdoin College
Jes Lopez, Michigan State University
Rochelle Lundy, Seattle University
Jon Marshall, UC Berkeley
Jens Pohlmann, Stanford University
Caitlin Pollock, University of Michigan
Sarah Potvin, Texas A & M University
Andrea Roberts, Texas A & M University
Daniel Royles, Florida International University
Hadassah St. Hubert, Florida International University
Todd Suomela, Bucknell University
Nicholas Wolf, New York University
Madiha Zahrah Choksi, Columbia University

Each participant will receive a stipend intended to cover all costs of attendance. The stipends will also be distributed in advance of the institute, to further promote equity and support social justice by minimizing the need for any personal financial investments.

This group will be traveling from 15 states, demonstrating the widespread interest in and need for TDM legal literacies training across the country:

When the project team set out to build this cohort, we looked for demonstrated commitments to diversity and equity but could hardly have imagined the breadth and depth of experiences that applicants were willing to share. The 32 participants have worked on wide-ranging and impactful DH TDM projects, such as understanding: “place” data from community histories of historic African American settlements; the development of AIDS activist networks in communities of color; portrayals of autism in literature; and more. Others expressed intentions to bring back the skills they learn to expand TDM opportunities for students and communities who have traditionally been marginalized or under-resourced. The participants are also representative of different institution types, research advising and support experience, professional roles, levels of experience with TDM, career stages, and disciplinary perspectives.

Curriculum preview

The participants all share at least one thing in common: They have wrestled with navigating law and policy issues related to copyright, contracts and licenses, privacy, or ethics inherent in DH TDM research.

Many reported having shied away from using in-copyright texts as the raw materials for their data mining for fear of potential repercussions, especially if they hoped to publish a portion of their corpus for others to be able to replicate results, or from which others could make new queries.
Often, applicants said they were unsure about the permissibility of conducting TDM on social media content from platforms like Twitter or Reddit due to unclear website terms of service, or confusion around what the platforms’ application programming interfaces (APIs) permit or prohibit. They also raised questions of ethics and authorial intent when mining this content: For instance, Twitter users may have intended their Tweets to be “public,” but what unintended harms might result from extracting trends across users, or aggregating public content and making certain messages more discoverable?
Some reported research roadblocks related to privacy considerations, including how to manage potentially sensitive information within special collections on communities and individuals who are still living.

The four-day Institute will address all of these issues and more. Our robust faculty of legal experts, DH researchers, and librarians will rely on a design thinking structure incorporating experiential methodologies—including dialogue, case studies, and real-world skill-building exercises. We’ll also allocate time for the cohort to design implementation plans to maximize knowledge dissemination post-Institute. An institute moderator will be encouraging personal and group reflection to reinforce learning outcomes.

We will publish a more detailed agenda for everyone, but here is a sneak peek:

Day 1 (23 June 2020):

After group introductions, the focus of Day 1 is on understanding how DH TDM researchers and professionals encounter laws and policies in their work, and the struggles participants have faced. Participants will have meaningful opportunities to share stories to build an understanding of the questions and problems that have arisen. We will begin to identify and discuss themes and shared terminologies to facilitate communication.

Day 2 (24 June 2020):

On Day 2, we will master the worlds of copyright and contracts through risk-informed skill building. We’ll cover the rights and limits of fair use, and how to navigate copyright in corpus creation, mining, and publishing. We’ll then transition into understanding how website terms of use, APIs, and license agreements can factor into TDM decision-making. Through engaging discussions and exercises, participants will practice applying these skills to their own experiences and other real-world research contexts.

Day 3 (25 June 2020):

Utilizing the same discourse-based and risk-informed approaches, on Day 3 we’ll turn to a nuanced exploration of privacy, ethics, and free speech. We’ll also cover special use cases related to DH TDM research, like international laws and researcher collaborations, digital rights management protections, and more.

Day 4 (26 June 2020):

The focus of Day 4 is on prototyping plans for integrating Institute skills and literacies into participants’ own practices and institutions. The cohort will develop personal and community “Implementation Mapping” plans to identify actionable next steps. We will also debrief on learning outcomes and discuss opportunities to build communities of practice.

Public engagement

We know there is a large community of DH TDM researchers and professionals (and beyond) who may be interested in the Institute curriculum and related educational materials. As a reminder, following the Institute we will be publishing an open educational resource that includes all of our instructional materials, slides, lecture notes, discussion prompts, and guided activities. We will also include additional modular content and best practices so that these resources can serve readers in a variety of contexts.

We’re excited to work and learn together with this terrific cohort of Building LLTDM participants. And we sincerely thank everyone who took the time and effort to submit an application for the Institute. If you’d like to follow along with our public announcements, please check in with #BuildingLLTDM on Twitter.

Call for Participants

Join us June 23-26, 2020 to gain the skills you need for navigating law, policy, ethics, and risk in digital humanities text and data mining projects. Apply to attend #BuildingLLTDM.

Building Legal Literacies for Text Data Mining (“Building LLTDM”) is an Institute for Advanced Topics in the Digital Humanities, and has been made possible by a grant from the National Endowment for the Humanities.

What is the purpose of the Building LLTDM Institute?

Our project team wants to empower digital humanities researchers and professionals (librarians, consultants, and other institutional staff) to confidently navigate United States law, policy, ethics, and risk within digital humanities text data mining projects — so that you can more easily engage in this type of research and contribute to the advancement of knowledge.

Why is help needed?

Until now, humanities researchers conducting text data mining in the U.S. have had to maneuver through a thicket of legal issues without much guidance or assistance. As an example, take a researcher scraping content about Egyptian artifacts from online sites or databases, or downloading videos about Egyptian tomb excavations, in order to conduct automated analysis about religion or philosophy. The researcher then shares these content-rich data sets with others to encourage research reproducibility or enable other researchers to query the data sets with new questions. This kind of work can raise issues of copyright, contract, and privacy law. Indeed, in a recent study of humanities scholars’ text analysis needs, participants noted that access to and use of copyright-protected texts was a “frequent obstacle” in their ability to select appropriate texts for mining. It can also raise concerns around ethics, for example, if there are plausible risks of exploitation of people, natural or cultural resources, or indigenous knowledge.

Potential legal hurdles do not just deter text data mining research: They also bias it toward particular topics and sources of data. In response to confusion over copyright, website terms of use, and other perceived legal roadblocks, some digital humanities researchers have gravitated to low-friction research questions and texts to avoid making decisions about rights-protected data. When researchers limit their research to such sources, it is inevitably skewed, leaving important questions unanswered, and rendering resulting findings less broadly applicable. A growing body of research also demonstrates how race, gender, and other biases found in openly available texts have contributed to and exacerbated bias in developing artificial intelligence tools.

When & where is the Building LLTDM Institute?

Building LLTDM will be hosted on the UC Berkeley campus, in Berkeley, California, from June 23-26, 2020.

Who is eligible to participate?

The Institute supports 32 participants — 16 digital humanities researchers and 16 digital humanities professionals. Digital humanities professionals are people like librarians, consultants, and other institutional staff who conduct digital humanities text data mining or aid researchers in their text data mining research.

Due to various restrictions on funding and terms, participants must be based in the United States.

Where possible, we encourage participation from pairs of participants (e.g. one digital humanities researcher and one professional affiliated with that same institution, organization, or digital humanities project).

Who is teaching the Institute?

The Institute will be taught by a combination of experienced legal scholars, digital humanities professionals, librarians, faculty, and researchers — all of whom are immersed in the Institute’s subject literacies and workflows. For a list of instructors, please see our Project Team page.

What will the Institute cover?

You will learn how the following law and policy matters pertain to text data mining research:

Copyright
Contracts & licensing
Privacy
Ethics
Special use cases (e.g. international collaborations, etc.)
Risk management

The Institute will teach you foundational skills to:

Navigate law, policy, ethics, and risk within digital humanities text data mining projects
Integrate workflows for these law and policy issues into your text data mining research and professional support
Practice sharing these new tools through authentic consultation exercises
Prototype plans for broadly disseminating your new knowledge
Develop communities of practice to promote cross-institutional outreach about the digital humanities text data mining legal landscape

How will the Institute be structured?

To help your build skills tailored for your own digital humanities research agendas, the Institute incorporates a design thinking structure reliant upon experiential methodologies. The program will model four stages in design thinking: empathize, define, ideate, and prototype. The “testing” phase of design thinking will occur post-institute when participants implement the knowledge and solutions they developed, and report back.

The institute also offers an ample instructor-to-attendee ratio to accommodate the highly immersive and discursive aspects of a design thinking approach. To that end, a librarian, legal expert, and researcher instructor will co-teach each session. To reinforce deliberations about ideas and practice, participants will have periodic opportunities to conduct free writing reflections on institute experiences. An Institute Moderator will support this by gathering and affirming observations to bolster learning outcomes.

How much does the Institute cost to attend?

Our aim is for participants to have zero out-of-pocket costs to attend the Institute. Please read more on our Stipends page.

How do I apply for the Institute?

Check out our Attend section! You will need to submit the following two documents by e-mail to contact-building-lltdm@googlegroups.com no later than 5 p.m., PST on December 20, 2019:

Current CV
2 page (maximum) letter of interest addressing: your experience with or interest in the intersection of text data mining in digital humanities research and the law; your goals for applying knowledge and skills to be acquired at the Institute to your own activities; your goals for sharing knowledge and skills with others at your home institutions/affiliations; and, how you might support the Institute’s commitment to diversity and equity.

If you are applying with a colleague from your institution (e.g. researcher/librarian pairs), please indicate the name of your colleague in your Letter of Interest. You must each submit separate applications, however.

What’s the timeline for application and notification of acceptance?

October 2019: Call for applications
December 20, 2019: Applications due
January 2020: Application review
February 2020: Selection notifications

What if I have more questions?

Please check out the Building LLTDM website, or contact contact-building-lltdm@googlegroups.com.

Team Awarded Grant to Help Digital Humanities Scholars Navigate Legal Issues of Text Data Mining

(Originally posted on the UC Berkeley Library Scholarly Communication blog.)

We are thrilled to share that the National Endowment for the Humanities (NEH) has awarded a $165,000 grant to a UC Berkeley-led team of legal experts, librarians, and scholars who will help humanities researchers and staff navigate complex legal questions in cutting-edge digital research.

What is this grant all about?

If you were to crack open some popular English-language novels written in the 1850’s–say, ones from Brontë, Hawthorne, Dickens, and Melville–you would find they describe men and women in very different terms. While a male character might be said to “get” something, a female character is more likely to have “felt” it. Whereas the word “mind” might be used when describing a man, the word “heart” is more likely to be used about a woman. Yet, as the 19th Century became the 20th, these descriptive differences between genders actually diminish. How do we know all this? We confess we have not actually read every novel ever written between the 19th and 21st Centuries (though we’d love to envision a world in which we could). Instead, we can make this assertion because researchers (including David Bamman, of UC Berkeley’s School of Information) used automated techniques to extract information from the novels, and analyzed these word usage trends at scale. They crafted algorithms to turn the language of those novels into data about the novels.

In fields of inquiry like the digital humanities, the application of such automated techniques and methods for identifying, extracting, and analyzing patterns, trends, and relationships across large volumes of unstructured or thinly-structured digital content is called “text data mining.” (You may also see it referred to as “text and data mining” or “computational text analysis”). Text data mining provides humanists and social scientists with invaluable frameworks for sifting, organizing, and analyzing vast amounts of material. For instance, these methods make it possible to:

Detect racial disparity by evaluating language from police body camera footage;
Develop new tools to enable large-scale analysis of television series and photographs; and
Capture and design new physical representations of naturally occurring laughter

The Problem

Until now, humanities researchers conducting text data mining have had to navigate a thicket of legal issues without much guidance or assistance. For instance, imagine the researchers needed to scrape content about Egyptian artifacts from online sites or databases, or download videos about Egyptian tomb excavations, in order to conduct their automated analysis. And then imagine the researchers also want to share these content-rich data sets with others to encourage research reproducibility or enable other researchers to query the data sets with new questions. This kind of work can raise issues of copyright, contract, and privacy law, not to mention ethics if there are issues of, say, indigenous knowledge or cultural heritage materials plausibly at risk. Indeed, in a recent study of humanities scholars’ text analysis needs, participants noted that access to and use of copyright-protected texts was a “frequent obstacle” in their ability to select appropriate texts for text data mining.

Potential legal hurdles do not just deter text data mining research; they also bias it toward particular topics and sources of data. In response to confusion over copyright, website terms of use, and other perceived legal roadblocks, some digital humanities researchers have gravitated to low-friction research questions and texts to avoid decision-making about rights-protected data. They use texts that have entered into the public domain or use materials that have been flexibly licensed through initiatives such as Creative Commons or Open Data Commons. When researchers limit their research to such sources, it is inevitably skewed, leaving important questions unanswered, and rendering resulting findings less broadly applicable. A growing body of research also demonstrates how race, gender, and other biases found in openly available texts have contributed to and exacerbated bias in developing artificial intelligence tools.

The Solution

The good news is that the NEH has agreed to support an Institute for Advanced Topics in the Digital Humanities to help key stakeholders to learn to better navigate legal issues in text data mining. Thanks to the NEH’s $165,000 grant, Rachael Samberg of UC Berkeley Library’s Office of Scholarly Communication Services will be leading a national team (identified below) from more than a dozen institutions and organizations to teach humanities researchers, librarians, and research staff how to confidently navigate the major legal issues that arise in text data mining research.

Our institute is aptly called Building Legal Literacies for Text Data Mining (Building LLTDM), and will run from June 23-26, 2020 in Berkeley, California. Institute instructors are legal experts, humanities scholars, and librarians immersed in text data mining research services, who will co-lead experiential meeting sessions empowering participants to put the curriculum’s concepts into action.

In October, we will issue a call for participants, who will receive stipends to support their attendance. We will also be publishing all of our training materials in an openly-available online book for researchers and librarians around the globe to help build academic communities that extend these skills.

Building LLTDM team member Matthew Sag, a law professor at Loyola University Chicago School of Law and leading expert on copyright issues in the digital humanities, said he is “excited to have the chance to help the next generation of text data mining researchers open up new horizons in knowledge discovery. We have learned so much in the past ten years working on HathiTrust [a text-minable digital library] and related issues. I’m looking forward to sharing that knowledge and learning from others in the text data mining community.”

Team member Brandon Butler, a copyright lawyer and library policy expert at the University of Virginia, said, “In my experience there’s a lot of interest in these research methods among graduate students and early-career scholars, a population that may not feel empowered to engage in “risky” research. I’ve also seen that digital humanities practitioners have a strong commitment to equity, and they are working to build technical literacies outside the walls of elite institutions. Building legal literacies helps ease the burden of uncertainty and smooth the way toward wider, more equitable engagement with these research methods.”

Kyle K. Courtney of Harvard University serves as Copyright Advisor at Harvard Library’s Office for Scholarly Communication, and is also a Building LLTDM team member. Courtney added, “We are seeing more and more questions from scholars of all disciplines around these text data mining issues. The wealth of full-text online materials and new research tools provide scholars the opportunity to analyze large sets of data, but they also bring new challenges having to do with the use and sharing not only of the data but also of the technological tools researchers develop to study them. I am excited to join the Building LLTDM team and help clarify these issues and empower humanities scholars and librarians working in this field.”

Megan Senseney, Head of the Office of Digital Innovation and Stewardship at the University of Arizona Libraries reflected on the opportunities for ongoing library engagement that extends beyond the initial institute. Senseney said that, “Establishing a shared understanding of the legal landscape for TDM is vital to supporting research in the digital humanities and developing a new suite of library services in digital scholarship. I’m honored to work and learn alongside a team of legal experts, librarians, and researchers to create this institute, and I look forward to integrating these materials into instruction and outreach initiatives at our respective universities.”

Next Steps

The Building LLTDM team is excited to begin supporting humanities researchers, staff, and librarians en route to important knowledge creation. Stay tuned if you are interested in participating in the institute.

In the meantime, please join us in congratulating all the members of the project team:

Rachael G. Samberg (University of California, Berkeley) (Project Director)
Scott Althaus (University of Illinois, Urbana-Champaign)
David Bamman (University of California, Berkeley)
Sara Benson (University of Illinois, Urbana-Champaign)
Brandon Butler (University of Virginia)
Beth Cate (Indiana University, Bloomington)
Kyle K. Courtney (Harvard University)
Maria Gould (California Digital Library)
Cody Hennesy (University of Minnesota, Twin Cities)
Eleanor Koehl (University of Michigan)
Thomas Padilla (University of Nevada, Las Vegas; OCLC Research)
Stacy Reardon (University of California, Berkeley)
Matthew Sag (Loyola University Chicago)
Brianna Schofield (Authors Alliance)
Megan Senseney (University of Arizona)
Glen Worthey (Stanford University)