Skip to content

[Conversations] What Does Consent Mean in the Age of Large Language Models (LLMs)?

Smiling dark-haired white woman with glasses.

Transcript

Monika Sengul-Jones

As a computational linguist, why are you reluctant to have the audio recording of our conversation available or streamed on the Internet?

Angelina McMillan-Major

I’m concerned about my data being out there on the [open] internet, available to crawlers. Large language models (LLMs), as well as other generative or machine learning models, are trained using data scraped from the internet. Oftentimes, it’s collected using automated systems that crawl domains such as Wikipedia[’s corpus] going from link to link.

My data, my voice data, is called PII, personally identifiable information. It’s [among] the high-risk types of data because it’s uniquely identifying. 

I’m concerned about having my PII out in the wild, where automated systems can gather my PII and throw it into a model and use it as they will.

It’s also that personal data is pervasively undervalued. From the industry perspective, ‘data goes in’ and the product is the model, the output. So I’m concerned about our individual data rights and what can be learned about us, as people, through [our] personal data.

Monika Sengul-Jones

It’s funny that the word “data” can be used to describe something so personally unique—the sound of your voice.

Angelina McMillan-Major

Yeah, your voice is conceptualized as a pattern, [as data] it becomes frequencies. What’s important, or desirable, isn’t just the content of what’s spoken—it’s your voice frequencies and what sort of words you use.

Monika Sengul-Jones

Is it accurate to say, from a privacy perspective, you’re concerned about your sensory—vocal, in this case—fingerprints? That we need protection for something that is unintentionally created and possessed and therefore is given away without realizing or consenting?

Angelina McMillan-Major

Yes.

Monika Sengul-Jones

Let’s talk more about your work as a computational linguist. You’ve presented research on the history of computation and language, and how the same word—artificial intelligence—is used to describe different technologies. For instance, we have the ELIZA chatbot (an early natural language processing computer program developed from 1964 to 1967 at MIT) in the mid-century, which was cutting-edge AI. Today, ELIZA is pretty basic. Tell us more about why this history is important to know.

Angelina McMillan-Major

It’s a good question. Well, chatbots like ELIZA used shallow processing. It was N-gram language modeling.

Monika Sengul-Jones

Can you explain an N-gram?

Angelina McMillan-Major

They work by making a statistical prediction of what text will come next—sort of like an ‘auto-complete’ that isn’t very good.

“N” refers to the number of grams, of consecutive words, or tokens. So “the cat” is bi-gram. “The cat meows” is a tri-gram. The more words you add, the higher the n-gram, which is less frequent. The phrase “the cat meows in the tree,” that’s not going to happen often [in some given text data]. 

You look at the probability of what word might come next—that was the state-of-the-art AI. But at a certain point, there’s a limit to how natural an N-gram will sound. 

Then neural networks became popular, they sounded more natural and the probability space was more fluid.

Monika Sengul-Jones

How are neural networks different from N-grams?

Angelina McMillan-Major

A neural network is fundamentally based on an algorithm called the perceptron. This is a specific mathematical formula based on linear algebra that models language as a network [of nodes]. So [neural networks] is going from the probability statistics space to linear algebra. It shifts what sort of things you can do to smooth low probabilities [in language prediction] as well as create randomization to allow for more fluid, unique patterns that aren’t necessarily directly in the training data.

Smiling dark-haired white woman with glasses.
Angelina McMillan-Major, PhD, is a computational linguist in the UW’s Language Learning Center where she focuses on methodologies for language documentation and reclamation, specifically endangered languages. Photo credit: Russell Hugo, 2024

Monika Sengul-Jones

I have to mention, just the word ‘perceptron’ sounds cool. Were these developed around the same time as the N-gram? Or did one follow the other?

Angelina McMillan-Major

An early perceptron version of a neural network was also developed back in the 1940s.

Monika Sengul-Jones

Before ELIZA.

Angelina McMillan-Major

Yes, however, in the ’40s, computational linguistics had multiple theories, but it wasn’t until we had the personal computer and then the internet with enough data and hardware that we could actually implement these theories. So there were versions of neural networks in the ’90s, but they didn’t take off until the 2010s.

Monika Sengul-Jones

That was our “big data” moment. So, in this brief history of artificial intelligence as it pertains to language, where do large language models (LLMs) come in?

Angelina McMillan-Major

At the end of the neural network period (in 2017). Most people are familiar with LLMs that use a particular type of architecture, the transformer model. This is what ChatGPT is based on. Compared to other neural networks, LLMs using a transformer [architecture] are extremely data-intensive, using billions of tokens.

Monika Sengul-Jones

Let’s go back to my first question, what is at stake for people by the fact that we’ve called all these different technologies “artificial intelligence.”

Angelina McMillan-Major

We’re seeing models used for decision-making, like determining credit scores, and we know these outputs are biased but it’s not transparent within the module itself. We don’t have the opportunity to see—“Oh, my credit score was decided because this model output a .6 or something”—and what that means internally.

Monika Sengul-Jones

I know this black boxing causes real harm to people. We deserve transparency on how decisions are made. But also, if people use these models for decision-making, if people are relieved of decision fatigue, are you worried people are going to get stupider?

Angelina McMillan-Major

I hope not.

Monika Sengul-Jones

That’s a relief!

Angelina McMillan-Major

I’m less concerned about the loss of critical thinking skills and more about people willingly giving up rights to our personally identifiable information (PII) in exchange for ease. 

Monika Sengul-Jones

In exchange for ease, yeah. And then your PII could be used against you, I suppose.

Angelina McMillan-Major

I worry about the normalization of this exchange in society. I want society to be aware that the exchange is the centralization of power into a small number of big companies.

Monika Sengul-Jones

Big in reach, small in number.

Angelina McMillan-Major

It doesn’t necessarily have to be that way.

Monika Sengul-Jones

Let’s talk about how else it could be. In your research, you’ve been developing best practices for research with communities, such as those who speak endangered languages. In North America, Indigenous communities, for instance. For anyone concerned about privacy, about the integrity of their personally identifiable data, who wants to document their language and to protect their data, what’s your approach?

Angelina McMillan-Major advocates for a consent-based model of technology, drawing from the bodily consent literature. She recommends checking out the Consentful Tech Project to learn more. Image: Screengrab from Consentful Tech Project, 2024.

Angelina McMillan-Major

Collection, maintenance, and controlling access—these are huge priorities.

Most people are familiar with participation in data-gathering as something you can opt in or out of. When the opt-out model is used [as the default], it’s not consent, since people may not be aware that removing yourself is an option.

When you’re working with a community, the process is [and should be] different. There are archives that will hold this data. And usually, there are intimate processes. You go to a specific family, for example, whose ancestor has recorded something. You get permission from that family, you specifically ask to use their recording in research. You explain the forms you’ll be using it in, what will be shared, what the outcomes will be, and how you’ll be giving back and reciprocating with the community.

Monika Sengul-Jones

So you’re thinking about computational linguistics, in this process, as co-created partnerships of reciprocity.

Angelina McMillan-Major

Yes. Additionally, the person asking for consent carries the burden of providing as much information as possible. They need to ensure there’s some sort of understanding on the other end. This is distinct from the way that most of us just go through the terms agreement and click accept.

Monika Sengul-Jones

I just do what I need to do to move on. Those modal interruptions are the worst.

Angelina McMillan-Major

Yeah. So that’s not informed consent. That’s as-quickly-as-possible consent.

Monika Sengul-Jones

You have an acronym you use to understand consent in your work. Freely given, reversible, informed, enthusiastic, and specific; FRIES consent. That’s really nice.

Angelina McMillan-Major

Yeah, that’s drawing from the bodily consent literature.

Monika Sengul-Jones

Right, and it brings us back to the beginning of our conversation, thinking about our personally identifiable information (PII) as intimate data, as an important part of us and deserving of protection. Our PII body.

Angelina McMillan-Major

Yeah. However, one of the concepts that we don’t have a technical analogy for yet is “reversible.” Once you give your agreement, you can’t take back your data. That’s not necessarily the case in Europe, with the General Data Protection Regulation (GDPR). But that’s a problem with our current LLMs. It’s hard to take out data because it’s built into the model.

Monika Sengul-Jones

Right. I like to think about how reversal might work with, for example, The Author’s Guild class action lawsuit against OpenAI. Let’s say the authors win. How could the books be removed from OpenAI’s GPT models, to, for instance, prevent works from being generated that closely resemble those copyrighted works that should be withdrawn? The litigation is an important question for copyright law because the books are not copied or saved on the servers or directly used to generate responses to queries, rather there are cases of overfitting—and we’ll see how the courts rule—but in the event the authors win, how will whatever those books helped create be removed?

Angelina McMillan-Major

Well [the books as] data, sort of, are the weights. The actual numbers that are calculated from them form the body of the model. How do you tie a specific data instance to the weights that are spread across a giant billion-parameter model? That’s hard to do.

Monika Sengul-Jones

When I hear things like this, it reminds me of people saying, ‘You can’t put the genie back in the bottle.’ But is it impossible? It seems more of a political and labor question.

Angelina McMillan-Major

I think people are trying. I’ll say that. I’m not convinced.

Monika Sengul-Jones

You’re not convinced?

Angelina McMillan-Major

I mean, I just don’t know how you would do it, from a theoretical perspective.

Monika Sengul-Jones

But if people didn’t give consent to have their data used, and yet it was, and it became the foundation of the model, then won’t we need to figure out how to remove parts?

Angelina McMillan-Major

Well, there’s the remove-the-whole-thing option. It’s the remove parts that people are trying their best to work on.

Monika Sengul-Jones

Before we end, I want to ask you about another intervention you’ve made in your work with the Tech Policy Lab. Which is this concept of “data statements”—metadata that are attached to data points. Tell us about data statements, what do you want people concerned about data and privacy to know?

Cover of A Guide for Creating and Documenting Language Datasets with Data Statements Schema Version 3 2024 Angelina McMillan Major and Emily M Bender.
Data Statements Guide by Angelina McMillan-Major & Emily Bender (2024); report design by Elias Greendorfer.

Angelina McMillan-Major

[Data Statements] was started by Batya Friedman and Emily Bender, who were asking, ‘How can we help people make more informed decisions about selecting data for the models they are going to use, and for the systems those models are embedded in?’ Data Statements help to make sure they’re appropriate for the use case. The behavior of the model is so tied to the data that it’s trained on that you don’t want to use, for example, a model only trained on English data for some other language, something as simple as that. Data Statements are guides.

Monika Sengul-Jones

I started to think of our conversation about crawlers on the internet just going, eat, eat, eat, like a little Pac-Man. Then they run into something like a data statement and it’s like, “nope!” can’t pass, it’s not right for what I need! I don’t know [laughter] I just…I liked that visual for my understanding of data statements. Is that an accurate description?

Angelina McMillan-Major

I hope so someday! [Laughter] The existing versions of data statements are designed for human decision making, but maybe further research will result in machine-readable versions.

Transcription by Mollie Chehab
Editing by Monika Sengul-Jones
Graphic of Data Statements Guide by Elias Greendorfer
Image Credit: Portrait of Angelina McMillan-Major (2024) by Russell Hugo of the Language Learning Center

Related Links

Consentful Tech Project

Tech Policy Lab’s Data Statements Project

McMillan-Major, A., Bender, E. M., & Friedman, B. (2024). Data Statements: From Technical Concept to Community Practice. ACM Journal on Responsible Computing, 1(1), 1–17. https://doi.org/10.1145/3594737

McMillan-Major, A., et al. (2024). Documenting Geographically and Contextually Diverse Language Data Sources. Northern European Journal of Language Technology, 10(1). https://doi.org/10.3384/nejlt.2000-1533.2024.5217