Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? "1F99C On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Emily M. Bender∗ XXXXXXXXXX University of Washington...

1 answer below »

On the Dangers of Stochastic Pa
ots: Can Language Models Be Too Big? "1F99C
On the Dangers of Stochastic Pa
ots:
Can Language Models Be Too Big?
Emily M. Bender∗
XXXXXXXXXX
University of Washington
Seattle, WA, USA
Timnit Ge
u∗
XXXXXXXXXX
Black in AI
Palo Alto, CA, USA
Angelina McMillan-Majo
XXXXXXXXXX
University of Washington
Seattle, WA, USA
Shmargaret Shmitchell
XXXXXXXXXX
The Aethe
ABSTRACT
The past 3 years of work in NLP have been characterized by the
development and deployment of ever larger language models, es-
pecially for English. BERT, its variants, GPT-2/3, and others, most
ecently Switch-C, have pushed the boundaries of the possible both
through architectural innovations and through sheer size. Using
these pretrained models and the methodology of fine-tuning them
for specific tasks, researchers have extended the state of the art
on a wide a
ay of tasks as measured by leade
oards on specific
enchmarks for English. In this paper, we take a step back and ask:
How big is too big? What are the possible risks associated with this
technology and what paths are available for mitigating those risks?
We provide recommendations including weighing the environmen-
tal and financial costs first, investing resources into curating and
carefully documenting datasets rather than ingesting everything on
the web, ca
ying out pre-development exercises evaluating how
the planned approach fits into research and development goals and
supports stakeholder values, and encouraging research directions
eyond ever larger language models.
CCS CONCEPTS
•Computingmethodologies→Natural language processing.
ACM Reference Format:
Emily M. Bender, Timnit Ge
u, Angelina McMillan-Major, and Shmar-
garet Shmitchell XXXXXXXXXXOn the Dangers of Stochastic Pa
ots: Can Language
Models Be Too Big? . In Conference on Fairness, Accountability, and Trans-
parency (FAccT ’21), March 3–10, 2021, Virtual Event, Canada. ACM, New
York, NY, USA, 14 pages. https:
doi.org/10.1145/ XXXXXXXXXX
1 INTRODUCTION
One of the biggest trends in natural language processing (NLP) has
een the increasing size of language models (LMs) as measured
y the number of parameters and size of training data. Since 2018
∗Joint first authors
FAccT ’21, March 3–10, 2021, Virtual Event, Canada
ACM ISBN XXXXXXXXXX/21/03.
https:
doi.org/10.1145/ XXXXXXXXXX
alone, we have seen the emergence of BERT and its variants [39,
70, 74, 113, 146], GPT-2 [106], T-NLG [112], GPT-3 [25], and most
ecently Switch-C [43], with institutions seemingly competing to
produce ever larger LMs. While investigating properties of LMs and
how they change with size holds scientific interest, and large LMs
have shown improvements on various tasks (§2), we ask whethe
enough thought has been put into the potential risks associated
with developing them and strategies to mitigate these risks.
We first consider environmental risks. Echoing a line of recent
work outlining the environmental and financial costs of deep learn-
ing systems [129], we encourage the research community to priori-
tize these impacts. One way this can be done is by reporting costs
and evaluating works based on the amount of resources they con-
sume [57]. As we outline in §3, increasing the environmental and
financial costs of these models doubly punishes marginalized com-
munities that are least likely to benefit from the progress achieved
y large LMs and most likely to be harmed by negative environ-
mental consequences of its resource consumption. At the scale we
are discussing (outlined in §2), the first consideration should be the
environmental cost.
Just as environmental impact scales with model size, so does
the difficulty of understanding what is in the training data. In §4,
we discuss how large datasets based on texts from the Internet
ove
epresent hegemonic viewpoints and encode biases potentially
damaging to marginalized populations. In collecting ever large
datasets we risk incu
ing documentation debt. We recommend
mitigating these risks by budgeting for curation and documentation
at the start of a project and only creating datasets as large as can
e sufficiently documented.
As argued by Bender and Koller [14], it is important to under-
stand the limitations of LMs and put their success in context. This
not only helps reduce hype which can mislead the public and re-
searchers themselves regarding the capabilities of these LMs, but
might encourage new research directions that do not necessarily
depend on having larger LMs. As we discuss in §5, LMs are not
performing natural language understanding (NLU), and only have
success in tasks that can be approached by manipulating linguis-
tic form [14]. Focusing on state-of-the-art results on leade
oards
without encouraging deeper understanding of the mechanism by
which they are achieved can cause misleading results as shown
610
This work is licensed under a Creative Commons Attribution International 4.0 License.
https:
doi.org/10.1145/ XXXXXXXXXX
https:
doi.org/10.1145/ XXXXXXXXXX
https:
creativecommons.org/licenses
y/4.0
FAccT ’21, March 3–10, 2021, Virtual Event, Canada Bender and Ge
u, et al.
in [21, 93] and direct resources away from efforts that would facili-
tate long-term progress towards natural language understanding,
without using unfathomable training data.
Furthermore, the tendency of human interlocutors to impute
meaning where there is none can mislead both NLP researchers
and the general public into taking synthetic text as meaningful.
Combined with the ability of LMs to pick up on both subtle biases
and overtly abusive language patterns in training data, this leads
to risks of harms, including encountering derogatory language and
experiencing discrimination at the hands of others who reproduce
acist, sexist, ableist, extremist or other harmful ideologies rein-
forced through interactions with synthetic language. We explore
these potential harms in §6 and potential paths forward in §7.
We hope that a critical overview of the risks of relying on ever-
increasing size of LMs as the primary driver of increased perfor-
mance of language technology can facilitate a reallocation of efforts
towards approaches that avoid some of these risks while still reap-
ing the benefits of improvements to language technology.
2 BACKGROUND
Similar to [14], we understand the term language model (LM) to
efer to systems which are trained on string prediction tasks: that is,
predicting the likelihood of a token (character, word or string) given
either its preceding context or (in bidirectional and masked LMs)
its su
ounding context. Such systems are unsupervised and when
deployed, take a text as input, commonly outputting scores or string
predictions. Initially proposed by Shannon in 1949 [117], some of
the earliest implemented LMs date to the early 1980s and were used
as components in systems for automatic speech recognition (ASR),
machine translation (MT), document classification, and more [111].
In this section, we provide a
ief overview of the general trend of
language modeling in recent years. For a more in-depth survey of
pretrained LMs, see [105].
Before neural models, n-gram models also used large amounts
of data [20, 87]. In addition to ASR, these large n-gram models of
English were developed in the context of machine translation from
another source language with far fewer direct translation examples.
For example, [20] developed an n-gram model for English with
a total of 1.8T n-grams and noted steady improvements in BLEU
score on the test set of 1797 Arabic translations as the training data
was increased from 13M tokens.
The next big step was the move towards using pretrained rep-
esentations of the distribution of words (called word embeddings)
in other (supervised) NLP tasks. These word vectors came from
systems such as word2vec [85] and GloVe [98] and later LSTM
models such as context2vec [82] and ELMo [99] and supported
state of the art performance on question answering, textual entail-
ment, semantic role labeling (SRL), coreference resolution, named
entity recognition (NER), and sentiment analysis, at first in Eng-
lish and later for other languages as well. While training the word
embeddings required a (relatively) large amount of data, it reduced
the amount of labeled data necessary for training on the various
supervised tasks. For example, [99] showed that a model trained
with ELMo reduced the necessary amount of training data needed
to achieve similar results on SRL compared to models without, as
shown in one instance where a model trained with ELMo reached
Year Model # of Parameters Dataset Size
2019 BERT [39] 3.4E+08 16GB
2019 DistilBERT [113] 6.60E+07 16GB
2019 ALBERT [70] 2.23E+08 16GB
2019 XLNet (Large) [150] 3.40E+08 126GB
2020 ERNIE-Gen (Large) [145] 3.40E+08 16GB
2019 RoBERTa (Large) [74] 3.55E+08 161GB
2019 MegatronLM [122] 8.30E+09 174GB
2020 T5-11B [107] 1.10E+10 745GB
2020 T-NLG [112] 1.70E+10 174GB
2020 GPT-3 [25] 1.75E+11 570GB
2020 GShard [73] 6.00E+11 –
2021 Switch-C [43] 1.57E+12 745GB
Table 1: Overview of recent large language models
the maximum development F1 score in 10 epochs as opposed to
486 without ELMo. This model furthermore achieved the same F1
score with 1% of the data as the baseline model achieved with 10%
of the training data. Increasing the number of model parameters,
however, did not yield noticeable increases for LSTMs [e.g. 82].
Transformer models, on the other hand, have been able to con-
tinuously benefit from larger architectures and larger quantities of
data. Devlin et al. [39] in particular noted that training on a large
dataset and fine-tuning for specific tasks leads to strictly increasing
esults on the GLUE tasks [138] for English as the hyperparameters
of the model were increased. Initially developed as Chinese LMs, the
ERNIE family [130, 131, 145] produced ERNIE-Gen, which was also
trained on the original (English) BERT dataset, joining the ranks
of very large LMs. NVIDIA released the MegatronLM which has
8.3B parameters and was trained on 174GB of text from the English
Wikipedia, OpenWebText, RealNews and CC-Stories datasets [122].
Trained on the same dataset, Microsoft released T-NLG,1 an LM
with 17B parameters. OpenAI’s GPT-3 [25] and Google’s GShard
[73] and Switch-C [43] have increased the definition of large LM by
orders of magnitude in terms of parameters at 175B, 600B, and 1.6T
parameters, respectively. Table 1 summarizes a selection of these
LMs in terms of training data size and parameters. As increasingly
large amounts of text are collected from the web in datasets such
as the Colossal Clean Crawled Corpus [107] and the Pile [51], this
trend of increasingly large LMs can be expected to continue as long
as they co
elate with an increase in performance.
A number of these models also have multilingual variants such
as mBERT [39] and mT5 [148] or are trained with some amount of
multilingual data such as GPT-3 where 7% of the training data was
not in English [25]. The performance of these multilingual mod-
els across languages is an active area of research. Wu and Drezde
[144] found that while mBERT does not perform equally well across
all 104 languages in its training data, it performed better at NER,
POS tagging, and dependency parsing than monolingual models
trained with comparable amounts of data for four low-resource
languages. Conversely, [95] surveyed monolingual BERT models
developed with more specific architecture considerations or addi-
tional monolingual data and found that they generally outperform
1https:
www.microsoft.com/en-us
esearch
log/turing-nlg-a-17-billion-parameter-
language-model-by-microsoft
611
Stochastic Pa
ots
Answered 1 days After Apr 20, 2022

Solution

Deblina answered on Apr 22 2022
99 Votes
Last Name:    2
Name:
Course:
Professor:
Date:
Title: Ethics Report
Contents
Reason why Timnit Ge
u & Margaret Mitchell was Fired    3
Works Cited    4
Reason why Timnit Ge
u & Margaret Mitchell was Fired
Margaret Mitchell and Timnit Ge
u were fired because of the aspects of academic freedom and diversity. The ethics researchers were said to violate the company's code of conduct and security policies for moving the electronic files outside the company. The ethics in the artificial intelligence of Google was effectively under observance after the firing of Ge
u. He gained a prominent position for exposing the facial analysis systems. It is because they had addressed diversity and inclusion among the employees which express concern about the company was...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here