On the Dangers of Stochastic Paots: Can Language Models Be Too Big? "1F99COn the Dangers of...

Question

On the Dangers of Stochastic Paots: Can Language Models Be Too Big? "1F99COn the Dangers of Stochastic Paots:Can Language Models Be Too Big?Emily M. Bender∗ XXXXXXXXXXUniversity of WashingtonSeattle, WA, USATimnit Geu∗ XXXXXXXXXXBlack in AIPalo Alto, CA, USAAngelina McMillan-Majo XXXXXXXXXXUniversity of WashingtonSeattle, WA, USAShmargaret Shmitchell XXXXXXXXXXThe AetheABSTRACTThe past 3 years of work in NLP have been characterized by thedevelopment and deployment of ever larger language models, es-pecially for English. BERT, its variants, GPT-2/3, and others, mostecently Switch-C, have pushed the boundaries of the possible boththrough architectural innovations and through sheer size. Usingthese pretrained models and the methodology of fine-tuning themfor specific tasks, researchers have extended the state of the arton a wide aay of tasks as measured by leadeoards on specificenchmarks for English. In this paper, we take a step back and ask:How big is too big? What are the possible risks associated with thistechnology and what paths are available for mitigating those risks?We provide recommendations including weighing the environmen-tal and financial costs first, investing resources into curating andcarefully documenting datasets rather than ingesting everything onthe web, caying out pre-development exercises evaluating howthe planned approach fits into research and development goals andsupports stakeholder values, and encouraging research directionseyond ever larger language models.CCS CONCEPTS•Computingmethodologies→Natural language processing.ACM Reference Format:Emily M. Bender, Timnit Geu, Angelina McMillan-Major, and Shmar-garet Shmitchell XXXXXXXXXXOn the Dangers of Stochastic Paots: Can LanguageModels Be Too Big? . In Conference on Fairness, Accountability, and Trans-parency (FAccT ’21), March 3–10, 2021, Virtual Event, Canada. ACM, New York, NY, USA, 14 pages. https:doi.org/10.1145/ XXXXXXXXXX1 INTRODUCTIONOne of the biggest trends in natural language processing (NLP) haseen the increasing size of language models (LMs) as measuredy the number of parameters and size of training data. Since 2018∗Joint first authorsFAccT ’21, March 3–10, 2021, Virtual Event, CanadaACM ISBN XXXXXXXXXX/21/03.https:doi.org/10.1145/ XXXXXXXXXXalone, we have seen the emergence of BERT and its variants [39,70, 74, 113, 146], GPT-2 [106], T-NLG [112], GPT-3 [25], and mostecently Switch-C [43], with institutions seemingly competing toproduce ever larger LMs. While investigating properties of LMs andhow they change with size holds scientific interest, and large LMshave shown improvements on various tasks (§2), we ask whetheenough thought has been put into the potential risks associatedwith developing them and strategies to mitigate these risks.We first consider environmental risks. Echoing a line of recentwork outlining the environmental and financial costs of deep learn-ing systems [129], we encourage the research community to priori-tize these impacts. One way this can be done is by reporting costsand evaluating works based on the amount of resources they con-sume [57]. As we outline in §3, increasing the environmental andfinancial costs of these models doubly punishes marginalized com-munities that are least likely to benefit from the progress achievedy large LMs and most likely to be harmed by negative environ-mental consequences of its resource consumption. At the scale weare discussing (outlined in §2), the first consideration should be theenvironmental cost.Just as environmental impact scales with model size, so doesthe difficulty of understanding what is in the training data. In §4,we discuss how large datasets based on texts from the Internetoveepresent hegemonic viewpoints and encode biases potentiallydamaging to marginalized populations. In collecting ever largedatasets we risk incuing documentation debt. We recommendmitigating these risks by budgeting for curation and documentationat the start of a project and only creating datasets as large as cane sufficiently documented.As argued by Bender and Koller [14], it is important to under-stand the limitations of LMs and put their success in context. Thisnot only helps reduce hype which can mislead the public and re-searchers themselves regarding the capabilities of these LMs, butmight encourage new research directions that do not necessarilydepend on having larger LMs. As we discuss in §5, LMs are notperforming natural language understanding (NLU), and only havesuccess in tasks that can be approached by manipulating linguis-tic form [14]. Focusing on state-of-the-art results on leadeoardswithout encouraging deeper understanding of the mechanism bywhich they are achieved can cause misleading results as shown610This work is licensed under a Creative Commons Attribution International 4.0 License.https:doi.org/10.1145/ XXXXXXXXXXhttps:doi.org/10.1145/ XXXXXXXXXXhttps:creativecommons.org/licensesy/4.0FAccT ’21, March 3–10, 2021, Virtual Event, Canada Bender and Geu, et al.in [21, 93] and direct resources away from efforts that would facili-tate long-term progress towards natural language understanding,without using unfathomable training data.Furthermore, the tendency of human interlocutors to imputemeaning where there is none can mislead both NLP researchersand the general public into taking synthetic text as meaningful.Combined with the ability of LMs to pick up on both subtle biasesand overtly abusive language patterns in training data, this leadsto risks of harms, including encountering derogatory language andexperiencing discrimination at the hands of others who reproduceacist, sexist, ableist, extremist or other harmful ideologies rein-forced through interactions with synthetic language. We explorethese potential harms in §6 and potential paths forward in §7.We hope that a critical overview of the risks of relying on ever-increasing size of LMs as the primary driver of increased perfor-mance of language technology can facilitate a reallocation of effortstowards approaches that avoid some of these risks while still reap-ing the benefits of improvements to language technology.2 BACKGROUNDSimilar to [14], we understand the term language model (LM) toefer to systems which are trained on string prediction tasks: that is,predicting the likelihood of a token (character, word or string) giveneither its preceding context or (in bidirectional and masked LMs)its suounding context. Such systems are unsupervised and whendeployed, take a text as input, commonly outputting scores or stringpredictions. Initially proposed by Shannon in 1949 [117], some ofthe earliest implemented LMs date to the early 1980s and were usedas components in systems for automatic speech recognition (ASR),machine translation (MT), document classification, and more [111].In this section, we provide a ief overview of the general trend oflanguage modeling in recent years. For a more in-depth survey ofpretrained LMs, see [105].Before neural models, n-gram models also used large amountsof data [20, 87]. In addition to ASR, these large n-gram models ofEnglish were developed in the context of machine translation fromanother source language with far fewer direct translation examples.For example, [20] developed an n-gram model for English witha total of 1.8T n-grams and noted steady improvements in BLEUscore on the test set of 1797 Arabic translations as the training datawas increased from 13M tokens.The next big step was the move towards using pretrained rep-esentations of the distribution of words (called word embeddings)in other (supervised) NLP tasks. These word vectors came fromsystems such as word2vec [85] and GloVe [98] and later LSTMmodels such as context2vec [82] and ELMo [99] and supportedstate of the art performance on question answering, textual entail-ment, semantic role labeling (SRL), coreference resolution, namedentity recognition (NER), and sentiment analysis, at first in Eng-lish and later for other languages as well. While training the wordembeddings required a (relatively) large amount of data, it reducedthe amount of labeled data necessary for training on the varioussupervised tasks. For example, [99] showed that a model trainedwith ELMo reduced the necessary amount of training data neededto achieve similar results on SRL compared to models without, asshown in one instance where a model trained with ELMo reachedYear Model # of Parameters Dataset Size2019 BERT [39] 3.4E+08 16GB2019 DistilBERT [113] 6.60E+07 16GB2019 ALBERT [70] 2.23E+08 16GB2019 XLNet (Large) [150] 3.40E+08 126GB2020 ERNIE-Gen (Large) [145] 3.40E+08 16GB2019 RoBERTa (Large) [74] 3.55E+08 161GB2019 MegatronLM [122] 8.30E+09 174GB2020 T5-11B [107] 1.10E+10 745GB2020 T-NLG [112] 1.70E+10 174GB2020 GPT-3 [25] 1.75E+11 570GB2020 GShard [73] 6.00E+11 –2021 Switch-C [43] 1.57E+12 745GBTable 1: Overview of recent large language modelsthe maximum development F1 score in 10 epochs as opposed to486 without ELMo. This model furthermore achieved the same F1score with 1% of the data as the baseline model achieved with 10%of the training data. Increasing the number of model parameters,however, did not yield noticeable increases for LSTMs [e.g. 82].Transformer models, on the other hand, have been able to con-tinuously benefit from larger architectures and larger quantities ofdata. Devlin et al. [39] in particular noted that training on a largedataset and fine-tuning for specific tasks leads to strictly increasingesults on the GLUE tasks [138] for English as the hyperparametersof the model were increased. Initially developed as Chinese LMs, theERNIE family [130, 131, 145] produced ERNIE-Gen, which was alsotrained on the original (English) BERT dataset, joining the ranksof very large LMs. NVIDIA released the MegatronLM which has8.3B parameters and was trained on 174GB of text from the EnglishWikipedia, OpenWebText, RealNews and CC-Stories datasets [122].Trained on the same dataset, Microsoft released T-NLG,1 an LMwith 17B parameters. OpenAI’s GPT-3 [25] and Google’s GShard[73] and Switch-C [43] have increased the definition of large LM byorders of magnitude in terms of parameters at 175B, 600B, and 1.6Tparameters, respectively. Table 1 summarizes a selection of theseLMs in terms of training data size and parameters. As increasinglylarge amounts of text are collected from the web in datasets suchas the Colossal Clean Crawled Corpus [107] and the Pile [51], thistrend of increasingly large LMs can be expected to continue as longas they coelate with an increase in performance.A number of these models also have multilingual variants suchas mBERT [39] and mT5 [148] or are trained with some amount ofmultilingual data such as GPT-3 where 7% of the training data wasnot in English [25]. The performance of these multilingual mod-els across languages is an active area of research. Wu and Drezde[144] found that while mBERT does not perform equally well acrossall 104 languages in its training data, it performed better at NER,POS tagging, and dependency parsing than monolingual modelstrained with comparable amounts of data for four low-resourcelanguages. Conversely, [95] surveyed monolingual BERT modelsdeveloped with more specific architecture considerations or addi-tional monolingual data and found that they generally outperform1https:www.microsoft.com/en-usesearchlog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft611Stochastic Paots

Deblina · Accepted Answer

Last Name:	2
Name:
Course:
Professor:
Date:
Title: Ethics Report 
Contents
Reason why Timnit Gebru & Margaret Mitchell was Fired	3
Works Cited	4
Reason why Timnit Gebru & Margaret Mitchell was Fired
Margaret Mitchell and Timnit Gebru were fired because of the aspects of academic freedom and diversity. The ethics researchers were said to violate the company's code of conduct and security policies for moving the electronic files outside the company. The ethics in the artificial intelligence of Google was effectively under observance after the firing of Gebru. He gained a prominent position for exposing the facial analysis systems.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? "1F99C On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Emily M. Bender∗ XXXXXXXXXX University of Washington...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment