Excavating AI: the politics of images in machine learning training sets
Vol.: XXXXXXXXXX
AI & SOCIETY
https:
doi.org/10.1007/s XXXXXXXXXX
ORIGINAL ARTICLE
Excavating AI: the politics of images in machine learning training sets
Kate Crawford1 · Trevor Paglen2
Received: 1 May 2020 / Accepted: 14 October 2020
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021
Abstract
By looking at the politics of classification within machine learning systems, this article demonstrates why the automated
interpretation of images is an inherently social and political project. We begin by asking what work images do in computer
vision systems, and what is meant by the claim that computers can “recognize” an image? Next, we look at the method for
introducing images into computer systems and look at how taxonomies order the foundational concepts that will determine
how a system interprets the world. Then we turn to the question of labeling: how humans tell computers which words will
elate to a given image. What is at stake in the way AI systems use these labels to classify humans, including by race, gender,
emotions, ability, sexuality, and personality? Finally, we turn to the purposes that computer vision is meant to serve in our
society—the judgments, choices, and consequences of providing computers with these capacities. Methodologically, we
call this an archeology of datasets: studying the material layers of training images and labels, cataloguing the principles and
values by which taxonomies are constructed, and analyzing how these taxonomies create the parameters of intelligibility for
an AI system. By doing this, we can critically engage with the underlying politics and values of a system, and analyze which
normative patterns of life are assumed, supported, and reproduced.
Keywords Computer vision · Machine learning · Training data · Epistemology · Politics of classification · Artificial
intelligence
1 Introduction
You open up a database of pictures used to train artificial
intelligence systems. At first, things seem straightforward.
You’re met with thousands of images: apples and oranges,
irds, dogs, horses, mountains, clouds, houses, and street
signs. But as you probe further into the dataset, people begin
to appear: cheerleaders, scuba divers, welders, Boy Scouts,
fire walkers, and flower girls. Things get strange: a photo-
graph of a woman smiling in a bikini is labeled a “slattern,
slut, slovenly woman, trollop.” A young man drinking beer
is categorized as an “alcoholic, alky, dipsomaniac, boozer,
lush, soaker, souse.” A child wearing sunglasses is classi-
fied as a “failure, loser, non-starter, unsuccessful person.”
You’re looking at the “person” category in a dataset called
ImageNet, one of the most widely used training sets for
machine learning.
Something is wrong with this picture.
Where did these images come from? Why were the peo-
ple in the photos labeled this way? What sorts of politics
are at work when pictures are paired with labels, and what
are the implications when they are used to train technical
systems?
In short, how did we get here?
There’s an u
an legend about the early days of machine
vision, the subfield of artificial intelligence AI) concerned
with teaching machines to detect and interpret images. In
1966, Marvin Minsky was a young professor at MIT, making
a name for himself in the emerging field of artificial intel-
ligence1.1 Deciding that the ability to interpret images was
* Kate Crawford
XXXXXXXXXX
Trevor Paglen
XXXXXXXXXX
1 University of Southern California, Annenberg School,
Microsoft Research New York, New York, USA
2 The University of Georgia, Athens, Greece
1 Minsky cu
ently faces serious allegations related to convicted
pedophile and rapist Jeffrey Epstein. Minsky was one of several sci-
entists who met with Epstein and visited his island retreat where
underage girls were forced to have sex with members of Epstein’s
coterie. As scholar Meredith Broussard observed, there is a a
oader
culture of exclusion and hostility that became endemic in AI: “as
wonderfully creative as Minsky and his cohort were, they also solidi-
fied the culture of tech as a billionaire boys’ club. Math, physics, and
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
AI & SOCIETY
1 3
a core feature of intelligence, Minsky turned to an under-
graduate student, Gerald Sussman, and asked him to “spend
the summer linking a camera to a computer and getting the
computer to describe what it saw.”2 This became the Sum-
mer Vision Project.3 Needless to say, the project of getting
computers to “see” was much harder than anyone expected,
and would take a lot longer than a single summer.
The story we’ve been told goes like this:
illiant men
worked for decades on the problem of computer vision,
proceeding in fits and starts, until the turn to probabilistic
modeling and learning techniques in the 1990s accelerated
progress. This led to the cu
ent moment, in which chal-
lenges such as object detection and facial recognition have
een largely solved.4 This arc of inevitability recurs in many
AI na
atives, where it is assumed that ongoing technical
improvements will resolve all problems and limitations.
But what if the opposite is true? What if the challenge of
getting computers to “describe what they see” will always be
a problem? In this essay, we will explore why the automated
interpretation of images is an inherently social and political
project, rather than a purely technical one. Understanding
the politics within AI systems matters more than ever, as
they are quickly moving into the architecture of social insti-
tutions: deciding whom to interview for a job, which stu-
dents are paying attention in class, which suspects to a
est,
and much else.
Over two years, from 2017 to 2019, we studied the under-
lying logic of how images are used to train AI systems to
“see” the world. We have looked at hundreds of collec-
tions of images used in artificial intelligence, from the first
experiments with facial recognition in the early 1960s to
contemporary training sets containing millions of images.
Methodologically, we call this project an archeology of data-
sets: we have been digging through the material layers, cata-
loguing the principles and values by which something was
constructed, and analyzing what normative patterns of life
were assumed, supported, and reproduced. These are collec-
tions of images that are very rarely critically examined. By
excavating the construction of these training sets and their
underlying structures, many unquestioned assumptions are
evealed. These assumptions inform the way AI systems
work—and fail—to this day.
This essay begins with some deceptively simple ques-
tions: what work do images do in AI systems? What does it
mean that computers can “recognize” an image, and what is
misrecognized or even completely invisible? Next, we look
at the method for introducing images into computer systems
and look at how taxonomies order the foundational concepts
that will become intelligible to a computer system. Then we
turn to the question of labeling: how do humans tell comput-
ers which words will relate to a given image? And what is
at stake in the way AI systems use these labels to classify
humans, including by race, gender, emotions, ability, sexu-
ality, and personality? Finally, we turn to the purposes that
computer vision is meant to serve in our society—the judg-
ments, choices, and consequences of providing computers
with these capacities.
2 Training AI
Building AI systems requires data. Supervised machine-
learning systems designed for object or facial recognition
are trained on vast amounts of data contained within data-
sets made up of many discrete images. To build a computer
vision system that can, for example, recognize the difference
etween pictures of apples and oranges, a developer has to
collect, label, and train a neural network on thousands of
labeled images of apples and oranges. On the software side,
the algorithms conduct a statistical survey of the images, and
develop a model to recognize the difference between the two
“classes.” If all goes according to plan, the trained model
will be able to distinguish the difference between images
of apples and oranges that it has never encountered before.
Training sets, then, are the foundation on which contem-
porary machine-learning systems are built.5 They are central
to how AI systems recognize and interpret the world. These
datasets shape the epistemic boundaries governing how AI
systems operate, and thus are an essential part of understand-
ing socially significant questions about AI.
But when we look at the training images widely used in
computer-vision systems, we find a bedrock composed of
shaky and skewed assumptions. For reasons that are rarely
4 Russell SJ (2010).
5 In the late 1970s, Ryszard Michalski wrote an algorithm based
on “symbolic variables” and logical rules. This language was very
popular in the 1980s and 1990s, but, as the rules of decision-making
and qualification became more complex, the language became less
usable. At the same moment, the potential of using large training
sets triggered a shift from this conceptual clustering to contemporary
machine-learning approaches. See Michalski R (1980).
Footnote 1 (continued)
the other “hard” sciences have never been hospitable to women and
people of color; tech followed this lead.” See Broussard (2018).
2 See Crevier D (1993).
3 Minsky gets the credit for this idea, but clearly Papert, Sussman,
and teams of “summer workers” were all part of this early effort to
get computers to describe objects in the world. See Papert SA (1966).
As he wrote: “The summer vision project is an attempt to use our
summer workers effectively in the construction of a significant part
of a visual system. The particular task was chosen partly because it
can be segmented into sub-problems which allow individuals to work
independently and yet participate in the construction of a system
complex enough to be a real landmark in the development of ‘pattern
ecognition’.
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
AI & SOCIETY
1 3
discussed within the field of computer vision, and despite
all that institutions like MIT and companies like Google and
Facebook have done, the project of interpreting images is
a profoundly complex and relational endeavor. Images are
emarkably slippery things, laden with multiple potential
meanings, i
esolvable questions, and contradictions. Entire
subfields of philosophy, art history, and media theory are
dedicated to teasing