Modify the skeleton .py file to translate words from English to Pig Latin. Please do not import anything other than what is already in the skeleton .py file. I've also attached the lecture slides...

1 answer below »

Today’s Topics
• ?'48"; &589 '.@'99+-,#
• Motivation: machine neural translation for long sentences
• Decoder: attention
• Transformer overview
• Self-attention
Slides Thanks to Dana Gurari
Converting Text to Vectors
1. Tokenize training data; convert data into sequence of tokens (e.g., data ->“This is tokening”)
2. Learn vocabulary
3. Encode data as vectors
Two common approaches:
https:
nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa
Converting Text to Vectors
1. Tokenize training data
2. Learn vocabulary by identifying all unique tokens in the training data
3. Encode data as vectors
Two common approaches:
https:
nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa
Token a b c *** 0 1 *** ! @ ***
Index 1 2 3 *** 27 28 *** XXXXXXXXXX ***
Token a an at *** bat ball *** zipper zoo ***
Index 1 2 3 *** XXXXXXXXXX *** 9,842 9,843 ***
1. Tokenize training data
2. Learn vocabulary by identifying all unique tokens in the training data
3. Encode data as one-hot vectors
https:
github.com/DipLernin/Text_Generation
One-hot encodings
Input sequence of 40 tokens
epresenting characters or words
Converting Text to Vectors
Converting Text to Vectors
What are the pros and cons for using word tokens instead of character tokens?
- Pros: length of input/output sequences is shorter, simplifies learning semantics
- Cons: “UNK” word token needed for out of vocabulary words; vocabulary can be large
https:
nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa
Token a b c *** 0 1 *** ! @ ***
Index 1 2 3 *** 27 28 *** XXXXXXXXXX ***
Token a an at *** bat ball *** zipper zoo ***
Index 1 2 3 *** XXXXXXXXXX *** 9,842 9,843 ***
Converting Text to Vectors
Word level representations are more commonly used
https:
nlpiation.medium.com/how-to-use-huggingfaces-transformers-pre-trained-tokenizers-e029e8d6d1fa
Token a b c *** 0 1 *** ! @ ***
Index 1 2 3 *** 27 28 *** XXXXXXXXXX ***
Token a an at *** bat ball *** zipper zoo ***
Index 1 2 3 *** XXXXXXXXXX *** 9,842 9,843 ***
Problems with One-Hot Encoding Words?
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
• Huge memory burden
• Computationally expensive
Dimensionality = vocabulary size
e.g., English has ~170,000 words
with ~10,000 commonly used words
Limitation of One-Hot Encoding Words
• No notion of which words are similar, yet such understanding can improve generalization
• e.g., “walking”, “running”, and “skipping” are all suitable for “He was ____ to school.”
Walking Soap Fire Skipping
The distance between
all words is equal!
Today’s Topics
• Introduction to natural language processing
• Text representation
• Neural word embeddings
• Programming tutorial
Idea: Represent Each Word Compactly in a Space
Where Vector Distance Indicates Word Similarity
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
Inspiration: Distributional Semantics
“The distributional hypothesis says that the meaning of a
word is derived from the context in which it is used, and
words with similar meaning are used in similar contexts.”
- Origins: Ha
is in 1954 and Firth in 1957
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
Inspiration: Distributional Semantics
“The distributional hypothesis says that the meaning of a
word is derived from the context in which it is used, and
words with similar meaning are used in similar contexts.”
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
Inspiration: Distributional Semantics
• What is the meaning of berimbau based on context?
• Idea: context makes it easier to understand a word’s meaning
Background music from a berimbau offers a beautiful escape.
Many people danced around the berimbau player.
I practiced for many years to learn how to play the berimbau.
https:
capoeirasongbook.wordpress.com/instruments
erimbau/[Adapted from slides by Lena Voita]
Inspiration: Distributional Semantics
“The distributional hypothesis says that the meaning of a
word is derived from the context in which it is used, and
words with similar meaning are used in similar contexts.”
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
• What other words could fit into these context?
Inspiration: Distributional Semantics
[Adapted from slides by Lena Voita]
1. Background music from a _______ offers a beautiful escape.
2. Many people danced around the _______ player.
3. I practiced for many years to learn how to play the _______.
XXXXXXXXXX
XXXXXXXXXX
XXXXXXXXXX
XXXXXXXXXX
Berimbau
Soap
Fire
Guita
1 if a word can appear in the context
0 otherwise
XXXXXXXXXXContexts
Hypothesis is that
words with similar
ow values have
similar meanings
Inspiration: Distributional Semantics
“The distributional hypothesis says that the meaning of a
word is derived from the context in which it is used, and
words with similar meaning are used in similar contexts.”
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
Approach
• Learn a dense (lower-dimensional) vector for each word by characterizing its
context, which inherently will reflect similarity/differences to other words
Berimbau and guitar are the closest word pairBerimbau Soap Fire Guita
The distance between
each pair of words differs!
Note: many ways to measure
distance (e.g., cosine distance)
Approach
• Learn a dense (lower-dimensional) vector for each word by characterizing its
context, which inherently will reflect similarity/differences to other words
We embed words in a shared space so they can
e compared with a few features
What features would discriminate these words?
Berimbau Soap Fire Guita
Approach
• Learn a dense (lower-dimensional) vector for each word by characterizing its
context, which inherently will reflect similarity/differences to other words
Berimbau Soap Fire Guita
Wooden
Commodity
Cleane
Food
Temperature
Noisy
Weapon
Potential, interpretable features
Approach: Learn Word Embedding Space
• An embedding space represents a finite number of words, decided in training
• A word embedding is represented as a vector indicating its context
• The dimensionality of all word embeddings in an embedding space match
• What is the dimensionality for the shown example?
…
Approach: Learn Word Embedding Space
• An embedding space represents a finite number of words, defined in training
• A word embedding is represented as a vector indicating its context
• The dimensionality of all word embeddings in an embedding space match
?
?
?
?
?
?
?
In practice, the learned discriminating
features are hard to interpret
Embedding Matrix
• The embedding matrix converts an input word into a dense vector
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
Size of vocabulary
Berimbau Soap Fire Guitar …
Target dimensionality
(e.g., 5)
One hot encoding dictates
the word embedding to use
Embedding Matrix
• It converts an input word into a dense vector
Kamath, Liu, and Whitaker. Deep Learning for NLP and Speech Recognition. 2019.
Size of vocabulary
Berimbau Soap Fire Guitar …
Target dimensionality
(e.g., 5)
A word’s embedding can efficiently be
extracted when we know the word’s index
Popular Word Embeddings
• Bengio method
• Word2vec (skip-gram model)
• And more…
Popular Word Embeddings
• Bengio method
• Word2vec (skip-gram model)
• And more…
Idea: Learn Word Embeddings That Help
Predict Viable Next Words
e.g.,
1. Background music from a _______
2. Many people danced around the _______
3. I practiced for many years to learn how to play the _______
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Task: Predict Next Word
Given Previous Ones
e.g.,
1. Background music from a _______
2. Many people danced around the _______
3. I practiced for many years to learn how to play the _______
Task: Predict Next Word
Given Previous Ones
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
e.g., a vocabulary size of 17,000
was used in experiments
What is the dimensionality of the
output layer?
Architecture
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Embedding matrix:
Word embeddings:
Architecture
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
e.g., a vocabulary size of 17,000
was used with embedding sizes of
30, 60, and 100 in experiments
Assume a 30-d word embedding
- what are the dimensions of the
embedding matrix C?
30 x 17,000 (i.e., 510,000 weights)
Architecture
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
e.g., a vocabulary size of 17,000
was used with embedding sizes of
30, 60, and 100 in experiments
Assume a 30-d word embedding
- what are the dimensions of each
word embedding?
1 x 30
Architecture
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Projection layer followed by a
hidden layer with non-linearity
Training
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Input: tried 1, 3, 5, and 8 input words
and used 2 datasets with ~1 million and
~34 million words respectively
Use sliding window on input data; e.g., 3 words
Background music from a berimbau offers a
eautiful escape…
Training
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Input: tried 1, 3, 5, and 8 input words
and used 2 datasets with ~1 million and
~34 million words respectively
Use sliding window on input data; e.g., 3 words
Background music from a berimbau offers a
eautiful escape…
Training
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Input: tried 1, 3, 5, and 8 input words
and used 2 datasets with ~1 million and
~34 million words respectively
Use sliding window on input data; e.g., 3 words
Background music from a berimbau offers a
eautiful escape…
Training
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Input: tried 1, 3, 5, and 8 input words
and used 2 datasets with ~1 million and
~34 million words respectively
Use sliding window on input data; e.g., 3 words
Background music from a berimbau offers a
eautiful escape…
Training
Bengio et al. A Neural Probabilistic Language Model. JMLR 2003.
Input: tried 1, 3, 5, and 8 input words
and used 2 datasets with ~1 million and
~34 million words respectively
Cost function:
minimize cross
entropy loss plus
egularization (L2
weight decay)
Word embedding iteratively updated
Summary: Word Embeddings Are Learned that
Support Predicting Viable Next Words
e.g.,
1. Background music from a _______
2. Many people danced around the _______
3. I practiced for many years to learn how to

hw7-1-jkcevtr2.zip slides-1-vikm0oms.pdf

Answered 1 days After May 11, 2022

Solution

Dipansu answered on May 13 2022

104 Votes

SOLUTION.PDF

Modify the skeleton .py file to translate words from English to Pig Latin. Please do not import anything other than what is already in the skeleton .py file. I've also attached the lecture slides...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment