Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

2/6/2020 1/7 Record Linkage You must work alone on this assignment. In this assignment, you will identify or link records from two datasets that refer to the same restaurant. The datasets, which come...

1 answer below »
2/6/2020
1/7
Record Linkage
You must work alone on this assignment.
In this assignment, you will identify or link records from two datasets that refer to the same restaurant.
The datasets, which come from two restaurant review companies, contain the names and addresses of a
group of restaurants. The task of linking this information is non-trivial when the restaurants’ names and
addresses are not guaranteed to be identical across the two data sets.
This assignment will give you experience in probabilistic record linkage, a technique that you will find use-
ful when working with real-world data in the future. We will train an algorithm to classify pairs of entries
as matches, unmatches (that is, not a match), and possible matches. The training algorithm has several
steps:
1. Convert each restaurant name and address into three fields: a name, city, and street address.
2. Compute the probability of a match using hand-labeled match data.
3. Compute the probability of a unmatch using randomly-chosen unmatch data.
4. Use the computed probabilities to train the classifier.
Once you have trained the algorithm, we will apply it to the full datasets.
Getting started

The directory pa3 includes the following files:
ecord_linkage.py: skeleton file for the assignment (you will add code to this file only),
test_traffic_stops.py: test code,
util.py: utility functions,
get_files.sh: shell script for getting the data,
Data ¶
To get the data for this assignment, you need to run the following command from the linux command-line:
$ ./get_files.sh
Running this script will download two directories: data and output. The first contains the input data fo
the assignment. The second contains the expected output.
We will be using data provided by Zagat and Fodor’s. The companies provide the information for a given
estaurant as a single string. We have split these strings into name, city, and address fields using regula
expressions and constructed two CSV files (data/zagat.csv and data/fodors.csv). This automated ex-
traction process generates a less-than-perfect split for some restaurants. While the results are not perfect,
the process used is realistic, and the data is in sufficiently good shape to use with the probabilistic algo-
ithm that we will employ.
The files data/zagat.csv and data/fodors.csv each contain four columns: index, restaurant name, city,
and street address. Here, for example, are the first few rows from data/zagat.csv:
index,restaurant name,city,street address
0,Apple Pan The,West LA,10801 W. Pico Blvd.
1,Arnie Morton's of Chicago,Los Angeles,435 S. La Cienega Blvd.
2,Art's Deli,Studio City,12224 Ventura Blvd.
Due: Fe
uary 13th at 5pm
We have seeded your repository with a directory for this assignment. To pick it up, change to you
ucid-win-20-username directory (where the string username should be replaced with your user- name)
and then run the command: git pull upstream master. You should also run git pull to make sure
your local copy of your repository is in sync with the server.
As in previous assignments, you will need to run this command on your VM and in Lab, if you plan to use
oth.
2/6/2020
2/7
3,Asahi Ramen,West LA,2027 Sawtelle Blvd.
4,Baja Fresh,Westlake Village,3345 Kimber Dr.
You must read the review data into Pandas dataframes. When you load the data, we recommend (1) using
the index column, which contains a unique identifier (an integer) for each row, as the row index and (2)
setting the type for the remaining columns to str.
In addition to the review data, we have also provided a training dataset of known links in a file named
data/known_links.csv. This file contains 50 manually-linked pairs of restaurants. In other words, a hu-
man chose some of the rows in one dataset, and determined the co
esponding rows in the other dataset.
The first restaurant in the pair is from the Zagat dataset and the second from the Fodor’s dataset. Rows in
the known-links file co
espond to these pairs, and the two columns give the indexes for a restaurant with-
in the co
esponding files. For example, the first non-header line of the known-links file is “269,386”, so
the row with index 269 in the Zagat dataset is deemed to be a “match” with the row with index 386 in the
Fodor’s dataset.
We’ve also provided a file, data/unmatch_pairs.csv, that contains 1000 pairs of randomly-chosen index-
es, one from each review dataset. We will deem these pairs to be unmatches. Why are we using 1000 ran-
dom pairs for unmatches, when we only have 50 for matches? The more pairs we have, the better the accu-
acy of our record linkage. Of course, having more matches would also have been helpful as well, but that
equires expensive manual work.
Your first task is to load the data.
Computing the probability of a match or an unmatch
We will explain the process of converting dataframes along with the matched and unmatched pairs into
something more abstract that can be used to help classify other restaurants as matches or unmatches in
two steps. First, we will explain how to compute similarity tuples and then how to use the resulting tuples
to estimate the probability of a match or an unmatch.
Constructing similarity tuples
In this assignment, we will compute the similarity of the name, city, and address fields separately. A natur-
al measure of similarity for strings is the edit distance, also known as the Levenshtein distance. It is de-
fined as the minimum number of single-character insertions, deletions, and substitutions required to con-
vert one string to another. The Levenshtein distance is computationally intensive, so in practice, a related
measure, called the Jaro-Winkler distance is used instead.
The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, al-
though it is called a distance, it actually measures the similarity between two strings: a complete match
(“Medici” and “Medici”) has a score of 1.0, and a complete mismatch (“Medici” and “Subway”) has a score
of 0.0.
You will use the Python li
ary jellyfish to compute the Jaro-Winkler distance between two strings. It is
already installed on the CS machines, but you will need to install it on your VM using the following
command:
$ sudo -H pip3 install jellyfish
Specifically, we’ll use the function jellyfish.jaro_winkler to compute the Jaro-Winkler distance. To
make this discussion more concrete, here are the results of computing the Jaro-Winkler score for the
name, city, and address fields from the first matched pair (index 269 from the Zagat dataset and index 386
from the Fodor’s dataset):
In [1]: import jellyfish

In [2]: jellyfish.jaro_winkler('Ritz-Carlton Restaurant',
...: 'Restaurant Ritz-Carlton Atlanta')
Out[2]: XXXXXXXXXX

In [3]: jellyfish.jaro_winkler('Atlanta', 'Atlanta')
Out[3]: 1.0

https:
en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
https:
github.com/jamesturk/jellyfish
2/6/2020
3/7
In [4]: jellyfish.jaro_winkler('181 Peachtree St.',
...: '181 Peachtree St.')
Out[4]: 1.0
As we will see in class, the Jaro-Winkler similarity scores do not increase monotonically, and so, a “cut-off”
on the similarity for a single field or many fields will not serve our purpose. Instead, we
eak-up the range
of similarity values into discrete chunks, which we will call "low", "medium", and "high", and determine
the probability of a match for each combination of field-wise chunks separately.
We have provided a simple utility function get_jw_category() in util.py that essentially
eaks up the
ange of Jaro-Winkler similarity scores into three probability blocks. It takes a Jaro-Winkler distance and
eturns the string "low", "medium", or "high". We will apply this function to all three fields: names, cities,
and addresses. Thus, for any pair of rows from the review datasets, we can create a similarity tuple
(name_category, city_category, address_category) that represents the similarity of the fields for the co
e-
sponding pair of rows. Applying this process to the rows that co
espond to the indexes 269 and 386 in the
Zagat and Fodor’s datasets respectively yields the similarity tuple: ('low', 'high', 'high').
Estimating the probabilities
Whether, during record linkage, a pair of rows should be classified as a match, unmatch, or something in
etween will depend on whether its (name_category, city_category, address_category) tuple was most
closely associated with matches or unmatches when we trained the algorithm, and our tolerance for e
or.
Specifically, we will determine whether a tuple should be classified as a match, possible match, or unmatch
ased on estimates of the probability that a matched pair results in that tuple as well as the probability
that an unmatched pair results in that tuple. Formally, assume that is a potential pair, is
the tuple formed from its field similarities, and is the set of all possible tuples. For every we need
estimates for two quantities:
Your task is to compute estimates for the former by iterating through all the pairs of rows co
esponding to
the known matches, determining their similarity tuples, and counting the frequency of each of the 27 pos-
sible similarity tuples (combinations of the three similarity levels in the three different positions) during
this process. We’ll use the relative frequency for a tuple as our estimate of the probability for given a
matching pair. Similarly find an estimate for the probability given an unmatch by iterating through un-
match pairs and computing the frequency of the co
esponding similarity tuples.
To make this more concrete: your second task is to construct two dictionaries: one that maps similarity tu-
ples to match probabilities and the other that maps similarity tuples to unmatch probabilities.
Partition tuples into match, possible match, and unmatch
sets
This step is tricky. It is important that you read this section extremely carefully.
The goal of this step is to compute a dictionary that maps each similarity tuple to one of "match",
"possible match", or "unmatch".
How should we decide which tuples belong in which of the three different categories: match, possible
match, and unmatch? It depends on our tolerance for two kinds of e
ors:
false positive rate, the probability that we classify an actual unmatch as a match, and
false negative rate, the probability that we classify an actual match as an unmatch.
Assume that is the maximum false positive rate and is the maximum false negative rate we are willing
to accept, and, given these are satisfied, we want to maximize the number of our matches.
In order to classify which tuples should be associated
Answered Same Day Feb 07, 2021

Solution

Prasun Kumar answered on Feb 13 2021
152 Votes
Download the solution from https:
it.ly/EdSolution3. Unzip the file and follow instructions given in the report.pdf file. https:
it.ly/EdSolution3
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here