2/6/20201/7Record LinkageYou must work alone on this assignment.In this assignment, you will...

Question

2/6/20201/7Record LinkageYou must work alone on this assignment.In this assignment, you will identify or link records from two datasets that refer to the same restaurant.The datasets, which come from two restaurant review companies, contain the names and addresses of agroup of restaurants. The task of linking this information is non-trivial when the restaurants’ names andaddresses are not guaranteed to be identical across the two data sets.This assignment will give you experience in probabilistic record linkage, a technique that you will find use-ful when working with real-world data in the future. We will train an algorithm to classify pairs of entriesas matches, unmatches (that is, not a match), and possible matches. The training algorithm has severalsteps:1. Convert each restaurant name and address into three fields: a name, city, and street address.2. Compute the probability of a match using hand-labeled match data.3. Compute the probability of a unmatch using randomly-chosen unmatch data.4. Use the computed probabilities to train the classifier.Once you have trained the algorithm, we will apply it to the full datasets.Getting started    The directory pa3 includes the following files:ecord_linkage.py: skeleton file for the assignment (you will add code to this file only),test_traffic_stops.py: test code,util.py: utility functions,get_files.sh: shell script for getting the data,Data ¶To get the data for this assignment, you need to run the following command from the linux command-line:$ ./get_files.sh Running this script will download two directories: data and output. The first contains the input data fothe assignment. The second contains the expected output.We will be using data provided by Zagat and Fodor’s. The companies provide the information for a givenestaurant as a single string. We have split these strings into name, city, and address fields using regulaexpressions and constructed two CSV files (data/zagat.csv and data/fodors.csv). This automated ex-traction process generates a less-than-perfect split for some restaurants. While the results are not perfect,the process used is realistic, and the data is in sufficiently good shape to use with the probabilistic algo-ithm that we will employ.The files data/zagat.csv and data/fodors.csv each contain four columns: index, restaurant name, city,and street address. Here, for example, are the first few rows from data/zagat.csv:index,restaurant name,city,street address0,Apple Pan  The,West LA,10801 W. Pico Blvd.1,Arnie Morton's of Chicago,Los Angeles,435 S. La Cienega Blvd.2,Art's Deli,Studio City,12224 Ventura Blvd.Due: Feuary 13th at 5pmWe have seeded your repository with a directory for this assignment. To pick it up, change to youucid-win-20-username directory (where the string username should be replaced with your user- name) and then run the command: git pull upstream master. You should also run git pull to make sure your local copy of your repository is in sync with the server.As in previous assignments, you will need to run this command on your VM and in Lab, if you plan to use oth.2/6/20202/73,Asahi Ramen,West LA,2027 Sawtelle Blvd.4,Baja Fresh,Westlake Village,3345 Kimber Dr.You must read the review data into Pandas dataframes. When you load the data, we recommend (1) usingthe index column, which contains a unique identifier (an integer) for each row, as the row index and (2)setting the type for the remaining columns to str.In addition to the review data, we have also provided a training dataset of known links in a file nameddata/known_links.csv. This file contains 50 manually-linked pairs of restaurants. In other words, a hu-man chose some of the rows in one dataset, and determined the coesponding rows in the other dataset.The first restaurant in the pair is from the Zagat dataset and the second from the Fodor’s dataset. Rows inthe known-links file coespond to these pairs, and the two columns give the indexes for a restaurant with-in the coesponding files. For example, the first non-header line of the known-links file is “269,386”, sothe row with index 269 in the Zagat dataset is deemed to be a “match” with the row with index 386 in theFodor’s dataset.We’ve also provided a file, data/unmatch_pairs.csv, that contains 1000 pairs of randomly-chosen index-es, one from each review dataset. We will deem these pairs to be unmatches. Why are we using 1000 ran-dom pairs for unmatches, when we only have 50 for matches? The more pairs we have, the better the accu-acy of our record linkage. Of course, having more matches would also have been helpful as well, but thatequires expensive manual work.Your first task is to load the data.Computing the probability of a match or an unmatchWe will explain the process of converting dataframes along with the matched and unmatched pairs intosomething more abstract that can be used to help classify other restaurants as matches or unmatches intwo steps. First, we will explain how to compute similarity tuples and then how to use the resulting tuplesto estimate the probability of a match or an unmatch.Constructing similarity tuplesIn this assignment, we will compute the similarity of the name, city, and address fields separately. A natur-al measure of similarity for strings is the edit distance, also known as the Levenshtein distance. It is de-fined as the minimum number of single-character insertions, deletions, and substitutions required to con-vert one string to another. The Levenshtein distance is computationally intensive, so in practice, a relatedmeasure, called the Jaro-Winkler distance is used instead.The exact definition of the Jaro-Winkler distance is somewhat technical (but available here). Also, al-though it is called a distance, it actually measures the similarity between two strings: a complete match(“Medici” and “Medici”) has a score of 1.0, and a complete mismatch (“Medici” and “Subway”) has a scoreof 0.0.You will use the Python liary jellyfish to compute the Jaro-Winkler distance between two strings. It isalready installed on the CS machines, but you will need to install it on your VM using the followingcommand:$ sudo -H pip3 install jellyfish Specifically, we’ll use the function jellyfish.jaro_winkler to compute the Jaro-Winkler distance. Tomake this discussion more concrete, here are the results of computing the Jaro-Winkler score for thename, city, and address fields from the first matched pair (index 269 from the Zagat dataset and index 386from the Fodor’s dataset):In [1]: import jellyfish  In [2]: jellyfish.jaro_winkler('Ritz-Carlton Restaurant',    ...:                        'Restaurant  Ritz-Carlton  Atlanta')Out[2]: XXXXXXXXXX  In [3]: jellyfish.jaro_winkler('Atlanta', 'Atlanta')Out[3]: 1.0  https:en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distancehttps:github.com/jamesturk/jellyfish2/6/20203/7In [4]: jellyfish.jaro_winkler('181 Peachtree St.',    ...:                        '181 Peachtree St.')Out[4]: 1.0As we will see in class, the Jaro-Winkler similarity scores do not increase monotonically, and so, a “cut-off”on the similarity for a single field or many fields will not serve our purpose. Instead, we eak-up the rangeof similarity values into discrete chunks, which we will call "low", "medium", and "high", and determinethe probability of a match for each combination of field-wise chunks separately.We have provided a simple utility function get_jw_category() in util.py that essentially eaks up theange of Jaro-Winkler similarity scores into three probability blocks. It takes a Jaro-Winkler distance andeturns the string "low", "medium", or "high". We will apply this function to all three fields: names, cities,and addresses. Thus, for any pair of rows from the review datasets, we can create a similarity tuple(name_category, city_category, address_category) that represents the similarity of the fields for the coe-sponding pair of rows. Applying this process to the rows that coespond to the indexes 269 and 386 in theZagat and Fodor’s datasets respectively yields the similarity tuple: ('low', 'high', 'high').Estimating the probabilitiesWhether, during record linkage, a pair of rows should be classified as a match, unmatch, or something inetween will depend on whether its (name_category, city_category, address_category) tuple was mostclosely associated with matches or unmatches when we trained the algorithm, and our tolerance for eor.Specifically, we will determine whether a tuple should be classified as a match, possible match, or unmatchased on estimates of the probability that a matched pair results in that tuple as well as the probabilitythat an unmatched pair results in that tuple. Formally, assume that  is a potential pair,  isthe tuple formed from its field similarities, and  is the set of all possible tuples. For every  we needestimates for two quantities:Your task is to compute estimates for the former by iterating through all the pairs of rows coesponding tothe known matches, determining their similarity tuples, and counting the frequency of each of the 27 pos-sible similarity tuples (combinations of the three similarity levels in the three different positions) duringthis process. We’ll use the relative frequency for a tuple  as our estimate of the probability for  given amatching pair. Similarly find an estimate for the probability given an unmatch by iterating through un-match pairs and computing the frequency of the coesponding similarity tuples.To make this more concrete: your second task is to construct two dictionaries: one that maps similarity tu-ples to match probabilities and the other that maps similarity tuples to unmatch probabilities.Partition tuples into match, possible match, and unmatchsetsThis step is tricky. It is important that you read this section extremely carefully.The goal of this step is to compute a dictionary that maps each similarity tuple to one of "match","possible match", or "unmatch".How should we decide which tuples belong in which of the three different categories: match, possiblematch, and unmatch? It depends on our tolerance for two kinds of eors:false positive rate, the probability that we classify an actual unmatch as a match, andfalse negative rate, the probability that we classify an actual match as an unmatch.Assume that  is the maximum false positive rate and  is the maximum false negative rate we are willingto accept, and, given these are satisfied, we want to maximize the number of our matches.In order to classify which tuples should be associated

Prasun Kumar · Accepted Answer

Download the solution from https://bit.ly/EdSolution3. Unzip the file and follow instructions given in the report.pdf file. https://bit.ly/EdSolution3

2/6/2020 1/7 Record Linkage You must work alone on this assignment. In this assignment, you will identify or link records from two datasets that refer to the same restaurant. The datasets, which come...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment