1/13/20201/8Course Search Engine: Crawling thecatalogYou must work alone on this assignment.Over the...

Question

1/13/20201/8Course Search Engine: Crawling thecatalogYou must work alone on this assignment.Over the course of the next two assignments, you will build parts of a course search engine. We have puttogether a simple web application that will use your code to do the work of finding courses that match theuser’s criteria.In this part of the assignment, you will build a web crawler that crawls a shadow copy of the college catalogto construct a simple index. The purpose of this assignment is to give you more Python programming ex-perience and to have you work with HTML documents extracted from the web. You will use the index inPA #2.Getting started    Before describing the specifics of your task, we will iefly explain how to work with URLs and grab pagesfrom the web.Working with URLsURL stands for uniform resource locator. A URL, for example:has the following format:protocol:site address/path/filenameThe protocol field (http, in our example) specifies which protocol should be used to interact with the re-source. Common protocols include http, https, and ftp. We will be working with http and https, the hyper-text transport protocols. The site address specifies the host computer name (www), the domain name(classes.cs.uchicago), and the top level domain (edu). The path specifies part of the hierarchical loca-Loading [Contrib]/a11y/accessibility-menu.jsDue: January 16th at 6pm                                                            http:www.classes.cs.uchicago.edu/archive/2005/winte10122/pa/index.htmlWe have seeded your repository with a directory for this assignment. To pick it up, change to youcs10122-win-20-username directory (where the string username should be replaced with your user- name), run git pull to make sure that your local copy of the repository is in sync with the server, and then run git pull upstream master to pick up the distribution.https:www.classes.cs.uchicago.edu/archive/2020/winte30122-1/_images/screenshot.png1/13/20202/8er, which we will not be using.Within a web page, a link can refer to an absolute URL (that is, it starts with http: or https:) or aelative URL, which can be converted into an absolute URL. A relative URL refers to a page with a locationthat is relative to the cuent page (much like relative path names in Linux). For example, here is a URLtaken from the index page for assignments mentioned above:pa1/index.htmlThe absolute URL for this page is:URLs can use a fragment (denoted by a #) to refer to a specific location (known as an anchor) on a page.For our purposes, we only need the part of the URL that comes before the #.We have provided some useful functions for working with URLs:util.is_absolute_url(url):Takes a URL and returns true if it is an absolute URL and false otherwise.util.convert_if_relative_url(url1, url2):Takes the URL for a page (url1) and a URL found on that page (url2). If url2 is an absolute URL,then this function will just return it. If url2 is a relative URL, the function will return an equivalentabsolute URL. The function will return None if it is unable to convert the URL.The following code, for example:would yield:as expected.util.remove_fragment(url):Takes a URL and removes the # and any text following it in the URL. For example, given the relativeURL index.html#minorprogramincomputerscience, the function would return index.html. If theURL does not contain a fragment, the function just returns the original URL.Graing a web pageWe will be using requests, a Python package, for making http connections and retrieving documents. Wehave supplied the following wrapper functions:util.get_request(url):Takes a URL and returns a request object. This function will return None if the request fails.util.read_request(request):Takes a request object and returns a string containing the HTML document retrieved by the request.This function may generate a warning of the form:  WARNING:root:Some characters could not be decoded, and were replaced with REPLACELoading [Contrib]/a11y/accessibility-menu.js                                                           url1 = "http:www.classes.cs.uchicago.edu/archive/2005/winte10122/paindex.h url2 = "pa1/index.html"util.convert_if_relative_url(url1, url2)http:www.classes.cs.uchicago.edu/archive/2005/winte10122/pa/pa1/index.htmltion (archive/2005/winte10122) of the desired file (index.html) in the file system on the host com- puter. The path to the archive directory is supplied by the configuration of the web server program run- ning on the host machine. The full path in the department’s file system for the example above is:stage/classes/archive/2005/winte10122/pa/index.html. URLs can also include a port num-http:www.classes.cs.uchicago.edu/archive/2005/winte10122/pa/pa1/index.html1/13/20203/8You may ignore this warning.util.get_request_url(request):Takes a request object and returns the associated URL. Note that the returned URL may be differentthan the URL provided to the original call to get_request. This seeming anomaly occurs when theoriginal URL redirects to another URL.Note that your crawler should check for failures (that is, checking if the output of any of these functions isNone before proceeding further).Note also that even though the recent lab used a different liary to download web pages for this assign-ment, you should follow the above functions.Catalog pagesFor this assignment, we are interested in three types of HTML tags: a, div, and p. We will use the first tofind links to other pages, the second to find sections of the page pertinent to a particular course, and thethird to extract actual course titles and descriptions from within the div tags.For example, here are three links from a page in the course catalog:a href="http:college.uchicago.edu">The University of Chicagoaa href="../../azindex/index.html">AZ Indexaa href="index.html#minorprogramincomputerscience"aThe first specifies a link using an absolute URL. The second specifies a link using a relative URL. And thethird specifies a link using a relative URL with a fragment.Course descriptions are enclosed in a div tag of the form . For exam-ple, here is the entry for Software Construction:Course titles and descriptions can be found in p tags nested within the div tag.You will construct an index that maps words to lists of course identifiers. A course identifier is an integethat uniquely identifies a specific course code (“CMSC 12200”, for example). We will provide a dictionarythat maps course codes to course identifiers. Course codes can be found at the beginning of each course ti-tle. The string   represents a single html character entity (non-eaking space). Certain reservedcharacters in HTML must be replaced with character entities. We saw in lecture that in order to include a as text that we needed to use the syntax (i.e., it’s part of defining a tag). In this assignment, you only need to woy about the   entity.However, when using the beautifulsoup4 .text attribute it will convert the   into an Unicode non-eaking space character for you. You will only need to replace it with a regular space when you construct acourse code.div >        CMSC  XXXXXXXXXXSoftware Construction.  100 Units.strong>   p>        Large software systems are difficult to build. The course discusses                   ...     will be expected to actively participate in team projects in this     course.   p>        Instructor(s): S. Lu     Terms Offered: Autumn      Prerequisite(s): CMSC 15400    pdivLoading [Contrib]/a11y/accessibility-menu.js1/13/20204/8You might wonder why we need both course codes and course identifiers. In the next assignment, you wille using a relational database that will contain an index like the one you are constructing along with infor-mation that we have scraped from the UChicago time schedules. The course identifiers will be used in thisdatabase to link different types of information about a course, such as the title, section dates, times, andlocations for a given quarter.SequencesIf a group of courses forms a sequence, the UChicago course catalog lists the course title and descriptionfor both the sequence (>quence (>See below for an example:div >   strong>CMSC  XXXXXXXXXX.      Computer Science with Applications I-II-III.strong>   p>        This three-quarter sequence teaches computational thinking and     skills to students who are majoring in the sciences, mathematics,     and economics. ... Students learn Java, Python, R and C++.   pdiv>  div >   strong>CMSC 12100.     Computer Science with Applications I.  100 Units.strong>   p>      p>        Instructor(s): A. Rogers     Terms Offered: Autumn     Prerequisite(s): Placement into MATH 15200 or higher, or consent of instructo    Note(s): This course meets the general education requirement in the mathematical  pdiv>  div >   strong>CMSC 12200.     Computer Science with Applications II.  100 Units.strong>   p>      p>        Instructor(s): A. Rogers     Terms Offered: Winte     Prerequisite(s): CMSC 12100      Note(s): This course meets the general education requirement in the     mathematical sciences.     pdiv>  div >   strong>CMSC 12300.     Computer Science with Applications III.  100 Units.strong>

Prasun Kumar · Accepted Answer

Download the solution from https://bit.ly/EdSolution. Unzip the file and follow instructions given in the report.pdf file. https://bit.ly/EdSolution

1/13/2020 1/8 Course Search Engine: Crawling the catalog You must work alone on this assignment. Over the course of the next two assignments, you will build parts of a course search engine. We have...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment