CIS 3120 FINAL REVIEW SHEET
Format:
50 multiple choice questions (Randomized)
2 hours to complete it
Blackboard
See github repository for code snippets:
https:
github.com/avinashjairam/avinashjairam.github.io
25 Questions Will Be From the following:
Web Scraping + Data Frames - 60%
Data Analytics (Up to the Midterm) + Numpy - 40%
Data Analytics
- Difference between data and information
- Why is there an increased interest in data analytics over the last 5-10 years?
- The demand for data analysts will increase tremendously over the next decade
- Define Big Data
- Explain why data is refe
ed to as the “new oil”
- What are some required skills and knowledge necessary to get started in data analytics?
- Why is python so popular especially in data analytics?
- What are the complex data structures in python?
- What is data analysis (See lecture 5)
- 5 significant steps for data analysis (See lecture 6)
Web Scraping
- Define web scraping
- Why is web scraping necessary? For example, in which situations would you use web
scraping?
- Difference between modern day web scraping and that in the 60s-80s?
- Challenges to modern data gathering
- Solutions to the modern data gathering problem
- List some popular APIs
https:
github.com/avinashjairam/avinashjairam.github.io
- What to do before web scraping?
- Dangers of web scraping
- Ethics of web scraping
- Challenges to web scraping
- What are some popular web scraping applications?
- 2 major entities on WWW
- What is HTTP and how it works? E.g. client sends request to server, server responds
accordingly
- HTML - need to know how to build a basic html page
- Know how IDs and Classes are used
- What is the web
owser module used for?
- Be familiar with the request module: how it works, what it is used for?
- Know how to make a request to a webpage and print the HTML returned by the server.
- What is a get request?
- What is beautiful soup?
- How to scrape a website with beautiful soup? The methods, steps involved?
- How to install beautiful soup?
- How to create a beautiful soup object?
- What does prettify() do?
- Be familiar with the server response codes? 200 and 404, codes starting with a 4 or 5
- All beautiful soup methods, e.g. .children, find, find_all, get_text()
- Be able to do basic scraping from a web page
- Exporting data from a webpage to a csv file (also covered in dataframes)
- Understand how HTML can be organized into a “tree”. E.g. parent tag, child tag, sibling
tab, etc
- Know how to scrape an HTML table. Be familiar with the example done in class
Data Frames
- What is a data frame?
- What is a series?
- How to create a series?
- What is pandas?
- How to install pandas?
- Why use pandas?
- How to add columns to a dataframe using lists?
- How to create a pandas dataframe from a series?
- Exporting data to csv file from a pandas data frame
- Be familiar with all the essential pandas methods. All pandas methods listed in lecture #5
- Know how to do all the operations here:
https:
github.com/avinashjairam/avinashjairam.github.io
lo
maste
Basic_Pandas.ipyn
- Differences between pandas and numpy operations: performance. Which situation would
e ideal to use numpy, pandas, etc?
https:
github.com/avinashjairam/avinashjairam.github.io
lo
maste
Basic_Pandas.ipyn
https:
github.com/avinashjairam/avinashjairam.github.io
lo
maste
Basic_Pandas.ipyn
- How to set the max number of rows/columns in a jupyter notebook
- Merge two data frames vertically
- info(), iloc(), loc(), describe()
- Given a dataframe, select rows based on indexes, row positions.
- List all columns in a dataframe
- Sort a dataframe based on a column
- Select particular columns
- What is an index?
- How to set the index of a dataframe? (temporarily vs permanently)
- List all indexes of a dataframe
- How to look up a record by its index?
- Sort a dataframe by index
- Select elements from a dataframe by filtering, e.g. by a filter mask
- Using a filter mask with loc
- Filtering on multiple conditions
- Negating a filter
- Modifying data with a dataframe: how to rename a column; editing data in a row
-
Numpy
- What is numpy?
- What is numpy used for?
- How to install numpy?
- Difference between a python list and a numpy a
ay?
- What is an a
ay?
- Characteristics of an a
ay- a
ays have rank and shape. Define what are those
- Know how to create a numpy a
ay from a list of lists
- Know how to print elements from a numpy a
ay
- Know how to create a numpy a
ay filled with 0s or filled with 1s
- Create a 2 D a
ay (think of it as a matrix) of ‘x’ and ‘y’ dimensions filled with random
values
- Common numpy methods, sum,sort, dot, multiply,
25 Questions Will Be From the following:
Relational Databases (5 Questions)
- What is a relational database?
- What is a table?
- Characteristics of a Relational Table
- What is a Null Value? When does a Null value occur?
- List Keys (All of them)
- Define the the keys
- Know when to use which key. E.g. when would you use a primary key?
- .Given a table, identify which keys should be the primary key, foreign key, composite
keys, superkeys, etc.
- Identify the alternate key
- Why are foreign keys helpful?
- Define Referential Integrity
- Why is Referential Integrity important?
- What can happen if a DB doesn’t enforce referential integrity?
- Know what a schema diagram is?
- Label a schema diagram?
- Define SQL? - Structured Query Language
- Know what a Select command in SQL does
- Use a select command to select rows and columns from a table
- Know what all joins
- Match Venn Diagrams with the appropriate joins
Data Visualization (5 questions)
- Define what is data visualization? State the goals
- Why is data visualization important?
- Explain the quantitative and qualitative questions asked in data visualization.
- Explain spatial and non-spatial data
- Identify how data is collected
- What factors should be considered when building a visualization model?
- Relationship between humans and visualizations in the decision making process
- When is a visualization not needed?
- What are the possibilities
enefits of having a good visualization model?
- Why use an external representation?
- Why represent all of the data?
- Why focus on tasks and effectiveness?
- Why are there resource limitations?
- Define what, why, and how?
- What are the 3 themes of visual analytics?
- Challenges to data visualization
- What are examples of data visualization tasks?
- What are some li
aries used for data visualization?
- Using matplotlib - plot line, bar, scatter, histogram plots
- How to a add titles, axes labels to a plot (x-axis, y - axis labels)
- Using pandas- plot line, bar, scatter, pie plots/charts
APIs (10 questions)
- Define what an API is
- What is an interface?
- How does an API work?
- How are web based APIs accessed?
- What formats are API data returned in?
- Method/Types of API authorization/ API security
- Label API requests diagram
- How do you learn how to send the proper request to a particular API?
- Know the difference in which web servers and APIs respond to requests?
- Why use an API instead of web scraping?
- Restrictions when using an api
- Add parameters to a URL to make an API call
- Parse JSON
- Label diagram - interacting with an API
- Know different status codes
- What is JSON, what it is used for?
- What is the purpose of json.dumps(), json.loads()?
Pandas and Numpy (5 questions
- Why is data analytics important in marketing?
- Join two data frames vertically using .concat(), ignore indexes
- Joining two data frames horizontally
- Why is it useful to keep the indexes when joining horizontally?
- Why is merging data frames useful
- What are the types of merges?
- Know to perform all types of merges between two data frames?
- How to merge two data frames?
- Dealing with missing data in a data frame
- Know how to use dropna, fillna