Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

IIMC_21.ipynb { "cells": [ { "cell_type": "markdown", "metadata": { "deletable": false, "editable": false, "nbgrader": { "grade": false, "locked": true, "solution": false } }, "source": [ "# Text...

1 answer below »
IIMC_21.ipyn
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"# Text Mining\n",
"\n",
"#### Automated Understanding of Text\n",
"\n",
" XXXXXXXXXXn",
"_Authors: Carleton Smith_\n",
"\n",
"## Project Guide\n",
"\n",
"- [Introducing the Amazon Review Dataset](#Introducing-the-Amazon-Review-Dataset)\n",
"- [Counting Positive/Negative Words](#Counting-Positive/Negative-Words)\n",
"- [Sentiment Intensity](#Sentiment-Intensity)\n",
"- [LDA Topics](#LDA-Topics)\n",
"- [Review Scores](#Review-Scores)\n",
"\n",
"\n",
"## Project Overview\n",
"\n",
"--------------- XXXXXXXXXXn",
"#### EXPECTED TIME: 1.5 HRS\n",
"\n",
"The lectures this week covered a large amount of material. As should be apparent, text mining offers many avenues for investigation. This assignment will focus on how to create a couple different features from a text document. In particular, activities will include: \n",
"\n",
"- Picking out positive and negative words\n",
"- Calculating sentiment scores\n",
"- Creating \"topics\" with LDA\n",
"\n",
"# VERY IMPORTANT: READ BELOW\n",
"\n",
"**If you recieve an e
or when trying to run the `imports` cell, go to the top of the screen; select `Kernel` on the tool bar, go down to `Change kernel`, and select `Python 3.5`**\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"### Introducing the Amazon Review Dataset\n",
"\n",
"**DATA CITATION**\n",
"\n",
" Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering\n",
" R. He, J. McAuley\n",
" WWW, 2016\n",
" \n",
" http:
jmcauley.ucsd.edu/data/amazon/\n",
" \n",
"The data today is a collection of reviews of outdoor products from `Amazon.com`. The full data-set includes many features: \n",
"**DATA DICTIONARY**\n",
"\n",
"1. `reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B\n",
"2. `asin` - ID of the product, e.g XXXXXXXXXXn",
"3. `reviewerName` - name of the reviewer\n",
"4. `helpful` - helpfulness rating of the review, e.g. 2/3\n",
"5. `reviewText` - text of the review\n",
"6. `overall` - rating of the product\n",
"7. `summary` - summary of the review\n",
"8. `unixReviewTime` - time of the review (unix time)\n",
"9. `reviewTime` - time of the review (raw)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import nltk\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"**READ IN THE DATA**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"data_path = '..
esource/asnli
publicdata
eviews_Sports_and_Outdoors_5.json.gz.voc'\n",
"reviews = pd.read_json(data_path, lines=True, compression='gzip')\n",
"print(\"Shape: \", reviews.shape)\n",
"reviews.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"**PREPARE DATASET** \n",
"\n",
"However, we will only be using a portion of this data; much of the provided data is auxilliary to our text-mining purposes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"# Drop unnecessary columns\n",
"cols_to_keep = ['overall', 'reviewText']\n",
"reviews = reviews.loc[:,cols_to_keep]\n",
"\n",
"# Take a sample of 20,000 5-star reviews (since they are majority)\n",
"five_star_sample = reviews.loc[reviews['overall'] == 5,:].sample(20000, random_state=24)\n",
"\n",
"# Grab the ~19,000+ reviews of 1 and 2 stars\n",
"one_and_two_stars = reviews.loc[reviews['overall'].isin([1,2]),:]\n",
"\n",
"# Display first 5 entries 5-star and low-star corpora \n",
"five_star_corpus = list(five_star_sample['reviewText'])\n",
"low_star_corpus = list(one_and_two_stars['reviewText'])\n",
"print(five_star_corpus[:5], \"\\n\\n\")\n",
"print(low_star_corpus[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"At This point there are two \"corpora\" one, a list of review text from 5-star reviews, and the other a list of review text from 1/2 star reviews. \n",
"\n",
"Of course we would expect significant difference betweeen the text of 1/2-star reviews and 5-star reviews. This is by design -- we want to see exactly how the reviews look different. "
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"### Counting Positive/Negative Words\n",
"\n",
"Below is the `\"get_words()\"` function used in lecture, along with the calls that will collect the lists of positive and negative words. \n",
"\n",
"Below that is the function `\"count_pos_and_neg()\"`. \n",
"\n",
"`count_pos_and_neg()` functionalizes the counting of positive and negative words demonstrated in lecture for the restaurants \"Community\" and \"Le Monde\". \n",
"\n",
"Finally, `\"count_pos_and_neg()\"` is used, on our positive/negative word lists (to see the cross-over)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"def get_words(file):\n",
" import requests\n",
Answered 3 days After Jun 29, 2021

Solution

Atal Behari answered on Jul 02 2021
149 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"# Text Mining\n",
"\n",
"#### Automated Understanding of Text\n",
"\n",
"-----------\n",
"_Authors: Carleton Smith_\n",
"\n",
"## Project Guide\n",
"\n",
"- [Introducing the Amazon Review Dataset](#Introducing-the-Amazon-Review-Dataset)\n",
"- [Counting Positive/Negative Words](#Counting-Positive/Negative-Words)\n",
"- [Sentiment Intensity](#Sentiment-Intensity)\n",
"- [LDA Topics](#LDA-Topics)\n",
"- [Review Scores](#Review-Scores)\n",
"\n",
"\n",
"## Project Overview\n",
"\n",
"----------------------------------\n",
"#### EXPECTED TIME: 1.5 HRS\n",
"\n",
"The lectures this week covered a large amount of material. As should be apparent, text mining offers\n",
"many avenues for investigation. This assignment will focus on how to create a couple different\n",
"features from a text document. In particular, activities will include:\n",
"\n",
"- Picking out positive and negative words\n",
"- Calculating sentiment scores\n",
"- Creating \"topics\" with LDA\n",
"\n",
"# VERY IMPORTANT: READ BELOW\n",
"\n",
"**If you recieve an e
or when trying to run the `imports` cell, go to the top of the screen;\n",
"select `Kernel` on the tool bar, go down to `Change kernel`, and select `Python 3.5`**\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"### Introducing the Amazon Review Dataset\n",
"\n",
"**DATA CITATION**\n",
"\n",
" Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering\n",
" R. He, J. McAuley\n",
" WWW, 2016\n",
" \n",
" http:
jmcauley.ucsd.edu/data/amazon/\n",
" \n",
"The data today is a collection of reviews of outdoor products from `Amazon.com`. The full data-set\n",
"includes many features:\n",
"**DATA DICTIONARY**\n",
"\n",
"1. `reviewerID` - ID of the reviewer, e.g. A2SUAM1J3GNN3B\n",
"2. `asin` - ID of the product, e.g. 0000013714\n",
"3. `reviewerName` - name of the reviewer\n",
"4. `helpful` - helpfulness rating of the review, e.g. 2/3\n",
"5. `reviewText` - text of the review\n",
"6. `overall` - rating of the product\n",
"7. `summary` - summary of the review\n",
"8. `unixReviewTime` - time of the review (unix time)\n",
"9. `reviewTime` - time of the review (raw)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [],
"source": [
"import nbconvert\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import nltk\n",
"%matplotlib inline"
]
},
{
"cell_type": "markdown",
"metadata": {
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"source": [
"**READ IN THE DATA**"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true,
"deletable": false,
"editable": false,
"nbgrader": {
"grade": false,
"locked": true,
"solution": false
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Shape: (296337, 9)\n"
]
},
{
"data": {
"text/html": [
"
\n",
"