1 MIST.3050 Programming Assignment: Basis Text AnalysisThis assignment will give you the opportunity...

Question

1 MIST.3050 Programming Assignment: Basis Text Analysis This assignment will give you the opportunity to develop a program that analyzes text by recognizing and counting the words of different...

1 answer below »

1
MIST.3050
Programming Assignment: Basis Text Analysis
This assignment will give you the opportunity to develop a program that analyzes text by
ecognizing and counting the words of different sentiments. The program can be useful in several
application scenarios. For example, in your cover letter for job applications, you may want to
avoid using negative words – you can use the program to identify the negative words, then revise
your letter to avoid them. You may also use the program to count positive and negative words in
online reviews, then use these counts as an alternative to star ratings in a predictive model that
you need to build.
1. Background
Words such as “able” and “joy” often have a positive sentiment. And other words such as
“novice” and “revoke” often have a negative sentiment. Linguists and text analysis researchers
have categorized words and documented their results as dictionaries. Several dictionaries are
used in practice. The following are two popular ones:
(1) General Inquirer Harvard IV-4 Dictionary (IV-4). Originally developed by Professor Philip
Stone at Harvard University for applications in psychology and sociology, the dictionary has
een widely used in many areas. The dictionary has 11,888 entries and more than180 categories,
including different categories developed by different researchers. Additional information about
this dictionary can be found at http:
www.wjh.harvard.edu/~inquire
. Spreadsheet versions can
e found at http:
www.wjh.harvard.edu/~inquire
spreadsheet_guide.htm. A plain text version
can be found at wjh.harvard.edu/~inquire
inqdict.txt.
(2) Master Dictionary for Business (MD). Developed by professors Tim Loughran and Bill
McDonald at the University of Notre Dame, this dictionary is specific to financial reports and
can be used for business related texts. One of the motivations of developing this business specific
dictionary is that frequently used words in business documents such as “liability” and “vice”
have no negative implications, but they are categorized negative in the Harvard IV-4 dictionary.
The master dictionary has more than 80,000 entries. Different forms of the same base word are
included as separate entries. These different forms, also known as inflections in natural language
processing (NLP), can be singular vs plural for nouns and different tenses for ve
s. Information
about this Master Dictionary can be found at https:
sraf.nd.edu/textual-analysis
esources/.
With either dictionary, you can look up a word to see if it has been categorized as positive or
negative. Words are in uppercase in both dictionaries. The main difference between the two
dictionaries is different classifications of word sentiment. Recall the example given earlier,
“liability” is negative according to IV-4, but it is not negative according to MD. Likewise, “joy”
is positive in IV-4 but not in MD.
Another notable difference is: IV-4 captures different word meanings, while MD captures
different word forms. A word can have different meanings, also known as word senses in NLP.
In this case, the word has multiple entries in IV-4, each is listed as the word followed by “#” and
http:
www.wjh.harvard.edu/~inquire
http:
www.wjh.harvard.edu/~inquire
spreadsheet_guide.htm
https:
sraf.nd.edu/textual-analysis
esources
2
a number, e.g., “ABOVE#1”, “ABOVE#2”, etc. In contrast, MD has only one entry for the word
“above”. In terms of word forms, IV-4 is quite limited, but MD includes all forms of a word. For
example, IV-4 has two entries for the word “book”: book and booking. In contrast, MD has
ook, books, booking, bookings, booked, and many book-prefixed words such as bookends and
ookkeeper – a total of 53 entries. As a results, MD ends up having more entries.
2. Tasks
Your main task of the assignment is to identify and count the occu
ences of positive words and
negative words in any given text. This is conceptually simple: for each word in the text, look it
up in a dictionary (or in both dictionaries) to see if it is positive or negative. For this assignment,
you may choose either dictionary or use both.
For example, given the text (which is from an online review):
I thought they were pretty nice at first. Sound quality is good. They charged quick in the
case. But transparency mode amplified sounds like crazy. At the gym someone was moving
weights, the cling of them hitting was so loud it hurt my ears. When I took the airpods out it
was much quieter in real life. Even in noise cancelling mode they pick up voices and then
they end up sounding robotic. There's a white noise in the background too. It's like a little
static, annoying as hell. Just now I started hearing a popping noise in the right bud. It kept
happening even when the music was off. When I turned to transparency mode that popping
ecame a buzz.
using MD, your program should be able to output something like below:
Positive words: {'transparency': 2, 'good': 1}
Negative words: {'hurt': 1, 'cancelling': 1, 'annoying': 1}
Positive occu
ences: 3; Negative occu
ences: 3
In addition, your program should produce decorated version of the text where positive words are
olded and negative words are underlined:
I thought they were pretty nice at first. Sound quality is good. They charged quick in the
case. But transparency mode amplified sounds like crazy. At the gym someone was moving
weights, the cling of them hitting was so loud it hurt my ears. When I took the airpods out it
was much quieter in real life. Even in noise cancelling mode they pick up voices and then
they end up sounding robotic. There's a white noise in the background too. It's like a little
static, annoying as hell. Just now I started hearing a popping noise in the right bud. It kept
happening even when the music was off. When I turned to transparency mode that popping
ecame a buzz.
The decoration can be done through HTML, additional information about which will be
provided.
3. Implementation
As you know, there is usually more than one way to accomplish a certain task. The following
suggestions intend to give you some ideas about implementation, and you may develop your own
implementation different from the suggestions.
3
You may use either dictionary or use both dictionaries (you can earn up to 5% bonus points if
you successfully use both dictionaries). In the following description, I assume you use only one
dictionary. Think about how you want to represent the dictionary so that you can easily check if
a word is categorized as positive or negative. You may want to implement functions or use OOP
to support word lookup.
The process has two main steps:
1. represent dictionary to support word lookup
2. for each token in the text, obtain the word and determine if it is positive or negative.
3.1 Processing Dictionary File
For step 1, you need to process the dictionary file, find positive and negative words, and
epresent these words to support lookup in the second step. In IV-4, a positive word is marked by
Positiv in the column named Positiv; in MD, a positive word is marked by a non-zero value in
the column named Positive. Negative words are marked using the Negativ and Negative
columns, respectively. See the Appendix for example entries in the two dictionary files.
The following example code gives you some idea about how to read the IV-4 dictionary file, find
positive words, and remove characters that are not in the alphabet plus the space character (e.g.,
converting “ABOVE#1” to “ABOVE”):
import re, csv
print("Positive words in IV-4")
with open('iv4.csv' , 'r') as f:
csvreader=csv.DictReader(f)
for row in csvreader:
if row['Positiv']:
XXXXXXXXXXprint(re.sub(r'[^a-zA-Z\s]', '', row['Entry']))
Note that the example uses two packages in the Python standard li
ary (meaning that you need
not install additional packages; and of course you could use the pandas package, if you prefer). It
uses csv package’s DictReader function, which understands that the first row is the headers (that
define column names) and coverts each row of data as a dictionary (with the column name as the
key). The sub function in the regular expression (re) package replaces anything that is not in the
alphabet or a space with an empty string.
Similarly, the following code processes the MD files, finds positive words, cleans each word (not
necessary because MD word entries uses characters only in the alphabet), and converts the word
to upper case (also unnecessary because MD entries are in uppercase; included here to show how
it is done because you will need case conversion for words in the text):
4
import re, csv
print("Positive words in MD")
with open('md2020.csv' , 'r') as f:
csvreader=csv.DictReader(f)
for row in csvreader:
if int(row['Positive'])>0:
XXXXXXXXXXprint(re.sub(r'[^a-zA-Z\s]', '', row['Word']).upper())
Note that above examples only find the words. You need to decide how to represent the words to
support lookup. I recommend Python dictionary (dict); you may also use list or set.
3.2 Processing Text
The text can be stored in a text file and read in by your program. If this becomes a challenge, you
may hard code it (i.e., storing it in a variable in your code).
As noted earlier, in addition to space and the alphabet characters, there are often other characters
in the text. For example, the following text includes quotation marks and a question mark:
It is a “good” practice?
You need to remove these punctuation marks for sentiment lookup. The same regular expression
shown in the example code earlier can help you accomplish this task:
re.sub(r'[^a-zA-Z\s]', '', 'It is a "good" practice?').upper()
' IT IS A GOOD PRACTICE'
The same regular expression also works on the single token.
You can use the split method of string object to convert the text into a list, with each token as a
list element. This can be useful when you need to preserve the order of the tokens.
3.3 Using HTML to Highlight Positive and Negative Words
HTML is used for webpages. It tells the
owser how to render texts and other contents. In this
assignment, we will use in-line style

pa-2oaimi1b.pdf md2020-ppng3un3.csv iv4-35rdvhl4.csv

Answered 4 days After Oct 05, 2021

Solution

Karthi answered on Oct 08 2021

132 Votes

1. Initially read the dataset given, important point is we are not all the columns, we are only reading specific columns which we need to compare.
2. Reading only two columns Positive and Negative columns from both the datasets.
3. Appending all the words into a list both negative and positive...

SOLUTION.PDF

1 MIST.3050 Programming Assignment: Basis Text Analysis This assignment will give you the opportunity to develop a program that analyzes text by recognizing and counting the words of different...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment