SET11121School of Computing, Napier University Assessment Brief 1. Module number SET11121 /...

Question

SET11121School of Computing, Napier University Assessment Brief  1.  Module number SET11121 / SET115212.  Module title Data Wrangling3.  Module leader Dimitra Gkatzia4.  Tutor with responsibility for this Assessment Dimitra Gkatzia ( XXXXXXXXXX)5.  Assessment Coursework6.  Weighting  100% of module assessment7.  Size and/or time limits for assessment up to 1700 words plus figures or tables with esults and developed code for all questions. 8.  Deadline of submission Your attention is drawn to the penalties for late submissionPart A: 08/03/18 at 1500 UK time Part B: 12/04/18 at 1600 UK time9.  Aangements for submission Your Coursework must be submitted via Moodle.  Further submission instructions are included in the attached specification, and on Moodle10.  Assessment Regulations All assessments are subject to the University Regulations. 11.  The requirements for the assessmentSee Attached12.  Special instructions See Attached13.  Return of work Feedback and marks will be provided within three weeks of submission.14.  Assessment criteria Your coursework will be marked using the marking sheet attached as Appendix A.   This specifies the criteria that will be used to mark your work.  Further discussion of criteria is also included in the coursework specification attached. SET11121 / SET11521 / SET XXXXXXXXXXData WranglingAssessment BriefThe assignment aims to cover the learning outcomes specified for the module:LO1: Critically evaluate the tools and techniques of the data storage, interfacing, aggregation and processingLO2: Select and apply a range of specialised data types, tools and techniques for data storage, interfacing, aggregation and processingLO3: Employ specialised techniques for dealing with complex data sets LO4: Design, develop and critically evaluate data driven applications in PythonThe goal of this assignment is to develop a prediction model for Abusive Language Detection. DataFor this assignment you must use the datasets provided on moodle. Part A - 30%. Deadline: Friday 8 March at 3pm (UK time).Deliverable 1: You will need to perform a literature review on recent approaches to abusive language detection. You will need to pick 3 new approaches published after 2016. For each approach, you will need to describe the dataset they used, the approach (including the feature selection), a ief description of their result as well as your critical review (are there any issues with the study, how would you improve it? etc.). Your report must include an “Introduction” (intro to the topic and described methods), “Background” (description of methods as described previously), a “Discussion” (critical analysis of the described methods), and a “Summary of results” from Deliverable 2.Deliverable 2: Using the provided datasets, you will need to:  Load (in Python) and store the training dataset using one of the approaches you learnt. In the comments explain why you chose to store the data in a particular way. Perform some analysis, e.g. find most frequent/infrequent words, number of unique words, Your references should come from international venues (such as conferences and journals). You can look for papers at Google Scholar or at the university liary (online). Your report must adhere to citation guidelines - any citation style is acceptable. An example guide can be found here: https:drhazelhall.files.wordpress.com/2013/01/2005_hall_referencing.pdf SET11121 / SET11521 / SET XXXXXXXXXXData Wranglinghttps:drhazelhall.files.wordpress.com/2013/01/2005_hall_referencing.pdfYou will submit:Part A consists of two deliverables: Deliverable 1: One .pdf file of 1200 words. The document should include your name, matriculation number and contact details, as well as tables and a short description of your text analysis. Deliverable 2: Your code with appropriate comments. Everything must be submitted on moodle only!Marking: You will be marked on the content (10%), the structure of the report (5%), the criticality (10%) and the quality of code (5%). See the end of the document for a detailed description of the marking scheme.Part B - 70%. Deadline: Friday 12 April at 3pm (UK time).For the second part of the assignment you will need to develop and evaluate abusive language detection models for the given datasets. You should choose two ML models: one of the ML approaches you were taught in class and one you identified from the literature. You should produce two models and an evaluation metric (metric taken from literature - you need to justify which metric you chose and why). The goal of this exercise is not to produce a state-of-the-art sentiment analysis model. If your chosen model performs poorly by your selected metric, do not woy—this is not what we are testing. Which model you use, and how you evaluate, is up to you. The choice of model is not important (although we will assume that when you choose a model, you understand what it is and how it works) as well as that the evaluation metric is appropriate. Your solution should be sensible - you should be able to explain why it tests something of impact to the problem. Tips and ClarificationsWe are not looking for models that perform well: we are looking to see that you can build  sensible models, i.e. choose meaningful features and perform a sensible evaluation. If you are struggling to make something work with the volume of data present, you can subsample (for instance, andomly pick a proportion of the dataset). You must use Python and its liaries to tackle this task. You are strongly encouraged to make use of third-party liaries for model building and evaluation, rather than writing your own, unless you specifically need to do something with no liary support.You will submit:1. The code of your solution, and an up to 500 words .pdf document explaining the data pre-processing, model features and evaluation as well as a discussion of your results, your critical evaluation and suggestions for future improvement. If you do any pre-processing to the data, please also include the script you use to do this (or a list of the commands run). Marking: 40% for method/model, 15% for evaluation, 15% for report and reflection. See Appendix A for more explanations. SET11121 / SET11521 / SET XXXXXXXXXXData WranglingAppendix A: Marking SchemeNo SubmissionVery poor Inadequate Adequate Good Very good Excellent OutstandingA1Content10%No work submittedLiterature  not described adequately, i.e described only the topic or the data, or sources are not relevantLiterature not described adequately, leaving most work unexplainedLiterature described partially: half of its elements coveredLiterature described partiallyLiterature described almost fullyLiterature fully described, covering everythingLiterature fully described and additional  investigation was performedA2Structure5%No work submittedReport does not follow the guidelines or word limit The structure of the report equires more workThe structure of the report is ok, but some part is missingThe structure of the eport is overall good but there is oom for improvementThe structure of the report is very good, naming of titles could improve The structure of the report is excellent The structure of the report is outstanding and professionalA3Criticality 10%No work submittedThe lit has not been criticised The lit eview has not been criticised adequately, e.g. no mentioning of specific drawbacksNot all sources has een criticised. The lit eview has een criticised ut not thoroughly enoughThe lit eview has een criticised thoroughly and good insights has een providedThe lit eview has een criticised thoroughly and valuable insights has een providedThe lit has een criticised thoroughly with excellent suggestions for improvementA4Code and explanation5%No work submittedCode with ugsCode with ugs but good explanations or questions answered partly Code without ugs but inadequate explanation Code without ugs and good but not thorough explanation Code without ugs and explanations almost completeExcellent code and thorough explanationsOutstanding code and thorough and thoughtful explanations. B1Methods/ Models 40%No work submittedCode with ugs and algorithm model not well describedCode with  ugs but algorithm model well describedCode with a minor bug ut algorithm  model not well described and justified Code with a minor ug but algorithm model well described and justified Code without ugs but algorithm model not described or justifiedCode without ugs but algorithm  model not described and justified in great detailCode without bugs and algorithm model described and justified in detailSET11121 / SET11521 / SET XXXXXXXXXXData WranglingLate submission policyCoursework submitted after the agreed deadline will be marked at a maximum of 40% (undergraduate) or P1 (postgraduate). Coursework submitted over five working days after the agreed deadline will be given 0% (although formative feedback will be offered where requested).ExtensionsIf you require an extension, please contact the module leader before the deadline. Extensions are only provided for exceptional circumstances and evidence may be required. See the Fit to Sit egulations for more details. PlagiarismPlagiarised work will be dealt with according to the university’s guidelines: http:www2.napier.ac.uk/ed/plagiarism/ B2Evaluation15%No work submittedNot appropriate evaluation metric chosenNeither the evaluation setup nor the results are described appropriatelyEvaluation setup is not justified but almost coectly executed and esults are mentionedEvaluation setup is not justified ut coectly executed and results are mentionedEvaluation setup is somewhat

Ximi · Accepted Answer

eda_1.pdf
INTRODUCTION 
Detecting and fighting Abusive languages on internet sources, regardless social media, 
news, articles, etc has been a rising problem lately. Researchers have been using AI 
based methods continuously to fight this, detect and remove such articles from the 
internet sources. 
We will be discussing here 3 new methods post 2016 to describe how the problem of 
Abusive language detection is being solved in literature.
1st approach involves solving this problem using various feature selection based methods 
using Convolutional neural nets (CNN).
2nd approach involves cross domain identification of abusive languages in various 
aspects or domains. They propose to validate the training model over some other domain 
and introduce mixture of training sets from various domain data.
3rd approach focusses on the specific interests of the abusive languages being written on 
social media or news articles. This approach tends to identify the interest of the topic on 
which the abusive language was written about. They have analysed the articles on basis, 
if the language is directed towards someone or some entity or a generalised one.
BACKGROUND 
Now we will dive into the methods that have been used in the approaches as described in 
the introduction.
1st approach :-
They proposed to implement three CNN-based models to classify sexist and racist 
abusive language: CharCNN, WordCNN, and HybridCNN. The major difference among 
these models is whether the input features are characters, words, or both.
CNNs provide a range of filters which helps in capturing the “window” of words or chars 
or both as per their models and make the judgement call. 
The problem of capturing n-grams in text analytics feature extraction and selection 
process gets reduced with this process.
Char CNN is a character-level convolutional network in (Zhang et al. 2015). Each 
character from the input sentence is transformed into a one-hot encoding of 70 
characters, including 26 English letters, 10 digits, 33 other characters, and a newline 
character (punctuations and special characters). Word CNN is a  CNN net where the input 
sentence is first segmented into words and converted into a 300-dimensional embedding 
word2vec trained on 100 billion words from Google News (Mikolov et al., 2013). 
Incorporating pre-trained vectors is a widely-used method to improve performance, 
especially when using a relatively small dataset. We set the embedding to be non-
trainable since our dataset is small.
HybridCNN, a variation of WordCNN, since WordCNN has a limitation of only taking word 
features as input. Abusive language often contains either purposely or mistakenly 
misspelled words and made-up vocabularies such as #feminazi. Therefore, since these 
above two concepts don’t use character and word inputs at the same time, they designed 
the HybridCNN to experiment whether the model can capture features from both fir of 
inputs.
2nd Approach - 
The aim of the paper was to asses how well the models trained on a particular dataset of 
abusive language perform on a different test dataset. The differences in performance can 
be traced back to the following factors: 
(1) the differences in the types of abusive language that the dataset was labeled with and 
(2) the differences in dataset sizes. In this work we observe the joint effect of both 
factors.
They used a linear Support Vector Machine (SVM), which has already been successfully 
applied to the task of abusive text classification and detection in literature.
3rd Approach - 
Much of the work on abusive language subtasks can be synthesized in a two-fold 
typology that considers whether (i) the abuse is directed at a specific target, and (ii) the 
degree to which it is explicit.
DISCUSSION 
The 1st and 2nd methods describe the problem in more of a technical fashion. The 3rd 
research is just a hypothesis and no technical implementation has been done. But as I 
think, the 3rd approach if implemented can result into more good insights about the 
directions and specifics of the abusive language being spread over the internet.
The 1st approach deals with a precise and one of the advance approaches used in the 
industry. Its exploration of the methodologies can help in deciding the quality and hyper 
parameter tunings of the models. The 2nd approach was just dealing with a typical 
machine learning generalisation problem. Although this method should be encouraged 
and deployed in practice to deal with real abusive language detection problem. I did 
some quick analytics on the dataset and loaded the JSON files with pandas. 
The JSON files with line separated and hence easy to load with pandas. Pandas is a 
framework to deal with record type data types and provides medium data scalability to 
deal with data problems. 
Pandas provides a dataframe format to deal with data. The analytics brought me the set 
of unique tokens, total tokens, the frequency distribution which actually enabled me to 
see which words were actually used so often in a particular context and also the least 
frequent words being used. This gives a quick peek around the data. Furthermore there 
are numerous other ways to explore and play around data and visualise them also. 
This approach is known as Exploratory Data Analysis (EDA).
__MACOSX/._eda_1.pdf
eda_2.pdf
Data Loading 
Pandas provides a dataframe format to deal with data. The analytics brought me the set 
of unique tokens, total tokens, the frequency distribution which actually enabled me to 
see which words were actually used so often in a particular context and also the least 
frequent words being used. This gives a quick peek around the data. Furthermore there 
are numerous other ways to explore and play around data and visualise them also. 
This approach is known as Exploratory Data Analysis (EDA).
Data Preprocessing  
Some basic preprocessing of removing stop words and regular expressions helped out 
clear out some vocabulary so that features get automatically reduced. 
The text data was converted into tokens and formed a corpus.
The data frames were concatenated and a label column was added to deal with the 
same.
Feature Engineering  
The feature extraction part was done in two ways according to the model used. 
The first one was to use the scikit-learn’s Count Vectorizer which just converts words into 
vectors with their respective word count. 
Whereas in the second scenario the CNN was used so accordingly word embeddings 
were created. Where that means to create word level vectors which are learnt by the 
model itself. 
Word vectors in the deep learning world are alone very powerful features to be used in 
any machine learning or deep learning model. The data was separated into training and 
testing sets.
Model Training and Evaluation 
The model was trained on two models. First model was trained using sklearn’s logistic 
regression classifier. The metric used for the evaluation was the accuracy score. 
The second model was a single layer CNN model which had word embeddings as a 
feature layer. 
The same model was trained for 10 epochs to quickly train and evaluate model. The 
accuracy metric was used here as well to evaluate the model. During training the cross 
validation set was used to generate a trade off b/w the bias and variance in the model.
__MACOSX/._eda_2.pdf
abusive_language_detection_1.ipynb
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd
",
    "import nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package punkt to /Users/Ximi-
",
      "[nltk_data]     Hoque/nltk_data...
",
      "[nltk_data]   Unzipping tokenizers/punkt.zip.
"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nltk.download('punkt')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Reading data in pandas dataframe as it allows applying operations in a proper way
",
    "data = pd.read_json("abusive_data/racism.json", lines=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['Annotation', 'contributors', 'coordinates', 'created_at',

SET11121 School of Computing, Napier University Assessment Brief 1. Module number SET11121 / SET11521 2. Module title Data Wrangling 3. Module leader Dimitra Gkatzia 4. Tutor with responsibility for...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment