no need reference, and no word count, only coding

Question

Ximi · Accepted Answer

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "cfile_c_636871579706929484_37137_1.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "metadata": {
        "colab_type": "text",
        "id": "x3aJEGS5keqH"
      },
      "cell_type": "markdown",
      "source": [
        "# Coursework Part 1: Detecting Spam with Spark\n",
        "\n",
        "\n",
        "This coursework is about classification of e-mail messages as spam or non-spam in Spark. We will go through the whole process from loading and preprocessing to training and testing classifiers in a distributed way in Spark. We wil use the techniques shown in the lextures and labs. I will also introduce here a few additional elements, such as the NLTK and some of the preprocessing and machine learning functions that come with Spark. You are not expected to need anything beyond the material handed out so far and in some cases the Spark documentation, to which I have put links in this document.  \n",
        "\n",
        "The structure is similar to the lab sheets. I provide a code structure with gaps that you are supposed to file. In addition you should run 2 small experiments and comment on the results. The lines where you are supposed to add code or take another action are marked with ">>>" \n",
        "please leave the ">>>" in the text, comment out that line, and write your own code in the next line using a copy of that line as a starting point.\n",
        "\n",
        "I have added numerous comments in text cells and the code cells to guid you through the program. Please read them carefully and ask if anything is unclear. \n",
        "\n",
        "Once you have completed the tasks, don't delete the outpus, but downlaod the notebook (outputs will be included)."
      ]
    },
    {
      "metadata": {
        "colab_type": "text",
        "id": "CdFlCqCFkeqL"
      },
      "cell_type": "markdown",
      "source": [
        "## Load and prepare the data\n",
        "\n",
        "We will use the lingspam dataset in this coursework (see [http://csmining.org/index.php/ling-spam-datasets.html](http://csmining.org/index.php/ling-spam-datasets.html) for more information).\n",
        "\n",
        "The next cells only prepare the machine, as usual."
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "dHGQ78mTkeqO",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 122
        },
        "outputId": "7a692fe0-08ef-4b93-96d9-737b797fd386"
      },
      "cell_type": "code",
      "source": [
        "# Load the Drive helper and mount\n",
        "from google.colab import drive\n",
        "\n",
        "# This will prompt for authorization.\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": 1,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code\n",
            "\n",
            "Enter your authorization code:\n",
            "··········\n",
            "Mounted at /content/drive\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "0-hhNOS0keqW",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 173
        },
        "outputId": "cb2b4e5c-9519-4e28-8d0e-45e0ccb93bf3"
      },
      "cell_type": "code",
      "source": [
        "!pip install pyspark\n",
        "\n",
        "import pyspark\n",
        "sc = pyspark.SparkContext.getOrCreate()"
      ],
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting pyspark\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/88/01/a37e827c2d80c6a754e40e99b9826d978b55254cc6c6672b5b08f2e18a7f/pyspark-2.4.0.tar.gz (213.4MB)\n",
            "\u001b[K    100% |████████████████████████████████| 213.4MB 118kB/s \n",
            "  Building wheel for pyspark (setup.py) ... \u001b[?25ldone\n",
            "\u001b[?25h  Stored in directory: /root/.cache/pip/wheels/cd/54/c2/abfcc942eddeaa7101228ebd6127a30dbdf903c72db4235b23\n",
            "Successfully built pyspark\n",
            "Installing collected packages: py4j, pyspark\n",
            "Successfully installed py4j-0.10.7 pyspark-2.4.0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "id": "QnQExFvULH4k",
        "colab_type": "code",
        "colab": {}
      },
      "cell_type": "code",
      "source": [
        "!tar -xzf /content/drive/My\ Drive/BigData/data/lingspam_public.tar.gz"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "nncrHFdwqUmE",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 629
        },
        "outputId": "66e75ae9-4057-400a-d7d8-c81e2db255ac"
      },
      "cell_type": "code",
      "source": [
        "# We have a new dataset in directory BigData/data/lingspam_public .\n",
        "%cd /content/lingspam_public/\n",
        "#drive/My Drive/BigData/data/lingspam_public \n",
        "# the line above should output should show "bare  lemm  lemm_stop  readme.txt  stop"\n",
        "!cat readme.txt\n",
        "# the line above shows the content of the readme file, which explains the structrue of the dataset\n",
        "# Lemmatisation is a process similar to stemming"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/content/lingspam_public\n",
            "This directory contains the Ling-Spam corpus, as described in the \n",
            "paper:\n",
            "\n",
            "I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, George Paliouras, \n",
            "and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam \n",
            "Filtering". In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), \n",
            "Proceedings of the Workshop on Machine Learning in the New Information \n",
            "Age, 11th European Conference on Machine Learning (ECML 2000), \n",
            "Barcelona, Spain, pp. 9-17, 2000.\n",
            "\n",
            "There are four subdirectories, corresponding to four versions of \n",
            "the corpus:\n",
            "\n",
            "bare: Lemmatiser disabled, stop-list disabled.\n",
            "lemm: Lemmatiser enabled, stop-list disabled.\n",
            "lemm_stop: Lemmatiser enabled, stop-list enabled.\n",
            "stop: Lemmatiser disabled, stop-list enabled.\n",
            "\n",
            "Each one of these 4 directories contains 10 subdirectories (part1, \n",
            "..., part10). These correspond to the 10 partitions of the corpus \n",
            "that were used in the 10-fold experiments. In each repetition, one \n",
            "part was reserved for testing and the other 9 were used for training. \n",
            "\n",
            "Each one of the 10 subdirectories contains both spam and legitimate \n",
            "messages, one message in each file. Files whose names have the form\n",
            "spmsg*.txt are spam messages. All other files are legitimate messages.\n",
            "\n",
            "By obtaining a copy of this corpus you agree to acknowledge the use \n",
            "and origin of the corpus in any published work of yours that makes \n",
            "use of the corpus, and to notify the person below about this work.\n",
            "\n",
            "Ion Androutsopoulos \n",
            "http://www.aueb.gr/users/ion/\n",
            "Ling-Spam corpus last updated: July 17, 2000\n",
            "This file (readme.txt) last updated: July 30, 2003.\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "colab_type": "text",
        "id": "nb02XcMOkeqq"
      },
      "cell_type": "markdown",
      "source": [
        "## Task 1) Read the dataset and create RDDs \n",
        "a) Start by reading the directory with text files from the file system (`/content/drive/My Drive/BigData/data/lingspam_public`). Load all text files per dirctory (part1,part2, ... ,part10) using `wholeTextFiles()`, which creates one RDD per part, containing tuples (filename,text). This is a good choice as the text files are small. (5%)\n",
        "\n",
        "b) We will use one of the RDDs as test set, the rest as training set. For the training set you need to create the union of the remaining RDDs. (5%)\n",
        "\n",
        "b) Remove the path and extension from the filename using the regular expression provided (5%).\n",
        "\n",
        "If the filename starts with 'spmsg' it is spam, otherwise it is not. We'll use that later to train a classifier. \n",
        "\n",
        "We will put the code in each cell into a function that we can reuse later. In this way we can develop the whole preprocessing with the smaller test set and apply it to the training set once we know that everything works. "
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "H7iF1lZukeqt",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 360
        },
        "outputId": "fda164f6-94ac-4316-b397-d9dc8f491084"
      },
      "cell_type": "code",
      "source": [
        "from pathlib import Path\n",
        "import re\n",
        "\n",
        "def makeTestTrainRDDs(pathString):\n",
        "    """ Takes one of the four subdirectories of the lingspam dataset and returns two RDDs one each for testing and training. """\n",
        "    # We should see10 parts that we can use for creating train and test sets.\n",
        "    p = Path(pathString) # gets a path object representing the current directory path.\n",
        "    dirs = list(p.iterdir()) # get the directories part1 ... part10. \n",
        "    print(dirs) # Print to check that you have the right directory. You can comment this out when checked. \n",
        "    rddList = [] # create a list for the RDDs\n",
        "    # now create an RDD for each 'part' directory and add them to rddList\n",
        "    dirRoot = '/content/lingspam_public/'#drive/My Drive/BigData/data/lingspam_public/'\n",
        "    for d in dirs: # iterate through the directories\n",
        "#>>>     rdd = ... #>>> # read the files in the directory \n",
        "         rdd = sc.wholeTextFiles(dirRoot+str(d))\n",
        "#>>>     ... #>>> append the RDD to the rddList\n",
        "         rddList.append(rdd)\n",
        "    print('len(rddList)',len(rddList))  # we should now have 10 RDDs in the list # just for testing\n",
        "    print(rddList[1].take(1)) # just for testing, comment out when it works.\n",
        "\n",
        "    testRDD1 = rddList[9] # set the test set\n",
        "    trainRDD1 = rddList[0] # start the training set from 0 and \n",
        "    # now loop over the range from 1 to 9(exclusive) to create a union of the remaining RDDs\n",
        "    for i in range(1,9):\n",
        "        trainRDD1 = trainRDD1.union(rddList[i]) #>>> create a union of the current and the next \n",
        "        print(i)\n",
        "            # RDD in the list, so that in the end we have a union of all parts 0-8. (9 ist used as test set)\n",
        "    # both RDDs should remove the paths and extensions from the filename. \n",
        "    #>>> This regular expression will do it: re.split('[/\.]', fn_txt[0])[-2]\n",
        "    #>>> apply it to the filenames in train and test RDD with a lambda\n",
        "#>>>    testRDD2 = testRDD1.map(lambda ...) \n",
        "        testRDD2 = testRDD1.map(lambda ft: (re.split('[/\.]',ft[0])[-2],ft[1]))\n",
        "#>>>    trainRDD2 = trainRDD1.map(lambda ...) \n",
        "        trainRDD2 = trainRDD1.map(lambda ft: (re.split('[/\.]',ft[0])[-2],ft[1]))\n",
        "    return (trainRDD2,testRDD2)\n",
        "\n",
        "# this makes sure we are in the right directory\n",
        "%cd /content/lingspam_public/\n",
        "#drive/My Drive/BigData/data/lingspam_public \n",
        "# this should show "bare  lemm  lemm_stop  readme.txt  stop"\n",
        "!ls\n",
        "# the code below is for testing the function makeTestTrainRDDs\n",
        "trainRDD_testRDD = makeTestTrainRDDs('bare') # read from the 'bare' directory - this takes a bit of time\n",
        "(trainRDD,testRDD) = trainRDD_testRDD # unpack the returned tuple\n",
        "print('created the RDDs') # notify the user, so that we can figure out where things went wrong if they do.\n",
        "print('testRDD.count(): ',testRDD.count()) # should be ~291 \n",
        "print('trainRDD.count(): ',trainRDD.count()) # should be ~2602 - commented out to save time as it takes some time to create RDD from all the files\n",
        "print('testRDD.getNumPartitions()',testRDD.getNumPartitions()) # normally 2 on Colab (single machine)\n",
        "print('testRDD.getStorageLevel()',testRDD.getStorageLevel()) # Serialized, 1x Replicated \n",
        "print('testRDD.take(1): ',testRDD.take(1)) # should be (filename,[tokens]) \n",
        "rdd1 = testRDD # use this for developemnt in the next tasks "
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/content/lingspam_public\n",
            "bare  lemm  lemm_stop  readme.txt  stop\n",
            "[PosixPath('bare/part5'), PosixPath('bare/part9'), PosixPath('bare/part10'), PosixPath('bare/part8'), PosixPath('bare/part2'), PosixPath('bare/part1'), PosixPath('bare/part3'), PosixPath('bare/part4'), PosixPath('bare/part6'), PosixPath('bare/part7')]\n",
            "len(rddList) 10\n",
            "[('file:/content/lingspam_public/bare/part9/8-922msg2.txt', 'Subject: summer school in behavioral and cognitive neurosciences\n\nthe groningen graduate school for behavioral and cognitive neurosciences ( bcn ) announces its second summer school in behavioral and cognitive neurosciences 30 june - 11 july 1997 groningen , the netherlands scope brain , behavior and cognition traditionally are studied by various disciplines , ranging from linguistics and experimental psychology through behavioral biology , biophysics and biochemistry to the preclinical and clinical neurosciences . within the groningen graduate school for behavioral and cognitive neurosciences ( bcn ) , established in 1991 at the university of groningen , researchers join efforts to study these different areas of brain research . the summer school brings together international expertise in this multidisciplinary field , with a focus on the interaction between the disciplines . program the summer school program consists of 12 master classes and 4 general lectures . they are taught each morning ( advanced classes ) and each afternoon ( introductory classes ) in four parallel sessions , and provide an excellent opportunity for in-depth discussions . the general lectures will be held in the afternoon , after the afternoon sessions . week 1 ( 30 june - 4 july ) parallel morning sessions ( advanced courses ) : - neural networks as models for neuronal phenomena invited speakers : j . p . draye , mons , belgium w . gerstner , university of lausanne , switserland d . bullock , boston university , usa p . g . morasso , university of genova - neurobiology of cns damage invited speakers : a . arutjunyan , lab . perinatal biochemistry , st . petersburg , russia r . i . hogenesch , norway e . a . j . joosten , departent of neurology , university hospital utrecht , the netherlands m . de ryck , janssen research foundation , beerse , begium r . a . i . de vos , laboratorium pathologie oost - nederland , enschede , the netherlands - topics in constraint - based natural language processing invited speakers : suresh manandhar , department of computer science , university of york , united kingdom from the university of tuebingen , germany : dale gerdemann , thilo goetz , gerald penn , detmar meurers , guido minnen and shuly wintner parallel afternoon sessions ( introductory courses ) : - clinical neuropsychology invited speakers : e . de haan , utrecht , the netherlands p . w . halligan , oxford , united kingdom p . de kort , tilburg , the netherlands d . t . stuss , ontario , canada - color vision invited speakers : k . arikawa , cuy , yokohama , japan t . w . cronin , umbc , baltimore , usa m . kamermans , uva , amsterdam , the netherlands d . g . stavenga , rug , groningen , the netherlands j . walraven , tno , soesterberg , the netherlands c . m . m . de weert , nici , nijmegen , the netherlands - foundations of cognitive science invited speakers : b . von eckardt , university of nebraska , usa m . r . ter hark , groningen , the netherlands e . i . stiekema , groningen , the netherlands - multidisciplinary microdialysis invited speakers : dr . a . m . j . young , dr . m . h . joseph , institute of psychiatry , london , uk dr . t . obrenovitch , insititute of neurology , london , uk week 2 ( 7 - 11 july ) morning session ( advanced course ) : - methodology for neuroimaging invited speakers : c . aine , los alamos , usa h . duifhuis , department of biohysics , groningen , the netherlands n . leenders , paul scherrer institut , villigen , switzerland parallel afternoon sessions ( introductory courses ) : - basics to developmental neurology invited speakers : j . - r . cazalets , cnrs , laboratoire de neurobiologie et mouvement , marseille , france m . van gelder - hasker , department of obstetry , hospital of the free university amsterdam , the netherlands e . a . j . joosten , department of neurology , utrecht university , the netherlands r . w . oppenheim , the bowman gray school of medicine , wake forest university , winston - salem , usa h . b . m . uylings , netherlands institute for brain research , amsterdam , the netherlands l . de vries , department of paediatrics , utrecht university hosital , the netherlands - developmental dyslexia in multidisciplinary perspective invited speakers : h . lyytinen , niilo maki institute , department of psychology , university of jyvaskyla , finland r . nicolson , department of psychology , university of sheffield , united kingdom f . j . koopmans - van beinum , institute for phonetic sciences , university of amsterdam , the netherlands - flexible syntax invited speaker : ad neeleman , department of linguistics , university of utrecht , the netherlands special hands-on course , each day both in the morning and afternoon : - cognitive modeling with act - r invited speakers : john r . anderson , department of psychology , carnegie mellon university , usa christian lebiere , department of psychology , carnegie mellon university , usa fees * graduate and undergraduate students dfl 200 , - * bcn staff and postdocs dfl 300 , - * non - bcn staff and postdocs dfl 400 , - * industrial participants dfl 1 . 000 , - registration * as soon as possible . ask for the program booklet with regsitration form or use the electronic registration form at our web - site inquiries further information regarding the summer school or bcn can be obtained by contacting : bcn office nijenborgh 4 9747 ag groningen the netherlands tel : + 31-50 - 363 . 47 . 34 fax : + 31-50 - 363 . 47 . 40 e-mail : bureau @ bcn . rug . nl see for more details our web - site : http : / / www . bcn . rug . nl / bcn / events / index . html\n')]\n",
            "1\n",
            "2\n",
            "3\n",
            "4\n",
            "5\n",
            "6\n",
            "7\n",
            "8\n",
            "created the RDDs\n",
            "testRDD.count():  289\n",
            "trainRDD.count():  2604\n",
            "testRDD.getNumPartitions() 2\n",
            "testRDD.getStorageLevel() Serialized 1x Replicated\n",
            "testRDD.take(1):  [('9-271msg1', 'Subject: l2 acquisition of rom\n\ncall for papers special session : second language acquisition and the romance languages 1998 convention of the modern language association ( mla ) december 27-30 , 1998 san francisco , ca abstracts of no more than 500 words on any aspect of second language acquisition relating to the learning or teaching of romance languages ( preference given to language acquisition in instructed settings ) . abstracts must be received by march 16 , 1998 . participants must be mla members by april 1 , 1998 . abstracts or inquiries to : jeffrey reeder dept . of modern foreign languages baylor university , box 97393 waco , tx 78798 fax : ( 254 ) 710-3799 e-mail : jeffrey _ reeder @ baylor . edu\n')]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "colab_type": "text",
        "id": "oUKltxq5keqw"
      },
      "cell_type": "markdown",
      "source": [
        "## Task 2) Tokenize and remove punctuation\n",
        "\n",
        "Now we need to split the words, a process called *tokenization* by linguists, and remove punctuation. \n",
        "\n",
        "We will use the Python [Natural Language Toolkit](http://www.nltk.org) *NLTK* to do the tokenization (rather than splitting ourselves, as these specialist tools usually do that better than we can ourselves). We use the NLTK function word_tokenize, see here for a code example: [http://www.nltk.org/book/ch03.html](http://www.nltk.org/book/ch03.html). 5%\n",
        "\n",
        "Then we will remove punctuation. There is no specific funtion for this, so we use a regular expression (see here for info [https://docs.python.org/3/library/re.html?highlight=re#module-re](https://docs.python.org/3/library/re.html?highlight=re#module-re)) in a list comprehension (here's a nice visual explanation: [http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/](http://treyhunner.com/2015/12/python-list-comprehensions-now-in-color/)). 5% \n",
        "\n",
        "We use a new technique here: we separate keys and values of the RDD, using the RDD functions `keys()` and `values()`, which yield each a new RDD. Then we process the values and *zip* them together with the keys again. See here for documentation: [http://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.RDD.zip](http://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.RDD.zip).  We wrap the whole sequence into one function `prepareTokenRDD` for later use. 5%"
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "AB_nfmhYkeqx",
        "scrolled": true,
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 2227
        },
        "outputId": "b2230496-d78b-413f-8cbf-7291417e0f1f"
      },
      "cell_type": "code",
      "source": [
        "import nltk\n",
        "import re\n",
        "from nltk.corpus import stopwords\n",
        "\n",
        "def tokenize(text):\n",
        "    """ Apply the nltk.word_tokenize() method to our text, return the token list. """\n",
        "    nltk.download('punkt') # this loads the standard NLTK tokenizer model \n",
        "    # it is important that this is done here in the function, as it needs to be done on every worker.\n",
        "    # If we do the download outside a this function, it would only be executed on the driver     \n",
        "#>>>    return ... # use the nltk function word_tokenize\n",
        "    return nltk.word_tokenize(text)\n",
        "    \n",
        "def removePunctuation(tokens):\n",
        "    """ Remove punctuation characters from all tokens in a provided list. """\n",
        "    # this will remove all punctiation from string s: re.sub('[(),.?!";_]','',s)\n",
        "#>>>    tokens2 =  [...] # use a list comprehension to remove punctuaton\n",
        "    tokens2 = [re.sub('[(),.?!"@:>>> now remove any empty strings (i.e. length 0) that we may have \n",
        "    # created by removing punctuation, and resulting entries without words left.\n",
        "    rdd_kvr = rdd_kv.filter(lambda x: len(x) > 0) # remove empty strings using RDD.map and a lambda. TIP len(s) gives you the lenght of string. \n",
        "    rdd_kvrf = rdd_kvr.filter(lambda x: x[1] is not None) # remove items without tokens using RDD.filter and a lambda. \n",
        "    # >>> Question: why should this be filtering done after zipping the keys and values together?\n",
        "    # Keep the order of the tokens in the RDD.\n",
        "    #list_of_lists = rdd_kv.map(lambda r: [r[1]]).collect() # converted from an RDD object to a python List\n",
        "    #final_list =  list_of_lists[1][0] # get the second element of the previous list and\n",
        "    #rdd_kv = filter(None, final_list) # filters any single empty value from the python  list\n",
        "    return rdd_kvrf # returns the new version of a cleaned list\n",
        "\n",
        "rdd2 = prepareTokenRDD(rdd1) # Use a small RDD for testing.\n",
        "rdd2.take(1) # For checking result of task 2. "
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('9-271msg1',\n",
              "  ['Subject',\n",
              "   '',\n",
              "   'l2',\n",
              "   'acquisition',\n",
              "   'of',\n",
              "   'rom',\n",
              "   'call',\n",
              "   'for',\n",
              "   'papers',\n",
              "   'special',\n",
              "   'session',\n",
              "   '',\n",
              "   'second',\n",
              "   'language',\n",
              "   'acquisition',\n",
              "   'and',\n",
              "   'the',\n",
              "   'romance',\n",
              "   'languages',\n",
              "   '1998',\n",
              "   'convention',\n",
              "   'of',\n",
              "   'the',\n",
              "   'modern',\n",
              "   'language',\n",
              "   'association',\n",
              "   '',\n",
              "   'mla',\n",
              "   '',\n",
              "   'december',\n",
              "   '2730',\n",
              "   '',\n",
              "   '1998',\n",
              "   'san',\n",
              "   'francisco',\n",
              "   '',\n",
              "   'ca',\n",
              "   'abstracts',\n",
              "   'of',\n",
              "   'no',\n",
              "   'more',\n",
              "   'than',\n",
              "   '500',\n",
              "   'words',\n",
              "   'on',\n",
              "   'any',\n",
              "   'aspect',\n",
              "   'of',\n",
              "   'second',\n",
              "   'language',\n",
              "   'acquisition',\n",
              "   'relating',\n",
              "   'to',\n",
              "   'the',\n",
              "   'learning',\n",
              "   'or',\n",
              "   'teaching',\n",
              "   'of',\n",
              "   'romance',\n",
              "   'languages',\n",
              "   '',\n",
              "   'preference',\n",
              "   'given',\n",
              "   'to',\n",
              "   'language',\n",
              "   'acquisition',\n",
              "   'in',\n",
              "   'instructed',\n",
              "   'settings',\n",
              "   '',\n",
              "   '',\n",
              "   'abstracts',\n",
              "   'must',\n",
              "   'be',\n",
              "   'received',\n",
              "   'by',\n",
              "   'march',\n",
              "   '16',\n",
              "   '',\n",
              "   '1998',\n",
              "   '',\n",
              "   'participants',\n",
              "   'must',\n",
              "   'be',\n",
              "   'mla',\n",
              "   'members',\n",
              "   'by',\n",
              "   'april',\n",
              "   '1',\n",
              "   '',\n",
              "   '1998',\n",
              "   '',\n",
              "   'abstracts',\n",
              "   'or',\n",
              "   'inquiries',\n",
              "   'to',\n",
              "   '',\n",
              "   'jeffrey',\n",
              "   'reeder',\n",
              "   'dept',\n",
              "   '',\n",
              "   'of',\n",
              "   'modern',\n",
              "   'foreign',\n",
              "   'languages',\n",
              "   'baylor',\n",
              "   'university',\n",
              "   '',\n",
              "   'box',\n",
              "   '97393',\n",
              "   'waco',\n",
              "   '',\n",
              "   'tx',\n",
              "   '78798',\n",
              "   'fax',\n",
              "   '',\n",
              "   '',\n",
              "   '254',\n",
              "   '',\n",
              "   '7103799',\n",
              "   'email',\n",
              "   '',\n",
              "   'jeffrey',\n",
              "   '',\n",
              "   'reeder',\n",
              "   '',\n",
              "   'baylor',\n",
              "   '',\n",
              "   'edu'])]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 8
        }
      ]
    },
    {
      "metadata": {
        "colab_type": "text",
        "id": "q9HgRAEGkeq0"
      },
      "cell_type": "markdown",
      "source": [
        "## Task 3) Creating normalised TF.IDF vectors of defined dimensionality, measure the effect of caching.\n",
        "\n",
        "We use the hashing trick to create fixed size TF vectors directly from the word list now (slightly different from the previous lab, where we used *(word,count)* pairs.). Write a bit of code as needed. (5%)\n",
        "\n",
        "Then we'll use the IDF and Normalizer functions provided by Spark. They use a slightly different pattern than RDD.map and reduce, have a look at the examples here in the documentation for Normalizer  and IDF:\n",
        "[http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html#pyspark.mllib.feature.Normalizer](http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html#pyspark.mllib.feature.Normalizer), [http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF](http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF) (5%)\n",
        "\n",
        "We want control of the dimensionality in the `normTFIDF` function, so we introduce an argument into our functions that enables us to vary dimensionalty later. Here is also an opportunity to benefit from caching, i.e. persisting the RDD after use, so that it will not be recomputed.  (5%)"
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "Qo_YQKAokeq0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        },
        "outputId": "2b3b9c89-1c79-479f-e60a-ffe3f09c1866"
      },
      "cell_type": "code",
      "source": [
        "from pyspark import StorageLevel\n",
        "\n",
        "# use the hashing trick to create a fixed-size vector from a word list\n",
        "def hashing_vectorize(text,N): # arguments: the list and the size of the output vector\n",
        "    v = [0] * N  # create vector of 0s\n",
        "    for word in text: # iterate through the words \n",
        "#>>>    hash_value = hash(word)%N        # get the hash value \n",
        "        hash_value = hash(word)%N\n",
        "#>>>    v = v[hash_value]+1          # add 1 at the hashed address\n",
        "        v[hash_value] = v[hash_value]+1\n",
        "    return v # return hashed word vector\n",
        "\n",
        "from pyspark.mllib.feature import IDF, Normalizer\n",
        "\n",
        "def normTFIDF(fn_tokens_RDD, vecDim, caching=True):\n",
        "    keysRDD = fn_tokens_RDD.keys()\n",
        "    tokensRDD = fn_tokens_RDD.values()\n",
        "    tfVecRDD = tokensRDD.map(lambda tokens: hashing_vectorize(tokens,vecDim)) #>>> passing the vecDim value. TIP: you need a lambda. \n",
        "    if caching:\n",
        "        tfVecRDD.persist(StorageLevel.MEMORY_ONLY) # since we will read more than once, caching in Memory will make things quicker.\n",
        "    idf = IDF() # create IDF object\n",
        "    idfModel = idf.fit(tfVecRDD) # calculate IDF values\n",
        "    tfIdfRDD = idfModel.transform(tfVecRDD) # 2nd pass needed (see lecture slides), transforms RDD\n",
        "#>>>    norm =  # create a Normalizer object like in the example linked above\n",
        "    norm = Normalizer()\n",
        "#>>>    normTfIdfRDD = norm. ... # and apply it to the tfIdfRDD \n",
        "    normTfIdfRDD = norm.transform(tfIdfRDD)\n",
        "#>>>    zippedRDD = ... # zip the keys and values together\n",
        "    zippedRDD = keysRDD.zip(normTfIdfRDD)\n",
        "    return zippedRDD\n",
        "\n",
        "testDim = 10 # too small for good accuracy, but OK for testing\n",
        "rdd3 = normTFIDF(rdd2, testDim, True) # test our\n",
        "print(rdd3.take(1)) # we should now have tuples with ('filename',[N-dim vector])\n",
        "# e.g. [('9-1142msg1', DenseVector([0.0, 0.0, 0.0, 0.0, 0.4097, 0.0, 0.0, 0.0, 0.9122, 0.0]))]"
      ],
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[('9-271msg1', DenseVector([0.0, 0.3061, 0.0, 0.334, 0.0, 0.0, 0.167, 0.0, 0.754, 0.4453]))]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "colab_type": "text",
        "id": "datJXbSNkeq2"
      },
      "cell_type": "markdown",
      "source": [
        "### Task 3a) Caching experiment\n",
        "\n",
        "The normTFIDF lets us switch caching on or off. Write a bit of code that measures the effect of caching by takes the time for both options. Use the time function as shown in lecture 3, slide 47. Remember that you need to call an action on an RDD to trigger full execution. \n",
        "\n",
        "Add a short comment on the result (why is there an effect, why of the size that it is?). Remember that this is wall clock time, i.e. you may get noisy results and they may change depending on the system state (e.g. how often this test has been run). (10%)"
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "tdzLL4A_keq3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        },
        "outputId": "83728d81-2a64-4c0e-9525-5f63a16dda42"
      },
      "cell_type": "code",
      "source": [
        "#run a small experiment with caching set to True or False, 3 times each\n",
        "\n",
        "from time import time\n",
        "\n",
        "resCaching = [] # for storing results\n",
        "resNoCache = [] # for storing results\n",
        "for i in range(3): # 3 samples\n",
        "#>>>  # start timer\n",
        "    startTime = time()\n",
        "    testRDD1 = normTFIDF(rdd2, testDim, True) # \n",
        "#>>>     # call an action on the RDD to trigger execution\n",
        "    testRDD1.take(5)\n",
        "#>>>  # end timer\n",
        "    endTime = time()\n",
        "    resCaching.append( endTime - startTime ) # calculate the time spent\n",
        "    \n",
        "for i in range(3): # 3 samples\n",
        "#>>>  # start timer\n",
        "    startTime = time()\n",
        "    testRDD2 = normTFIDF(rdd2, testDim, False) \n",
        "#>>>  # call an action on the RDD to trigger execution\n",
        "    testRDD2.take(5)\n",
        "#>>>  # end timer\n",
        "    endTime = time()\n",
        "    resNoCache.append( endTime - startTime ) # calculate the time spent\n",
        "    \n",
        "#>>> meanTimeCaching = # calculate average times\n",
        "    meanTimeCaching = sum(resCaching) / float(len(resCaching))\n",
        "#>>> meanTimeNoCache = # calculate average times\n",
        "    meanTimeNoCache = sum(resNoCache) / float(len(resNoCache))\n",
        "\n",
        "print('Creating TF.IDF vectors, 3 trials - mean time with caching: ', meanTimeCaching, ', mean time without caching: ', meanTimeNoCache)\n",
        "#>>> # add your results and comments here "
      ],
      "execution_count": 17,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Creating TF.IDF vectors, 3 trials - mean time with caching:  10.951113859812418 , mean time without caching:  15.147806485493978\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "metadata": {
        "colab_type": "text",
        "id": "f-xzB6hrkeq5"
      },
      "cell_type": "markdown",
      "source": [
        "## Task 4) Create LabeledPoints \n",
        "\n",
        "Determine whether the file is spam (i.e. the filename contains ’spmsg’) and replace the filename by a 1 (spam) or 0 (non-spam) accordingly. Use `RDD.map()` to create an RDD of LabeledPoint objects. See here [http://spark.apache.org/docs/2.1.0/mllib-linear-methods.html#logistic-regression](http://spark.apache.org/docs/2.1.0/mllib-linear-methods.html#logistic-regression) for an example, and here [http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint](http://spark.apache.org/docs/2.1.0/api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) for the `LabeledPoint` documentation. (10%)\n",
        "\n",
        "There is a handy function of Python strings called startswith: e.g. 'abc'.startswith('ab) will return true. The relevant Python syntax here is a conditional expression: **`` if  else ``**, i.e. 1 if the filename starts with 'spmsg' and otherwise 0."
      ]
    },
    {
      "metadata": {
        "colab_type": "code",
        "id": "wF9BDmnEkeq6",
        "pixiedust": {
          "displayParams": {
            "handlerId": "tableView"
          }
        },
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        },
        "outputId": "81a7b761-79bc-42a7-d851-01debb9a08b8"
      },
      "cell_type": "code",
      "source": [
        "from pyspark.mllib.regression import LabeledPoint\n",
        "\n",
        "# creatate labelled points of vector size N out of an RDD with normalised (filename [(word,count), ...]) items\n",
        "def makeLabeledPoints(fn_vec_RDD): # RDD and N needed \n",
        "    # we determine the true class as encoded in the filename and represent as 1 (samp) or 0 (good)\n",
        "#>>>    cls_vec_RDD = fn_vec_RDD.map(lambda x : True if fn_vec_RDD.startswith('spmsg') else False  ) # use a conditional expression to get the class label (True or False)\n",
        "    cls_vec_RDD = fn_vec_RDD.map(lambda x : (1,

no need reference, and no word count, only coding

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment