Apache Spark: Fit a Binary Logistic Regression Model to a DatasetDataset:Dropbox link for...

Question

Apache Spark: Fit a Binary Logistic Regression Model to a Dataset Dataset:Dropbox link for baby-names Output:Jupyter Notebook (please display the output) Requirements for exercise: 1. Build a...

1 answer below »

Apache Spark: Fit a Binary Logistic Regression Model to a Dataset

Dataset:Dropbox link for baby-names

Output:Jupyter Notebook (please display the output)

Requirements for exercise:

1.Build a Classification Model:In this exercise, you will fit a binary logistic regression model to the baby name dataset you used in the previous exercise. This model will predict the sex of a person based on their age, name, and state they were born in. To train the model, you will use the data found in baby-names/names-classifier.

2.Prepare in Input Features:First, you will need to prepare each of the input features. While age is a numeric feature, state and name are not. These need to be converted into numeric vectors before you can train the model. Use a StringIndexer along with the OneHotEncoderEstimator to convert the name, state, and sex columns into numeric vectors. Use the VectorAssembler to combine the name, state, and age vectors into a single features vector. Your final dataset should contain a column called features containing the prepared vector and a column called label containing the sex of the person.

3.Fit and Evaluate the Model:Fit the model as a logistic regression model with the following parameters. LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8). Provide the area under the ROC curve for the model.

Provided insights and summary of model output, 500 words.

Answered Same Day Oct 31, 2021

Ximi · Accepted Answer

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "spark-grey.ipynb",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "code",
      "metadata": {
        "id": "YdjxUFOKgbPX",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import pandas
",
        "!apt-get install openjdk-8-jdk-headless -qq > /dev/null
",
        "!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
",
        "!tar xvf spark-2.4.4-bin-hadoop2.7.tgz
",
        "!pip install -q findspark"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "TDbXbmKege-K",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import os
",
        "os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
",
        "os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "g9YMz0J4g0fS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import findspark
",
        "findspark.init()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SXFqUHmxoK0e",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "from pyspark.sql import SparkSession
",
        "spark = SparkSession.builder.master("local[*]").getOrCreate()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ed_uGjoWoZ58",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sc = spark.sparkContext
",
        "from pyspark.sql import SQLContext
",
        "sqlContext = SQLContext(sc)
"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qMQhqdhHobWW",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import glob
",
        "# List all *.parquet files
",
        "files = glob.glob('*.parquet')"
      ],
      "execution_count": 0,
      "outputs": []
    },

Apache Spark: Fit a Binary Logistic Regression Model to a Dataset Dataset:Dropbox link for baby-names Output:Jupyter Notebook (please display the output) Requirements for exercise: 1. Build a...

Solution

Answer To This Question Is Available To Download

Related Questions & Answers

Submit New Assignment