Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Apache Spark: Fit a Binary Logistic Regression Model to a Dataset Dataset:Dropbox link for baby-names Output:Jupyter Notebook (please display the output) Requirements for exercise: 1. Build a...

1 answer below »

Apache Spark: Fit a Binary Logistic Regression Model to a Dataset

Dataset:Dropbox link for baby-names

Output:Jupyter Notebook (please display the output)

Requirements for exercise:

1.Build a Classification Model:In this exercise, you will fit a binary logistic regression model to the baby name dataset you used in the previous exercise. This model will predict the sex of a person based on their age, name, and state they were born in. To train the model, you will use the data found in baby-names/names-classifier.

2.Prepare in Input Features:First, you will need to prepare each of the input features. While age is a numeric feature, state and name are not. These need to be converted into numeric vectors before you can train the model. Use a StringIndexer along with the OneHotEncoderEstimator to convert the name, state, and sex columns into numeric vectors. Use the VectorAssembler to combine the name, state, and age vectors into a single features vector. Your final dataset should contain a column called features containing the prepared vector and a column called label containing the sex of the person.

3.Fit and Evaluate the Model:Fit the model as a logistic regression model with the following parameters. LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8). Provide the area under the ROC curve for the model.

Provided insights and summary of model output, 500 words.

Answered Same Day Oct 31, 2021

Solution

Ximi answered on Nov 03 2021
141 Votes
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "spark-grey.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "code",
"metadata": {
"id": "YdjxUFOKgbPX",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas\n",
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q http:
www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz\n",
"!tar xvf spark-2.4.4-bin-hadoop2.7.tgz\n",
"!pip install -q findspark"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "TDbXbmKege-K",
"colab_type": "code",
"colab": {}
},
"source": [
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/us
li
jvm/java-8-openjdk-amd64\"\n",
"os.environ[\"SPARK_HOME\"] = \"/content/spark-2.4.4-bin-hadoop2.7\""
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "g9YMz0J4g0fS",
"colab_type": "code",
"colab": {}
},
"source": [
"import findspark\n",
"findspark.init()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "SXFqUHmxoK0e",
"colab_type": "code",
"colab": {}
},
"source": [
"from pyspark.sql import SparkSession\n",
"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Ed_uGjoWoZ58",
"colab_type": "code",
"colab": {}
},
"source": [
"sc = spark.sparkContext\n",
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)\n"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "qMQhqdhHobWW",
"colab_type": "code",
"colab": {}
},
"source": [
"import glob\n",
"# List all *.parquet files\n",
"files = glob.glob('*.parquet')"
],
"execution_count": 0,
"outputs": []
},
{
...
SOLUTION.PDF

Answer To This Question Is Available To Download

Related Questions & Answers

More Questions »

Submit New Assignment

Copy and Paste Your Assignment Here