Great Deal! Get Instant \$25 FREE in Account on First Order + 10% Cashback on Every Order Order Now

# Apache Spark: Fit a Binary Logistic Regression Model to a Dataset</o:p> Dataset:Dropbox link for baby-names</o:p> Output:Jupyter Notebook (please display the output)</o:p>...

Apache Spark: Fit a Binary Logistic Regression Model to a Dataset

Output:Jupyter Notebook (please display the output)

Requirements for exercise:

1.Build a Classification Model:In this exercise, you will fit a binary logistic regression model to the baby name dataset you used in the previous exercise. This model will predict the sex of a person based on their age, name, and state they were born in. To train the model, you will use the data found in baby-names/names-classifier.

2.Prepare in Input Features:First, you will need to prepare each of the input features. While age is a numeric feature, state and name are not. These need to be converted into numeric vectors before you can train the model. Use a StringIndexer along with the OneHotEncoderEstimator to convert the name, state, and sex columns into numeric vectors. Use the VectorAssembler to combine the name, state, and age vectors into a single features vector. Your final dataset should contain a column called features containing the prepared vector and a column called label containing the sex of the person.

3.Fit and Evaluate the Model:Fit the model as a logistic regression model with the following parameters. LogisticRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8). Provide the area under the ROC curve for the model.

Provided insights and summary of model output, 500 words.

## Solution

Ximi answered on Nov 03 2021
{
"nbformat": 4,
"nbformat_minor": 0,
"colab": {
"name": "spark-grey.ipynb",
"provenance": []
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "code",
"id": "YdjxUFOKgbPX",
"colab_type": "code",
"colab": {}
},
"source": [
"import pandas\n",
"!apt-get install openjdk-8-jdk-headless -qq > /dev/null\n",
"!wget -q http:
"!pip install -q findspark"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "TDbXbmKege-K",
"colab_type": "code",
"colab": {}
},
"source": [
"import os\n",
"os.environ[\"JAVA_HOME\"] = \"/us
li
jvm/java-8-openjdk-amd64\"\n",
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "g9YMz0J4g0fS",
"colab_type": "code",
"colab": {}
},
"source": [
"import findspark\n",
"findspark.init()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "SXFqUHmxoK0e",
"colab_type": "code",
"colab": {}
},
"source": [
"from pyspark.sql import SparkSession\n",
"spark = SparkSession.builder.master(\"local[*]\").getOrCreate()"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "Ed_uGjoWoZ58",
"colab_type": "code",
"colab": {}
},
"source": [
"sc = spark.sparkContext\n",
"from pyspark.sql import SQLContext\n",
"sqlContext = SQLContext(sc)\n"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"id": "qMQhqdhHobWW",
"colab_type": "code",
"colab": {}
},
"source": [
"import glob\n",
"# List all *.parquet files\n",
"files = glob.glob('*.parquet')"
],
"execution_count": 0,
"outputs": []
},
{
...
SOLUTION.PDF

### Submit New Assignment

Copy and Paste Your Assignment Here