Great Deal! Get Instant $10 FREE in Account on First Order + 10% Cashback on Every Order Order Now

Assignment 2 - Clustering¶ Learning Outcomes¶ In this assignment, you will do the following: · Explore a dataset and carry out clustering using k-means algorithm · Identify the optimum number of...

1 answer below »
Assignment 2 - Clustering¶
Learning Outcomes¶
In this assignment, you will do the following:
· Explore a dataset and ca
y out clustering using k-means algorithm
· Identify the optimum number of clusters for a given dataset
Data:
https:
archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams XXXXXXXXXX
Problem Description¶
In this assignment, you will study the electricity demand from clients in Portugal, during 2013 and 2014. You have been provided with the data file, which you should download when you download this assignment file.
The data11 available contains 370 time series, co
esponding to the electric demand22 for 370 clients, between 2011 and 2014.
In this guided exercise, you will use clustering techniques to understand the typical usage behaviour during XXXXXXXXXX.
Both these datasets are publicly available, and can be used to ca
y out experiments. Their source is below:
1. Data: https:
archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams XXXXXXXXXX#
2. Electric Demand: http:
www.think-energy.net/KWvsKWH.htm
We will start by exploring the data set and continue on to the assignment. Consider this as a working notebook, you will add your work to the same notebook.
In this assignment we will use the sklearn package for k-means. Please refer here for the documentation https:
scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (https:
scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).
The sklearn package for k-means is one of the many clustering algorithms found in the module "sklearn.cluster". These come with a variety of functions that you can call by importing the package.
For example
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import KMeans
Work to be completed in the workbook provide - assignment2.ipynb – the questions are at the 2nd half of the workbook.
Questions (15 marks total)¶
Q1: (7 marks)
a. Determine what a convenient number of clusters. Justify your choice. Make use of the sklearn's package for k-means for this. You may refer to the module to figure out how to come up with the optimal number of clusters.
. Make a plot for each cluster, that includes:
- The number of clients in the cluster (you can put this in the title of the plot)
- All the curves in the cluste
- The curve co
esponding to the center of the cluster (make this curve thicker to distinguish it from the individual curves). The center is also sometimes refe
ed to as "centroid".
You have 2 separate plots for each cluster if you prefer (one for the individual curves, one for the centroid)
Q2: (8 marks)
In this exercise you work with the daily curves of 1 single client. First, create a list of a
ays, each a
ay containing a curve for a day. You may use X from the cells above. X = average_curves_norm.copy() The list contains 730 a
ays, one for each of the days of 2013 and 2014.
a. Determine the optimal value of k ( number of clusters). This time you may also perform silhoutte analysis as stated in the module. Ca
ying out silhoutte analysis is left as an exercise. What do you understand about the clusters?
. Based on your results from your analyses of both methods, what do understand? Interpret it perhaps with different perspectives of timelines like weeks or months.
Answered Same Day Jun 22, 2021

Solution

Sandeep Kumar answered on Jun 26 2021
127 Votes
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "D9UboDIvnAKI"
},
"source": [
"## Assignment 2 - Clustering"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "5AibtmcInAKK"
},
"source": [
"## Learning Outcomes\n",
"\n",
"In this assignment, you will do the following:\n",
"\n",
"* Explore a dataset and ca
y out clustering using k-means algorithm\n",
"\n",
"* Identify the optimum number of clusters for a given dataset\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "TJBMBFfAnAKK"
},
"source": [
"## Problem Description\n",
"\n",
"In this assignment, you will study the electricity demand from clients in Portugal, during 2013 and 2014. You have been provided with the data file, which you should download when you download this assignment file.\n",
"\n",
"The data$^1$ available contains 370 time series, co
esponding to the electric demand$^2$ for 370 clients, between 2011 and 2014. \n",
"\n",
"In this guided exercise, you will use clustering techniques to understand the typical usage behaviour during 2013-2014.\n",
"\n",
"Both these datasets are publicly available, and can be used to ca
y out experiments. Their source is below:\n",
"\n",
" 1. Data:\n",
"https:
archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014#\n",
"\n",
" 2. Electric Demand:\n",
"http:
www.think-energy.net/KWvsKWH.htm\n",
"\n",
"We will start by exploring the data set and continue on to the assignment. Consider this as a working notebook, you will add your work to the same notebook.\n",
"\n",
"In this assignment we will use the sklearn package for k-means. Please refer here for the documentation https:
scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html\n",
"(https:
scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).\n",
"\n",
"The sklearn package for k-means is one of the many clustering algorithms found in the module \"sklearn.cluster\". These come with a variety of functions that you can call by importing the package.\n",
"\n",
"For example \n",
" \n",
" from sklearn.cluster import AgglomerativeClustering\n",
" from sklearn.cluster import KMeans\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "u0fHlteBnAKL"
},
"source": [
"## Data Preparation\n",
"\n",
"Start by downloading the data to a local directory and modify the \"pathToFile\" and \"fileName\" variables, if needed. The data file has been provided with this assignment. It is also available at the links provided above."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "0DJsRL9_nAKM"
},
"outputs": [],
"source": [
"pathToFile = r\"\"\n",
"#pathToFile = r\"C:\\\\Users\\\\\\\\Downloads\\\\\"\n",
"\n",
"fileName = 'LD2011_2014.txt'"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "yLxHF5B-nAKP"
},
"outputs": [],
"source": [
"import numpy as np\n",
"from sklearn.cluster import KMeans\n",
"import matplotlib.pyplot as plt\n",
"import random\n",
"from sklearn.metrics import silhouette_score\n",
"from sklearn.cluster import AgglomerativeClustering\n",
"random.seed(42)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "6c6CmGbYnAKR"
},
"outputs": [],
"source": [
"# Replace \",\" by \".\", otherwise the numbers will be in the form 2,3445 instead of 2.3445\n",
"import fileinput\n",
"\n",
"with fileinput.FileInput(pathToFile+fileName, inplace=True, backup='.bak') as file:\n",
" for line in file:\n",
" print(line.replace(\",\", \".\"), end='')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "ACTUfls8nAKU"
},
"outputs": [],
"source": [
"# Create dataframe\n",
"import pandas as pd\n",
"data = pd.read_csv(pathToFile+fileName, sep=\";\", index_col=0)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "CfULOBctnAKW"
},
"source": [
"### Quick data inspection"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "x3OHI8vRnAKX",
"outputId": "c821694f-6e8f-48ad-ff42-336915cb4da4"
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"