Assignment 2 - Clustering¶

Learning Outcomes¶

In this assignment, you will do the following:

· Explore a dataset and ca

y out clustering using k-means algorithm

· Identify the optimum number of clusters for a given dataset

Data:

https:

archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams XXXXXXXXXX

Problem Description¶

In this assignment, you will study the electricity demand from clients in Portugal, during 2013 and 2014. You have been provided with the data file, which you should download when you download this assignment file.

The data11 available contains 370 time series, co

esponding to the electric demand22 for 370 clients, between 2011 and 2014.

In this guided exercise, you will use clustering techniques to understand the typical usage behaviour during XXXXXXXXXX.

Both these datasets are publicly available, and can be used to ca

y out experiments. Their source is below:

1. Data: https:

archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams XXXXXXXXXX#

2. Electric Demand: http:

www.think-energy.net/KWvsKWH.htm

We will start by exploring the data set and continue on to the assignment. Consider this as a working notebook, you will add your work to the same notebook.

In this assignment we will use the sklearn package for k-means. Please refer here for the documentation https:

scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (https:

scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

The sklearn package for k-means is one of the many clustering algorithms found in the module "sklearn.cluster". These come with a variety of functions that you can call by importing the package.

For example

from sklearn.cluster import AgglomerativeClustering

from sklearn.cluster import KMeans

Work to be completed in the workbook provide - assignment2.ipynb – the questions are at the 2nd half of the workbook.

Questions (15 marks total)¶

Q1: (7 marks)

a. Determine what a convenient number of clusters. Justify your choice. Make use of the sklearn's package for k-means for this. You may refer to the module to figure out how to come up with the optimal number of clusters.

. Make a plot for each cluster, that includes:

- The number of clients in the cluster (you can put this in the title of the plot)

- All the curves in the cluste

- The curve co

esponding to the center of the cluster (make this curve thicker to distinguish it from the individual curves). The center is also sometimes refe

ed to as "centroid".

You have 2 separate plots for each cluster if you prefer (one for the individual curves, one for the centroid)

Q2: (8 marks)

In this exercise you work with the daily curves of 1 single client. First, create a list of a

ays, each a

ay containing a curve for a day. You may use X from the cells above. X = average_curves_norm.copy() The list contains 730 a

ays, one for each of the days of 2013 and 2014.

a. Determine the optimal value of k ( number of clusters). This time you may also perform silhoutte analysis as stated in the module. Ca

ying out silhoutte analysis is left as an exercise. What do you understand about the clusters?

. Based on your results from your analyses of both methods, what do understand? Interpret it perhaps with different perspectives of timelines like weeks or months.