What is Plagiarism and How Can I Avoid It ?
3. Data Engineering
https:
www.oreilly.com/content/data-engineering-a-quick-and-simple-definition
Data engineersĀ are responsible for finding trends inĀ dataĀ sets and developing algorithms to help make rawĀ dataĀ more useful to the enterprise.
āinitialĀ data analysis, whereby aĀ data analystĀ uses visual exploration to understand what is in a dataset and the characteristics of the dataā
āThese characteristics can include size or amount of data, completeness of the data, co
ectness of the data, possible relationships amongst data elements or files/tables in the dataā
1
Data Exploration
Data Exploration or Exploratory Data Analysis (EDA) is used
Answer questions related to data, test data assumptions, generate hypotheses for further analysis.
Prepare the data for modeling
Have deep understanding of your data to answer questions
Build insights on your data sets
Help interpreting results of modeling in the future
Data explorationĀ is an approach similar to initialĀ dataĀ analysis, whereby aĀ dataĀ analyst uses visualĀ explorationĀ to understand what is in a dataset and the characteristics of theĀ data, rather than through traditionalĀ dataĀ management systems.
In statistics,Ā exploratory data analysisĀ (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more na
owly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.
Plagiarism Tutorial
2
Data Exploration Steps
3.1 Data Process
Decide the approaches and steps of deriving raw, training, validation and test datasets in order to enable the models to meet the project requirements.
https:
en.wikipedia.org/wiki/Data_modeling
Ā
3.2 Data Collection
Define the sources, parameters and quantity of raw datasets; collect necessary and sufficient raw datasets; present samples from raw datasets.
Collecting data is the first step in data processing. Data is pulled from available sources, includingĀ data lakes and data warehouses. It is important that the data sources available are trustworthy and well-built so the data collected (and later used as information) is of the highest possible quality.
Ā
Data Exploration Steps
3.3 Data Pre-processing
Pre-process collected raw data with cleaning and validation tools; present samples from pre-processed datasets.
Ā
emoval of noise and outliers
collecting necessary information to model or account for noise
handling of missing data
https:
serokell.io
log/data-preprocessing
Binning methodĀ is used to smoothing data or to handle noisy data. In thisĀ method, the data is first sorted and then the sorted values are distributed into a number of buckets or bins. AsĀ binning methodsĀ consult the neighborhood of values, they perform local smoothing.
In statistical modeling,Ā regression analysisĀ is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables
Plagiarism Tutorial
4
Data Exploration Steps
3.4 Data Transformation
Transform pre-processed datasets to desired formats with tools and scripts; present samples from transformed datasets.
Normalization helps you to scale the data within a range
Feature selection is the selection of variables in data that are the best predictors for the variable we want to predict.
Discretization: transforms the data into sets of small intervals.
Generate a hierarchy between the attributes where it was not specified
Ā
https:
serokell.io
log/data-preprocessing
Data discretizationĀ is defined as a process of converting continuousĀ dataĀ attribute values into a finite set of intervals and associating with each interval some specificĀ dataĀ value. ... IfĀ discretizationĀ leads to an unreasonably small number ofĀ dataĀ intervals, then it may result in significant information loss.
ConceptĀ hierarchy generationĀ based on the number of distinct values per attribute. Suppose a user selects a set of location-oriented attributesāstreet, country, province_ or_state, and city āfrom the AllElectronics database, but does not specify theĀ hierarchicalĀ ordering among the attributes.
NormalizationĀ helps you to scale the data within a range to avoid building inco
ect ML models while training and/or executing data analysis. If the data range is very wide, it will be hard to compare the figures. With various normalization techniques, you can transform the original data linearly,Ā perform decimal scaling or Z-score normalization.
Plagiarism Tutorial
5
Data Exploration Steps
Ā
3.5 Data Preparation
Prepare training, validation and test datasets from transformed datasets; present samples from training, validation and test datasets.
https:
algotrading101.com/learn/train-test-split
Data which we use to design our models (Training set)
Data which we use to refine our models (Validation set)
Data which we use to test our models (Testing set)
Ā
Data Exploration Steps
Ā
3.6 Data Statistics
Summarize the results of progressive results for including deriving raw, pre-processed, transformed and prepared datasets; statistically present the results in visualization formats.
https:
www.tableau.com/learn/articles/data-visualization
Knowledge Discovery Process
8
https:
link.springer.com/chapte
10.1007/ XXXXXXXXXX5_2
Data preprocessing
9
applying domain knowledge of the data to create new features that allow ML algorithms to work bette
45
Adopted from Andrew Ferlitsch Slide
Import the data
Clean the data (Data Wrangling)
Replace Missing Values
Categorical Value Conversion
Feature Scaling
Steps
Importing the Dataset - Python
import pandas as pd
dataset = pd.read_csv( ādata.csvā )
Cleaning the Data
It is not uncommon for datasets to have some dirty data entries (i.e., samples, rows in CSV file, ā¦)
Common Problems
Bad Character Encodings (Funny Characters)
Misaligned Data (e.g., row has too few/many columns)
Data in wrong format.
Data Wrangling is an expertise/occupation all in its own.
Common Practices in Data Wrangling
Know the character encoding of the data file and intended character encoding of the data.
Convert the data encoding format of the file if necessary. e.g., Notepad++ -> Encodings
Know the data format of the source and expected data format.
Convert the data format using a batch preprocessing file. e.g., XXXXXXXXXX -> 1,000,000
Replace Missing Values
Not unusual for samples (rows) to contain missing (blank) entries, or not a number (NaN).
Blank/NaN entries do not work for Machine Learning
Need to replace the blank/NaN entry with something meaningful.
Delete the rows (generally not desirable)
Replace with a Single Value
Mean Average
Multivariate Imputation using Chained Equations (MICS)
Missing Values ā Mean Value
scikit-learn class for handling missing data
from sklearn.preprocessing import Imputer # scikit-learn module
# Create imputer object to replace NaN values with the mean value of the column imputer = Imputer( missing_values=āNaNā,
strategy=āmeanā )
# Fit the data to the imputer object
imputer = imputer.fit( dataset[ :, 2 ] )
# do the replacement and update the dataset dataset[ :, 2 ] = imputer.transform( dataset[ :, 2 ] )
needs to be the same columns in dataset
original dataset
eplace missing values in column 2 (index starts at 0)
select all rows
Age Gender Income
25 Male 25000
26 Female 22000
30 Male 45000
24 Female 26000
Independent Variables (Features)
Dependent Variables (Label)
Real Values
Value to Predict
Categorical Values
Categorical Variables
Known in Python as OneHotEncode
For each categorical feature:
Scan the dataset and determine all the unique instances.
Create a new feature (i.e., dummy variable) in dataset, one per unique instance.
Remove the categorical feature from the dataset.
For each sample (row), set a 1 in the feature (dummy variable) that co
esponds to that categorical value instance, and:
Set a 0 in the remaining features (dummy variables) for that
categorical field.
Remove one dummy variable field.
Dummy Variable Conversion
Dummy Variable Trap
Gende
Male
Female
Male
Female
Need to Drop one Dummy Variable!
x2 x3
Male Female
1 0
0 1
1 0
0 1
x1
Multicollinearity occurs when one variable predicts another.
i.e., x2 = ( 1 ā x3)
As a result, a regression analysis cannot distinguish between the contribution of x2 and x3.
Categorical Variable Conversion
scikit-learn class for categorical variable conversion
from sklearn.preprocessing import LabelEncoder # scikit-learn module
# Create an encoder object to numerically (enumeration) encode categorical variables labelEncoder = LabelEncoder()
# Fit the data to the Encoder object
labelEncoder.fit_transform()
encode the categorical values in column 1 (index starts at 0)
dataset[ :, 1 ] = labelEncoder.fit_transform( dataset[ :, 1 ] )
original dataset select all rows
needs to be the same columns in dataset
# Create an encoder to convert numerical encodings to 1-encoded dummy variables onehotencoder = OneHotEncoder( categorical_features = [ 1 ] )
Categorical variables to convert are in column 1
# Replace the encoded categorical values with the 1-encoded dummy variables dataset = onehotencoder.fit_transform( dataset )
Dataset with converted categorical variables
If features do not have the same numerical scale in values, will cause issues in training a mode.
If the scale of one independent variable (feature) is greater than another independent variable, the model will give more importance (skew) to the independent variable with the larger range.
To eliminate this problem, one converts all the independent variables to use the same scale.
( 0 to 1 )
Normalization
Standardization ( -1 to 1 )
Feature Scaling
Decision tree, random forest, doesnāt need feature scaling
21
Most machine learning models use Euclidean distance between two points in 2D Cartesian space.
?? ā ?? ? + (?? ā ??)?
Given two independent variables (x1 = Age, x2 = Income) and a dependent variable (y = spending), becomes fo
a given sample (row) i:
? +
?
??? ā ??? ?? ā ?? = ??? ā ???
?
If x1 or x2 is a substantially greater scale than the other, the co
esponding independent variable will dominate the result, and will contribute more to the model.
Scaling Issue ā Euclidean Distance
Especially gradient descent algorithm.
22
Feature Scaling means scaling features to the same scale.
Normalization scales features between 0 and 1, retaining their proportional range to each other (min-max scaling)
Standardization scales features to have a mean (u) of 0 and standard deviation (a) of 1.
Xā =
? ā min(?) max ? ā min(?)
Normalization
original value
new value
Xā =
? ā ?
?
Standardization
original value
new value
mean
standard deviation
Normalization or Standardization
23
Feature Scaling in Python
from sklearn.preprocessing import StandardScala
# scikit-learn module
# Create a scaling object to scale the features. scale = StandardScalar()
# Fit the data to the Scaling object and transform the data
dataset [:,-1] = scale.fit_transform( dataset[:,-1] )
scikit-learn class for Feature Scaling
feature scale all the variables except the last column (y or label)
Co
elationĀ Heatmap
Co
elation states how the features are related to each other or the target variable.
Co
elation can be positive (increase in one value of feature increases the value of the
target variable) or negative (increase in one value of feature decreases the value of the
target variable)
Heatmap makes it easy to identify which features are most related to the target
variable, we will plot heatmap of co
elated features using the seaborn li
ary.sns.heatmap(df_new.co
())
https:
heartbeat.fritz.ai/seaborn-heatmaps-13-ways-to-customize-co
elation-matrix-visualizations-f1c49c816f07
What to Submit?
Complete DataExp for your research report
Submit your own Data Exploration on your research pape