Write a program which gives the insurance premium to be paid each year on a life insurance policy under these rules:
CS 221 OOP & Data Structures
Fall 2020
File name:
AS1_98S1.DOC
Last revised:
Thursday, 22 July 2021 at 12:33 PM
This assignment requires you to solve four programming problems, and to implement your solution in C++. You will be assessed by your final delivery.
Problems: Code in Visual Studio 2019.
P1. Modify the String class (N:\Class\Assignment2\String) in the following ways:
The code that goes with this question is: String.cpp, String.h, StringTest.cpp
a) Our String class always deletes the old character buffer and allocates a new character buffer on assignment or in the copy constructor. This need not be done if the new value is shorter than the cu
ent value and hence would fit the existing buffer. Rewrite the String class to only allocate a new buffer when necessary
) Overload the compound += operator to perform concatenation of two String objects.
c) Overload the + operator using the += operator.
d) Add a member function compare (String) that returns -1 or 0 or 1 depending on whether the string is lexicographically less than, equal to, or greater than the argument.
Provide a test program to test your class.
P2. Write a class template for a map with a fixed type for the keys – string. and a template type parameter for the values. Use open hashing (chaining) with an a
ay or vector of pointers to STL linked lists. Provide the following public member functions:
The code that goes with this question is: String.cpp, String.h, StringTest.cpp
· Constructo
· Destructo
· Copy constructo
· Assignment operato
· Operator ==
· Function size that returns the number of key-value pairs in the map
· Function count that returns the number of elements with a specific key
· Function insert that inserts a key-value pair into the map
· Function at that returns a reference to the mapped value of the element identified with a specific key k. If k does not match the key of any element in the map, the function should throw an out_of_range exception.
· Function erase that removes a key-value pair identified with a specific key k. If k does not match the key of any element in the map, the function should throw an out_of_range exception.
· Function key_comp that returns a key comparison object, a function object comparing keys and returning true if the first argument (key) for comparison is less than the second and false otherwise
Provide a test program to test your template with at least two different types of values.
Hint: First implement a class for values of some fixed data type, for example integer or string, and then convert it to a class template. It makes sense to submit both the non-template and the template versions.
P3. Change the database.cpp program (N:\Class\\Assignment2\database.cpp) to throw exceptions when there is an unexpected problem. When appropriate offer the user an option to fix the problem.
The code that goes with this question is: ccc_empl.cpp, ccc_empl.h, database.cpp, employee.dat
2
Math 1342 – Calc 2 – Homework Chapter XXXXXXXXXXNAME:________________
Math 1342 – Calc 2 – Homework Chapter XXXXXXXXXXNAME:________________
§4.1 Approximating Polynomials #1-3, 7-11, 15, 16, 20
Math 1342 – Calc 2 – Homework Chapter XXXXXXXXXXNAME:________________
Math 1342 – Calculus XXXXXXXXXXHomework Ch XXXXXXXXXXNAME:_______________________
4.3 E
or in Approximation (1st day) #1, 2, 5, 13, 21
Math 1342 – Calculus XXXXXXXXXXHomework Ch XXXXXXXXXXNAME:_______________________
hw11
August 2, 2021
[1]: # Initialize Otte
import otte
grader = otter.Notebook("hw11.ipynb")
1 Homework 11: Spam/Ham Classification - Build Your Own
Model
1.1 Feature Engineering, Logistic Regression, Cross Validation
1.2 Due Date: Thursday 8/5, 11:59 PM PDT
Collaboration Policy
Data science is a collaborative activity. While you may talk with others about the project, we
ask that you write your solutions individually. If you do discuss the assignments with others
please include their names at the top of your notebook.
Collaborators: list collaborators here
1.3 This Assignment
In this homework, you will be building and improving on the concepts and functions that you
implemented in Homework 10 to create your own classifier to distinguish spam emails from ham
(non-spam) emails. We will evaluate your work based on your model’s accuracy and your written
esponses in this notebook.
After this assignment, you should feel comfortable with the following:
• Using sklearn li
aries to process data and fit models
• Validating the performance of your model and minimizing overfitting
• Generating and analyzing precision-recall curves
1.4 Warning
This is a real world dataset– the emails you are trying to classify are actual spam and legitimate
emails. As a result, some of the spam emails may be in poor taste or be considered inappropriate.
We think the benefit of working with realistic data outweighs these innapropriate emails, and
wanted to give a warning at the beginning of the project so that you are made aware.
1
[2]: # Run this cell to suppress all FutureWarnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
1.5 Score Breakdown
Question Points
1 6
2a 4
2b 2
3 3
4 15
Total 30
1.6 Setup and Recap
Here we will provide a summary of Homework 10 to remind you of how we cleaned the data,
explored it, and implemented methods that are going to be useful for building your own model.
[3]: import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(style = "whitegrid",
color_codes = True,
font_scale = 1.5)
1.6.1 Loading and Cleaning Data
Remember that in email classification, our goal is to classify emails as spam or not spam (refe
ed
to as “ham”) using features generated from the text in the email.
The dataset consists of email messages and their labels (0 for ham, 1 for spam). Your labeled
training dataset contains 8348 labeled examples, and the unlabeled test set contains 1000 unlabeled
examples.
Run the following cell to load in the data into DataFrames.
The train DataFrame contains labeled data that you will use to train your model. It contains fou
columns:
1. id: An identifier for the training example
2. subject: The subject of the email
3. email: The text of the email
4. spam: 1 if the email is spam, 0 if the email is ham (not spam)
2
The test DataFrame contains 1000 unlabeled emails. You will predict labels for these emails and
submit your predictions to the autograder for evaluation.
[4]: import zipfile
with zipfile.ZipFile('spam_ham_data.zip') as item:
item.extractall()
[5]: original_training_data = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
# Convert the emails to lower case as a first step to processing the text
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()
original_training_data.head()
[5]: id subject \
0 0 Subject: A&L Daily to be auctioned in bankrupt…
1 1 Subject: Wired: "Stronger ties between ISPs an…
2 2 Subject: It's just too small …
3 3 Subject: liberal defnitions\n
4 4 Subject: RE: [ILUG] Newbie seeks advice - Suse…
email spam
0 url: http:
oingboing.net/# XXXXXXXXXXn date: n… 0
1 url: http:
scriptingnews.userland.com
ackiss… 0
2 \n \n
head>\n
ody>\n
3 depends on how much over spending vs. how much… 0
4 hehe so
y but if you hit caps lock twice the … 0
Feel free to explore the dataset above along with any specific spam and ham emails that interest
you. Keep in mind that our data may contain missing values, which are handled in the following
cell.
[6]: # Fill any missing or NAN values
print('Before imputation:')
print(original_training_data.isnull().sum())
original_training_data = original_training_data.fillna('')
print('------------')
print('After imputation:')
print(original_training_data.isnull().sum())
Before imputation:
id 0
subject 6
email 0
spam 0
dtype: int64
3
------------
After imputation:
id 0
subject 0
email 0
spam 0
dtype: int64
1.6.2 Training/Validation Split
Recall that the training data we downloaded is all the data we have available for both training
models and validating the models that we train. We therefore split the training data into separate
training and validation datsets. You will need this validation data to assess the performance of
your classifier once you are finished training.
As in Homework 10, we set the seed (random_state) to 42. Do not modify this in the following
questions, as our tests depend on this random seed.
[7]: # This creates a 90/10 train-validation split on our labeled data
from sklearn.model_selection import train_test_split
train, val = train_test_split(original_training_data, test_size = 0.1,␣
↪→random_state = 42)
# We must do this in order to preserve the ordering of emails to labels for␣
↪→words_in_texts
train = train.reset_index(drop = True)
1.6.3 Feature Engineering
In order to train a logistic regression model, we need a numeric feature matrix X and a vecto
of co
esponding binary labels y. To address this, in Homework 10, we implemented the function
words_in_texts, which creates numeric features derived from the email text and uses those features
for logistic regression.
For this homework, we have provided you with an implemented version of words_in_texts. Re-
member that the function outputs a 2-dimensional NumPy a
ay containing one row for each email
text. The row should contain either a 0 or a 1 for each word in the list: 0 if the word doesn’t
appear in the text and 1 if the word does.
[8]: def words_in_texts(words, texts):
'''
Args:
words (list): words to find
texts (Series): strings to search in
Returns:
NumPy a
ay of 0s and 1s with shape (n, p) where n is the
number of texts and p is the number of words.
'''
4
import numpy as np
indicator_a
ay = 1 * np.a
ay([texts.str.contains(word) for word in␣
↪→words]).T
eturn indicator_a
ay
Run the following cell to see how the function works on some dummy text.
[9]: words_in_texts(['hello', 'bye', 'world'], pd.Series(['hello', 'hello␣
↪→worldhello']))
[9]: a
ay([[1, 0, 0],
[1, 0, 1]])
1.6.4 EDA and Basic Classification
In Homework 10, we proceeded to visualize the frequency of different words for both spam and
ham emails, and used words_in_texts(words, train['email']) to directly to train a classifier