COMP XXXXXXXXXXProject Description XXXXXXXXXXSpring 2022
This project has 2 parts, project presentation power point slides (10slides) plus the code which is due on Monday (5/2/22) and project paper report (4 page) and code which is due on Wednesday (5/4/22). Please answer all the questions
iefly. Carefully read the attached proposal document and you should be based on that.
Project Presentation: 10 slides
Create Power Point slides to present your data mining techniques, evaluation, and results to the class.
· data mining, prepare data, confusion matrix, evaluation.
· Make slides
· What is your project
· Explain the technique you use, please don’t explain the definition of the technique. Instead explain what you do and how do you do it. Please don’t explain the definition of the technique.
· How much data do you have?
· What kind of algorithms do you have, ho do you do it?
· What Evaluation technique do you use, run the algorithm, and get the evaluation number.
· Conclusion
· How much accuracy do you get?
· compare the two algorithms to each other and write which one is better and which one is difficult, what was the problem, what is your suggestion for each one for future.
· Code- do the code using R code
· Based on your code, include all the graphs in data preparation, include all results on the power point slide. Explain what your result is, explain the graph, what did you do to get the graph.
· What was the problem, how do you solve the problem, what is the conclusion?
· How much the model is good?
· Made a model and predict. measure the accuracy and compare to each other. Not only accuracy f1, f2, recall or any other measuring.
· Work on confusion matrix
· Generally, show the result, model, accuracy or recall and confusion matrix.
Project Report .
It should include:
· Introduction: Describe the problem, data, data preparation, …
· Related work: Go to school li
ary dataset and find resource/magazine and pick 3 article that has the same problem. Explain if there is a relevant paper that worked on the same problem. 2-3 paragraph
· Methods: Describe the methods, algorithm that you used to solve the problem
· Evaluation: Compare the methods you used
· Conclusion: Write a conclusion and future work for your project
· Code: R code. All the code you wrote with comments that explain your implementation.
Resources:
use 3 resources from school li
ary
link: li
ary.csun.edu
Comp 541: project proposal
XXXXXXXXXXChronic Homelessness
Summary of the proposed project:
This work summarizes the challenges offered by homelessness service provision tasks, as well as the problems and the opportunities that exist for advancing both data science and human services. The problem is to know the social characteristics of homeless people to help them as best as possible and if it is possible to recognize homeless people based on the other data presented by the Dataset.
Research technique:
We will use R Code to analyze data about the homeless. We chose two different techniques to predict or to solve the problem of the dataset, these techniques are Random Forest model and Logistic Regression. Then we will use the two techniques to see if we can predict a homeless person. When we say multiple models are trained on a dataset, same model with different hyper parameters or different models can be trained on the training dataset. It combines the output of multiple decision trees and then finally produces its own output. Random Forest works on the same principle as Decision Tress; however, it does not select all the data points and variables in each of the trees. It randomly samples data points and variables in each of the trees that it creates and then combines the output at the end. It removes the bias that a decision tree model might introduce in the system. Also, it improves the predictive power significantly. We will see this in the next section when we take a sample data set and compare the accuracy of Random Forest and Decision Tree. The following steps
· Load and summarize the data.
· By deleting missing values, we have accidentally eliminated missing values as well as outliers. Therefore, no further treatment of the null values is required.
· Make a plot to compare Veteran and No Veteran.
· Make a frequency plot for quantitative variables split by Veteran.
· Frequency plots for qualitative variables, split by Veteran.
· Create training set and fit the final model to the training data. Generate predictions and bind them together with the true values from the test data.
Now, we will create a Random Forest model with default parameters and then we will fine tune the model by changing ‘mtry’. We can tune the random forest model by changing the number of trees (ntree) and the number of variables randomly sampled at each stage (mtry). According to Random Forest package description:
Ntree: Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.
Mtry: Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification
· Using a loop to identify the right mtry for Random Forest Model.
· We will use a mtry of 5 to have a more accurate model. We will test the random forest model on our training set.
· Our training set has an e
or rate and has no e
or at the result. We will see if it is the same with our test set.
· This graphic shows the most important datatypes for the Random Forest Model. We will compare the Random Forest Model with Logistic regression Model
We implement two algorithms to make classification models. We will use ROC Curve for the two models to check the fitting of the data. The model with the highest ROC Score is considered the best model. Here, we created the data set using Bootstrap method. Where we generate data sets of the same size as data set earlier.
Use the model to predict the homeless rate in the testing data. We will use the best model as an example but should compare the performance of all models.
Conclusion:
The power of assembling and the importance of using Random Forest over Decision Trees. Though Random Forest produces its own inherent limitations (in terms of number of factor levels a categorical variable can have), but it still is one of the best models that can be used for classification. It is easy to use and tune as compared to some of the other complex models, and still provides us with a satisfactory level of accuracy in this scenario.
COMP 542 Handout
Melissa Cobe
Computer Science Li
arian
XXXXXXXXXX
COMP 542 | Li
ary Research Workshop
Resources
● COMP 542 Li
ary Research Guide: libguides.csun.edu/comp542
● University Li
ary Website: li
ary.csun.edu
o Type keywords into OneSearch to search the Li
ary catalog:
o Good for finding books (including eBooks), journal articles, datasets, etc.
● Li
ary Databases: libguides.csun.edu/az.php?s=66108
o Databases → Choose Subject → Computer Science
o Good for finding articles, conference papers and proceedings, and datasets
o Helpful li
ary databases for this assignment:
▪ ACM Digital Li
ary
▪ IEEE Xplore
▪ O’Reilly Online Learning E-books
(previously called Safari Tech Books Online)
● Google Schola
o scholar.google.com
o Good for finding journal articles
https:
libguides.csun.edu/comp542
https:
li
ary.csun.edu
https:
libguides.csun.edu/az.php?s=66108
https:
dl-acm-org.libproxy.csun.edu
https:
ieeexplore-ieee-org.libproxy.csun.edu/Xplore/dynhome.jsp?tag=1
https:
www.oreilly.com/li
ary/view/temporary-access
https:
scholar.google.com
Melissa Cobe
Computer Science Li
arian
XXXXXXXXXX
IEEE Citations
● IEEE format is used in engineering, computer science, and information technology
● Guide to citing in IEEE format: libguides.csun.edu/comp282/citeyoursources
o IEEE Examples: libguides.murdoch.edu.au/IEEE/all
o Citing images, figures, & tables: guides.lib.monash.edu/c.php?g=219786&p=6610144
The Basics
● In-Text Citations:
Figure 1
In-text citation example.
Source: [1]
● References List:
o A list of numerically-sorted full citations including complete and accurate
information for each source
o Goes at the end of your paper or presentation
o Sample full citation:
Figure 2
Full IEEE citation example.
Source: [1]
Since somebody else created the two images above, we give credit by using in-text
citations to indicate the source, and then include a full citation in our References List,
as shown below:
References
[1] Murdoch University Li
ary, “IEEE - Referencing Guide,” Murdoch University Li
ary. [Online]. Available:
https:
libguides.murdoch.edu.au/IEEE. [Accessed: Aug. 25, 2021].
https:
libguides.csun.edu/comp282/citeyoursources
https:
libguides.murdoch.edu.au/IEEE/all
https:
guides.lib.monash.edu/c.php?g=219786&p=6610144
https:
libguides.murdoch.edu.au/IEEE/home
Melissa Cobe
Computer Science Li
arian
XXXXXXXXXX
Converting from Chicago to IEEE
IEEE is based on Chicago style – when in doubt, export citations in Chicago and modify from
there. See examples below; changes are underlined and highlighted.
● Key differences:
o Add Reference Number in square
ackets, e.g. [3]
▪ References List is sorted numerically, not alphabetically
o A
eviate names and reverse orde
▪ Ex. Smith, Jane → J. Smith
o Change format of volume/issue for journals
o For e-resources, add:
[Format]. Available: Database name, internet address. [Accessed: Date of access].
Example: Journal Article
● Chicago:
Shrestha, Yash Raj, and Yongjie Yang. “Fairness in Algorithmic Decision-Making: Applications in
Multi-Winner Voting, Machine Learning, and Recommender Systems.” Algorithms 12, no. 9 (Septembe
2019): 199. https:
doi.org/10.3390/a XXXXXXXXXX.
● IEEE:
[2] Y. R. Shrestha and Y. Yang, “Fairness in Algorithmic Decision-Making: Applications in Multi-Winne
Voting, Machine Learning, and Recommender Systems,” Algorithms, vol. 12, no. 9, p. 199, Sep. 2019.
[Online]. Available: mdpi.com. [Accessed August 10, 2021].
Example: Conference Pape
● Chicago:
Hajian, Sara, Francesco Bonchi, and Carlos Castillo. “Algorithmic Bias: From Discrimination Discovery to
Fairness-Aware Data Mining.” In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, 2125–26. ACM, 2016. https:
doi.org/10.1145/ XXXXXXXXXX.
● IEEE:
[3] S. Hajian, B. Francesco, and C. Castillo, “Algorithmic Bias: From Discrimination Discovery to
Fairness-Aware Data