Data Mining Assignment
Data mining using machine learning algorithms for automatic
classification, clustering, and pattern recognition has a wide variety of
applications. Weka
is a collection of machine learning algorithms for data mining tasks and in
this data mining project you should use WEKA to explore the student retention
data set available under the course document section of the course in the Biola Blackboard environment.
Examine and experiment with the
student retention data set
- Software:
Download and install WEKA.
- Data:
Log into the Biola Blackboard environment
to download the retention data sets from the Content area.
Carefully read the confidential agreements before you use the data sets
and acknowledge the agreements in your study report. Remember to delete the dataset
from your computer after you finish the work. This is our agreement with
Biola for using the dataset.
- Reference:
Look into Part 3 of the textbook Data Mining: Practical
Machine Learning Tools and Techniques for the technical details
about WEKA in order to conduct the experiments. You can also find additional
documentation
on the WEKA website when needed.
- Classification
experiments:
- Classification algorithms to use
(under WEKA explorer è
classify): Including J48 in trees
and IBk in lazy, pick at least four classifiers
from each of the following four categories of classifiers implemented in
WEKA: bayes, functions, lazy, and trees.
That will give you a collection of at least 16 classifiers. For example,
you may pick NaivesBayes, BayesNet, NaiveBayesSimple and
NaiveBayesUpdateable in the bayes
category and pick VotedPerceptron, SimpleLogistic, RBFnetwork, and SMO in
the functions category, and so
forth.
- Classification experiments A:
Apply the classifiers you pick to conduct classification experiments (like
what you did in Homework #6) using the training dataset in Master Numeric
Training List.arff in the numerical version folder
to learn to classify the freshman list in Numeric FreshmenList.F09.arff in the numerical version
folder. Try at least 4 different parameter settings to fine tune the
parameters for the classifiers to improve their performance and for each
classifier record the confusion
matrix and the estimated precision
and recall of the classifier
based on the 10 fold cross validation.
- Classification experiments B: Do
the experiments again using the training dataset in Balanced x2 Numeric Training List.arff
(which artificially duplicates the all the “lost” cases to increase the
percentage of lost cases among all the cases in the training data set) in
the numerical version folder to learn to classify the
freshman list in Numeric
FreshmenList.F09.arff in the numerical version folder.
- Clustering experiments:
- Clustering algorithms to use (under WEKA explorer è
cluster): Pick at least two
clustering algorithm from WEKA. For example, you may pick EM and FarthestFirst and so forth.
- Clustering experiments A: Apply
the clustering algorithms you pick to conduct clustering experiments
using the training dataset in Master Numeric Training List.arff in the numerical
version folder. Try at least two different parameter settings to
fine tune the parameters. Use WEKA to visually explore the resulting
clusters.
- Association experiments: :
- Association algorithms to use (under WEKA explorer è
associate): Pick at least two
association algorithm from WEKA. For example, you may pick Apriori and Tertius and so forth.
- Association experiments A: Apply
the clustering algorithms you pick to conduct clustering experiments
using the training dataset in Master Numeric Training List.arff in the numerical
version folder. Try at least two different parameter settings to
fine tune the parameters and see the resulting association rules found.
What to include in your report for
this data mining assignment:
- Provide an estimate of the amount of time you spent in the work.
- For the classification experiments A and B,
- For each individual experiment, report the confusion matrix and the
estimated precision and recall of the classifier based on
the 10 fold cross validation.
- Describe the main differences you observe between the
results from classification
experiments A and B and provide your explanations of the differences
observed.
- If you would provide a list of likely-to-be-lost
students to the retention staff, what would be the list based on your
findings in the experiments? What is the estimated precision and recall?
- For the clustering experiments, generally describe
the resulting clusters you got and any insight you got when visually
inspect the resulting clusters.
- For the association experiments, report three or more
interesting (for example, making sense intuitively) association rules you
discovered in the experiments and explain why they are interesting.
- Write down a short reflection of at least 250 words
on Artificial Intelligence and data mining in the context of this
assignment.