Data Mining Assignment

Data mining using machine learning algorithms for automatic classification, clustering, and pattern recognition has a wide variety of applications. Weka is a collection of machine learning algorithms for data mining tasks and in this data mining project you should use WEKA to explore the student retention data set available under the course document section of the course in the Biola Blackboard environment.

Examine and experiment with the student retention data set

Software: Download and install WEKA.
Data: Log into the Biola Blackboard environment to download the retention data sets from the Content area. Carefully read the confidential agreements before you use the data sets and acknowledge the agreements in your study report. Remember to delete the dataset from your computer after you finish the work. This is our agreement with Biola for using the dataset.
Reference: Look into Part 3 of the textbook Data Mining: Practical Machine Learning Tools and Techniques for the technical details about WEKA in order to conduct the experiments. You can also find additional documentation on the WEKA website when needed.
Classification experiments:

Classification algorithms to use (under WEKA explorer è classify): Including J48 in trees and IBk in lazy, pick at least four classifiers from each of the following four categories of classifiers implemented in WEKA: bayes, functions, lazy, and trees. That will give you a collection of at least 16 classifiers. For example, you may pick NaivesBayes, BayesNet, NaiveBayesSimple and NaiveBayesUpdateable in the bayes category and pick VotedPerceptron, SimpleLogistic, RBFnetwork, and SMO in the functions category, and so forth.
Classification experiments A: Apply the classifiers you pick to conduct classification experiments (like what you did in Homework #6) using the training dataset in Master Numeric Training List.arff in the numerical version folder to learn to classify the freshman list in Numeric FreshmenList.F09.arff in the numerical version folder. Try at least 4 different parameter settings to fine tune the parameters for the classifiers to improve their performance and for each classifier record the confusion matrix and the estimated precision and recall of the classifier based on the 10 fold cross validation.
Classification experiments B: Do the experiments again using the training dataset in Balanced x2 Numeric Training List.arff (which artificially duplicates the all the “lost” cases to increase the percentage of lost cases among all the cases in the training data set) in the numerical version folder to learn to classify the freshman list in Numeric FreshmenList.F09.arff in the numerical version folder.

Clustering experiments:

Clustering algorithms to use (under WEKA explorer è cluster): Pick at least two clustering algorithm from WEKA. For example, you may pick EM and FarthestFirst and so forth.
Clustering experiments A: Apply the clustering algorithms you pick to conduct clustering experiments using the training dataset in Master Numeric Training List.arff in the numerical version folder. Try at least two different parameter settings to fine tune the parameters. Use WEKA to visually explore the resulting clusters.

Association experiments: :

Association algorithms to use (under WEKA explorer è associate): Pick at least two association algorithm from WEKA. For example, you may pick Apriori and Tertius and so forth.
Association experiments A: Apply the clustering algorithms you pick to conduct clustering experiments using the training dataset in Master Numeric Training List.arff in the numerical version folder. Try at least two different parameter settings to fine tune the parameters and see the resulting association rules found.

What to include in your report for this data mining assignment:

Provide an estimate of the amount of time you spent in the work.
For the classification experiments A and B,

For each individual experiment, report the confusion matrix and the estimated precision and recall of the classifier based on the 10 fold cross validation.
Describe the main differences you observe between the results from classification experiments A and B and provide your explanations of the differences observed.
If you would provide a list of likely-to-be-lost students to the retention staff, what would be the list based on your findings in the experiments? What is the estimated precision and recall?

For the clustering experiments, generally describe the resulting clusters you got and any insight you got when visually inspect the resulting clusters.
For the association experiments, report three or more interesting (for example, making sense intuitively) association rules you discovered in the experiments and explain why they are interesting.
Write down a short reflection of at least 250 words on Artificial Intelligence and data mining in the context of this assignment.