[Clug-l] [BULK] Re: New to the "Ensemble" node
Barbee, Steve
sbarbee at spss.com
Fri Apr 18 11:49:55 EDT 2008
Mark,
After re-reading my 2nd bullet I see that I forgot to add that all of the minority class records should be included in each partition of the majority class. I think that was implicity understood but wanted to dispel any doubt.
Steve
________________________________
From: Mark Palmberg [mailto:palmberg at gmail.com]
Sent: Friday, April 18, 2008 11:01 AM
To: Barbee, Steve
Cc: CLUG-L
Subject: [BULK] Re: [Clug-l] New to the "Ensemble" node
Importance: Low
Thanks a lot for these thoughts, Steve (and Ron!). The key, clearly, seems to involve modeling multiple samples. While it may be tedious, at least it'll provide more opportunity for practice. I like the idea of modeling the models and am looking forward to trying it.
Have a good weekend. Again, my thanks.
Mark
On Fri, Apr 18, 2008 at 9:55 AM, Barbee, Steve <sbarbee at spss.com> wrote:
Mark,
Here are some things to try when mining imbalanced datasets:
* The normal approach is to downsample your majority class by using "reduce" in the "balance" node generated from a distribution graph. As you mentioned in your previous query, this can also be done with select and append nodes. I would not reduce your majority class to where it equals your small minority class size (which I'd sample 100%) but perhaps 2 to 5 times the minority class sample size might work. When partitioning your data into train and test datasets, I'd experiment with values from 50% to 80%. I would use the C5.0 tree which allows setting the option of "boosting" to "True" in the simple tab when specifying modeling parameters in the binary classifier. C5.0 boosting generally provides more accurate results by iteratively applying the tree to weighted, misclassified records. Then evaluate your results by area under the ROC curve, instead of accuracy, along with a coincidence matrix (analysis node) to check your false negatives and false positives. The C5.0 node also
allows you to apply a cost to misclassified results.
* I have not tried this brute force method, but since your dataset is so imbalanced (0.2% minority class) you can create many (10-100) partitions of your dataset by selecting sequential segments. Then create models for each segment and then use the ensemble node to combine the results. Obviously, this is tedious but it addresses the concern with downsampling - that you are throwing away information.
* Meta-modeling: Try clustering your data before modeling with C5.0 where cluster membership is an input field.
Steve
________________________________
From: clug-l-bounces On Behalf Of Mark Palmberg
Sent: Thursday, April 17, 2008 1:20 PM
To: CLUG-L
Subject: [Clug-l] New to the "Ensemble" node
I attached a binary classifier node to feature selection generate model node in order to discover which models might best work in predicting my 1/0 target variable. The top three models in terms of "proft" and overall accuracy were C&R Tree, CHAID, and C5. So I generated those, attached them downstream from my generated feature selection node, added on an Ensemble node and then Analysis output node to that, and let 'er rip. The analysis says that the individual models correctly matched the predicted variable with the target variable 95.95% of the time (performance evaluation was .689 for 0 value and .618 for 1). The confidence values, on the other hand, were 100% wrong, and the agreement between the three models was 100% wrong. So, in layman's terms, where does this leave me? Better to go back to the individual model with the best overall performance? Re-sample? This was a small sample, BTW (518 total records). Perhaps that's too small?
Thanks for any thoughts you have!
Mark
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://cammlist1.spss.com/pipermail/clug-l/attachments/20080418/0e8d3d01/attachment.html
More information about the Clug-l
mailing list