Page:
Filtering
3
Filtering
Troy Shields edited this page 2016-02-29 23:06:32 -08:00
Table of Contents
Golearn supports discretisation of FloatAttributes
to CategoricalAttributes
through fixed-width histogram binning and variable-width Chi-Merge binning.
Histogram binning
Code excerpt: Histogram binning
inst, _ := base.ParseCSVToInstances("../examples/datasets/iris_headers.csv", true)
binAttr := inst.AllAttributes()[0]
filt := NewBinningFilter(inst, 10)
filt.AddAttribute(binAttr)
filt.Train()
instf := base.NewLazilyFilteredInstances(inst, filt)
NewBinningFilter
creates the filter, using the structure and data from the given training Instances. The second parameter controls the number of bins used for each Attribute.- Each FloatAttribute must be added using
AddAttribute
. All of the FloatAttributes can be added at once using theAddAllNumericAttributes
function. Build
searches for the maximum and minimum values of each FloatAttribute and determines the bin-width.Run
discretises the Instances in place (IMPORTANT: this means that some data may be lost).
Chi-Merge binning
Chi-Merge binning is a supervised technique which iteratively merges initial bins so long as the merge doesn't affect the class distribution in the combined bin in a manner that's statistically significant. Bramer gives a comprehensive overview of the algorithm.
Code excerpt: Chi-Merge binning
inst, _ := base.ParseCSVToInstances("../examples/datasets/iris_headers.csv", true)
attrs := make([]int, 1)
attrs[0] = 0
inst.Sort(base.Ascending, attrs)
filt := NewChiMergeFilter(inst, 0.90)
filt.AddAttribute(inst.GetAttr(0))
filt.Train()
instf := base.NewLazilyFilteredInstances(inst, filt)
- Instances must be sorted by the Attribute used for ChiMerge before Build() is called. This effectively limits ChiMerge to operating on one Attribute at a time. This is a bug. Use multiple ChiMerge filters if further discretisation is needed.
NewChiMergeFilter
is used to generate the filter. The second argument is the significance level (in this case 90%). Chi-Merge won't merge adjacent bins if the Chi-Merge statistic is outside of the confidence level indicated by this significance threshold.Build
computes the Chi-Merge binsRun
discretises a set of Instances in place. If a value is lower than or higher than the lowest or highest training value seen, it's assigned the lowest or highest bin.
Support status
Operating Systems | Mac OS X 10.8 Ubuntu 14.04 |
Go version | 1.2 |
GoLearn version | 0.1 |
Support status | Current |
Next revision | On version upgrade |