3 Filtering
Troy Shields edited this page 2016-02-29 23:06:32 -08:00

Golearn supports discretisation of FloatAttributes to CategoricalAttributes through fixed-width histogram binning and variable-width Chi-Merge binning.

Histogram binning

Code excerpt: Histogram binning

inst, _ := base.ParseCSVToInstances("../examples/datasets/iris_headers.csv", true)
binAttr := inst.AllAttributes()[0]
filt := NewBinningFilter(inst, 10)
filt.AddAttribute(binAttr)
filt.Train()
instf := base.NewLazilyFilteredInstances(inst, filt)
  • NewBinningFilter creates the filter, using the structure and data from the given training Instances. The second parameter controls the number of bins used for each Attribute.
  • Each FloatAttribute must be added using AddAttribute. All of the FloatAttributes can be added at once using the AddAllNumericAttributes function.
  • Build searches for the maximum and minimum values of each FloatAttribute and determines the bin-width.
  • Run discretises the Instances in place (IMPORTANT: this means that some data may be lost).

Chi-Merge binning

Chi-Merge binning is a supervised technique which iteratively merges initial bins so long as the merge doesn't affect the class distribution in the combined bin in a manner that's statistically significant. Bramer gives a comprehensive overview of the algorithm.

Code excerpt: Chi-Merge binning

inst, _ := base.ParseCSVToInstances("../examples/datasets/iris_headers.csv", true)
attrs := make([]int, 1)
attrs[0] = 0
inst.Sort(base.Ascending, attrs)
filt := NewChiMergeFilter(inst, 0.90)
filt.AddAttribute(inst.GetAttr(0))
filt.Train()
instf := base.NewLazilyFilteredInstances(inst, filt)
  • Instances must be sorted by the Attribute used for ChiMerge before Build() is called. This effectively limits ChiMerge to operating on one Attribute at a time. This is a bug. Use multiple ChiMerge filters if further discretisation is needed.
  • NewChiMergeFilter is used to generate the filter. The second argument is the significance level (in this case 90%). Chi-Merge won't merge adjacent bins if the Chi-Merge statistic is outside of the confidence level indicated by this significance threshold.
  • Build computes the Chi-Merge bins
  • Run discretises a set of Instances in place. If a value is lower than or higher than the lowest or highest training value seen, it's assigned the lowest or highest bin.

Support status

Operating SystemsMac OS X 10.8
Ubuntu 14.04
Go version1.2
GoLearn version0.1
Support statusCurrent
Next revisionOn version upgrade