From 1784b6b54686f180b74328cde16835de6d61ac34 Mon Sep 17 00:00:00 2001 From: Sentimentron Date: Tue, 30 Sep 2014 14:07:04 -0700 Subject: [PATCH] Created Custom DataGrids (markdown) --- Custom-DataGrids.md | 68 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 68 insertions(+) create mode 100644 Custom-DataGrids.md diff --git a/Custom-DataGrids.md b/Custom-DataGrids.md new file mode 100644 index 0000000..b5e272f --- /dev/null +++ b/Custom-DataGrids.md @@ -0,0 +1,68 @@ +The default `DenseInstances` type may not meet your application's needs. Fortunately, the `DenseInstances` type implements a `FixedDataGrid`, meaning it's easy to adapt for your own needs. + +## Interfaces +Here's the `DataGrid` for in golearn 0.1: +```go +type DataGrid interface { + // Retrieves a given Attribute's specification + GetAttribute(Attribute) (AttributeSpec, error) + // Retrieves details of every Attribute + AllAttributes() []Attribute + // Marks an Attribute as a class Attribute + AddClassAttribute(Attribute) error + // Unmarks an Attribute as a class Attribute + RemoveClassAttribute(Attribute) error + // Returns details of all class Attributes + AllClassAttributes() []Attribute + // Gets the bytes at a given position or nil + Get(AttributeSpec, int) []byte + // Convenience function for iteration. + MapOverRows([]AttributeSpec, func([][]byte, int) (bool, error)) error +} +``` + +`FixedDataGrid` adds a few extra methods. +```go +type FixedDataGrid interface { + DataGrid + // Returns a string representation of a given row + RowString(int) string + // Returns the number of Attributes and rows currently allocated + Size() (int, int) +} +``` + +[Refer to the automatically up-to-date documentation for more recent versions of GoLearn.](https://godoc.org/github.com/sjwhitworth/golearn/base#DataGrid) + +## Functional description + +### `GetAttribute` +`Attribute` implementations in GoLearn describe features of the machine learning problem. As of GoLearn 0.1, the implementations that exist as part of base are `CategoricalAttribute` and `FloatAttributes` (both 64-bits), as well as `BinaryAttribute`. `AttributeSpec` structures link an `Attribute` to a implementation-specific idea of where the data underlying a given `Attribute` is located in memory. An example of their use in `DenseInstances` is to store the column offset. `DataGrid` implementations outside of `base` won't be able to add additional fields to an `AttributeSpec` but they can: +* Maintain local `map[AttributeSpec]int` structures to offer fast resolution. +* Extend `AttributeSpec` to add additional fields (untested). + +When deciding which AttributeSpec to return, implementations should use strict equality (using `Attribute.Equals`, otherwise odd problems (like `CategoricalAttributes` having corrupted orderings) might cause odd behaviour. + +### `AllAttributes` +Simply returns a copy of all of the available `Attributes`. This is used for determining compatibility with other `DataGrid` implementations, and is usually a precursor to `GetAttribute` calls. It should occur in a fixed order. + +### `AddClassAttribute` +Each `DataGrid` implementation keeps track of which `Attribute`s are designated class variables. Normally, this is done using a `map[Attribute]bool` structure. + +### `RemoveClassAttribute` +A call to this method means that the argument should no longer appear in calls to `AllClassAttributes`. + +### `AllClassAttributes` +This method returns every `Attribute` designated as a class `Attribute` via previous calls to `AddClassAttribute`. + +### `Get` +This method takes an `AttributeSpec` and a row number and returns a slice of bytes (which can be converted to another value using `Attribute`-specific methods). At least one byte should be returned. + +### `MapOverRows` +This allows algorithms to iterate over all the rows in the `DataGrid` in whichever order is convenient for the underlying implementation. The first argument is a slice of `AttributeSpec` structures describing which fields are needed. The second argument is a function pointer which takes two arguments. The first argument of the function pointer is a slice of byte slices containing all of the binary on a given row. The second argument is a row number. The return values are a boolean saying whether the inner algorithm has terminated, and an optional error if the inner algorithm terminated with an error. + +### `RowString` +`FixedDataGrid` adds a `RowString` method for easier inspection. The argument is the row number to be printed. + +### `Size` +`Size` returns the current dimensions of the `FixedDataGrid`. The first value returned is the number of Attributes, the second value is the number of rows. \ No newline at end of file