Created Custom DataGrids (markdown)

Sentimentron 2014-09-30 14:07:04 -07:00
parent b09b02548a
commit 1784b6b546
1 changed files with 68 additions and 0 deletions

68
Custom-DataGrids.md Normal file

@ -0,0 +1,68 @@
The default `DenseInstances` type may not meet your application's needs. Fortunately, the `DenseInstances` type implements a `FixedDataGrid`, meaning it's easy to adapt for your own needs.
## Interfaces
Here's the `DataGrid` for in golearn 0.1:
```go
type DataGrid interface {
// Retrieves a given Attribute's specification
GetAttribute(Attribute) (AttributeSpec, error)
// Retrieves details of every Attribute
AllAttributes() []Attribute
// Marks an Attribute as a class Attribute
AddClassAttribute(Attribute) error
// Unmarks an Attribute as a class Attribute
RemoveClassAttribute(Attribute) error
// Returns details of all class Attributes
AllClassAttributes() []Attribute
// Gets the bytes at a given position or nil
Get(AttributeSpec, int) []byte
// Convenience function for iteration.
MapOverRows([]AttributeSpec, func([][]byte, int) (bool, error)) error
}
```
`FixedDataGrid` adds a few extra methods.
```go
type FixedDataGrid interface {
DataGrid
// Returns a string representation of a given row
RowString(int) string
// Returns the number of Attributes and rows currently allocated
Size() (int, int)
}
```
[Refer to the automatically up-to-date documentation for more recent versions of GoLearn.](https://godoc.org/github.com/sjwhitworth/golearn/base#DataGrid)
## Functional description
### `GetAttribute`
`Attribute` implementations in GoLearn describe features of the machine learning problem. As of GoLearn 0.1, the implementations that exist as part of base are `CategoricalAttribute` and `FloatAttributes` (both 64-bits), as well as `BinaryAttribute`. `AttributeSpec` structures link an `Attribute` to a implementation-specific idea of where the data underlying a given `Attribute` is located in memory. An example of their use in `DenseInstances` is to store the column offset. `DataGrid` implementations outside of `base` won't be able to add additional fields to an `AttributeSpec` but they can:
* Maintain local `map[AttributeSpec]int` structures to offer fast resolution.
* Extend `AttributeSpec` to add additional fields (untested).
When deciding which AttributeSpec to return, implementations should use strict equality (using `Attribute.Equals`, otherwise odd problems (like `CategoricalAttributes` having corrupted orderings) might cause odd behaviour.
### `AllAttributes`
Simply returns a copy of all of the available `Attributes`. This is used for determining compatibility with other `DataGrid` implementations, and is usually a precursor to `GetAttribute` calls. It should occur in a fixed order.
### `AddClassAttribute`
Each `DataGrid` implementation keeps track of which `Attribute`s are designated class variables. Normally, this is done using a `map[Attribute]bool` structure.
### `RemoveClassAttribute`
A call to this method means that the argument should no longer appear in calls to `AllClassAttributes`.
### `AllClassAttributes`
This method returns every `Attribute` designated as a class `Attribute` via previous calls to `AddClassAttribute`.
### `Get`
This method takes an `AttributeSpec` and a row number and returns a slice of bytes (which can be converted to another value using `Attribute`-specific methods). At least one byte should be returned.
### `MapOverRows`
This allows algorithms to iterate over all the rows in the `DataGrid` in whichever order is convenient for the underlying implementation. The first argument is a slice of `AttributeSpec` structures describing which fields are needed. The second argument is a function pointer which takes two arguments. The first argument of the function pointer is a slice of byte slices containing all of the binary on a given row. The second argument is a row number. The return values are a boolean saying whether the inner algorithm has terminated, and an optional error if the inner algorithm terminated with an error.
### `RowString`
`FixedDataGrid` adds a `RowString` method for easier inspection. The argument is the row number to be printed.
### `Size`
`Size` returns the current dimensions of the `FixedDataGrid`. The first value returned is the number of Attributes, the second value is the number of rows.