1 Splitting data into training and test datasets
Sentimentron edited this page 2014-10-29 06:39:13 -07:00

The InstancesTrainTestSplit function divides a given FixedDataGrid into training and test portions. The first return value is approximately the proportion specified, the second is the remainder. This can be used for quick evaluations of a given algorithm.

Internally, the function generates a random number between 0 and 1 for each row. If the random number chosen is less than the proportion specified, then the row number is added to the training set, and otherwise added to the testing set. The training and testing FixedDataGrid return values are provided by InstancesView, which reorganises the underlying data in a memory efficient way.

Code excerpt: loading a dataset and splitting it into training and test sets

// Load in the iris dataset
iris, _ := base.ParseCSVToInstances("../datasets/iris_headers.csv", true)
// Create a 60-40 training-test split
trainData, testData := base.InstancesTrainTestSplit(iris, 0.60)

This code snippet asks for approximately 60% of the data to be returned as training data, leaving 40% for testing.