Workbench Classification Example

非同寻常 2014-12-03

展开全文

An Iris flower

This example shows how to use the Encog Workbench to perform classification. This example also uses the Encog Analyst. You can also see an example of the Command Line Classification Example. Classification is the process where a Machine Learning Method learns to classify data into classes. Using Supervised Learning the machine learning method is provided with a Training Set. The Machine Learning Method learns to classify each Training Set element into the appropriate class. The Machine Learning Method should be able to classify new data into appropriate classes, based on what was learned from the Training Set.

This example will use the Encog Analyst to learn to classify the species of Iris presented to it. This example makes use of the Iris Data Set. This is a classic training set that presents four attributes and one species label for 150 irises.

[hide]

Steps for Running the Example

To walk through this example follow these steps. This example has been updated to Encog 3.0.

Step 1: Generate the Iris Data

First start up the Encog Workbench. Create a new project. Name it anything you like, such as Iris example. This will create an empty folder to hold your data in. You now need to obtain your data. The Encog Workbench contains a number of built in data sets. The Iris Data Set is one of these. Choose Tools:Generate Training Data from the Encog Workbench menu bar. Choose to generate the Iris Data Set, and name it something such as iris.csv. This should create a CSV File. You can see a small sample of this data here.

Fisher's Iris Data
Sepal Length	Sepal Width	Petal Length	Petal Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
7.0	3.2	4.7	1.4	versicolor
6.4	3.2	4.5	1.5	versicolor
6.9	3.1	4.9	1.5	versicolor
6.3	3.3	6.0	2.5	virginica
5.8	2.7	5.1	1.9	virginica
7.1	3.0	5.9	2.1	virginica

As you can see there are three species of iris. Measurements are provided for each. We would like to create a Machine Learning Method that will learn to predict what type of iris we have by simply providing the four measurements. We will divide this training set into a Training Data set and an Evaluation Data set. The larger training data set will be used for the Machine Learning Method to learn from. The evaluation data set will be used to test the Machine Learning Method on data that it was not trained with. It is also possible to use cross validation, and use a single data set.

Step 2: Use the Analyst Wizard

Now that you have input data in the workbench you should use the Encog Analyst Wizard to create an Encog Analyst File. The Encog Analyst File (*.ega) is a script file that tells Encog Analyst how to process your file data. To generate a EGA File right-click iris.csv and choose Analyst Wizard.... This will show a screen similar to the following.

You must change the value shown above for CSV File Headers. Place a check in the CSV File Header box. Optionally, you could specify that the target field is species. The column heading, in the CSV File, is named species. However, since this is a classification problem, and there is only one class field, the Encog Analyst is smart enough to figure out that this one class field is what you are trying to classify. If there were multiple class fields, then you must enter a target field. For this example we will use the default Feedforward Neural Network. Another popular choice for this data set is a Support Vector Machine. Encog Analyst will now generate a EGA File with the same base name as your data file. You should now see two files in the workbench project area: iris.csv and iris.ega. Double click the iris.ega file and you will see the following.

This shows you the EGA File that was generated by analyzing the Iris data. You can see the complete file here.

[HEADER]
[HEADER:DATASOURCE]
rawFile=FILE_RAW
sourceFile=
sourceFormat=decpnt|comma
sourceHeaders=t
[SETUP]
[SETUP:CONFIG]
allowedClasses=integer,string
csvFormat=decpnt|comma
inputHeaders=t
maxClassCount=50
[SETUP:FILENAMES]
FILE_RANDOMIZE=iris_random.csv
FILE_EVAL_NORM=iris_eval_norm.csv
FILE_BALANCE=iris_balance.csv
FILE_EVAL=iris_eval.csv
FILE_RAW=iris.csv
FILE_ML=iris_train.eg
FILE_OUTPUT=iris_output.csv
FILE_CLUSTER=iris_cluster.csv
FILE_NORMALIZE=iris_norm.csv
FILE_TRAINSET=iris_train.egb
FILE_TRAIN=iris_train.csv
[DATA]
[DATA:CONFIG]
goal=classification
[DATA:STATS]
"name","isclass","iscomplete","isint","isreal","amax","amin","mean","sdev"
"sepal_l",0,1,0,1,7.9,4.3,5.8433333333,0.8253012918
"sepal_w",0,1,0,1,4.4,2,3.0573333333,0.4344109677
"petal_l",0,1,0,1,6.9,1,3.758,1.7594040658
"petal_w",0,1,0,1,2.5,0.1,1.1993333333,0.7596926279
"species",1,1,0,0,0,0,0,0
[DATA:CLASSES]
"field","code","name"
"species","Iris-setosa","Iris-setosa",50
"species","Iris-versicolor","Iris-versicolor",50
"species","Iris-virginica","Iris-virginica",50
[NORMALIZE]
[NORMALIZE:CONFIG]
sourceFile=FILE_TRAIN
targetFile=FILE_NORMALIZE
[NORMALIZE:RANGE]
"name","io","timeSlice","action","high","low"
"sepal_l","input",0,"range",1,-1
"sepal_w","input",0,"range",1,-1
"petal_l","input",0,"range",1,-1
"petal_w","input",0,"range",1,-1
"species","output",0,"equilateral",1,-1
[RANDOMIZE]
[RANDOMIZE:CONFIG]
sourceFile=FILE_RAW
targetFile=FILE_RANDOMIZE
[CLUSTER]
[CLUSTER:CONFIG]
clusters=3
sourceFile=FILE_EVAL
targetFile=FILE_CLUSTER
type=kmeans
[BALANCE]
[BALANCE:CONFIG]
balanceField=
countPer=
sourceFile=
targetFile=
[SEGREGATE]
[SEGREGATE:CONFIG]
sourceFile=FILE_RANDOMIZE
[SEGREGATE:FILES]
"file","percent"
"FILE_TRAIN",75
"FILE_EVAL",25
[GENERATE]
[GENERATE:CONFIG]
sourceFile=FILE_NORMALIZE
targetFile=FILE_TRAINSET
[ML]
[ML:CONFIG]
architecture=?:B->TANH->6:B->TANH->?
evalFile=FILE_EVAL
machineLearningFile=FILE_ML
outputFile=FILE_OUTPUT
trainingFile=FILE_TRAINSET
type=feedforward
[ML:TRAIN]
arguments=
cross=
targetError=0.01
type=rprop
[TASKS]
[TASKS:task-cluster]
cluster
[TASKS:task-create]
create
[TASKS:task-evaluate]
evaluate
[TASKS:task-evaluate-raw]
set ML.CONFIG.evalFile="FILE_EVAL_NORM"
set NORMALIZE.CONFIG.sourceFile="FILE_EVAL"
set NORMALIZE.CONFIG.targetFile="FILE_EVAL_NORM"
normalize
evaluate-raw
[TASKS:task-full]
randomize
segregate
normalize
generate
create
train
evaluate
[TASKS:task-generate]
randomize
segregate
normalize
generate
[TASKS:task-train]
train

Now that the EGA file has been generated the wizard will make no further changes to it. You can make additional customizations to the EGA file by directly editing its text. For more information on the format of this file, see the article on EGA Files and Encog Analyst.

Step 3: Visualizing Your Data

You will notice that the EGA File Editor, which was opened in the last step, has several buttons on the top. The Visualize button provides several ways to visualize your data. Clicking the Visualize button will provide you with a list of data visualizations provided by Encog Analyst.

Ranges Report

The first is the range report. The range report tells you what ranges your data columns were in. Encog Analyst determined this while analyzing the data file. All of these ranges are saved inside of the EGA File. You can see some of the data produced in a range report here:

There is additional information if you scroll down. This information is necessary for the analyst to normalize your data. Most Machine Learning Methods require some form of normalization to work with data.

Scatter Plot

The Iris data set scatter plot

A scatter plot can also show some interesting information about your data. You can easily see if your data forms clusters. Choose Scatter Plot from the dialog that appears when you click the Visualize button in the EGA File Editor. You will be prompted for what attributes you wish to plot. Place a check in all check boxes. This will produce a multivariate scatter plot, as seen here. This multivariate scatter plot shows pairings of each of the attributes. This allows you to see how the pairs relate to each other. Ideally you will see large clusters of similar colored dots. If you do not, your data is either very noisy, or is simply not expressed in a way that is going to be easy for a Machine Learning Method to learn. The iris data set does have well defined clusters.

Another interesting feature of the Iris data set is that the clusters are not linearly separable. At least not all three. Iris Setosa is linearly separable from the other two. But Iris Versicolor and Iris Virginica are not linearly separable. At least, not on all pairings.

Additionally, there are only two clusters, if you do not have species information. Imagine all dots were black. You would only only see two clusters. Because of this a simple unsupervied clustering Machine Learning Method would not be able to learn the difference. This also illustrates the difference between clustering and classification. Clustering is unsupervised and simply places data into natural clusters. Classification is generally supervised, and learns to classify new data that it has not yet seen.

Step 4: Execute the Analyst Script

Execute the Encog Analyst Script

Now that the EGA File has been created, you can execute it. This will perform several steps. Click the Execute button from the EGA File Editor, that was opened in Step 2. This takes the data through 7 steps. There may be more, or fewer steps, for other Encog Analyst projects, depending on what options are chosen. The entire execution should take under a minute on most computers.

Step 1: Randomize - Shuffle the file into a random order.
Step 2: Segregate - Create a Training Data Set and an Evaluation Data Set
Step 3: Normalize - Normalize the data into a form usable by the selected Machine Learning Method
Step 4: Generate - Generate the training data into an EGB File that can be used to train.
Step 5: Create - Generate the selected Machine Learning Method.
Step 6: Train - Train the selected Machine Learning Method.
Step 7: Evaluate - Evaluate the Machine Learning Method.

This process will also create a number of files. The complete list of files, in this project is:

iris.csv - The raw data.
iris.ega - The EGA File. This is the Encog Analyst script.
iris_eval.csv - The evaluation data.
iris_norm.csv - The normalized version of iris_train.csv.
iris_output.csv - The output from running iris_eval.csv.
iris_random.csv - The randomized output from running iris.csv.
iris_train.csv - The training data.
iris_train.eg - The Machine Learning Method that was trained.
iris_train.egb - The binary training data, created from iris_norm.egb.

Step 5: Examine the Output

To see how well the newly trained Machine Learning Method performed, examine iris_output.csv. You can see part of this file here.

"sepal_l","sepal_w","petal_l","petal_w","species","Output:species"
4.6,3.1,1.5,0.2,Iris-setosa,Iris-setosa
5.4,3.4,1.5,0.4,Iris-setosa,Iris-setosa
5.8,2.6,4.0,1.2,Iris-versicolor,Iris-versicolor
5.5,2.6,4.4,1.2,Iris-versicolor,Iris-versicolor
5.2,3.5,1.5,0.2,Iris-setosa,Iris-setosa
4.9,2.4,3.3,1.0,Iris-versicolor,Iris-versicolor
6.2,2.8,4.8,1.8,Iris-virginica,Iris-virginica
5.5,2.4,3.8,1.1,Iris-versicolor,Iris-versicolor
5.8,2.8,5.1,2.4,Iris-virginica,Iris-virginica
5.6,2.5,3.9,1.1,Iris-versicolor,Iris-versicolor

As you can see, the learning method's output(far-right) is matching well to the expected output(2nd to the last column).

Step 6: Analyze the Network

A feedforward neural network for the Iris data set

This example assumes that you used a Feedforward Neural Network as the Machine Learning Method. However, a Support Vector Machine, or other compatibleMachine Learning Method could have been used in step 2. Encog makes Machine Learning Methods very interchangeable.

You can examine the neural network created with this example. Double click the iris_train.eg file, and choose the Visualize button. Choose Network Structure. You will see the network structure.

Understanding the Example

The Encog Analyst actually shielded you from a fair amount of complexity. All normalization decisions were made automatically and encoded into the EGA File. If you like, you can change any of the options in the EGA File and rerun the example. The wizard is really meant to just give you a starting point with an EGA File. Additionally, equilateral class normalization was chosen. This causes you to have two output neurons. Equilateral normalization is often a good choice when there are more than 2 classes. Equilateral normalization also requires one fewer than the total number of classes.

The analyst wizard also made a quick estimate of how many hidden neurons might be needed. You may get better results by varying this number.