What is a Measure
A crucial aspect of generating models is estimating their quality. Some metric is needed to decide whether a model is poor, good, very good, etc. In the SUMO Toolbox this role is played by the Measure component. A measure is an object that, given a model, returns an estimation of its quality. This can be something very simple as the error of the model fit in the training data, or it may involve a complex calculation to see if a model satisfies some physical constraint.
There are two aspects to a Measure:
- The quality estimation algorithm
- The error function
The first is the algorithm used to estimate the model quality. This is for example the in-sample error, or the 5-fold crossvalidation score. The error function determines what kind of error you want to use. You can calculate the in-sample error using: the average absolute error, the root mean square error, a maximum relative error, etc. Note that the error function may not be relevant for every type of measure (eg., AIC).
It cannot be stressed enough that:
That choice will depend on your problem characteristics, the data distribution and the model type you will use to fit the data. It is extremely important to think about this well ("what do you want?") before starting any modeling.
A recommended read is the technical report available here.
In general, the default measure, 5 fold CrossValidation, is an acceptable choice. However, it is also very expensive, as it requires that a model be re-trained for each fold. This may slow things down if a model is expensive to train (e.g., neural nets). Also CrossValidation can give biased results if data is clustered or scarce. Increasing the number of folds may help here. A cheaper alternative is ValidationSet (see below) or AIC. For a full list of available measures see the
Note that multiple measures may also be combined. For more information see Multi-Objective Modeling.
For how to change the error function see this FAQ entry.
Below is a list of some available measures and the configuration options available for each of them. Each measure also has a target accuracy attribute, which can be omitted and which defaults to 0.001. In certain cases, such as the binary MinMax measure, the target accuracy is irrelevant.
We are well aware that documentation is not always complete and possibly even out of date in some cases. We try to document everything as best we can but much is limited by available time and manpower. We are are a university research group after all. The most up to date documentation can always be found (if not here) in the default.xml configuration file and, of course, in the source files. If something is unclear please dont hesitate to ask.
The CrossValidation measure is the default choice and performs an n-fold cross validation on the model to create an efficient estimation of the accuracy of the model. Several options are available to customize this measure.
|The number of folds used for the measure. A higher number means that more models will be built, but that a better accuracy estimate is achieved.|
|If the number of samples is greater than this number a random partitioning is used|
|This option defines whether the test sets for the folds are chosen randomly, or are chosen in such a way as to maximize the domain coverage. Random is generally much faster, but might result in pessimistic scoring, as unlucky test set choice can result in an inaccurate error. This can partly be fixed by enabling the resetFolds option.|
|Folds are generated from scratch for each model that is evaluated using this measure. If the same model is evaluated twice (for example, after a rebuild), new folds are used. Enabling this feature can be very costly for large sample sizes. As a rule of thumb, enable this in combination with the random partition method, or disable this when using the uniform method.|
The ValidationSet measure has two different methods of operation.
- In the first method, the list of samples that have been evaluated is split into a validation set and a training set. A model is then built using the training set, and evaluated using the validation set (which is by default 20% of the total sample pool).
- However, an external data file containing a validation set can also specified. In this case, all the evaluated samples are used for training, and the external set is used for validation only. Which of these two operation methods is used, depends on the configuration options below. By default, no external validation set is loaded.
If you want to use an external validation set, you will have to provide a SampleEvaluator configuration so that the validation set can be loaded from an external source. Here is a ValidationSet configuration example which loads the validation set from the scattered data file provided in the simulator file:
<Measure type="ValidationSet" target=".001"> <Option key="type" value="file"/> <SampleEvaluator type="ibbt.sumo.SampleEvaluators.ScatteredDatasetSampleEvaluator"/> </Measure>
|type||[distance, random, file]||distance|
|Method used to acquire samples for the validation set. The default method, 'distance', tries to select a validation set which covers the entire domain as good as possible, ensuring that not all validation samples are chosen in the same part of the domain. This is achieved using a distance heuristic, which gives no guarantees on optimal coverage but performs very well in almost all situations. The 'random' method just picks a random set of samples from the entire pool to be used for validation set. Finally, the 'file' method does not take samples at all from the pool, but loads a validation set from an external dataset.|
|Percent of samples used for the validation set. By default 20% of all samples are used for validation, while the remaining 80% are used for training. This option is irrelevant if the 'type' option is set to 'file'.|
|When the sample pool is very large, the distance heuristic used by default becomes too slow, and the toolbox switches to random sample selection automatically. This is done when the amount of samples is larger than this value, which defaults to 1000. This option should not be changed unless the performance is unacceptable even for sample sets smaller than this amount.|
The MinMax measure is used to eliminate models whose response falls below a given minimum or above a given maximum. This measure can be used to detect models that have poles in the model domain and to guide the modeling process in the right direction. If the output is known to lie within certain value bounds, these can be added to the simulator file as follows:
<OutputParameters> <Parameter name="out" type="real" minimum="-1" maximum="1"/> </OutputParameters>
When the MinMax measure is defined, these values will be used to ensure that all models stay within these bounds. If only the minimum or only the maximum is defined, naturally only these are enforced. There are no further configuration options for this measure. In case of complex outputs the modulus is used.
Remember though, that no guarantee can be given that the poles will really disappear. Using this measure only combats the symptoms and not the cause of the problem. Also, this measure can be reasonably slow, because it evaluates a dense grid to decide wether the model is crossing boundaries. If the model is slow to evaluate, this can take a considerable amount of time.
Tip: even if you don't know the exact bounds on your output, you can still use this measure by specifying very broad bounds (e.g., [-10000 10000]). This can still allow you to catch poles since, by definition, they reach until infinity.
This measure simply calculates the error in the training data. Note that this measure is useless for interpolating methods like Kriging and RBF.
The toolbox keeps track of the n best models found so far. The ModelDifference measure uses the disagreement between those models as a heuristic for ranking them. A model that differs considerably from the other models is assumed to be of poor quality. Remember that this is just a heuristic!
This is a very useful measure that can be used to complement other measures (see Multi-Objective Modeling) with some good results. For example, as a cheaper alternative for crossvaliation with neural networks. It will penalize models where they show unwanted 'bumps' or 'ripples' in the response.
Implements Akaikes Information Criterion. Note that this requires a proper implementation of the freeParams of a Model.