General guidelines

From SUMOwiki
Revision as of 22:50, 28 January 2009 by Dgorissen (talk | contribs)
Jump to navigationJump to search

The default.xml file can be used as a starting point for default behavior for the SUMO Toolbox. If you are a new user, you should initially leave most options at their default values. The default settings were chosen since they produce good results on average.

However, usually the optimal choice of components depends on the problem itself, so that the default settings aren't necessarily the best. This page will give the user general guidelines to decide which component to use for each situation they may encounter. The user is of course free to ignore these rules and experiment with other settings.

Measures

The default Measure is CrossValidation. Even though this is a very good, accurate, overall measure, there are some considerations to make in the following cases:

  • Expensive modelers (ann): If it is relatively expensive to train a model (for example, with neural networks), CrossValidation is also very slow, because it has to train a model for each fold (which is 5 by default). If modeling takes too long, you might want to use a faster alternative, such as ValidationSet.
  • ErrorSampleSelector: CrossValidation might give a biased result when combined with the ErrorSampleSelector. This is because the ErrorSampleSelector tends to cluster samples around one point, which will result in very accurate surrogate models for all the points in this cluster (and thus good results with CrossValidation ). So when using CrossValidation and ErrorSampleSelector together, keep in mind that the real accuracy might be slightly lower than the estimated one.
  • Rational modeler: When using Rational modeler, you might want to manually add a MinMax measure (if you got a rough estimate of the minimum and maximum values for your outputs) and use it together with CrossValidation. By adding the MinMax measure, you eliminate models which have poles in the design space, because these poles always break the minimum and maximum bounds. This usually results in better models and quicker convergence.

Sample Selectors

The default SampleSelector is the GradientSampleSelector. This is a very robust sample selector, capable of dealing with most situations. There are, however, some cases in which it is advisable to choose a different one:

  • Large-scale problems (1000+ samples): The GradientSampleSelector's time complexity is O(n^2) to the number of samples n, so for large-scale experiments in which many samples are taken, the GradientSampleSelector becomes quite slow. Depending on the time it takes to perform one simulation, this may or may not be a problem. If it takes a long time to perform one simulation, the cost for selecting new samples with the GradientSampleSelector might still be negligible.
  • Rational modeler: Benchmarks have shown that the gain of using the GradientSampleSelector over the ErrorSampleSelector when using global approximation methods (mainly rational/polynomial) is pretty much zero. It is therefore advisable to use the (much faster) ErrorSampleSelector when using the Rational modeler.

When using the ErrorSampleSelector instead of the GradientSampleSelector, it is always a good idea to combine it with the DensitySampleSelector, to combat stability/robustness issues the ErrorSampleSelector often causes. It is a good idea to select about 60% of the samples with the ErrorSampleSelector, and 40% with the DensitySampleSelector. This will ensure that at least the entire design space is covered to a certain degree. This additional sample selector is NOT necessary when using the GradientSampleSelector.

Adaptive Model Builders

The question that always gets asked is Which model type should I use for my data? Unfortunately there is no straightforward since it all depends on your problem: how many dimensions, how many points, is your function rugged, smooth, or both, is there noise, etc, etc. Based on this knowledge it is possible to say which model types are more likely to do well but it remains a heuristic. Best is to try a few and see what happens, or use the heterogenetic model builder to try multiple model types in parallel and automatically try to determine the best type.