Difference between revisions of "FAQ"

Revision as of 16:01, 9 February 2009

General

What is a global surrogate model?

A global surrogate model is a mathematical model that mimics the behavior of a computationally expensive simulation code over the complete parameter space as accurately as possible, using as little data points as possible. So note that optimization is not the primary goal, although it can be done as a post-processing step. Global surrogate models are useful for:

design space exploration, to get a feel of how the different parameters behave
sensitivity analysis
what-if analysis
prototyping
...

In addition they are a cheap way to model large scale systems, multiple global surrogate models can be chained together in a model cascade.

What about surrogate driven optimization?

When coining the term surrogate driven optimization most people associate it with trust-region strategies and simple polynomial models. These frameworks first construct a local surrogate which is optimized to find an optimum. Afterwards, a move limit strategy decides how the local surrogate is scaled and/or moved through the input space. Subsequently the surrogate is rebuild and optimized. I.e. the surrogate zooms in to the global optimum. For instance the DAKOTA Toolbox implements such strategies where the surrogate construction is separated from optimization.

Such a framework was earlier implemented in the SUMO Toolbox but was deprecated as it didn't fit the philosophy and design of the toolbox.

Instead another, equally powerful, approach was taken. The current optimization framework is in fact a sampling selection strategy that balances local and global search. In other words, it balances between exploring the input space and exploiting the information the surrogate gives us.

A configuration example can be found here. For more information see the InfillSamplingCriterion .

Can the toolbox drive my simulation code directly?

Yes it can. See the Interfacing with the toolbox page.

What is the difference between the M3-Toolbox and the SUMO-Toolbox?

The SUMO toolbox is a complete, feature-full framework for automatically generating approximation models and performing adaptive sampling. In contrast, the M3-Toolbox was more of a proof-of-principle.

What happened to the M3-Toolbox?

The M3 Toolbox project has been discontinued (Fall 2007) and superseded by the SUMO Toolbox. Please contact tom.dhaene@ua.ac.be for any inquiries and requests about the M3 Toolbox.

How can I stay up to date with the latest news?

To stay up to date with the latest news and releases, we also recommend subscribing to our newsletter here. Traffic will be kept to a minimum (1 message every 2-3 months) and you can unsubscribe at any time.

What is the roadmap for the future?

There is no explicit roadmap since much depends on where our research leads us, what feedback we get, which problems we are working on, etc. However, to get an idea of features to come you can always check the Whats new page.

Will there be an R/Scilab/Octave/Sage/.. version?

At the start of the project we considered moving to one of the available open source alternatives to Matlab. However, after much discussion we decided against this for several reasons(*), including:

The quality and amount of available Matlab documentation
The quality and number of Matlab toolboxes
Many well documented interfacing options (esp. Java)
Existing experience and know-how

Matlab sure has its problems and deficiencies but the number of advanced algorithms and toolboxes make it a very attractive platform. Equally important is the fact that every function is properly documented and includes examples, tutorials, and in some cases GUI tools. A lot of things would have been a lot harder and/or time consuming to implement on one of the other platforms. The other platforms remain on our radar however, and we do look into them from time to time. In principle it would even be possible to write a bridge between Matlab and them.

(*) We are not saying those projects are poor or useless, quite the contrary. Its just that given our situation, goals, and resources at the time, Matlab was the best choice for us.

Installation and Configuration

What is the relationship between Matlab and Java?

Many people do not know this, but your Matlab installation automatically includes a Java virtual machine. By default, Matlab seamlessly integrates with Java, allowing you to create Java objects from the command line (e.g., 's = java.lang.String'). It is possible to disable java support but in order to use the SUMO Toolbox it should not be. To check if Java is enabled you can use the 'usejava' command.

What is Java, why do I need it, do I have to install it, etc. ?

The short answer is: no, dont worry about it. The long answer is: Some of the code of the SUMO Toolbox is written in Java, since it makes a lot more sense in many situations and is a proper programming language instead of a scripting language like Matlab. Since Matlab automatically includes a JVM to run Java code there is nothing you need to do or worry about (see the previous FAQ entry). Unless its not working of course, in that case see FAQ#When_running_the_toolbox_you_get_something_like_.27.3F.3F.3F_Undefined_variable_.22ibbt.22_or_class_.22ibbt.sumo.config.ContextConfig.setRootDirectory.22.27.

What is XML?

XML stands for eXtensible Markup Language and is related to HTML (= the stuff web pages are written in). The first thing you have to understand is that does not do anything. Honest. Many engineers are not used to it and think it is some complicated computer programming language-stuff-thingy. This is of course not the case (we ignore some of the fancy stuff you can do with it for now). XML is a markup language meaning, it provides some rules how you can annotate or structure existing text.

The way SUMO uses XML is really simple and there is not much to understand. First some simple terminology. Take the following example:

  <Foo attr="bar">bla bla bla</Foo>

Here we have a tag called Foo containing text bla bla bla. The tag Foo also has an attribute attr with value bar. '<Foo>' is what we call the opening tag, and '</Foo>' is the closing tag. Each time you open a tag you must close it again. How you name the tags or attributes it totally up to you, you choose :)

Lets take a more interesting example. Here we have used XML to represent information about a receipe for pancakes:

 <recipe category="dessert">
 <title>Pancakes</title>
 <author>sumo@intec.ugent.be</author>
 <date>Wed, 14 Jun 95</date>
 <description>
   Good old fashioned pancakes.
 </description>
 <ingredients>
   <item>
       <amount>3</amount>
       <type>eggs</type>
   </item>
   
   <item>
        <amount>0.5 tablespoon</amount>
        <type>salt</type>
   </item>
    ...
 </ingredients>
 <preparation>
   ...
 </preparation>
</recipe>

So basically, you see that XML is just a way to structure, order, and group information. Thats it! So SUMO basically uses it to store and structure configuration options. And this works well due to the nice hierarchical nature of XML.

If you understand this there is nothing else to it in order to be able to understand the SUMO configuration files. If you need more information see the tutorial here: http://www.w3schools.com/XML/xml_whatis.asp. You can also have a look at the wikipedia page here: http://en.wikipedia.org/wiki/XML

Why does SUMO use XML?

XML is the defacto standard way of structuring information. This ranges from spreadsheet files (Microsoft Excel for example), to configuration data, to scientific data, ... There are even whole database systems based solely on XML. So basically, its an intuitive way to structure data and it is used everywhere. This makes that there are a very large number of libraries and programming languages available that can parse, and handle XML easily. That means less work for the programmer. Then of course there is stuff like XSLT, XQuery, etc that makes life even easier. So basically, it would not make sense for SUMO to use any other format :)

Upgrading

How do I upgrade to a newer version?

Delete your old <SUMO-Toolbox-directory> completely and replace it by the new one. Install the new activation file / extension pack as before (see Installation), start Matlab and make sure the default run works. To port your old configuration files to the new version: make a copy of default.xml (from the new version) and copy over your custom changes (from the old version) one by one. This should prevent any weirdness if the XML structure has changed between releases.

Using

Why are the Neural Networks so slow?

You are probably using the CrossValidation measure. CrossValidation is used by default if you have not defined a measure yourself. Since you need to train them, neural nets will always be slower than the other models. Using CrossValidation will slow things down much much more (5-times slower by default). Therefore, when using one of the neural network model types, please use a different measure, such as ValidationSet or SampleError. See the comments in default.xml for examples.

Note: Starting from version 5.0, two new neural network backends are available (based on FANN and NNSYSID). These are a lot faster than the default backend based on the Matlab Neural Network Toolbox. However, the accuracy it not as good.

How can I make the toolbox run faster?

There are a number of things you can do to speed things up:

Disable some, or even all of the profilers or disable the output handlers that draw charts
Turn off the plotting of models in ContextConfig, you can always generate plots from the saved mat files
Decrease the logging granularity, a log level of FINE (the default is FINEST or ALL) is more then granular enough. Setting it to FINE, INFO, or even WARNING should speed things up.
If you have a multi-core/multi-cpu machine, set the threadCount variable in LocalSampleEvaluator equal to the number of cores/CPUs (only do this if it is ok to start multiple instances of your simulation script in parallel!)
Avoid cross validation and use a validation set
Dont use the Min-Max measure, it can slow things down.
Instead of anngenetic use fanngenetic or nanngenetic, these are faster but the quality of the models is lower. However it may be good enough for your problem.
If using ANNModels, try setting the training goal (= the SSE to reach during training) to a small positive number (e.g., 1e-5) instead of 0. Take care if the range of your response is not too small though.
For most model builders there is an option "maxFunEals", "maxIterations", or equivalent. Change this value to change the maximum number of models built between 2 sampling iterations. The higher this number, the slower, but the better the models can be.

If you are having problems with very slow or seemingly hanging runs

Do a run inside the Matlab profiler and see where most time is spent.
Monitor CPU and physical/virtual memory usage while the SUMO toolbox is running and see if you notice anything strange.

Also note that by default Matlab only allocates about 117 MB memory space for the Java Virtual Machine. If you would like to increase this limit (which you should) please follow the instructions here. See also the general memory instructions here.

To check if your SUMO run has hanged, monitor your log file (with the level set at least to FINE). If you see no changes for about 30 minutes the toolbox will probably have stalled. report the problems here.

Such problems are hard to identify and fix so it is best to work towards a reproducible test case if you think you found a performance or scalability issue.

How do I build models with more than one output

See Running#Models_with_multiple_outputs

How do I turn off adaptive sampling (run the toolbox for a fixed set of samples)?

See : Adaptive Modeling Mode.

How do I change the error function (relative error, RMS, ...)?

The <Measure> tag specifies the algorithm to use to assign models a score, e.g., CrossValidation. It is also possible to specify which error function to use, in the measure. The default error function is 'rootRelativeSquareError'.

Say you want to use CrossValidation with the maximum absolute error, then you would put:

<Measure type="CrossValidation" target="0.001" errorFcn="maxAbsoluteError"/>

On the other hand, if you wanted to use the ValidationSet measure with a relative root-mean-square error you would put:

<Measure type="ValidationSet" target="0.001" errorFcn="relativeRms"/>

The default error function is 'rootRelativeSquareError'. These error functions can be found in the src/matlab/tools/errorFunctions directory. You are free to modify them and add your own.

How do I enable more profilers?

Go to the <Profiling> tag and put "*" as the regular expression. See also the next question.

What regular expressions can I use to filter profilers?

See the syntax here.

How can I ensure deterministic results?

See : Random state.

How do I get a simple closed-form model (symbolic expression)?

See : Using a model.

How do I enable the Heterogenous evolution to automatically select the best model type?

Simply use the heterogenetic modelbuilder as you would any other.

What is the combineOutputs option?

See Running#Models_with_multiple_outputs

What error function should I use?

The default error function is the Bayesian Estimation Error Quotient (BEEQ). This error function is described here. For example, if you get a BEEQ of 0.16 it (very intuitively) means you are doing 16% better than the most simple model (= the mean). On the other hand meanRelativeError may be more intuitive but in that case you have to be careful if you have function values close to zero since in that case the relative error explodes or even gives infinity. You could also use one of the combined relative error functions (contain a +1 in the denominator to account for small values) but then you get something between a relative and absolute error (=> hard to interpret).

So to be sure an absolute error seems the safest bet (like the RMSE), however in that case you have to come up with sensible accuracy targets and realize that you will build models that try to fit the regions of high absolute value better than the low ones.

Picking an error function is a very tricky business and many people do not realize this. Which one is best for you and what targets you use ultimately depends on your application and on what kind of model you want. There is no general answer.

A recommended read is the technical report available here. See also the page on Multi-Objective Modeling.

I just want to generate an initial design (no sampling, no modeling)

Do a regular SUMO run, except set the 'maxModelingIterations' in the SUMO tag to 0. The resulting run will only generate (and evaluate) the initial design and save it to samples.txt in the output directory.

How do I start a run with the samples of of a previous run, or with a custom initial design?

Use a Dataset design component, for example:

<InitialDesign type="DatasetDesign">
   <Option key="file" value="/path/to/the/file/containing/the/points.txt"/>
</InitialDesign>

What is a level plot?

A level plot is a plot that shows how the error histogram changes as the best model improves. An example is:

Level plots only work if you have a separate dataset (test set) that the model can be checked against. See the comments in default.xml for how to enable level plots.

I am getting a java out of memory error, what happened?

Datasets are loaded through java. This means that the java heap space is used for storing the data. If you try to load a huge dataset (> 50MB), you might experience problems with the maximum heap size. You can solve this by raising the heap size as described on the following webpage: [1]

How do I force the output of the model to lie in a certain range

See Measures#MinMax.

Troubleshooting

I have a problem and I want to report it

See : Reporting problems.

I sometimes get flat models when using rational functions

First make sure the model is indeed flat, and does not just appear so on the plot. You can verify this by looking at the output axis range and making sure it is within reasonable bounds. When there are poles in the model, the axis range is sometimes stretched to make it possible to plot the high values around the pole, causing the rest of the model to appear flat. If the model contains poles, refer to the next question for the solution.

The RationalModel tries to do a least squares fit, based on which monomials are allowed in numerator and denominator. We have experienced that some models just find a flat model as the best least squares fit. There are two causes for this:

The number of sample points is few, and the model parameters (as explained here and here) force the model to use only a very small set of degrees of freedom. The solution in this case is to increase the minimum percentage bound in the RationalXYZInterface ( RationalSequentialInterface or RationalGeneticInterface) section of your configuration file: change the "percentBounds" option to "60,100", "80,100", or even "100,100". A setting of "100,100" will force the polynomial models to always exactly interpolate. However, note that this does not scale very well with the number of samples (to counter this you can set "maxDegrees"). If, after increasing the "percentBounds" you still get weird, spiky, models you simply need more samples or you should switch to a different model type.
Another possibility is that given a set of monomial degrees, the flat function is just the best possible least squares fit. In that case you simply need to wait for more samples.
The measure you are using is not accurately estimating the true error, try a different measure or error function. Note that a maximum relative error is dangerous to use since a the 0-function (= a flat model) has a lower maximum relative error than a function which overshoots the true behavior in some places but is otherwise correct.

When using rational functions I sometimes get 'spikes' (poles) in my model

When the denominator polynomial of a rational model has zeros inside the domain, the model will tend to infinity near these points. In most cases these models will only be recognized as being `the best' for a short period of time. As more samples get selected these models get replaced by better ones and the spikes should disappear.

So, it is possible that a rational model with 'spikes' (caused by poles inside the domain) will be selected as best model. This may or may not be an issue, depending on what you want to use the model for. If it doesn't matter that the model is very inaccurate at one particular, small spot (near the pole), you can use the model with the pole and it should perform properly.

However, if the model should have a reasonable error on the entire domain, several methods are available to reduce the chance of getting poles or remove the possibility altogether. The possible solutions are:

Simply wait for more data, usually spikes disappear (but not always).
Lower the maximum of the "percentBounds" option in the RationalXYZInterface ( RationalSequentialInterface or RationalGeneticInterface) section of your configuration file. For example, say you have 500 data points and if the maximum of the "percentBounds" option is set to 100 percent it means the degrees of the polynomials in the rational function can go up to 500. If you set the maximum of the "percentBounds" option to 10, on the other hand, the maximum degree is set at 50 (= 10 percent of 500). You can also use the "maxDegrees" option to set an absolute bound.
If you roughly know the output range your data should have, an easy way to eliminate poles is to use the MinMax Measure together with your current measure ( CrossValidation by default). This will cause models whose response falls outside the min-max bounds to be penalized extra, thus spikes should disappear. See : Combining measures.
Use a different model type (RBF, ANN, SVM,...), as spikes are a typical problem of rational functions.
Increase the population size if using the genetic version
Try using the RationalPoleSuppressionSampleSelector, it was designed to get rid of this problem more quickly, but it only selects one sample at the time.

However, these solutions may not still not suffice in some cases. The underlying reason is that the order selection algorithm contains quite a lot of randomness, making it prone to over-fitting. This issue is being worked on but will take some time. Automatic order selection is not an easy problem

There is no noise in my data yet the rational functions don't interpolate

See : this question.

When loading a model from disk I get "Warning: Class ':all:' is an unknown object class. Object 'model' of this class has been converted to a structure."

You are trying to load a model file without the SUMO Toolbox in your Matlab path. Make sure the toolbox is in your Matlab path.

In short: Start Matlab, run <SUMO-Toolbox-directory>/startup.m (to ensure the toolbox is in your path) and then try to load your model.

When running the SUMO Toolbox you get an error like "No component with id 'ann' of type 'adaptive model builder' found in config file."

This means you have specified to use a component with a certain id (in this case an AdaptiveModelBuilder component with id 'ann') but a component with that id does not exist further down in the configuration file (in this particular case 'ann' does not exist but 'anngenetic' does, as a quick search through the configuration file will show). So make sure you only declare components which have a definition lower down. So see which components are available, simply scroll down the configuration file and see which id's are specified. Please also refer to the Declarations and Definitions page.

When using RBF neural network models I sometimes get get a crash in "newrb"

This is an error in the Matlab Neural Network Toolbox implementation and not anything we can do about (a workaround is available on request). This should be fixed by Matlab 7.5.

When using NANN models I sometimes get "Runtime error in matrix library, Choldc failed. Matrix not positive definite"

This is a problem in the mex implementation of the NNSYSID toolbox. Simply delete the mex files, the Matlab implementation will be used and this will not cause any problems.

When using FANN models I sometimes get "Invalid MEX-file createFann.mexa64, libfann.so.2: cannot open shared object file: No such file or directory."

This means Matlab cannot find the FANN library itself to link to dynamically. Make sure it is in your library path, ie, on unix systems, make sure it is included in LD_LIBRARY_PATH.

When trying to use SVM models I get 'Error during fitness evaluation: Error using ==> svmtrain at 170, Group must be a vector'

You forgot to build the SVM mex files for your platform. For windows they are pre-compiled for you, on other systems you have to compile them yourself with the makefile.

When running the toolbox you get something like '??? Undefined variable "ibbt" or class "ibbt.sumo.config.ContextConfig.setRootDirectory"'

First see this FAQ entry.

This means Matlab cannot find the needed Java classes. This typically means that you forgot to run 'startup' (to set the path correctly) before running the toolbox (using 'go'). So make sure you always run 'startup' before running 'go' and that both commands are always executed in the toolbox root directory.

If you did run 'startup' correctly and you are still getting an error, check that Java is properly enabled:

typing 'usejava jvm' should return 1
typing 's = java.lang.String', this should not give an error
typing 'version('-java')' should return at least version 1.5.0

If (1) returns 0, then the jvm of your Matlab installation is not enabled. Check your Matlab installation or startup parameters (did you start Matlab with -nojvm?) If (2) fails but (1) is ok, there is a very weird problem, check the Matlab documentation. If (3) returns a version before 1.5.0 you will have to upgrade Matlab to a newer version or force Matlab to use a custom, newer, jvm (See the Matlab docs for how to do this).

You get errors related to gaoptimset,psoptimset,saoptimset,newff not being found or unknown

You are trying to use a component of the SUMO toolbox that requires a Matlab toolbox that you do not have. See the System requirements for more information.

After upgrading I get all kinds of weird errors or warnings when I run my XML files

See FAQ#How_do_I_upgrade_to_a_newer_version.3F

I get a warning about duplicate samples being selected, why is this?

When using the Infill Sample Selector or the Error based Sample Selector it is possible (though unlikely) that a point is selected twice. This is due to the algorithm itself and can be ignored (unless it is a real problem, in that case report it. This issue will be tackled in a future version.

I sometimes see the error of the best model go up, shouldn't it decrease monotonically?

There is no short answer here, it depends on the situation. Below 'single objective' refers to the case where during the hyperparameter optimization (= the modeling iteration) combineOutputs=false, and there is only a single measure set to 'on'. The other cases are classified as 'multi objective'.

Sampling off
1. Single objective: the error should always decrease monotonically, you should never see it rise. If it does report it as a bug
2. Multi objective: There is a very small chance the error can temporarily decrease but it should be safe to ignore. In this case it is best to use a multi objective enabled modeling algorithm (available in version 6.1.)
Sampling on
1. Single objective: inside each modeling iteration the error should always monotonically decrease. At each sampling iteration the best models are updated (to reflect the new data), thus there the best model score may increase, this is normal behavior(*). It is possible that the error increases for a short while, but as more samples come in it should decrease again. If this does not happen you are using a poor measure or poor hyperparameter optimization algorithm, or there is a problem with the modeling technique itself (e.g., clustering in the datapoints is causing numerical problems).
2. Multi objective: Combination of 1.2 and 2.1.

(*) This is normal if you are using a measure like cross validation that is less reliable on little data than on more data. However, in some cases you may wish to override this behavior if you are using a measure that is independent of the number of samples the model is trained with (e.g., a dense, external validation set). In this case you can force a monotonic decrease by setting the 'keepOldModels' option in the SUMO tag to true. Use with caution!

At the end of a run I get Undefined variable "ibbt" or class "ibbt.sumo.util.JpegImagesToMovie.createMovie"

This is normal, the warning printed out before the error explains why:

[WARNING] jmf.jar not found in the java classpath, movie creation may not work! Did you install the SUMO extension pack? Alternatively you can install the java media framwork from java.sun.com

By default, at the end of a run, the toolbox will try to generate a movie of all the intermediate model plots. To do this it requires the extension pack to be installed (you can download it from the SUMO lab website). So install the extension pack and you will no longer get the error. Alternatively you can simply set the "createMovie" option in the <SUMO> tag to "false". So note that there is nothing to worry about, everything has run correctly, it is just the movie creation that is failing.

On startup I get the error "java.io.IOException: Couldn't get lock for output/SUMO-Toolbox.%g.%u.log"

This error means that SUMO is unable to create the log file. Check the output directory exists and has the correct permissions. If your output directory is on a shared (network) drive this could also cause problems. Also make sure you are running the toolbox (calling 'go') from the toolbox root directory, and not in some toolbox sub directory! This is very important.

If you still have problems you can override the default logfile name and location as follows:

In the <FileHandler> tag inside the <Logging> tag add the following option:

<Option key="Pattern" value="My_SUMO_Log_file.log"/>

This means that from now on the sumo log file will be saved as the file "My_SUMO_Log_file.log" in the SUMO root directory. You can use any path you like. For more information about this option see the FileHandler Javadoc.

@@ Line 267: / Line 267: @@
 Datasets are loaded through java. This means that the java heap space is used for storing the data. If you try to load a huge dataset (> 50MB), you might experience problems with the maximum heap size. You can solve this by raising the heap size as described on the following webpage:
 [http://www.mathworks.com/support/solutions/data/1-18I2C.html]
+=== How do I force the output of the model to lie in a certain range ===
+See [[Measures#MinMax]].
 == Troubleshooting ==