Difference between revisions of "FAQ"

Revision as of 10:46, 5 March 2014

General

What is a global surrogate model?

A global surrogate model is a mathematical model that mimics the behavior of a computationally expensive simulation code over the complete parameter space as accurately as possible, using as little data points as possible. So note that optimization is not the primary goal, although it can be done as a post-processing step. Global surrogate models are useful for:

design space exploration, to get a feel of how the different parameters behave
sensitivity analysis
what-if analysis
prototyping
visualization
...

In addition they are a cheap way to model large scale systems, multiple global surrogate models can be chained together in a model cascade.

What about surrogate driven optimization?

When coining the term surrogate driven optimization most people associate it with trust-region strategies and simple polynomial models. These frameworks first construct a local surrogate which is optimized to find an optimum. Afterwards, a move limit strategy decides how the local surrogate is scaled and/or moved through the input space. Subsequently the surrogate is rebuild and optimized. I.e. the surrogate zooms in to the global optimum. For instance the DAKOTA Toolbox implements such strategies where the surrogate construction is separated from optimization.

Such a framework was earlier implemented in the SUMO Toolbox but was deprecated as it didn't fit the philosophy and design of the toolbox.

Instead another, equally powerful, approach was taken. The current optimization framework is in fact a sampling selection strategy that balances local and global search. In other words, it balances between exploring the input space and exploiting the information the surrogate gives us.

A configuration example can be found here.

What is (adaptive) sampling? Why is it used?

In classical Design of Experiments you need to specify the design of your experiment up-front. Or in other words, you have to say up-front how many data points you need and how they should be distributed. Two examples are Central Composite Designs and Latin Hypercube designs. However, if your data is expensive to generate (e.g., an expensive simulation code) it is not clear how many points are needed up-front. Instead data points are selected adaptively, only a couple at a time. This process of incrementally selecting new data points in regions that are the most interesting is called adaptive sampling, sequential design, or active learning. Of course the sampling process needs to start from somewhere so the very first set of points is selected based on a fixed, classic experimental design. See also Running#Understanding_the_control_flow. SUMO provides a number of different sampling algorithms: SampleSelector

Of course sometimes you dont want to do sampling. For example if you have a fixed dataset you just want to load all the data in one go and model that. For how to do this see FAQ#How_do_I_turn_off_adaptive_sampling_.28run_the_toolbox_for_a_fixed_set_of_samples.29.3F.

What about dynamical, time dependent data?

The original design and purpose was to tackle static input-output systems, where there is no memory. Just a complex mapping that must be learnt and approximated. Of course you can take a fixed time interval and apply the toolbox but that typically is not a desired solution. Usually you are interested in time series prediction, e.g., given a set of output values from time t=0 to t=k, predict what happens at time t=k+1,k+2,...

The toolbox was originally not intended for this purpose. However, it is quite easy to add support for recurrent models. Automatic generation of dynamical models would involve adding a new model type (just like you would add a new regression technique) or require adapting an existing one. For example it would not be too much work to adapt the ANN or SVM models to support dynamic problems. The only extra work besides that would be to add a new Measure that can evaluate the fidelity of the models' prediction.

Naturally though, you would be unable to use sample selection (since it makes no sense in those problems). Unless of course there is a specialized need for it. In that case you would add a new SampleSelector.

For more information on this topic Contact us.

What about classification problems?

The main focus of the SUMO Toolbox is on regression/function approximation. However, the framework for hyperparameter optimization, model selection, etc. can also be used for classification. Starting from version 6.3 a demo file is included in the distribution that shows how this works on the well known two spiral test problem. It is possible to specify a run as a classification problem by setting the 'classificationMode' and 'numberOfClasses' option in ContextConfig in the configuration file. Classification models from WEKA are also available in SUMO. Please refer to the default configuration file for the explanation on usage of WEKA model types available through SUMO. The LOLA-Voronoi sample selection scheme also supports classification, and its usage is documented in the default configuration file as well.

Does SUMO support discrete inputs/outputs

Not, if you mean in a smart way. There is a way to flag an input/output as discrete but it is not used anywhere. It is on the wishlist but we have not been able to get to it yet. Discrete inputs are just handled as if they were continuous. Depending on how many levels there are and if there is an ordering this may work ok or not work at all. You could of course add your own model type that can handle these :) As for discrete outputs see FAQ#What_about_classification_problems.3F.

Can the toolbox drive my simulation code directly?

Yes it can. See the Interfacing with the toolbox page.

What is the difference between the M3-Toolbox and the SUMO-Toolbox?

The SUMO toolbox is a complete, feature-full framework for automatically generating approximation models and performing adaptive sampling. In contrast, the M3-Toolbox was more of a proof-of-principle.

What happened to the M3-Toolbox?

The M3 Toolbox project has been discontinued (Fall 2007) and superseded by the SUMO Toolbox. Please contact tom.dhaene@ua.ac.be for any inquiries and requests about the M3 Toolbox.

How can I stay up to date with the latest news?

To stay up to date with the latest news and releases, we also recommend subscribing to our newsletter here. Traffic will be kept to a minimum (1 message every 2-3 months) and you can unsubscribe at any time.

You can also follow our blog: http://sumolab.blogspot.com/.

What is the roadmap for the future?

There is no explicit roadmap since much depends on where our research leads us, what feedback we get, which problems we are working on, etc. However, to get an idea of features to come you can always check the Whats new page.

You can also follow our blog: http://sumolab.blogspot.com/.

Will there be an R/Scilab/Octave/Sage/.. version?

At the start of the project we considered moving from Matlab to one of the available open source alternatives. However, after much discussion we decided against this for several reasons, including:

Existing experience and know-how of the development team
The widespread use of the Matlab platform in the target application domains
The quality and amount of available Matlab documentation
The quality and number of Matlab toolboxes
Support for object orientation (inheritance, polymorphism, etc.)
Many well documented interfacing options (especially the seamless integration with Java)

Matlab, as a proprietary platform, definitely has its problems and deficiencies but the number of advanced algorithms and available toolboxes make it a very attractive platform. Equally important is the fact that every function is properly documented, tested, and includes examples, tutorials, and in some cases GUI tools. A lot of things would have been a lot harder and/or time consuming to implement on one of the other platforms. Add to that the fact that many engineers (particularly in aerospace) already use Matlab quite heavily. Thus given our situation, goals, and resources at the time, Matlab was the best choice for us.

The other platforms remain on our radar however, and we do look into them from time to time. Though, with our limited resources porting to one of those platforms is not (yet) cost effective.

What are collaboration options?

We will gladly help out with any SUMO-Toolbox related questions or problems. However, since we are a university research group the most interesting goal for us is to work towards some joint publication (e.g., we can help with the modeling of your problem). Alternatively, it is always nice if we could use your data/problem (fully referenced and/or anonymized if necessary of course) as an example application during a conference presentation or in a PhD thesis.

The most interesting case is if your problem involves sample selection and modeling. This means you have some simulation code or script to drive and you want an accurate model while minimizing the number of data points. In this case, in order for us to optimally help you it would be easiest if we could run your simulation code (or script) locally or access it remotely. Else its difficult to give good recommendations about what settings to use.

If this is not possible (e.g., expensive, proprietary or secret modeling code) or if your problem does not involve sample selection, you can send us a fixed data set that is representative of your problem. Again, this may be fully anonymized and will be kept confidential of course.

In either case (code or dataset) remember:

the data file should be an ASCII file in column format (each row containing one data point) (see also Interfacing_with_the_toolbox)
include a short description of your data:
- number of inputs and number of outputs
- the range of each input (or scaled to [-1 1] if you do not wish to disclose this)
- if the outputs are real or complex valued
- how noisy the data is or if it is completely deterministic (computer simulation) (please also see: FAQ#My_data_contains_noise_can_the_SUMO-Toolbox_help_me.3F).
- if possible the expected range of each output (or scaled if you do not wish to disclose this)
- if possible the names of each input/output + a short description of what they mean
- any further insight you have about the data, expected behavior, expected importance of each input, etc.

If you have any further questions or comments related to this please Contact us.

Can you help me model my problem?

Please see the previous question: FAQ#What_are_collaboration_options.3F

Installation and Configuration

What is the relationship between Matlab and Java?

Many people do not know this, but your Matlab installation automatically includes a Java virtual machine. By default, Matlab seamlessly integrates with Java, allowing you to create Java objects from the command line (e.g., 's = java.lang.String'). It is possible to disable java support but in order to use the SUMO Toolbox it should not be. To check if Java is enabled you can use the 'usejava' command.

What is Java, why do I need it, do I have to install it, etc. ?

The short answer is: no, dont worry about it. The long answer is: Some of the code of the SUMO Toolbox is written in Java, since it makes a lot more sense in many situations and is a proper programming language instead of a scripting language like Matlab. Since Matlab automatically includes a JVM to run Java code there is nothing you need to do or worry about (see the previous FAQ entry). Unless its not working of course, in that case see FAQ#When_running_the_toolbox_you_get_something_like_.27.3F.3F.3F_Undefined_variable_.22ibbt.22_or_class_.22ibbt.sumo.config.ContextConfig.setRootDirectory.22.27.

What is XML?

XML stands for eXtensible Markup Language and is related to HTML (= the stuff web pages are written in). The first thing you have to understand is that does not do anything. Honest. Many engineers are not used to it and think it is some complicated computer programming language-stuff-thingy. This is of course not the case (we ignore some of the fancy stuff you can do with it for now). XML is a markup language meaning, it provides some rules how you can annotate or structure existing text.

The way SUMO uses XML is really simple and there is not much to understand (for more information on how SUMO uses XML go this page).

First some simple terminology. Take the following example:

   <Foo attr="bar">bla bla bla</Foo>

Here we have a tag called Foo containing text bla bla bla. The tag Foo also has an attribute attr with value bar. '<Foo>' is what we call the opening tag, and '</Foo>' is the closing tag. Each time you open a tag you must close it again. How you name the tags or attributes it totally up to you, you choose :)

Lets take a more interesting example. Here we have used XML to represent information about a receipe for pancakes:

  <recipe category="dessert">
  <title>Pancakes</title>
  <author>sumo@intec.ugent.be</author>
  <date>Wed, 14 Jun 95</date>
  <description>
    Good old fashioned pancakes.
  </description>
  <ingredients>
    <item>
        <amount>3</amount>
        <type>eggs</type>
    </item>
    
    <item>
         <amount>0.5 tablespoon</amount>
         <type>salt</type>
    </item>
     ...
  </ingredients>
  <preparation>
    ...
  </preparation>
 </recipe>

So basically, you see that XML is just a way to structure, order, and group information. Thats it! So SUMO basically uses it to store and structure configuration options. And this works well due to the nice hierarchical nature of XML.

If you understand this there is nothing else to it in order to be able to understand the SUMO configuration files. If you need more information see the tutorial here: http://www.w3schools.com/XML/xml_whatis.asp. You can also have a look at the wikipedia page here: http://en.wikipedia.org/wiki/XML

Why does SUMO use XML?

XML is the de facto standard way of structuring information. This ranges from spreadsheet files (Microsoft Excel for example), to configuration data, to scientific data, ... There are even whole database systems based solely on XML. So basically, its an intuitive way to structure data and it is used everywhere. This makes that there are a very large number of libraries and programming languages available that can parse, and handle XML easily. That means less work for the programmer. Then of course there is stuff like XSLT, XQuery, etc that makes life even easier. So basically, it would not make sense for SUMO to use any other format :). For more information on how SUMO uses XML go this page.

I get an error that SUMO is not yet activated

Make sure you installed the activation file that was mailed to you as is explained in the Installation instructions. Also double check your system meets the System requirements and that java is enabled. To fully verify that the activation file installation is correct ensure that the file ContextConfig.class is present in the directory <SUMO installation directory>/bin/java/ibbt/sumo/config.

Please note that more flexible research licenses are available if it is possible to collaborate in any way.

Upgrading

How do I upgrade to a newer version?

Delete your old <SUMO-Toolbox-directory> completely and replace it by the new one. Install the new activation file / extension pack as before (see Installation), start Matlab and make sure the default run works. To port your old configuration files to the new version: make a copy of default.xml (from the new version) and copy over your custom changes (from the old version) one by one. This should prevent any weirdness if the XML structure has changed between releases.

If you had a valid activation file for the previous version, just Contact us (giving your SUMOlab website username) and we will send you a new activation file. Note that to update an activation file you must first unzip a copy of the toolbox to a new directory and install the activation file as if it was the very first time. Upgrading of an activation file without performing a new toolbox install is (unfortunately) not (yet) supported.

Using

What configuration options (model type, sample selection algorithm, ...) should I use for my problem?

See General_guidelines.

Ok, I generated a model, what can I do with it?

See: Using a model.

How can I share a model created by the SUMO Toolbox?

See : Model portability.

I dont like the final model generated by SUMO how do I improve it?

Before you start the modeling you should really ask youself this question: What properties do I want to see in the final model? You have to think about what for you constitutes a good model and what constitutes a poor model. Then you should rank those properties depending on how important you find them. Examples are:

accuracy in the training data
- is it important that the error in the training data is exactly 0, or do you prefer some smoothing
accuracy outside the training data
- this is the validation or test error, how important is proper generalization (usually this is very important)
what does accuracy mean to you? a low maximum error, a low average error, both, ...
smoothness
- should your model be perfectly smooth or is it acceptable that you have a few small ripples here and there for example
are some regions of the response more important than others?
- for example you may want to be certain that the minima/maxima are captured very accurately but everything in between is less important
are there particular special features that your model should have
- for example, capture underlying poles or discontinuities correctly
extrapolation capability
...

It is important to note that often these criteria may be conflicting. The classical example is fitting noisy data: the lower your training error the higher your testing error. A natural approach is to combine multiple criteria, see Multi-Objective Modeling.

Once you have decided on a set of requirements the question is then, can the SUMO-Toolbox produce a model that meets them? In SUMO model generation is driven by one or more Measures. So you should choose the combination of Measures that most closely match your requirements. Of course we can not provide a Measure for every single property, but it is very straightforward to add your own Measure.

Now, lets say you have chosen what you think are the best Measures but you are still not happy with the final model. Reasons could be:

you need more modeling iterations or you need to build more models per iteration (see Running#Understanding_the_control_flow). This will result in a more extensive search of the model parameter space, but will take longer to run.
you should switch to a different model parameter optimization algorithm (e.g., for example instead of the Pattern Search variant, try the Genetic Algorithm variant of your AdaptiveModelBuilder.)
the model type you are using is not ideally suited to your data
there simply is not enough data, use a larger initial design or perform more sampling iterations to get more information per dimension
maybe the sample distribution is causing troubles for your model (e.g., Kriging can have problems with clustered data). In that case it could be worthwhile to choose a different sample selection algorithm.
the range of your response variable is not ideal (for example, neural networks have trouble modeling data if the range of the outputs is very very small)

You may also refer to the following General_guidelines. Finally, of course it may be that your problem is simply a very difficult one and does not approximate well. But, still you should at least get something satisfactory.

If you are having these kinds of problems, please let us know and we will gladly help out.

My data contains noise can the SUMO-Toolbox help me?

The original purpose of the SUMO-Toolbox was for it to be used in conjunction with computer simulations. Since these are fully deterministic you do not have to worry about noise in the data and all the problems it causes. However, the methods in the toolbox are general fitting methods that work on noisy data as well. So yes, the toolbox can be used with noisy data, but you will just have to be more careful about how you apply the methods and how you perform model selection. Its only when you use the toolbox with a noisy simulation engine that a few special options may need to be set. In that case Contact us for more information.

Note though, that the toolbox is not a statistical package, if you have noisy data and you need noise estimation algorithms, kernel smoothing algorithms, etc. you should look towards other tools.

What is the difference between a ModelBuilder and a ModelFactory?

See Add Model Type.

Why are the Neural Networks so slow?

The ANN models are an extremely powerful model type that give very good results in many problems. However, they are quite slow to use. There are some things you can do:

use trainlm or trainscg instead of the default training function trainbr. trainbr gives very good, smooth results but is slower to use. If results with trainlm are not good enough, try using msereg as a performance function.
try setting the training goal (= the SSE to reach during training) to a small positive number (e.g., 1e-5) instead of 0.
check that the output range of your problem is not very small. If your response data lies between 10e-5 and 10e-9 for example it will be very hard for the neural net to learn it. In that case rescale your data to a more sane range.
switch from ANN to one of the other neural network modelers: fanngenetic or nanngenetic. These are a lot faster than the default backend based on the Matlab Neural Network Toolbox. However, the accuracy is usually not as good.
If you are using CrossValidation try to switch to a different measure since CrossValidation is very expensive to use. CrossValidation is used by default if you have not defined a measure yourself. When using one of the neural network model types, try to use a different measure if you can. For example, our tests have shown that minimizing the sum of SampleError and LRMMeasure can give equal or even better results than CrossValidation, while being much cheaper (see Multi-Objective Modeling for how to combine multiple measures). See also the comments in default.xml for examples.
Finally, as with any model type things will slow down if you have many dimensions or very large amounts of data. If that is the case, try some dimensionality reduction or subsampling techniques.

How can I make the toolbox run faster?

There are a number of things you can do to speed things up. These are listed below. Remember though that the main reason the toolbox may seem to be slow is due to the many models being built as part of the hyperparameter optimization. Please make sure you fully understand the control flow described here before trying more advanced options.

First of all check that your virus scanner is not interfering with Matlab. If McAfee or any other program wants to scan every file SUMO generates this really slows things down and your computer becomes unusable.

Turn off the plotting of models in ContextConfig, you can always generate plots from the saved mat files

This is an important one. For most model builders there is an option "maxFunEals", "maxIterations", or equivalent. Change this value to change the maximum number of models built between 2 sampling iterations. The higher this number, the slower, but the better the models may be. Equivalently, for the Genetic model builders reduce the population size and the number of generations.

If you are using Measures#CrossValidation see if you can avoid it and use one of the other measures or a combination of measures (see Multi-Objective Modeling

If you are using a very dense Measures#ValidationSet as your Measure, this means that every single model will be evaluated on that data set. For some models like RBF, Kriging, SVM, this can slow things down.

Disable some, or even all of the profilers or disable the output handlers that draw charts. For example, you might use the following configuration for the profilers:

<Profiling>
	<Profiler name=".*share.*|.*ensemble.*|.*Level.*" enabled="true">
		<Output type="toImage"/>
		<Output type="toFile"/>
	</Profiler>
			
	<Profiler name=".*" enabled="true">
		<Output type="toFile"/>
	</Profiler>
</Profiling>

The ".*" means match any one or more characters (see here for the full list of supported wildcards). Thus in this example all the profilers that have "share", "ensemble", or "Level" in their name shoud be enabled and should be saved as a text file (toFile) AND as an image file (toImage). All the other profilers should be saved just to file. The idea is to only save to image what you want as an image since image generation is expensive. If you do this or switch off image generation completely you will see everything run much faster.

Decrease the logging granularity, a log level of FINE (the default is FINEST or ALL) is more then granular enough. Setting it to FINE, INFO, or even WARNING should speed things up.

If you have a multi-core/multi-cpu machine:
- if you have the Matlab Parallel Computing Toolbox, try setting the parallelMode option to true in Config:ContextConfig. Now all model training occurs in parallel. This may give unexpected errors in some cases so beware when using.
- if you are using a native executable or script as the sample evaluator set the threadCount variable in LocalSampleEvaluator equal to the number of cores/CPUs (only do this if it is ok to start multiple instances of your simulation script in parallel!)

Dont use the Min-Max measure, it can slow things down. See also FAQ#How_do_I_force_the_output_of_the_model_to_lie_in_a_certain_range

If you are using neural networks see FAQ#Why_are_the_Neural_Networks_so_slow.3F

If you are having problems with very slow or seemingly hanging runs:
- Do a run inside the Matlab profiler and see where most time is spent.

- Monitor CPU and physical/virtual memory usage while the SUMO toolbox is running and see if you notice anything strange.

Also note that by default Matlab only allocates about 117 MB memory space for the Java Virtual Machine. If you would like to increase this limit (which you should) please follow the instructions here. See also the general memory instructions here.

To check if your SUMO run has hanged, monitor your log file (with the level set at least to FINE). If you see no changes for about 30 minutes the toolbox will probably have stalled. report the problems here.

Such problems are hard to identify and fix so it is best to work towards a reproducible test case if you think you found a performance or scalability issue.

How do I build models with more than one output

Sometimes you have multiple responses that you want to model at once. See Running#Models_with_multiple_outputs

How do I turn off adaptive sampling (run the toolbox for a fixed set of samples)?

See : Adaptive Modeling Mode.

How do I change the error function (relative error, RMSE, ...)?

The <Measure> tag specifies the algorithm to use to assign models a score, e.g., CrossValidation. It is also possible to specify which error function to use, in the measure. The default error function is 'rootRelativeSquareError'.

Say you want to use CrossValidation with the maximum absolute error, then you would put:

<Measure type="CrossValidation" target="0.001" errorFcn="maxAbsoluteError"/>

On the other hand, if you wanted to use the ValidationSet measure with a relative root-mean-square error you would put:

<Measure type="ValidationSet" target="0.001" errorFcn="relativeRms"/>

The default error function is 'rootRelativeSquareError'. These error functions can be found in the src/matlab/tools/errorFunctions directory. You are free to modify them and add your own. Remember that the choice of error function is very important! Make sure you think well about it. Also see Multi-Objective Modeling.

How do I enable more profilers?

Go to the <Profiling> tag and put ".*" as the regular expression. See also the next question.

What regular expressions can I use to filter profilers?

See the syntax here.

How can I ensure deterministic results?

See : Random state.

How do I get a simple closed-form model (symbolic expression)?

See : Using a model.

How do I enable the Heterogenous evolution to automatically select the best model type?

Simply use the heterogenetic modelbuilder as you would any other.

What is the combineOutputs option?

See Running#Models_with_multiple_outputs

What error function should I use?

The default error function is the Root Relative Square Error (RRSE). On the other hand meanRelativeError may be more intuitive but in that case you have to be careful if you have function values close to zero since in that case the relative error explodes or even gives infinity. You could also use one of the combined relative error functions (contain a +1 in the denominator to account for small values) but then you get something between a relative and absolute error (=> hard to interpret).

So to be sure an absolute error seems the safest bet (like the RMSE), however in that case you have to come up with sensible accuracy targets and realize that you will build models that try to fit the regions of high absolute value better than the low ones.

Picking an error function is a very tricky business and many people do not realize this. Which one is best for you and what targets you use ultimately depends on your application and on what kind of model you want. There is no general answer.

A recommended read is is this paper. See also the page on Multi-Objective Modeling.

I just want to generate an initial design (no sampling, no modeling)

Do a regular SUMO run, except set the 'maxModelingIterations' in the SUMO tag to 0. The resulting run will only generate (and evaluate) the initial design and save it to samples.txt in the output directory.

How do I start a run with the samples of of a previous run, or with a custom initial design?

Use a Dataset design component, for example:

 <InitialDesign type="DatasetDesign">
    <Option key="file" value="/path/to/the/file/containing/the/points.txt"/>
 </InitialDesign>

The points of a previous run can be found in the samples.txt file in the output directory of the run you want to continue.

As a sidenote, remark you can start the toolbox with *data points* of a previous run, but not with the *models* of a previous run.

What is a level plot?

A level plot is a plot that shows how the error histogram changes as the best model improves. An example is:

Level plots only work if you have a separate dataset (test set) that the model can be checked against. See the comments in default.xml for how to enable level plots.

How do I force the output of the model to lie in a certain range

See Measures#MinMax.

My problem is high dimensional and has a lot of input parameters (more than 10). Can I use SUMO?

That depends. Remember that the main focus of SUMO is to generate accurate 'global' models. If you want to do sampling the practical dimensionality is limited to around 6-8 (though it depends on the problem and how cheap the simulations are!). Since the more dimensions the more space you need to fill. At that point you need to see if you can extend the models with domain specific knowledge (to improve performance) or apply a dimensionality reduction method (see the next question). On the other hand, if you don't need to do sample selection but you have a fixed dataset which you want to model. Then the performance on high dimensional data just depends on the model type. For examples SVM type models are independent of the dimension and thus can always be applied. Though things like feature selection are always recommended.

Can the toolbox tell me which are the most important inputs (= variable selection)?

When tackling high dimensional problems a crucial question is "Are all my input parameters relevant?". Normally domain knowledge would answer this question but this is not always straightforward. In those cases a whole set of algorithms exist for doing dimensionality reduction (= feature selection). Support for some of these algorithms may eventually make it into the toolbox but are not currently implemented. That is a whole PhD thesis on its own. However, if a model type provides functions for input relevance determination the toolbox can leverage this. For example, the LS-SVM model available in the toolbox supports Automatic Relevance Determination (ARD). This means that if you use the SUMO Toolbox to generate an LS-SVM model, you can call the function ARD() on the model and it will give you a list of the inputs it thinks are most important.

Should I use a Matlab script or a shell script for interfacing with my simulation code?

When you want to link SUMO with an external simulation engine (ADS Momentum, SPECTRE, FEBIO, SWAT, ...) you need a shell script (or executable) that can take the requested points from SUMO, setup the simulation engine (e.g., set necessary input files), calls the simulator for all the requested points, reads the output (e.g., one or more output files), and returns the results to SUMO (see Interfacing with the toolbox).

Which one you choose (matlab script + Matlab Sample Evaluator, or shell script/executable with Local Sample Evaluator is basically a matter of preference, take whatever is easiest for you.

HOWEVER, there is one important consideration: Matlab does not support threads so this means that if you use a matlab script to interface with the simulation engine, simulations and modeling will happen sequentially, NOT in parallel. This means the modeling code will sit around waiting, doing nothing, until the simulation(s) have finished. If your simulation code takes a long time to run this is not very efficient.

On the other hand, using a shell script/executable, does allow the modeling and simulation to occur in parallel (at least if you wrote your interface script in such a way that it can be run multiple times in parallel, i.e., no shared global directories or variables that can cause race conditions).

As a sidenote, note that if you already put work into a Matlab script, it is still possible to use a shell script, by writing a shell script that starts Matlab (using -nodisplay or -nojvm options), executes your script (using the -r option), and exits Matlab again. Of course it is not very elegant and adds some overhead but depending on your situation it may be worth it.

How can I look at the internal structure of a SUMO model

See Using_a_model#Available_methods.

Is there any design documentation available?

An in depth overview of the rationale and philosophy, including a treatment of the software architecture underlying the SUMO Toolbox is available in the form of a PhD dissertation. A copy of this dissertation is available here.

Troubleshooting

I have a problem and I want to report it

See : Reporting problems.

I am getting a java out of memory error, what happened?

Datasets are loaded through java. This means that the java heap space is used for storing the data. If you try to load a huge dataset (> 50MB), you might experience problems with the maximum heap size. You can solve this by raising the heap size as described on the following webpage: [1]

I sometimes get flat models when using rational functions

First make sure the model is indeed flat, and does not just appear so on the plot. You can verify this by looking at the output axis range and making sure it is within reasonable bounds. When there are poles in the model, the axis range is sometimes stretched to make it possible to plot the high values around the pole, causing the rest of the model to appear flat. If the model contains poles, refer to the next question for the solution.

The RationalModel tries to do a least squares fit, based on which monomials are allowed in numerator and denominator. We have experienced that some models just find a flat model as the best least squares fit. There are two causes for this:

The number of sample points is few, and the model parameters (as explained here) force the model to use only a very small set of degrees of freedom. The solution in this case is to increase the minimum percentage bound in the RationalFactory section of your configuration file: change the "percentBounds" option to "60,100", "80,100", or even "100,100". A setting of "100,100" will force the polynomial models to always exactly interpolate. However, note that this does not scale very well with the number of samples (to counter this you can set "maxDegrees"). If, after increasing the "percentBounds" you still get weird, spiky, models you simply need more samples or you should switch to a different model type.
Another possibility is that given a set of monomial degrees, the flat function is just the best possible least squares fit. In that case you simply need to wait for more samples.
The measure you are using is not accurately estimating the true error, try a different measure or error function. Note that a maximum relative error is dangerous to use since a the 0-function (= a flat model) has a lower maximum relative error than a function which overshoots the true behavior in some places but is otherwise correct.

When using rational functions I sometimes get 'spikes' (poles) in my model

When the denominator polynomial of a rational model has zeros inside the domain, the model will tend to infinity near these points. In most cases these models will only be recognized as being `the best' for a short period of time. As more samples get selected these models get replaced by better ones and the spikes should disappear.

So, it is possible that a rational model with 'spikes' (caused by poles inside the domain) will be selected as best model. This may or may not be an issue, depending on what you want to use the model for. If it doesn't matter that the model is very inaccurate at one particular, small spot (near the pole), you can use the model with the pole and it should perform properly.

However, if the model should have a reasonable error on the entire domain, several methods are available to reduce the chance of getting poles or remove the possibility altogether. The possible solutions are:

Simply wait for more data, usually spikes disappear (but not always).
Lower the maximum of the "percentBounds" option in the RationalFactory section of your configuration file. For example, say you have 500 data points and if the maximum of the "percentBounds" option is set to 100 percent it means the degrees of the polynomials in the rational function can go up to 500. If you set the maximum of the "percentBounds" option to 10, on the other hand, the maximum degree is set at 50 (= 10 percent of 500). You can also use the "maxDegrees" option to set an absolute bound.
If you roughly know the output range your data should have, an easy way to eliminate poles is to use the MinMax Measure together with your current measure ( CrossValidation by default). This will cause models whose response falls outside the min-max bounds to be penalized extra, thus spikes should disappear.
Use a different model type (RBF, ANN, SVM,...), as spikes are a typical problem of rational functions.
Increase the population size if using the genetic version
Try using the RationalPoleSuppressionSampleSelector, it was designed to get rid of this problem more quickly, but it only selects one sample at the time.

However, these solutions may not still not suffice in some cases. The underlying reason is that the order selection algorithm contains quite a lot of randomness, making it prone to over-fitting. This issue is being worked on but will take some time. Automatic order selection is not an easy problem

There is no noise in my data yet the rational functions don't interpolate

see this question.

When loading a model from disk I get "Warning: Class ':all:' is an unknown object class. Object 'model' of this class has been converted to a structure."

You are trying to load a model file without the SUMO Toolbox in your Matlab path. Make sure the toolbox is in your Matlab path.

In short: Start Matlab, run <SUMO-Toolbox-directory>/startup.m (to ensure the toolbox is in your path) and then try to load your model.

When running the SUMO Toolbox you get an error like "No component with id 'annpso' of type 'adaptive model builder' found in config file."

This means you have specified to use a component with a certain id (in this case an AdaptiveModelBuilder component with id 'annpso') but a component with that id does not exist further down in the configuration file (in this particular case 'annpso' does not exist but 'anngenetic' or 'ann' does, as a quick search through the configuration file will show). So make sure you only declare components which have a definition lower down. So see which components are available, simply scroll down the configuration file and see which id's are specified. Please also refer to the Declarations and Definitions page.

When using NANN models I sometimes get "Runtime error in matrix library, Choldc failed. Matrix not positive definite"

This is a problem in the mex implementation of the NNSYSID toolbox. Simply delete the mex files, the Matlab implementation will be used and this will not cause any problems.

When using FANN models I sometimes get "Invalid MEX-file createFann.mexa64, libfann.so.2: cannot open shared object file: No such file or directory."

This means Matlab cannot find the FANN library itself to link to dynamically. Make sure the FANN libraries (stored in src/matlab/contrib/fann/src/.libs/) are in your library path, e.g., on unix systems, make sure they are included in LD_LIBRARY_PATH.

Undeﬁned function or method ’createFann’ for input arguments of type ’double’.

See FAQ#When_using_FANN_models_I_sometimes_get_.22Invalid_MEX-file_createFann.mexa64.2C_libfann.so.2:_cannot_open_shared_object_file:_No_such_file_or_directory..22

When trying to use SVM models I get 'Error during fitness evaluation: Error using ==> svmtrain at 170, Group must be a vector'

You forgot to build the SVM mex files for your platform. For windows they are pre-compiled for you, on other systems you have to compile them yourself with the makefile.

When running the toolbox you get something like '??? Undefined variable "ibbt" or class "ibbt.sumo.config.ContextConfig.setRootDirectory"'

First see this FAQ entry.

This means Matlab cannot find the needed Java classes. This typically means that you forgot to run 'startup' (to set the path correctly) before running the toolbox (using 'go'). So make sure you always run 'startup' before running 'go' and that both commands are always executed in the toolbox root directory.

If you did run 'startup' correctly and you are still getting an error, check that Java is properly enabled:

typing 'usejava jvm' should return 1
typing 's = java.lang.String', this should not give an error
typing 'version('-java')' should return at least version 1.5.0

If (1) returns 0, then the jvm of your Matlab installation is not enabled. Check your Matlab installation or startup parameters (did you start Matlab with -nojvm?) If (2) fails but (1) is ok, there is a very weird problem, check the Matlab documentation. If (3) returns a version before 1.5.0 you will have to upgrade Matlab to a newer version or force Matlab to use a custom, newer, jvm (See the Matlab docs for how to do this).

You get errors related to gaoptimset,psoptimset,saoptimset,newff not being found or unknown

You are trying to use a component of the SUMO toolbox that requires a Matlab toolbox that you do not have. See the System requirements for more information.

After upgrading I get all kinds of weird errors or warnings when I run my XML files

See FAQ#How_do_I_upgrade_to_a_newer_version.3F

I get a warning about duplicate samples being selected, why is this?

Sometimes, in special circumstances, multiple sample selectors may select the same sample at the same time. Even though in most cases this is detected and avoided, it can still happen when multiple outputs are modelled in one run, and each output is sampled by a different sample selector. These sample selectors may then accidentally choose the same new sample location.

I sometimes see the error of the best model go up, shouldn't it decrease monotonically?

There is no short answer here, it depends on the situation. Below 'single objective' refers to the case where during the hyperparameter optimization (= the modeling iteration) combineOutputs=false, and there is only a single measure set to 'on'. The other cases are classified as 'multi objective'. See also Multi-Objective Modeling.

Sampling off
1. Single objective: the error should always decrease monotonically, you should never see it rise. If it does report it as a bug
2. Multi objective: There is a very small chance the error can temporarily decrease but it should be safe to ignore. In this case it is best to use a multi objective enabled modeling algorithm
Sampling on
1. Single objective: inside each modeling iteration the error should always monotonically decrease. At each sampling iteration the best models are updated (to reflect the new data), thus there the best model score may increase, this is normal behavior(*). It is possible that the error increases for a short while, but as more samples come in it should decrease again. If this does not happen you are using a poor measure or poor hyperparameter optimization algorithm, or there is a problem with the modeling technique itself (e.g., clustering in the datapoints is causing numerical problems).
2. Multi objective: Combination of 1.2 and 2.1.

(*) This is normal if you are using a measure like cross validation that is less reliable on little data than on more data. However, in some cases you may wish to override this behavior if you are using a measure that is independent of the number of samples the model is trained with (e.g., a dense, external validation set). In this case you can force a monotonic decrease by setting the 'keepOldModels' option in the SUMO tag to true. Use with caution!

At the end of a run I get Undefined variable "ibbt" or class "ibbt.sumo.util.JpegImagesToMovie.createMovie"

This is normal, the warning printed out before the error explains why:

[WARNING] jmf.jar not found in the java classpath, movie creation may not work! Did you install the SUMO extension pack? Alternatively you can install the java media framwork from java.sun.com

By default, at the end of a run, the toolbox will try to generate a movie of all the intermediate model plots. To do this it requires the extension pack to be installed (you can download it from the SUMO lab website). So install the extension pack and you will no longer get the error. Alternatively you can simply set the "createMovie" option in the <SUMO> tag to "false". So note that there is nothing to worry about, everything has run correctly, it is just the movie creation that is failing.

On startup I get the error "java.io.IOException: Couldn't get lock for output/SUMO-Toolbox.%g.%u.log"

This error means that SUMO is unable to create the log file. Check the output directory exists and has the correct permissions. If your output directory is on a shared (network) drive this could also cause problems. Also make sure you are running the toolbox (calling 'go') from the toolbox root directory, and not in some toolbox sub directory! This is very important.

If you still have problems you can override the default logfile name and location as follows:

In the <FileHandler> tag inside the <Logging> tag add the following option:

<Option key="Pattern" value="My_SUMO_Log_file.log"/>

This means that from now on the sumo log file will be saved as the file "My_SUMO_Log_file.log" in the SUMO root directory. You can use any path you like. For more information about this option see the FileHandler Javadoc.

The Toolbox crashes with "Too many open files" what should I do?

This is a known bug, see Known_bugs#Version_6.1.

If this does not fix your problem then do the following:

On Windows try increasing the limit in windows as dictated by the error message. Also, when you get the error, use the fopen("all") command to see which files are open and send us the list of filenames. Then we can maybe further help you debug the problem. Even better would be to use the Process Explorer utility available here. When you get the error, dont shut down Matlab but start Process explorer and see which SUMO-Toolbox related files are open. If you then let us know we can further debug the problem.

On Linux again don't shut down Matlab but:

open a new terminal window
type:

lsof > openFiles.txt

Then send us the following information:
- the file openFiles.txt
- the exact Linux distribution you are using (Red Hat 10, CentOS 5, SUSE 11, etc).
- the output of

uname -a ; df -T ; mount

As a temporary workaround you can try increasing the maximum number of open files (see for example here). We are currently debugging this issue.

In general: to be safe it is always best to do a SUMO run from a clean Matlab startup, especially if the run is important or may take a long time.

When using the LS-SVM models I get lots of warnings: "make sure lssvmFILE.x (lssvmFILE.exe) is in the current directory, change now to MATLAB implementation..."

The LS-SVMs have a C implementation and a Matlab implementation. If you dont have the compiled mex files it will use the matlab implementation and give a warning. But everything will work properly. To get rid of the warnings, compile the mex files as described here, this can be done very easily. Or simply comment out the lines that produce the output in the lssvmlab directory in src/matlab/contrib.

I get an error "Undefined function or method 'trainlssvm' for input arguments of type 'cell'"

You most likely forgot to install the extension pack.

When running the SUMO-Toolbox under Linux, the X server suddenly restarts and I am logged out of my session

Note that in Linux there is an explicit difference between the kernel and the X display server. If the kernel crashes or panics your system completely freezes (you have to reset manually) or your computer does a full reboot. Luckily this is very rare. However, if you display server (X) crashes or restarts it means your operating system is still running fine, its just that you have to log in again since your graphical session has terminated. The FAQ entry is only for the latter. If you find your kernel is panicing or freezing, that is a more fundamental problem and you should contact your system admin.

So what happens is that after a few seconds when the toolbox wants to plot the first model X crashes and you are suddenly presented with a login screen. The problem is not due to SUMO but rather to the Matlab - Display server interaction.

What you should first do is set plotModels to false in the Config:ContextConfig tag, run again and see if the problem occurs again. If it does please report it. If the problem does not occur you can then try the following:

Log in as root (or use sudo)
Edit the following configuration file using a text editor (pico, nano, vi, kwrite, gedit,...)

/etc/X11/xorg.conf

Note: the exact location of the xorg.conf file may vary on your system.

Look for the following line:

  Load         "glx"

Comment it out by replacing it by:

#  Load         "glx"

Then save the file, restart your X server (if you do not know how to do this simply reboot your computer)
Log in again, and try running the toolbox (making sure plotModels is set to true again). It should now work. If it still does not please report it.

Note:

this is just an empirical workaround, if you have a better idea please let us know
if you wish to debug further yourself please check the Xorg log files and those in /var/log
another possible workaround is to start matlab with the "-nodisplay" option. That could work as well.

I get the error "Failed to close Matlab pool cleanly, error is Too many output arguments"

This happens if you run the toolbox on Matlab version 2008a and you have the parallel computing toolbox installed. You can simply ignore this error message, it does not cause any problems. If you want to use SUMO with the parallel computing toolbox you will need Matlab 2008b.

The toolbox seems to keep on running forever, when or how will it stop?

The toolbox will keep on generating models and selecting data until one of the termination criteria has been reached. It is up to you to choose these targets carefully, so how low the toolbox runs simply depends on what targets you choose. Please see Running#Understanding_the_control_flow.

Of course choosing a-priori targets up front is not always easy and there is no real solution for this, except thinking well about what type of model you want (see FAQ#I_dont_like_the_final_model_generated_by_SUMO_how_do_I_improve_it.3F). In doubt you can always use a small value (or 0) and then simply quit the running toolbox using Ctrl-C when you think its been enough.

While one could implement fancy, automatic stopping algorithms, their actual benefit is questionable.

@@ Line 45: / Line 45: @@
 === What about classification problems? ===
-The main focus of the SUMO Toolbox is on regression/function approximation.  However, the framework for hyperparameter optimization, model selection, etc.  can also be used for classification.  Starting from version 6.3 a demo file is included in the distribution that shows how this works on the well known two spiral test problem.
+The main focus of the SUMO Toolbox is on regression/function approximation.  However, the framework for hyperparameter optimization, model selection, etc.  can also be used for classification.  Starting from version 6.3 a demo file is included in the distribution that shows how this works on the well known two spiral test problem. It is possible to specify a run as a classification problem by setting the 'classificationMode' and 'numberOfClasses' option in ContextConfig in the configuration file. Classification models from WEKA are also available in SUMO. Please refer to the default configuration file for the explanation on usage of WEKA model types available through SUMO. The LOLA-Voronoi sample selection scheme also supports classification, and its usage is documented in the default configuration file as well.
 === Does SUMO support discrete inputs/outputs ===