Compact model specification with user-defined types
A proposal for supporting models of synapses and other user-definable component types in NeuroML
Current version: 0.5.0 17-September-2010 These pages as a single PDF
This proposal consists of:
- A An overview of how types and models are declared
- A set of XML elements for defining and using user-defined types
- A proof-of-concept interpreter for processing and running models built with these elements
- Some examples (left menu) illustrating how elements can be defined and used and example4 with some additional structures not currently supported by the interpreter.
- A summary of the canonical form for a model that corresponds to the elements presented here
- Miscellaneous discussion (this page) of the objectives, design issues, benefits and weaknesses of this approach
- A list of problems with the current version and probable requirements to make it useful for modeling synapses.
The best place to start is probably the first example. Anyone familiar with modeling and model specification should be able to read the XML and make out what is going on.
After that, feel free to download the interpreter, run the examples, and try constructing your own.
What it does so far
You can define Types of component (eg a "HH channel" or "a bi-exponential synapse") which express the general properties of a particular type of thing that goes in a model. This includes saying what parameters they have, what child elements they are allowed, and how they behave (the equations).
You can then define Components based on these types by supplying values for the parameters and adding any child elements that are required, so, for example, a bi-exponential synapse model with rise time 1ms and decay 5ms would be a component.
Types can extend other Types to add extra parameters, fix certain values, and otherwise modify their behavior. Components can extend other Components to reuse specified parameter values. There is also a loose notion of abstract types, so a component can accept children with a particular lineage without needing to know exactly what type they are. This can be used, for example, to define cells that accept synaptic connections provided they have a particular signature.
Each type can have a Behavior element that specifies how it behaves: what the state variables are, the equations that govern them, and what happens when events are sent or received. The interpreter takes a model consisting of type and component elements referenced from a network, builds an instance from them and runs it.
For those familiar with object oriented languages, the Type/Component distinction is close to the normal Class/Instance distinction. When the model is run, the same pattern applies again, with the Components acting as class definitions, with their "instances" actually containing the state variables in the running mode.
The March 2010 NeuroML meeting identified a need to extend the capability within NeuroML for expressing a range of models of synapses. It was decided that the hitherto adopted approach of defining parameterized building blocks to construct models by combining blocks and setting parameters was unlikely to be flexible enough to cope with the needs for synapse models. This is not obvious a-priori, since, for example, the NeuroML 2.0 ion channel building blocks are fully adequate to describe the dynamics of a wide range existing channel models. But there appears to be no such commonality in models used for synapses, where the mechanisms used range from highly detailed biochemical models to much more abstract ones.
This work also has antecedents in Catacomb 3, which was essentially a GUI for a component definition system and model builder using a type system similar to that proposed here. Much of the XML processing code used in the interpreter was taken from PSICS which iteself currently uses the "building block" approach to model specification. The need for user-defined types has been considered with respect to future PSICS development, and this proposal also reflects potential requirements for PSICS.
Model description languages differ markedly in where their focus lies and how they value (or disregard) particular features. Such features include how important it is for model specifications to be:
- Minimally redundant
- Low entropy (see below)
- Machine readable
- human readable
- Writable from existing simulators
- Writable by hand
- Mappable onto existing simulators
- Language independent
- Mathematically oriented (dimensionless variables and equations)
- Physically/biologically oriented (everything dimensional)
The term "entropy" is used as a loose analogy. The idea is that a model as conceived by a modeler or as described in a paper is highly structured. The quantities occurring in it are physical quantities (voltages and times rather than just numbers) and the structures are concise, hierarchical and minimally redundant. This is a low entropy representation. As the model gets converted into something that can be run on a computer, most of the structure is removed. Dimensional quantities get divided by units to provide dimensionless numbers and mechanistic concepts get converted to equations. It goes through a state of being a bunch of state variables and equations and eventually ends up numerical code implementing state update rules. This is the high entropy end. Models that are only available as compiled executables are the extreme high entropy end. Those that are only available as c-code are a close second which can only be converted to low entropy forms by extensive manual curation.
You can automate the process of turning a low-entropy representation into a runnable model, but in general you can't automatically get back to a low entropy representation from a higher entropy one. Simulators vary in how well they represent and preserve low entropy models, but, particularly older simulators tend to increase the entropy from the start and the only internal model representation used is often of rather higher entropy than the model representation created by the modeler. For example a modeler might be forced to render their model dimensionless before getting it into the simulator. The units they used to do this would probably be there in comments in the source files, but they are not part of the internal state of the simulator so it is unable to write out a low entropy model.
This proposal is all about expressing and protecting low entropy representations. These are the most valuable representation of a model because they can readily be turned into a variety of higher entropy representations as used by different simulators. Note that this low entropy focus may not be suitable to a model exchange language such as NineML which is intended to be writable by existing simulators. For that a medium entropy representation is probably required.
> For those familiar with software engineering, the entropy discussion is essentially a > variant of the DRY > (Don't Repeat Yourself, or 'DIE': Duplication Is Evil) principle in software design. >
Mathematical v. Physical/biological
There are two issues here. One is whether a bundle of state variables and equations is enough to make a model. Mathematically, of course, this is all there is to many models, but scientifically, such a representation is of relatively high entropy since the structure, hierarchy and relations have been lost. The second question is about quantities that go into a model. Are they numbers or are they physical quantities? The distinction is that for a mathematical model one might say (as eg CellML does) "v is a number which represents a voltage measured in milliviolts" Ie, its what you get when you take a voltage and divide it by another voltage which is 1mV. For a physical model, you'd just say "v is a voltage" and leave it at that. Then v is a rich quantity with magnitude and dimensions.
If your modeling system takes the first approach, it can force the user to render quantities dimensionless themselves or can provide some support for unit conversions. But in either case, the quantities in the equations that the modeler enters are just numbers.
In the second approach, the quantities occurring in equations are dimensional quantities. This is the norm in written model descriptions but it is generally not the norm in model description software. Being focused on turning things into executable code (involving bare numbers) the latter tends to dispense with dimensionality as soon as possible. This seems unfortunate because premature non-dimensionalisation opens up all sorts of cans of worms for the modeler and indeed for the software developer that simply don't need to be opened up. Sticking with dimensional quantities throughout the model description phase makes most of these problems vanish.
Incidentally, this is related to the xml construct often seen in model specifications where a quantity has both a value and a unit as in '<mass value="3" unit="kg"/>'. This suggests that somehow the mass of the item, (the 'value' of its mass) was '3' and that the mass itself has a unit, rather than its mass (or the 'value' thereof) being '3kg' as in normal usage. In fact, of course, neither the '3' or the 'kg' are attributes of the mass. Exactly the same quantity could be expressed with '3000' and 'g' or '3E-6' and 'Ton'. Neither is meaningful in on their own, so ideally this looks to call for an XML datatype.
In the light of the above discussion, this proposal prioritizes:
- Low entropy models
- Human readability
- Human writability
- Physical/biological (as opposed to mathematical or computational) model specification
- Conciseness and minimal redundancy
A consequence of these priorities is that it is probably going to be very hard to automatically export models in this format from existing simulators unless they already have a low-entropy internal representation. In general, it will involve rewriting them by hand.
This format is not intended as a canonical form for a model, although there is a clear need for such a model specification format. Rather, it is better to think of it as lightweight XML user interface to a nascent canonical form. Canonical forms are inevitably hard to work with directly (eg, a canonical form for model specification should probably use MathML which requires an intermediate tool to read or write) and the present form makes a much simpler structure within which to develop and explore model specification capabilities. Once a suitable structures have been arrived at, the corresponding canonical form can be specified.
It is intended that a relatively simple XML mapping should map losslessly between this format and the canonical form with a couple of constraints. The present format allows multiple ways to express the same model. For example, the same quantity can be expressed with different magnitude units and elements can be written in a number of different ways. Such variants of the same model should all map to the same canonical form. For the mapping to be invertible, the additional information (such as that the original value of a current was given in nA for example) will have to be stored in metadata in the canonical form of the model.
Comparison with other systems
There are strong parallels in VHDL. The hierarchical components proposal for SBML level 3 looks to be heading towards the same end point from a different direction. There are also comparisons to be made with the facilities for modular model representation in CellML 1.1. Like SBML, it is arriving from the other end, with a substantial body of models expressed in a standalone, medium entropy form and a curation process to abstract out modules that can then be referenced from several models.
There are also close parallels with NineML, which may, ideally, provide a standardized format for losslessly writing and re-reading models expressed in LEMS.
Comparison with 'building-block' languages
The great thing about a building-block language is that a model can be reliably and relatively easily mapped onto efficient code to execute it. This is not the case with the present proposal or with other general systems where the user can define their own equations. One way round this is to develop smarter symbolic algebra capabilities so that efficient numerical implementations can be generated from the equations. This could work in some cases. It is hard to see it working in all cases. Another way round it is for an implementation to spot structures it recognizes and map them to efficient hard-coded implementations, but to keep the capability to run new, unrecognized, structures (albeit more slowly). However, I'm not aware of any implementations that actually do this yet, so it remains to bee seen whether it is a workable approach.
If a system in which modelers develop new models and simulators gradually add hard-code support for the most popular ones is to be made to work, then there must be a strong incentive for reusing an existing type (that can then be recognized by the simulator) rather than re-expressing the whole model from scratch. This is probably more of an issues for simulators and model sharing infrastructure than for the language itself, although, of course, the language itself must provide capabilities to do this cleanly.
It is designed to be read and written by hand (as well as machines). Nevertheless, the parsers used for the XML and the expressions are among the more straightforward parts of the implementation. The most common alternative is to to pre-parse the expressions and make the XML store the parse tree. This would make it easier to process in a simulator (no need for an expression parser). Similarly, dimensional quantities could be split into a number attribute and a unit attribute which would also make it slightly easier to process. However, these benefits are slight, and would only accrue to the small number of developers writing simulators. In the meantime, every modeler would have to write quantities out the long way and find a way to generate the xml parse trees for expressions. At the very least, an interpreter should provide tools for the user to do this, so the user can write input as a normal expression. But if the interpreter can do this, why standardize on the unreadable post-parsed form rather than the concise form written by the user? It has been suggested that this is necessary to avoid ambiguities in what a model means, but I suspect that is more of a hypothetical danger than a real problem, given all the other thins that can go wrong with a model specification.
Finally, for those who want it, it is straightforward to convert models in this format to pre-digested element-only XML with parse trees instead of expressions. Indeed, the interpreter will generate this format if called with the '-p' qualifier in 'java -jar lems-x.x.x.jar -p model.xml'.