In this work we look at systems of statistical models using algebraic methods.  We begin by converting statistical models to polynomial forms, and ( more completely ) to algebraic things called “algebraic varieties”.  These can include so-called “differential polynomials” for handling differential equations, ways to represent "if-then” statements, and so on. The algebraic forms also provide for statistical distributions ( as “semi-algebraic sets”).

Secondly, we pay considerable attention to well-behaved connections that occur among statistical datasets and models.  Three of the most important kinds of these are those for transforming datasets from one kind to another, those for translating statistical models from one form to another, and those that  involve model parameterizations.  We flesh out the use of these and other well-behaved connections.

Finally, when using algebraic methods, the question  is usually not whether algebraic methods are correct; the more important question is typically whether or not they are useful.   In this context, the job of the software is to see whether that is the case in practice. About one-third of the software is for applications to quantitative pharmacology; the rest is to proteomics.     ).  We are releasing the software ( written from 2004-2006, and late 2007-2008 here in the spring and summer of 2008.












What is a statistical Model?

To store statistical models and to define services that involve the use of statistical models, we first need to ask "What is a Statistical Model?" [1].   Our answer is that a statistical models are an essentially “algebraic” by nature.  See, for example:

[1] Peter McCullagh, "What is a statistical model?", The Annals of Statistics, 2002, Vol. 30, No. 5, 1225–1310

[2] Seth Sullivant, "Statistical Models are Algebraic Varieties", 2006,


Starting point #1:  It is useful to transform models to systems of polynomials ( more completely as “semi-algebraic sets”)

In this work, we specifically take models in proteomics ( specifically in R and SAS), plus models in pharmacology ( specifically in NONMEM, WinBUGS, and Monolix ), to algebraic forms.  In a practical sense, that mostly involves taking all of the data, statistical models, and metadata, and turning it into a list of equations.  (The equations mostly just involve vectors and polynomials.)   We then create databases in which we show both the algebraic representations together with the language-specific interpretations of those models.   On the pharmacology side, the database design begins with a fairly standard type of database design ( e.g., as used the  “Obiwan” system of GSK written in 2003 to handle NONMEM runs),  but augmented with tools as suggested by the use of the algebraic perspectives mentioned here.


Starting point #2:  It is useful to think not about individual statistical models, but rather systems of models.

This is the potato-chip model for statistics models — you cannot just have “one”, you always want “many”.   Moreover, there are connections among the various datasets and models.  In addition, the connections among statistical datasets, as well as connections among statistical models need to be “well-behaved”.   In a formal sense, this means that we don’t just want to talk about individual datasets, we want to talk about systems of datasets.  For us, the systems are algebraic structures ( “algebras” for short ).  We look at systems of models in the same way.

The main practical result is with respect to handling datasets is that we then focus considerable effort on doing well-behaved data transformations.  With respect to the statistical models, the need is then to spend a lot of time on doing translations of one kind to another ( e.g., in pharmacology, translating NONMEM models to WinBUGS, and WinBUGS to NONMEM.) 

In the algebraic perspective, “well-behaved” connections, such as those for data transformations and model translations, are called “functors”.   In this context, another important functor ( of a different type than those associated with data transformations and model “translations”) is essentially a model’s parameterization.   We flesh out practical versions of about seven or eight of these so-called “functors” overall.


Starting point #3:  Algebraic perspectives can complement biological and other perspectives in useful ways.

In this work, we are just representing models as lists of equations.    These are defined using the algebraic perspectives above.  As implied in the previous point, algebraic approaches are typically “aligned” with biological approaches.   In some cases, the formal perspectives simply confirm the value of existing approaches in use for a long time.  In other cases, though,  (we believe) that algebraic perspectives can also provide some additional slogans, some extra, useful “nitpicking” rules to use, or some additional theorems to make use of.

One problem is that algebraic representations  are quite terse ( even cryptic ) by nature.  When communicating what a model or procedure is, it is often useful to make use of other approaches as well.   Sometimes, for example, a “good old FOR loop” ( as in the case of WinBUGS ) can be a very good way to write a model. 

Other types of “concrete” representations are also apt to go hand in hand with algebraic approaches.  For example, in proteomics, one is apt to want to complement  “concrete” approaches for messaging such as mzXML; in pharmacology, one is apt to want to complement a definition of models with an XML approach as defined by the Nonlinear Mixed Effects Consortium ( NLMEc ), or practical language, such as Mike Dunlavey’s PML language for modeling and simulation[1]. 

In a practical sense, this project began in the fall of 2002, with the code then being written from 2004 to 2006, and late 2007 to the present.  We are completing and uploading this software during the spring and summer of 2008.


[1] Mike Dunlavey, “Next Generation Modeling Language”,  Page 16 ( 2007 ), Abstr 1076 []

(This site is maintained by Cellular Statistics, LLC, at ) .. 



Open Statistical Services