Data fitting - with minerr

GuyB · ‎Sep 01, 2009

In response to

http://collab.mathsoft.com/read?127698,10

I have decided to post the following worksheet, in which I hope to show clearly how to fit data in Mathcad using functions based on Minerr().

The sheet was developed in MCad v11.2a, tested in v14 M020, and saved as a v2001i file.(*)

Hope this helps,

- Guy

(*) It uses a data table, which I _think_ causes problems for versions < 2001i ... but if requested I could probably come up with an earlier version.

(**) Note: both a greatly-simplified, and an improved, more-extensive version of this sheet are available in the thread, below.

ptc-1368288 · ‎Sep 01, 2009

Quite complete though a bit hairy for novice user. The attached is rather for the everyday tool box. I didn't show how to constrain Minerr to pass through a "user calibration point". An advanced example is the fit done for Walter, months ago.
Saved in Mathcad 8. Should work for all versions up.

jmG

ptc-1368288 · ‎Sep 01, 2009

... ugly typos !

err(,,) is a short version of Paul W.
Most interesting as is if it works but it will fail eventually. And the long Paul W. Minerr is the best to all cases. At least it must be presented, but not if not so requested. Whichever method, the fitting session starts by the two key points in the work sheet. No matter how much science in curve fitting, curve fitting is an art [F.B. Hildebrand], an art that can't be exposed in full as not all possible data have been collected. Data are terrifyingly misleading, like the one in the morning or yesterday "genfit".

jmG

RichardJ · ‎Sep 02, 2009

It's very comprehensive, and useful to a novice as a tool. I think I would have to agree with Jean though that if a novice were looking for an explanation of how to do it, rather than a ready to use tool, they might be a little overwhelmed!

Richard

TheodoreM.Bones · ‎Sep 02, 2009

It needed a little annotationing by changing Y to ln(Y), after inspecting the original fit on a log-log scale graph.

GuyB · ‎Sep 06, 2009

Given the constructive feedback, I've written a much shorter & simpler fitting sheet.

- Guy

ptc-1368288 · ‎Sep 06, 2009

On 9/6/2009 12:37:18 AM, GuyBeadie wrote:
>Given the constructive
>feedback, I've written a much
>shorter & simpler fitting
>sheet.
>
> - Guy
______________________________

Nothing wrong but still hardly educative for the novice fitter.

1. choice of model:
The fitter must be aware that if he does not have a good library of models, he will get no fit. Even with good models as there may be dozens, sometimes it is quite a decision making to declare one vs another.

2. The short version err(param):=f(param,X)-Y
does work generally, but it will fail. On that count, I would zap MyF(X,vguess)-Y = 0 and as assigned to a function and replace by the Paul W minerr long module. This last module enables the manipulation of the range of X,Y that eventually best fit or even fit where the err(parm) fails.

3. initialise:
Initialising the fit is the most torturing part of the fit. The initial fit must cross the data set at least once, twice is much better. Some fit don't care the initial crossing or not. Some fit are so difficult to do that the fit can only be manual.

4. advanced techniques:
Are particular to the fitter, mostly "data reduction" by fitting a transformed data set.

5. Supplementary fitting tools & considerations:
Sometimes smoothing is necessary. Other times the project dictates some weighting (especially in the low range) as that depends upon the primary element pre-knowledge of the collection not linear wrt the physical. The point spacing is often critical. Often too, scaling the data is necessary, scale conversion ...etc.
............................

On all these counts, a tutorial must be a tool and as direct as possible as a plug and play. Here is a total of 60 pages of "tools". It was reported that Mathcad 14 had turned red on some modules. At least it demonstrates "fitting is so much" and not automatable in many cases ... not deceptively simple but simply deceptive.

Years ago (4, 5 ?) a collection of models was posted in the collab. It is now obsolete as new models appear regularly, and this collection is not on the schedule and would be in the 100 pages !

Fitting is a no end project.
On that, Mathcad is "Le Grand Champion".
Honestly, no other can even qualify as not being "modular CAS".

Thanks Guy for reviving this so demanded knowledge base.

jmG

TomGutman · ‎Sep 06, 2009

Overall a very good sheet.

One technical point. Fitting relative errors is not quite the same as fitting the logs. They are both solutions to the problem with errors proportional to the value, and are usually very similar (the better the fit, the less the difference) but they are not quite the same. While I would provide both as options, one can do with just one of them -- but the nomenclature should match what is actually done.

You might consider offering this to Mona for inclusion in the PTC newletter. It could use some decent Mathcad articles.
__________________
� � � � Tom Gutman

TheodoreM.Bones · ‎Sep 06, 2009

Is this a revolt against the monarchy of MathCad?
Like Barebone's Parliament of the English Commonwealth of 1653.

ptc-1368288 · ‎Sep 06, 2009

>You might consider offering this to Mona for inclusion in the PTC newletter. It could use some decent Mathcad articles. <<br> __________________________

Last year, I started splitting my 50 pages collection for the news letter, but it never arrived. About the one you are talking about: not of practical use as a tool, but can be adapted, The tutorial is however excellent once adapted.

jmg

TheodoreM.Bones · ‎Sep 06, 2009

I just wanted to explain that a mixed up Thiele fraction is just as good and slightly better. The Thiele is at the bottom of this worksheet.

ptc-1368288 · ‎Sep 06, 2009

>It looks like the Thiele is up a little closer to the data at the far end than the double exp fit. <<br> _____________________

Good point.
If the data set would be real and the model too, like it is in Guy fit ... and knowing the data set may be biased (hysteresis and so on), it would then be a case of applying some weighting function. An example is shown in the master work sheet posted last night.

jmG

GuyB · ‎Sep 06, 2009

On 9/6/2009 10:20:26 AM, jmG wrote:
>>It looks like the Thiele is up a little closer to the data at the far end than the double exp fit. <<br> >_____________________
>
>Good point.

Thanks for taking the time to look through and offer comments on my sheet.

Perhaps I should point out that the data set was actually generated from a double exponential function, with some added noise which will re-compute every time you run the sheet. (You can convince yourself of that by expanding the collapsed region in which I generate the data.)

Your comments are educational for others wishing to fit data - taking too much care to fit noise can lead you astray when trying to pick a fit function.

A Thiele fit makes sense when you expect your data to fit a Thiele function. A Thiele fit may also make sense when you simply want a curve that goes through data points.

I would argue, however, that it doesn't make for a "better" fit when the underlying data actually arise from a different model. You might re-examine the quality of the fit for a couple of different data runs (hit CTRL-F9 a few times), particularly with regard to how much (or little) you gain by fitting with 8 free parameters rather than just the 4 free parameters of the double-exponential fit.

- Guy

GuyB · ‎Sep 06, 2009

On 9/6/2009 1:47:46 AM, Tom_Gutman wrote:
>One technical point. Fitting
>relative errors is not quite
>the same as fitting the logs.
>... the nomenclature should
>match what is actually done.

Good point, though I'm not quite sure how to rename the Get_Log_Fit() function in a pithy way.

Perhaps Get_Proportional_Fit().

Thanks,

- Guy

TomGutman · ‎Sep 06, 2009

I would use the term relative error (or some contraction thereof).
__________________
� � � � Tom Gutman

TomGutman · ‎Sep 06, 2009

A bit of criticism.

While a very good sheet, it's not something I'm likely to use much. The reason is the starting point, a fitting function where the parameters have been collected into a single vector. I tend to avoid such a construct, as I have trouble keeping track of what vector position corresponds (logically) with what parameter. Works OK with a few parameters, gets unwieldy by the time one has a half dozen or more parameters.
__________________
� � � � Tom Gutman

GuyB · ‎Sep 06, 2009

On 9/6/2009 4:01:14 PM, Tom_Gutman wrote:
>A bit of criticism.
>
>While a very good sheet, it's
>not something I'm likely to
>use much. The reason is the
>starting point, a fitting
>function where the parameters
>have been collected into a
>single vector.

I can respect that - it's a matter of style.

I prefer the current version because I don't have to redefine a different Minerr solve block function each time I change my model. The same framework applies for any fit function I chose to define.

I frequently play with different models to fit the same data set, so it is much more convenient (for me) to use the same Get_Fit function each time, keeping track of each parameter vector separately.

- Guy

TomGutman · ‎Sep 06, 2009

As you say, a good deal of it is a matter of style.

When I do need to make fitting functions with a single vector parameter, I would usually use the approach in the first part of the "Easy Genfit Setup" to allow me to write the function in human readable terms and then automatically create the vector form needed for fitting. But in most cases I find it as easy to just write a new solve block as to adjust the function to fit the general framework.

Another point of style -- I would do the required vectorization explicitly in the fitting routines (even if it puts in an explicit loop) rather than require that the provided function do implicit vectorization. I've seen too many failures with implicit vectorization, with the overloaded operators (mainy multiplication) to consider it safe. Expecially for those with limited experience.

I've taken the liberty of modifying your sheet to include both log and relative error fits, and to eliminate the need for the fitting function to handle a vector input. Other minor changes because I couldn't resist.
__________________
� � � � Tom Gutman

GuyB · ‎Sep 07, 2009

On 9/6/2009 8:13:19 PM, Tom_Gutman wrote:
>...
>I've taken the liberty of
>modifying your sheet to
>include both log and relative
>error fits, and to eliminate
>the need for the fitting
>function to handle a vector
>input. Other minor changes
>because I couldn't resist.

I appreciate the effort, and have retained several of your suggestions in this new version.

I am now sticking with the v11 sheet, because saving to earlier versions mucked with my font & grid colors.

Tom, I found that the IsNaN(v) function did not work with a vector input v, at least in v11. I had to modify it by taking the dot product:

IsNaN(v*v)

Did it work in your original version when you sent a vector?

I also added a new example for the Get_Log_Fit - highlighting the confusion that ensues when Excel users do linear fits of ln-scaled data and compare the results to direct power-law fits of unscaled data.

- Guy

TomGutman · ‎Sep 07, 2009

The IsNan works in MC14 but not in MC11. I did my sheet in MC14 and saved as MC11 (by an large there are more restrictions in MC14 than in MC11, but there are exceptions). Rather than use IsNaN on the dot product, I suggest IsArray. Then any scalar, including NaN, will cause all parameters to be fitted.

I do wonder why you use a fixed input table for X2/Y2, rather than generate the data as you do for the other cases. I realize I made a mistake in my sheet, starting with the double exponential decay rather than the single exponential decay from your data. But still, generating the exact values and then multiplying by either (1+rnorm(0,σ)) or exp(rnorm(0,σ)) should generate data that meets your criteria. One is actually a relative error, the other a lognormal error, but for small errors the two are very similar -- and both have the property of errors approximately proportional to the value.

Another useful tool in this set would be the calculation of the variance-covariance matrix for the parameters, using the procedure in PaulW's sheet. It involves approximating the derivatives of the function with respect to the parameters, but that should be no worse than the mulitple fits used by the bootstrap process. It would normally be presented as a vector of standard deviations and a matrix of correlations.
__________________
� � � � Tom Gutman

TheodoreM.Bones · ‎Sep 07, 2009

A welter of statistics does not necessarily indicate a goodness of fit opinion obtained by visual inspection of the curves of data vs, fit. Everything might look fine with stats but overlooks a serious loop off the chart st one end, for examplr.

ptc-1368288 · ‎Sep 07, 2009

On 9/7/2009 7:10:39 AM, study wrote:
>A welter of statistics does
>not necessarily indicate a
>goodness of fit opinion
>obtained by visual inspection
>of the curves of data vs, fit.
>Everything might look fine
>with stats but overlooks a
>serious loop off the chart at
>one end, for example.
______________________________

100% correct, "goodness of fit" does not exist. A conjecture for the amateur who should read "goodness of lies". For "reflexive functions", there will be as many calculated parameters than there are initial guesses. For the two components decay example, the several seconds statistical fit from ORIGINLAB does output same as the PWMinerr. The other points are that is it not possible to imitate real noise and even less possible to evaluate the non linearities of the capturing device(s), neither to compensate for the digitization errors and even less possible to improve the chain of the collection.

A " floating fit" and a "calibrated fit" aren't the same either. The fitter may have to calibrate the fit like in the 10 pages 2nd sheet. For "calibrated fit" of more general data sets, the residuals is a good tool by constraining the fit at some of the probable calibration point of the measuring chain.

I doubt anything valid will result from this thread and the example, except confusion ... especially for the exponential decay that has several models. The idea of scanning through a set of models by class is not bad, but eventually very cumbersome, it was done for demo but not retained as a tool. Before getting more confused with hypothetical ideas the interested fitter must go by the essential rules:

1. the model
2. the initialisation
2. data reduction
4. some data sets can only be fitted by hand.

Low populated data set are sometimes insufficient, sometimes they are the only route. Many real life data set just can't exist with a unique model function ... Oh ! yes it can with discontinuous function, but that is not my point. My point is a resulting "clean data set" that can be exported for user interpolation on more standard software even as primitive as a user control system. Some data set (looking simple to the naive) toke me 100's of hrs, no bigbang techniques out of the blue, just a good tool box.

Interesting:
this thread will surely last till the end of this collaboratory.

jmG

GuyB · ‎Sep 07, 2009

On 9/7/2009 3:54:59 AM, Tom_Gutman wrote:
>.... Rather than use
>IsNaN on the dot product, I
>suggest IsArray. Then any
>scalar, including NaN, will
>cause all parameters to be
>fitted.

A good thought.

>...I do wonder why you use a
>fixed input table for X2/Y2,
>rather than generate the data
>as you do for the other cases.
>... But still, generating
>the exact values and then
>multiplying by either
>(1+rnorm(0,�)) or
>exp(rnorm(0,�)) should
>generate data that meets your
>criteria.

I had to fix the data table because randomly-generated data will sometimes produce data fits that wander off the small-amplitude data and sometimes not.

In fact, as you might expect when you think about it, you'll find that most of the time fits to randomly-generated data look acceptable.

So, I fixed the data set to make the point clear each time someone runs the sheet.

>Another useful tool in this
>set would be the calculation
>of the variance-covariance
>matrix for the parameters,
>using the procedure in PaulW's
>sheet.

I did consider this, and I'm sure those used to seeing & using the covariance matrix would like to have such tools. I've never really used the covariance matrix myself, though, as I pointed out in:

http://collab.mathsoft.com/read?78851,11

I state the reasons for my reluctance there, which Paul acknowledges in the post immediately following it.

The Bootstrap method fails to reveal dependency relationsips between fit coefficients, but it does tend to provide a more-accurate assessment of their overall uncertainty values.

- Guy

TomGutman · ‎Sep 08, 2009

While I agree that the variance-covariance matrix can be misleading, I don't see that the bootstrap process is obviously better.

The variance-covariance matrix is indeed a point measurement, strictly valid only for an infinitesimal region around the minimum. But the range of validity is quite variable, and for reasonable fits will usually encompass the range of interest. If the fit is really poor it might not be a very accurate assessment of how poor. Note that while it is based on a linear approximation, that linear approximation is what is used as the basis for the Levenberg-Marquardt algorithm.

The bootstrap process has its own assumptions and limitations. While it is a monte-carlo method in the sense that it uses (pseudo)random numbers, it is not a classic monte-carlo method in that it does not represent an accurate modelling of the desired system. What is wanted is the result of repeated sampling from the same underlying population. Here the underlying population is known, as the data was generated by a particular model, and one can simply generate multiple data sets using the same parameter. In a real life situation one can make some additional assumptions and calculate the distribution that would hold if the model were true.

But the bootstrap process keeps using the same sample. It is effetively just playing with the weights. Each point is given a weight (in addition to any weighting done in the model) which is an integer and can be zero. The distribution of those weights is not quite clear. As a first order approximation, each point gets n chances with a probability of 1/n, resulting in a binomial distribution (for large n, approximately a poisson distribution). But the weights are not quite independent, the construction forcing their sum to be n. This will distort the distribution, probably not by much, but I don't know the form.

In any case the result is not values based on random sampling from the population (what one is looking for) but rather the variations that can be gotten by using different weights with the one given sample. I consider the relationship between this weight driven variation and the "true" variatiion to be just as unknown, and subject to empirical verification, as the relatiiionship between the point estimate variance-covariance matrix and the "true" variation.
__________________
� � � � Tom Gutman

TheodoreM.Bones · ‎Sep 08, 2009

Bootstrapping is the same as shuffle and deal.

I was thinking of jmG�s remarks about the impossibility of any random noise command to give a real life weighing representation on raw or target data collected from natural
observations. In real life Brownian movements control the noise levels and frequencies, but using the runif command, the selected limits are artificial and subject to uncertainty themselves. Observational errors are not taken into consideration here.

How is the �goodness of fit� evaluated? Recall that the parameters mean, MLE, minimum chisquare, etc. all tend to be the same IF the stdeviations are zero. Accordingly, throwing
away outliers tends to emphasize that they are excessive �Brownian movements� in the data and ought to be disregarded. Then trimming more and more such data tends to collect a final cluster of data with zero stdeviation, that is with the mean, MLE, or minimum chisquare value.

Then the �goodness of fit� is data trimmed to the mean, etc. It is not a simple parameter.

ptc-1368288 · ‎Sep 09, 2009

Yes Theodore,

Our comments and practice always corroborate. Data fitting in an act of humility, i.e: the wise vs the pretentious. Fitting data is simply a conjecture that more maths make it less usable. Like squaring the circle, the adventure starts by understanding the project "sur le fond".
Here is an "initiatic" usable tool.

jmG

TheodoreM.Bones · ‎Sep 09, 2009

Getting the best initial guesses for the Minerr work is an art in itself. Here, the user is freed of such effort with a white noise ensemble of guesses that can run towards a solution faster than the chaotic dependent values in the noisy signal can upset the work. Repeated use of the Ctrl+F9 command can give better correlation. The probability of the white noise made guesses for the Thiele fraction to succeed is higher than the chances of failure caused by the noisy signal. There may be rare chances of a freeze.

TheodoreM.Bones · ‎Sep 09, 2009

For data unadorned by user-made chaotics, the chances of an ensemble of white noise initial guesses for a Thiele model Minerr process to start are much higher than a single user-made guess and run (hit & run).

RichardJ · ‎Sep 08, 2009

A minor modification to the Residual_Vector function. Your log fit can sometimes fail. As written, if the data is real it can generate the negative of the desired function, so that the log then has a constant imaginary part (i.e. there are two possible minima). You can see it in your sheet if you look at the plot showing the logarithmic fit, and keep hitting Ctrl F9. Sometimes the first coefficient is negative, and the plot will vanish.

Richard