Do regression fitting on Org Mode tables

Regression fitting predicts numerical values of a column based on other columns. The closer the predicted values to the observed values, the better the fit is.

New

The output is now displayed with the same accuracy as the input (see Accuracy of the output). Output used to display all available digits.

Example

We have a table of observed data. Buses with passengers consume fuel (around 30 liters / 100km) given in the 3rd column. We also have the year of the travel, and the number of passengers.

| year | passengers | consumption |
|------+------------+-------------|
| 2012 |         33 |        33.2 |
| 2013 |         31 |        32.8 |
| 2016 |         26 |        31.8 |
| 2020 |         28 |        30.5 |
| 2013 |         47 |        33.1 |
| 2012 |         24 |        33.1 |
| 2019 |         23 |        30.8 |
| 2018 |         41 |        30.9 |
| 2020 |         29 |        30.5 |
| 2020 |         21 |        30.4 |
| 2014 |         27 |        32.4 |
| 2014 |         16 |        32.3 |
| 2014 |         24 |        32.4 |
| 2020 |         31 |        30.5 |
| 2012 |         19 |        33.0 |

Can we predict the consumption using year and passengers data? Let us try with a simple linear-with-constant model.

To do so, let us put the cursor in the consumption column and type M-x orgtbl-fit

Two additional columns appear:

the predicted values by the linear-with-constant model,
the differences between observations and predictions.

| year | passengers | consumption | Best Fit | Fit Diff |
|------+------------+-------------+----------+----------|
| 2012 |         33 |        33.2 |    33.16 |    -0.04 |
| 2013 |         31 |        32.8 |    32.81 |     0.01 |
| 2016 |         26 |        31.8 |    31.76 |    -0.04 |
| 2020 |         28 |        30.5 |    30.44 |    -0.06 |
| 2013 |         47 |        33.1 |    32.94 |    -0.16 |
| 2012 |         24 |        33.1 |    33.09 |    -0.01 |
| 2019 |         23 |        30.8 |    30.73 |    -0.07 |
| 2018 |         41 |        30.9 |    31.21 |     0.31 |
| 2020 |         29 |        30.5 |    30.45 |    -0.05 |
| 2020 |         21 |        30.4 |    30.38 |    -0.02 |
| 2014 |         27 |        32.4 |    32.44 |     0.04 |
| 2014 |         16 |        32.3 |    32.35 |     0.05 |
| 2014 |         24 |        32.4 |    32.42 |     0.02 |
| 2020 |         31 |        30.5 |    30.46 |    -0.04 |
| 2012 |         19 |        33.0 |    33.05 |     0.05 |
#+TBLFM: $4=707.281913694 - 0.335181456447*$1 + 7.98331706947e-3*$2;%.2f::$5=$4-$3;%.2f

We see that the predictions are quite close to the observations.

The formula found by orgtbl-fit is given at the bottom of the table:

$4
= 707.2819
- 0.3352*$1
+ 0.00798*$2

Which can be interpreted as:

In the year 0, buses used to consume 707.2819 liters/100km (of course this value has no real meaning, as the validity of the model is around 2012-2020)
Every year, buses succeed in reducing their consumption by 0.3352 liters/100km. Evidence is the negative sign for this coefficient.
Each additional passenger generates an overhead of 0.00798 liters/100km.

As usual, we should be wary of jumping to conclusions based on data only. Coming back to the field and understanding it is mandatory in data analysis.

Usage

Put the cursor in the column you want to fit (not necessarily the last one).
Type M-x orgtbl-fit.
Give a model (the linear-with-constant model is provided as a default).
Look at the additional two columns at the end of the table.
Look at the formula for the before-last column: it is the given model with placeholders replaced by actual numerical values.
Remove the last two columns (M-S-left) and play again if you are not satisfied by the results.
Add rows to the table with new data to compute predictions.

Model

orgtbl-fit needs a model. It is an equation assumed to compute the target column. But the parameters are not specified. Instead, question marks ? denote placeholders.

A default model is offered. It is a linear-with-constant model with all relevant columns. In the example above, it is:

? +?*year +?*passengers

Only relevant columns appear in the default model. A relevant column is an entirely numerical one. Of course, the target column do not appear.

If the table has no header, then columns must be refered to by their $ names. The previous example would be:

? +?*$1 +?*$2

Even though the default model is linear-with-constant, any model can be used, be it polynomial, logarithmic, exponential, whatever.

For example, the following model is acceptable:

? +?*log(year-1900) +?*passengers

Actually it fits quite well the data.

Another model could be:

? +?*year +?*passengers +?*(year-2016)*passengers

It would be a valid hypothesis if we suspect that there is a cross relationship between year and passengers.

Let us try it:

| year | passengers | consumption | Best Fit | Fit Diff |
|------+------------+-------------+----------+----------|
| 2012 |         33 |        33.2 |    33.21 |     0.01 |
| 2013 |         31 |        32.8 |    32.83 |     0.03 |
| 2016 |         26 |        31.8 |    31.77 |    -0.03 |
| 2020 |         28 |        30.5 |    30.43 |    -0.07 |
| 2013 |         47 |        33.1 |    33.05 |    -0.05 |
| 2012 |         24 |        33.1 |    33.05 |    -0.05 |
| 2019 |         23 |        30.8 |    30.81 |     0.01 |
| 2018 |         41 |        30.9 |    31.05 |     0.15 |
| 2020 |         29 |        30.5 |    30.42 |    -0.08 |
| 2020 |         21 |        30.4 |    30.52 |     0.12 |
| 2014 |         27 |        32.4 |    32.44 |     0.04 |
| 2014 |         16 |        32.3 |    32.33 |     0.03 |
| 2014 |         24 |        32.4 |    32.41 |     0.01 |
| 2020 |         31 |        30.5 |    30.40 |    -0.10 |
| 2012 |         19 |        33.0 |    32.97 |    -0.03 |
#+TBLFM: $4=501.454507662 - 0.233013277919*$1 + 2.837561942e-3*$2 - 3.68345292814e-3*$2*($1 - 2016);%.2f::$5=$4-$3;%.2f

The fit is slightly better. After all there might be a cross relationship between years and passengers. The coefficient is negative. The interpretation could go something like that: “newer buses are designed in a more efficient way, such that additional passengers do not increase consumption as it used to”.

Or maybe we are over-interpreting data. This new quadratic coefficient, -0.00368, is quite small. And consumption data is quite coarse: just 3 digits accuracy.

Returning to the field and understanding it is mandatory.

Note that in the model we have chosen:

? +?*year +?*passengers +?*(year-2016)*passengers

there is this arbitrary 2016 constant. But if we change it to 2000 or whatever, the results are exactly the same. This is because changes in this constant will result in changes in the third question mark ?, that make the overall equations identical.

Note also that in this new model, we added one question mark ?. We should be careful not to over-fit. If our model has as many question marks ? as rows in the table, the fit will be perfect (zero discrepancies between observations and predictions). But our model will no longer summarize the data. Use as few coefficients (question marks ?) as possible to fit data, but not less.

Forecasting new data

We came up with a quite good predictive model for data observations. But could it be applied to new data? Answer is Yes! And it is easy: just add new rows to the table.

Example. Coming back to the buses example, we add 3 rows to the table.

| year | passengers | consumption | Best Fit | Fit Diff |
|------+------------+-------------+----------+----------|
| 2012 |         33 |        33.2 |    33.16 |    -0.04 |
| 2013 |         31 |        32.8 |    32.81 |     0.01 |
| …
| 2020 |         31 |        30.5 |    30.46 |    -0.04 |
| 2012 |         19 |        33.0 |    33.05 |     0.05 |
|------+------------+-------------+----------+----------|
| 2017 |         45 |             |    31.58 |    31.58 |
| 2022 |         40 |             |    29.86 |    29.86 |
| 2027 |         35 |             |    28.15 |    28.15 |
#+TBLFM: $4=707.281913694 - 0.335181456447*$1 + 7.98331706947e-3*$2;%.2f::$5=$4-$3;%.2f

We ask Org to recompute the table by typing C-u C-c *. And we get our 3 predictions. Only the Best Fit column is worth looking at.

Note that two of the rows are for the 2022 and 2027 years, outside the 2012-2020 range of observations. There is no problem as far as the model is concerned. However we should be careful, as reality may invalidate our predictions with unforeseen events.

“Prediction is a difficult art, especially when it concerns the future” – Grucho Max? Mark Twain? Pierre Dac? Niels Bohr?

How it works?

A classical regression fitting is performed. The process attempts to change the placeholders (question marks ?) in the model equation by actual numerical values. Those numerical values are chosen so that they minimize the sum of the squares of errors (the so called ”least squares method”). They also ensure that the sum of errors is zero.

orgtbl-fit is an interface between Org Tables and Calc. Calc is the 30 years old symbolic calculator of Emacs. More specifically, the calcFunc-fit Lisp function is called as the back-end. You can call it directly in Calc.

Of course, regression with the least squares method is implemented in many systems: R and GnuPlot to name just two. Use orgtbl-fit if you want a pure Emacs system without external dependencies, and integration with Org Tables.

Accuracy of the output

Let us fit the observ column in this example:

| d |  q | observ | Best Fit | Fit Diff |
|---+----+--------+----------+----------|
| 1 | 12 |  98.55 |   99.163 |    0.613 |
| 2 | 51 | 148.86 |  147.778 |   -1.082 |
| 3 | 67 | 172.91 |  173.664 |    0.754 |
| 4 |  8 | 126.15 |  125.434 |   -0.716 |
| 5 |  7 | 134.09 |  134.520 |    0.430 |
#+TBLFM: $4=77.2301611968 + 10.0745377652*$1 + 0.988214672678*$2;%.3f::$5=$4-$3;%.3f

The output in the Best Fit and Fit Diff columns display 3 digits after dot, as in 99.163. Actually, more digits are computed, but only 3 are displayed. It is on purpose. The results are formatted with one more digit than the observ input column. The input column contains values like 98.55 with 2 digits after dot.

This “1 more digits in output than in input” policy is a good compromise:

non significant digits are not displayed to avoid distraction,
there cannot be more accuracy in the output that there is in the input,
one more digit is useful for not having the feeling that something is hidden.

This default formatting is easy to change with the Org Mode tools:

with cursor in the Best Fit column, type C-c =,
this brings an editor for the column formula,
at the end, change ;%.3f into another valid Org Mode format, like ;f3 or ;g or whatever, or even remove it,
Type C-u C-c * to refresh the table.

For very large or very small values like 2.36e20 or 5.89e-20, there is no suitable formatter to display digits up to a specified common precision for the whole column. Instead, a ;%#.5g formatter is used. Here 5 means that values will be displayed with 5 significant digits. This is in case the target column appears to have only 4 significant digits.

Here is the previous example with large values:

| d |  q |    observ |    Best Fit |     Fit Diff |
|---+----+-----------+-------------+--------------|
| 1 | 12 |  .9855e20 | 9.91633e+19 |  6.13300e+17 |
| 2 | 51 | 1.4886e20 | 1.47778e+20 | -1.08200e+18 |
| 3 | 67 | 1.7291e20 | 1.73664e+20 |  7.54000e+17 |
| 4 |  8 | 1.2615e20 | 1.25434e+20 | -7.16000e+17 |
| 5 |  7 | 1.3409e20 | 1.34520e+20 |  4.30000e+17 |
#+TBLFM: $4=7.72301611968e19 + 1.00745377652e19*$1 + 9.88214672678e17*$2;%#.6g::$5=$4-$3;%#.6g

We see that in the Best Fit column, all values are similar to the values in the target column, with one more digit. Except the first one, 9.91633e+19, which have 2 additional digits compared to .9855e20. This is unavoidable with the formatters offered by Org Mode. Thought the result looks good.

Fruits baskets

We need to estimate the average weight of an apple, a banana, and a blueberry. But we don’t have any device to weight fruits. We can just read a label on each basket giving its overall weight. All baskets are the same, but each contain a different mix of fruits.

Let us store data in an Org Mode table, one row per observation. Each row counts the number of fruits in a basket, plus the overall weight including the basket itself (in grams).

| apples | bananas | blueberries | total weight |
|--------+---------+-------------+--------------|
|      6 |       5 |          50 |         2816 |
|      1 |       5 |         232 |         2094 |
|      1 |       0 |         143 |          952 |
|      7 |       0 |          57 |         2087 |
|      0 |       2 |         263 |         1363 |
|      4 |       2 |         246 |         2193 |
|      7 |       0 |         184 |         2341 |
|      5 |       5 |         149 |         2800 |
|      1 |       3 |         272 |         1784 |
|      3 |       0 |          32 |         1170 |
|      0 |       5 |          55 |         1525 |
|      7 |       5 |         276 |         3486 |
|      3 |       2 |         227 |         1943 |
|      7 |       5 |          95 |         3131 |
|      1 |       4 |          59 |         1559 |

Let us put the cursor on the total weight column, and type M-x orgtbl-fit. The default linear-plus-constant model is the one to use. Result is:

| apples | bananas | blueberries | total weight | Best Fit | Fit Diff |
|--------+---------+-------------+--------------+----------+----------|
|      6 |       5 |          50 |         2816 |   2820.4 |      4.4 |
|      1 |       5 |         232 |         2094 |   2093.8 |     -0.2 |
|      1 |       0 |         143 |          952 |    953.7 |      1.7 |
|      7 |       0 |          57 |         2087 |   2088.1 |      1.1 |
|      0 |       2 |         263 |         1363 |   1359.6 |     -3.4 |
|      4 |       2 |         246 |         2193 |   2195.9 |      2.9 |
|      7 |       0 |         184 |         2341 |   2340.0 |     -1.0 |
|      5 |       5 |         149 |         2800 |   2799.2 |     -0.8 |
|      1 |       3 |         272 |         1784 |   1787.7 |      3.7 |
|      3 |       0 |          32 |         1170 |   1168.6 |     -1.4 |
|      0 |       5 |          55 |         1525 |   1525.3 |      0.3 |
|      7 |       5 |         276 |         3486 |   3486.0 |      0.0 |
|      3 |       2 |         227 |         1943 |   1940.7 |     -2.3 |
|      7 |       5 |          95 |         3131 |   3127.1 |     -3.9 |
|      1 |       4 |          59 |         1559 |   1558.0 |     -1.0 |
#+TBLFM: $5=452.653490802 + 217.494062646*$1 + 192.718341278*$2 + 1.9829588217*$3;%.1f::$6=$5-$4;%.1f

Note that the two columns computed by orgtbl-fit are formatted with one decimal place more than the target column.

We see that the predicted values are quite close to the actual ones. Evidences are the very tiny differences shown in the last column. Good!

The formula is easy to interpret:

total weight
= 452.653490802         // basket
+ 217.494062646 * $1    // apples
+ 192.718341278 * $2    // bananas
+ 1.9829588217 * $3     // blueberries

The empty basket weights 452.7 g.
An average apple weights 217.5 g.
An average banana weights 192.7 g.
A single blueberry weights 1.98 g.

The best teacher?

You are a teacher who gives a 3 days course every week. The course is always the same, but you give it to new students each time.

At the end of the course, you require your students to complete a quiz. Based on the results, you know that an average student retains anything between 5% to 65% of the course. Yes, unfortunately, it never reaches 100%.

You want to optimize your course. You believe that there are 2 main factors:

theo: the proportion between theoretical and practical content. 100% theory is as bad as 100% practice. The best is somewhere in between, but you don’t know where.
```
theo ∈ [0..1]
    
```
duration: theoretical lectures alternate with practical exercises, at this rate. Periods of 3 minutes are too short, and periods of 3 hours are too long. The best duration is in between.
```
duration ∈ [3..180]
    
```

You choose a quadratic model in the variables, to account for bad performance at both end of some interval:

quizzresult
= ?
+ ? * theo + ? * theo^2
+ ? * duration + ? * duration^2

You suspect that those variables are not completely independent, so you add a coupling term to your model:

quizzresult
= ?
+ ? * theo + ? * theo^2
+ ? * duration + ? * duration^2
+ ? * theo * duration

Your model have 6 question marks ? and therefore 6 numerical values to find.

Now, you decide to experiment with different settings every week. You accumulate samples, one per week. Here they are:

|  theo | duration | quizzresult |
|-------+----------+-------------|
| 0.781 |     18.0 |        30.3 |
| 0.615 |     38.3 |         5.2 |
| 0.601 |      5.2 |        52.5 |
| 0.176 |      4.4 |        60.4 |
| 0.428 |     40.9 |        10.5 |
| 0.255 |     12.2 |        66.2 |
| 0.784 |     12.6 |        34.2 |
| 0.468 |      5.4 |        58.1 |
| 0.475 |     37.6 |        19.8 |
| 0.605 |     22.5 |        47.5 |
| 0.319 |     24.8 |        58.7 |
| 0.713 |     30.1 |        19.2 |
| 0.596 |     36.0 |        15.6 |
| 0.659 |     21.8 |        42.5 |
| 0.731 |     19.6 |        36.2 |
| 0.309 |     44.0 |         4.9 |

You want to know what are the optimal theo & duration, with just observations over 16 weeks.

You ask for a regression fit.

You put the cursor on the quizzresult column
You type M-x orgtbl-fit.
You enter your double quadratic model.

|  theo | duration | quizzresult | Best Fit | Fit Diff |
|-------+----------+-------------+----------+----------|
| 0.781 |     18.0 |        30.3 |    31.42 |     1.12 |
| 0.615 |     38.3 |         5.2 |     5.40 |     0.20 |
| 0.601 |      5.2 |        52.5 |    51.29 |    -1.21 |
| 0.176 |      4.4 |        60.4 |    59.52 |    -0.88 |
| 0.428 |     40.9 |        10.5 |    11.33 |     0.83 |
| 0.255 |     12.2 |        66.2 |    67.27 |     1.07 |
| 0.784 |     12.6 |        34.2 |    34.28 |     0.08 |
| 0.468 |      5.4 |        58.1 |    59.60 |     1.50 |
| 0.475 |     37.6 |        19.8 |    20.20 |     0.40 |
| 0.605 |     22.5 |        47.5 |    46.21 |    -1.29 |
| 0.319 |     24.8 |        58.7 |    58.57 |    -0.13 |
| 0.713 |     30.1 |        19.2 |    19.09 |    -0.11 |
| 0.596 |     36.0 |        15.6 |    15.26 |    -0.34 |
| 0.659 |     21.8 |        42.5 |    41.81 |    -0.69 |
| 0.731 |     19.6 |        36.2 |    36.27 |     0.07 |
| 0.309 |     44.0 |         4.9 |     4.27 |    -0.63 |
#+TBLFM: $4=41.310831661 + 86.6227413961*$1 - 134.496123584*$1^2 + 2.04949509715*$2 - 0.067672358144*$2^2 - 0.744700713305*$2*$1;%.2f::$5=$4-$3;%.2f

The fit is quite good. Your optimal model instantiation is therefore (rounded):

quizzresult
= 41.31
+ 86.62*theo - 134.50*theo^2
+ 2.049*duration - 0.068*duration^2
- 0.7447*duration*theo

From there you can find the values of theo and duration which maximize quizzresult. Easy. Just compute partial derivatives, and make them zero.

0 = 2.049-.136*duration-.7447*theo    // derivative along duration
0 = -269.00*theo+86.62-.7447*duration // derivative along theo

Now solve this simple 2 equations, 2 variables system. Calc can do that.

M-x calc to launch Calc
[ to open a vector
’ 2.049-.136*duration-.7447*theo enter the first equation
’ -269.00*theo+86.62-.7447*duration enter the second
] close the vector
a S duration theo to solve the system

Calc answers:

[duration = 13.5077128691, theo = 0.284612662552]

The best is to alternate 13 minutes and 30 seconds runs. The best is also to do 28.46% theoretical lectures (and 71.54% practical sessions).

You can inject those values into the table. Just put the cursor in the last row and the first column. Then type Ret and fill in the cells. Compute this new row typing C-c *:

| …
| 0.309 |     44.0 |         4.9 |     4.27 |    -0.63 |
| .2846 |     13.5 |             |    67.54 |    67.54 |
#+TBLFM: $4=41.310831661 + 86.6227413961*$1 - 134.496123584*$1^2 + 2.04949509715*$2 - 0.067672358144*$2^2 - 0.744700713305*$2*$1;%.2f::$5=$4-$3;%.2f

The best possible result to the quiz is therefore 67.54%.

By the way, you are not the best teacher just by following 2 optimal numbers, of course. Teaching is much more than this simplistic regression fitting example.

Dates

A Date in a table is converted to a number of days since January 1st year 1. Therefore columns of dates can be used as numeric columns.

However, the spreadsheet formula added by orgtbl-fit may need to be adjusted manually. The date() function is handy. It can convert a date to a number of days and back again. As an example, we want to explain the num column by the event column. orgtbl-fit gives:

| event            | num | Best Fit | Fit Diff |
|------------------+-----+----------+----------|
| [2023-01-08 Sun] | 1.1 |     0.04 |    -1.06 |
| [2023-02-08 Wed] | 2.6 |     0.04 |    -2.56 |
| [2023-03-03 Fri] | 3.2 |     0.04 |    -3.16 |
| [2023-05-17 Wed] | 6.0 |     0.04 |    -5.96 |
#+TBLFM: $3=0.0371869477971*$1 - 27462.3666844;%.2f::$4=$3-$2;%.2f

The 0.04 value we see is erroneous. It is actually 0.0371869477971 formatted with ;%.2f.

Let us adjust manually $1 to date($1):

| event            | num |  Best Fit |   Fit Diff |
|------------------+-----+-----------+------------|
| [2023-01-08 Sun] | 1.1 |      1.24 |       0.14 |
| [2023-02-08 Wed] | 2.6 |      2.39 |      -0.21 |
| [2023-03-03 Fri] | 3.2 |      3.24 |       0.04 |
| [2023-05-17 Wed] | 6.0 |      6.03 |       0.03 |
#+TBLFM: $3=0.0371869477971*date($1) - 27462.3666844;%.2f::$4=$3-$2;%.2f

Better.

Installation

The easiest and most standard installation is from Melpa. The process is as follow:

Tell Emacs about the Melpa repository, if you have not already:

M-x customize-variable package-archives

Add the Melpa URL: http://melpa.org/packages/

Install orgtbl-fit:

M-x package-install orgtbl-fit

The package will be auto-loaded when you call orgtbl-fit (otherwise it will stay out of the way).

Alternately, you may download the orgtbl-fit.el file, and load it in your .emacs configuration file:

(load-file "somewhere/in/you/computer/orgtbl-fit.el")

You may want to add an entry in the Table menu, Column sub-menu. You may also want to call orgtbl-fit with C-c f. One way to do so is to use use-package in your .emacs init file:

(use-package orgtbl-fit
  :after (org)
  :bind ("C-c f" . orgtbl-fit)
  :init
  (easy-menu-add-item
   org-tbl-menu '("Column")
   ["Regression fit this column" orgtbl-fit (org-at-table-p)]))

Authors, contributors

Thierry Banel, tbanelwebmin at free dot fr

Look also at packages from the same author:

orgtbl-aggregate
orgtbl-join
uniline

GPL 3 License

orgtbl-fit is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

orgtbl-fit is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.org		README.org
orgtbl-fit.el		orgtbl-fit.el
orgtbl-fit.info		orgtbl-fit.info
unittests.org		unittests.org

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Do regression fitting on Org Mode tables

New

Table of Contents

Example

Usage

Model

Forecasting new data

How it works?

Accuracy of the output

Fruits baskets

The best teacher?

Dates

Installation

Authors, contributors

GPL 3 License

About

Uh oh!

Releases

Packages

Languages

License

tbanel/orgtblfit

Folders and files

Latest commit

History

Repository files navigation

Do regression fitting on Org Mode tables

New

Table of Contents

Example

Usage

Model

Forecasting new data

How it works?

Accuracy of the output

Fruits baskets

The best teacher?

Dates

Installation

Authors, contributors

GPL 3 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages