Regression fitting predicts numerical values of a column based on other columns. The closer the predicted values to the observed values, the better the fit is.
The output is now displayed with the same accuracy as the input (see Accuracy of the output). Output used to display all available digits.
- Example
- Usage
- Model
- Forecasting new data
- How it works?
- Accuracy of the output
- Fruits baskets
- The best teacher?
- Dates
- Installation
- Authors, contributors
- GPL 3 License
We have a table of observed data. Buses with passengers consume fuel (around 30 liters / 100km) given in the 3rd column. We also have the year of the travel, and the number of passengers.
| year | passengers | consumption | |------+------------+-------------| | 2012 | 33 | 33.2 | | 2013 | 31 | 32.8 | | 2016 | 26 | 31.8 | | 2020 | 28 | 30.5 | | 2013 | 47 | 33.1 | | 2012 | 24 | 33.1 | | 2019 | 23 | 30.8 | | 2018 | 41 | 30.9 | | 2020 | 29 | 30.5 | | 2020 | 21 | 30.4 | | 2014 | 27 | 32.4 | | 2014 | 16 | 32.3 | | 2014 | 24 | 32.4 | | 2020 | 31 | 30.5 | | 2012 | 19 | 33.0 |
Can we predict the consumption using year and passengers data? Let us
try with a simple linear-with-constant model.
To do so, let us put the cursor in the consumption column and type M-x orgtbl-fit
Two additional columns appear:
- the predicted values by the linear-with-constant model,
- the differences between observations and predictions.
| year | passengers | consumption | Best Fit | Fit Diff | |------+------------+-------------+----------+----------| | 2012 | 33 | 33.2 | 33.16 | -0.04 | | 2013 | 31 | 32.8 | 32.81 | 0.01 | | 2016 | 26 | 31.8 | 31.76 | -0.04 | | 2020 | 28 | 30.5 | 30.44 | -0.06 | | 2013 | 47 | 33.1 | 32.94 | -0.16 | | 2012 | 24 | 33.1 | 33.09 | -0.01 | | 2019 | 23 | 30.8 | 30.73 | -0.07 | | 2018 | 41 | 30.9 | 31.21 | 0.31 | | 2020 | 29 | 30.5 | 30.45 | -0.05 | | 2020 | 21 | 30.4 | 30.38 | -0.02 | | 2014 | 27 | 32.4 | 32.44 | 0.04 | | 2014 | 16 | 32.3 | 32.35 | 0.05 | | 2014 | 24 | 32.4 | 32.42 | 0.02 | | 2020 | 31 | 30.5 | 30.46 | -0.04 | | 2012 | 19 | 33.0 | 33.05 | 0.05 | #+TBLFM: $4=707.281913694 - 0.335181456447*$1 + 7.98331706947e-3*$2;%.2f::$5=$4-$3;%.2f
We see that the predictions are quite close to the observations.
The formula found by orgtbl-fit is given at the bottom of the table:
$4 = 707.2819 - 0.3352*$1 + 0.00798*$2
Which can be interpreted as:
- In the year 0, buses used to consume
707.2819 liters/100km(of course this value has no real meaning, as the validity of the model is around 2012-2020) - Every year, buses succeed in reducing their consumption by
0.3352 liters/100km. Evidence is the negative sign for this coefficient. - Each additional passenger generates an overhead of
0.00798 liters/100km.
As usual, we should be wary of jumping to conclusions based on data only. Coming back to the field and understanding it is mandatory in data analysis.
- Put the cursor in the column you want to fit (not necessarily the last one).
- Type
M-x orgtbl-fit. - Give a model (the linear-with-constant model is provided as a default).
- Look at the additional two columns at the end of the table.
- Look at the formula for the before-last column: it is the given model with placeholders replaced by actual numerical values.
- Remove the last two columns (
M-S-left) and play again if you are not satisfied by the results. - Add rows to the table with new data to compute predictions.
orgtbl-fit needs a model. It is an equation assumed to compute the
target column. But the parameters are not specified. Instead, question
marks ? denote placeholders.
A default model is offered. It is a linear-with-constant model with all relevant columns. In the example above, it is:
? +?*year +?*passengers
Only relevant columns appear in the default model. A relevant column is an entirely numerical one. Of course, the target column do not appear.
If the table has no header, then columns must be refered to by their $ names. The previous example would be:
? +?*$1 +?*$2
Even though the default model is linear-with-constant, any model can be used, be it polynomial, logarithmic, exponential, whatever.
For example, the following model is acceptable:
? +?*log(year-1900) +?*passengers
Actually it fits quite well the data.
Another model could be:
? +?*year +?*passengers +?*(year-2016)*passengers
It would be a valid hypothesis if we suspect that there is a cross
relationship between year and passengers.
Let us try it:
| year | passengers | consumption | Best Fit | Fit Diff | |------+------------+-------------+----------+----------| | 2012 | 33 | 33.2 | 33.21 | 0.01 | | 2013 | 31 | 32.8 | 32.83 | 0.03 | | 2016 | 26 | 31.8 | 31.77 | -0.03 | | 2020 | 28 | 30.5 | 30.43 | -0.07 | | 2013 | 47 | 33.1 | 33.05 | -0.05 | | 2012 | 24 | 33.1 | 33.05 | -0.05 | | 2019 | 23 | 30.8 | 30.81 | 0.01 | | 2018 | 41 | 30.9 | 31.05 | 0.15 | | 2020 | 29 | 30.5 | 30.42 | -0.08 | | 2020 | 21 | 30.4 | 30.52 | 0.12 | | 2014 | 27 | 32.4 | 32.44 | 0.04 | | 2014 | 16 | 32.3 | 32.33 | 0.03 | | 2014 | 24 | 32.4 | 32.41 | 0.01 | | 2020 | 31 | 30.5 | 30.40 | -0.10 | | 2012 | 19 | 33.0 | 32.97 | -0.03 | #+TBLFM: $4=501.454507662 - 0.233013277919*$1 + 2.837561942e-3*$2 - 3.68345292814e-3*$2*($1 - 2016);%.2f::$5=$4-$3;%.2f
The fit is slightly better. After all there might be a cross
relationship between years and passengers. The coefficient is
negative. The interpretation could go something like that: “newer
buses are designed in a more efficient way, such that additional
passengers do not increase consumption as it used to”.
Or maybe we are over-interpreting data. This new quadratic
coefficient, -0.00368, is quite small. And consumption data is quite
coarse: just 3 digits accuracy.
Returning to the field and understanding it is mandatory.
Note that in the model we have chosen:
? +?*year +?*passengers +?*(year-2016)*passengers
there is this arbitrary 2016 constant. But if we change it to 2000 or
whatever, the results are exactly the same. This is because changes in
this constant will result in changes in the third question mark ?,
that make the overall equations identical.
Note also that in this new model, we added one question mark ?. We
should be careful not to over-fit. If our model has as many question
marks ? as rows in the table, the fit will be perfect (zero
discrepancies between observations and predictions). But our model
will no longer summarize the data. Use as few coefficients (question
marks ?) as possible to fit data, but not less.
We came up with a quite good predictive model for data observations. But could it be applied to new data? Answer is Yes! And it is easy: just add new rows to the table.
Example. Coming back to the buses example, we add 3 rows to the table.
| year | passengers | consumption | Best Fit | Fit Diff | |------+------------+-------------+----------+----------| | 2012 | 33 | 33.2 | 33.16 | -0.04 | | 2013 | 31 | 32.8 | 32.81 | 0.01 | | … | 2020 | 31 | 30.5 | 30.46 | -0.04 | | 2012 | 19 | 33.0 | 33.05 | 0.05 | |------+------------+-------------+----------+----------| | 2017 | 45 | | 31.58 | 31.58 | | 2022 | 40 | | 29.86 | 29.86 | | 2027 | 35 | | 28.15 | 28.15 | #+TBLFM: $4=707.281913694 - 0.335181456447*$1 + 7.98331706947e-3*$2;%.2f::$5=$4-$3;%.2f
We ask Org to recompute the table by typing C-u C-c *. And we get our
3 predictions. Only the Best Fit column is worth looking at.
Note that two of the rows are for the 2022 and 2027 years, outside the
2012-2020 range of observations. There is no problem as far as the
model is concerned. However we should be careful, as reality may
invalidate our predictions with unforeseen events.
“Prediction is a difficult art, especially when it concerns the future” – Grucho Max? Mark Twain? Pierre Dac? Niels Bohr?
A classical regression fitting is performed. The process attempts to change the placeholders (question marks?) in the model equation by
actual numerical values. Those numerical values are chosen so that
they minimize the sum of the squares of errors (the so called ”least
squares method”). They also ensure that the sum of errors is zero.
orgtbl-fit is an interface between Org Tables and Calc. Calc is
the 30 years old symbolic calculator of Emacs. More specifically, the
calcFunc-fit Lisp function is called as the back-end. You can call it
directly in Calc.
Of course, regression with the least squares method is implemented in
many systems: R and GnuPlot to name just two. Use orgtbl-fit if you
want a pure Emacs system without external dependencies, and
integration with Org Tables.
Let us fit the observ column in this example:
| d | q | observ | Best Fit | Fit Diff | |---+----+--------+----------+----------| | 1 | 12 | 98.55 | 99.163 | 0.613 | | 2 | 51 | 148.86 | 147.778 | -1.082 | | 3 | 67 | 172.91 | 173.664 | 0.754 | | 4 | 8 | 126.15 | 125.434 | -0.716 | | 5 | 7 | 134.09 | 134.520 | 0.430 | #+TBLFM: $4=77.2301611968 + 10.0745377652*$1 + 0.988214672678*$2;%.3f::$5=$4-$3;%.3f
The output in the Best Fit and Fit Diff columns display 3 digits after
dot, as in 99.163. Actually, more digits are computed, but only 3 are
displayed. It is on purpose. The results are formatted with one more
digit than the observ input column. The input column contains values
like 98.55 with 2 digits after dot.
This “1 more digits in output than in input” policy is a good compromise:
- non significant digits are not displayed to avoid distraction,
- there cannot be more accuracy in the output that there is in the input,
- one more digit is useful for not having the feeling that something is hidden.
This default formatting is easy to change with the Org Mode tools:
- with cursor in the
Best Fitcolumn, typeC-c =, - this brings an editor for the column formula,
- at the end, change
;%.3finto another valid Org Mode format, like;f3or;gor whatever, or even remove it, - Type
C-u C-c *to refresh the table.
For very large or very small values like 2.36e20 or 5.89e-20, there is
no suitable formatter to display digits up to a specified common
precision for the whole column. Instead, a ;%#.5g formatter is
used. Here 5 means that values will be displayed with 5 significant
digits. This is in case the target column appears to have only 4
significant digits.
Here is the previous example with large values:
| d | q | observ | Best Fit | Fit Diff | |---+----+-----------+-------------+--------------| | 1 | 12 | .9855e20 | 9.91633e+19 | 6.13300e+17 | | 2 | 51 | 1.4886e20 | 1.47778e+20 | -1.08200e+18 | | 3 | 67 | 1.7291e20 | 1.73664e+20 | 7.54000e+17 | | 4 | 8 | 1.2615e20 | 1.25434e+20 | -7.16000e+17 | | 5 | 7 | 1.3409e20 | 1.34520e+20 | 4.30000e+17 | #+TBLFM: $4=7.72301611968e19 + 1.00745377652e19*$1 + 9.88214672678e17*$2;%#.6g::$5=$4-$3;%#.6g
We see that in the Best Fit column, all values are similar to the
values in the target column, with one more digit. Except the first
one, 9.91633e+19, which have 2 additional digits compared to
.9855e20. This is unavoidable with the formatters offered by Org
Mode. Thought the result looks good.
Let us store data in an Org Mode table, one row per observation. Each row counts the number of fruits in a basket, plus the overall weight including the basket itself (in grams).
| apples | bananas | blueberries | total weight | |--------+---------+-------------+--------------| | 6 | 5 | 50 | 2816 | | 1 | 5 | 232 | 2094 | | 1 | 0 | 143 | 952 | | 7 | 0 | 57 | 2087 | | 0 | 2 | 263 | 1363 | | 4 | 2 | 246 | 2193 | | 7 | 0 | 184 | 2341 | | 5 | 5 | 149 | 2800 | | 1 | 3 | 272 | 1784 | | 3 | 0 | 32 | 1170 | | 0 | 5 | 55 | 1525 | | 7 | 5 | 276 | 3486 | | 3 | 2 | 227 | 1943 | | 7 | 5 | 95 | 3131 | | 1 | 4 | 59 | 1559 |
Let us put the cursor on the total weight column, and type
M-x orgtbl-fit. The default linear-plus-constant model is
the one to use. Result is:
| apples | bananas | blueberries | total weight | Best Fit | Fit Diff | |--------+---------+-------------+--------------+----------+----------| | 6 | 5 | 50 | 2816 | 2820.4 | 4.4 | | 1 | 5 | 232 | 2094 | 2093.8 | -0.2 | | 1 | 0 | 143 | 952 | 953.7 | 1.7 | | 7 | 0 | 57 | 2087 | 2088.1 | 1.1 | | 0 | 2 | 263 | 1363 | 1359.6 | -3.4 | | 4 | 2 | 246 | 2193 | 2195.9 | 2.9 | | 7 | 0 | 184 | 2341 | 2340.0 | -1.0 | | 5 | 5 | 149 | 2800 | 2799.2 | -0.8 | | 1 | 3 | 272 | 1784 | 1787.7 | 3.7 | | 3 | 0 | 32 | 1170 | 1168.6 | -1.4 | | 0 | 5 | 55 | 1525 | 1525.3 | 0.3 | | 7 | 5 | 276 | 3486 | 3486.0 | 0.0 | | 3 | 2 | 227 | 1943 | 1940.7 | -2.3 | | 7 | 5 | 95 | 3131 | 3127.1 | -3.9 | | 1 | 4 | 59 | 1559 | 1558.0 | -1.0 | #+TBLFM: $5=452.653490802 + 217.494062646*$1 + 192.718341278*$2 + 1.9829588217*$3;%.1f::$6=$5-$4;%.1f
Note that the two columns computed by orgtbl-fit are formatted with
one decimal place more than the target column.
We see that the predicted values are quite close to the actual ones. Evidences are the very tiny differences shown in the last column. Good!
The formula is easy to interpret:
total weight = 452.653490802 // basket + 217.494062646 * $1 // apples + 192.718341278 * $2 // bananas + 1.9829588217 * $3 // blueberries
- The empty basket weights 452.7 g.
- An average apple weights 217.5 g.
- An average banana weights 192.7 g.
- A single blueberry weights 1.98 g.
At the end of the course, you require your students to complete a quiz. Based on the results, you know that an average student retains anything between 5% to 65% of the course. Yes, unfortunately, it never reaches 100%.
You want to optimize your course. You believe that there are 2 main factors:
theo: the proportion between theoretical and practical content. 100% theory is as bad as 100% practice. The best is somewhere in between, but you don’t know where.theo ∈ [0..1]duration: theoretical lectures alternate with practical exercises, at this rate. Periods of 3 minutes are too short, and periods of 3 hours are too long. The best duration is in between.duration ∈ [3..180]
You choose a quadratic model in the variables, to account for bad performance at both end of some interval:
quizzresult = ? + ? * theo + ? * theo^2 + ? * duration + ? * duration^2
You suspect that those variables are not completely independent, so you add a coupling term to your model:
quizzresult = ? + ? * theo + ? * theo^2 + ? * duration + ? * duration^2 + ? * theo * duration
Your model have 6 question marks ? and therefore 6 numerical values to
find.
Now, you decide to experiment with different settings every week. You accumulate samples, one per week. Here they are:
| theo | duration | quizzresult | |-------+----------+-------------| | 0.781 | 18.0 | 30.3 | | 0.615 | 38.3 | 5.2 | | 0.601 | 5.2 | 52.5 | | 0.176 | 4.4 | 60.4 | | 0.428 | 40.9 | 10.5 | | 0.255 | 12.2 | 66.2 | | 0.784 | 12.6 | 34.2 | | 0.468 | 5.4 | 58.1 | | 0.475 | 37.6 | 19.8 | | 0.605 | 22.5 | 47.5 | | 0.319 | 24.8 | 58.7 | | 0.713 | 30.1 | 19.2 | | 0.596 | 36.0 | 15.6 | | 0.659 | 21.8 | 42.5 | | 0.731 | 19.6 | 36.2 | | 0.309 | 44.0 | 4.9 |
You want to know what are the optimal theo & duration, with just
observations over 16 weeks.
You ask for a regression fit.
- You put the cursor on the
quizzresultcolumn - You type
M-x orgtbl-fit. - You enter your double quadratic model.
| theo | duration | quizzresult | Best Fit | Fit Diff | |-------+----------+-------------+----------+----------| | 0.781 | 18.0 | 30.3 | 31.42 | 1.12 | | 0.615 | 38.3 | 5.2 | 5.40 | 0.20 | | 0.601 | 5.2 | 52.5 | 51.29 | -1.21 | | 0.176 | 4.4 | 60.4 | 59.52 | -0.88 | | 0.428 | 40.9 | 10.5 | 11.33 | 0.83 | | 0.255 | 12.2 | 66.2 | 67.27 | 1.07 | | 0.784 | 12.6 | 34.2 | 34.28 | 0.08 | | 0.468 | 5.4 | 58.1 | 59.60 | 1.50 | | 0.475 | 37.6 | 19.8 | 20.20 | 0.40 | | 0.605 | 22.5 | 47.5 | 46.21 | -1.29 | | 0.319 | 24.8 | 58.7 | 58.57 | -0.13 | | 0.713 | 30.1 | 19.2 | 19.09 | -0.11 | | 0.596 | 36.0 | 15.6 | 15.26 | -0.34 | | 0.659 | 21.8 | 42.5 | 41.81 | -0.69 | | 0.731 | 19.6 | 36.2 | 36.27 | 0.07 | | 0.309 | 44.0 | 4.9 | 4.27 | -0.63 | #+TBLFM: $4=41.310831661 + 86.6227413961*$1 - 134.496123584*$1^2 + 2.04949509715*$2 - 0.067672358144*$2^2 - 0.744700713305*$2*$1;%.2f::$5=$4-$3;%.2f
The fit is quite good. Your optimal model instantiation is therefore (rounded):
quizzresult = 41.31 + 86.62*theo - 134.50*theo^2 + 2.049*duration - 0.068*duration^2 - 0.7447*duration*theo
From there you can find the values of theo and duration which maximize
quizzresult. Easy. Just compute partial derivatives, and make them
zero.
0 = 2.049-.136*duration-.7447*theo // derivative along duration 0 = -269.00*theo+86.62-.7447*duration // derivative along theo
Now solve this simple 2 equations, 2 variables system. Calc can do that.
M-x calcto launch Calc[to open a vector- ’
2.049-.136*duration-.7447*theoenter the first equation - ’
-269.00*theo+86.62-.7447*durationenter the second ]close the vectora S duration theoto solve the system
Calc answers:
[duration = 13.5077128691, theo = 0.284612662552]
The best is to alternate 13 minutes and 30 seconds runs. The best is also to do 28.46% theoretical lectures (and 71.54% practical sessions).
You can inject those values into the table. Just put the cursor in
the last row and the first column. Then type Ret and fill in the
cells. Compute this new row typing C-c *:
| … | 0.309 | 44.0 | 4.9 | 4.27 | -0.63 | | .2846 | 13.5 | | 67.54 | 67.54 | #+TBLFM: $4=41.310831661 + 86.6227413961*$1 - 134.496123584*$1^2 + 2.04949509715*$2 - 0.067672358144*$2^2 - 0.744700713305*$2*$1;%.2f::$5=$4-$3;%.2f
The best possible result to the quiz is therefore 67.54%.
By the way, you are not the best teacher just by following 2 optimal numbers, of course. Teaching is much more than this simplistic regression fitting example.
A Date in a table is converted to a number of days since January 1st year 1. Therefore columns of dates can be used as numeric columns.However, the spreadsheet formula added by orgtbl-fit may need to be
adjusted manually. The date() function is handy. It can convert a date
to a number of days and back again. As an example, we want to explain
the num column by the event column. orgtbl-fit gives:
| event | num | Best Fit | Fit Diff | |------------------+-----+----------+----------| | [2023-01-08 Sun] | 1.1 | 0.04 | -1.06 | | [2023-02-08 Wed] | 2.6 | 0.04 | -2.56 | | [2023-03-03 Fri] | 3.2 | 0.04 | -3.16 | | [2023-05-17 Wed] | 6.0 | 0.04 | -5.96 | #+TBLFM: $3=0.0371869477971*$1 - 27462.3666844;%.2f::$4=$3-$2;%.2f
The 0.04 value we see is erroneous. It is actually 0.0371869477971
formatted with ;%.2f.
Let us adjust manually $1 to date($1):
| event | num | Best Fit | Fit Diff | |------------------+-----+-----------+------------| | [2023-01-08 Sun] | 1.1 | 1.24 | 0.14 | | [2023-02-08 Wed] | 2.6 | 2.39 | -0.21 | | [2023-03-03 Fri] | 3.2 | 3.24 | 0.04 | | [2023-05-17 Wed] | 6.0 | 6.03 | 0.03 | #+TBLFM: $3=0.0371869477971*date($1) - 27462.3666844;%.2f::$4=$3-$2;%.2f
Better.
The easiest and most standard installation is from Melpa. The process is as follow:Tell Emacs about the Melpa repository, if you have not already:
M-x customize-variable package-archives
Add the Melpa URL: http://melpa.org/packages/
Install orgtbl-fit:
M-x package-install orgtbl-fit
The package will be auto-loaded when you call orgtbl-fit (otherwise it
will stay out of the way).
Alternately, you may download the orgtbl-fit.el file, and load it in
your .emacs configuration file:
(load-file "somewhere/in/you/computer/orgtbl-fit.el")
You may want to add an entry in the Table menu, Column sub-menu. You
may also want to call orgtbl-fit with C-c f. One way to do so is to
use use-package in your .emacs init file:
(use-package orgtbl-fit
:after (org)
:bind ("C-c f" . orgtbl-fit)
:init
(easy-menu-add-item
org-tbl-menu '("Column")
["Regression fit this column" orgtbl-fit (org-at-table-p)]))Author
- Thierry Banel, tbanelwebmin at free dot fr
Look also at packages from the same author:
orgtbl-aggregateorgtbl-joinuniline
orgtbl-fit is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
orgtbl-fit is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.