Background

The true relationship between a response (Y-variable) and one or several X-variables can be complicated, simple or even non-existing. By using data and a suitable software we can try to find the relationship. (If the relationship were known, we would not need data…)
If we try to apply a complicated non-linear model to the data we need to handle complicated mathematical statistics and perhaps make it more difficult than necessary. Fortunately there are mathematical tools that let us turn a complicated function into a series of less complicated terms.

There are some arguments that justify doing this. One is that we usually need a model over only a limit area of its working comditions, e.g. a certain span of temperatures, pressures, etc. Another argument is that we usually do not have perfect data but data that include some randomness.

We use sin(x) as an example. Below sin(x) is expressed as the beginning of an infinite series of terms (cos(x) looks similar). The value of sin(x) can easily found using an ordinary calculator or any programming language but it can alse be calculated by hand using the first terms.

Statistical theory will call sin(x) ‘… a linear combinations of variables’:

\[ \begin{align} \LARGE sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \dots \end{align} \] \[ \begin{align} \LARGE cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \dots \end{align} \]


Analysis

Below we apply two different models, one with seven variables (XX7) to and one with three variables (X, X3, and X5). The coding contains the following steps:

X <- seq(0, pi/2, 0.1)
Y <- sin(X)

X2 <- X*X
X3 <- X2*X
X4 <- X3*X
X5 <- X4*X
X6 <- X5*X
X7 <- X6*X


Model 1. ‘lm’ is the R-code for analysis of a linear model. The codes are followed by the printout, the summary.

model1 <- lm(Y ~ X + X2 + X3 + X4 + X5 + X6 + X7)      # Analysis: 7 parameters. 
summary(model1)                                        # Summary of model 1.    
## 
## Call:
## lm(formula = Y ~ X + X2 + X3 + X4 + X5 + X6 + X7)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.312e-08 -8.140e-09  1.198e-09  8.039e-09  1.198e-08 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -3.283e-09  1.224e-08 -2.680e-01   0.7954    
## X            1.000e+00  3.798e-07  2.633e+06  < 2e-16 ***
## X2          -1.471e-05  3.462e-06 -4.250e+00   0.0028 ** 
## X3          -1.666e-01  1.282e-05 -1.300e+04  < 2e-16 ***
## X4          -1.904e-04  2.342e-05 -8.130e+00 3.89e-05 ***
## X5           8.587e-03  2.242e-05  3.831e+02  < 2e-16 ***
## X6          -1.789e-04  1.076e-05 -1.662e+01 1.74e-07 ***
## X7          -1.427e-04  2.046e-06 -6.977e+01 1.98e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.232e-08 on 8 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.565e+15 on 7 and 8 DF,  p-value: < 2.2e-16
coeff1      <- round(model1$coefficients, 4)           # Getting the coefficients, rounded to 4 decimal places.
pred1       <- round(predict(model1), 5)               # Getting 16 predicted values, rounded to 4 decimals.
deviations1 <- pred1-Y                                 # Calculating the deviations.
coeff1                                                 # Printing coefficients model 1.
## (Intercept)           X          X2          X3          X4          X5          X6          X7 
##      0.0000      1.0000      0.0000     -0.1666     -0.0002      0.0086     -0.0002     -0.0001

The coefficients from model 1 above show that only the variables X, X3 and X5 seems to be of any importance. (Coefficient for e.g. X3 is the value 1/(3!).). Note that there is no intercept in these two models, i.e. no constant term. (Although, the series for cos(x) has a constant term (1).)


Model 2. R-codes for analysis of a linear model 2 using the variables X, X3, and X5.

model2 <- lm(Y ~ X + X3 + X5)                          # Analysis: 3 parameters. 
summary(model2)                                        # Summary of model 2.    
## 
## Call:
## lm(formula = Y ~ X + X3 + X5)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.892e-05 -3.129e-05  7.220e-07  2.865e-05  4.883e-05 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)  4.041e-05  2.575e-05     1.569    0.143    
## X            9.997e-01  8.245e-05 12124.470   <2e-16 ***
## X3          -1.658e-01  9.552e-05 -1735.848   <2e-16 ***
## X5           7.579e-03  3.231e-05   234.568   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.684e-05 on 12 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.085e+08 on 3 and 12 DF,  p-value: < 2.2e-16
coeff2      <- round(model2$coefficients, 4)           # Getting the coefficients, rounded to 4 decimal places.
pred2       <- round(predict(model2), 5)               # Getting 16 predicted values, rounded to 4 decimals.
deviations2 <- pred2-Y                                 # Calculating the deviations.
coeff2                                                 # Printing coefficients model 2.
## (Intercept)           X          X3          X5 
##      0.0000      0.9997     -0.1658      0.0076

The coefficients from model 2 above correspond to 1, 1/6, and 1/120 as in the expressions for sin(x) above.

Deviations

The graph shows the deviations from model 1 and model 2. The deviations are calculated as the difference between true value of sin(x) and the values predicted by the two models.

library(ggplot2)                                                       # Loads the graphical package

allDev <- data.frame(deviations1, deviations2, x = c(1:16))            # All values in one data frame.

avvDiagram <- ggplot(allDev, aes(x = x)) + 
  geom_line(aes(y = deviations1), color = "red")  + geom_point(aes(y = deviations1), color = "red") + 
  geom_line(aes(y = deviations2), color = "blue") + geom_point(aes(y = deviations2), color = "blue") +
  xlab("X (arguments in sin(x))") + ylab("Difference estimate and true value of sin(x)") + 
  annotate("text", x=4, y=Inf, label="Blue: 3 parameters,   Red: 7 parameters", vjust=2, size=4, hjust=0.1) + 
  annotate("text", x=4, y=Inf, label="Difference between the estimates and true values of sin(x)", vjust=5, size=3.5, hjust=0.1)
avvDiagram


Comments. The graph shows that the ‘7 parameter’-model has less variation compared to the ‘3 parameter’-model. This is the common experience from regression analysis. The more variables, relevant or irrelevant ones, that are entered into the models, the lesser the residual variance becomes and thus lesser deviations. However, unnecessary variables should be removed from the models. (Many predictor in a model will increase the variation when used for prediction).

Final remarks

This text is not aimed to explain the details of regression analysis, nor a treatment of the rather complicated mathematical structure that leads to linearization of complicated functions. As a simple example the well known ‘sin(x)’-function was used.
(Another reason for the text: when performing an investigation of some audio-feature in manufacturing mobile telephones, the linear regression analysis found a small well-behaved model. But some engineer thought that he delivered a devasting comment when saying ‘…audio signal are definitely not linear…’ but without having the knowledge expressed in this text. As an analyst one must have enough knowledge in order to explain and defend the results.)

More R-codes, graphs, etc:    https://ovn.ing-stat.se/Rgraphs/Rgraphs2.php