The true relationship between a response (Y-variable) and one
or several X-variables can be complicated, simple or even
non-existing. By using data and a suitable software we can try to find
the relationship. (If the relationship were known, we would not need
data…)
If we try to apply a complicated non-linear model to the data
we need to handle complicated mathematical statistics and perhaps make
it more difficult than necessary. Fortunately there are mathematical
tools that let us turn a complicated function into a series of less
complicated terms.
There are some arguments that justify doing
this. One is that we usually need a model over only a limit area of its
working comditions, e.g. a certain span of temperatures, pressures, etc.
Another argument is that we usually do not have perfect data but data
that include some randomness.
We use sin(x) as an
example. Below sin(x) is expressed as the beginning of an
infinite series of terms (cos(x) looks similar). The value of
sin(x) can easily found using an ordinary calculator or any
programming language but it can alse be calculated by hand using the
first terms.
Statistical theory will call sin(x) ‘… a
linear combinations of variables’:
\[ \begin{align} \LARGE sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \dots \end{align} \] \[ \begin{align} \LARGE cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \dots \end{align} \]
Below we apply two different models, one with seven variables
(X … X7) to and one with three variables
(X, X3, and X5). The coding
contains the following steps:
X <- seq(0, pi/2, 0.1)
Y <- sin(X)
X2 <- X*X
X3 <- X2*X
X4 <- X3*X
X5 <- X4*X
X6 <- X5*X
X7 <- X6*X
Model 1. ‘lm’ is the R-code for analysis of a linear
model. The codes are followed by the printout, the summary.
model1 <- lm(Y ~ X + X2 + X3 + X4 + X5 + X6 + X7) # Analysis: 7 parameters.
summary(model1) # Summary of model 1.
##
## Call:
## lm(formula = Y ~ X + X2 + X3 + X4 + X5 + X6 + X7)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.312e-08 -8.140e-09 1.198e-09 8.039e-09 1.198e-08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.283e-09 1.224e-08 -2.680e-01 0.7954
## X 1.000e+00 3.798e-07 2.633e+06 < 2e-16 ***
## X2 -1.471e-05 3.462e-06 -4.250e+00 0.0028 **
## X3 -1.666e-01 1.282e-05 -1.300e+04 < 2e-16 ***
## X4 -1.904e-04 2.342e-05 -8.130e+00 3.89e-05 ***
## X5 8.587e-03 2.242e-05 3.831e+02 < 2e-16 ***
## X6 -1.789e-04 1.076e-05 -1.662e+01 1.74e-07 ***
## X7 -1.427e-04 2.046e-06 -6.977e+01 1.98e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.232e-08 on 8 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.565e+15 on 7 and 8 DF, p-value: < 2.2e-16
coeff1 <- round(model1$coefficients, 4) # Getting the coefficients, rounded to 4 decimal places.
pred1 <- round(predict(model1), 5) # Getting 16 predicted values, rounded to 4 decimals.
deviations1 <- pred1-Y # Calculating the deviations.
coeff1 # Printing coefficients model 1.
## (Intercept) X X2 X3 X4 X5 X6 X7
## 0.0000 1.0000 0.0000 -0.1666 -0.0002 0.0086 -0.0002 -0.0001
The coefficients from model 1 above show that only the variables X,
X3 and X5 seems to be of any importance. (Coefficient for e.g. X3 is the
value 1/(3!).). Note that there is no intercept in these two models,
i.e. no constant term. (Although, the series for cos(x) has a
constant term (1).)
Model 2. R-codes for analysis of a linear model 2 using
the variables X, X3, and X5.
model2 <- lm(Y ~ X + X3 + X5) # Analysis: 3 parameters.
summary(model2) # Summary of model 2.
##
## Call:
## lm(formula = Y ~ X + X3 + X5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.892e-05 -3.129e-05 7.220e-07 2.865e-05 4.883e-05
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.041e-05 2.575e-05 1.569 0.143
## X 9.997e-01 8.245e-05 12124.470 <2e-16 ***
## X3 -1.658e-01 9.552e-05 -1735.848 <2e-16 ***
## X5 7.579e-03 3.231e-05 234.568 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.684e-05 on 12 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.085e+08 on 3 and 12 DF, p-value: < 2.2e-16
coeff2 <- round(model2$coefficients, 4) # Getting the coefficients, rounded to 4 decimal places.
pred2 <- round(predict(model2), 5) # Getting 16 predicted values, rounded to 4 decimals.
deviations2 <- pred2-Y # Calculating the deviations.
coeff2 # Printing coefficients model 2.
## (Intercept) X X3 X5
## 0.0000 0.9997 -0.1658 0.0076
The coefficients from model 2 above correspond to 1, 1/6, and 1/120
as in the expressions for sin(x) above.
The graph shows the deviations from model 1 and model 2. The
deviations are calculated as the difference between true value of
sin(x) and the values predicted by the two models.
library(ggplot2) # Loads the graphical package
allDev <- data.frame(deviations1, deviations2, x = c(1:16)) # All values in one data frame.
avvDiagram <- ggplot(allDev, aes(x = x)) +
geom_line(aes(y = deviations1), color = "red") + geom_point(aes(y = deviations1), color = "red") +
geom_line(aes(y = deviations2), color = "blue") + geom_point(aes(y = deviations2), color = "blue") +
xlab("X (arguments in sin(x))") + ylab("Difference estimate and true value of sin(x)") +
annotate("text", x=4, y=Inf, label="Blue: 3 parameters, Red: 7 parameters", vjust=2, size=4, hjust=0.1) +
annotate("text", x=4, y=Inf, label="Difference between the estimates and true values of sin(x)", vjust=5, size=3.5, hjust=0.1)
avvDiagram
Comments. The graph shows that the ‘7 parameter’-model
has less variation compared to the ‘3 parameter’-model. This is the
common experience from regression analysis. The more variables, relevant
or irrelevant ones, that are entered into the models, the lesser the
residual variance becomes and thus lesser deviations. However,
unnecessary variables should be removed from the models. (Many predictor
in a model will increase the variation when used for prediction).
This text is not aimed to explain the details of regression analysis,
nor a treatment of the rather complicated mathematical structure that
leads to linearization of complicated functions. As a simple example the
well known ‘sin(x)’-function was used.
(Another reason for the text:
when performing an investigation of some audio-feature in manufacturing
mobile telephones, the linear regression analysis found a small
well-behaved model. But some engineer thought that he delivered a
devasting comment when saying ‘…audio signal are definitely not linear…’
but without having the knowledge expressed in this text. As an analyst
one must have enough knowledge in order to explain and defend the
results.)
More R-codes,
graphs, etc: https://ovn.ing-stat.se/Rgraphs/Rgraphs2.php