Average.:

Std dev:

Min:

Max:

n:

Show/hide histogram

Show/hide distribution

Show/hide probability plots

μ:

σ:

Info, references, etc...

If any endpoint of the X-axis is more than 3 sigmas from the mean, the histogram is not shown.

(Suggestion: Increase the range of the X-axis by clicking and changing the min- and/or max-values.)

18. Mixture of normal distributions – an animation

General.   When studying data from a practical situation it is easy to suspect that the data might consist of two or more normal distributions (%mix*). Especially if a normal distribution is expected.
Perhaps the data consist of produced items from a number of sources, machines, spindels, etc. Even if the items are produced from the same drawing there is variation. The process might also be 'drifting' and thus produces data with constantly changing mean or variation.
There are a number of practical statistical tools such as histogram, probability plot (%Hist*), cusumplot (%Diagn*, %SimDiagn*), etc, to graphically see if the data is non-normal (non-symmetrical). NB that it can be that the data comes from an non-symmetrical distribution (typically time measurements) and in such case the idea of normal distributions is wrong.

Formal analysis.   A formal analysis demands that there are extra data columns showing what machine, spindle, etc that was used. A common way is the to perform a so-called t-test (%t-test*).
(It is actually possible, at least in theory, to use the measurements only. By a complicated mathematical method the five parameters, equivalent to the five sliders to the left ([Change parameters]), can be estimated. However, this demands a large number of data values and will most likely produce results with a large uncertainty.)

(*A number macros for Minitab can be obtained by request via www.ing-stat.se)

••••

The two normal distributions

Percent

X-values

Show/hide data (two normals)

Show/hide true models

Show/hide estimated models

Show graph of all data

All data from the mixed distributions

Percent

X-values

μ distr 1

μ distr 2

σ distr 1

σ distr 2

Proportion distr 1

Total number of values

Exercise 1 – change the parameters
Change the parameters via the slides and note that the distribution changes accordingly. If necessary, change the min or max values of the X-axis.
Make sure that the two μ-parameters have the same values and set also the two σ-parameters to equal values. Change the proportion slide and notice that the two total parameters do not change.

Exercise 2 – one sigma difference
Change the two μ-parameters to 40 and 45, respectively. Change the two σ-parameters to 5. Set the proportion to 0.5. Thus the difference in mean is 45-40=5 i.e. one standard deviation. Notice that the resulting distribution does not visually reveal this rather large difference. (To find this difference other data and methods are needed.) Decrease the first parameter and note that it needs nearly two sigma difference before the difference becomes visible.

Exercise 3 – plus/minus three sigma
Change the two μ-parameters to 20 and 40, respectively. Change the two σ-parameters to 4. Set the proportion to 0.5. This produces a distribution with two distinct peaks where the mean is 30.00 and sigma is 10.77. Notice that the rule of thumb of plus/minus three sigmas embrace practically all the distribution. (Extend if necessary the X-axis to show the histogram.)

••••

After reading the 'info'-fields and performing the exercises, it is obvious that a mixture of distributions can be difficult to find.
Usually there is a need for other variables that indicate e.g. machine or similar.
If the data consists of a sudden change in mean, this can sometimes be found by e.g. SQC-metods or other types of time series analysis.

••••

The expected value where p is the proportion of the first normal distribution (0 < p < 1):

   μtot = p μ1 + (1 -p) μ2

The standard deviation:

   σtot = p[ σ12 + (μtot -μ1 )2] + (1-p)[ σ22 + (μtot -μ2 ) 2]

The pdftot is the 'height' of the mixed distribution at every X-value:

   pdftot = p pdf1 + (1 -p) pdf2

••••

The blue line is the resulting mixed distribution and the area under the curve is the probability. The total area is 1.

The expected value is indicated on the X-axis as one red vertical bar with the value attached to it. The small red lines indicate 1, 2, and 3 sigma from the expected value.

The X-axis can be changed by clicking and changing the min or max values for a better fit.

Use the button [Ordinary normal] to learn more about the normal distribution.

••••


Exercises.  A number of exercises to further illuminate certain features of mixture of variables.

Some conclusions.  A summary of the main ideas and problems with mixture of variables.

Formulas.  There are three main formulas that are used for the mixed result: the expected value, the standard deviation and the probability distribution. These formulas are valid for all distributions.

Change parameters.  It is possible to change the parameters for the mixed distribution. This is done using five sliders.

Mixed Poisson.  The button leads to a page showing a mixture of Poisson distributions.

Mixed normal.  The button leads to a page showing a mixture of normal distributions.

Ordinary Poisson.  The button leads to a page showing all basic features of a Poisson distribution.

Ordinary normal.  The button leads to a page showing all basic features of a normal distribution.

μ.  The theoretical mean of the mixture of distributions.

σ.  The theoretical standard deviation of the mixture of distributions.

••••

The range of the slides can not be changed. The four top slides move 0.1 every time a right or left arrow is pressed. The proportion slide moves 0.01 every time a right or left arrow is pressed. The bottom slide moves 2 units every time a right or left arrow is pressed.

μ distr 1:  The theoretical mean of the first normal distribution.

σ distr 1:  The theoretical standard deviation of the first normal distribution.

μ distr 2:  The theoretical mean of the second normal distribution.

σ distr 2:  The theoretical standard deviation of the second normal distribution.

Proportion (p):  The proportion of the first distribution and thus (1-p) is the proportion of the second distribution.

Total number of values:  The total number of simulated values.

••••

The graph shows the two normal distributions depicted with their so-called probability distribution functions. Here shown as straight lines (with linear Y-axis these are S-shaped curves but here the Y-axis has been recalculated.). The Y-axis is always the range [0, 1] (some computer programs show it as 0 - 100 %).

If the two lines are more or less parallel the two normals have approximately the same sigma. The horisontal difference between the lines corresponds to difference in means. Both these features can be visualized using the parameter slides.

Show/hide data (the two normals):  By toggling, the data can be shown or hidden.

Show/hide true models:  The input parameters define two true models. These are shown as dashed lines.

Show/hide estimated models:  From data two normal models are estimated and shown as solid lines.

Show/hide all data:  All data is shown in a separate window.

••••

The graph shows all the simulated data. If the graph deviates too much from a straight line one can suspect that the data consist of a mixture of distributions. To be able to investigate this, one needs extra info such as a grouping variable.

Sometimes the data can give the impression that the true model is skewed in some direction. It is then a mistake to start using popular tools to force the result to be more symmetrical and thus more 'normal-looking'.
In doing this perhaps a good understanding of the process is lost. Any data needs to be investigated in several ways in order to reveal its secrets.

••••