Module R’Stat1 : La régression linéaire multiple

Novembre 2019 ; IRD-Montpellier-France

CC BY-NC-ND 3.0

Reg. lin. multiple par l’exemple d’après C. Prieur

Les données state

##                Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama              3615   3624        2.1    69.05   15.1    41.3    20
## Alaska                365   6315        1.5    69.31   11.3    66.7   152
## Arizona              2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas             2110   3378        1.9    70.66   10.1    39.9    65
## California          21198   5114        1.1    71.71   10.3    62.6    20
## Colorado             2541   4884        0.7    72.06    6.8    63.9   166
## Connecticut          3100   5348        1.1    72.48    3.1    56.0   139
## Delaware              579   4809        0.9    70.06    6.2    54.6   103
## Florida              8277   4815        1.3    70.66   10.7    52.6    11
## Georgia              4931   4091        2.0    68.54   13.9    40.6    60
## Hawaii                868   4963        1.9    73.60    6.2    61.9     0
## Idaho                 813   4119        0.6    71.87    5.3    59.5   126
## Illinois            11197   5107        0.9    70.14   10.3    52.6   127
## Indiana              5313   4458        0.7    70.88    7.1    52.9   122
## Iowa                 2861   4628        0.5    72.56    2.3    59.0   140
## Kansas               2280   4669        0.6    72.58    4.5    59.9   114
## Kentucky             3387   3712        1.6    70.10   10.6    38.5    95
## Louisiana            3806   3545        2.8    68.76   13.2    42.2    12
## Maine                1058   3694        0.7    70.39    2.7    54.7   161
## Maryland             4122   5299        0.9    70.22    8.5    52.3   101
## Massachusetts        5814   4755        1.1    71.83    3.3    58.5   103
## Michigan             9111   4751        0.9    70.63   11.1    52.8   125
## Minnesota            3921   4675        0.6    72.96    2.3    57.6   160
## Mississippi          2341   3098        2.4    68.09   12.5    41.0    50
## Missouri             4767   4254        0.8    70.69    9.3    48.8   108
## Montana               746   4347        0.6    70.56    5.0    59.2   155
## Nebraska             1544   4508        0.6    72.60    2.9    59.3   139
## Nevada                590   5149        0.5    69.03   11.5    65.2   188
## New Hampshire         812   4281        0.7    71.23    3.3    57.6   174
## New Jersey           7333   5237        1.1    70.93    5.2    52.5   115
## New Mexico           1144   3601        2.2    70.32    9.7    55.2   120
## New York            18076   4903        1.4    70.55   10.9    52.7    82
## North Carolina       5441   3875        1.8    69.21   11.1    38.5    80
## North Dakota          637   5087        0.8    72.78    1.4    50.3   186
## Ohio                10735   4561        0.8    70.82    7.4    53.2   124
## Oklahoma             2715   3983        1.1    71.42    6.4    51.6    82
## Oregon               2284   4660        0.6    72.13    4.2    60.0    44
## Pennsylvania        11860   4449        1.0    70.43    6.1    50.2   126
## Rhode Island          931   4558        1.3    71.90    2.4    46.4   127
## South Carolina       2816   3635        2.3    67.96   11.6    37.8    65
## South Dakota          681   4167        0.5    72.08    1.7    53.3   172
## Tennessee            4173   3821        1.7    70.11   11.0    41.8    70
## Texas               12237   4188        2.2    70.90   12.2    47.4    35
## Utah                 1203   4022        0.6    72.90    4.5    67.3   137
## Vermont               472   3907        0.6    71.64    5.5    57.1   168
## Virginia             4981   4701        1.4    70.08    9.5    47.8    85
## Washington           3559   4864        0.6    71.72    4.3    63.5    32
## West Virginia        1799   3617        1.4    69.48    6.7    41.6   100
## Wisconsin            4589   4468        0.7    72.48    3.0    54.5   149
## Wyoming               376   4566        0.6    70.29    6.9    62.9   173
##                  Area
## Alabama         50708
## Alaska         566432
## Arizona        113417
## Arkansas        51945
## California     156361
## Colorado       103766
## Connecticut      4862
## Delaware         1982
## Florida         54090
## Georgia         58073
## Hawaii           6425
## Idaho           82677
## Illinois        55748
## Indiana         36097
## Iowa            55941
## Kansas          81787
## Kentucky        39650
## Louisiana       44930
## Maine           30920
## Maryland         9891
## Massachusetts    7826
## Michigan        56817
## Minnesota       79289
## Mississippi     47296
## Missouri        68995
## Montana        145587
## Nebraska        76483
## Nevada         109889
## New Hampshire    9027
## New Jersey       7521
## New Mexico     121412
## New York        47831
## North Carolina  48798
## North Dakota    69273
## Ohio            40975
## Oklahoma        68782
## Oregon          96184
## Pennsylvania    44966
## Rhode Island     1049
## South Carolina  30225
## South Dakota    75955
## Tennessee       41328
## Texas          262134
## Utah            82096
## Vermont          9267
## Virginia        39780
## Washington      66570
## West Virginia   24070
## Wisconsin       54464
## Wyoming         97203

Les données state

state.x77: matrix with 50 rows and 8 columns giving the following statistics in the respective columns.

Population: population estimate as of July 1, 1975

Income: per capita income (1974)

Illiteracy: illiteracy (1970, percent of population)

Life Exp: life expectancy in years (1969–71)

Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)

HS Grad: percent high-school graduates (1970)

Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city

Area: land area in square miles

Statistiques descriptives

##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

## [1] "matrix"
## [1] "data.frame"

Expliquer l’espérance de vie Life.Exp

## 
## Call:
## lm(formula = usa$Life.Exp ~ usa$Population + usa$Income + usa$Illiteracy + 
##     usa$Murder + usa$HS.Grad + usa$Frost + usa$Area)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.094e+01  1.748e+00  40.586  < 2e-16 ***
## usa$Population  5.180e-05  2.919e-05   1.775   0.0832 .  
## usa$Income     -2.180e-05  2.444e-04  -0.089   0.9293    
## usa$Illiteracy  3.382e-02  3.663e-01   0.092   0.9269    
## usa$Murder     -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## usa$HS.Grad     4.893e-02  2.332e-02   2.098   0.0420 *  
## usa$Frost      -5.735e-03  3.143e-03  -1.825   0.0752 .  
## usa$Area       -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10

AIC

"Le critère d’information d’Akaike, (en anglais Akaike information criterion ou AIC) est une mesure de la qualité d’un modèle statistique proposée par Hirotugu Akaike en 1973.

Lorsque l’on estime un modèle statistique, il est possible d’augmenter la vraisemblance du modèle en ajoutant un paramètre. Le critère d’information d’Akaike, tout comme le critère d’information bayésien, permet de pénaliser les modèles en fonction du nombre de paramètres afin de satisfaire le critère de parcimonie. On choisit alors le modèle avec le critère d’information d’Akaike le plus faible." WIKIPEDIA

## [1] 121.7092

## 
## Call:
## lm(formula = usa$Life.Exp ~ usa$Population + usa$Murder + usa$HS.Grad + 
##     usa$Frost)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.103e+01  9.529e-01  74.542  < 2e-16 ***
## usa$Population  5.014e-05  2.512e-05   1.996  0.05201 .  
## usa$Murder     -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## usa$HS.Grad     4.658e-02  1.483e-02   3.142  0.00297 ** 
## usa$Frost      -5.943e-03  2.421e-03  -2.455  0.01802 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12
## [1] 115.7326

## 
## Call:
## lm(formula = usa$Life.Exp ~ usa$Murder + usa$HS.Grad + usa$Frost)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5015 -0.5391  0.1014  0.5921  1.2268 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.036379   0.983262  72.246  < 2e-16 ***
## usa$Murder  -0.283065   0.036731  -7.706 8.04e-10 ***
## usa$HS.Grad  0.049949   0.015201   3.286  0.00195 ** 
## usa$Frost   -0.006912   0.002447  -2.824  0.00699 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7427 on 46 degrees of freedom
## Multiple R-squared:  0.7127, Adjusted R-squared:  0.6939 
## F-statistic: 38.03 on 3 and 46 DF,  p-value: 1.634e-12
## [1] 117.9743

procédure stepwise

## Start:  AIC=-22.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost + Area
## 
##              Df Sum of Sq    RSS     AIC
## - Area        1    0.0011 23.298 -24.182
## - Income      1    0.0044 23.302 -24.175
## - Illiteracy  1    0.0047 23.302 -24.174
## <none>                    23.297 -22.185
## - Population  1    1.7472 25.044 -20.569
## - Frost       1    1.8466 25.144 -20.371
## - HS.Grad     1    2.4413 25.738 -19.202
## - Murder      1   23.1411 46.438  10.305
## 
## Step:  AIC=-24.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Illiteracy  1    0.0038 23.302 -26.174
## - Income      1    0.0059 23.304 -26.170
## <none>                    23.298 -24.182
## - Population  1    1.7599 25.058 -22.541
## - Frost       1    2.0488 25.347 -21.968
## - HS.Grad     1    2.9804 26.279 -20.163
## - Murder      1   26.2721 49.570  11.569
## 
## Step:  AIC=-26.17
## Life.Exp ~ Population + Income + Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Income      1     0.006 23.308 -28.161
## <none>                    23.302 -26.174
## - Population  1     1.887 25.189 -24.280
## - Frost       1     3.037 26.339 -22.048
## - HS.Grad     1     3.495 26.797 -21.187
## - Murder      1    34.739 58.041  17.456
## 
## Step:  AIC=-28.16
## Life.Exp ~ Population + Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

## 
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
##     data = usa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

Prévisions (à l’intérieur des valeurs connues)

##     Population Murder HS.Grad Frost
## min        365    1.4    37.8     0
## max      21198   15.1    67.3   188

##        fit      lwr      upr
## 1 70.92559 69.45497 72.39621

Etude des résidus

dépendance entre variables explicatives (colinéarité)

## Warning: package 'car' was built under R version 3.6.1
## Loading required package: carData
## Population     Murder    HS.Grad      Frost 
##   1.189835   1.727844   1.356791   1.498077

Variance Inflation Factors (VIF) < 10 : ok !

Reg. lin. multiple

Modèle

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon_i\)

\(y = \sum_{j=1}^{p}\beta_jx_j+\epsilon\)

\(y\) : variable quantitative continue à expliquer.

\(x_i\) : variables quantitatives continues explicatives.

\(\epsilon\) : erreur aléatoire de loi Normale d’espérance nulle et d’écart-type \(\sigma\).

Avec R

fonction lm()

résultats avec summary()

graphiques pour vérifier les hypothèses avec plot(lm())

tests statistiques comme shapiro.test() pour la normalité des résidus

Reg. lin. multiple et sélection des variables

Colinéarité

Si une (ou plusieurs) variable explicative est la combinaison linéaire d’une (ou de plusieurs) autre varaible, on parle de colinéarité. Dans ce cas, les coéficients individuels associés à chaque variable ne peuvent être interprétés de manière fiable…

Détecter la colinéarité (données)

Détecter la colinéarité (données)

##         x1         x2       x3        x4        x5       x6        x7
## 1 83.72785  144.56185 96.55340 98.715658 106.62437 74.93125 250.73830
## 2 73.51153  -72.77192 59.94101 56.605437  56.29103 73.70063  34.74389
## 3 14.58394  132.80571 46.19169  1.696772  25.08175 23.46682 -16.02037
## 4 19.90037 -172.43941 17.39241  4.257799  15.54725 26.53419 -61.08011
## 5 80.21417   61.82470 81.74531 73.002784  93.81761 75.59769  42.15923
## 6 98.18177   82.38916 88.62505 99.777754 102.91077 90.94779 223.50261
##           x8       x9       x10         y
## 1  89.470036 96.53135 110.98130 516.62259
## 2  71.042496 55.74935  71.26404 573.29268
## 3  29.774042 29.20769  23.43577  51.15482
## 4   2.489837 10.24249  37.63406  62.85970
## 5  66.895511 59.13655  65.04702 517.36852
## 6 109.079371 89.16625  81.85138 348.68400

Détecter la colinéarité (données)

On devrait s’attendre à un effet significatif de x1, x2, x3 et x4.

Détecter la colinéarité (données)

## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272.74  -79.81    8.44   91.27  322.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.7007    31.2962  -0.246    0.806    
## x1            6.1036     1.3631   4.478 2.23e-05 ***
## x2            0.8654     0.1534   5.641 1.97e-07 ***
## x3           -0.5174     1.4008  -0.369    0.713    
## x4           -0.2623     1.3167  -0.199    0.843    
## x5            1.2176     1.4404   0.845    0.400    
## x6            0.3952     1.2640   0.313    0.755    
## x7           -0.2172     0.1586  -1.369    0.174    
## x8           -0.8464     1.2975  -0.652    0.516    
## x9           -1.2123     1.3064  -0.928    0.356    
## x10           1.1036     1.3043   0.846    0.400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.3 on 89 degrees of freedom
## Multiple R-squared:  0.7085, Adjusted R-squared:  0.6757 
## F-statistic: 21.63 on 10 and 89 DF,  p-value: < 2.2e-16

Détecter la colinéarité (1)

Avec la corrélation entre variables explicatives :

## corrplot 0.84 loaded

Détecter la colinéarité (1)