Module R’Stat1 : La régression linéaire multiple

francois.rebaudo@ird.fr

Novembre 2019 ; IRD-Montpellier-France
CC BY-NC-ND 3.0

Reg. lin. multiple par l’exemple d’après C. Prieur

Les données `state`

data(state)
print(state.x77)

##                Population Income Illiteracy Life Exp Murder HS Grad Frost
## Alabama              3615   3624        2.1    69.05   15.1    41.3    20
## Alaska                365   6315        1.5    69.31   11.3    66.7   152
## Arizona              2212   4530        1.8    70.55    7.8    58.1    15
## Arkansas             2110   3378        1.9    70.66   10.1    39.9    65
## California          21198   5114        1.1    71.71   10.3    62.6    20
## Colorado             2541   4884        0.7    72.06    6.8    63.9   166
## Connecticut          3100   5348        1.1    72.48    3.1    56.0   139
## Delaware              579   4809        0.9    70.06    6.2    54.6   103
## Florida              8277   4815        1.3    70.66   10.7    52.6    11
## Georgia              4931   4091        2.0    68.54   13.9    40.6    60
## Hawaii                868   4963        1.9    73.60    6.2    61.9     0
## Idaho                 813   4119        0.6    71.87    5.3    59.5   126
## Illinois            11197   5107        0.9    70.14   10.3    52.6   127
## Indiana              5313   4458        0.7    70.88    7.1    52.9   122
## Iowa                 2861   4628        0.5    72.56    2.3    59.0   140
## Kansas               2280   4669        0.6    72.58    4.5    59.9   114
## Kentucky             3387   3712        1.6    70.10   10.6    38.5    95
## Louisiana            3806   3545        2.8    68.76   13.2    42.2    12
## Maine                1058   3694        0.7    70.39    2.7    54.7   161
## Maryland             4122   5299        0.9    70.22    8.5    52.3   101
## Massachusetts        5814   4755        1.1    71.83    3.3    58.5   103
## Michigan             9111   4751        0.9    70.63   11.1    52.8   125
## Minnesota            3921   4675        0.6    72.96    2.3    57.6   160
## Mississippi          2341   3098        2.4    68.09   12.5    41.0    50
## Missouri             4767   4254        0.8    70.69    9.3    48.8   108
## Montana               746   4347        0.6    70.56    5.0    59.2   155
## Nebraska             1544   4508        0.6    72.60    2.9    59.3   139
## Nevada                590   5149        0.5    69.03   11.5    65.2   188
## New Hampshire         812   4281        0.7    71.23    3.3    57.6   174
## New Jersey           7333   5237        1.1    70.93    5.2    52.5   115
## New Mexico           1144   3601        2.2    70.32    9.7    55.2   120
## New York            18076   4903        1.4    70.55   10.9    52.7    82
## North Carolina       5441   3875        1.8    69.21   11.1    38.5    80
## North Dakota          637   5087        0.8    72.78    1.4    50.3   186
## Ohio                10735   4561        0.8    70.82    7.4    53.2   124
## Oklahoma             2715   3983        1.1    71.42    6.4    51.6    82
## Oregon               2284   4660        0.6    72.13    4.2    60.0    44
## Pennsylvania        11860   4449        1.0    70.43    6.1    50.2   126
## Rhode Island          931   4558        1.3    71.90    2.4    46.4   127
## South Carolina       2816   3635        2.3    67.96   11.6    37.8    65
## South Dakota          681   4167        0.5    72.08    1.7    53.3   172
## Tennessee            4173   3821        1.7    70.11   11.0    41.8    70
## Texas               12237   4188        2.2    70.90   12.2    47.4    35
## Utah                 1203   4022        0.6    72.90    4.5    67.3   137
## Vermont               472   3907        0.6    71.64    5.5    57.1   168
## Virginia             4981   4701        1.4    70.08    9.5    47.8    85
## Washington           3559   4864        0.6    71.72    4.3    63.5    32
## West Virginia        1799   3617        1.4    69.48    6.7    41.6   100
## Wisconsin            4589   4468        0.7    72.48    3.0    54.5   149
## Wyoming               376   4566        0.6    70.29    6.9    62.9   173
##                  Area
## Alabama         50708
## Alaska         566432
## Arizona        113417
## Arkansas        51945
## California     156361
## Colorado       103766
## Connecticut      4862
## Delaware         1982
## Florida         54090
## Georgia         58073
## Hawaii           6425
## Idaho           82677
## Illinois        55748
## Indiana         36097
## Iowa            55941
## Kansas          81787
## Kentucky        39650
## Louisiana       44930
## Maine           30920
## Maryland         9891
## Massachusetts    7826
## Michigan        56817
## Minnesota       79289
## Mississippi     47296
## Missouri        68995
## Montana        145587
## Nebraska        76483
## Nevada         109889
## New Hampshire    9027
## New Jersey       7521
## New Mexico     121412
## New York        47831
## North Carolina  48798
## North Dakota    69273
## Ohio            40975
## Oklahoma        68782
## Oregon          96184
## Pennsylvania    44966
## Rhode Island     1049
## South Carolina  30225
## South Dakota    75955
## Tennessee       41328
## Texas          262134
## Utah            82096
## Vermont          9267
## Virginia        39780
## Washington      66570
## West Virginia   24070
## Wisconsin       54464
## Wyoming         97203

Les données `state`

state.x77: matrix with 50 rows and 8 columns giving the following statistics in the respective columns.

Population: population estimate as of July 1, 1975

Income: per capita income (1974)

Illiteracy: illiteracy (1970, percent of population)

Life Exp: life expectancy in years (1969–71)

Murder: murder and non-negligent manslaughter rate per 100,000 population (1976)

HS Grad: percent high-school graduates (1970)

Frost: mean number of days with minimum temperature below freezing (1931–1960) in capital or large city

Area: land area in square miles

Statistiques descriptives

summary(state.x77)

##    Population        Income       Illiteracy       Life Exp    
##  Min.   :  365   Min.   :3098   Min.   :0.500   Min.   :67.96  
##  1st Qu.: 1080   1st Qu.:3993   1st Qu.:0.625   1st Qu.:70.12  
##  Median : 2838   Median :4519   Median :0.950   Median :70.67  
##  Mean   : 4246   Mean   :4436   Mean   :1.170   Mean   :70.88  
##  3rd Qu.: 4968   3rd Qu.:4814   3rd Qu.:1.575   3rd Qu.:71.89  
##  Max.   :21198   Max.   :6315   Max.   :2.800   Max.   :73.60  
##      Murder          HS Grad          Frost             Area       
##  Min.   : 1.400   Min.   :37.80   Min.   :  0.00   Min.   :  1049  
##  1st Qu.: 4.350   1st Qu.:48.05   1st Qu.: 66.25   1st Qu.: 36985  
##  Median : 6.850   Median :53.25   Median :114.50   Median : 54277  
##  Mean   : 7.378   Mean   :53.11   Mean   :104.46   Mean   : 70736  
##  3rd Qu.:10.675   3rd Qu.:59.15   3rd Qu.:139.75   3rd Qu.: 81163  
##  Max.   :15.100   Max.   :67.30   Max.   :188.00   Max.   :566432

class(state.x77)

## [1] "matrix"

usa <- data.frame(state.x77)
class(usa)

## [1] "data.frame"

Expliquer l’espérance de vie `Life.Exp`

mod01 <- lm(usa$Life.Exp ~ 
  usa$Population + usa$Income + 
  usa$Illiteracy + usa$Murder + 
  usa$HS.Grad + usa$Frost + 
  usa$Area)
# lm(Life.Exp ~ ., data = usa)

summary(mod01)

## 
## Call:
## lm(formula = usa$Life.Exp ~ usa$Population + usa$Income + usa$Illiteracy + 
##     usa$Murder + usa$HS.Grad + usa$Frost + usa$Area)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.48895 -0.51232 -0.02747  0.57002  1.49447 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.094e+01  1.748e+00  40.586  < 2e-16 ***
## usa$Population  5.180e-05  2.919e-05   1.775   0.0832 .  
## usa$Income     -2.180e-05  2.444e-04  -0.089   0.9293    
## usa$Illiteracy  3.382e-02  3.663e-01   0.092   0.9269    
## usa$Murder     -3.011e-01  4.662e-02  -6.459 8.68e-08 ***
## usa$HS.Grad     4.893e-02  2.332e-02   2.098   0.0420 *  
## usa$Frost      -5.735e-03  3.143e-03  -1.825   0.0752 .  
## usa$Area       -7.383e-08  1.668e-06  -0.044   0.9649    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7448 on 42 degrees of freedom
## Multiple R-squared:  0.7362, Adjusted R-squared:  0.6922 
## F-statistic: 16.74 on 7 and 42 DF,  p-value: 2.534e-10

AIC

"Le critère d’information d’Akaike, (en anglais Akaike information criterion ou AIC) est une mesure de la qualité d’un modèle statistique proposée par Hirotugu Akaike en 1973.

Lorsque l’on estime un modèle statistique, il est possible d’augmenter la vraisemblance du modèle en ajoutant un paramètre. Le critère d’information d’Akaike, tout comme le critère d’information bayésien, permet de pénaliser les modèles en fonction du nombre de paramètres afin de satisfaire le critère de parcimonie. On choisit alors le modèle avec le critère d’information d’Akaike le plus faible." WIKIPEDIA

AIC(mod01)

## [1] 121.7092

mod02 <- lm(usa$Life.Exp ~ 
  usa$Population + usa$Murder + 
  usa$HS.Grad + usa$Frost)
# update(mod01,.~.-Area-Illiteracy-Income)

summary(mod02)

## 
## Call:
## lm(formula = usa$Life.Exp ~ usa$Population + usa$Murder + usa$HS.Grad + 
##     usa$Frost)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.103e+01  9.529e-01  74.542  < 2e-16 ***
## usa$Population  5.014e-05  2.512e-05   1.996  0.05201 .  
## usa$Murder     -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## usa$HS.Grad     4.658e-02  1.483e-02   3.142  0.00297 ** 
## usa$Frost      -5.943e-03  2.421e-03  -2.455  0.01802 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

AIC(mod02)

## [1] 115.7326

mod03 <- lm(usa$Life.Exp ~ 
  usa$Murder + 
  usa$HS.Grad + usa$Frost)

summary(mod03)

## 
## Call:
## lm(formula = usa$Life.Exp ~ usa$Murder + usa$HS.Grad + usa$Frost)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5015 -0.5391  0.1014  0.5921  1.2268 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.036379   0.983262  72.246  < 2e-16 ***
## usa$Murder  -0.283065   0.036731  -7.706 8.04e-10 ***
## usa$HS.Grad  0.049949   0.015201   3.286  0.00195 ** 
## usa$Frost   -0.006912   0.002447  -2.824  0.00699 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7427 on 46 degrees of freedom
## Multiple R-squared:  0.7127, Adjusted R-squared:  0.6939 
## F-statistic: 38.03 on 3 and 46 DF,  p-value: 1.634e-12

AIC(mod03)

## [1] 117.9743

procédure `stepwise`

mod0X <- step(
  lm(Life.Exp ∼ ., data = usa), 
  direction = "backward")

## Start:  AIC=-22.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost + Area
## 
##              Df Sum of Sq    RSS     AIC
## - Area        1    0.0011 23.298 -24.182
## - Income      1    0.0044 23.302 -24.175
## - Illiteracy  1    0.0047 23.302 -24.174
## <none>                    23.297 -22.185
## - Population  1    1.7472 25.044 -20.569
## - Frost       1    1.8466 25.144 -20.371
## - HS.Grad     1    2.4413 25.738 -19.202
## - Murder      1   23.1411 46.438  10.305
## 
## Step:  AIC=-24.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Illiteracy  1    0.0038 23.302 -26.174
## - Income      1    0.0059 23.304 -26.170
## <none>                    23.298 -24.182
## - Population  1    1.7599 25.058 -22.541
## - Frost       1    2.0488 25.347 -21.968
## - HS.Grad     1    2.9804 26.279 -20.163
## - Murder      1   26.2721 49.570  11.569
## 
## Step:  AIC=-26.17
## Life.Exp ~ Population + Income + Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Income      1     0.006 23.308 -28.161
## <none>                    23.302 -26.174
## - Population  1     1.887 25.189 -24.280
## - Frost       1     3.037 26.339 -22.048
## - HS.Grad     1     3.495 26.797 -21.187
## - Murder      1    34.739 58.041  17.456
## 
## Step:  AIC=-28.16
## Life.Exp ~ Population + Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

summary(mod0X)

## 
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
##     data = usa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47095 -0.53464 -0.03701  0.57621  1.50683 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.103e+01  9.529e-01  74.542  < 2e-16 ***
## Population   5.014e-05  2.512e-05   1.996  0.05201 .  
## Murder      -3.001e-01  3.661e-02  -8.199 1.77e-10 ***
## HS.Grad      4.658e-02  1.483e-02   3.142  0.00297 ** 
## Frost       -5.943e-03  2.421e-03  -2.455  0.01802 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7197 on 45 degrees of freedom
## Multiple R-squared:  0.736,  Adjusted R-squared:  0.7126 
## F-statistic: 31.37 on 4 and 45 DF,  p-value: 1.696e-12

Prévisions (à l’intérieur des valeurs connues)

rbind(
  min = sapply(usa[c('Population', 'Murder', 'HS.Grad', 'Frost')], min),
  max = sapply(usa[c('Population', 'Murder', 'HS.Grad', 'Frost')], max))

##     Population Murder HS.Grad Frost
## min        365    1.4    37.8     0
## max      21198   15.1    67.3   188

predict(mod0X, 
  data.frame(
    Murder = 8,  
    HS.Grad = 55, 
    Frost = 80, 
    Population = 4250), 
  interval = "prediction",
  level = 0.95)

##        fit      lwr      upr
## 1 70.92559 69.45497 72.39621

Etude des résidus

par(mfrow = c(2, 2))
plot(mod0X)

dépendance entre variables explicatives (colinéarité)

library(car)

## Warning: package 'car' was built under R version 3.6.1

## Loading required package: carData

vif(mod0X)

## Population     Murder    HS.Grad      Frost 
##   1.189835   1.727844   1.356791   1.498077

Variance Inflation Factors (VIF) < 10 : ok !

Reg. lin. multiple

Modèle

\(y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p + \epsilon_i\)

\(y = \sum_{j=1}^{p}\beta_jx_j+\epsilon\)

\(y\) : variable quantitative continue à expliquer.

\(x_i\) : variables quantitatives continues explicatives.

\(\epsilon\) : erreur aléatoire de loi Normale d’espérance nulle et d’écart-type \(\sigma\).

Avec R

fonction lm()

résultats avec summary()

graphiques pour vérifier les hypothèses avec plot(lm())

tests statistiques comme shapiro.test() pour la normalité des résidus

Reg. lin. multiple et sélection des variables

Colinéarité

Si une (ou plusieurs) variable explicative est la combinaison linéaire d’une (ou de plusieurs) autre varaible, on parle de colinéarité. Dans ce cas, les coéficients individuels associés à chaque variable ne peuvent être interprétés de manière fiable…

Détecter la colinéarité (données)

set.seed(12345678)
xx <- sample(1:100, size = 100, replace = TRUE) 
df <- data.frame(
  sapply(
    1:10, 
    function(i){
      if(sample(c(TRUE, FALSE), size = 1)){
        xx + rnorm(100, sd = 10)
      }else{
        xx + rnorm(100, sd = 100)
      }
    })
)
colnames(df) <- paste0("x", 1:10)
df$y <- 0.5 + 
  0.5*df$x1 + rnorm(100, mean = df$x1, sd = 10) + 
  0.8*df$x2 + rnorm(100, mean = df$x1, sd = 10) + 
  0.3*df$x3 + rnorm(100, mean = df$x1, sd = 10) + 
  0.5*df$x4 + rnorm(100, mean = df$x1, sd = 10) + 
  rnorm(100, mean = 0, sd = 150)

Détecter la colinéarité (données)

head(df)

##         x1         x2       x3        x4        x5       x6        x7
## 1 83.72785  144.56185 96.55340 98.715658 106.62437 74.93125 250.73830
## 2 73.51153  -72.77192 59.94101 56.605437  56.29103 73.70063  34.74389
## 3 14.58394  132.80571 46.19169  1.696772  25.08175 23.46682 -16.02037
## 4 19.90037 -172.43941 17.39241  4.257799  15.54725 26.53419 -61.08011
## 5 80.21417   61.82470 81.74531 73.002784  93.81761 75.59769  42.15923
## 6 98.18177   82.38916 88.62505 99.777754 102.91077 90.94779 223.50261
##           x8       x9       x10         y
## 1  89.470036 96.53135 110.98130 516.62259
## 2  71.042496 55.74935  71.26404 573.29268
## 3  29.774042 29.20769  23.43577  51.15482
## 4   2.489837 10.24249  37.63406  62.85970
## 5  66.895511 59.13655  65.04702 517.36852
## 6 109.079371 89.16625  81.85138 348.68400

Détecter la colinéarité (données)

On devrait s’attendre à un effet significatif de x1, x2, x3 et x4.

Détecter la colinéarité (données)

modL <- lm(y~., data = df)
summary(modL)

## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272.74  -79.81    8.44   91.27  322.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.7007    31.2962  -0.246    0.806    
## x1            6.1036     1.3631   4.478 2.23e-05 ***
## x2            0.8654     0.1534   5.641 1.97e-07 ***
## x3           -0.5174     1.4008  -0.369    0.713    
## x4           -0.2623     1.3167  -0.199    0.843    
## x5            1.2176     1.4404   0.845    0.400    
## x6            0.3952     1.2640   0.313    0.755    
## x7           -0.2172     0.1586  -1.369    0.174    
## x8           -0.8464     1.2975  -0.652    0.516    
## x9           -1.2123     1.3064  -0.928    0.356    
## x10           1.1036     1.3043   0.846    0.400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.3 on 89 degrees of freedom
## Multiple R-squared:  0.7085, Adjusted R-squared:  0.6757 
## F-statistic: 21.63 on 10 and 89 DF,  p-value: < 2.2e-16

Détecter la colinéarité (1)

Avec la corrélation entre variables explicatives :

library("corrplot")

## corrplot 0.84 loaded

library("RColorBrewer")

Détecter la colinéarité (1)

corrplot.mixed(cor(df[,1:10]), upper.col = rev(brewer.pal(10, "Spectral")))

Détecter la colinéarité (2)

En comparant le carré de la corrélation au R² (règle de Klein) :

cor(df[,1:10])^2 > summary(modL)$r.squared

##        x1    x2    x3    x4    x5    x6    x7    x8    x9   x10
## x1   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x2  FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## x3   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x4   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x5   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x6   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x7  FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## x8   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x9   TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## x10  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Détecter la colinéarité (3)

Le signe de la corrélation et de l’estimateur doivent êtr eles mêmes :

cbind(coef = modL$coef[2:11], corr = cor(df)[1:10,"y"])

##           coef      corr
## x1   6.1035976 0.7563715
## x2   0.8653900 0.4587661
## x3  -0.5173635 0.6405377
## x4  -0.2622878 0.6915929
## x5   1.2175900 0.7170081
## x6   0.3951513 0.6545490
## x7  -0.2171809 0.1973219
## x8  -0.8463590 0.6470706
## x9  -1.2122646 0.6401045
## x10  1.1036109 0.6688523

Détecter la colinéarité (4)

La façon recommandée : utiliser les facteurs d’infaltion VIF (Variance Inflation Factors)

library("car")
vif(modL)

##        x1        x2        x3        x4        x5        x6        x7 
##  8.199499  1.141505  8.941736  7.834481 10.060477  7.069245  1.268323 
##        x8        x9       x10 
##  7.982893  7.522818  8.680919

Détecter la colinéarité (4)

excludeVar <- names(vif(modL)[vif(modL) > 10])
print(excludeVar)

## [1] "x5"

Détecter la colinéarité (4)

myForm <- paste0("y ~ . -", paste0(excludeVar, collapse = "-"), "")
modL2 <- lm(eval(parse(text = myForm)), data = df)

Détecter la colinéarité (4)

summary(modL)

## 
## Call:
## lm(formula = y ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -272.74  -79.81    8.44   91.27  322.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.7007    31.2962  -0.246    0.806    
## x1            6.1036     1.3631   4.478 2.23e-05 ***
## x2            0.8654     0.1534   5.641 1.97e-07 ***
## x3           -0.5174     1.4008  -0.369    0.713    
## x4           -0.2623     1.3167  -0.199    0.843    
## x5            1.2176     1.4404   0.845    0.400    
## x6            0.3952     1.2640   0.313    0.755    
## x7           -0.2172     0.1586  -1.369    0.174    
## x8           -0.8464     1.2975  -0.652    0.516    
## x9           -1.2123     1.3064  -0.928    0.356    
## x10           1.1036     1.3043   0.846    0.400    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.3 on 89 degrees of freedom
## Multiple R-squared:  0.7085, Adjusted R-squared:  0.6757 
## F-statistic: 21.63 on 10 and 89 DF,  p-value: < 2.2e-16

Détecter la colinéarité (4)

summary(modL2)

## 
## Call:
## lm(formula = eval(parse(text = myForm)), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -282.66  -85.24    2.58   96.18  328.93 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.2173    31.1952  -0.295    0.768    
## x1            6.4607     1.2939   4.993 2.89e-06 ***
## x2            0.8863     0.1511   5.864 7.36e-08 ***
## x3           -0.3094     1.3768  -0.225    0.823    
## x4           -0.2101     1.3132  -0.160    0.873    
## x6            0.4844     1.2575   0.385    0.701    
## x7           -0.2230     0.1582  -1.410    0.162    
## x8           -0.6990     1.2837  -0.545    0.587    
## x9           -1.1133     1.2990  -0.857    0.394    
## x10           1.3888     1.2579   1.104    0.273    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 138.1 on 90 degrees of freedom
## Multiple R-squared:  0.7061, Adjusted R-squared:  0.6767 
## F-statistic: 24.03 on 9 and 90 DF,  p-value: < 2.2e-16

modL3 <- step(
  lm(y ~ ., data = df), 
  direction = "backward"
)

## Start:  AIC=996.19
## y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + x10
## 
##        Df Sum of Sq     RSS     AIC
## - x4    1       759 1702332  994.23
## - x6    1      1869 1703442  994.30
## - x3    1      2608 1704181  994.34
## - x8    1      8135 1709708  994.67
## - x5    1     13662 1715235  994.99
## - x10   1     13687 1715261  994.99
## - x9    1     16464 1718037  995.15
## <none>              1701573  996.19
## - x7    1     35848 1737421  996.27
## - x1    1    383342 2084915 1014.51
## - x2    1    608462 2310035 1024.76
## 
## Step:  AIC=994.23
## y ~ x1 + x2 + x3 + x5 + x6 + x7 + x8 + x9 + x10
## 
##        Df Sum of Sq     RSS     AIC
## - x6    1      1503 1703835  992.32
## - x3    1      2229 1704561  992.36
## - x8    1     10108 1712440  992.83
## - x10   1     12940 1715271  992.99
## - x5    1     13391 1715722  993.02
## - x9    1     18322 1720654  993.30
## <none>              1702332  994.23
## - x7    1     35486 1737818  994.30
## - x1    1    391910 2094242 1012.95
## - x2    1    632154 2334486 1023.81
## 
## Step:  AIC=992.32
## y ~ x1 + x2 + x3 + x5 + x7 + x8 + x9 + x10
## 
##        Df Sum of Sq     RSS     AIC
## - x3    1      1466 1705301  990.41
## - x8    1      9534 1713369  990.88
## - x5    1     14375 1718210  991.16
## - x10   1     14718 1718553  991.18
## - x9    1     16902 1720737  991.31
## <none>              1703835  992.32
## - x7    1     37133 1740967  992.48
## - x1    1    398083 2101917 1011.32
## - x2    1    633046 2336881 1021.92
## 
## Step:  AIC=990.41
## y ~ x1 + x2 + x5 + x7 + x8 + x9 + x10
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1     13110 1718411  989.17
## - x8    1     13244 1718544  989.18
## - x10   1     13446 1718746  989.19
## - x9    1     18990 1724290  989.52
## <none>              1705301  990.41
## - x7    1     38246 1743547  990.63
## - x1    1    397467 2102768 1009.36
## - x2    1    648828 2354128 1020.65
## 
## Step:  AIC=989.17
## y ~ x1 + x2 + x7 + x8 + x9 + x10
## 
##        Df Sum of Sq     RSS     AIC
## - x8    1      8144 1726554  987.65
## - x9    1     14615 1733025  988.02
## - x10   1     28196 1746607  988.80
## <none>              1718411  989.17
## - x7    1     40252 1758663  989.49
## - x1    1    522716 2241126 1013.73
## - x2    1    696575 2414985 1021.20
## 
## Step:  AIC=987.65
## y ~ x1 + x2 + x7 + x9 + x10
## 
##        Df Sum of Sq     RSS     AIC
## - x10   1     21977 1748531  986.91
## - x9    1     26034 1752588  987.14
## <none>              1726554  987.65
## - x7    1     39721 1766276  987.92
## - x1    1    533407 2259962 1012.57
## - x2    1    692286 2418841 1019.36
## 
## Step:  AIC=986.91
## y ~ x1 + x2 + x7 + x9
## 
##        Df Sum of Sq     RSS     AIC
## - x9    1     11623 1760154  985.57
## <none>              1748531  986.91
## - x7    1     35382 1783913  986.91
## - x2    1    675351 2423882 1017.57
## - x1    1    870633 2619164 1025.32
## 
## Step:  AIC=985.57
## y ~ x1 + x2 + x7
## 
##        Df Sum of Sq     RSS     AIC
## <none>              1760154  985.57
## - x7    1     46802 1806956  986.20
## - x2    1    671087 2431241 1015.87
## - x1    1   2637633 4397787 1075.14

summary(modL3)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x7, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -300.23  -80.57   -1.75   94.07  323.94 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -14.6462    28.4221  -0.515    0.608    
## x1            6.1569     0.5133  11.994  < 2e-16 ***
## x2            0.8624     0.1426   6.050 2.79e-08 ***
## x7           -0.2396     0.1500  -1.598    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 135.4 on 96 degrees of freedom
## Multiple R-squared:  0.6984, Adjusted R-squared:  0.689 
## F-statistic: 74.11 on 3 and 96 DF,  p-value: < 2.2e-16

Quand des variables sont corrélées, il faut penser à une méthode pour sélectionner ses données (Cf. exemple données spatialisées en écologie).

TD 7 : prédire la température

Module R’Stat1 : La régression linéaire multiple

francois.rebaudo@ird.fr

Novembre 2019 ; IRD-Montpellier-France CC BY-NC-ND 3.0

Reg. lin. multiple par l’exemple d’après C. Prieur

Les données state

Les données state

Statistiques descriptives

Expliquer l’espérance de vie Life.Exp

AIC

procédure stepwise

Prévisions (à l’intérieur des valeurs connues)

Etude des résidus

dépendance entre variables explicatives (colinéarité)

Reg. lin. multiple

Modèle

Avec R

Reg. lin. multiple et sélection des variables

Colinéarité

Détecter la colinéarité (données)

Détecter la colinéarité (données)

Détecter la colinéarité (données)

Détecter la colinéarité (données)

Détecter la colinéarité (1)

Détecter la colinéarité (1)

Détecter la colinéarité (2)

Détecter la colinéarité (3)

Détecter la colinéarité (4)

Détecter la colinéarité (4)

Détecter la colinéarité (4)

Détecter la colinéarité (4)

Détecter la colinéarité (4)

TD 7 : prédire la température

Novembre 2019 ; IRD-Montpellier-France
CC BY-NC-ND 3.0

Les données `state`

Les données `state`

Expliquer l’espérance de vie `Life.Exp`

procédure `stepwise`