Regression with Categorical Dependent Variables

Montserrat Guillén

This page presents regression models where the dependent variable is categorical, whereas covariates can either be categorical or continuous, using data from the book Predictive Modeling Applications in Actuarial Science. A methodological overview can be found in:

• Frees, E.W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press. New York.
• Greene, W.H. (2011). Econometric analysis. 7th edition. Prentice Hall. New York.

DATA DESCRIPTION

 Name Content description FullCoverage.csv 4000 policy holders of motor insurance VehOwned.csv 2067 customers of an insurance firm who are offered unordered options VehChoicePrice.csv This example corresponds to a similar situation to previous example

LOGISTIC REGRESSION MODEL

In the logistic regression model the dependent variable is binary. This model is the most popular for binary dependent variables. It is highly recommended to start from this model setting before more sophisticated categorical modeling is carried out. Dependent variable yi can only take two possible outcomes. We assume yi follows a Bernoulli distribution with probability πi. The probability of the 'event' response πi depends on a set of individual characteristics xi, i = 1, . . ., n, where n is the number of observations.

- Specification

Show/Hide

The logistic regression model specifies that:
$$Pr(y_1=1|\mathbf{x}_i)=\pi_i=\frac{1}{1+exp(-\mathbf{x}_i^\prime \mathbf{\beta})}=\frac{exp(\mathbf{x}_i^\prime \bf{\beta})}{1+exp(\mathbf{x}_i^\prime \bf{\beta})}$$
and the inverse of this relationship, called the link function in generalized linear models, expresses x'i β as a function of πi as:
$$\mathbf{x}_i^\prime \beta=\ln \left(\frac{\pi_i}{1-\pi_i}\right)= logit(\pi_i).$$

- Parameter estimation

Show/Hide

- Script

- Results

MODELS FOR ORDINAL CATEGORICAL DEPENDENT VARIABLES

In ordinal categorical dependent variable models the responses have a natural ordering. This is quite common in insurance, an example is to model possible claiming outcomes as ordered categorical responses.

- Specification

Show/Hide

Let us assume that an ordinal categorical variable has J possible choices. The most straightforward model in this case is the cumulative logit model, also known as ordered logit. Let us denote by yi the choice of individual i for a categorical ordered response variable. Let us assume that πij is the probability that i choses j, j=1,...,J. So, πi1i2+ ... + πiJ = 1. Response probabilities depend on the individual predictors, again, we assume they depend on x'i β. It is important to bear in mind that the ordered logit model concentrates on the cumulative probabilities Pr(yi ≤ j | xi ). Then,

$$logit(Pr(y_i\le j|\mathbf{x}_i ))=\alpha_j+\mathbf{x}_i^\prime \beta.$$ Note that, $$logit(Pr(y_i\le j| \mathbf{x}_i))=\ln \left(\frac{Pr(y_i\le j|\mathbf{x}_i )} {1-Pr(y_i\le j|\mathbf{x}_i )}\right).$$

- Script

- Results

MODELS FOR NOMINAL CATEGORICAL DEPENDENT VARIABLES

Let us start with the generalized logit model. This model is often called the multinomial logit model, which we will present later and which is a bit more general. However, the generalized logit model is so widely used that this is the reason why it is often called the multinomial logit model. It is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables that measure individual risk factors.

- Specification

Show/Hide

Let us denote by yi the choice of individual i for a nominal categorical response variable. Let us assume that πij is the probability that i choses j, j=1,...,J and i=1,...,n. So, πi1+ πi2 + ... + πiJ = 1. The probabilities depend on the individual predictors, and we assume these choice probabilities depend on x'i βj.
We assume that the J-th alternative is the baseline choice. Then, the generalized logit regression model is specified as:

\begin{eqnarray} Pr(y_1=j|\mathbf{x}_i)&=&\frac{exp(\mathbf{x}_i^\prime \beta_j)}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}, \,\, j=1,...,J-1 \nonumber\\ Pr(y_1=J|\mathbf{x}_i)&=&\frac{1}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}.\nonumber \end{eqnarray}

So there are J-1 vectors of parameters to be estimated, namely β1, β2, ... , βJ-1. We set vector βJ to zero for identification purposes.

- Script

- Results

MULTINOMIAL LOGISTIC REGRESSION MODEL

In the multinomial logistic regression model individual characteristics can be different for different choices. This model is also known as the conditional logit model due to the fact that individual characteristics depend on the chosen alternative.

- Specification

Show/Hide

The multinomial logistic regression model specification is,
\begin{eqnarray} Pr(y_1=j|\mathbf{x}_{ij})=\frac{exp(\mathbf{x}_{ij}^\prime \beta)}{\sum_{k=1}^{J}exp(\mathbf{x}_{ik}^\prime \beta)}, \,\, j=1,...,J. \end{eqnarray}
There is only one vector of unknown parameters β, but we have J vectors of known characteristics xi1, xi2, ..., xiJ.

- Script

- Results

REFERENCES

[1] Allison, P. D. (1999). Logistic regression using the SAS system: theory and application. Cary, NC: SAS Institute.

[2] Cameron, A. C. and Trivedi, P. K. (2005) Microeconometrics: methods and applications. Cambridge University Press. New York.

[3] Frees, E. W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press. New York.

[4] Greene, W. H. (2011). Econometric analysis. 7th edition. Prentice Hall. New York.

[5] Hilbe, J. M. (2009). Logistic regression models. CRC Press, Chapman & Hall. Boca Raton, FL.

[6] Hosmer, D. W. and Lemeshow, S. (2000). Applied logistic regression. John Wiley & Sons, New York, 2nd edition.

[7] Long, J. S. (1997). Regression models of categorical and limited dependent variables. Sage, Thousand Oaks, CA.

• Universitat de Barcelona - Last Updated: 05-23-2014