Home

Regression with Categorical Dependent Variables

Montserrat Guillén

This page presents regression models where the dependent variable is categorical, whereas covariates can either be categorical or continuous, using data from the book Predictive Modeling Applications in Actuarial Science. A methodological overview can be found in:

Frees, E.W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press. New York.
Greene, W.H. (2011). Econometric analysis. 7th edition. Prentice Hall. New York.

DATA DESCRIPTION

Name	Content description
FullCoverage.csv	4000 policy holders of motor insurance
VehOwned.csv	2067 customers of an insurance firm who are offered unordered options
VehChoicePrice.csv	This example corresponds to a similar situation to previous example

Download FullCoverage.csv data set here.	Download VehOwned.csv data set here.
Download TypeofCoverage.csv data set here.	Download VehChoicePrice.csv data set here.

Download all files in EXCEL format here (.zip).

Downloal all files in CVS format, ready
for R scripts, here (.zip).

LOGISTIC REGRESSION MODEL

In the logistic regression model the dependent variable is binary. This model is the most popular for binary dependent variables. It is highly recommended to start from this model setting before more sophisticated categorical modeling is carried out. Dependent variable y_i can only take two possible outcomes. We assume y_i follows a Bernoulli distribution with probability π_i. The probability of the 'event' response π_i depends on a set of individual characteristics x_i, i = 1, . . ., n, where n is the number of observations.

- Specification

Show/Hide

The logistic regression model specifies that:
\begin{equation} Pr(y_1=1|\mathbf{x}_i)=\pi_i=\frac{1}{1+exp(-\mathbf{x}_i^\prime \mathbf{\beta})}=\frac{exp(\mathbf{x}_i^\prime \bf{\beta})}{1+exp(\mathbf{x}_i^\prime \bf{\beta})} \end{equation}
and the inverse of this relationship, called the link function in generalized linear models, expresses x'_i β as a function of π_i as:
\begin{equation} \mathbf{x}_i^\prime \beta=\ln \left(\frac{\pi_i}{1-\pi_i}\right)= logit(\pi_i). \end{equation}

- Parameter estimation

Show/Hide

- Script

Download the R script here.

- Results

Download the results here.

MODELS FOR ORDINAL CATEGORICAL DEPENDENT VARIABLES

In ordinal categorical dependent variable models the responses have a natural ordering. This is quite common in insurance, an example is to model possible claiming outcomes as ordered categorical responses.

- Specification

Show/Hide

Let us assume that an ordinal categorical variable has J possible choices. The most straightforward model in this case is the cumulative logit model, also known as ordered logit. Let us denote by y_i the choice of individual i for a categorical ordered response variable. Let us assume that π_ij is the probability that i choses j, j=1,...,J. So, π_i1+π_i2+ ... + π_iJ = 1. Response probabilities depend on the individual predictors, again, we assume they depend on x'_i β. It is important to bear in mind that the ordered logit model concentrates on the cumulative probabilities Pr(y_i ≤ j | x_i ). Then,

\begin{equation} logit(Pr(y_i\le j|\mathbf{x}_i ))=\alpha_j+\mathbf{x}_i^\prime \beta. \end{equation} Note that, \begin{equation} logit(Pr(y_i\le j| \mathbf{x}_i))=\ln \left(\frac{Pr(y_i\le j|\mathbf{x}_i )} {1-Pr(y_i\le j|\mathbf{x}_i )}\right). \end{equation}

- Script

Download the R script here.

- Results

Download the results here.

MODELS FOR NOMINAL CATEGORICAL DEPENDENT VARIABLES

Let us start with the generalized logit model. This model is often called the multinomial logit model, which we will present later and which is a bit more general. However, the generalized logit model is so widely used that this is the reason why it is often called the multinomial logit model. It is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables that measure individual risk factors.

- Specification

Show/Hide

Let us denote by y_i the choice of individual i for a nominal categorical response variable. Let us assume that π_ij is the probability that i choses j, j=1,...,J and i=1,...,n. So, π_i1+ π_i2 + ... + π_iJ = 1. The probabilities depend on the individual predictors, and we assume these choice probabilities depend on x'_i β_j.
We assume that the J-th alternative is the baseline choice. Then, the generalized logit regression model is specified as:

\begin{eqnarray} Pr(y_1=j|\mathbf{x}_i)&=&\frac{exp(\mathbf{x}_i^\prime \beta_j)}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}, \,\, j=1,...,J-1 \nonumber\\ Pr(y_1=J|\mathbf{x}_i)&=&\frac{1}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}.\nonumber \end{eqnarray}

So there are J-1 vectors of parameters to be estimated, namely β₁, β₂, ... , β_J-1. We set vector β_J to zero for identification purposes.

- Script

Download the R script here.

- Results

Download the results here.

MULTINOMIAL LOGISTIC REGRESSION MODEL

In the multinomial logistic regression model individual characteristics can be different for different choices. This model is also known as the conditional logit model due to the fact that individual characteristics depend on the chosen alternative.

- Specification

Show/Hide

The multinomial logistic regression model specification is,
\begin{eqnarray} Pr(y_1=j|\mathbf{x}_{ij})=\frac{exp(\mathbf{x}_{ij}^\prime \beta)}{\sum_{k=1}^{J}exp(\mathbf{x}_{ik}^\prime \beta)}, \,\, j=1,...,J. \end{eqnarray}
There is only one vector of unknown parameters β, but we have J vectors of known characteristics x_i1, x_i2, ..., x_iJ.

- Script

Download the R script here.

- Results

Download the results here.

REFERENCES

[1] Allison, P. D. (1999). Logistic regression using the SAS system: theory and application. Cary, NC: SAS Institute.

[2] Cameron, A. C. and Trivedi, P. K. (2005) Microeconometrics: methods and applications. Cambridge University Press. New York.

[3] Frees, E. W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press. New York.

[4] Greene, W. H. (2011). Econometric analysis. 7th edition. Prentice Hall. New York.

[5] Hilbe, J. M. (2009). Logistic regression models. CRC Press, Chapman & Hall. Boca Raton, FL.

[6] Hosmer, D. W. and Lemeshow, S. (2000). Applied logistic regression. John Wiley & Sons, New York, 2nd edition.

[7] Long, J. S. (1997). Regression models of categorical and limited dependent variables. Sage, Thousand Oaks, CA.

Universitat de Barcelona - Last Updated: 05-23-2014