# Regression with Categorical Dependent Variables

## Montserrat Guillén

*Predictive Modeling Applications in Actuarial Science*. A methodological overview can be found in:

- Frees, E.W. (2010). Regression modeling with actuarial and financial applications. Cambridge University Press. New York.
- Greene, W.H. (2011). Econometric analysis. 7th edition. Prentice Hall. New York.

**DATA DESCRIPTION**

Name |
Content description |

FullCoverage.csv |
4000 policy holders of motor insurance |

VehOwned.csv | 2067 customers of an insurance firm who are
offered unordered options |

VehChoicePrice.csv | This example corresponds to a similar
situation to previous example |

**LOGISTIC REGRESSION MODEL**

In the logistic regression model the dependent
variable is binary. This model is the most popular for
binary dependent variables. It is highly recommended
to start from this model setting before more
sophisticated categorical modeling is carried out.
Dependent variable *y** _{i}* can
only take two possible outcomes. We assume

*y*

*follows a Bernoulli distribution with probability π*

_{i}*. The probability of the 'event' response π*

_{i}*depends on a set of individual characteristics*

_{i}*x*

*,*

_{i}*i*= 1, . . .,

*n*, where

*n*is the number of observations.

**- Specification**

Show/Hide

The

*logistic regression model*specifies that:

\begin{equation} Pr(y_1=1|\mathbf{x}_i)=\pi_i=\frac{1}{1+exp(-\mathbf{x}_i^\prime \mathbf{\beta})}=\frac{exp(\mathbf{x}_i^\prime \bf{\beta})}{1+exp(\mathbf{x}_i^\prime \bf{\beta})} \end{equation}

and the inverse of this relationship, called the

*link function*in generalized linear models, expresses

*as a function of π*

**x'**_{i}β*as:*

_{i}\begin{equation} \mathbf{x}_i^\prime \beta=\ln \left(\frac{\pi_i}{1-\pi_i}\right)= logit(\pi_i). \end{equation}

**- Parameter estimation**

Show/Hide

The maximum likelihood method is the procedure used for parameter estimation and standard error estimation in logistic regression models. This method is based on the fact that responses from the observed units are independent and that the likelihood function can then be obtained from the product of the likelihood of single observations.

The likelihood of a single observation in a logistic regression model is simply the probability of the event that is observed, so it can be expressed as,

\begin{equation} Pr(y_i=1|\mathbf{x}_i)^{y_i}(1-Pr(y_i=1|\mathbf{x}_i))^{(1-{y_i})}=\pi_i^{y_i}(1-\pi_i)^{(1-y_i)}. \end{equation}

The log-likelihood function for a data set is a function of the vector of unknown parameters

*β*. The observed values of

*y*and

_{i}**are given by the information in the data set on**

*x*_{i}*n*individuals. Then, when observations are independent, we can write the log-likelihood function as, \begin{eqnarray*} L(\beta)&=&\ln \left[\Pi_{i=1}^n Pr(y_i=1|\mathbf{x}_i)^{y_i}(1-Pr(y_i=1|\mathbf{x}_i))^{(1-{y_i})} \right]= \\ & =&\sum_{i=1}^n \left[ {y_i}\ln Pr(y_i=1|\mathbf{x}_i) - (1-y_i)\ln(1-Pr(y_i=1|\mathbf{x}_i)) \right]. \end{eqnarray*}

Conventional software will do the job to maximize the log-likelihood function, provide the parameter estimates and their standard errors. Unless covariates are perfectly correlated the parameter estimates exist and are unique.

**- Script**

- Results

**MODELS FOR ORDINAL CATEGORICAL DEPENDENT
VARIABLES**

In ordinal categorical dependent variable models the responses have a natural ordering. This is quite common in insurance, an example is to model possible claiming outcomes as ordered categorical responses.

**- Specification**

Show/Hide

Let us assume that an ordinal categorical variable has

*J*possible choices. The most straightforward model in this case is the

*cumulative logit model*, also known as

*ordered logit*. Let us denote by

*y*the choice of individual

_{i}*i*for a categorical ordered response variable. Let us assume that

*π*is the probability that

_{ij}*i*choses

*j*,

*j=1,...,J*. So,

*π*. Response probabilities depend on the individual predictors, again, we assume they depend on

_{i1}+π_{i2}+ ... + π_{iJ}= 1*. It is important to bear in mind that the ordered logit model concentrates on the cumulative probabilities*

**x'**_{i}β*Pr(y*. Then,

_{i}≤ j |**x**_{i})\begin{equation} logit(Pr(y_i\le j|\mathbf{x}_i ))=\alpha_j+\mathbf{x}_i^\prime \beta. \end{equation} Note that, \begin{equation} logit(Pr(y_i\le j| \mathbf{x}_i))=\ln \left(\frac{Pr(y_i\le j|\mathbf{x}_i )} {1-Pr(y_i\le j|\mathbf{x}_i )}\right). \end{equation}

**- Script**

- Results

**MODELS FOR NOMINAL CATEGORICAL DEPENDENT
VARIABLES**

Let us start with the

*generalized logit model*. This model is often called the

*multinomial logit model*, which we will present later and which is a bit more general. However, the generalized logit model is so widely used that this is the reason why it is often called the multinomial logit model. It is used to predict the probabilities of the different possible outcomes of a categorically distributed dependent variable, given a set of independent variables that measure individual risk factors.

**- Specification**

Show/Hide

Let us denote by

*y*the choice of individual

_{i}*i*for a nominal categorical response variable. Let us assume that π

_{ij}is the probability that

*i*choses

*j*,

*j=1,...,J*and

*i=1,...,n*. So,

*π*= 1. The probabilities depend on the individual predictors, and we assume these choice probabilities depend on

_{i1}+ π_{i2}+ ... + π_{iJ}*.*

**x'**_{i}β_{j}We assume that the

*J*-th alternative is the

*baseline*choice. Then, the generalized logit regression model is specified as:

\begin{eqnarray} Pr(y_1=j|\mathbf{x}_i)&=&\frac{exp(\mathbf{x}_i^\prime \beta_j)}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}, \,\, j=1,...,J-1 \nonumber\\ Pr(y_1=J|\mathbf{x}_i)&=&\frac{1}{1+\sum_{k=1}^{J-1}exp(\mathbf{x}_i^\prime \beta_k)}.\nonumber \end{eqnarray}

So there are

*J*-1 vectors of parameters to be estimated, namely

*. We set vector*

**β**_{1}, β_{2}, ... , β_{J-1}*to zero for identification purposes.*

**β**_{J}**- Script**

- Results

**MULTINOMIAL LOGISTIC REGRESSION MODEL**

In the

*multinomial logistic regression model*individual characteristics can be different for different choices. This model is also known as the

*conditional logit model*due to the fact that individual characteristics depend on the chosen alternative.

**- Specification**

Show/Hide

The multinomial logistic regression model specification is,

\begin{eqnarray} Pr(y_1=j|\mathbf{x}_{ij})=\frac{exp(\mathbf{x}_{ij}^\prime \beta)}{\sum_{k=1}^{J}exp(\mathbf{x}_{ik}^\prime \beta)}, \,\, j=1,...,J. \end{eqnarray}

There is only one vector of unknown parameters

*, but we have*

**β***J*vectors of known characteristics

**x**

_{i1},

**x**

_{i2}, ...,

**x**

_{iJ}.

**- Script**

- Results

**REFERENCES**

[1] Allison, P. D. (1999). Logistic regression using
the SAS system: theory and application. Cary, NC: SAS
Institute.

[2] Cameron, A. C. and Trivedi, P. K. (2005)
Microeconometrics: methods and applications. Cambridge
University Press. New York.

[3] Frees, E. W. (2010). Regression modeling with
actuarial and financial applications. Cambridge
University Press. New York.

[4] Greene, W. H. (2011). Econometric analysis. 7th
edition. Prentice Hall. New York.

[5] Hilbe, J. M. (2009). Logistic regression models.
CRC Press, Chapman & Hall. Boca Raton, FL.

[6] Hosmer, D. W. and Lemeshow, S. (2000). Applied
logistic regression. John Wiley & Sons, New York,
2nd edition.

[7] Long, J. S. (1997). Regression models of
categorical and limited dependent variables. Sage,
Thousand Oaks, CA.