Regression models for survival data with applications in

insurance - R programming

Montserrat Guillén, Jens P. Nielsen & Anna M. Pérez-Marín

In this page, we describe how to use alternative regression models for survival data with applications in insurance. Firstly, we consider a standard statistical method for surival analysis, namely, the proportional hazards regression model (Cox, 1972), with strong statistical assumptions. Alternatively, we also consider a modified version of the classical proportional hazards regression model where covariates are allowed to have a time-varying effect on the survival time (Martinussen and Scheike, 2006). These models can be used in the insurance context in many practical situations, including insurance marketing. In the following paper, they are applied for estimating customer lifetime duration in insurance:


DATA DESCRIPTION

Name Content description
datat.csv                                                                                 
400 observations. Time: time-to-event variable. Status: censoring indicator (0 censored, 1 event). x1, x2, x3, x4: covariates




PROPORTIONAL HAZARDS REGRESSION MODEL

We define our dependent variable, T as the time-to-event variable. For example, if we are interested in analyzing customer lifetime duration, then T is the time elapsed since the policy is underwritten by a policy holder until it is canceled. This survival time may be right-censored because the event of interest (policy cancellation) may not have occurred before the end of the observation period.


The classical proportional hazards regression model, also called Cox model (Cox, 1972) specifies the intensity λi(t) of the random variable T for individual i as:

\begin{equation*} \lambda _{i}(t)=Y_{i}(t)\alpha _{0}(t)\exp (Z_{i}^{\prime }\beta ) \end{equation*}

where Yi(t) is an indicator equal to 1 if the individual i is at risk at t and zero otherwise, α0(t) is the so-called baseline hazard function which only depends on time, Zi=(Zi1, . . ., Zip)' is the p-dimensional vector of covariates and β is the p-dimensional vector of unknown regression parameters. The regression parameters β are estimated by maximizing the Cox's partiallikelihood function (Cox, 1972, 1975):

\begin{equation*} L(\beta )=\prod_{t}\prod_{i}\left( \frac{\exp \left( Z_{i}^{\prime }\beta \right) }{S_{0}(t,\beta )}\right) ^{\Delta N_{i}(t)} \end{equation*}

where \begin{equation*} S_{0}(t,\beta )=\sum_{i=1}^{n}Y_{i}(t)\exp (Z_{i}^{\prime }\beta ) \end{equation*} and Ni(t) is a counting process that is equal to zero until the moment when the event occurs for individual i. More details about regression models for survival data can be found in Martinussen and Scheike (2006).


The proportional hazards regression model can be easily estimated in R by using the coxph function of the survival R package. An example is shown in the following script.




COX-TYPE REGRESSION MODEL WITH TIME-VARYING COEFFICIENTS

The following extension of the Cox model (Murphy and Sen, 1991 and Grambsch and Therneau, 1994) introduce time-dependent parameters and also considers potential time-dependent covariates: \begin{equation} \lambda _{i}(t)=Y_{i}(t)\exp \left[ Z_{i}^{\prime }(t)\beta (t) \right] \hspace{0.80cm} (1) \end{equation}

where Zi(t)={Zi1(t), . . ., Zip(t}' is the p-dimensional vector of covariates that may evolve with time t and β(t) denotes the associated regression time-dependent coefficients. When the first covariate is constant and equal to one Zi1(t) = 1 then (1) contains a baseline α0(t) that is parametrized as exp[β1(t)]. It is important to remark the difference between time-varying covariates and time-varying parameters. Time-varying covariates are some individual characteristic which may vary over time (such as age) while time-varying parameters reflect the fact that some covariate (which can be either constant or not) may have a changing effect on the response over time. For instance, the effect of filing a claim on the risk of cancellation can be higher/lower shortly after the accident occurs than later on. This more general model therefore includes both formulation improvements (more details can be found in Martinussen and Scheike, 2006).


Statistical tests based on the asymptotic analysis of the cumulative regression functions of model (1) can be used to decide whether these effects are in fact time-varying or not (see Martinussen and Scheike, 2006). The model estimation requires maximizing of the corresponding log-likelihood function (see more details in Martinussen, Scheike and Skovgaard, 2002 and Scheike and Martinussen, 2004), that is given by \begin{equation*} \log L(\beta )=\sum_{i}\left\{ \int_{0}^{\infty }Z_{i}^{\prime }(t)\beta (t)dN_{i}(t)-\int_{0}^{\infty }Y_{i}(t)\exp \left[ Z_{i}^{\prime }(t)\beta (t)\right] dt\right\}. \end{equation*}


The estimation process is complex and starts with an initial estimate of the parameters and then the Newton-Raphson algorithm is applied, but smoothing is needed to stabilize the solution that finally is given in terms of cumulative regression coefficients. The model can be estimated by using the function timecox of the timereg R package (Scheike, 2014). An example is showed in the following script.




REFERENCES

[1] Cox, D.R. (1972) Regression models and life tables, Journal of the Royal Statistical Society B, 34, 187-220.

[2] Cox, D.R. (1975) Partial likelihood, Biometrika, 62, 269-276.

[3] Grambsch, P. M. and Therneu, T.M. (1994) Proportional hazards tests and diagnostics based on weighted residuals, Biometrika, 81, 515-526.

[4] Martinussen, T. and Scheike, T. H. (2006) Dynamic regression models for survival data, Springer Verlag, New-York.

[5] Murphy, S. A. and Sen, P. K. (1991) Time-dependent coefficients in a Cox-type regression model, Stochastic Processes and Their Applications, 39, 153-180.

[6] Scheike, T. H. (2014) Timereg R package. Available at http://cran.r-project.org/web/packages/timereg/index.html.



excel HUBc BKC
  • Universitat de Barcelona - Last Updated: 05-29-2014