Estimation of semiparametric regression model with right-censored high-dimensional data
Abstract
In this paper, we consider the estimation problem for the semiparametric regression model with censored data in which the number of explanatory variables p in the linear part is much larger than sample size n, often denoted as p n. The purpose of this paper is to study the effects of covariates on a response variable censored on the right by a random censoring variable with an unknown probability distribution. It should be noted that high variance and over-fitting are a major concern in such problems. Ordinary statistical methods for estimation cannot be applied directly to censored and high-dimensional data, and therefore a transformation is required. In the context of this paper, a synthetic data transformation is used for solving the censoring problem. We then apply the LASSO-type double-penalized least squares (DPLS) to achieve sparsity in the parametric component and use smoothing splines to estimate the nonparametric component. A Monte Carlo simulation study is performed to show the performance of the estimators and to analyse the effects of the different censoring levels. A real high-dimensional censored data example is used to illustrate the ideas discussed herein.