Sparse Group Lasso

We must be willing to let go of the life we have planned, so as to have the life that is waiting for

Independently Interpretable Lasso

No amount of guilt can solve the past, and no amount of anxiety can change the future. Anonymous

Latent Feature Lasso

I tried to make sense of the Four Books, until love arrived, and it all became a single syllable. Yunus

Fused Multiple Graphical Lasso

Open your mouth only if what you are going to say is more beautiful than the silience. BUDDHA

SVD-Based Screening for the Graphical Lasso

Happiness doesn't result from what we get, but from what we give. Ben Carson

Lecture 20 Theory of the Lasso I

Ego says, "Once everything falls into place, I'll feel peace." Spirit says "Find your peace, and then

Sparse inverse covariance estimation with the graphical lasso

The happiest people don't have the best of everything, they just make the best of everything. Anony

On the Validity of the Pairs Bootstrap for Lasso Estimators

We may have all come on different ships, but we're in the same boat now. M.L.King

Bayesian Lasso and multinomial logistic regression on GPU

Be grateful for whoever comes, because each has been sent as a guide from beyond. Rumi

Journal of Machine Learning Research 8 (2007) 2701-2726

Submitted 6/07; Revised 7/07; Published 12/07

Stagewise Lasso Peng Zhao Bin Yu

PENGZHAO @ STAT. BERKELEY. EDU BINYU @ STAT. BERKELEY. EDU

Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA

Editor: Saharon Rosset

Abstract Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences not derivatives, we conclude that it provides a class of simple and easy-to-implement algorithms for tracing the regularization or solution paths of penalized minimization problems. Keywords: backward step, boosting, convexity, Lasso, regularization path

1. Introduction Many statistical machine learning algorithms minimize either an empirical loss function or a penalized empirical loss, in a regression or a classification setting where covariate or predictor variables are used to predict a response variable and i.i.d. samples of training data are available. For example, in classification, both AdaBoost (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996) and SVM (Vapnik, 1995; Cristianini and Shawe-Taylor, 2002; Sch o¨ lkopf and Smola, 2002) build linear classification functions of basis functions of the covariates (predictors). Adaboost minimizes an c

2007 Peng Zhao and Bin Yu.

Z HAO AND Y U

exponential loss function of the margin, while SVM a penalized hinge loss function of the margin. A single regularization tuning parameter, the number of iterations in AdaBoost and the smoothing parameter in SVM, controls the bias-and-variance trade-off or whether the algorithm overfits or underfits the data. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated. These algorithms have been shown to achieve start-of-the-art prediction performance. A tuning parameter value, or equivalently a point on the path of solutions, is chosen to minimize the estimated prediction error over a proper test set or through cross-validation. Hence it is necessary to have algorithms that can generate the path of solutions in an efficient manner. Path following algorithms have also been devised in other statistical penalized minimization problems such as the problems of information bottleneck Tishby et al. (1999) and information distortion Gedeon et al. (2002) where the loss and penalty functions are different from these discussed in this paper. In fact, following the solution paths of numerical problems is the focus of homotopy, a sub-area in numerical analysis. Interested readers are referred to the book Allgower and Georg (1980) (http://www.math.colostate.edu/emeriti/georg/AllgGeorgHNA.pdf) and references at http://www.math.colostate.edu/emeriti/georg/georg.publications.html. Among all the machine learning methods, those that produce sparse models are of great interest because sparse models lend themselves more easily to interpretation and therefore preferred in sciences and social sciences. Lasso (Tibshirani, 1996) is such a method. It minimizes the L 2 loss function with an L1 penalty on the parameters in a linear regression model. In signal processing it is called Basis Pursuit (Chen and Donoho, 1994). The L1 penalty leads to sparse solutions, that is, there are few predictors or basis functions with nonzero weights (among all possible choices). This statement is proved asymptotically under various conditions by different authors (see, e.g., Knight and Fu, 2000; Osborne et al., 2000a,b; Donoho et al., 2006; Donoho, 2006; Rosset et al., 2004; Tropp, 2006; Zhao and Yu, 2006; Zou, 2006; Meinshausen and Yu, 2006; Zhang and Huang, 2006). Sparsity has also been observed in the models generated by Forward Stagewise Fitting (FSF) or e-Boosting. FSF is a gradient descent procedure with more cautious steps than L 2 Boosting (i.e., the usual coordinatewise gradient descent method applied to the L 2 loss) (Rosset et al., 2004; Hastie et al., 2001; Efron et al., 2004). Moreover, these papers also study the similarities between FSF (eBoosting) with Lasso. This link between Lasso and e-Boosting or FSF is more formally described for the linear regression case through the LARS algorithm (Least Angle Regression Efron et al., 2004). It is also shown in Efron et al. (2004) that for special cases (such as orthogonal designs) e-Boosting or FSF can approximate Lasso path infinitely close, but in general, it is unclear what regularization criterion e-Boosting or FSF optimizes. As can be seen in our experiments (Figure 1 in Section 6.1), e-Boosting or FSF solutions can be significantly different from the Lasso solutions in the case of strongly correlated predictors which are common in high-dimensional data problems. However, FSF is still used as an approximation to Lasso because it is often computationally prohibitive to solve Lasso with general loss functions for many regularization parameters through Quadratic Programming. In this paper, we propose a new algorithm BLasso that connects Lasso with FSF or e-Boosting (and the B in the name stands for this connection to boosting). BLasso generates approximately the Lasso path in general situations for both regression and classification for L 1 penalized convex loss function. The motivation for BLasso is a critical observation that FSF or e-Boosting only works in a forward fashion. It takes steps that reduce empirical loss the most regardless of the impact on model complexity (or the L1 penalty). Hence it is not able to adjust earlier steps. Taking a coordinate (difference) descent view point of the Lasso minimization with a fixed stepsize, we introduce an 2702

S TAGEWISE L ASSO

innovative “backward” step. This step uses the same minimization rule as the forward step to define each fitting stage but with an extra rule to force the model complexity or L 1 penalty to decrease. By combining backward and forward steps, BLasso is able to go back and forth to approximate the Lasso path correctly. BLasso can be seen as a marriage between two families of successful methods. Computationally, BLasso works similarly to e-Boosting and FSF. It isolates the sub-optimization problem at each step from the whole process, that is, in the language of the Boosting literature, each base learner is learned separately. This way BLasso can deal with different loss functions and large classes of base learners like trees, wavelets and splines by fitting a base learner at each step and aggregating the base learners as the algorithm progresses. Moreover, the solution path of BLasso can be shown to converge to that of the Lasso, which uses explicit global L 1 regularization for cases with a finite number of base learners. In contrast, e-Boosting or FSF can be seen as local regularization in the sense that at any iteration, FSF with a fixed small stepsize only searches over those models which are one small step away from the current one in all possible directions corresponding to the base learners or predictors (cf. Hastie et al., 2006, for a recent interpretation of the ε → 0 case). In particular, we make three contributions in this paper via BLasso. First, by introducing the backward step we modify e-Boosting or FSF to follow the Lasso path and consequently generate models that are sparser with equivalent or slightly better prediction performance in our simulations (with true sparse models and more predictors than the sample size). Secondly, by showing convergence of BLasso to the Lasso path, we further tighten the conceptual ties between e-Boosting or FSF and Lasso that have been considered in previous works. Finally, since BLasso can be generalized to deal with other convex penalties and does not use any derivatives of the Loss function or penalty, we provide the Generalized BLasso algorithm as a simple and easy-to-implement off-the-shelf method for approximating the regularization path for a general loss function and a general convex penalty. We would like to note that, for the original Lasso problem, that is, the least squares problem (L 2 loss) with an L1 penalty, algorithms that give the entire Lasso path have been established, namely, the homotopy method by Osborne et al. (2000b) and the LARS algorithm by Efron et al. (2004). For parametric least squares problems where the number of predictors is not large, these methods are very efficient as their computational complexity is on the same order as a single Ordinary Least Squares regression. For other problems such as classification and for nonparametric setups like model fitting with trees, FSF or e-Boosting has been used as a tool for approximating the Lasso path (Rosset et al., 2004). For such problems, BLasso operates in a similar fashion as FSF or eBoosting but, unlike FSF, BLasso can be shown to converge to the Lasso path quite generally when the stepsize goes to zero. The rest of the paper is organized as follows. In Section 2.1, the gradient view of Boosting is provided and FSF (e-Boosting) is reviewed as a coordinatewise descent method with a fixed stepsize on the L2 loss. In Section 2.2, the Lasso empirical minimization problem is reviewed. Section 3 introduces BLasso that is a coordinate descent algorithm with a fixed stepsize applied to the Lasso minimization problem. Section 4 discusses the backward step and gives the intuition behind BLasso and explains why FSF is unable to give the Lasso path. Section 5 introduces a Generalized BLasso algorithm which deals with general convex penalties. In Section 6, results of experiments with both simulated and real data are reported to demonstrate the attractiveness of BLasso. BLasso is shown as a learning algorithm that gives sparse models and good prediction and as a simple plug-in method for approximating the regularization path for different convex loss functions and penalties. Moreover, we compare different choices of the stepsize and give evidence for the regularization 2703

Z HAO AND Y U

effect of using moderate stepsizes. Finally, Section 7 is a discussion and a summary. In particular, it comments on the computational complexity of BLasso, compares with the algorithm in Rosset (2004), explores the possibility of BLasso for nonparametric learning problems, summarizes the paper, and points to future directions.

2. Boosting, Forward Stagewise Fitting and the Lasso Boosting was originally proposed as an iterative fitting procedure that builds up a model sequentially using a weak or base learner and then carries out a weighted averaging (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996). More recently, boosting has been interpreted as a gradient descent algorithm on an empirical loss function. FSF or e-Boosting can be viewed as a gradient descent with a fixed small stepsize at each stage and it produces solutions that are often close to the Lasso solutions (path). We now give a brief gradient descent view of Boosting and of FSF (e-Boosting), followed by a review of the Lasso minimization problem. 2.1 Boosting and Forward Stagewise Fitting Given data Zi = (Yi , Xi )(i = 1, ..., n), where the univariate Y can be continuous (regression problem) or discrete (classification problem), our task is to estimate the function F : R d → R that minimizes an expected loss E[C(Y, F(X))], C(·, ·) : R × R → R+ . The most prominent examples of the loss function C(·, ·) include exponential loss (AdaBoost), logit loss and L2 loss. The family of F(·) being considered is the set of ensembles of “base learners” D = {F : F(x) = ∑ β j h j (x), x ∈ Rd , β j ∈ R}, j

where the family of base learners can be very large or contain infinite members, for example, trees, wavelets and splines. Let β = (β1 , ...β j , ...)T , we can re-parametrize the problem using L(Z, β) := C(Y, F(X)), where the specification of F is hidden by L to make our notation simpler. To find an estimate for β, we set up an empirical minimization problem: n

βˆ = arg min ∑ L(Zi ; β). β i=1

Despite the fact that the empirical loss function is often convex in β, exact minimization is usually a formidable task for a moderately rich function family of base learners and with such function families the exact minimization leads to overfitted models. Because the family of base learners is usually large, Boosting can be viewed as finding approximate solutions by applying functional gradient descent. This gradient descent view has been recognized and studied by various authors including Breiman (1998), Mason et al. (1999), Friedman et al. (2000), Friedman (2001) and Buhlmann and Yu (2003). Precisely, boosting is a progressive procedure that iteratively builds up the solution (and it is often stopped early to avoid overfitting): 2704

S TAGEWISE L ASSO

n

( jˆ, g) ˆ = arg min ∑ L(Zi ; βˆ t + g1 j ), j,g

(1)

i=1

βˆ t+1 = βˆ t + g1 ˆ jˆ,

(2)

where 1 j is the jth standard basis vector with all 0’s except for a 1 in the jth coordinate, and g ∈ R is stepsize. In other words, Boosting favors the direction jˆ that reduces most the empirical loss and gˆ is found through a line search. The well-known AdaBoost, LogitBoost and L 2 Boosting can all be viewed as implementations of this strategy for different loss functions. Forward Stagewise Fitting (FSF) (Efron et al., 2004) is a similar method for approximating the minimization problem described by (1) with some additional regularization. FSF has also been called e-Boosting for ε-Boosting as in Rosset et al. (2004). Instead of optimizing the stepsize as in (2), FSF updates βˆ t by a fixed stepsize ε as in Friedman (2001). For general loss functions, FSF can be defined by removing the minimization over g in (1): n

( jˆ, s) ˆ = arg min

j,s=±ε

∑ L(Zi ; βˆ t + s1 j ),

(3)

i=1

βˆ t+1 = βˆ t + s1 ˆ jˆ.

(4)

This description looks different from the FSF described in Efron et al. (2004), but the underlying mechanic of the algorithm remains unchanged (see Section 5). Initially all coefficients are zero. At each successive step, a basis function or predictor or coordinate is selected that reduces most the empirical loss. Its corresponding coefficient β jˆ is then incremented or decremented by a fixed amount ε, while all other coefficients β j , j 6= jˆ are left unchanged. By taking small steps, FSF imposes some local regularization or shrinkage. A related approach can be found in Zhang (2003) where a relaxed gradient descent method is used. After T < ∞ iterations, many of the estimated coefficients by FSF will be zero, namely those that have yet to be incremented. The others will tend to have absolute values smaller than the unregularized solutions. This shrinkage/sparsity property is reflected in the similarity between the solutions given by FSF and Lasso which is reviewed next. 2.2 General Lasso Let T (β) denote the L1 penalty of β = (β1 , ..., β j , ...)T , that is, T (β) = kβk1 = ∑ j |β j |, and Γ(β; λ) denote the Lasso (least absolute shrinkage and selection operator) loss function n

Γ(β; λ) = ∑ L(Zi ; β) + λT (β). i=1

The general Lasso estimate βˆ = (βˆ 1 , ..., βˆ j , ...)T is defined by βˆ λ = min Γ(β; λ). β

The parameter λ ≥ 0 controls the amount of regularization applied to the estimate. Setting λ = 0 reverses the Lasso problem to minimizing the unregularized empirical loss. On the other hand, a 2705

Z HAO AND Y U

very large λ will completely shrink βˆ to 0 thus leading to the empty or null model. In general, moderate values of λ will cause shrinkage of the solutions towards 0, and some coefficients may end up being exactly 0. This sparsity in Lasso solutions has been researched extensively in recent years (e.g., Osborne et al., 2000a,b; Efron et al., 2004; Donoho et al., 2006; Donoho, 2006; Tropp, 2006; Rosset et al., 2004; Meinshausen and Bu¨ hlmann, 2005; Candes and Tao, 2007; Zhao and Yu, 2006; Zou, 2006; Wainwright, 2006; Meinshausen and Yu, 2006; Zhang and Huang, 2006). Sparsity can also result from other penalties as in, for example, Fan and Li (2001). Computation of the solution to the Lasso problem for a fixed λ has been studied for special cases. Specifically, for least squares regression, it is a quadratic programming problem with linear inequality constraints; for 1-norm SVM, it can be transformed into a linear programming problem. But to get a model that performs well on future data, we need to select an appropriate value for the tuning parameter λ. Very efficient algorithms have been proposed to give the entire regularization path for the squared loss function (the homotopy method by Osborne et al. 2000b and similarly LARS by Efron et al. 2004) and SVM (1-norm SVM by Zhu et al., 2003). However, it remains open how to give the entire regularization path of the Lasso problem for general convex loss function. FSF exists as a compromise since, like Boosting, it is a nonparametric learning algorithm that works with different loss functions and large numbers of base learners (predictors) but it is local regularization and does not converge to the Lasso path in general. As can be seen in Sec. 6.2, FSF has also less sparse solutions comparing to Lasso in our simulations. Next we propose the BLasso algorithm which works in a computationally efficient fashion as FSF. In contrast to FSF, BLasso converges to the Lasso path for general convex loss functions when the stepsize goes to 0. This relationship between Lasso and BLasso leads to sparser solutions for BLasso comparing to FSF with similar or slightly better prediction performance in our simulation set-up with different choices of the stepsize.

3. The BLasso Algorithm We first describe the BLasso algorithm (Algorithm 1). This algorithm has two related input parameters, a stepsize ε and a tolerance level ξ. The tolerance level is needed only to avoid numerical instability when assessing changes of the empirical loss function and should be set as small as possible while accommodating the numerical accuracy of the implementation. (ξ is set to 10−6 in the implementation of the algorithm that used in this paper.) We will discuss forward and backward steps in depth in the next section. Immediately, the following properties can be proved for BLasso (see Appendix for the proof). Lemma 1. 1. For any λ ≥ 0, if there exist j and s with |s| = ε such that Γ(s1 j ; λ) ≤ Γ(0; λ), we have λ0 ≥ λ. 2. For any t, we have Γ(βˆ t+1 ; λt ) ≤ Γ(βˆ t ; λt ) − ξ.

3. For ξ ≥ 0 and any t such that λt+1 < λt , we have Γ(βˆ t ± ε1 j ; λt ) > Γ(βˆ t ; λt ) − ξ for every j and kβˆ t+1 k1 = kβˆ t k1 + ε.

Lemma 1 (1) guarantees that it is safe for BLasso to start with an initial λ 0 which is the largest λ that would allow an ε step away from 0 (i.e., larger λ’s correspond to βˆ λ = 0). Lemma 1 (2) says 2706

S TAGEWISE L ASSO

Algorithm 1 BLasso Step 1 (initialization). Given data Zi = (Yi , Xi ), i = 1, ..., n and a small stepsize constant ε > 0 and a small tolerance parameter ξ > 0, take an initial forward step n

( jˆ, sˆ jˆ) = arg min

j,s=±ε

∑ L(Zi ; s1 j ),

i=1

βˆ 0 = sˆ jˆ1 jˆ, Then calculate the initial regularization parameter n 1 n λ0 = ( ∑ L(Zi ; 0) − ∑ L(Zi ; βˆ 0 )). ε i=1 i=1

Set the active index set IA0 = { jˆ}. Set t = 0. Step 2 (Backward and Forward steps). Find the “backward” step that leads to the minimal empirical loss: n jˆ = arg min L(Zi ; βˆ t + s j 1 j ) where s j = −sign(βˆ t )ε. (5)

∑

j

j∈IAt i=1

Take the step if it leads to a decrease of moderate size ξ in the Lasso loss, otherwise force a forward step (as (3), (4) in FSF) and relax λ if necessary: If Γ(βˆ t + sˆ jˆ1 jˆ; λt ) − Γ(βˆ t , λt ) ≤ −ξ, then βˆ t+1 = βˆ t + sˆ jˆ1 jˆ, λt+1 = λt . Otherwise, n

∑ L(Zi ; βˆ t + s1 j ), j,s=±ε

( jˆ, s) ˆ = arg min

(6)

i=1

βˆ t+1 = βˆ t + s1 ˆ jˆ, n 1 n λt+1 = min[λt , ( ∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ)], ε i=1 i=1

IAt+1 = IAt ∪ { jˆ}. Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λt ≤ 0.

that for each value of λ, BLasso performs coordinate descent until there is no descent step. Then, by Lemma 1 (3), the value of λ is reduced and a forward step is forced. The stepsize ε controls fineness of the grid BLasso runs on. The tolerance ξ controls how large a descend need to be made for a backward step to be taken. It is needed to accommodate for numerical error and should be set to be much smaller than ε to have a good approximation (see Proof of Theorem 1). In fact, we have a convergence result for BLasso (detailed proof is included in the Appendix): 2707

Z HAO AND Y U

Theorem 1. For a finite number of base learners and ξ = o(ε), if ∑ L(Zi ; β) is strongly convex with bounded second derivatives in β then as ε → 0, the BLasso path converges to the Lasso path uniformly. Note that Conjecture 2 of Rosset et al. (2004) follows from Theorem 1. This is because if all the optimal coefficient paths are monotone, then BLasso will never take a backward step, so it will be equivalent to e-Boosting. Many popular loss functions, for example, squared loss, logistic loss, and negative log-likelihood functions of exponential families are convex and twice differentiable, and they satisify the conditions in Theorem 1. Moreover, from the proof of this theorem in the appendix, it is easy to see that it suffices to have the conditions in the theorem satisfied over a bounded set of β. For the exponential loss, Lemma 1 implies that there is a finite λ0 < ∞ for every data set (Zi ). Thus we can restrict the proof of Theorem 1 to this bounded set of β to show the result for the exponential loss. Other functions like the hinge loss (SVM) is continuous and convex but not differentiable. The differentiability, however, is only necessary for the proof of Theorem 1. BLasso does not use any gradient or higher order derivatives but only the differences of the loss function therefore remains applicable to loss functions that are not differentiable or of which differentiation is too complex or computationally expensive. It is theoretically possible that BLasso’s coordinate descent strategy gets stuck at nondifferentiable points for functions like the hinge loss. However, as illustrated in our third experiment, BLasso may still work for cases like 1-norm SVM empirically. Theorem 1 does not cover nonparametric learning problems with an infinite number of base learners either. In fact, for problems with large or infinite number of base learners, the minimization in (6) is usually done approximately by functional gradient descent and a tolerance ξ > 0 needs to be chosen to avoid oscillation between forward and backward steps caused by slow descending. We discuss more on this topic in the discussion (Sec. 7).

4. The Backward Step We now explain the motivation and working mechanic of BLasso. Observe that FSF only uses “forward” steps, that is, it only takes steps that lead to a direct reduction of the empirical loss. Comparing to classical model selection methods like Forward Selection and Backward Elimination, Growing and Pruning of a classification tree, a “backward” counterpart is missing. Without the backward step, when FSF picks up more irrelevant variables as compared to the Lasso path in some cases (cf. Figure 1 in Section 6.2), it does not have a mechanism to remove them. As seen below, this backward step naturally arises in BLasso because of our coordinate descent view of the minimization of the Lasso loss. (Since ξ exists for numerical purpose only, it is assumed to be 0 thus excluded in the following theoretical discussion.) For a given β 6= 0 and λ > 0, consider the impact of a small ε > 0 change of β j to the Lasso loss Γ(β; λ). For an |s| = ε, ∆ j Γ(Z; β)

n

=

n

( ∑ L(Zi ; β + s1 j ) − ∑ L(Zi ; β)) + λ(T (β + s1 j ) − T (β)) i=1

i=1

n

:= ∆ j ( ∑ L(Zi ; β)) + λ∆ j T (β). i=1

Since T (β) is simply the L1 norm of β, ∆T (β) reduces to a simple form: 2708

S TAGEWISE L ASSO

∆ j T (β) = kβ + s1 j k1 − kβk1 = |β j + s| − |β j |

= ε · sign+ (β j , s) 1 if sβ j > 0 or β j = 0 = ε· . -1 if sβ j < 0

(7)

Equation (7) shows that an ε step changes the penalty by a fixed ε in absolute value for any j. That is, only the sign of the penalty change may vary. In the beginning of BLasso, all j directions are leaving zero and hence changing the L1 penalty by the same positive amount λ · ε. Therefore the first step of BLasso is a forward step because minimizing Lasso loss is equivalent to minimizing the L2 loss due to the same positive change of the L1 penalty. As the algorithm proceeds, some of the penalty changes might become negative and minimizing the empirical loss is no longer equivalent to minimizing the Lasso loss. In fact, except for special cases like orthogonal covariates (predictors), the FSF steps might result in negative changes of the L1 penalty. In some of these situations, a step that goes “backward” reduces the penalty with a small sacrifice in the empirical loss. In general, to minimize the Lasso loss, one needs to go “back and forth” to trade off the penalty with the empirical loss for different regularization parameters. ˆ a backward step is such that: To be precise, for a given β, ∆βˆ = s j 1 j , subject to βˆ j 6= 0, sign(s) = −sign(βˆ j ) and |s| = ε. Making such a step will reduce the penalty by a fixed amount λ·ε, but its impact on the empirical loss can be different, therefore as in (5) we want: n

jˆ = arg min ∑ L(Zi ; βˆ + s j 1 j ) subject to βˆ j 6= 0 and s j = −sign(βˆ j )ε, j

i=1

that is, jˆ is selected such that the empirical loss after making the step is as small as possible. While forward steps try to reduce the Lasso loss through minimizing the empirical loss, the backward steps try to reduce the Lasso loss through minimizing the Lasso penalty. In summary, by allowing the backward steps, we are able to work with the Lasso loss directly and take backward steps to correct earlier forward steps that might have picked up irrelevant variables. Since much of the discussion on the similarity and difference between FSF and Lasso is focused on Least Squares problems (e.g., Efron et al., 2004; Hastie et al., 2001), we next examine the BLasso algorithm in this case. It is straightforward to see that in LS problems both forward and backward steps in BLasso are based only on the correlations between fitted residuals and the covariates (predictors). It follows that BLasso in this case reduces to finding the best direction in both forward and backward steps by examining the inner-products, and then deciding whether to go forward or backward based on the regularization parameter. This not only simplifies the minimization procedure but also significantly reduces the computation complexity for a large number of observations since the inner-product between ηt and X j can be updated by (ηt+1 )0 X j = (ηt − sX jˆt )0 X j = (ηt )0 X j − sX j0ˆt X j ,

(8)

which takes only one operation if X j0ˆt X j is precalculated. Therefore, when the number of base learners is small, based on precalculated X 0 X and Y 0 X, BLasso could use (8) to make its computation 2709

Z HAO AND Y U

complexity independent from the number of observations. This nice property is not surprising as it is also observed in established algorithms like LARS and Osborne’s homotopy method which are specialized for LS problems. In nonparametric situations, the number of base learners is large therefore the aforementioned strategy becomes inefficient. BLasso has a natural extention to this case as follows: similar to boosting, the forward step is carried out by a sub-optimization procedure such as fitting trees, smoothing splines or stumps. For the backward step, only inner-products between base learners that have entered the model need to be calculated. The inner products between these base learners and residuals can be updated by (8). This makes the backward steps’ computation complexity proportional to the number of base learners that are already chosen instead of the number of all possible base learners. Therefore BLasso works not only for cases with large sample size but also for cases where a class of large or infinite number of possible base learners is given. As mentioned earlier, there are already established efficient algorithms for solving the least square (L2 ) Lasso problem, for example, the homotopy method by Osborne et al. (2000b) and LARS (Efron et al., 2004). These algorithms are very efficient for giving the exact Lasso paths for parametric settings. For nonparametric learning problems with a large or an infinite number of base learners, we believe BLasso is an attractive strategy for approximating the path of the Lasso, as it shares the same computational strategy as Boosting which has proven itself successful in applications. Also, in cases where the Ordinary Least Square (OLS) method performs well, BLasso can be modified to start from the OLS estimate, go backward and stop in a few iterations.

5. Generalized BLasso As stated earlier, BLasso not only works for general convex loss functions, but also extends to convex penalties other than the L1 penalty. For the Lasso problem, BLasso does a fixed stepsize coordinate descent to minimize the penalized loss. Since the penalty has the special L 1 norm and (7) holds, a step’s impact on the penalty has a fixed size ε with either a positive or a negative sign, and the coordinate descent takes form of “backward” and “forward” steps. This reduces the minimization of the penalized loss function to unregularized minimizations of the loss function as in (6) and (5). For general convex penalties, since a step on different coordinates does not necessarily have the same impact on the penalty, one is forced to work with the penalized function directly. Assume T (β): Rm → R is a convex penalty function. We next describe the Generalized BLasso algorithm (Algorithm 2). In the Generalized BLasso algorithm, explicit “forward” or “backward” steps are no longer seen. However, the mechanism remains the same—minimize the penalized loss function for each λ, relax the regularization by reducing λ through a “forward” step when the minimum of the loss function for the current λ is reached.

6. Experiments In this section, three experiments are carried out to illustrate the attractiveness of BLasso. The first experiment runs BLasso under the classical Lasso setting on the diabetes data set (cf. Efron et al., 2004) often used in studies of Lasso with an added artificial covariate variable to highlight the difference between BLasso and FSF. This added covariate is strongly correlated with a couple of the original covariates (predictors). In this case, BLasso is seen to produce a path almost exactly the 2710

S TAGEWISE L ASSO

Algorithm 2 Generalized BLasso Step 1 (initialization). Given data Zi = (Yi , Xi ), i = 1, ..., n and a fixed small stepsize ε > 0 and a small tolerance parameter ξ ≥ 0, take an initial forward step n

∑ L(Zi ; s1 j ), βˆ 0 = sˆjˆ1 jˆ. j,s=±ε

( jˆ, sˆ jˆ) = arg min

i=1

Then calculate the corresponding regularization parameter λ0 =

∑ni=1 L(Zi ; 0) − ∑ni=1 L(Zi ; βˆ 0 ) . T (βˆ 0 ) − T (0)

Set t = 0. Step 2 (steepest descent on Lasso loss). Find the steepest coordinate descent direction on the penalized loss: ( jˆ, sˆ jˆ) = arg min Γ(βˆ t + s1 j ; λt ). j,s=±ε

Update βˆ if it reduces Lasso loss by at least a ξ amount, otherwise force βˆ to minimize L and recalculate the regularization parameter: If Γ(βˆ t + sˆ jˆ1 jˆ; λt ) − Γ(βˆ t , λt ) < −ξ, then βˆ t+1 = βˆ t + sˆ jˆ1 jˆ, λt+1 = λt . Otherwise, n

L(Zi ; βˆ t + s1 j ), ∑ j,|s|=ε

( jˆ, sˆ jˆ) = arg min

i=1

β

ˆ t+1

= β + sˆ jˆ1 jˆ, ˆt

λt+1 = min[λt ,

∑ni=1 L(Zi ; βˆ t ) − ∑ni=1 L(Zi ; βˆ t+1 ) ]. T (βˆ t+1 ) − T (βˆ t )

Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λt ≤ 0.

same as the Lasso path which shrinks the added irrelevant variable back to zero, while FSF’s path parts drastically from Lasso’s due to the added strongly correlated covariate and does not move it back to zero. In the second experiment, we compare the prediction and variable selection performance of FSF and BLasso in a least squares regression simulation using a large number (p = 500 >> n = 50) of randomly correlated base learners to emulate the nonparametric learning scenario and when the true model is sparse. The result shows, overall, BLasso gives sparser solutions than FSF and with similar or slightly better predictions. And this holds for various stepsizes. Moreover, we find that when the stepsize increases, there is a regularization effect in terms of both prediction and sparsity, for both BLasso and FSF. 2711

Z HAO AND Y U

The last experiment is to illustrate BLasso as an off-the-shelf method for computing the regularization path for general convex loss functions and general convex penalties. Two cases are presented. The first case is bridge regression (Frank and Friedman, 1993) on diabetes data using different Lγ (γ ≥ 1) norms as penalties. The other is a simulated classification problem using 1-norm SVM (Zhu et al., 2003) with the hinge loss. 6.1 L2 Regression with L1 Penalty (Classical Lasso) The data set used in this experiment is the diabetes data set where n=442 diabetes patients were measured on 10 baseline predictor variables X 1 , ..., X 10 . A prediction model was desired for the response variable Y , a quantitative measure of disease progression one year after baseline. We add one additional predictor variable to make more visible the difference between FSF and Lasso solutions. This added variable is X 11 = −X 7 + X 8 + 5X 9 + e, where e is i.i.d. Gaussian noise (mean zero and variance 1/442). The following vector gives the correlations of X 11 with X 1 , X 2 , ..., X 10 : (0.25 , 0.24 , 0.47 , 0.39 , 0.48 , 0.38 , −0.58 , 0.76 , 0.94 , 0.47). The classical Lasso (L2 regression with L1 penalty) is applied to this data set with the added covariate. Location and scale transformations are made so that all the covariates or predictors are standardized to have mean 0 and unit length, and the response has mean zero. The penalized loss function has the form: n

Γ(β; λ) = ∑ (Yi − Xi β)2 + λkβk1 . i=1

The middle panel of Figure 1 shows the coefficient path plot for BLasso applied to the modified diabetes data. Left (Lasso) and Middle (BLasso) panels are indistinguishable from each other. Both FSF and BLasso pick up the added artificial and strongly correlated X 11 (the solid line) in the earlier stages, but due to the greedy nature of FSF, it is not able to remove X 11 in the later stages thus every parameter estimate is affected leading to significantly different solutions from Lasso. The BLasso solutions were built up in 8700 steps (making the step size ε = 0.5 small so that the coefficient paths are smooth), 840 of which were backward steps. In comparison, FSF took 7300 pure forward steps. BLasso’s backward steps concentrate mainly around the steps where FSF and BLasso tend to differ. 6.2 Comparison of BLasso and Forward Stagewise Fitting by Simulation In this experiment, we compare the model estimates generated by FSF and BLasso in a large p(=500) and small n(=50) setting to mimic a nonparametric learning scenario where FSF and BLasso are computationally attractive. In this least squares regression simulation, the design is randomly generated as described below to guarantee a fair amount of correlation among the covariates (predictors). Otherwise, if the design is close to orthogonal, the FSF and BLasso paths will be too similar for this simulation to yield interesting results. 2712

S TAGEWISE L ASSO

Lasso

BLasso

FSF

500

500

500

0

0

0

−500

−500

−500

0

1000

2000

t = ∑ |βˆ j | →

3000

0

1000

2000

t = ∑ |βˆ j | →

3000

0

1000

2000

3000

t = ∑ |βˆ j | →

Figure 1: Regularization path plots, for the diabetes data set, of Lasso, BLasso and FSF: the curves (or paths) of estimates βˆ j for 10 original and 1 added covariates (predictors), as the regularization is relaxed or t tends to infinity. The thick solid curves correspond to the 11th added covariate. Left Panel: Lasso solution paths (produced using simplex search method on the penalized empirical loss function for each λ) as a function of t = kβk 1 . Middle Panel: BLasso solution paths, which can be seen indistinguishable to the Lasso solutions. Right Panel: FSF solution paths, which are different from Lasso and BLasso.

We first draw 5 covariance matrices Ci , i = 1, .., 5 from .95 × Di + .05I p×p where Di is sampled from Wishart(20, p) then normalized to have 1’s on diagonal. The Wishart distribution creates a fair amount of correlation in Ci (average absolute value is about 0.18) between the covariates and the added identity matrix guarantees Ci to be full rank. For each of the covariance matrix Ci , the design X is then drawn independently from N(0,Ci ) with n = 50. The target variable Y is then computed as Y = Xβ + e, where β1 to βq with q = 7 are drawn independently from N(0, 1) and β8 to β500 are set to zero to create a sparse model. e is the Gaussian noise vector with mean zero and variance 1. For each of 1 1 1 1 , 20 , 40 and 80 . the 5 cases with different Ci , both BLasso and FSF are run using stepsizes ε = 51 , 10 We also run Lasso which is listed as BLasso when ε = 0. To compare the performances, we examine the solutions on the regularization paths that give the ˆ 2 . The mean squared error (on log scale) of these solutions smallest mean squared error kXβ − X βk are tabulated together with the number of nonzero estimates in each solution. All cases are run 50 times and the average results are reported in Table 1. As can be seen from Table 1, since our true model is sparse, in almost all cases the BLasso solutions are sparser and have similar prediction performances comparing to the FSF solutions with the same stepsize. It is also interesting to note that, smaller stepsizes require more computation but often give worse predictions and much less sparsity. We conjecture that there is also a regularization effect caused by the discretization of the solution paths (more discussion in Section 8) and this effect has also been observed by Gao et al. (2006) in a language ranking problem. 2713

Z HAO AND Y U

Design C1

MSE qˆ

C2

MSE qˆ

C3

MSE qˆ

C4

MSE qˆ

C5

MSE qˆ

BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF

ε = 15 18.60 19.77 15.38 18.32 19.58 20.67 14.80 18.34 18.83 19.35 15.22 15.38 20.09 21.53 15.76 18.90 18.79 19.99 15.58 17.10

1 ε = 10 18.27 19.40 20.08 24.00 19.28 20.29 18.92 21.90 18.14 19.11 19.10 19.72 19.88 21.09 20.82 24.64 18.62 19.92 19.16 23.24

1 ε = 20 18.33 19.60 21.76 27.28 19.65 20.63 20.18 25.70 18.55 19.52 19.92 23.30 19.85 21.13 22.20 30.38 18.70 19.84 21.26 28.24

1 ε = 40 18.60 19.82 21.44 30.48 19.94 20.94 21.22 28.80 18.90 19.78 20.02 25.88 20.20 21.35 22.42 32.02 19.09 20.19 21.92 30.94

1 ε = 80 19.42 19.96 20.50 32.14 20.76 21.11 20.52 29.38 19.32 19.93 19.52 27.30 21.84 21.57 21.12 34.16 19.47 20.36 22.18 32.84

Lasso (ε = 0) 19.98 21.86 21.12 21.82 20.15 21.08 21.70 22.24 20.12 22.76

Table 1: Comparison of FSF and BLasso in a simulated nonparametric regression setting. The log of MSE and qˆ =# of nonzeros are reported for the oracle solutions on the regularization paths. All results are averaged over 50 runs.

Design C1

MSE qˆ

C2

MSE qˆ

C3

MSE qˆ

C4

MSE qˆ

C5

MSE qˆ

BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF

ε = 15 -1.38 (0.37) -1.17 (0.27) -6.48 (0.64) -2.94 (0.89) -1.54 (0.37) -1.09 (0.32) -7.02 (0.58) -3.54 (0.99) -1.32 (0.35) -0.53 (0.28) -5.86 (0.81) -0.16 (0.78) -1.61 (0.45) -1.44 (0.30) -6.48 (0.71) -3.14 (0.92) -1.33 (0.38) -1.20 (0.25) -7.18 (0.84) -1.52 (0.88)

1 ε = 10 -1.71 (0.23) -1.13 (0.28) -1.78 (0.70) -3.92 (1.22) -1.84 (0.29) -1.01 (0.27) -2.90 (0.65) -2.98 (0.88) -2.01 (0.36) -0.97 (0.22) -1.98 (0.72) -0.62 (0.87) -1.82 (0.33) -1.20 (0.28) -1.42 (0.85) -3.82 (1.16) -1.50 (0.26) -1.30 (0.23) -3.60 (0.64) -4.08 (1.10)

1 ε = 20 -1.65 (0.21) -1.27 (0.26) -0.10 (0.67) -5.52 (1.26) -1.47 (0.26) -0.98 (0.23) -1.64 (0.52) -5.52 (1.09) -1.60 (0.33) -0.97 (0.23) -1.16 (0.54) -3.38 (1.05) -1.85 (0.33) -1.28 (0.24) -0.04 (0.73) -8.18 (1.12) -1.41 (0.26) -1.14 (0.28) -1.50 (0.58) -6.98 (1.08)

1 ε = 40 -1.38 (0.21) -1.22 (0.26) -0.42 (0.63) -9.04 (1.43) -1.18 (0.25) -1.00 (0.23) -0.60 (0.50) -7.58 (1.31) -1.25 (0.32) -0.88 (0.23) -1.06 (0.55) -5.86 (0.97) -1.50 (0.33) -1.15 (0.29) 0.18 (0.52) -9.60 (1.35) -1.03 (0.22) -1.10 (0.29) -0.84 (0.52) -9.02 (1.21)

1 ε = 80 -0.56 (0.35) -0.54 (0.24) -1.36 (0.65) -11.64 (1.64) -0.36 (0.45) -0.35 (0.38) -1.30 (0.48) -8.86 (1.41) -0.83 (0.32) -0.62 (0.24) -1.56 (0.56) -7.78 (1.08) 0.14 (0.66) 0.27 (0.67) -1.12 (0.67) -13.04 (1.68) -0.65 (0.22) -0.89 (0.28) -0.58 (0.55) -10.66 (1.50)

Table 2: Means and Standard Errors of the differences of MSE and qˆ between BLasso and Lasso, and between Blasso and FSF in Table 1.

2714

100

100

S TAGEWISE L ASSO

20

40

60

80

BLasso FSF Lasso

0

0

20

40

60

80

BLasso FSF Lasso

2

4

6

8

10

12

14

2

4

6

8

10

12

14

Figure 2: Plots of in-sample Mean Squared Error (y-axis) versus kβk 1 (x-axis) for a typical realiza1 tion of the experiment (on run under C2 from Table 1). The step size is set to ε = 80 in 1 the left plot and ε = 5 in the right.

Table 2 gives a further analysis of the results in Table 1. It contains means and standard errors of the differences of MSE and q, ˆ between BLasso and Lasso and between BLasso and FSF, for the stepsizes given in Table 1. First of all, all the mean differences are negative and when compared with their SE’s, the differences are also significant except for few cells for small stepsizes 1/40 and 1/80 (in the last two columns). This overwhelming pattern of significant negative difference suggests that, for this simulation, BLasso is better than Lasso and FSF in terms of both prediction and sparsity unless the stepsize is very small as in the last two columns. Moreover, for MSE the stepsize ε = 1/10 seems to bring the best improvement of BLasso over Lasso, and the improvement is pretty robust against the choice of stepsize. On the other hand, the improvements of BLasso over FSF on MSE are less then those of BLasso over Lasso because FSF has the same discrete stepsizes. Hence these improvements reflect the gains only by the backward steps since FSF takes also forward steps. In terms of q, ˆ the number of covariates selected, as expected, the larger the stepsize, the sparser the BLasso model is relative to the Lasso model or the FSF model. The sparsity improvements over Lasso are significant for all cells except for the last column with ε = 1/80. When compared with FSF, the sparsity improvements are less and smaller (still significant). In terms of gains on both MSE and sparsity and relative to both Lasso and FSF, stepsizes 1/10 and 1/20, that is, 0.1 or 0.05, seem good overall choices for this simulation study. 2715

Z HAO AND Y U

As suggested by one referee, we compare the Lasso empirical loss functions induced by BLasso, FSF and Lasso (through LARS). Figure 2 shows plots of in-sample Mean Squared Error versus L 1 norms of the coefficients taken from one typical run of the simulation conducted in this section. As shown by the plots, the in-sample MSE from BLasso approximates the in-sample MSE from the Lasso better than the FSF under both big and small step sizes. In particular, when the step size is small, the BLasso path is almost indiscernible from the Lasso path. A final comment on Figure 2 is in order. Although the in-sample MSE curve for BLasso in the right panel of Figure 2 does seem to go up at the end of the plot, we can not extend the x-axis further to higher ||β|| 1 values because at the stepsize ε = 1/5, the BLasso solution has achieved its L 1 norm maximum around 14 − 15 – the maximum of the x-axis on the right panel of Figure 2. 6.3 Generalized BLasso for Other Penalties and Nondifferentiable Loss Functions First, to demonstrate Generalized BLasso for different penalties, we use the Bridge Regression setting with the diabetes data set (without the added covariate in the first experiment). The Bridge Regression (first proposed by Frank and Friedman 1993 and later more carefully discussed and implemented by Fu 2001) is a generalization of the ridge regression (L 2 penalty) and Lasso (L1 penalty). It considers a linear (L2 ) regression problem with Lγ penalty for γ ≥ 1 (to maintain the convexity of the penalty function). The penalized loss function has the form: n

Γ(β; λ) = ∑ (Yi − Xi β)2 + λkβkγ , i=1

where γ is the bridge parameter. The data used in this experiment are centered and rescaled as in the first experiment. Generalized BLasso successfully produced the paths for all 5 cases which are verified by pointwise minimization using simplex method (γ = 1, γ = 1.1, γ = 4 and γ = max) or close form solutions (γ = 2). It is interesting to notice the phase transition from the near-Lasso to the Lasso as the solution paths are similar but only Lasso has sparsity. Also, as γ grows larger, estimates for different β j tend to have more similar sizes and in the extreme γ = ∞ there is a “branching” phenomenon— the estimates stay together in the beginning and branch out into different directions as the path progresses. To demonstrate the Generalized BLasso algorithm for classification using an nondifferentiable loss function with a L1 penalty function, we look at binary classification with the hinge loss. As in Zhu et al. (2003), we generate n=50 training data points in each of two classes. The first class has two standard normal independent inputs X 1 and X 2 and class label Y = −1. The second class also has two standard normal independent inputs, but conditioned on 4.5 ≤ (X 1 )2 + (X 2 )2 ≤ 8 and has class label Y = 1. We wish to find a classification rule from the training data. so that when given a new input, we can assign a label from {1, −1} to it. 1-norm SVM (Zhu et al., 2003) is used to estimate β: n

m

5

j=1

j=1

(βˆ 0 , β) = arg min ∑ (1 −Yi (β0 + ∑ β j h j (Xi )))+ + λ ∑ |β j |, β0 ,β i=1

where hi ∈ D are √ basis functions regularization parameter. The dictionary of basis √ 2 √ and1 λ2 is the 1 2 1 functions is D = { 2X , 2X , 2X X , (X ) , (X 2 )2 }. Notice that β0 is left unregularized so the penalty function is not the L1 penalty. 2716

S TAGEWISE L ASSO

γ=1

600

γ = 1.1

1800 3000

600

γ=2

1800 3000

600

γ=4

1800 3000

600

γ=∞

1800 3000

100

1

1

1

1

1

0

0

0

0

0

−1

−1 −1

0

1

−1 −1

0

1

0

1

700

−1

−1 −1

400

−1

0

1

−1

0

1

Figure 3: Upper Panel: Solution paths produced by BLasso for different bridge parameters, on the diabetes data set. From left to right: Lasso (γ = 1), near-Lasso (γ = 1.1), Ridge (γ = 2), over-Ridge (γ = 4), max (γ = ∞). The Y -axis is the parameter estimate and has the range [−800, 800]. The X-axis for each of the left 4 plots is ∑i |βi |, the one for the 5th plot is max(|βi |) because ∑i |βi | is unsuitable. Lower Panel: The corresponding penalty equal contours for |β1 |γ + |β2 |γ = 1.

2717

Z HAO AND Y U

Regularization Path

Data 3

0.8

0.7

2 0.6

0.5

1

0.4

0

0.3

0.2

−1 0.1

0

−2 −0.1

−0.2

0

0.5

t=

1

∑5j=1 |βˆ j |

1.5

−3 −3

−2

−1

0

1

2

3

→

Figure 4: Estimates of 1-norm SVM coefficients βˆ j , j=1,2,...,5, for the simulated two-class classification data. Left Panel: BLasso solutions as a function of t = ∑5j=1 |βˆ j |. Right Panel: Scatter plot of the data points with labels: ’+’ for y = −1; ’o’ for y = 1.

The fitted model is m

fˆ(x) = βˆ 0 + ∑ βˆ j h j (x), j=1

and the classification rule is given by sign( fˆ(x)). Since the loss function is not differentiable, we do not have a theoretical guarantee that BLasso works. Nonetheless the solution path produced by Generalized BLasso has the same sparsity and piecewise linearity as the 1-norm SVM solutions shown in Zhu et al. (2003). It takes Generalized BLasso 490 iterations to generate the solutions. The covariates enter the regression equation sequentially as t increase, in the following order: the two quadratic terms first, followed by the interaction term then the two linear terms. As 1-norm SVM in Zhu et al. (2003), BLasso correctly picked up the quadratic terms early. That come up much later are the interaction term and linear terms that are not in the true model. In other words, BLasso results are in good agreement with Zhu et al.’s 1-norm SVM results and we regard this as a confirmation for BLasso’s effectiveness in this nondifferentiable example.

7. Discussion and Concluding Remarks As seen from our simulations under sparse true models, BLasso generates sparser solutions with similar or slightly better predictions relative to Lasso and FSF. The behavior relative to Lasso is due to the discrete stepsize of BLasso, while the behavior relative to FSF is partially explained by its convergence to the Lasso path as the stepsize goes to 0. We believe that the generalized version 2718

S TAGEWISE L ASSO

400

200

0 2000

1500

1000

500

0

λ Figure 5: Estimates of regression coefficients βˆ 3 for the diabetes data set. Solutions are plotted as functions of λ. Dotted Line: Estimates using stepsize ε = 0.05. Solid Line: Estimates using stepsize ε = 10. Dash-dot Line: Estimates using stepsize ε = 50.

is also effective as an off-the-shelf algorithm for the general convex penalized loss minimization problems. Computationally, BLasso takes roughly O(1/ε) steps to produce the whole path. Depending on the actual loss function, base learners and minimization method used in each step, the actual computation complexity varies. As shown in the simulations, choosing a smaller stepsize gives a smoother solution path but it does not guarantee a better prediction. Actually, for the particular simulation set-up in Sec. 6.2, moderate stepsizes gave better results both in terms of MSE and sparsity. It is worth noting that the BLasso coefficient estimates are pretty close to the Lasso solutions even for relatively large stepsizes. For the diabetes data, using a moderate stepsize ε = 0.05, the solution path can not be distinguished from the exact regularization path. Moreover, even when the stepsize is as large as ε = 10 and ε = 50, the solutions are still good approximations. BLasso has only one stepsize parameter (with the exception of the numerical tolerance ξ which is implementation specific but not necessarily a user parameter). This parameter controls both how close BLasso approximates the minimization coefficients for each λ and how close two adjacent λ on the regularization path are placed. As can be seen from Figure 5, a smaller stepsize leads to a closer approximation to the solutions and also finer grids for λ. We argue that, if λ is sampled on a coarse grid we should not spend computational power on finding a much more accurate approximation of the coefficients for each λ. Instead, the available computational power spent on these two coupled tasks should be balanced. BLasso’s 1-parameter setup automatically balances these two aspects of the approximation which is graphically expressed by the staircase shape of the solution paths. Another algorithm similar to Generalized BLasso was developed independently by Rosset (2004). There, starting from λ = 0, a solution is generated by taking a small Newton-Raphson step for each λ, then λ is increased by a fixed amount. The algorithm assumes twice-differentiability of both 2719

Z HAO AND Y U

loss function and penalty function and involves calculation of the Hessian matrix which could be heavy-duty computationally when the number p of covariates is not small. In comparison, BLasso uses only the differences of the loss function and involves only basic operations and does not require advanced mathematical knowledge of the loss function or penalty. It can also be used a simple plugin method for dealing with other convex penalties. Hence BLasso is easy to program and allows testing of different loss and penalty functions. Admittedly, this ease of implementation can cost computation time in large p situations. BLasso’s stepsize is defined in the original parameter space which makes the solutions evenly spread in β’s space rather than in λ. In general, since λ is approximately the reciprocal of size of the penalty, as a fitted model grows larger and λ becomes smaller, changing λ by a fixed amount makes the algorithm in Rosset (2004) move too fast in the β space. On the other hand, when the model is close to empty and the penalty function is very small, λ is very large, but the algorithm still uses the same small steps thus computation is spent to generate solutions that are too close to each other. As we discussed for the least squares problem, BLasso may also be computationally attractive for dealing with nonparametric learning problems with a large or an infinite number of base learners. This is mainly due to two facts. First, the forward step, as in Boosting, is a sub-optimization problem by itself and Boosting’s functional gradient descend strategy applies. For example, in the case of classification with trees, one can use the classification margin or the logistic loss function as the loss function and use a reweighting procedure to find the appropriate tree at each step (for details see, e.g., Breiman, 1998; Friedman et al., 2000). In the case of regression with the L 2 loss function, the minimization as in (6) is equivalent to refitting the residuals as we described in the last section. The second fact is that, when using an iterative procedure like BLasso, we usually stop early to avoid overfitting and to get a sparse model. And even if the algorithm is kept running, it usually reaches a close-to-perfect fit without too many iterations. Therefore, the backward step’s computation complexity is limited because it only involves base learners that are already included from previous steps. There is, however, a difference in the BLasso algorithm between the case with a small number of base learners and that with a large or an infinite number of base learners. For the finite case, BLasso avoids oscillation by requiring a backward step to be strictly descending and relax λ whenever no descending step is available. Hence BLasso never reaches the same solution more than once and the tolerance constant ξ can be set to 0 or a very small number to accommodate the program’s numerical accuracy. In the nonparametric learning case, a different kind of oscillation can occur when BLasso keeps going back and force in different directions but only improving the penalized loss function by a diminishing amount, therefore a positive tolerance ξ is mandatory. As suggested by the proof of Theorem 1, we suggest choosing ξ = o(ε) to warrant a good approximation to the Lasso path. One direction for future research is to apply BLasso in an online or time series setting. Since BLasso has both forward and backward steps, we believe that an adaptive online learning algorithm can be devised based BLasso so that it goes back and forth to track the best regularization parameter and the corresponding model. We end with a summary of our main contributions: 1. By combining both forward and backward steps, the BLasso algorithm is constructed to minimize an L1 penalized convex loss function. While it maintains the simplicity and flexibility of e-Boosting (or Forward Stagewise Fitting), BLasso efficiently approximate the Lasso so2720

S TAGEWISE L ASSO

lutions for general loss functions and large classes of base learners. This can be proven rigorously for a finite number of base learners under some assumptions. 2. The backward steps introduced in this paper are critical for producing the Lasso path. Without them, the FSF algorithm in general does not produce Lasso solutions, especially when the base learners are strongly correlated as in cases where the number of base learners is larger than the number of observations. As a result, FSF loses some of the sparsity provided by Lasso and might also suffer in prediction performance as suggested by our simulations. 3. We generalized BLasso as a simple, easy-to-implement, plug-in method for approximating the regularization path for other convex penalties. 4. Discussions based on intuition and simulation results are made on the regularization effect of using stepsizes that are not very small. Last but not least, matlab codes by Guilherme V. Rocha for BLasso in the case of L 2 loss and L1 penalty can be downloaded at http://www.stat.berkeley.edu/twiki/Research/YuGroup/Software.

Acknowledgments Yu would like to gratefully acknowledge the partial supports from NSF grants FD01-12731 and CCR-0106656 and ARO grant DAAD19-01-1-0643, and the Miller Research Professorship in Spring 2004 from the Miller Institute at University of California at Berkeley. We thank Dr. Chris Holmes and Mr. Guilherme V. Rocha for their very helpful comments and discussions on the paper. Finally, we would like to thank three referees and the action editor for their thoughtful and detailed comments on an earlier version of the paper.

Appendix A. Proofs Proof (Lemma 1) 1. It is assumed that there exist λ and j with |s| = ε such that Γ(s1 j ; λ) ≤ Γ(0; λ).

Then we have

n

n

i=1

i=1

∑ L(Zi ; 0) − ∑ L(Zi ; s1 j ) ≥ λT (s1 j ) − λT (0).

Therefore λ ≤

n 1 n { ∑ L(Zi ; 0) − ∑ L(Zi ; s1 j )} ε i=1 i=1

≤

n 1 n { ∑ L(Zi ; 0) − 0min ∑ L(Zi ; s1 j0 )} ε i=1 j ,|s|=ε i=1

=

n 1 n { ∑ L(Zi ; 0) − ∑ L(Zi ; βˆ 0 )} ε i=1 i=1

= λ0 . 2721

Z HAO AND Y U

2. Since a backward step is only taken when Γ(βˆ t+1 ; λt ) < Γ(βˆ t ; λt ) − ξ and λt+1 = λt , so we only need to consider forward steps. When a forward step is forced, if Γ( βˆ t+1 ; λt+1 ) > Γ(βˆ t ; λt+1 ) − ξ, then n

n

i=1

i=1

∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ < λt+1 T (βˆ t+1 ) − λt+1 T (βˆ t ).

Hence

n 1 n { ∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ} < λt+1 , ε i=1 i=1

which contradicts the algorithm. 3. Since λt+1 < λt and λ can not be relaxed by a backward step, we immediately have k βˆ t+1 k1 = kβˆ t k1 + ε. Then from n 1 n λt+1 = { ∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ}, ε i=1 i=1

we get Γ(βˆ t ; λt+1 ) − ξ = Γ(βˆ t+1 ; λt+1 ). Add (λt − λt+1 )kβˆ t k1 to both sides, and recall T (βˆ t+1 ) = kβˆ t+1 k1 > |βˆ t k1 = T (βˆ t ), we get Γ(βˆ t ; λt ) − ξ < Γ(βˆ t+1 ; λt ) = min Γ(βˆ t + s1 j0 ; λt ) j0 ,|s|=ε

≤ Γ(βˆ t ± ε1 j ; λt ) for all j.

Proof (Theorem 1) Theorem 3.1 claims that “the BLasso path converges to the Lasso path uniformly” for ∑ L(Z; β) that is strongly convex with bounded second derivatives in β. The strong convexity and bounded second derivatives imply the Hessian w.r.t. β satisfies mI ∇2 ∑ L MI, for positive constants M ≥ m > 0. Using these notations, we will show that for any t s.t. λt+1 > λt , we have ξ2 √ M ) p, kβˆ t − β∗ (λt )k2 ≤ ( ε + (9) m εm where β∗ (λt ) ∈ R p is the Lasso estimate with a regularization parameter λt . The proof of (9) relies on the following inequalities for strongly convex functions, some of which can be found in Boyd and Vandenberghe (2004). First, because of the strong convexity, we have m ∑ L(Z; β∗ (λt )) ≥ ∑ L(Z; βˆ t ) + ∇ ∑ L(Z; βˆ t )T (β∗ (λt ) − βˆ t ) + 2 kβ∗ (λt ) − βˆ t k22 . 2722

S TAGEWISE L ASSO

The L1 penalty function is also convex although not strictly convex nor differentiable at 0, but we have kβ∗ (λt )k1 ≥ kβˆ t k1 + δT (β∗ (λt ) − βˆ t )

hold for any p-dimensional vector δ with δi the i’th entry of sign(βˆ t )T for the nonzero entries and |δi | ≤ 1 otherwise. Putting both inequalities together, we have m Γ(β∗ (λt ); λt ) ≥ Γ(βˆ t ; λt ) + (∇ ∑ L(Z; βˆ t ) + λt δ)T (β∗ (λt ) − βˆ t ) + kβ∗ (λt ) − βˆ t k22 . 2

(10)

Using Equation (10), we can bound the L2 distance between β∗ (λt ) and βˆ t by applying CauchySchwartz to get m Γ(β∗ (λt ); λt ) ≥ Γ(βˆ t ; λt ) − k∇ ∑ L(Z; βˆ t ) + λt δk2 kβ∗ (λt ) − βˆ t k2 + kβ∗ (λt ) − βˆ t k22 . 2 Since Γ(β∗ (λt ); λt ) ≤ Γ(βˆ t ; λt ), we have kβ∗ (λt ) − βˆ t k2 ≤

2 k∇ L(Z; βˆ t ) + λt δk2 . m ∑

(11)

By statement (3) of Lemma 1, for βˆ tj 6= 0, we have

∑ L(Z; βˆ t ± εsign(βˆ tj )1 j ) ± λt ε ≥ ∑ L(Z; βˆ t ) − ξ.

(12)

At the same time, by the bounded Hessian assumption, we have

∑ L(Z; βˆ t ± εsign(βˆ tj )1 j ) ≤ ∑ L(Z; βˆ t ) ± ε∇ ∑ L(Z; βˆ t )T sign(βˆ tj )1 j +

M 2 ε . 2

(13)

Connect these two inequalities, we have M ∓ε × (∇ ∑ L(Z; βˆ t )T 1 j sign(βˆ tj ) + λt ) ≤ ε2 + ξ, 2 therefore

M ξ |(∇ ∑ L(Z; βˆ t )T 1 j sign(βˆ tj ) + λt )| ≤ ε + . 2 ε

(14)

Similarly, for βˆ tj = 0, instead of (12), we have

∑ L(Z; βˆ t ± εsign(βˆ tj )1 j ) + λt ε ≥ ∑ L(Z; βˆ t ) − ξ. Combine with (13), we have M ξ |∇ ∑ L(Z; βˆ t )T 1 j | − λt ≤ ε + . 2 ε For j such that βˆ tj = 0, we choose δ j appropriately and combine with (14) so that the right hand side √ of (11) is controlled by p × m2 × ( M2 ε + ξε ). This way we obtain (9).

2723

Z HAO AND Y U

References E.L. Allgower and K. Georg. Homotopy methods for approximating several solutions to nonlinear systems of equations. In W. Forster, editor, Numerical solution of highly nonlinear problems, pages 253–270. North-Holland, 1980. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801–824, 1998. P. Buhlmann and B. Yu. Boosting with the l2 loss: regression and classification. Journal of American Statistical Association, 98, 2003. E. Candes and T. Tao. The danzig selector: Statistical estimation when p is much larger than n. Annals of Statistics (to appear), 2007. S. Chen and D. Donoho. Basis pursuit. Technical report, Department of Statistics, Stanford University, 1994. N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernelbased learning methods. Cambridge University Press, 2002. D. Donoho. For most large undetermined system of linear equatnions the minimal l 1 -norm nearsolution approximates the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Information Theory, 52(1):6–18, 2006. B. Efron, T. Hastie, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004. J. Fan and R.Z. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96(456):1348–1360, 2001. I. Frank and J. Friedman. A statistical view od some chemometrics regression tools. Technometrics, 35:109–148, 1993. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121: 256–285, 1995. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, pages 148–156. Morgan Kauffman, San Francisco, 1996. J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annal of Statistics, 29:1189–1232, 2001. J.H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annal of Statistics, 28:337–407, 2000. 2724

S TAGEWISE L ASSO

W.J. Fu. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397–416, 2001. J. Gao, H. Suzuki, and B. Yu. Approximate lasso methods for language modeling. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, pages 225–232, 2006. T. Gedeon, A. E. Parker, and A. G. Dimitrov. Information distortion and neural coding. Canadian Applied Mathematics Quarterly, 2002. T. Hastie, Tibshirani, R., and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag, 2001. T. Hastie, J. Taylor, R. Tibshirani, and G. Walther. Forward stagewise regression and the monotone lasso. Technical report, Department of Statistics, Stanford University, 2006. K. Knight and W. J. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28:1356–1378, 2000. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. Advance in Large Margin Classifiers, 1999. N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34:1436–1462, 2005. N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics (to appear), 2006. M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. Journal of Numerical Analysis, 20(3):389–403, 2000a. M.R. Osborne, B. Presnell, and B.A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319–337, 2000b. S. Rosset. Tracking curved regularized optimization solution paths. NIPS, 2004. S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5:941–973, 2004. R.E. Schapire. The strength of weak learnability. Journal of Machine Learning, 5(2):1997–2027, 1990. B. Sch¨olkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization and beyond. MIT Press, 2002. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control and Computing, 1999. 2725

Z HAO AND Y U

J.A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Information Theory, 52(3):1030 –1051, 2006. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ` 1 constrained quadratic programming. Technical Report 709, Statistics Department, UC Berkeley, 2006. C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high dimensional linear regression. Annals of Statistics (to appear), 2006. T. Zhang. Sequentiall greedy approximation for certain convex optimization problems. IEEE Trans. on Information Theory, 49(3):682–691, 2003. P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7 (Nov):2541–2563, 2006. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. Advances in Neural Information Processing Systems, 16, 2003. H. Zou. The adaptive lasso and its oracle properties. Journal of American Statistical Association, 101:1418–1429, 2006.

2726

Submitted 6/07; Revised 7/07; Published 12/07

Stagewise Lasso Peng Zhao Bin Yu

PENGZHAO @ STAT. BERKELEY. EDU BINYU @ STAT. BERKELEY. EDU

Department of Statistics University of Berkeley 367 Evans Hall Berkeley, CA 94720-3860, USA

Editor: Saharon Rosset

Abstract Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L1 penalized L2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function (L1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences not derivatives, we conclude that it provides a class of simple and easy-to-implement algorithms for tracing the regularization or solution paths of penalized minimization problems. Keywords: backward step, boosting, convexity, Lasso, regularization path

1. Introduction Many statistical machine learning algorithms minimize either an empirical loss function or a penalized empirical loss, in a regression or a classification setting where covariate or predictor variables are used to predict a response variable and i.i.d. samples of training data are available. For example, in classification, both AdaBoost (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996) and SVM (Vapnik, 1995; Cristianini and Shawe-Taylor, 2002; Sch o¨ lkopf and Smola, 2002) build linear classification functions of basis functions of the covariates (predictors). Adaboost minimizes an c

2007 Peng Zhao and Bin Yu.

Z HAO AND Y U

exponential loss function of the margin, while SVM a penalized hinge loss function of the margin. A single regularization tuning parameter, the number of iterations in AdaBoost and the smoothing parameter in SVM, controls the bias-and-variance trade-off or whether the algorithm overfits or underfits the data. When this tuning parameter changes, a regularization “path” of solutions to the minimization problem is generated. These algorithms have been shown to achieve start-of-the-art prediction performance. A tuning parameter value, or equivalently a point on the path of solutions, is chosen to minimize the estimated prediction error over a proper test set or through cross-validation. Hence it is necessary to have algorithms that can generate the path of solutions in an efficient manner. Path following algorithms have also been devised in other statistical penalized minimization problems such as the problems of information bottleneck Tishby et al. (1999) and information distortion Gedeon et al. (2002) where the loss and penalty functions are different from these discussed in this paper. In fact, following the solution paths of numerical problems is the focus of homotopy, a sub-area in numerical analysis. Interested readers are referred to the book Allgower and Georg (1980) (http://www.math.colostate.edu/emeriti/georg/AllgGeorgHNA.pdf) and references at http://www.math.colostate.edu/emeriti/georg/georg.publications.html. Among all the machine learning methods, those that produce sparse models are of great interest because sparse models lend themselves more easily to interpretation and therefore preferred in sciences and social sciences. Lasso (Tibshirani, 1996) is such a method. It minimizes the L 2 loss function with an L1 penalty on the parameters in a linear regression model. In signal processing it is called Basis Pursuit (Chen and Donoho, 1994). The L1 penalty leads to sparse solutions, that is, there are few predictors or basis functions with nonzero weights (among all possible choices). This statement is proved asymptotically under various conditions by different authors (see, e.g., Knight and Fu, 2000; Osborne et al., 2000a,b; Donoho et al., 2006; Donoho, 2006; Rosset et al., 2004; Tropp, 2006; Zhao and Yu, 2006; Zou, 2006; Meinshausen and Yu, 2006; Zhang and Huang, 2006). Sparsity has also been observed in the models generated by Forward Stagewise Fitting (FSF) or e-Boosting. FSF is a gradient descent procedure with more cautious steps than L 2 Boosting (i.e., the usual coordinatewise gradient descent method applied to the L 2 loss) (Rosset et al., 2004; Hastie et al., 2001; Efron et al., 2004). Moreover, these papers also study the similarities between FSF (eBoosting) with Lasso. This link between Lasso and e-Boosting or FSF is more formally described for the linear regression case through the LARS algorithm (Least Angle Regression Efron et al., 2004). It is also shown in Efron et al. (2004) that for special cases (such as orthogonal designs) e-Boosting or FSF can approximate Lasso path infinitely close, but in general, it is unclear what regularization criterion e-Boosting or FSF optimizes. As can be seen in our experiments (Figure 1 in Section 6.1), e-Boosting or FSF solutions can be significantly different from the Lasso solutions in the case of strongly correlated predictors which are common in high-dimensional data problems. However, FSF is still used as an approximation to Lasso because it is often computationally prohibitive to solve Lasso with general loss functions for many regularization parameters through Quadratic Programming. In this paper, we propose a new algorithm BLasso that connects Lasso with FSF or e-Boosting (and the B in the name stands for this connection to boosting). BLasso generates approximately the Lasso path in general situations for both regression and classification for L 1 penalized convex loss function. The motivation for BLasso is a critical observation that FSF or e-Boosting only works in a forward fashion. It takes steps that reduce empirical loss the most regardless of the impact on model complexity (or the L1 penalty). Hence it is not able to adjust earlier steps. Taking a coordinate (difference) descent view point of the Lasso minimization with a fixed stepsize, we introduce an 2702

S TAGEWISE L ASSO

innovative “backward” step. This step uses the same minimization rule as the forward step to define each fitting stage but with an extra rule to force the model complexity or L 1 penalty to decrease. By combining backward and forward steps, BLasso is able to go back and forth to approximate the Lasso path correctly. BLasso can be seen as a marriage between two families of successful methods. Computationally, BLasso works similarly to e-Boosting and FSF. It isolates the sub-optimization problem at each step from the whole process, that is, in the language of the Boosting literature, each base learner is learned separately. This way BLasso can deal with different loss functions and large classes of base learners like trees, wavelets and splines by fitting a base learner at each step and aggregating the base learners as the algorithm progresses. Moreover, the solution path of BLasso can be shown to converge to that of the Lasso, which uses explicit global L 1 regularization for cases with a finite number of base learners. In contrast, e-Boosting or FSF can be seen as local regularization in the sense that at any iteration, FSF with a fixed small stepsize only searches over those models which are one small step away from the current one in all possible directions corresponding to the base learners or predictors (cf. Hastie et al., 2006, for a recent interpretation of the ε → 0 case). In particular, we make three contributions in this paper via BLasso. First, by introducing the backward step we modify e-Boosting or FSF to follow the Lasso path and consequently generate models that are sparser with equivalent or slightly better prediction performance in our simulations (with true sparse models and more predictors than the sample size). Secondly, by showing convergence of BLasso to the Lasso path, we further tighten the conceptual ties between e-Boosting or FSF and Lasso that have been considered in previous works. Finally, since BLasso can be generalized to deal with other convex penalties and does not use any derivatives of the Loss function or penalty, we provide the Generalized BLasso algorithm as a simple and easy-to-implement off-the-shelf method for approximating the regularization path for a general loss function and a general convex penalty. We would like to note that, for the original Lasso problem, that is, the least squares problem (L 2 loss) with an L1 penalty, algorithms that give the entire Lasso path have been established, namely, the homotopy method by Osborne et al. (2000b) and the LARS algorithm by Efron et al. (2004). For parametric least squares problems where the number of predictors is not large, these methods are very efficient as their computational complexity is on the same order as a single Ordinary Least Squares regression. For other problems such as classification and for nonparametric setups like model fitting with trees, FSF or e-Boosting has been used as a tool for approximating the Lasso path (Rosset et al., 2004). For such problems, BLasso operates in a similar fashion as FSF or eBoosting but, unlike FSF, BLasso can be shown to converge to the Lasso path quite generally when the stepsize goes to zero. The rest of the paper is organized as follows. In Section 2.1, the gradient view of Boosting is provided and FSF (e-Boosting) is reviewed as a coordinatewise descent method with a fixed stepsize on the L2 loss. In Section 2.2, the Lasso empirical minimization problem is reviewed. Section 3 introduces BLasso that is a coordinate descent algorithm with a fixed stepsize applied to the Lasso minimization problem. Section 4 discusses the backward step and gives the intuition behind BLasso and explains why FSF is unable to give the Lasso path. Section 5 introduces a Generalized BLasso algorithm which deals with general convex penalties. In Section 6, results of experiments with both simulated and real data are reported to demonstrate the attractiveness of BLasso. BLasso is shown as a learning algorithm that gives sparse models and good prediction and as a simple plug-in method for approximating the regularization path for different convex loss functions and penalties. Moreover, we compare different choices of the stepsize and give evidence for the regularization 2703

Z HAO AND Y U

effect of using moderate stepsizes. Finally, Section 7 is a discussion and a summary. In particular, it comments on the computational complexity of BLasso, compares with the algorithm in Rosset (2004), explores the possibility of BLasso for nonparametric learning problems, summarizes the paper, and points to future directions.

2. Boosting, Forward Stagewise Fitting and the Lasso Boosting was originally proposed as an iterative fitting procedure that builds up a model sequentially using a weak or base learner and then carries out a weighted averaging (Schapire, 1990; Freund, 1995; Freund and Schapire, 1996). More recently, boosting has been interpreted as a gradient descent algorithm on an empirical loss function. FSF or e-Boosting can be viewed as a gradient descent with a fixed small stepsize at each stage and it produces solutions that are often close to the Lasso solutions (path). We now give a brief gradient descent view of Boosting and of FSF (e-Boosting), followed by a review of the Lasso minimization problem. 2.1 Boosting and Forward Stagewise Fitting Given data Zi = (Yi , Xi )(i = 1, ..., n), where the univariate Y can be continuous (regression problem) or discrete (classification problem), our task is to estimate the function F : R d → R that minimizes an expected loss E[C(Y, F(X))], C(·, ·) : R × R → R+ . The most prominent examples of the loss function C(·, ·) include exponential loss (AdaBoost), logit loss and L2 loss. The family of F(·) being considered is the set of ensembles of “base learners” D = {F : F(x) = ∑ β j h j (x), x ∈ Rd , β j ∈ R}, j

where the family of base learners can be very large or contain infinite members, for example, trees, wavelets and splines. Let β = (β1 , ...β j , ...)T , we can re-parametrize the problem using L(Z, β) := C(Y, F(X)), where the specification of F is hidden by L to make our notation simpler. To find an estimate for β, we set up an empirical minimization problem: n

βˆ = arg min ∑ L(Zi ; β). β i=1

Despite the fact that the empirical loss function is often convex in β, exact minimization is usually a formidable task for a moderately rich function family of base learners and with such function families the exact minimization leads to overfitted models. Because the family of base learners is usually large, Boosting can be viewed as finding approximate solutions by applying functional gradient descent. This gradient descent view has been recognized and studied by various authors including Breiman (1998), Mason et al. (1999), Friedman et al. (2000), Friedman (2001) and Buhlmann and Yu (2003). Precisely, boosting is a progressive procedure that iteratively builds up the solution (and it is often stopped early to avoid overfitting): 2704

S TAGEWISE L ASSO

n

( jˆ, g) ˆ = arg min ∑ L(Zi ; βˆ t + g1 j ), j,g

(1)

i=1

βˆ t+1 = βˆ t + g1 ˆ jˆ,

(2)

where 1 j is the jth standard basis vector with all 0’s except for a 1 in the jth coordinate, and g ∈ R is stepsize. In other words, Boosting favors the direction jˆ that reduces most the empirical loss and gˆ is found through a line search. The well-known AdaBoost, LogitBoost and L 2 Boosting can all be viewed as implementations of this strategy for different loss functions. Forward Stagewise Fitting (FSF) (Efron et al., 2004) is a similar method for approximating the minimization problem described by (1) with some additional regularization. FSF has also been called e-Boosting for ε-Boosting as in Rosset et al. (2004). Instead of optimizing the stepsize as in (2), FSF updates βˆ t by a fixed stepsize ε as in Friedman (2001). For general loss functions, FSF can be defined by removing the minimization over g in (1): n

( jˆ, s) ˆ = arg min

j,s=±ε

∑ L(Zi ; βˆ t + s1 j ),

(3)

i=1

βˆ t+1 = βˆ t + s1 ˆ jˆ.

(4)

This description looks different from the FSF described in Efron et al. (2004), but the underlying mechanic of the algorithm remains unchanged (see Section 5). Initially all coefficients are zero. At each successive step, a basis function or predictor or coordinate is selected that reduces most the empirical loss. Its corresponding coefficient β jˆ is then incremented or decremented by a fixed amount ε, while all other coefficients β j , j 6= jˆ are left unchanged. By taking small steps, FSF imposes some local regularization or shrinkage. A related approach can be found in Zhang (2003) where a relaxed gradient descent method is used. After T < ∞ iterations, many of the estimated coefficients by FSF will be zero, namely those that have yet to be incremented. The others will tend to have absolute values smaller than the unregularized solutions. This shrinkage/sparsity property is reflected in the similarity between the solutions given by FSF and Lasso which is reviewed next. 2.2 General Lasso Let T (β) denote the L1 penalty of β = (β1 , ..., β j , ...)T , that is, T (β) = kβk1 = ∑ j |β j |, and Γ(β; λ) denote the Lasso (least absolute shrinkage and selection operator) loss function n

Γ(β; λ) = ∑ L(Zi ; β) + λT (β). i=1

The general Lasso estimate βˆ = (βˆ 1 , ..., βˆ j , ...)T is defined by βˆ λ = min Γ(β; λ). β

The parameter λ ≥ 0 controls the amount of regularization applied to the estimate. Setting λ = 0 reverses the Lasso problem to minimizing the unregularized empirical loss. On the other hand, a 2705

Z HAO AND Y U

very large λ will completely shrink βˆ to 0 thus leading to the empty or null model. In general, moderate values of λ will cause shrinkage of the solutions towards 0, and some coefficients may end up being exactly 0. This sparsity in Lasso solutions has been researched extensively in recent years (e.g., Osborne et al., 2000a,b; Efron et al., 2004; Donoho et al., 2006; Donoho, 2006; Tropp, 2006; Rosset et al., 2004; Meinshausen and Bu¨ hlmann, 2005; Candes and Tao, 2007; Zhao and Yu, 2006; Zou, 2006; Wainwright, 2006; Meinshausen and Yu, 2006; Zhang and Huang, 2006). Sparsity can also result from other penalties as in, for example, Fan and Li (2001). Computation of the solution to the Lasso problem for a fixed λ has been studied for special cases. Specifically, for least squares regression, it is a quadratic programming problem with linear inequality constraints; for 1-norm SVM, it can be transformed into a linear programming problem. But to get a model that performs well on future data, we need to select an appropriate value for the tuning parameter λ. Very efficient algorithms have been proposed to give the entire regularization path for the squared loss function (the homotopy method by Osborne et al. 2000b and similarly LARS by Efron et al. 2004) and SVM (1-norm SVM by Zhu et al., 2003). However, it remains open how to give the entire regularization path of the Lasso problem for general convex loss function. FSF exists as a compromise since, like Boosting, it is a nonparametric learning algorithm that works with different loss functions and large numbers of base learners (predictors) but it is local regularization and does not converge to the Lasso path in general. As can be seen in Sec. 6.2, FSF has also less sparse solutions comparing to Lasso in our simulations. Next we propose the BLasso algorithm which works in a computationally efficient fashion as FSF. In contrast to FSF, BLasso converges to the Lasso path for general convex loss functions when the stepsize goes to 0. This relationship between Lasso and BLasso leads to sparser solutions for BLasso comparing to FSF with similar or slightly better prediction performance in our simulation set-up with different choices of the stepsize.

3. The BLasso Algorithm We first describe the BLasso algorithm (Algorithm 1). This algorithm has two related input parameters, a stepsize ε and a tolerance level ξ. The tolerance level is needed only to avoid numerical instability when assessing changes of the empirical loss function and should be set as small as possible while accommodating the numerical accuracy of the implementation. (ξ is set to 10−6 in the implementation of the algorithm that used in this paper.) We will discuss forward and backward steps in depth in the next section. Immediately, the following properties can be proved for BLasso (see Appendix for the proof). Lemma 1. 1. For any λ ≥ 0, if there exist j and s with |s| = ε such that Γ(s1 j ; λ) ≤ Γ(0; λ), we have λ0 ≥ λ. 2. For any t, we have Γ(βˆ t+1 ; λt ) ≤ Γ(βˆ t ; λt ) − ξ.

3. For ξ ≥ 0 and any t such that λt+1 < λt , we have Γ(βˆ t ± ε1 j ; λt ) > Γ(βˆ t ; λt ) − ξ for every j and kβˆ t+1 k1 = kβˆ t k1 + ε.

Lemma 1 (1) guarantees that it is safe for BLasso to start with an initial λ 0 which is the largest λ that would allow an ε step away from 0 (i.e., larger λ’s correspond to βˆ λ = 0). Lemma 1 (2) says 2706

S TAGEWISE L ASSO

Algorithm 1 BLasso Step 1 (initialization). Given data Zi = (Yi , Xi ), i = 1, ..., n and a small stepsize constant ε > 0 and a small tolerance parameter ξ > 0, take an initial forward step n

( jˆ, sˆ jˆ) = arg min

j,s=±ε

∑ L(Zi ; s1 j ),

i=1

βˆ 0 = sˆ jˆ1 jˆ, Then calculate the initial regularization parameter n 1 n λ0 = ( ∑ L(Zi ; 0) − ∑ L(Zi ; βˆ 0 )). ε i=1 i=1

Set the active index set IA0 = { jˆ}. Set t = 0. Step 2 (Backward and Forward steps). Find the “backward” step that leads to the minimal empirical loss: n jˆ = arg min L(Zi ; βˆ t + s j 1 j ) where s j = −sign(βˆ t )ε. (5)

∑

j

j∈IAt i=1

Take the step if it leads to a decrease of moderate size ξ in the Lasso loss, otherwise force a forward step (as (3), (4) in FSF) and relax λ if necessary: If Γ(βˆ t + sˆ jˆ1 jˆ; λt ) − Γ(βˆ t , λt ) ≤ −ξ, then βˆ t+1 = βˆ t + sˆ jˆ1 jˆ, λt+1 = λt . Otherwise, n

∑ L(Zi ; βˆ t + s1 j ), j,s=±ε

( jˆ, s) ˆ = arg min

(6)

i=1

βˆ t+1 = βˆ t + s1 ˆ jˆ, n 1 n λt+1 = min[λt , ( ∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ)], ε i=1 i=1

IAt+1 = IAt ∪ { jˆ}. Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λt ≤ 0.

that for each value of λ, BLasso performs coordinate descent until there is no descent step. Then, by Lemma 1 (3), the value of λ is reduced and a forward step is forced. The stepsize ε controls fineness of the grid BLasso runs on. The tolerance ξ controls how large a descend need to be made for a backward step to be taken. It is needed to accommodate for numerical error and should be set to be much smaller than ε to have a good approximation (see Proof of Theorem 1). In fact, we have a convergence result for BLasso (detailed proof is included in the Appendix): 2707

Z HAO AND Y U

Theorem 1. For a finite number of base learners and ξ = o(ε), if ∑ L(Zi ; β) is strongly convex with bounded second derivatives in β then as ε → 0, the BLasso path converges to the Lasso path uniformly. Note that Conjecture 2 of Rosset et al. (2004) follows from Theorem 1. This is because if all the optimal coefficient paths are monotone, then BLasso will never take a backward step, so it will be equivalent to e-Boosting. Many popular loss functions, for example, squared loss, logistic loss, and negative log-likelihood functions of exponential families are convex and twice differentiable, and they satisify the conditions in Theorem 1. Moreover, from the proof of this theorem in the appendix, it is easy to see that it suffices to have the conditions in the theorem satisfied over a bounded set of β. For the exponential loss, Lemma 1 implies that there is a finite λ0 < ∞ for every data set (Zi ). Thus we can restrict the proof of Theorem 1 to this bounded set of β to show the result for the exponential loss. Other functions like the hinge loss (SVM) is continuous and convex but not differentiable. The differentiability, however, is only necessary for the proof of Theorem 1. BLasso does not use any gradient or higher order derivatives but only the differences of the loss function therefore remains applicable to loss functions that are not differentiable or of which differentiation is too complex or computationally expensive. It is theoretically possible that BLasso’s coordinate descent strategy gets stuck at nondifferentiable points for functions like the hinge loss. However, as illustrated in our third experiment, BLasso may still work for cases like 1-norm SVM empirically. Theorem 1 does not cover nonparametric learning problems with an infinite number of base learners either. In fact, for problems with large or infinite number of base learners, the minimization in (6) is usually done approximately by functional gradient descent and a tolerance ξ > 0 needs to be chosen to avoid oscillation between forward and backward steps caused by slow descending. We discuss more on this topic in the discussion (Sec. 7).

4. The Backward Step We now explain the motivation and working mechanic of BLasso. Observe that FSF only uses “forward” steps, that is, it only takes steps that lead to a direct reduction of the empirical loss. Comparing to classical model selection methods like Forward Selection and Backward Elimination, Growing and Pruning of a classification tree, a “backward” counterpart is missing. Without the backward step, when FSF picks up more irrelevant variables as compared to the Lasso path in some cases (cf. Figure 1 in Section 6.2), it does not have a mechanism to remove them. As seen below, this backward step naturally arises in BLasso because of our coordinate descent view of the minimization of the Lasso loss. (Since ξ exists for numerical purpose only, it is assumed to be 0 thus excluded in the following theoretical discussion.) For a given β 6= 0 and λ > 0, consider the impact of a small ε > 0 change of β j to the Lasso loss Γ(β; λ). For an |s| = ε, ∆ j Γ(Z; β)

n

=

n

( ∑ L(Zi ; β + s1 j ) − ∑ L(Zi ; β)) + λ(T (β + s1 j ) − T (β)) i=1

i=1

n

:= ∆ j ( ∑ L(Zi ; β)) + λ∆ j T (β). i=1

Since T (β) is simply the L1 norm of β, ∆T (β) reduces to a simple form: 2708

S TAGEWISE L ASSO

∆ j T (β) = kβ + s1 j k1 − kβk1 = |β j + s| − |β j |

= ε · sign+ (β j , s) 1 if sβ j > 0 or β j = 0 = ε· . -1 if sβ j < 0

(7)

Equation (7) shows that an ε step changes the penalty by a fixed ε in absolute value for any j. That is, only the sign of the penalty change may vary. In the beginning of BLasso, all j directions are leaving zero and hence changing the L1 penalty by the same positive amount λ · ε. Therefore the first step of BLasso is a forward step because minimizing Lasso loss is equivalent to minimizing the L2 loss due to the same positive change of the L1 penalty. As the algorithm proceeds, some of the penalty changes might become negative and minimizing the empirical loss is no longer equivalent to minimizing the Lasso loss. In fact, except for special cases like orthogonal covariates (predictors), the FSF steps might result in negative changes of the L1 penalty. In some of these situations, a step that goes “backward” reduces the penalty with a small sacrifice in the empirical loss. In general, to minimize the Lasso loss, one needs to go “back and forth” to trade off the penalty with the empirical loss for different regularization parameters. ˆ a backward step is such that: To be precise, for a given β, ∆βˆ = s j 1 j , subject to βˆ j 6= 0, sign(s) = −sign(βˆ j ) and |s| = ε. Making such a step will reduce the penalty by a fixed amount λ·ε, but its impact on the empirical loss can be different, therefore as in (5) we want: n

jˆ = arg min ∑ L(Zi ; βˆ + s j 1 j ) subject to βˆ j 6= 0 and s j = −sign(βˆ j )ε, j

i=1

that is, jˆ is selected such that the empirical loss after making the step is as small as possible. While forward steps try to reduce the Lasso loss through minimizing the empirical loss, the backward steps try to reduce the Lasso loss through minimizing the Lasso penalty. In summary, by allowing the backward steps, we are able to work with the Lasso loss directly and take backward steps to correct earlier forward steps that might have picked up irrelevant variables. Since much of the discussion on the similarity and difference between FSF and Lasso is focused on Least Squares problems (e.g., Efron et al., 2004; Hastie et al., 2001), we next examine the BLasso algorithm in this case. It is straightforward to see that in LS problems both forward and backward steps in BLasso are based only on the correlations between fitted residuals and the covariates (predictors). It follows that BLasso in this case reduces to finding the best direction in both forward and backward steps by examining the inner-products, and then deciding whether to go forward or backward based on the regularization parameter. This not only simplifies the minimization procedure but also significantly reduces the computation complexity for a large number of observations since the inner-product between ηt and X j can be updated by (ηt+1 )0 X j = (ηt − sX jˆt )0 X j = (ηt )0 X j − sX j0ˆt X j ,

(8)

which takes only one operation if X j0ˆt X j is precalculated. Therefore, when the number of base learners is small, based on precalculated X 0 X and Y 0 X, BLasso could use (8) to make its computation 2709

Z HAO AND Y U

complexity independent from the number of observations. This nice property is not surprising as it is also observed in established algorithms like LARS and Osborne’s homotopy method which are specialized for LS problems. In nonparametric situations, the number of base learners is large therefore the aforementioned strategy becomes inefficient. BLasso has a natural extention to this case as follows: similar to boosting, the forward step is carried out by a sub-optimization procedure such as fitting trees, smoothing splines or stumps. For the backward step, only inner-products between base learners that have entered the model need to be calculated. The inner products between these base learners and residuals can be updated by (8). This makes the backward steps’ computation complexity proportional to the number of base learners that are already chosen instead of the number of all possible base learners. Therefore BLasso works not only for cases with large sample size but also for cases where a class of large or infinite number of possible base learners is given. As mentioned earlier, there are already established efficient algorithms for solving the least square (L2 ) Lasso problem, for example, the homotopy method by Osborne et al. (2000b) and LARS (Efron et al., 2004). These algorithms are very efficient for giving the exact Lasso paths for parametric settings. For nonparametric learning problems with a large or an infinite number of base learners, we believe BLasso is an attractive strategy for approximating the path of the Lasso, as it shares the same computational strategy as Boosting which has proven itself successful in applications. Also, in cases where the Ordinary Least Square (OLS) method performs well, BLasso can be modified to start from the OLS estimate, go backward and stop in a few iterations.

5. Generalized BLasso As stated earlier, BLasso not only works for general convex loss functions, but also extends to convex penalties other than the L1 penalty. For the Lasso problem, BLasso does a fixed stepsize coordinate descent to minimize the penalized loss. Since the penalty has the special L 1 norm and (7) holds, a step’s impact on the penalty has a fixed size ε with either a positive or a negative sign, and the coordinate descent takes form of “backward” and “forward” steps. This reduces the minimization of the penalized loss function to unregularized minimizations of the loss function as in (6) and (5). For general convex penalties, since a step on different coordinates does not necessarily have the same impact on the penalty, one is forced to work with the penalized function directly. Assume T (β): Rm → R is a convex penalty function. We next describe the Generalized BLasso algorithm (Algorithm 2). In the Generalized BLasso algorithm, explicit “forward” or “backward” steps are no longer seen. However, the mechanism remains the same—minimize the penalized loss function for each λ, relax the regularization by reducing λ through a “forward” step when the minimum of the loss function for the current λ is reached.

6. Experiments In this section, three experiments are carried out to illustrate the attractiveness of BLasso. The first experiment runs BLasso under the classical Lasso setting on the diabetes data set (cf. Efron et al., 2004) often used in studies of Lasso with an added artificial covariate variable to highlight the difference between BLasso and FSF. This added covariate is strongly correlated with a couple of the original covariates (predictors). In this case, BLasso is seen to produce a path almost exactly the 2710

S TAGEWISE L ASSO

Algorithm 2 Generalized BLasso Step 1 (initialization). Given data Zi = (Yi , Xi ), i = 1, ..., n and a fixed small stepsize ε > 0 and a small tolerance parameter ξ ≥ 0, take an initial forward step n

∑ L(Zi ; s1 j ), βˆ 0 = sˆjˆ1 jˆ. j,s=±ε

( jˆ, sˆ jˆ) = arg min

i=1

Then calculate the corresponding regularization parameter λ0 =

∑ni=1 L(Zi ; 0) − ∑ni=1 L(Zi ; βˆ 0 ) . T (βˆ 0 ) − T (0)

Set t = 0. Step 2 (steepest descent on Lasso loss). Find the steepest coordinate descent direction on the penalized loss: ( jˆ, sˆ jˆ) = arg min Γ(βˆ t + s1 j ; λt ). j,s=±ε

Update βˆ if it reduces Lasso loss by at least a ξ amount, otherwise force βˆ to minimize L and recalculate the regularization parameter: If Γ(βˆ t + sˆ jˆ1 jˆ; λt ) − Γ(βˆ t , λt ) < −ξ, then βˆ t+1 = βˆ t + sˆ jˆ1 jˆ, λt+1 = λt . Otherwise, n

L(Zi ; βˆ t + s1 j ), ∑ j,|s|=ε

( jˆ, sˆ jˆ) = arg min

i=1

β

ˆ t+1

= β + sˆ jˆ1 jˆ, ˆt

λt+1 = min[λt ,

∑ni=1 L(Zi ; βˆ t ) − ∑ni=1 L(Zi ; βˆ t+1 ) ]. T (βˆ t+1 ) − T (βˆ t )

Step 3 (iteration). Increase t by one and repeat Step 2 and 3. Stop when λt ≤ 0.

same as the Lasso path which shrinks the added irrelevant variable back to zero, while FSF’s path parts drastically from Lasso’s due to the added strongly correlated covariate and does not move it back to zero. In the second experiment, we compare the prediction and variable selection performance of FSF and BLasso in a least squares regression simulation using a large number (p = 500 >> n = 50) of randomly correlated base learners to emulate the nonparametric learning scenario and when the true model is sparse. The result shows, overall, BLasso gives sparser solutions than FSF and with similar or slightly better predictions. And this holds for various stepsizes. Moreover, we find that when the stepsize increases, there is a regularization effect in terms of both prediction and sparsity, for both BLasso and FSF. 2711

Z HAO AND Y U

The last experiment is to illustrate BLasso as an off-the-shelf method for computing the regularization path for general convex loss functions and general convex penalties. Two cases are presented. The first case is bridge regression (Frank and Friedman, 1993) on diabetes data using different Lγ (γ ≥ 1) norms as penalties. The other is a simulated classification problem using 1-norm SVM (Zhu et al., 2003) with the hinge loss. 6.1 L2 Regression with L1 Penalty (Classical Lasso) The data set used in this experiment is the diabetes data set where n=442 diabetes patients were measured on 10 baseline predictor variables X 1 , ..., X 10 . A prediction model was desired for the response variable Y , a quantitative measure of disease progression one year after baseline. We add one additional predictor variable to make more visible the difference between FSF and Lasso solutions. This added variable is X 11 = −X 7 + X 8 + 5X 9 + e, where e is i.i.d. Gaussian noise (mean zero and variance 1/442). The following vector gives the correlations of X 11 with X 1 , X 2 , ..., X 10 : (0.25 , 0.24 , 0.47 , 0.39 , 0.48 , 0.38 , −0.58 , 0.76 , 0.94 , 0.47). The classical Lasso (L2 regression with L1 penalty) is applied to this data set with the added covariate. Location and scale transformations are made so that all the covariates or predictors are standardized to have mean 0 and unit length, and the response has mean zero. The penalized loss function has the form: n

Γ(β; λ) = ∑ (Yi − Xi β)2 + λkβk1 . i=1

The middle panel of Figure 1 shows the coefficient path plot for BLasso applied to the modified diabetes data. Left (Lasso) and Middle (BLasso) panels are indistinguishable from each other. Both FSF and BLasso pick up the added artificial and strongly correlated X 11 (the solid line) in the earlier stages, but due to the greedy nature of FSF, it is not able to remove X 11 in the later stages thus every parameter estimate is affected leading to significantly different solutions from Lasso. The BLasso solutions were built up in 8700 steps (making the step size ε = 0.5 small so that the coefficient paths are smooth), 840 of which were backward steps. In comparison, FSF took 7300 pure forward steps. BLasso’s backward steps concentrate mainly around the steps where FSF and BLasso tend to differ. 6.2 Comparison of BLasso and Forward Stagewise Fitting by Simulation In this experiment, we compare the model estimates generated by FSF and BLasso in a large p(=500) and small n(=50) setting to mimic a nonparametric learning scenario where FSF and BLasso are computationally attractive. In this least squares regression simulation, the design is randomly generated as described below to guarantee a fair amount of correlation among the covariates (predictors). Otherwise, if the design is close to orthogonal, the FSF and BLasso paths will be too similar for this simulation to yield interesting results. 2712

S TAGEWISE L ASSO

Lasso

BLasso

FSF

500

500

500

0

0

0

−500

−500

−500

0

1000

2000

t = ∑ |βˆ j | →

3000

0

1000

2000

t = ∑ |βˆ j | →

3000

0

1000

2000

3000

t = ∑ |βˆ j | →

Figure 1: Regularization path plots, for the diabetes data set, of Lasso, BLasso and FSF: the curves (or paths) of estimates βˆ j for 10 original and 1 added covariates (predictors), as the regularization is relaxed or t tends to infinity. The thick solid curves correspond to the 11th added covariate. Left Panel: Lasso solution paths (produced using simplex search method on the penalized empirical loss function for each λ) as a function of t = kβk 1 . Middle Panel: BLasso solution paths, which can be seen indistinguishable to the Lasso solutions. Right Panel: FSF solution paths, which are different from Lasso and BLasso.

We first draw 5 covariance matrices Ci , i = 1, .., 5 from .95 × Di + .05I p×p where Di is sampled from Wishart(20, p) then normalized to have 1’s on diagonal. The Wishart distribution creates a fair amount of correlation in Ci (average absolute value is about 0.18) between the covariates and the added identity matrix guarantees Ci to be full rank. For each of the covariance matrix Ci , the design X is then drawn independently from N(0,Ci ) with n = 50. The target variable Y is then computed as Y = Xβ + e, where β1 to βq with q = 7 are drawn independently from N(0, 1) and β8 to β500 are set to zero to create a sparse model. e is the Gaussian noise vector with mean zero and variance 1. For each of 1 1 1 1 , 20 , 40 and 80 . the 5 cases with different Ci , both BLasso and FSF are run using stepsizes ε = 51 , 10 We also run Lasso which is listed as BLasso when ε = 0. To compare the performances, we examine the solutions on the regularization paths that give the ˆ 2 . The mean squared error (on log scale) of these solutions smallest mean squared error kXβ − X βk are tabulated together with the number of nonzero estimates in each solution. All cases are run 50 times and the average results are reported in Table 1. As can be seen from Table 1, since our true model is sparse, in almost all cases the BLasso solutions are sparser and have similar prediction performances comparing to the FSF solutions with the same stepsize. It is also interesting to note that, smaller stepsizes require more computation but often give worse predictions and much less sparsity. We conjecture that there is also a regularization effect caused by the discretization of the solution paths (more discussion in Section 8) and this effect has also been observed by Gao et al. (2006) in a language ranking problem. 2713

Z HAO AND Y U

Design C1

MSE qˆ

C2

MSE qˆ

C3

MSE qˆ

C4

MSE qˆ

C5

MSE qˆ

BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF BLasso FSF

ε = 15 18.60 19.77 15.38 18.32 19.58 20.67 14.80 18.34 18.83 19.35 15.22 15.38 20.09 21.53 15.76 18.90 18.79 19.99 15.58 17.10

1 ε = 10 18.27 19.40 20.08 24.00 19.28 20.29 18.92 21.90 18.14 19.11 19.10 19.72 19.88 21.09 20.82 24.64 18.62 19.92 19.16 23.24

1 ε = 20 18.33 19.60 21.76 27.28 19.65 20.63 20.18 25.70 18.55 19.52 19.92 23.30 19.85 21.13 22.20 30.38 18.70 19.84 21.26 28.24

1 ε = 40 18.60 19.82 21.44 30.48 19.94 20.94 21.22 28.80 18.90 19.78 20.02 25.88 20.20 21.35 22.42 32.02 19.09 20.19 21.92 30.94

1 ε = 80 19.42 19.96 20.50 32.14 20.76 21.11 20.52 29.38 19.32 19.93 19.52 27.30 21.84 21.57 21.12 34.16 19.47 20.36 22.18 32.84

Lasso (ε = 0) 19.98 21.86 21.12 21.82 20.15 21.08 21.70 22.24 20.12 22.76

Table 1: Comparison of FSF and BLasso in a simulated nonparametric regression setting. The log of MSE and qˆ =# of nonzeros are reported for the oracle solutions on the regularization paths. All results are averaged over 50 runs.

Design C1

MSE qˆ

C2

MSE qˆ

C3

MSE qˆ

C4

MSE qˆ

C5

MSE qˆ

BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF BLasso−Lasso BLasso−FSF

ε = 15 -1.38 (0.37) -1.17 (0.27) -6.48 (0.64) -2.94 (0.89) -1.54 (0.37) -1.09 (0.32) -7.02 (0.58) -3.54 (0.99) -1.32 (0.35) -0.53 (0.28) -5.86 (0.81) -0.16 (0.78) -1.61 (0.45) -1.44 (0.30) -6.48 (0.71) -3.14 (0.92) -1.33 (0.38) -1.20 (0.25) -7.18 (0.84) -1.52 (0.88)

1 ε = 10 -1.71 (0.23) -1.13 (0.28) -1.78 (0.70) -3.92 (1.22) -1.84 (0.29) -1.01 (0.27) -2.90 (0.65) -2.98 (0.88) -2.01 (0.36) -0.97 (0.22) -1.98 (0.72) -0.62 (0.87) -1.82 (0.33) -1.20 (0.28) -1.42 (0.85) -3.82 (1.16) -1.50 (0.26) -1.30 (0.23) -3.60 (0.64) -4.08 (1.10)

1 ε = 20 -1.65 (0.21) -1.27 (0.26) -0.10 (0.67) -5.52 (1.26) -1.47 (0.26) -0.98 (0.23) -1.64 (0.52) -5.52 (1.09) -1.60 (0.33) -0.97 (0.23) -1.16 (0.54) -3.38 (1.05) -1.85 (0.33) -1.28 (0.24) -0.04 (0.73) -8.18 (1.12) -1.41 (0.26) -1.14 (0.28) -1.50 (0.58) -6.98 (1.08)

1 ε = 40 -1.38 (0.21) -1.22 (0.26) -0.42 (0.63) -9.04 (1.43) -1.18 (0.25) -1.00 (0.23) -0.60 (0.50) -7.58 (1.31) -1.25 (0.32) -0.88 (0.23) -1.06 (0.55) -5.86 (0.97) -1.50 (0.33) -1.15 (0.29) 0.18 (0.52) -9.60 (1.35) -1.03 (0.22) -1.10 (0.29) -0.84 (0.52) -9.02 (1.21)

1 ε = 80 -0.56 (0.35) -0.54 (0.24) -1.36 (0.65) -11.64 (1.64) -0.36 (0.45) -0.35 (0.38) -1.30 (0.48) -8.86 (1.41) -0.83 (0.32) -0.62 (0.24) -1.56 (0.56) -7.78 (1.08) 0.14 (0.66) 0.27 (0.67) -1.12 (0.67) -13.04 (1.68) -0.65 (0.22) -0.89 (0.28) -0.58 (0.55) -10.66 (1.50)

Table 2: Means and Standard Errors of the differences of MSE and qˆ between BLasso and Lasso, and between Blasso and FSF in Table 1.

2714

100

100

S TAGEWISE L ASSO

20

40

60

80

BLasso FSF Lasso

0

0

20

40

60

80

BLasso FSF Lasso

2

4

6

8

10

12

14

2

4

6

8

10

12

14

Figure 2: Plots of in-sample Mean Squared Error (y-axis) versus kβk 1 (x-axis) for a typical realiza1 tion of the experiment (on run under C2 from Table 1). The step size is set to ε = 80 in 1 the left plot and ε = 5 in the right.

Table 2 gives a further analysis of the results in Table 1. It contains means and standard errors of the differences of MSE and q, ˆ between BLasso and Lasso and between BLasso and FSF, for the stepsizes given in Table 1. First of all, all the mean differences are negative and when compared with their SE’s, the differences are also significant except for few cells for small stepsizes 1/40 and 1/80 (in the last two columns). This overwhelming pattern of significant negative difference suggests that, for this simulation, BLasso is better than Lasso and FSF in terms of both prediction and sparsity unless the stepsize is very small as in the last two columns. Moreover, for MSE the stepsize ε = 1/10 seems to bring the best improvement of BLasso over Lasso, and the improvement is pretty robust against the choice of stepsize. On the other hand, the improvements of BLasso over FSF on MSE are less then those of BLasso over Lasso because FSF has the same discrete stepsizes. Hence these improvements reflect the gains only by the backward steps since FSF takes also forward steps. In terms of q, ˆ the number of covariates selected, as expected, the larger the stepsize, the sparser the BLasso model is relative to the Lasso model or the FSF model. The sparsity improvements over Lasso are significant for all cells except for the last column with ε = 1/80. When compared with FSF, the sparsity improvements are less and smaller (still significant). In terms of gains on both MSE and sparsity and relative to both Lasso and FSF, stepsizes 1/10 and 1/20, that is, 0.1 or 0.05, seem good overall choices for this simulation study. 2715

Z HAO AND Y U

As suggested by one referee, we compare the Lasso empirical loss functions induced by BLasso, FSF and Lasso (through LARS). Figure 2 shows plots of in-sample Mean Squared Error versus L 1 norms of the coefficients taken from one typical run of the simulation conducted in this section. As shown by the plots, the in-sample MSE from BLasso approximates the in-sample MSE from the Lasso better than the FSF under both big and small step sizes. In particular, when the step size is small, the BLasso path is almost indiscernible from the Lasso path. A final comment on Figure 2 is in order. Although the in-sample MSE curve for BLasso in the right panel of Figure 2 does seem to go up at the end of the plot, we can not extend the x-axis further to higher ||β|| 1 values because at the stepsize ε = 1/5, the BLasso solution has achieved its L 1 norm maximum around 14 − 15 – the maximum of the x-axis on the right panel of Figure 2. 6.3 Generalized BLasso for Other Penalties and Nondifferentiable Loss Functions First, to demonstrate Generalized BLasso for different penalties, we use the Bridge Regression setting with the diabetes data set (without the added covariate in the first experiment). The Bridge Regression (first proposed by Frank and Friedman 1993 and later more carefully discussed and implemented by Fu 2001) is a generalization of the ridge regression (L 2 penalty) and Lasso (L1 penalty). It considers a linear (L2 ) regression problem with Lγ penalty for γ ≥ 1 (to maintain the convexity of the penalty function). The penalized loss function has the form: n

Γ(β; λ) = ∑ (Yi − Xi β)2 + λkβkγ , i=1

where γ is the bridge parameter. The data used in this experiment are centered and rescaled as in the first experiment. Generalized BLasso successfully produced the paths for all 5 cases which are verified by pointwise minimization using simplex method (γ = 1, γ = 1.1, γ = 4 and γ = max) or close form solutions (γ = 2). It is interesting to notice the phase transition from the near-Lasso to the Lasso as the solution paths are similar but only Lasso has sparsity. Also, as γ grows larger, estimates for different β j tend to have more similar sizes and in the extreme γ = ∞ there is a “branching” phenomenon— the estimates stay together in the beginning and branch out into different directions as the path progresses. To demonstrate the Generalized BLasso algorithm for classification using an nondifferentiable loss function with a L1 penalty function, we look at binary classification with the hinge loss. As in Zhu et al. (2003), we generate n=50 training data points in each of two classes. The first class has two standard normal independent inputs X 1 and X 2 and class label Y = −1. The second class also has two standard normal independent inputs, but conditioned on 4.5 ≤ (X 1 )2 + (X 2 )2 ≤ 8 and has class label Y = 1. We wish to find a classification rule from the training data. so that when given a new input, we can assign a label from {1, −1} to it. 1-norm SVM (Zhu et al., 2003) is used to estimate β: n

m

5

j=1

j=1

(βˆ 0 , β) = arg min ∑ (1 −Yi (β0 + ∑ β j h j (Xi )))+ + λ ∑ |β j |, β0 ,β i=1

where hi ∈ D are √ basis functions regularization parameter. The dictionary of basis √ 2 √ and1 λ2 is the 1 2 1 functions is D = { 2X , 2X , 2X X , (X ) , (X 2 )2 }. Notice that β0 is left unregularized so the penalty function is not the L1 penalty. 2716

S TAGEWISE L ASSO

γ=1

600

γ = 1.1

1800 3000

600

γ=2

1800 3000

600

γ=4

1800 3000

600

γ=∞

1800 3000

100

1

1

1

1

1

0

0

0

0

0

−1

−1 −1

0

1

−1 −1

0

1

0

1

700

−1

−1 −1

400

−1

0

1

−1

0

1

Figure 3: Upper Panel: Solution paths produced by BLasso for different bridge parameters, on the diabetes data set. From left to right: Lasso (γ = 1), near-Lasso (γ = 1.1), Ridge (γ = 2), over-Ridge (γ = 4), max (γ = ∞). The Y -axis is the parameter estimate and has the range [−800, 800]. The X-axis for each of the left 4 plots is ∑i |βi |, the one for the 5th plot is max(|βi |) because ∑i |βi | is unsuitable. Lower Panel: The corresponding penalty equal contours for |β1 |γ + |β2 |γ = 1.

2717

Z HAO AND Y U

Regularization Path

Data 3

0.8

0.7

2 0.6

0.5

1

0.4

0

0.3

0.2

−1 0.1

0

−2 −0.1

−0.2

0

0.5

t=

1

∑5j=1 |βˆ j |

1.5

−3 −3

−2

−1

0

1

2

3

→

Figure 4: Estimates of 1-norm SVM coefficients βˆ j , j=1,2,...,5, for the simulated two-class classification data. Left Panel: BLasso solutions as a function of t = ∑5j=1 |βˆ j |. Right Panel: Scatter plot of the data points with labels: ’+’ for y = −1; ’o’ for y = 1.

The fitted model is m

fˆ(x) = βˆ 0 + ∑ βˆ j h j (x), j=1

and the classification rule is given by sign( fˆ(x)). Since the loss function is not differentiable, we do not have a theoretical guarantee that BLasso works. Nonetheless the solution path produced by Generalized BLasso has the same sparsity and piecewise linearity as the 1-norm SVM solutions shown in Zhu et al. (2003). It takes Generalized BLasso 490 iterations to generate the solutions. The covariates enter the regression equation sequentially as t increase, in the following order: the two quadratic terms first, followed by the interaction term then the two linear terms. As 1-norm SVM in Zhu et al. (2003), BLasso correctly picked up the quadratic terms early. That come up much later are the interaction term and linear terms that are not in the true model. In other words, BLasso results are in good agreement with Zhu et al.’s 1-norm SVM results and we regard this as a confirmation for BLasso’s effectiveness in this nondifferentiable example.

7. Discussion and Concluding Remarks As seen from our simulations under sparse true models, BLasso generates sparser solutions with similar or slightly better predictions relative to Lasso and FSF. The behavior relative to Lasso is due to the discrete stepsize of BLasso, while the behavior relative to FSF is partially explained by its convergence to the Lasso path as the stepsize goes to 0. We believe that the generalized version 2718

S TAGEWISE L ASSO

400

200

0 2000

1500

1000

500

0

λ Figure 5: Estimates of regression coefficients βˆ 3 for the diabetes data set. Solutions are plotted as functions of λ. Dotted Line: Estimates using stepsize ε = 0.05. Solid Line: Estimates using stepsize ε = 10. Dash-dot Line: Estimates using stepsize ε = 50.

is also effective as an off-the-shelf algorithm for the general convex penalized loss minimization problems. Computationally, BLasso takes roughly O(1/ε) steps to produce the whole path. Depending on the actual loss function, base learners and minimization method used in each step, the actual computation complexity varies. As shown in the simulations, choosing a smaller stepsize gives a smoother solution path but it does not guarantee a better prediction. Actually, for the particular simulation set-up in Sec. 6.2, moderate stepsizes gave better results both in terms of MSE and sparsity. It is worth noting that the BLasso coefficient estimates are pretty close to the Lasso solutions even for relatively large stepsizes. For the diabetes data, using a moderate stepsize ε = 0.05, the solution path can not be distinguished from the exact regularization path. Moreover, even when the stepsize is as large as ε = 10 and ε = 50, the solutions are still good approximations. BLasso has only one stepsize parameter (with the exception of the numerical tolerance ξ which is implementation specific but not necessarily a user parameter). This parameter controls both how close BLasso approximates the minimization coefficients for each λ and how close two adjacent λ on the regularization path are placed. As can be seen from Figure 5, a smaller stepsize leads to a closer approximation to the solutions and also finer grids for λ. We argue that, if λ is sampled on a coarse grid we should not spend computational power on finding a much more accurate approximation of the coefficients for each λ. Instead, the available computational power spent on these two coupled tasks should be balanced. BLasso’s 1-parameter setup automatically balances these two aspects of the approximation which is graphically expressed by the staircase shape of the solution paths. Another algorithm similar to Generalized BLasso was developed independently by Rosset (2004). There, starting from λ = 0, a solution is generated by taking a small Newton-Raphson step for each λ, then λ is increased by a fixed amount. The algorithm assumes twice-differentiability of both 2719

Z HAO AND Y U

loss function and penalty function and involves calculation of the Hessian matrix which could be heavy-duty computationally when the number p of covariates is not small. In comparison, BLasso uses only the differences of the loss function and involves only basic operations and does not require advanced mathematical knowledge of the loss function or penalty. It can also be used a simple plugin method for dealing with other convex penalties. Hence BLasso is easy to program and allows testing of different loss and penalty functions. Admittedly, this ease of implementation can cost computation time in large p situations. BLasso’s stepsize is defined in the original parameter space which makes the solutions evenly spread in β’s space rather than in λ. In general, since λ is approximately the reciprocal of size of the penalty, as a fitted model grows larger and λ becomes smaller, changing λ by a fixed amount makes the algorithm in Rosset (2004) move too fast in the β space. On the other hand, when the model is close to empty and the penalty function is very small, λ is very large, but the algorithm still uses the same small steps thus computation is spent to generate solutions that are too close to each other. As we discussed for the least squares problem, BLasso may also be computationally attractive for dealing with nonparametric learning problems with a large or an infinite number of base learners. This is mainly due to two facts. First, the forward step, as in Boosting, is a sub-optimization problem by itself and Boosting’s functional gradient descend strategy applies. For example, in the case of classification with trees, one can use the classification margin or the logistic loss function as the loss function and use a reweighting procedure to find the appropriate tree at each step (for details see, e.g., Breiman, 1998; Friedman et al., 2000). In the case of regression with the L 2 loss function, the minimization as in (6) is equivalent to refitting the residuals as we described in the last section. The second fact is that, when using an iterative procedure like BLasso, we usually stop early to avoid overfitting and to get a sparse model. And even if the algorithm is kept running, it usually reaches a close-to-perfect fit without too many iterations. Therefore, the backward step’s computation complexity is limited because it only involves base learners that are already included from previous steps. There is, however, a difference in the BLasso algorithm between the case with a small number of base learners and that with a large or an infinite number of base learners. For the finite case, BLasso avoids oscillation by requiring a backward step to be strictly descending and relax λ whenever no descending step is available. Hence BLasso never reaches the same solution more than once and the tolerance constant ξ can be set to 0 or a very small number to accommodate the program’s numerical accuracy. In the nonparametric learning case, a different kind of oscillation can occur when BLasso keeps going back and force in different directions but only improving the penalized loss function by a diminishing amount, therefore a positive tolerance ξ is mandatory. As suggested by the proof of Theorem 1, we suggest choosing ξ = o(ε) to warrant a good approximation to the Lasso path. One direction for future research is to apply BLasso in an online or time series setting. Since BLasso has both forward and backward steps, we believe that an adaptive online learning algorithm can be devised based BLasso so that it goes back and forth to track the best regularization parameter and the corresponding model. We end with a summary of our main contributions: 1. By combining both forward and backward steps, the BLasso algorithm is constructed to minimize an L1 penalized convex loss function. While it maintains the simplicity and flexibility of e-Boosting (or Forward Stagewise Fitting), BLasso efficiently approximate the Lasso so2720

S TAGEWISE L ASSO

lutions for general loss functions and large classes of base learners. This can be proven rigorously for a finite number of base learners under some assumptions. 2. The backward steps introduced in this paper are critical for producing the Lasso path. Without them, the FSF algorithm in general does not produce Lasso solutions, especially when the base learners are strongly correlated as in cases where the number of base learners is larger than the number of observations. As a result, FSF loses some of the sparsity provided by Lasso and might also suffer in prediction performance as suggested by our simulations. 3. We generalized BLasso as a simple, easy-to-implement, plug-in method for approximating the regularization path for other convex penalties. 4. Discussions based on intuition and simulation results are made on the regularization effect of using stepsizes that are not very small. Last but not least, matlab codes by Guilherme V. Rocha for BLasso in the case of L 2 loss and L1 penalty can be downloaded at http://www.stat.berkeley.edu/twiki/Research/YuGroup/Software.

Acknowledgments Yu would like to gratefully acknowledge the partial supports from NSF grants FD01-12731 and CCR-0106656 and ARO grant DAAD19-01-1-0643, and the Miller Research Professorship in Spring 2004 from the Miller Institute at University of California at Berkeley. We thank Dr. Chris Holmes and Mr. Guilherme V. Rocha for their very helpful comments and discussions on the paper. Finally, we would like to thank three referees and the action editor for their thoughtful and detailed comments on an earlier version of the paper.

Appendix A. Proofs Proof (Lemma 1) 1. It is assumed that there exist λ and j with |s| = ε such that Γ(s1 j ; λ) ≤ Γ(0; λ).

Then we have

n

n

i=1

i=1

∑ L(Zi ; 0) − ∑ L(Zi ; s1 j ) ≥ λT (s1 j ) − λT (0).

Therefore λ ≤

n 1 n { ∑ L(Zi ; 0) − ∑ L(Zi ; s1 j )} ε i=1 i=1

≤

n 1 n { ∑ L(Zi ; 0) − 0min ∑ L(Zi ; s1 j0 )} ε i=1 j ,|s|=ε i=1

=

n 1 n { ∑ L(Zi ; 0) − ∑ L(Zi ; βˆ 0 )} ε i=1 i=1

= λ0 . 2721

Z HAO AND Y U

2. Since a backward step is only taken when Γ(βˆ t+1 ; λt ) < Γ(βˆ t ; λt ) − ξ and λt+1 = λt , so we only need to consider forward steps. When a forward step is forced, if Γ( βˆ t+1 ; λt+1 ) > Γ(βˆ t ; λt+1 ) − ξ, then n

n

i=1

i=1

∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ < λt+1 T (βˆ t+1 ) − λt+1 T (βˆ t ).

Hence

n 1 n { ∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ} < λt+1 , ε i=1 i=1

which contradicts the algorithm. 3. Since λt+1 < λt and λ can not be relaxed by a backward step, we immediately have k βˆ t+1 k1 = kβˆ t k1 + ε. Then from n 1 n λt+1 = { ∑ L(Zi ; βˆ t ) − ∑ L(Zi ; βˆ t+1 ) − ξ}, ε i=1 i=1

we get Γ(βˆ t ; λt+1 ) − ξ = Γ(βˆ t+1 ; λt+1 ). Add (λt − λt+1 )kβˆ t k1 to both sides, and recall T (βˆ t+1 ) = kβˆ t+1 k1 > |βˆ t k1 = T (βˆ t ), we get Γ(βˆ t ; λt ) − ξ < Γ(βˆ t+1 ; λt ) = min Γ(βˆ t + s1 j0 ; λt ) j0 ,|s|=ε

≤ Γ(βˆ t ± ε1 j ; λt ) for all j.

Proof (Theorem 1) Theorem 3.1 claims that “the BLasso path converges to the Lasso path uniformly” for ∑ L(Z; β) that is strongly convex with bounded second derivatives in β. The strong convexity and bounded second derivatives imply the Hessian w.r.t. β satisfies mI ∇2 ∑ L MI, for positive constants M ≥ m > 0. Using these notations, we will show that for any t s.t. λt+1 > λt , we have ξ2 √ M ) p, kβˆ t − β∗ (λt )k2 ≤ ( ε + (9) m εm where β∗ (λt ) ∈ R p is the Lasso estimate with a regularization parameter λt . The proof of (9) relies on the following inequalities for strongly convex functions, some of which can be found in Boyd and Vandenberghe (2004). First, because of the strong convexity, we have m ∑ L(Z; β∗ (λt )) ≥ ∑ L(Z; βˆ t ) + ∇ ∑ L(Z; βˆ t )T (β∗ (λt ) − βˆ t ) + 2 kβ∗ (λt ) − βˆ t k22 . 2722

S TAGEWISE L ASSO

The L1 penalty function is also convex although not strictly convex nor differentiable at 0, but we have kβ∗ (λt )k1 ≥ kβˆ t k1 + δT (β∗ (λt ) − βˆ t )

hold for any p-dimensional vector δ with δi the i’th entry of sign(βˆ t )T for the nonzero entries and |δi | ≤ 1 otherwise. Putting both inequalities together, we have m Γ(β∗ (λt ); λt ) ≥ Γ(βˆ t ; λt ) + (∇ ∑ L(Z; βˆ t ) + λt δ)T (β∗ (λt ) − βˆ t ) + kβ∗ (λt ) − βˆ t k22 . 2

(10)

Using Equation (10), we can bound the L2 distance between β∗ (λt ) and βˆ t by applying CauchySchwartz to get m Γ(β∗ (λt ); λt ) ≥ Γ(βˆ t ; λt ) − k∇ ∑ L(Z; βˆ t ) + λt δk2 kβ∗ (λt ) − βˆ t k2 + kβ∗ (λt ) − βˆ t k22 . 2 Since Γ(β∗ (λt ); λt ) ≤ Γ(βˆ t ; λt ), we have kβ∗ (λt ) − βˆ t k2 ≤

2 k∇ L(Z; βˆ t ) + λt δk2 . m ∑

(11)

By statement (3) of Lemma 1, for βˆ tj 6= 0, we have

∑ L(Z; βˆ t ± εsign(βˆ tj )1 j ) ± λt ε ≥ ∑ L(Z; βˆ t ) − ξ.

(12)

At the same time, by the bounded Hessian assumption, we have

∑ L(Z; βˆ t ± εsign(βˆ tj )1 j ) ≤ ∑ L(Z; βˆ t ) ± ε∇ ∑ L(Z; βˆ t )T sign(βˆ tj )1 j +

M 2 ε . 2

(13)

Connect these two inequalities, we have M ∓ε × (∇ ∑ L(Z; βˆ t )T 1 j sign(βˆ tj ) + λt ) ≤ ε2 + ξ, 2 therefore

M ξ |(∇ ∑ L(Z; βˆ t )T 1 j sign(βˆ tj ) + λt )| ≤ ε + . 2 ε

(14)

Similarly, for βˆ tj = 0, instead of (12), we have

∑ L(Z; βˆ t ± εsign(βˆ tj )1 j ) + λt ε ≥ ∑ L(Z; βˆ t ) − ξ. Combine with (13), we have M ξ |∇ ∑ L(Z; βˆ t )T 1 j | − λt ≤ ε + . 2 ε For j such that βˆ tj = 0, we choose δ j appropriately and combine with (14) so that the right hand side √ of (11) is controlled by p × m2 × ( M2 ε + ξε ). This way we obtain (9).

2723

Z HAO AND Y U

References E.L. Allgower and K. Georg. Homotopy methods for approximating several solutions to nonlinear systems of equations. In W. Forster, editor, Numerical solution of highly nonlinear problems, pages 253–270. North-Holland, 1980. S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. L. Breiman. Arcing classifiers. The Annals of Statistics, 26:801–824, 1998. P. Buhlmann and B. Yu. Boosting with the l2 loss: regression and classification. Journal of American Statistical Association, 98, 2003. E. Candes and T. Tao. The danzig selector: Statistical estimation when p is much larger than n. Annals of Statistics (to appear), 2007. S. Chen and D. Donoho. Basis pursuit. Technical report, Department of Statistics, Stanford University, 1994. N. Cristianini and J. Shawe-Taylor. An introduction to support vector machines and other kernelbased learning methods. Cambridge University Press, 2002. D. Donoho. For most large undetermined system of linear equatnions the minimal l 1 -norm nearsolution approximates the sparsest solution. Communications on Pure and Applied Mathematics, 59(6):797–829, 2006. D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Trans. Information Theory, 52(1):6–18, 2006. B. Efron, T. Hastie, and R. Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004. J. Fan and R.Z. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of American Statistical Association, 96(456):1348–1360, 2001. I. Frank and J. Friedman. A statistical view od some chemometrics regression tools. Technometrics, 35:109–148, 1993. Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation, 121: 256–285, 1995. Y. Freund and R.E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proc. Thirteenth International Conference, pages 148–156. Morgan Kauffman, San Francisco, 1996. J.H. Friedman. Greedy function approximation: a gradient boosting machine. Annal of Statistics, 29:1189–1232, 2001. J.H. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting. Annal of Statistics, 28:337–407, 2000. 2724

S TAGEWISE L ASSO

W.J. Fu. Penalized regression: The bridge versus the lasso. Journal of Computational and Graphical Statistics, 7(3):397–416, 2001. J. Gao, H. Suzuki, and B. Yu. Approximate lasso methods for language modeling. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, Sydney, pages 225–232, 2006. T. Gedeon, A. E. Parker, and A. G. Dimitrov. Information distortion and neural coding. Canadian Applied Mathematics Quarterly, 2002. T. Hastie, Tibshirani, R., and J.H. Friedman. The Elements of Statistical Learning: Data Mining, Inference and Prediction. Springer Verlag, 2001. T. Hastie, J. Taylor, R. Tibshirani, and G. Walther. Forward stagewise regression and the monotone lasso. Technical report, Department of Statistics, Stanford University, 2006. K. Knight and W. J. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28:1356–1378, 2000. L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses. Advance in Large Margin Classifiers, 1999. N. Meinshausen and P. B¨uhlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34:1436–1462, 2005. N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data. Annals of Statistics (to appear), 2006. M.R. Osborne, B. Presnell, and B.A. Turlach. A new approach to variable selection in least squares problems. Journal of Numerical Analysis, 20(3):389–403, 2000a. M.R. Osborne, B. Presnell, and B.A. Turlach. On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2):319–337, 2000b. S. Rosset. Tracking curved regularized optimization solution paths. NIPS, 2004. S. Rosset, J. Zhu, and T. Hastie. Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, 5:941–973, 2004. R.E. Schapire. The strength of weak learnability. Journal of Machine Learning, 5(2):1997–2027, 1990. B. Sch¨olkopf and A. J. Smola. Learning with kernels: support vector machines, regularization, optimization and beyond. MIT Press, 2002. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1):267–288, 1996. N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In The 37th annual Allerton Conference on Communication, Control and Computing, 1999. 2725

Z HAO AND Y U

J.A. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. IEEE Trans. Information Theory, 52(3):1030 –1051, 2006. V. N. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995. M. J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using ` 1 constrained quadratic programming. Technical Report 709, Statistics Department, UC Berkeley, 2006. C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high dimensional linear regression. Annals of Statistics (to appear), 2006. T. Zhang. Sequentiall greedy approximation for certain convex optimization problems. IEEE Trans. on Information Theory, 49(3):682–691, 2003. P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7 (Nov):2541–2563, 2006. J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. Advances in Neural Information Processing Systems, 16, 2003. H. Zou. The adaptive lasso and its oracle properties. Journal of American Statistical Association, 101:1418–1429, 2006.

2726

*When life gives you a hundred reasons to cry, show life that you have a thousand reasons to smile*

© Copyright 2015 - 2021 PDFFOX.COM - All rights reserved.