Efficient Algorithms for Ranking with SVMs
Introduction
Methods:
- Pointwise (give score close to its relevance)
- Pairwise (Compute score for each pair, documents and its relevance)
- Listwise (training loss function is based on the list of all participating documents and their scores)
Pairwise:
- RankBoost
- RankNet
- LambdaRank
- GBRank
- RankSVM
Main differences
- Type of loss function
- Training methods
RankSVM
- Minimizing objective function: \(\frac12||w||^2 + C\sum_{(i,j)\in P} l(w^T x_i - w^T x_j)\)
- l is a loss function such as: \(l(t)=max(0,1-t^2)\)
- SVMLight form all diff & solve using dual method1
- Limit iterations lowers time but imcomplete training gives poor results
This paper:
- RankSVM + primal newton method2
- Two ways:
- various computation of the Newton method \(O(nd+p)\)
- Deals with P arises dur to small number of relevance levels \(O(nd+n\log n)\)
Newton optimization for linear SVM
Objective function, \(y_i \in \{-1,1\}\): \[\min_{w,b} \frac12 ||w||^2 C \sum^{n}_{i=1}max(0,1-y_i (w^T x_i +b))^2\]
no need to solve in dual,but can use:
- Non-linear conjugate gradient
- gradient of the objective function & line search procedure (secant method/cubic polynomial interpolation)
- O(n) if Xs is precomputed, X data matrix, s search direction
- Truncated Newton
- \(w <- w - H^{-1}g\), H Hessian, g gradient of the objective function
- truncated newton => Hessian is not computed directly, linear system is solved by linear conjugate gradient
Roughly equivalent(Chapelle, 2007a), we use trucated newton
- Support vectors \(sv = {i,y_i(w^T x_i+b -y_i)x_i<1}\)
- Gradient \(g:=w+2C\sum_{i \in sv}(w^T x_i +b -y_i)x_i\)
- Hessian \(H:=I+2C\sum_{i \in sv}x_i x^T_i\)
- Newton step, can be computed with linear conjugate gradient: \(H^{-1}g\), For some vector s: \(Hs=s+2CX^TD(Xs)\)
- D => diagonal matix \(D_{ii}=1, \mbox{if} i \in sv, \mbox{0 otherwise}\)
First implementation3
- Linear SVM can be used for RankSVM
- \(x_i <- x_i - x_j, X <- AX, \mbox{where} A_{ki} \in \{-1,1\}\)
Better Algorithm4
- \(g=w+Cg_L\)
- \(Hs=s+CH_L s\)
Two relevance levels
- \(L=\sum_{i \in A, j \in B} \ell (y_i-y_j)\)
- A high relevance, B low relevance
- let \(\tilde{y_i} = \begin{cases} y_i - \frac12, & i \in A \\ y_i + \frac12, & i \in B \end{cases}\)
- => \(\ell (y_i-y_j) => \max (0,\tilde{y_j} - \tilde{y_i})\)