Efficient Algorithms for Ranking with SVMs

Introduction
Newton optimization for linear SVM
First implementation
Better Algorithm
Two relevance levels

Introduction

Methods:

Pointwise (give score close to its relevance)
Pairwise (Compute score for each pair, documents and its relevance)
Listwise (training loss function is based on the list of all participating documents and their scores)

Pairwise:

RankBoost
RankNet
LambdaRank
GBRank
RankSVM

Main differences

Type of loss function
Training methods

RankSVM

Minimizing objective function: \(\frac12||w||^2 + C\sum_{(i,j)\in P} l(w^T x_i - w^T x_j)\)
- l is a loss function such as: \(l(t)=max(0,1-t^2)\)
SVMLight form all diff & solve using dual method¹
Limit iterations lowers time but imcomplete training gives poor results

This paper:

RankSVM + primal newton method²
Two ways:
- various computation of the Newton method \(O(nd+p)\)
- Deals with P arises dur to small number of relevance levels \(O(nd+n\log n)\)

Newton optimization for linear SVM

Objective function, \(y_i \in \{-1,1\}\): \[\min_{w,b} \frac12 ||w||^2 C \sum^{n}_{i=1}max(0,1-y_i (w^T x_i +b))^2\]

no need to solve in dual,but can use:

Non-linear conjugate gradient
- gradient of the objective function & line search procedure (secant method/cubic polynomial interpolation)
- O(n) if Xs is precomputed, X data matrix, s search direction
Truncated Newton
- \(w <- w - H^{-1}g\), H Hessian, g gradient of the objective function
- truncated newton => Hessian is not computed directly, linear system is solved by linear conjugate gradient

Roughly equivalent(Chapelle, 2007a), we use trucated newton

Support vectors \(sv = {i,y_i(w^T x_i+b -y_i)x_i<1}\)
Gradient \(g:=w+2C\sum_{i \in sv}(w^T x_i +b -y_i)x_i\)
Hessian \(H:=I+2C\sum_{i \in sv}x_i x^T_i\)
Newton step, can be computed with linear conjugate gradient: \(H^{-1}g\), For some vector s: \(Hs=s+2CX^TD(Xs)\)
- D => diagonal matix \(D_{ii}=1, \mbox{if} i \in sv, \mbox{0 otherwise}\)

First implementation³

Linear SVM can be used for RankSVM
\(x_i <- x_i - x_j, X <- AX, \mbox{where} A_{ki} \in \{-1,1\}\)

Better Algorithm⁴

\(g=w+Cg_L\)
\(Hs=s+CH_L s\)

Two relevance levels

\(L=\sum_{i \in A, j \in B} \ell (y_i-y_j)\)
A high relevance, B low relevance
let \(\tilde{y_i} = \begin{cases} y_i - \frac12, & i \in A \\ y_i + \frac12, & i \in B \end{cases}\)
=> \(\ell (y_i-y_j) => \max (0,\tilde{y_j} - \tilde{y_i})\)

what is it?↩
Keerthi and DeCoste, 2005; Chapelle, 2007b↩
Matlab Code ↩
Known as PRSVM+↩