Loss functions for machine learning (classification, regression)

Loss functions are fundamental things to be defined in machine learning (ML) problems, and they are highly correlated to the goal of the ML model. I also recall that it was one of the interview questions for IBM with respect to a ML position.

Here, let's try to summarize a set of loss functions, and when they can be used for different types of problems such as regression and classification related ones.


Regression-related 

------------------------------------------------

MSE (Mean Squared Error): 

  • $L(Q, \hat{Q})=\frac{1}{N}\sum_1^N(Q-\hat{Q})^2$
  • where $Q$ denotes the actual value and $\hat{Q}$ denotes the predicted value from our model, $N$ is the total number of examples. 

MSLE (Mean Squared Log Error):
  • $L(Q, \hat{Q})=\frac{1}{N}\sum_1^N(log(Q+1)-log(\hat{Q}+1))^2$
  • Good when we don't want to emphasize much on the errors on large values in the prediction

MAE (Mean Absolute Error):
  • $L(Q, \hat{Q})=\frac{1}{N}\sum_1^N|Q-\hat{Q}|$

Huber Loss: (Considers both absolute and square error worlds)

\begin{equation} L_{\delta}(a)=\begin{cases} \frac{1}{2}a^2 \text{ for }|a|<\delta \\ \delta(|a|-\frac{1}{2}\delta) \text{ otherwise}\end{cases} \end{equation}
  • $a$ refers to error, delta, e.g., =1 is a threshold
  • Huber loss is less sensitive to outliers in data than mean squared error.

Asymmetric Squared Error:

\begin{equation} L(Q, \hat{Q})=\begin{cases} (Q-\hat{Q})^2, & \text{if } \hat{Q} < Q \\ c\times(Q-\hat{Q})^2, & \text{otherwise}, 0<c<1 \end{cases} \end{equation}
  • lager prediction is preferable to estimating value below than true value
  • example use case: traffic prediction of next month for a website

Loss with emphasize for closing to 0 or 1:
  • $L(Q, \hat{Q})=\frac{|Q-\hat{Q}|}{Q(1-Q)}, Q, \hat{Q} \in [0, 1]$



Classification-related

----------------------------------------------

Hamming Loss: 
  • $HL=\frac{1}{NL}\sum_{i=1}^{N}\sum_{j=1}^{L}xor(y_{i,j},z_{i,j})$
  • where $L$ is the number of labels
  • related problem: (multi-label) classification

Cross-entropy (CE) / Log-loss (LL): 
  • $CE/LL=-(Q\log(\hat{Q}))+(1-Q)log(1-\hat{Q}), Q\in0, 1, \hat{Q}\in[0,1]$
  • Visualization about cross-entropy/log-loss
  • related problem: (multi-label) classification

Hinge Loss: 
  • The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs)
  • For an intended output $Q=\pm 1$, $L(Q, \hat{Q})=max(0, 1-Q \cdot \hat{Q})$
  • related problem: (multi-label) classification


Ranking Loss

----------------------------------------------

Pairwise Ranking Loss:

\begin{equation} L=\begin{cases} d(r_a, r_p) & \text{if  Positive Pair} \\ max(0, m-d(r_a, r_n)) & \text{if Negative Pair} \end{cases} \end{equation}

where $m$ is the margin.

Triplet Ranking Loss
  • $L(r_a, r_p, r_n)=max(0, m+d(r_a,r_p)-d(r_a,r_n))$
  • $r_a$ is anchor, $r_p$ and $r_n$ are positive and negative examples

Others

--------------------------------------

Context [1]: Suppose the future return of a stock price is very small, say 0.01 (or 1%). We have a model that predicts the stock's future price, and our profit and loss is directly tied to us acting on the prediction. How should we measure the loss associated with the model's predictions, and subsequent future predictions? A squared-error loss is agnostic to the signage and would penalize a prediction of -0.01 equally as bad a prediction of 0.03. 

Sample loss function from [1]:
\begin{equation} L(Q, \hat{Q})=\begin{cases} \alpha\cdot\hat{Q}^2-sign(Q)\cdot\hat{Q}+|Q|, & \text{if } Q\cdot\hat{Q} < 0 \\ |Q-\hat{Q}|, & \text{otherwise} \end{cases} \end{equation}

  • $\alpha$ is a penalty term


The loss for predicting -0.01 and 0.03 with respect to the true value 0.01 will be 0.03 and 0.02, respectively (when $\alpha=100$).

[1] Probabilistic-Programming-and-Bayesian-Methods-for-Hackers