Here, let's try to summarize a set of loss functions, and when they can be used for different types of problems such as regression and classification related ones.
Regression-related
------------------------------------------------
- $L(Q, \hat{Q})=\frac{1}{N}\sum_1^N(Q-\hat{Q})^2$
- where $Q$ denotes the actual value and $\hat{Q}$ denotes the predicted value from our model, $N$ is the total number of examples.
MSLE (Mean Squared Log Error):
- $L(Q, \hat{Q})=\frac{1}{N}\sum_1^N(log(Q+1)-log(\hat{Q}+1))^2$
- Good when we don't want to emphasize much on the errors on large values in the prediction
MAE (Mean Absolute Error):
- $L(Q, \hat{Q})=\frac{1}{N}\sum_1^N|Q-\hat{Q}|$
Huber Loss: (Considers both absolute and square error worlds)
\begin{equation} L_{\delta}(a)=\begin{cases} \frac{1}{2}a^2 \text{ for }|a|<\delta \\ \delta(|a|-\frac{1}{2}\delta) \text{ otherwise}\end{cases} \end{equation}
- $a$ refers to error, delta, e.g., =1 is a threshold
- Huber loss is less sensitive to outliers in data than mean squared error.
\begin{equation} L(Q, \hat{Q})=\begin{cases} (Q-\hat{Q})^2, & \text{if } \hat{Q} < Q \\ c\times(Q-\hat{Q})^2, & \text{otherwise}, 0<c<1 \end{cases} \end{equation}
- lager prediction is preferable to estimating value below than true value
- example use case: traffic prediction of next month for a website
Loss with emphasize for closing to 0 or 1:
- $L(Q, \hat{Q})=\frac{|Q-\hat{Q}|}{Q(1-Q)}, Q, \hat{Q} \in [0, 1]$
Classification-related
----------------------------------------------
Hamming Loss:
- $HL=\frac{1}{NL}\sum_{i=1}^{N}\sum_{j=1}^{L}xor(y_{i,j},z_{i,j})$
- where $L$ is the number of labels
- related problem: (multi-label) classification
Cross-entropy (CE) / Log-loss (LL):
- $CE/LL=-(Q\log(\hat{Q}))+(1-Q)log(1-\hat{Q}), Q\in0, 1, \hat{Q}\in[0,1]$
- Visualization about cross-entropy/log-loss
- related problem: (multi-label) classification
Hinge Loss:
- The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs)
- For an intended output $Q=\pm 1$, $L(Q, \hat{Q})=max(0, 1-Q \cdot \hat{Q})$
- related problem: (multi-label) classification
Ranking Loss
----------------------------------------------
Pairwise Ranking Loss:
\begin{equation} L=\begin{cases} d(r_a, r_p) & \text{if Positive Pair} \\ max(0, m-d(r_a, r_n)) & \text{if Negative Pair} \end{cases} \end{equation}
where $m$ is the margin.
Triplet Ranking Loss
- $L(r_a, r_p, r_n)=max(0, m+d(r_a,r_p)-d(r_a,r_n))$
- $r_a$ is anchor, $r_p$ and $r_n$ are positive and negative examples
Others
--------------------------------------
Context [1]: Suppose the future return of a stock price is very small, say 0.01 (or 1%). We have a model that predicts the stock's future price, and our profit and loss is directly tied to us acting on the prediction. How should we measure the loss associated with the model's predictions, and subsequent future predictions? A squared-error loss is agnostic to the signage and would penalize a prediction of -0.01 equally as bad a prediction of 0.03.
Sample loss function from [1]:
\begin{equation} L(Q, \hat{Q})=\begin{cases} \alpha\cdot\hat{Q}^2-sign(Q)\cdot\hat{Q}+|Q|, & \text{if } Q\cdot\hat{Q} < 0 \\ |Q-\hat{Q}|, & \text{otherwise} \end{cases} \end{equation}
- $\alpha$ is a penalty term
The loss for predicting -0.01 and 0.03 with respect to the true value 0.01 will be 0.03 and 0.02, respectively (when $\alpha=100$).
[1] Probabilistic-Programming-and-Bayesian-Methods-for-Hackers