Focal Loss was introduced by Lin et al

Mediante this case, the activation function does not depend per scores of other classes sopra \(C\) more than \(C_1 = C_i\). So the gradient respect preciso the each punteggio \(s_i\) in \(s\) will only depend on the loss given by its binary problem.

Caffe: Sigmoid Cross-Entropy Loss Layer
Pytorch: BCEWithLogitsLoss
TensorFlow: sigmoid_cross_entropy.

Focal Loss

, from Facebook, in this paper. They claim onesto improve one-tirocinio object detectors using Focal Loss esatto train a detector they name RetinaNet. Focal loss is a Ciclocampestre-Entropy Loss that weighs the contribution of each sample onesto the loss based mediante the classification error. The intenzione is that, if per sample is already classified correctly by the CNN, its contribution esatto the loss decreases. With this strategy, they claim preciso solve the problem of class imbalance by making the loss implicitly focus in those problematic classes. Moreover, they also weight the contribution of each class onesto the lose per per more explicit class balancing. They use Sigmoid activations, so Focal loss could also be considered per Binary Cross-Entropy Loss. We define it for each binary problem as:

Where \((1 – s_i)\gamma\), with the www.datingranking.net/it/polyamorydate-review focusing parameter \(\qualita >= 0\), is per modulating factor to veterano the influence of correctly classified samples sopra the loss. With \(\qualita = 0\), Focal Loss is equivalent sicuro Binary Ciclocampestre Entropy Loss.

Where we have separated formulation for when the class \(C_i = C_1\) is positive or negative (and therefore, the class \(C_2\) is positive). As before, we have \(s_2 = 1 – s_1\) and \(t2 = 1 – t_1\).

The gradient gets a bit more complex coppia to the inclusion of the modulating factor \((1 – s_i)\gamma\) mediante the loss formulation, but it can be deduced using the Binary Cross-Entropy gradient expression.

Where \(f()\) is the sigmoid function. Onesto get the gradient expression for a negative \(C_i (t_i = 0\)), we just need sicuro replace \(f(s_i)\) with \((1 – f(s_i))\) per the expression above.

Ratto that, if the modulating factor \(\qualita = 0\), the loss is equivalent esatto the CE Loss, and we end up with the same gradient expression.

Forward pass: Loss computation

Where logprobs[r] stores, verso each element of the batch, the sum of the binary cross entropy per each class. The focusing_parameter is \(\gamma\), which by default is 2 and should be defined as per layer parameter in the net prototxt. The class_balances can be used puro introduce different loss contributions a class, as they do in the Facebook paper.

Backward pass: Gradients computation

Durante the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class \(C_p\) keeps its term durante the loss. There is only one element of the Target vector \(t\) which is not nulla \(t_i = t_p\). So discarding the elements of the summation which are nulla paio to target labels, we can write:

This would be the pipeline for each one of the \(C\) clases. We attrezzi \(C\) independent binary classification problems \((C’ = 2)\). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem esatto backpropagate, and the losses to monitor the global loss. \(s_1\) and \(t_1\) are the conteggio and the gorundtruth label for the class \(C_1\), which is also the class \(C_i\) con \(C\). \(s_2 = 1 – s_1\) and \(t_2 = 1 – t_1\) are the score and the groundtruth label of the class \(C_2\), which is not per “class” in our original problem with \(C\) classes, but a class we create puro set up the binary problem with \(C_1 = C_i\). We can understand it as verso retroterra class.