Common Evaluation Metrics in CV

16th Sep 2025
5 min read
Tags:
CV,
Evaluation Metrics

CV Object Detection Metrics

type: Post status: Published date: 2025/09/16

1. Common Evaluation Metrics

Classification tasks identify which class an input image belongs to. Common metrics include:

Accuracy

$$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

The proportion of correctly predicted samples among all samples. Suitable when class distribution is balanced.
Precision

$$ Precision = \frac{TP}{TP + FP} $$

Among all predictions labeled as positive, the proportion that are truly positive. Focuses on the correctness of positive predictions.

The main idea: negatives should not be predicted as positives. When the cost of false positives is high, we want to avoid misclassification. For example, in spam filtering, the cost of missing spam is much lower than putting legitimate emails into the spam folder.

Recall

$$ Recall = \frac{TP}{TP + FN} $$

Among all true positives, the proportion successfully retrieved. Focuses on completeness.

The main idea: positives should not be predicted as negatives. For example, in disease screening, misdiagnosing a positive as negative may miss timely treatment.

F1-score

$$ F1 = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} $$

Used when balancing precision and recall.

ROC Curve & AUC

The ROC curve shows TPR vs. FPR across thresholds. AUC is the area under the curve and measures overall separability.
PR Curve

If classes are roughly balanced, ROC and AUC are suitable for model comparison. When data are imbalanced, the Precision-Recall curve (PRC) and area under it better visualize performance.

AP (Average Precision) and mAP (Mean Average Precision)

AP is based on the Precision-Recall curve by averaging precision over recall values; it is the area under the curve. In practice, the right-side maximum is often used.

In multi-class object detection, each class has its PR/AP curve; mAP is the mean of APs across classes.

2. Common Metrics for Object Detection

Object detection predicts both location (bounding boxes) and class.

IoU (Intersection over Union)

$$ IoU = \frac{\text{Predicted box} \cap \text{Ground-truth box}}{\text{Predicted box} \cup \text{Ground-truth box}} $$

Measures the overlap between predicted and ground-truth boxes. Typically IoU ≥ 0.5 counts as a correct detection.
mAP (mean Average Precision)

If the IoU threshold = 0.5, it is AP@50 (prediction counts as correct if overlap ≥ 50%). If IoU threshold = 0.75, it is AP@75, and so on.

In COCO, AP[.50:.05:.95] means AP is computed at IoU thresholds from 0.50 to 0.95 in steps of 0.05 (10 values) and then averaged.

Therefore, common comparisons include:
- $AP@50$: whether the detector “finds the target”
- $AP@75$: whether the detector “boxes the target precisely”
- $AP@[.50:.95]$: performance across varying strictness
- $AP_S$: AP for objects with area < $32^2$ pixels
- $AP_M$: AP for objects with area in $[32^2, 96^2]$ pixels
- $AP_L$: AP for objects with area > $96^2$ pixels
FPS (Frames per Second)

Measures inference speed; used to balance accuracy and real-time performance.

3. Common Metrics for Image Segmentation

Segmentation performs per-pixel classification.

Pixel Accuracy

Correctly classified pixels / Total pixels.
IoU (Jaccard Index)

Measures overlap between predicted and ground-truth regions.

Used in semantic and instance segmentation.
mIoU (mean IoU)

IoU averaged over all classes; the most common segmentation metric.
Dice Coefficient (F1 for segmentation)

$$ Dice = \frac{2 \cdot |\text{Prediction} \cap \text{Ground truth}|}{|\text{Prediction}| + |\text{Ground truth}|} $$

More sensitive to small targets; common in medical image segmentation.

4. Common Metrics for Retrieval and Ranking

Used in image retrieval, face recognition, etc., where results are ranked.

Top-k Accuracy

Correct if the true label appears in the top k predictions (common in ImageNet).
Recall@k

Whether the target appears in the top k retrieved results.
mAP@k

Average Precision computed over the top k results.
NDCG (Normalized Discounted Cumulative Gain)

Measures how close the ranking is to the ideal ordering.

5. Additional Metrics

Focal Loss

In one-stage detectors (e.g., RetinaNet), two prominent issues:

Class imbalance: most anchors are background (negatives); true objects (positives) are few.
With standard cross-entropy, many easy negatives dominate the loss, so the model under-learns hard examples (small/blurred objects).

Focal Loss adds a modulating factor on top of cross-entropy to down-weight easy examples and focus on hard ones.

$$ FL(p_t) = -\alpha (1 - p_t)^\gamma \log(p_t) $$

$p_t$: predicted probability for the true class.
$\alpha \in [0,1]$: balances the importance of positives vs. negatives.
$\gamma \ge 0$: focusing parameter.

Explanation of the focusing parameter:

When $\gamma = 0$: reduces to standard cross-entropy.
When $\gamma > 0$:
- If a sample is easy ($p_t$ large), $(1 - p_t)^\gamma$ is small → loss reduced.
- If a sample is hard ($p_t$ small), $(1 - p_t)^\gamma$ is larger → loss amplified.

Thus, $\gamma$ controls the focus on hard samples.

In practice:

RetinaNet recommends $\alpha = 0.25, \gamma = 2$.

References:

https://zhuanlan.zhihu.com/p/479674794

https://zhuanlan.zhihu.com/p/88896868

https://blog.csdn.net/qq_42722197/article/details/128963093

https://hackmd.io/@wayne0509/HkZ6rEZMP#mAP

https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall?hl=zh-cn

https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=zh-cn

Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2980-2988).

https://amaarora.github.io/posts/2020-06-29-FocalLoss.html