A realistic fish-habitat dataset to evaluate algorithms for underwater visual analysis


Based on the labels of DeepFish, we consider these four computer vision tasks: classification, counting, localization, and segmentation. Deep learning have consistently achieved state-of-the-art results on these tasks as they can leverage the enormous size of the datasets they are trained on. These datasets include ImageNet6, Pascal7, CityScapes5 and COCO24. DeepFish aims to be part of these large scale datasets with the unique goal of understanding complex fish habitats for the purpose of inspiring further research in this area.

We present standard deep learning methods for each of these tasks. Shown as the blue module in Figure 4, these methods have the ResNet-5013 backbone which is one of the most popular feature extractors for image understanding and visual recognition. They enable models to learn from large datasets and transfer the acquired knowledge to train efficiently on another dataset. This process is known as transfer learning and has been consistently used in most current deep learning methods22. Such pretrained models can even recognize object classes that they have never been trained on29. This property illustrates how powerful the extracted features are from a pretrained ResNet-50.

Therefore, we initialize the weights of our ResNet-50 backbones by pre-training it on ImageNet following the procedure discussed in6. ImageNet consists of over 14 million images categorized over 1,000 classes. As a result, the backbone learns to extract strong, general features for unseen images by training on such dataset. These features are then used by a designated module to perform their respective computer vision task such as classification and segmentation. We describe these modules in the sections below.

To put the results into perspective, we also include baseline results by training the same methods without ImageNet pretraining (Table 3). In this case, we randomly initialize the weights of the ResNet-50 backbone with Xavier’s method11. These results also illustrate the efficacy of having pretrained models over randomly initialized models.

Table 3 Comparison between randomly initialized and ImageNet pretrained models.

Classification results

The goal of the classification task is to identify whether images are foreground (contains fish) or background (contains no fish). We use accuracy to evaluate the models on this task which is a standard metric for binary classification problems3,8,9,15,27. The metric is computed as

$$begin{aligned} { ACC}=({ TP}+{ TN})/{ N}, end{aligned}$$

where ({ TP}) and ({ TN}) are the true positives and true negatives, respectively, and ({ N}) is the total number of images. A true positive represents an image with at least one fish that is predicted as foreground, whereas a true negative represents an image with no fish that is predicted as background. For this task we used the FishClf dataset for this task where the number of images labeled is 39,766.

The classification architecture consists of a ResNet-50 backbone and a feed-forward network (FFN) (classification branch of Fig. 4). FFN takes as input features extracted by ResNet-50 and outputs a probability for the image corresponding to how likely it contains a fish. If the probability is higher than 0.5 the predicted classification label is foreground. For the FFN, we use the network presented in ImageNet which consists of 3 layers. However, instead of the original 1,000-class output layer, we use a 2-class output layer to represent the foreground or background class.

During training, the classifier learns to minimize the binary cross-entropy objective function28 using the Adam16 optimizer. The learning rate was set as (10^{-3}) and the batch size was set to be 16. Since FFN require a fixed resolution of the extracted features, the input images are resized to (224times 224). At test time, the model outputs a score for each of the two classes for a given unseen image. The predicted class for that image is the class with the higher score.

In Table 3 we compare between a classifier with the backbone pretrained on ImageNet and with the randomly initialized backbone. Note that both classifiers have their FFN network initialized at random. We see that the pretrained model achieved near-perfect classification results outperforming the baseline significantly. This result suggests that transfer learning is important and that deep learning has strong potential for analyzing fish habitats.

Figure 4

Deep learning methods. The architecture used for the four computer vision tasks of classification, counting, localization, and segmentation consists of two components. The first component is the ResNet-50 backbone which is used to extract features from the input image. The second component is either a feed-forward network that outputs a scalar value for the input image or an upsampling path that outputs a value for each pixel in the image.

Counting results

The goal of the counting task is to predict the number of fish present in an image. We evaluate the models on the FishLoc dataset, which consists of 3,200 images labeled with point-level annotations. We measure the model’s efficacy in predicting the fish count by using the mean absolute error. It is defined as,

$$begin{aligned} { MAE}=frac{1}{N}sum _{i=1}^N|hat{C}_i-C_i|, end{aligned}$$

where (C_i) is the true fish count for image i and (hat{C}_i) is the model’s predicted fish count for image i. This metric is standard for object counting12,23 and it measures the number of miscounts the model is making on average across the test images.

The counting branch in Fig. 4 shows the architecture used for the counting task, which, similar to the classifier, consists of a ResNet-50 backbone and a feed-forward network (FFN). Given the extracted features from the backbone for an input image, the FFN outputs a number that correspond to the count of the fish in the image. Thus, instead of a 2-class output layer like with the classifier, the counting model has a single node output layer.

We train the models by minimizing the squared error loss28, which is a common objective function for the counting task. At test time, the predicted value for an image is the predicted object count.

The counting model with the backbone pretrained on ImageNet achieved an MAE of 0.38 (Table 3. This result corresponds to making an average of 0.38 fish miscounts per image which is satisfactory as the average number of fish per image is 7. In comparison, the counting model initialized randomly achieved an MAE of 1.30. This result further confirms that transfer learning and deep learning can successfully address the counting task despite the fact that the dataset for counting (FishLoc) is much smaller than classification (FishClf).

Localization results

Localization considers the task of identifying the locations of the fish in the image. It is a more difficult task than classification and counting as the fish can extensively overlap. Like with the counting task, we evaluate the models on the FishLoc dataset. However, MAE scores do not provide how well the model performs at localization as the model can count the wrong objects and still achieve perfect score. To address this limitation, we use a more accurate evaluation for localization by following12, which considers both the object count and the location estimated for the objects. This metric is called Grid Average Mean absolute Error (GAME). It is computed as

$$begin{aligned} GAME = sum _{i=1}^4 { GAME}(L),quad { GAME}(L) = frac{1}{N}sum _{i=1}^Nleft( sum _{l=1}^{4^L}|D^l_i – hat{D}^l_i|right) , end{aligned}$$

where (D^l_i) is the number of point-level annotations in region l, and (hat{D}^l_i) is the model’s predicted count for region l. ({ GAME}(L)) first divides the image into a grid of (4^L) non-overlapping regions, and then computes the sum of the MAE scores across these regions. The higher L, the more restrictive the GAME metric will be. Note that ({ GAME}(0)) is equivalent to MAE.

The localization branch in Fig. 4 shows the architecture used for the localization task, which consists of a ResNet-50 backbone and an upsampling path. The upsampling path is based on the network described in FCN826 which is a standard fully convolutional neural network meant for localization and segmentation, which consists of three upsampling layers.

FCN8 processes images as follows. The features extracted with the backbone are of a smaller resolution than the input image. These features are then upsampled with the upsampling path to match the resolution of the input image. The final output is a per-pixel probability map where each pixel represents the likelihood that it belongs to the fish class.

The models is trained using a state-of-the-art localization-based loss function called LCFCN21. LCFCN is trained using four objective functions: image-level loss, point-level loss, split-level loss, and false positive loss. The image-level loss encourages the model to predict all pixels as background for background images. The point-level loss encourages the model to predict the centroids of the fish. Unfortunately, these two loss terms alone do not prevent the model from predicting every pixel as fish for foreground images. Thus, LCFCN also minimizes the split loss and false-positive loss. The split loss splits the predicted regions so that no region has more than one point annotation. This results in one blob per point annotation. The false-positive loss prevents the model from predicting blobs for regions where there are no point annotations. Note that training LCFCN only requires point-level annotations which are spatial locations of where the objects are in the image.

At test time, the predicted probability map are thresholded to become 1 if they are larger than 0.5 and 0 otherwise. This results in a binary mask, where each blob is a single connected component and they can be collectively obtained using the standard connected components algorithm. The number of connected components is the object count and each blob represents the location of an object instance (see Fig. 5 for example predictions with FCN8 trained with LCFCN).

Models trained on this dataset are optimized using Adam16 with a learning rate of (10^{-3}) and weight decay of 0.0005, and have been ran for 1,000 epochs on the training set. In all cases the batch size is 1, which makes it applicable for machines with limited memory.

Table 3 shows the MAE and GAME results of training an FCN8 with and without a pretrained ResNet-50 backbone using the LCFCN loss function. We see that pretraining leads to significant improvement on MAE and a slight improvement for GAME. The efficacy of the pretrained model is further confirmed by the qualitative results shown in Fig. 5a where the predicted blobs are well-placed on top of the fish in the images.

Figure 5
figure5

Qualitative results on counting, localization, and segmentation. (a) Prediction results of the model trained with the LCFCN loss21. (b) Annotations that represent the (xy) coordinates of each fish within the images. (c) Prediction results of the model trained with the focal loss25. (d) Annotations that represent the full segmentation masks of the corresponding fish.

Segmentation results

The task of segmentation is to label every pixel in the image as either fish or not fish (Fig. 5c,d). When combined with depth information, a segmented image allows us to measure the size and the weight of the fish in a location, which can vastly improve our understanding of fish communities. We evaluate the model on the FishSeg dataset for which we acquired per-pixel labels for 620 images. We evaluate the models on this dataset using the standard Jaccard index5,7 which is defined as the number of correctly labelled pixels of a class, divided by the number of pixels labelled with that class in either the ground truth mask or the predicted mask. It is commonly known as the intersection-over-union metric IoU, computed as (frac{TP}{TP + FP + FN}), where TP, FP, and FN are the numbers of true positive, false positive, and false negative pixels, respectively, which is determined over the whole test set. In segmentation tasks, the IoU is preferred over accuracy as it is not as affected by the class imbalances that are inherent in foreground and background segmentation masks like in DeepFish.

During training, instead of minimizing the standard per-pixel cross-entropy loss26, we use the focal loss function25 which is more suitable when the number of background pixels is much higher than the foreground pixels like in our dataset. The rest of the training procedure is the same as with the methods trained for localization.

At test time, the model outputs a probability for each pixel in the image. If the probability is higher than 0.5 for the foreground class, then the pixel is labeled as fish, resulting in a segmentation mask for the input image.

The results in Table 3 show a comparison between the pretrained and randomly initialized segmentation model. Like with the other tasks, the pretrained model achieves superior results both quantitatively and qualitatively (Fig. 5).

Ethical approval

This work was conducted with the approval of the JCU Animal Ethics Committee (protocol A2258), and conducted in accordance with DAFF general fisheries permit #168652 and GBRMP permit #CMES63.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *