Generation of Image Dataset and Preprocessing

In Summer 2019, we partnered with Hillsborough county mosquito control district in Florida to lay outdoor mosquito traps over multiple days. Each morning after laying traps, we collected all captured mosquitoes, froze them in a portable container and took them to the county lab, where taxonomists identified them for us. For this study, we utilized 23 specimens of Aedes aegypti and Aedes infirmatus, and 22 specimens of Aedes taeniorhynchus, Anopheles crucians, Anopheles quadrimaculatus, Anopheles stephensi, Culex coronator, Culex nigripalpus and Culex salinarius. We point out that specimens of eight species were trapped in the wild. The An. stephensi specimens alone were lab-raised whose ancestors were originally trapped in India.

Each specimen was then emplaced on a plain flat surface, and then imaged using one smartphone (among iPhone 8, 8 Plus, and Samsung Galaxy S8, S10) in normal indoor light conditions. To take images, the smartphone was attached to a movable platform 4 to 5 inches above the mosquito specimen, and three photos at different angles were taken. One directly above, and two at (45^{circ }) angles to the specimen opposite from each other. As a result of these procedures, we generated a total of 600 images. Then, 500 of these images were preprocessed to generate the training dataset, and the remaining 100 images were separated out for validation. For preprocessing, the images were scaled down to (1024 times 1024) pixels for faster training (which did not lower accuracy). The images were augmented by adding Gaussian blur and randomly flipping them from left to right. These methods are standard in image processing, which better account for variances during run-time execution. After this procedure, our training dataset increased to 1500 images.

Note here that all mosquitoes used in this study are vectors. Among these, Aedes aegypti is particularly dangerous, since it spreads Zika fever, dengue, chikungunya and yellow fever. This mosquito is also globally distributed now.

Our Deep Neural Network Framework based on Mask R-CNN

To address our goal of extracting anatomical components from a mosquito image, a straightforward approach is to try a mixture of Gaussian models to remove background from the image1,15. But this will only remove the background, without being able to extract anatomical components in the foreground separately. There are other recent approaches in the realm also. One technique is U-Net16, wherein semantic segmentation based on deep neural networks is proposed. However, this technique does not lend itself to instance segmentation (i.e., segmenting and labeling of pixels across multiple classes). Multi-task Network Cascade17 (MNC) is an instance segmentation technique, but it is prone to information loss, and is not suitable for images as complex as mosquitoes with multiple anatomical components. Fully Convolutional Instance-aware Semantic Segmentation18 (FCIS) is another instance segmentation technique, but it is prone to systematic errors on overlapping instances and creates spurious edges, which are not desirable. DeepMask19 developed by Facebook, extracts masks (i.e., pixels) and then uses Fast R-CNN20 technique to classify the pixels within the mask. This technique though is slow as it does not enable segmentation and classification in parallel. Furthermore, it uses selective search to find out regions of interest, which further adds to delays in training and inference.

In our problem, we have leveraged Mask R-CNN11 neural network architecture for extracting masks (i.e. pixels) comprising of objects of interest within an image which eliminates selective search, and also uses Regional Proposal Network (RPN)21 to learn correct regions of interest. This approach best suited for quicker training and inference. Apart from that, it uses superior alignment techniques for feature maps, which helps prevent information loss. The basic architecture is shown in Fig. 1. Adapting it for our problem requires a series of steps presented below.

  • Annotation for Ground-truth First, we manually annotate our training and validation images using VGG Image Annotator (VIA) tool22. To do so, we manually (and carefully) emplace bounding polygons around each anatomical component in our training and validation images. The pixels within the polygons and associated labels (i.e., thorax, abdomen, wing or leg) serve as ground truth. One sample annotated image is shown in Fig. 4.

  • Generate Feature Maps using CNN Then, we learn semantically rich features in the training image dataset to recognize the complex anatomical components of the mosquito. To do so, our neural network architecture is a combination of the popular Res-Net101 architecture with Feature Pyramid Networks (FPN)12. Very briefly, ResNet-10123 is a CNN with residual connections, and was specifically designed to remove vanishing gradients at later layers during training. It is relatively simple with 345 layers. Addition of a feature pyramid network to ResNet was attempted in another study, where the motivation was to leverage the naturally pyramidal shape of CNNs, and to also create a subsequent feature pyramid network that combines low resolution semantically strong features with high resolution semantically weak features using a top-down pathway and lateral connections12. This resulting architecture is well suited to learn from images at different scales from only minimal input image scales. Ensuring scale-invariant learning is specifically important for our problem, since mosquito images can be generated at different scales during run-time, due to diversity in camera hardware and human induced variations.

  • Emplacing anchors on anatomical components in the image In this step, we leverage the notion of Regional Proposal Network (RPN)21 and results from the previous two steps, to design a simpler CNN that will learn feature maps corresponding to ground-truthed anatomical components in the training images. The end goal is to emplace anchors (rectangular boxes) that enclose the detected anatomical components of interest in the image.

  • Classification and pixel-level extraction Finally, we align the feature maps of the anchors (i.e., region of interest) learned from the above step into fixed sized feature maps which serve as input to three branches to: (a) label the anchors with the anatomical component; (b) extract only the pixels within the anchors that represents an anatomical component; and (c) tighten the anchors for improved accuracy. All three steps are done in parallel.

Figure 4

Manual annotation of each anatomy (thorax, abdomen, wings, and legs) using VGG Image Annotator (VIA) tool.

Loss functions

For our problem, recall that there are three specific sub-problems: labeling the anchors as thorax, abdomen, wings or leg; masking the corresponding pixels within each anchor; and a regressor to tighten anchors. We elaborate now on the loss functions used for these three sub-problems. We do so because, loss functions are a critical component during training and validation of deep neural networks to improve learning accuracy and avoid overfitting.

Labeling (or classification) loss For classifying the anchors, we utilized the Categorical Cross Entropy loss function, and it worked well. For a single anchor j, the loss is given by,

$$begin{aligned} CCE_j=-log(p), end{aligned}$$


where p is the model estimated probability for the ground truth class of the anchor.

Masking loss Masking is most challenging, considering the complexity in learning to detect only pixels comprising of anatomical components in an anchor. Initially, we experimented with the simple Binary Cross Entropy loss function. With this loss function, we noticed good accuracy for pixels corresponding to thorax, wings and abdomen. But, many pixels corresponding to legs were mis-classified as background. This is because of class imbalance highlighted in Fig. 5, wherein we see significantly larger number of background pixels, compared to number of foreground pixels for anchors (colored blue) emplaced around legs. This imbalance leads to poor learning for legs, because the binary class entropy loss function is biased towards the (much more, and easier to classify) background pixels.

Figure 5

After emplacement of anchors, we see significantly more background pixels than foreground pixels for anchors encompassing legs.

To fix this shortcoming, we investigated another more recent loss function called focal loss24 which lowers the effect of well classified samples on the loss, and rather places more emphasis on the harder samples. This loss function hence prevents more commonly occurring background pixels from overwhelming the not so commonly occurring foreground pixels during learning, hence overcoming class imbalance problems. The focal loss for a pixel i is represented as,

$$begin{aligned} FL(i)=-(1-p)^gamma log (p), end{aligned}$$


where p is the model estimated probability for the ground truth class, and (gamma) is a tunable parameter, which was set as 2 in our model. With these definitions, it is easy to see that when a pixel is mis-classified and (p rightarrow 0), then the modulating factor ((1-p)^gamma) tends to 1 and the loss (log(p)) is not affected. However, when a pixel is classified correctly and when (p rightarrow 1), the loss is down-weighted. In this manner, priority during training is emphasized more on the hard negative classifications, hence yielding superior classification performance in the case of unbalanced datsets. Utilizing the focal loss gave us superior classification results for all anatomical components.

Regressor loss To tighten the anchors and hence improve masking accuracy, the loss function we utilized is based on the summation of Smooth L1 functions computed across anchor, ground truth and predicted anchors. Let (xy) denote the top-left coordinate of a predicted anchor. Let (x_a) and (x^*) denote the same for anchors generated by the RPN, and the manually generated ground-truth. The notations are the same for the y coordinate, width w and height h of an anchor. We define several terms first, following which the loss function (L_{reg}) used in our architecture is presented.

$$begin{aligned} begin{array}{l} t_x^*=frac{(x^*-x_a)}{w_a},quad quad t_y^*=frac{(y^*-y_a)}{h_a},quad quad t_w^*=log (frac{w^*}{w_a}),quad quad t_h^*=log (frac{h^*}{h_a}),\ \ t_x=frac{(x-x_a)}{w_a},quad quad t_y=frac{(y-y_a)}{h_a},quad quad t_w=log (frac{w}{w_a}),~quad quad t_h=log (frac{h}{h_a}),\ \ smooth_{L_1}= {left{ begin{array}{ll} 0.5x^2 ,&{} text {if } |x|< 1\ |x| -0.5, &{} text {otherwise} end{array}right. } ~~~text {and} \ \ L_{reg}(t_i,t_{i}^{*})=sum _{iepsilon {{x,y,w,h}}}smooth_{L_1}(t_{i}^{*}-t_i).\ \ end{array} end{aligned}$$



For convenience, Table 5 lists values of critical hyperparameters in our finalized architecture.

Table 5 Values of critical hyperparameters in our architecture.

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *