CN113160108B

CN113160108B - Sequential query counting method for few-sample multi-class baits

Info

Publication number: CN113160108B
Application number: CN202011382864.6A
Authority: CN
Inventors: 赵德安; 曹硕; 孙月平; 秦云; 盛亮; 戚浩; 石子坚
Original assignee: Jiangsu University
Current assignee: Jiangsu University
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2024-03-19
Anticipated expiration: 2040-12-01
Also published as: CN113160108A

Abstract

The invention discloses a sequential inquiring and counting method of few-sample multi-type baits, which comprises the steps of flexibly collecting images of baits conforming to feeding practice through a camera, and marking the images only by means of point annotation to construct a supporting and inquiring image set; then, the extracted feature images adopt a circulating attention mechanism, each bait in the query image is focused in the dictionary sequence of the label coordinates, the attention patterns are calculated to weight and obtain query feature vectors, and then the query feature vectors and the prototype feature vectors extracted from the support image pass through a linear layer together, so that the bait classification problem of the small sample is converted into a linear classification model; finally, calculating and inquiring the distance between the high-dimensional feature vector corresponding to the bait and each type of bait prototype to predict the category of each bait, and realizing counting and tracking of the residual quantity of the bait. The method is suitable for counting of few-sample multi-type baits, and is a key for scientifically measuring various baits and the ingestion capacity coefficient of river crabs to establish modern cultivation feeding decisions.

Description

Sequential query counting method for few-sample multi-class baits

Technical Field

The invention relates to the field of underwater image processing and bait counting, in particular to a sequential query counting method for few-sample multi-class baits.

Background

The feeding rhythm is the result of actively adapting to periodically changing environmental conditions such as illumination, temperature, dissolved oxygen, baits and the like in the long-term growth evolution process of crabs, and the decision of feeding strategies such as feeding time, feeding mode, feeding frequency, feeding quantity and the like of the crabs is mastered by the direct relation of the feeding rhythm of the crabs, so that the utilization rate of the baits and the pollution load of the culture water body are influenced. At present, most of feeding in the river crab culture process is based on crab ingestion rhythms. Therefore, research on feeding rhythms for cultured crabs is important to the culture feeding strategy. Although, the modern culture feeding strategy fully considers the factors such as the variety, the size, the quantity distribution, the behavior, the quality of the culture water body and the like of the culture objects through the underwater machine vision and the machine learning technology, and does not depend on artificial culture experience and simple timing control. However, in the crab ingestion rhythm, the types, scientific proportions and residual amounts of baits fed under different environments are tracked, and important bases of crab ingestion capacity and feeding decisions are also judged.

At present, the residual quantity of the bait mainly shoots video/image sequences of fish/crab foraging under water or on the water surface by means of specific equipment such as a cultivation net cage with a camera, a fixed observation table and the like. On the one hand, the shape characteristics, the area characteristics or the cluster foraging shadow characteristics of the breeding objects are indirectly extracted by adopting an image segmentation method such as a threshold value, an area, a contour and the like on the video/image sequence acquired in real time, and the feeding allowance is estimated according to the change of the characteristic areas. On the other hand, the bait and the non-bait are directly separated for corrosion operation, and the communication domains of the obtained corrosion images are counted and summed; or directly separating the outlines of the bait targets, calculating and arranging the areas of the outlines, and determining the area of a single bait so as to estimate the number of the baits contained in each outline. It is readily found that these methods focus on counting individual classes of objects (granule baits) and do not involve multi-class counting methods.

In fact, the multi-category counting is more suitable for the actual feeding application of river crabs, for example, when the baits in the observation table are counted, the images of the river crabs contain a plurality of categories of baits (plant baits such as corn, bean pulp, wheat bran and the like, animal baits such as fish, shrimp, snail, clam and the like, and various particle baits). However, for the target count, especially when a plurality of tags are contained in each acquired bait image and the types and the quantity of the baits are uncontrollable, the tag data of a large quantity of various baits are needed. This presents a great challenge in training a model that can identify and count new classes of baits given only a small number of marker bait images.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a sequential query counting method for few-sample multi-type baits, which aims to solve the problem of accurate counting of the multi-type baits in a bait image under the condition of lacking a marked image, so as to scientifically measure the residual quantity of the baits, estimate the feeding capacity of the river crabs and promote the establishment of a modern river crab culture feeding strategy together with other feeding rhythms.

The technical scheme of the invention is a sequential query counting method for few-sample multi-class baits, which comprises the following steps:

step 1, performing nodding shooting through a CMOS camera arranged below an automatic bait casting boat with a GPS positioning function, and continuously collecting an underwater two-dimensional RGB image sequence capable of reflecting the residual quantity of the bait in a pond along with boat movement on one hand; on the other hand, the device is intermittently static along with the ship, and an underwater two-dimensional RGB image sequence capable of reflecting continuous change of the bait remaining amount is shot. The two acquisition modes complement each other and are flexible and changeable, the limitation that an observation station is required to be set up to observe bait change is overcome, and the average density and the category of the bait in each image are various, so that a bait image data set with category and target distribution which accords with the actual feeding situation is constructed.

And 2, carrying out pixel point level annotation on multiple types of baits in the acquired bait image by using an annotation tool, namely, only one pixel of each bait in the image is marked. The setting does not depend on image-level labels (each category count) and bounding box labels, so that the number of required training samples is reduced, the labeling cost of each sample is reduced, the challenge of acquiring a large number of bait image labels is relieved, and a support image set and a query image set required by a training model can be easily constructed.

Step 3, in order to adapt to the counting of few-sample multi-class baits, the sequential query counting of the weakly-supervised multi-class baits firstly uses the provided pixel point level annotation to guide the attention to the bait examples in the image so as to learn and output the segmentation spots of each bait example and realize the distinguishing and positioning of the baits from the background; then, the bait in the images is sequentially focused and relevant characteristics thereof are extracted according to a specific preset sequence by adopting an attention mechanism with unknown class, so that each bait is classified into one class in the support set. Specifically, the sequential query counting model of the few-sample multi-class baits is mainly realized by the following steps and structures:

step 3.1, inputting the support set image and the query image into a shared feature extractor VoVNet-19 to extract a feature map;

step 3.2, for the query image, the extracted feature images of each image interact in a cyclic system by using a cyclic neural network and a sequential attention mechanism, a space weighted image is generated for each bait in the image, and then a query feature vector is generated for each bait;

step 3.3, for the support images, generating a weight map for the bait targets and the background classes of interest in the images by using the cyclic neural network and the Gaussian map generated from the annotated pixel points, and then creating prototype feature vectors for each class of bait;

and 3.4, outputting class indexes for each bait existing in the image by performing cross-correlation comparison on the query feature vector and a group of prototype features extracted from the support image, so as to realize positioning, classifying and counting of complex multi-class baits.

And 4, an objective function of the real data fitting effect in the counting model training process is formed by the positioning loss of the attention module and the classification loss of the positioning bait. The positioning penalty is the K-value neighbor divergence sum between the attention pattern generated for the bait in the query image and the corresponding gaussian centered around the point of annotation, and the classification penalty is the cross entropy penalty between the prediction class score and the response label.

Step 5, training labels of each image in the counting model training process are sequences of (pixel points, bait class) pairs organized according to a specific order. The order in which the flattening of the image occurs is similar: the bait tags are ordered in an order from top left to bottom right, i.e. in a dictionary order of (y, x) coordinates.

And 6, updating and calculating network parameters influencing model training and model output end to end by using an Adam optimizer, wherein the learning rate of the network parameters is gradually changed along with the training process. In addition, for effective training, an adaptive weighting strategy is adopted, and along with iterative training of the model, the weight of the positioning loss in the objective function is gradually reduced so as to balance the positioning and classifying loss.

Further, the counting model in step 3.1 uses IVoVNet-19 as the basis for the feature extractor, where IVoVNet-19 is composed of a 4-stage modified one-key aggregation unit that adds identity mapping and channel attention. In addition, in order to further improve the recognition capability of baits of different scales, the characteristics of the output layers of the aggregation units of 4 stages in the IVoVNet-19 network are up-sampled to one fourth of the original image so as to fuse the shallow characteristics (more focused on detail information and suitable for positioning) and the deep characteristics (more focused on semantic information and suitable for classification) of the IVoVNet-19 network, and provide proper resolution and strong semantic characteristics for multi-stage prediction of targets of different scales of the network. And simultaneously, in order to enable the counting model to sense the position, the coding position of each pixel in the feature map and the features extracted from the image are mutually connected in a way of single-heat coding of x and y coordinates.

Further, in step 3.4, the counting model is used for constructing class prediction logic of a query target (bait), all query feature vectors and prototype feature quantities are subjected to multi-layer perception linear layers, and therefore the bait classification problem of a small sample is converted into a linear classification model. The distance between the high-dimensional feature vector corresponding to the searched bait and each type of bait prototype is calculated, and the distance is converted into a probability value by utilizing softmax, so that the category of the bait is predicted, and the counting of complex multi-category baits is realized.

The invention has the technical effects that: aiming at the situation that the types of baits in crab feeding scenes are various but the baits lack of the tag data, a novel circulating attention mechanism is adopted, the characteristics of the baits in query images are sequentially extracted according to a specific sequence, the characteristics are compared with the class prototypes extracted in a supporting and concentrating mode, the problem of classifying the baits in small samples is converted into a linear classification model by combining a point-level annotation method, and finally the counting model can concentrate on scenes with a large number of classes in each image. The method can realize counting and tracking of the residual amounts of various baits in complex scenes, so that the ingestion capacity coefficient of the river crabs is scientifically measured to help establish a scientific feeding strategy, and the survival rate, the energy conversion efficiency and the utilization rate of the baits in the cultivation process are improved.

Drawings

FIG. 1 is a diagram showing an implementation structure of a few-sample multi-class bait sequential query counting model;

FIG. 2 is a schematic diagram of the basic structural elements of the feature extractor (an improved one-key aggregation unit that adds identity mapping and channel attention);

FIG. 3 is a schematic diagram of an implementation of a query feature vector;

FIG. 4 is a schematic representation of an implementation of prototype feature vectors;

Detailed Description

The following describes the embodiment of the present invention further with reference to the accompanying drawings, and a specific flow is shown in fig. 1.

1. Acquisition of bait image data sets with class and target distribution

According to the invention, the CMOS camera arranged below the automatic bait casting ship with the GPS positioning function is used for performing nodding shooting, so that on one hand, the underwater two-dimensional RGB image sequences capable of reflecting the residual quantity of the bait in the pond are continuously collected along with the ship; on the other hand, the device is intermittently static along with the ship, and an underwater two-dimensional RGB image sequence capable of reflecting continuous change of the bait remaining amount is shot. The two acquisition modes complement each other and are flexible and changeable, the limitation that an observation station is required to be set up to observe bait change is overcome, and the average density and the category of the bait in each image are various, so that a bait image data set with category and target distribution which accords with the actual feeding situation is constructed.

2. Construction of a support image set and a query image set using pixel-level annotations

And carrying out pixel point level annotation on multiple types of baits in the acquired bait image by using an annotation tool, namely only one pixel of each bait in the image is marked. The setting does not depend on image-level labels (each category count) and bounding box labels, so that the number of required training samples is reduced, the labeling cost of each sample is reduced, the challenge of acquiring a large number of bait image labels is relieved, and a support image set and a query image set required by a training model can be easily constructed.

3. Few-sample multi-class bait sequential inquiry counting model

In order to adapt to counting of few-sample multi-class baits, the weakly-supervised multi-class bait sequence query counting model disclosed by the invention firstly uses provided pixel point level annotation to guide attention to bait examples in an image so as to learn and output segmentation spots of each bait example and realize distinguishing and positioning positions of baits from the background; then, the bait in the images is sequentially focused and relevant characteristics thereof are extracted according to a specific preset sequence by adopting an attention mechanism with unknown class, so that each bait is classified into one class in the support set. Specifically, the sequential query counting model of the few-sample multi-class baits mainly comprises the following steps and structures (as shown in fig. 1):

step (a), the support set image and the query image are input to a shared feature extractor IVoVNet-19 to extract a feature map.

The counting model uses IVoVNet-19 as the basis for the feature extractor, where IVoVNet-19 is made up of a 4-stage modified one-key aggregation unit (e.g., FIG. 2) that adds identity mapping and channel attention. Each one-key aggregation unit of the base is provided with 3 convolution conversion layers, and the characteristic graphs with larger receiving domains are aggregated in the last output layer through continuous 3 convolution kernels step by step to generate the characteristics of various receiving domains which are good enough. In addition, in order to further improve the recognition capability of baits of different scales, the characteristics of the output layers of the aggregation units of 4 stages in the IVoVNet-19 network are up-sampled to one fourth of the original image so as to fuse the shallow characteristics (more focused on detail information and suitable for positioning) and the deep characteristics (more focused on semantic information and suitable for classification) of the IVoVNet-19 network, and provide proper resolution and strong semantic characteristics for multi-stage prediction of targets of different scales of the network. And simultaneously, in order to enable the counting model to sense the position, the coding position of each pixel in the feature map and the features extracted from the image are mutually connected in a way of single-heat coding of x and y coordinates. Finally, the feature map of the extracted structure is output for the decoding process of step (b) and step (c).

And (b) for the query image, the extracted feature images of each image interact in a cyclic system by using a cyclic neural network and a sequential attention mechanism, a space weighted image is generated for each bait in the image, and then a query feature vector is generated for each bait.

As shown in fig. 3 in particular, at each time step t, the attention module linearly combines the extracted image feature map f with the previous long and short term memory network state h _t-1 And generates an attention map a through nonlinear output _t ＝softmax(q ^T tanh(W _f f+W _h h _t-1 ) Wherein W is _f 、W _h And q are training weights; then, the generated attention map a _t Spatially weighting the extracted image feature map f to reduce the feature map to feature vectorsi and j represent the abscissa and ordinate of the feature, respectively; then, the feature vector v _t Class score s predicted with previous time step t-1 _t-1 ＝softmax(o _t-1 ) Linear combination to generate the next input x of long-short-term memory network _t ＝W _v v _t +W _s s _t-1 Wherein W is _v And W is _s Training weights are also used; finally, output o of long-term and short-term memory network _t ＝LSTM(x _t ,h _t-1 ) As a class score for the current time step t, a query feature vector may thus be generated for each bait.

And (c) for the support images, using the recurrent neural network and the Gaussian graph generated from the annotated pixel points to generate a weight graph for the bait targets and background classes of interest in the images, and then creating prototype feature vectors for each class of bait.

As shown in FIG. 4 in particular, the Gaussian kernel G is generated using the annotation points of the current target and the estimate of the standard deviation _t ^s As a weight map, rather than employing an attention module. The gaussian kernel centers on the object of interest, weights the features extracted from the support set, and also generates a weight map for the background class. And simplifying the feature map into high-dimensional feature vectors by using the weight map, and taking the average value of the high-dimensional feature vectors generated by all baits in each class in the support set as a prototype of the class of the baits so as to realize the classification of the baits extracted from the query image.

And (d) outputting class indexes for each bait existing in the image by performing cross-correlation comparison on the query feature vector and a group of prototype features extracted from the support image, so that positioning, classifying and counting of complex multi-class baits are realized.

Specifically, to construct class prediction logic for query targets (baits), all query feature vectors and prototype feature quantities are passed through multiple perceptual linear layers, thereby converting the small sample bait classification problem into a linear classification model. And calculating and inquiring the distance between the high-dimensional feature vector corresponding to the bait and each type of bait prototype, converting the distance into a probability value by using softmax, predicting the type of the bait, and then realizing the counting of complex multi-type baits.

4. Composition of counting model target loss function

The objective function of the real data fitting effect in the counting model training process consists of the positioning loss of the attention module and the classifying loss of the positioning bait, namelyWherein, the positioning loss of the forepart->K-value neighbor divergence sum between the attention pattern generated for bait in the query image and the corresponding Gaussian diagram centered on the annotation point, classification loss of the latter term->Is the cross entropy loss between the prediction class score and the response tag, a _t And Gt represent the attention mask generated from the query image and the gaussian kernel generated from the annotation point, y, respectively _t Is one-hot coded, s _t Is the predicted class score, lambda, at time step t ₁ ^t And lambda (lambda) ₂ ^t The time-varying weights are defined as, respectively, the classification loss.

5. Processing mode of pixel point level annotation tag

The training labels of each image in the counting model training process are sequences of (pixel points, bait class) pairs organized in a specific order. The bait tags are ordered from top left to bottom right, i.e., in the dictionary order of (y, x) coordinates (as shown in fig. 3 and 4), similar to the order in which the flattening of the images occurs. In addition, the few sample learning adopts an epimodic training, namely a multitask training mode, each unique subset of the bait class is defined as a task, and class indexes (between 0 and C-1) of the bait are randomly allocated to each task; and the number of inquiry images in each task is set to 1, and the number of support images is randomly selected to be 3 to 5.

6. Efficient training setup for counting models

Network parameters affecting model training and model output were calculated using an Adam optimizer end-to-end update, with the learning rate halved when each validation error did not change after going through 40 training periods (each training period containing 100 epodes), while the gaussian kernel for the weighted features used the default standard deviation (0.8). In addition, for effective training, an adaptive weighting strategy is adopted, and along with iterative training of the model, the weight of the positioning loss in the objective function is gradually reduced so as to balance the positioning and classifying loss. That is, at the beginning of training, focus is placed on the model to distinguish between bait and background, and then as the model gets better in sequentially focusing and locating targets, focus is placed on the class classification of bait.

In summary, according to the sequential inquiring and counting method for few-sample multi-class baits, provided by the invention, the camera installed on the automatic bait casting ship is used for flexibly collecting bait images which accord with feeding practice and have bait class and density distribution, and the images are marked by means of point annotation to establish a supporting and inquiring image set so as to reduce the dependence on marking data; then the extracted feature map adopts a cyclic neural network and an attention mechanism, attention patterns calculated by each bait in the query image are focused according to the dictionary sequence of the label (y, x) coordinates to obtain query feature vectors in a weighting manner, and then the query feature vectors and the prototype feature vectors extracted from the support image pass through a linear layer together to convert the bait classification problem of a small sample into a linear classification model; and finally, calculating and inquiring the distance between the high-dimensional feature vector corresponding to the bait and each type of bait prototype, and converting the distance into a probability value to predict the category of each bait, thereby realizing counting and tracking of the residual quantity of each type of bait in a complex scene. The method reduces the challenges of counting a plurality of types of baits with few samples, and is a key for scientifically measuring various baits allowance and river crab ingestion capacity coefficients to enrich crab ingestion rhythms and establish modern cultivation feeding decisions.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "illustrative embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: many changes, modifications, substitutions and variations may be made to the embodiments without departing from the spirit and principles of the invention, the scope of which is defined by the claims and their equivalents.

Claims

1. The sequential query counting method for the few-sample multi-class baits is characterized by comprising the following steps of:

step 1, acquiring a bait image data set with class and target distribution;

step 2, constructing a supporting image set and a query image set by adopting pixel point level annotation;

step 3, in order to adapt to counting of few-sample multi-class baits, weakly supervising the multi-class baits to sequentially query a counting model, firstly, using provided pixel point level annotation to guide attention to bait examples in an image so as to learn and output a segmentation mask of each bait example and realize distinguishing and positioning positions of the baits from the background; then, adopting a class-unknown circulating attention mechanism, sequentially focusing on baits in the images according to a specific preset sequence and extracting relevant characteristics of the baits, and classifying each bait into one class in the support set; specifically, the sequential query counting model of the few-sample multi-class baits is mainly realized by the following steps and structures:

step 3.1, inputting the support set image and the query image into a shared feature extractor IVoVNet-19 to extract a feature map;

step 3.4, outputting class indexes for each bait existing in the image by performing cross-correlation comparison on the query feature vector and a group of prototype features extracted from the support image, so as to realize positioning, classification and counting of complex multi-class baits;

step 4, constructing a counting model target loss function;

step 5, processing pixel point level annotation labels;

and 6, effectively training a counting model.

2. The method for sequentially inquiring and counting the few-sample multi-type baits according to claim 1, wherein the specific collecting process in the step 1 is as follows: performing nodding shooting through a CMOS camera arranged below an automatic bait casting ship with a GPS positioning function, and continuously collecting underwater two-dimensional RGB image sequences capable of reflecting the residual quantity of the bait in a pond along with the ship on one hand; on the other hand, the underwater two-dimensional RGB image sequence which can reflect continuous change of the bait remaining amount is shot along with intermittent rest of the ship; the two acquisition modes complement each other and are flexible and changeable, the limitation that an observation station is required to be set up to observe bait change is overcome, and the average density and the category of the bait in each image are various, so that a bait image data set with category and target distribution which accords with the actual feeding situation is constructed.

3. The method for sequentially inquiring and counting the few-sample multi-type baits according to claim 1, wherein the specific process of the step 2 is as follows: carrying out pixel point level annotation on multiple types of baits in the acquired bait image by using an annotation tool, namely, only one pixel of each bait in the image is marked; the setting is independent of the image-level label and the bounding box label, and the support image set and the query image set required by the training model can be easily constructed.

4. The method according to claim 1, wherein in step 3.1, the counting model uses IVoVNet-19 as the basis of the feature extractor, wherein IVoVNet-19 is composed of a 4-stage modified one-key aggregation unit that adds identity mapping and channel attention; in addition, in order to further improve the recognition capability of baits of different scales, the characteristics of the output layers of the aggregation units of 4 stages in the IVoVNet-19 network are up-sampled to one fourth of the original image so as to fuse the shallow layer characteristics and the deep layer characteristics of the IVoVNet-19 network, provide proper resolution and strong semantic characteristics for multistage prediction of targets of different scales of the network, and simultaneously, in order to enable a counting model to perceive the positions, the coding position of each pixel in the characteristic image and the characteristics extracted from the image are mutually connected in a way of independent thermal coding of x and y coordinates.

5. The method according to claim 1, wherein in step 3.4, the counting model converts the small sample bait classification problem into a linear classification model by passing all the query feature vectors and prototype feature values through a plurality of perceptual linear layers in order to construct the class prediction logic of the query target; the distance between the high-dimensional feature vector corresponding to the searched bait and each type of bait prototype is calculated, and the distance is converted into a probability value by utilizing softmax, so that the category of the bait is predicted, and the counting of complex multi-category baits is realized.

6. The method for sequentially inquiring and counting the few-sample multi-class baits according to claim 1, wherein the objective function of the effect of fitting the real data in the step 4 counting model training process is composed of the positioning loss of the attention module and the classification loss of the positioning baits; the positioning loss is the K-value neighbor divergence sum between the attention pattern generated for the bait in the query image and the corresponding gaussian centered around the point of annotation, and the classification loss is the cross entropy loss between the prediction class score and the response label.

7. The method for sequentially inquiring and counting the few-sample multi-class baits according to claim 1, wherein in the step 5, the training label of each image in the training process of the counting model is a pixel point organized according to a specific sequence, and the sequence of the pairs of the baits is the same as the sequence of the pairs of the baits; the order in which the flattening of the image occurs is similar: the bait tags are ordered in an order from top left to bottom right, i.e. in a dictionary order of (y, x) coordinates.

8. The method for sequentially inquiring and counting the few-sample multi-type baits according to claim 1, wherein the step 4 counting model training process uses an Adam optimizer to update and calculate network parameters affecting model training and model output end to end, and the learning rate thereof gradually changes along with the training process; in addition, for effective training, an adaptive weighting strategy is adopted, and along with iterative training of the model, the weight of the positioning loss in the objective function is gradually reduced so as to balance the positioning and classifying loss.