CN115601562A - Fancy carp detection and identification method using multi-scale feature extraction - Google Patents

Fancy carp detection and identification method using multi-scale feature extraction Download PDF

Info

Publication number
CN115601562A
CN115601562A CN202211368354.2A CN202211368354A CN115601562A CN 115601562 A CN115601562 A CN 115601562A CN 202211368354 A CN202211368354 A CN 202211368354A CN 115601562 A CN115601562 A CN 115601562A
Authority
CN
China
Prior art keywords
feature
scale
feature extraction
module
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211368354.2A
Other languages
Chinese (zh)
Inventor
汤永华
石非凡
林森
张志鹏
孟妍君
刘兴通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang University of Technology
Original Assignee
Shenyang University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang University of Technology filed Critical Shenyang University of Technology
Priority to CN202211368354.2A priority Critical patent/CN115601562A/en
Publication of CN115601562A publication Critical patent/CN115601562A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method for detecting and identifying a fancy carp by using multi-scale feature extraction, belongs to the technical field of deep learning image detection, and particularly relates to a method for detecting and identifying the quality of the fancy carp. The method is characterized by comprising the steps of a data collection process, a picture processing and feature extraction process, a multi-scale feature extraction process, a training process and the like, and aims to improve the identification accuracy of the fancy carps; when the method is applied to underwater fancy carp detection, the method has the characteristics of high detection rate, high accuracy and the like, and a series of problems generated in the traditional process of manually screening fancy carps are solved, such as low manual detection rate and the like along with the increase of time; the format of the data set is strictly unified on the whole; the problem of insufficient data sets of the current fancy carps can be solved; the method can effectively fuse feature map information of different scales, further increase the detection rate of underwater fancy carps, and detect underwater fancy carp targets which are shielded more.

Description

Fancy carp detection and identification method using multi-scale feature extraction
Technical Field
The invention belongs to the technical field of deep learning image detection, and particularly relates to a method for detecting and identifying the quality of a fancy carp.
Background
In recent years, the deep learning of the fire is a research direction in the field of machine learning, the deep learning of the fire can learn the intrinsic rules and the expression levels of sample data, and information obtained in the learning process greatly helps to explain data such as characters, images and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds. Deep learning is a complex machine learning algorithm, and achieves the effect in speech and image recognition far exceeding the prior related art. The object detection field based on deep learning is defined as that a computer judges whether a given image or video contains a target object, and if the given image or video contains the target object, the position and category information of the target in the image or video needs to be given. The application fields of target detection are various, such as safety prevention and control, enterprise business, automatic driving and the like.
China will become the largest fancy carp market in the world at all times, the fancy carp market has huge development potential, the fancy carps are various, the appearance is attractive, the ornamental value is high, and the sales volume in the market is stable. The fancy carp has high reproductive capacity, strong adaptability to the environment, easy survival and high yield, and the breeding profit is good from the market quotation and the economic benefit of the current fancy carp. According to the estimation of the current market price, the profit of cultivating one mu of fancy carps is about 5000-10000 yuan, which is particularly related to the cultivated variety, and the good variety has high price and profit. The market demand of fancy carps is continuously increased, but the market demand of the fancy carps for scale cultivation is less at present, and the cultivation of the fancy carps belongs to the sunglory industry, so the prospect is good.
The target identification technology for marine fishes in the world is mature at present, so that the disclosed marine fish data set is abundant. However, the target detection and identification for koi and the corresponding data set are scarce, and the identification method for detecting and identifying different sizes and types has no experience in this respect, and also includes the identification of blocked and small targets, so that the detection effect is not good, and the accuracy of batch detection of the types of koi is also problematic. Human detection also causes a great deal of errors in the identification process, which has a great influence on improving the overall efficiency of the related industries. Therefore, in order to better distinguish the quality of the koi, an efficient koi detection algorithm is very important.
Disclosure of Invention
The invention aims to: the invention aims to provide a koi detection and identification method using multi-scale feature extraction, and aims to improve the identification accuracy of koi.
The technical scheme is as follows:
a method for detecting and identifying fancy carps by using multi-scale feature extraction is characterized by comprising the following steps: the detection and identification method comprises the following steps:
the data collection process comprises the following steps: acquiring and labeling a plurality of pictures of koi, and dividing the pictures into a training set and a verification set in a certain proportion;
picture processing and feature extraction processes: the marked picture is transmitted and preprocessed through mosaic data enhancement and self-adaptive scaling, then the picture is transmitted into a main network of a main model, the feature extraction of the picture is carried out in the main network, and then the multi-scale feature extraction of the koi is carried out through a multi-scale feature extraction module (Nonlocal-pro), so that the method can detect the koi with different sizes;
the processing process of the main model comprises the following steps: the feature map processed by the backbone network is processed by a multi-scale feature extraction module (Nonlocal-pro) and a codec module (Transformer), and the multi-scale feature extraction module is used for extracting multi-scale features of koi, so that the method can detect koi with different scales; the codec module improves the performance of small target detection; the feature map processed by the backbone network is transmitted into a neck module, the feature map in the neck module is subjected to enhanced feature extraction in a bidirectional weighted feature pyramid connection mode, and meanwhile, a CA attention module is added at a detection head, so that the position information of the koi carp on the picture can be embedded into a channel, and the detection of the main model is finished;
the training process is entered next: the output gives bounding boxes and confidence levels through the 4 valid feature layers that are finally output from the CA attention module:
screening repeated boundary frames by adopting a non-maximum value inhibition method to obtain a prediction frame, comparing the prediction frame with a frame generated by a marking tool, calculating loss by adopting a GIoU loss function, and performing back propagation by utilizing the loss function so as to adjust the weight, wherein the GIoU loss formula is as follows:
Figure BDA0003924093350000021
a: marking a frame rectangle;
b: predicting a frame rectangle;
c: the minimum bounding rectangle of the image formed by the two frames, namely the area of the minimum frame simultaneously containing the prediction frame and the real frame;
repeating the above process to gradually converge, and continuously adjusting parameters through the test of the verification set to improve the generalization ability and precision;
preferably, in the small-scale feature extraction process step in the multi-scale feature extraction module (non-local-pro), the original feature map is input into the small-scale feature extraction process, the number of channels of the compressed picture and the feature extraction of the small-scale convolution kernel are performed for one time separately, and then the three results are combined with the dimensions except the number of channels, which are the dimensions of width and height, and the dimension of the picture is changed into THW C; t represents the number of input pictures in batch, H represents the height, W represents the width, C represents the channel number, and the dimensions of the three results are changed into THW C;
next, performing matrix dot multiplication on the result of matrix transposition after channel number compression and the result of first small-scale convolution kernel feature extraction to obtain the relation of each pixel in each frame to all pixels in other frames, wherein the dimension of the result obtained by dot multiplication is changed into THW (total internal reflection) THW, namely THW C and C THW dot multiplication, and then performing softmax operation on the autocorrelation feature to obtain the result with the value range of [0,1 ];
then, performing dot multiplication operation on the matrix and the result extracted by the second small-scale convolution kernel feature, wherein the dimension at the moment is the dot multiplication operation of THW and THW C, and the obtained result dimension is THW C; so far, the small scale feature extraction process in the multi-scale feature extraction module (Nonlocal-pro) is completely finished, the medium scale and large scale processing process is completely the same as the small scale feature extraction process, and the difference is that the convolution kernel size used in the medium scale and large scale extraction process is different from the convolution kernel size used in the small scale feature extraction process, so that the module can extract features of different scales, three feature maps of different scales are generated after three scale features are extracted, at this time, the three feature maps are stacked on the channel dimension to obtain a new feature map, then the feature map is adjusted to the channel number which is the same as that of the original input feature map, and finally the feature map and the original feature map are added through a matrix to be output, so that the whole multi-scale feature extraction module (Nonlocal-pro) is completely explained;
preferably, the incoming picture of the method is firstly subjected to Mosaic data enhancement to expand the data set so as to facilitate the subsequent model training process; the image enhanced by the Mosaic data is then transmitted to the above-mentioned multi-scale feature extraction module (non-local-pro) through a series of feature extraction processes, where the main model extracts the multi-scale features.
Preferably, the codec module (transform) of the method is a residual structure, which is divided into a stem part and a residual edge part, and the residual edge part is directly connected end to end without any processing or with a small amount of processing.
Preferably, the neck module of the method adopts a connection mode of a bidirectional weighted feature pyramid to perform feature fusion on four effective feature layers of a backbone network again to generate new four effective feature layers, and the new four effective feature layers are respectively sent to the detection head through a CA attention module; the connection mode of the bidirectional weighted feature pyramid is that two paths are provided, wherein one path is a bottom-up path and transmits position information of low-level features; the other path is a top-down path and transmits semantic information of high-level features; besides two paths, the structure also adds an edge on the input and output nodes of the same layer, and deletes the nodes with only one input edge; the specific input-output relationship is shown as follows:
Figure BDA0003924093350000041
Figure BDA0003924093350000042
as shown in fig. 5, the above two formulas take the feature layer 3 as an example,
Figure BDA0003924093350000043
representing the output of the layer(s),
Figure BDA0003924093350000044
representing the output of the intermediate layer(s),
Figure BDA0003924093350000045
representing the input of the layer(s) of the layer,
Figure BDA0003924093350000046
represents the output of the feature layer 2 and,
Figure BDA0003924093350000047
representing the input of the feature layer 4, resize representing the upsampling or downsampling, ω representing the learned parameters for distinguishing the importance of different features in the feature fusion process, conv generationAnd (4) performing table feature convolution.
Preferably, the CA attention module of the method adds the position information to the channel attention, and the attention mechanism decomposes the channel attention into two parallel one-dimensional feature encoding processes, aggregating the features in two directions, respectively; one direction obtains a remote dependence relationship, and the other direction obtains accurate position information; encoding the generated feature map to form a pair of direction-sensing and position-sensitive features; the CA attention module essentially comprises two steps: embedding coordinate information and generating coordinate attention; for coordinate information embedding, global pooling is divided into a pair of one-dimensional feature coding operations; for an input feature map X with dimension C × H × W, each channel is encoded by using pooling kernels with sizes of (H, 1) and (1, W) along horizontal direction coordinates and vertical direction coordinates, respectively, that is, the output of the C-th channel with height H and the C-th channel with width W, and the output formula is shown as the following two formulas:
Figure BDA0003924093350000051
Figure BDA0003924093350000052
the above formula carries out feature aggregation along two directions, returns a pair of direction perception attention features Z h And Z w This helps the network to locate the target to be detected more accurately; for coordinate attention generation, the modules generate two feature layers before concatenation, and then transform F using a shared 1 × 1 convolution 1 The formula is as follows:
f=δ(F 1 ([Z h ,Z w ]))
wherein f ∈ R C/r×(H+W) The characteristic diagram is an intermediate characteristic diagram of the spatial information in the horizontal direction and the vertical direction, r represents a down-sampling ratio, and is set to be 16 to control the size of the module, and the overall performance and the calculated amount are balanced most. Delta denotes a non-linear activation function, [, ]]Representing dimensions along a spaceA degree of concatenation operation, then, the f is divided into two separate tensors f along the spatial dimension w ∈R C /r×W And f h ∈R C/r×W Reuse of two 1 × 1 convolutions F h And F w Will the characteristic diagram f h And f w Conversion to the same number of channels as input X yields the following equation:
g h =σ(F h (f h ))
g w =σ(F w (f w ))
finally, for g h And g w Expanding, as the attention weight, the final output of the CA attention module can be expressed as follows:
Figure BDA0003924093350000053
so far, the entire CA attention module is finished executing.
Has the advantages that: the Mosaic image enhancement can expand the data set through splicing, rotation, symmetry and other modes, and is convenient for the subsequent training process. The addition of a multi-scale feature extraction module (non-local-pro) can effectively detect koi of different scales underwater, and the addition of a codec module (transform) can detect koi of smaller targets. The connection mode of using the bidirectional weighted feature pyramid in the neck module can effectively fuse feature information of different scales, and meanwhile, the performance of detecting the targets of the multi-shielding koi under water can be greatly improved. The method disclosed by the invention is used for detecting the koi by using a deep learning method, and a series of problems, such as low manual detection rate and the like, generated in the traditional process of manually screening the koi are solved.
The final effect is that when the camera acquires real-time fancy carp images underwater, the quality of each fancy carp is detected on the display in real time, each detected fancy carp is framed out through the rectangular frame, the fancy carp quality is marked on each rectangular frame, and people with confidence coefficient can also display the fancy carp quality.
Drawings
FIG. 1 is a data set acquisition flow diagram of the present invention;
FIG. 2 is a flow chart of the primary method of the present invention;
FIG. 3 is a schematic diagram of the structure of the main model components of the present invention;
FIG. 4 is a schematic diagram of the connection relationship between the multi-scale feature extraction modules (non-local-pro) in the main model of the present invention;
FIG. 5 is a schematic diagram of the connection relationship of the neck (BIFPN + CA attention) in the main model of the present invention;
FIG. 6 is a schematic diagram of the CA attention module of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings by way of specific embodiments.
A method for detecting and identifying fancy carps by using multi-scale feature extraction is characterized by comprising the following steps: the detection and identification method comprises the following steps:
(1) And (3) data collection process: acquiring and labeling a plurality of pictures of the fancy carp, and dividing the pictures into a training set and a verification set in a certain proportion;
(2) Picture processing and feature extraction: the marked pictures are transmitted into a backbone network after mosaic data enhancement and adaptive scaling pretreatment, the features of the pictures are extracted in the backbone network, and then multi-scale features of the koi are extracted through a multi-scale feature extraction module (non-local-pro), so that the method can detect the koi with different sizes;
the main model processing process comprises the following steps: the feature map processed by the backbone network is processed by a multi-scale feature extraction module (Nonlocal-pro) and a codec module (Transformer), and the multi-scale feature extraction module is used for extracting multi-scale features of koi, so that the method can detect koi with different scales; the codec module improves the performance of small target detection; the feature map processed by the backbone network is transmitted into a neck module, the feature map in the neck module is subjected to enhanced feature extraction in a bidirectional weighted feature pyramid connection mode, and meanwhile, a CA attention module is added at a detection head, so that the position information of the koi on the picture can be embedded into a channel, and the detection of the main model is finished;
(3) Entering a training process: the output gives bounding boxes and confidence levels through the last 4 valid feature layers output from the CA attention module:
screening repeated boundary frames by adopting a non-maximum value inhibition method to obtain a prediction frame, comparing the prediction frame with a frame generated by a marking tool, calculating loss by adopting a GIoU loss function, and performing back propagation by utilizing the loss function so as to adjust the weight, wherein the GIoU loss formula is as follows:
Figure BDA0003924093350000071
a: marking a frame rectangle;
b: predicting a frame rectangle;
c: the minimum bounding rectangle of the image formed by the two frames is the area of the minimum frame which simultaneously contains the prediction frame and the real frame;
and repeating the process to gradually converge, and continuously adjusting parameters through the test of the verification set so as to improve the generalization capability and precision of the fancy carp and finally obtain the identification of the fancy carp.
The method needs a plurality of early preparation works before use, namely, the preparation process of the data set is shown in figure 1. As the research on koi in the world is less at present, the data set which can be publicly used for training can not be found, so the early picture acquisition, the picture preprocessing and the data set production are particularly important.
After a plurality of fancy carp pictures are collected, the pretreatment work of the pictures is needed firstly, and because the collected pictures contain a plurality of fancy carps, the work of cutting and the like is needed, so that the training effect and the model accuracy are increased.
After the preprocessing of the data set is completed, the fancy carp needs to be framed by a rectangular frame by a marking tool for each processed picture, the type name of the fancy carp is marked, and due to the reasons of storage format, compatibility and the like, the data set is temporarily stored in a VOC format.
Since the format of the data set required by the present invention is TXT format, the preprocessed data set needs to be converted into TXT format by a specific script, and at the same time, the data set is divided into a training set and a validation set in a specific ratio. By this point, the preliminary preparation work has ended.
The input of pictures can be done after the data set is ready:
as shown in fig. 2, at this time, the processed data set is sent to the input end of the method, because the sizes of all the pictures are different, and the method needs to generate the feature layer only when all the pictures have the same size, at this time, the pictures need to be scaled adaptively, that is, the input size required by the method is reduced, then the black stripe added to the shorter side becomes a square to meet the input specification of 640 pixels × 640 pixels, if the data is insufficient, a data enhancement method is needed, and a Mosaic (Mosaic) method is used to splice four pictures in a manner of random scaling, random cropping, and random arrangement to form one picture. Therefore, when the method receives one picture, the method means that the target of the original 4 pictures is received at the same time, and the data set is enriched. So far, the previous picture and processing work of the method is finished.
The processed picture will then enter the main model part as shown in fig. 2, which is also the core part of the present invention. The main model part is divided into a Backbone network (Backbone) and a Neck (tack) as shown in fig. 3.
The main model backbone network mainly includes a multi-scale feature extraction module (noloacl-pro) and a codec module (Transformer), the Transformer is mainly used to improve the performance of the small target detection of the method, since the codec module is already very popular, it is not described here much, and it is emphasized here that the multi-scale feature extraction module in the present invention is shown in fig. 4.
As shown in fig. 4, when the original feature map is input into the small-scale feature extraction process, the number of channels of the compressed picture and the feature extraction of the small-scale convolution kernel are performed separately, and then the three results are merged into dimensions other than the number of channels, which are referred to as width and height dimensions, so that the dimension of the picture becomes THW C (T represents the number of batch input pictures, H represents the height, W represents the width, and C represents the number of channels), so that the dimensions of the previous three results become THW C at this time.
Since the pixels are essentially matrices, one of the results is then transposed (here, the result after compressing the number of channels is selected to be transposed), so that the dimension of the matrix of the result becomes C · THW, and the dimensions of the other two results remain unchanged, that is, THW × C.
And then, performing matrix dot multiplication on a result of matrix transposition after channel number compression and a result of first small-scale convolution kernel feature extraction, so as to calculate autocorrelation in the features, namely the relationship of each pixel in each frame to all pixels in other frames, wherein the dimension of the result obtained by dot multiplication is changed into THW (THW C and C THW dot multiplication), and then performing softmax operation on the autocorrelation feature to obtain a result with a value field of [0,1 ].
And then performing dot multiplication operation of the matrix on the result and the result extracted by the second small-scale convolution kernel feature, wherein the dimension at the moment is the dot multiplication operation of THW and THW C, and the obtained result dimension is THW C.
So far, the small-scale feature extraction process in the multi-scale feature extraction module (non-local-pro) is completely completed.
The process of the medium-scale and large-scale feature extraction modules is completely the same as that of the small-scale feature extraction module, and the only difference is that the scale feature extraction convolution kernels used by the medium-scale and large-scale feature extraction modules are different.
The dimensions of the three results obtained by the small, medium and large scale feature extraction processes are THW C, and at the moment, the three results are merged on the channel, so that information of different scales is fused.
And after the scale information is fused, the number of channels is required to be readjusted, at the moment, the number of channels is expanded, and simultaneously, the completely processed characteristic diagram and the far characteristic diagram are subjected to matrix addition operation, so that the obtained result is the output of the whole multi-scale characteristic extraction module (non-local-pro).
So far, the principle of the multi-scale extraction module (non-local-pro) in the Backbone network (Backbone) in the multi-scale extraction process of the invention is completely finished.
After the picture passes through a Backbone network (Backbone) of the main model, four effective feature layers are generated, and the four effective feature layers are transmitted into the Neck (tack) for further enhanced feature extraction.
The neck module adopts a connection mode of a bidirectional weighted feature pyramid to perform feature fusion on the four effective feature layers of the main network again to generate new four effective feature layers, and the new four effective feature layers are respectively sent to the detection head through the CA attention module; the connection mode of the bidirectional weighted feature pyramid is that two paths are provided, wherein one path is a bottom-up path and transmits position information of low-level features; the other path is a top-down path and transmits semantic information of high-level features; besides two paths, the structure also adds an edge on the input and output nodes of the same layer, and deletes the nodes with only one input edge; the specific input-output relationship is shown as follows:
Figure BDA0003924093350000091
Figure BDA0003924093350000092
the above two formulas take the feature layer 3 as an example,
Figure BDA0003924093350000093
representing the output of the layer(s),
Figure BDA0003924093350000094
representing the output of the intermediate layer(s),
Figure BDA0003924093350000095
representing the input of the layer(s) of the layer,
Figure BDA0003924093350000096
represents the output of the feature layer 2 and,
Figure BDA0003924093350000097
representing the input of the feature layer 4, resize representing upsampling or downsampling, ω representing a learned parameter for distinguishing the importance of different features in the feature fusion process, and Conv representing feature convolution.
In the neck part, a connection mode of a bidirectional weighted feature pyramid as shown in fig. 5 is adopted, so that the method can complete the task of identifying and detecting koi in various complex or sheltered environments, and channel attention CA is added at a detection head, as shown in fig. 6, CA can encode channel relation and long-term dependency through accurate position information, and specifically comprises two steps of coordinate information embedding and coordinate attention generation as shown in fig. 6: the method comprises the steps of converting the global pooling operation into a pair of one-dimensional feature coding operations through two global pooling operations respectively, specifically, given an input X, coding each channel along a horizontal coordinate and a vertical coordinate by using a pooling kernel with the size of (H, 1) or (1, W), performing stacking operation on the obtained result on the channel, namely concat operation shown in figure 6, then performing transformation operation on the concat operation by using 1X 1 convolution, then using a normalization function and a nonlinear activation function, converting the obtained result into tensors with the same channel number by using two 1X 1 convolution kernels respectively, and finally using a sigmoid activation function for activation.
The CA attention module of the method adds the position information into the attention of the channel, and the attention mechanism decomposes the attention of the channel into two parallel one-dimensional feature coding processes which respectively aggregate features in two directions; one direction obtains a remote dependency relationship, and the other direction obtains accurate position information; encoding the generated feature map to form a pair of direction-sensing and position-sensitive features; the CA attention module essentially comprises two steps: embedding coordinate information and generating coordinate attention; for coordinate information embedding, global pooling is divided into a pair of one-dimensional feature coding operations; for the input feature map X with dimensions C × H × W, each channel is first encoded along the coordinates in the horizontal direction and the coordinates in the vertical direction using pooling kernels with sizes of (H, 1) and (1, W), that is, the output of the C-th channel with height H and the C-th channel with width W are output according to the following two formulas:
Figure BDA0003924093350000101
Figure BDA0003924093350000102
the above formula carries out feature aggregation along two directions, returns a pair of direction perception attention features Z h And Z w This helps the network to locate the target to be detected more accurately; for coordinate attention generation, the modules generate two feature layers before concatenation, and then transform F using a shared 1 × 1 convolution 1 The formula is as follows:
f=δ(F 1 ([Z h ,Z w ]))
wherein f ∈ R C/r×(H+W) The characteristic diagram is an intermediate characteristic diagram of the spatial information in the horizontal direction and the vertical direction, r represents a down-sampling proportion, and the down-sampling proportion is set to be 16 for controlling the size of the module, so that the overall performance and the calculated amount are most balanced. Delta denotes a non-linear activation function, [, ]]Representing a join operation along a spatial dimension, and then slicing f into two separate tensors f along the spatial dimension w ∈R C /r×W And f h ∈R C/r×W Reuse of two 1 × 1 convolutions F h And F w Will feature map f h And f w Conversion to the same number of channels as input X yields the following equation:
g h =σ(F h (f h ))
g w =σ(F w (f w ))
finally, for g h And g w Expanding, as the attention weight, the final output of the CA attention module can be expressed as follows:
Figure BDA0003924093350000111
the entire CA attention module is executed.
The multi-scale extraction process of the present invention has been completed.
Next, the picture will enter the final output stage:
in the feature extraction process, new four effective feature layers are generated, and the output end provides a boundary box (initial prediction of the model, multiple boundary boxes exist in one type) and confidence (confidence degree indicating that an object really exists in the boundary box and confidence degree indicating whether the boundary box includes all features of the whole object) according to the newly generated four feature layers. Then, as shown in fig. 2, a non-maximum suppression method is adopted to screen out repeated bounding boxes, and the non-maximum suppression method includes the steps of firstly sorting according to confidence scores, selecting the bounding box with the highest confidence to add to the final output list, deleting the bounding box from the bounding box list, calculating the areas of all the bounding boxes, and calculating the intersection-to-union ratio IoU (which is the ratio of the intersection area of two boxes to the union area of two boxes and indicates the intersection degree of the two boxes) of the bounding box with the highest confidence and other candidate boxes. And deleting the bounding boxes with the IoU larger than a certain value, and repeating the process until the bounding box list is empty. The remaining bounding box is the predicted box, which is compared with the previously manually labeled box, and the loss is calculated using the GIoU loss function (the loss function reflects the difference between the predicted box and the actual box, and the weight can be continuously adjusted by the loss function to reduce the difference, thereby improving the accuracy) as shown in fig. 2. And then the loss function is used for back propagation, so that the weight of the method is adjusted.
And finally, judging whether the model is converged or not according to the rule of weight adjustment, if so, outputting the final model, and if not, repeating all the processes.
Thus, the overall method of the invention is completed.
The method can directly give the classification probability and the position of the koi, has higher identification speed, can process a large batch of koi pictures, promotes the development of related industries, contributes to screening the koi, greatly improves a large amount of manpower and material resources consumed by the traditional manual screening, has enough data sets, can improve the precision to be very high after training, and can be deployed on related hardware with a camera for real-time detection.

Claims (6)

1. A method for detecting and identifying fancy carps by using multi-scale feature extraction is characterized by comprising the following steps: the detection and identification method comprises the following steps:
(1) The data collection process comprises the following steps: acquiring and labeling a plurality of pictures of the fancy carp, and dividing the pictures into a training set and a verification set in a certain proportion;
(2) Picture processing and feature extraction: the marked pictures are transmitted to a mosaic data enhancement and self-adaptive scaling pretreatment, then the pictures are transmitted to a main network of a main model, the features of the pictures are extracted in the main network, and then the multi-scale features of the koi are extracted through a multi-scale feature extraction module, so that the method can detect the koi with different sizes;
the main model processing process comprises the following steps: the feature map processed by the backbone network is processed by a multi-scale feature extraction module and a codec module, and the multi-scale feature extraction module is used for extracting the multi-scale features of the koi, so that the method can detect the koi with different scales; the codec module improves the performance of small target detection; the feature map processed by the backbone network is transmitted into a neck module, the feature map in the neck module is subjected to enhanced feature extraction in a bidirectional weighted feature pyramid connection mode, and meanwhile, a CA attention module is added at a detection head, so that the position information of the koi carp on the picture can be embedded into a channel, and the detection of the main model is finished;
(3) Entering a training process: the output gives bounding boxes and confidence levels through the last 4 valid feature layers output from the CA attention module:
screening repeated boundary frames by adopting a non-maximum value inhibition method to obtain a prediction frame, comparing the prediction frame with a frame generated by a marking tool, calculating loss by adopting a GIoU loss function, and performing back propagation by utilizing the loss function so as to adjust the weight, wherein the GIoU loss formula is as follows:
Figure FDA0003924093340000011
a: marking a frame rectangle;
b: predicting a frame rectangle;
c: the minimum bounding rectangle of the image formed by the two frames is the area of the minimum frame which simultaneously contains the prediction frame and the real frame;
and repeating the process to gradually converge the carp, and continuously adjusting parameters through the test of the verification set so as to enable the carp to have generalization capability and improve precision, and finally obtaining the identification of the koi.
2. The method for detecting and identifying koi carp by using multi-scale feature extraction as claimed in claim 1, wherein: in the small-scale feature extraction process step in the multi-scale feature extraction module, an original feature map is input into the small-scale feature extraction process, firstly, the number of channels for compressing a picture and the feature extraction of small-scale convolution kernels are carried out for two times separately, and then the three results are combined with the dimensions except the number of channels, namely the width dimension and the height dimension, so that the dimension of the picture is changed into THW C; t represents the number of batch input pictures, H represents height, W represents width, C represents channel number, and the dimensions of the three results are all changed into THW C;
then, performing matrix dot product operation on a result of matrix transposition after channel number compression and a result of first small-scale convolution kernel feature extraction to obtain the relation of each pixel in each frame to all pixels of other frames, wherein the dimension of the result obtained by dot product is changed into THW, namely THW C and C THW dot product, and performing softmax operation on the autocorrelation feature to obtain a result with a value range of [0,1 ];
performing dot multiplication operation on the matrix and the result extracted by the second small-scale convolution kernel feature, wherein the dimension at the moment is the dot multiplication operation of THW and THW C, and the obtained result dimension is THW C; the small-scale feature extraction process in the multi-scale feature extraction module is completed completely, the medium-scale and large-scale processing process is completely the same as the small-scale feature extraction process, and the difference lies in that the convolution kernel size used in the medium-scale and large-scale extraction process is different from that used in the small-scale feature extraction convolution kernel, so that the module can extract features of different scales, three feature maps of different scales are generated after three scale features are extracted, at the moment, the three feature maps are stacked on the channel dimension to obtain a new feature map, then the feature map is adjusted to the channel number which is the same as that of the original input feature map, and finally the feature map and the original feature map are added through a matrix so as to be output, so that the whole multi-scale feature extraction module is completely explained.
3. The method for detecting and identifying koi carp by using multi-scale feature extraction as claimed in claim 1, wherein: the method comprises the steps that firstly, the incoming pictures are subjected to Mosaic data enhancement, picture self-adaptive scaling, rotation, splicing and the like to expand a data set, so that the subsequent model training process is facilitated; the image enhanced by the Mosaic data is transmitted to the multi-scale feature extraction module through a series of feature extraction processes, wherein the multi-scale features are extracted by the main model.
4. The method for detecting and identifying koi carp by using multi-scale feature extraction as claimed in claim 1, wherein: the codec module of the method is a residual structure which is divided into a main part and a residual side part, and the residual side part is directly connected end to end without any treatment or with a small amount of treatment.
5. The method for detecting and identifying koi carp by using multi-scale feature extraction as claimed in claim 1, wherein: the neck module of the method adopts a connection mode of a bidirectional weighted feature pyramid to perform feature fusion on four effective feature layers of a backbone network again to generate new four effective feature layers, and the four effective feature layers are respectively sent to a detection head through a CA attention module; the connection mode of the bidirectional weighted feature pyramid is that two paths are provided, wherein one path is a bottom-up path and transmits position information of low-level features; the other path is a top-down path and transmits semantic information of high-level features; besides two paths, the structure also adds an edge on the input and output nodes of the same layer, and deletes the nodes with only one input edge; the specific input-output relationship is shown as follows:
Figure FDA0003924093340000031
Figure FDA0003924093340000032
the above two formulas take the feature layer 3 as an example,
Figure FDA0003924093340000033
representing the output of the layer(s),
Figure FDA0003924093340000034
representing the output of the intermediate layer(s),
Figure FDA0003924093340000035
representing the input of the layer(s) of the layer,
Figure FDA0003924093340000036
represents the output of the feature layer 2 and,
Figure FDA0003924093340000037
representing the input of the feature layer 4, resize representing the upsampling or downsampling, ω representing the learned parameters for distinguishing the importance of different features in the feature fusion process, and Conv representing the feature convolution.
6. The method for detecting and identifying koi carp by using multi-scale feature extraction as claimed in claim 1, wherein: the CA attention module of the method adds the position information into the attention of the channel, and the attention mechanism decomposes the attention of the channel into two parallel one-dimensional feature coding processes which respectively aggregate features in two directions; one direction obtains a remote dependence relationship, and the other direction obtains accurate position information; encoding the generated feature map to form a pair of direction-sensing and position-sensitive features; the CA attention module essentially comprises two steps: embedding coordinate information and generating coordinate attention; for coordinate information embedding, global pooling is divided into a pair of one-dimensional feature coding operations; for an input feature map X with dimension C × H × W, each channel is encoded by using pooling kernels with sizes of (H, 1) and (1, W) along horizontal direction coordinates and vertical direction coordinates, respectively, that is, the output of the C-th channel with height H and the C-th channel with width W, and the output formula is shown as the following two formulas:
Figure FDA0003924093340000038
Figure FDA0003924093340000039
the above formula carries out feature aggregation along two directions, returns a pair of direction perception attention features Z h And Z w This helps the network to locate the target to be detected more accurately; for coordinate attention generation, the modules generate two feature layers before concatenation, and then transform F using a shared 1 × 1 convolution 1 The formula is as follows:
f=δ(F 1 ([Z h ,Z w ]))
wherein f ∈ R C/r×(H+W) The characteristic diagram is an intermediate characteristic diagram of the spatial information in the horizontal direction and the vertical direction, r represents a down-sampling proportion, and the down-sampling proportion is set to be 16 for controlling the size of the module, so that the overall performance and the calculated amount are most balanced. Delta denotes the nonlinear activation function, [, ]]Representing a join operation along a spatial dimension, and then slicing f into two separate tensors f along the spatial dimension w ∈R C /r×W And f h ∈R C/r×W Reuse of two 1 × 1 convolutions F h And F w Will the characteristic diagram f h And f w Conversion to the same number of channels as input X yields the following equation:
g h =σ(F h (f h ))
g w =σ(F w (f w ))
finally, for g h And g w Expanding, as the attention weight, the final output of the CA attention module can be expressed as follows:
Figure FDA0003924093340000041
so far, the entire CA attention module is finished executing.
CN202211368354.2A 2022-11-03 2022-11-03 Fancy carp detection and identification method using multi-scale feature extraction Pending CN115601562A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211368354.2A CN115601562A (en) 2022-11-03 2022-11-03 Fancy carp detection and identification method using multi-scale feature extraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211368354.2A CN115601562A (en) 2022-11-03 2022-11-03 Fancy carp detection and identification method using multi-scale feature extraction

Publications (1)

Publication Number Publication Date
CN115601562A true CN115601562A (en) 2023-01-13

Family

ID=84851081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211368354.2A Pending CN115601562A (en) 2022-11-03 2022-11-03 Fancy carp detection and identification method using multi-scale feature extraction

Country Status (1)

Country Link
CN (1) CN115601562A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880574A (en) * 2023-03-02 2023-03-31 吉林大学 Underwater optical image lightweight target identification method, equipment and medium
CN116052064A (en) * 2023-04-03 2023-05-02 北京市农林科学院智能装备技术研究中心 Method and device for identifying feeding strength of fish shoal, electronic equipment and bait casting machine

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880574A (en) * 2023-03-02 2023-03-31 吉林大学 Underwater optical image lightweight target identification method, equipment and medium
CN116052064A (en) * 2023-04-03 2023-05-02 北京市农林科学院智能装备技术研究中心 Method and device for identifying feeding strength of fish shoal, electronic equipment and bait casting machine

Similar Documents

Publication Publication Date Title
CN107358257B (en) Under a kind of big data scene can incremental learning image classification training method
CN108805070A (en) A kind of deep learning pedestrian detection method based on built-in terminal
CN115601562A (en) Fancy carp detection and identification method using multi-scale feature extraction
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN113420643B (en) Lightweight underwater target detection method based on depth separable cavity convolution
CN112529090B (en) Small target detection method based on improved YOLOv3
CN113609896A (en) Object-level remote sensing change detection method and system based on dual-correlation attention
CN111652273B (en) Deep learning-based RGB-D image classification method
CN112784756B (en) Human body identification tracking method
CN113343937A (en) Lip language identification method based on deep convolution and attention mechanism
CN113112416B (en) Semantic-guided face image restoration method
CN113449691A (en) Human shape recognition system and method based on non-local attention mechanism
CN113837366A (en) Multi-style font generation method
CN115240119A (en) Pedestrian small target detection method in video monitoring based on deep learning
CN116912708A (en) Remote sensing image building extraction method based on deep learning
CN110851627B (en) Method for describing sun black subgroup in full-sun image
CN116524189A (en) High-resolution remote sensing image semantic segmentation method based on coding and decoding indexing edge characterization
CN115410078A (en) Low-quality underwater image fish target detection method
CN112037239B (en) Text guidance image segmentation method based on multi-level explicit relation selection
CN110807369B (en) Short video content intelligent classification method based on deep learning and attention mechanism
CN117079125A (en) Kiwi fruit pollination flower identification method based on improved YOLOv5
CN114743023B (en) Wheat spider image detection method based on RetinaNet model
CN116403133A (en) Improved vehicle detection algorithm based on YOLO v7
CN114926691A (en) Insect pest intelligent identification method and system based on convolutional neural network
CN115272846A (en) Improved Orientdrcnn-based rotating target detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination