CN116246116A

CN116246116A - Target detection method for enhanced multi-scale feature extraction, multiplexing and fusion

Info

Publication number: CN116246116A
Application number: CN202310286881.7A
Authority: CN
Inventors: 张伟; 袁甲; 张�浩; 任柯宇
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-06-09

Abstract

The invention discloses a target detection method for enhancing multi-scale feature extraction, multiplexing and fusion. Relates to the technical field of artificial intelligence. Wherein the method comprises the following steps: in response to a request for identifying the target image, inputting the target image into a target feature extraction network of a target detection model, and outputting N first feature images, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of a target image; inputting the N first characteristic images into a characteristic fusion network, and outputting M second characteristic images; and inputting the M second characteristic images into a target prediction network, and outputting the recognition result of the target images. The method solves the technical problems of low target recognition efficiency in the target detection process caused by complex model structure, high calculation cost and no pertinence to the salient features in the target detection algorithm.

Description

Target detection method for enhanced multi-scale feature extraction, multiplexing and fusion

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a target detection method for enhancing multi-scale feature extraction, multiplexing and fusion.

Background

At present, the target detection technology is mainly divided into a traditional target detection algorithm, a two-stage target detection algorithm, a single-stage target detection algorithm and a target detection algorithm based on key points. Prior to 2012, a conventional target detection algorithm was mainly studied, and this stage is represented by Viola Jones (a target detection method that processes and gives a target detection rate in real time), HOG (detection method based on a direction histogram feature) detector, and DPM (detection method based on a component) detector. Firstly, traversing sample data by adopting a selective window technology by using the algorithm to generate candidate frames; then, extracting the characteristics of the sample data in the candidate frames by utilizing a characteristic extraction component; and finally, classifying by using a classifier. However, the algorithm has no pertinence to important characteristic information of the target, so that the problems of window redundancy, high algorithm complexity and the like are easily caused.

The two-stage target detection algorithm is based on low real-time performance and high detection precision of candidate areas, the algorithm firstly generates candidate areas for sample data, then utilizes a convolutional neural network to extract characteristics of the candidate areas, and finally carries out classification processing. Currently, such algorithms mainly take RCNN (an algorithm for applying deep learning to target detection), SPPNet (spatial pyramid pooling convolutional network), fast RCNN, hyperset, FPN, mask RCNN, tridentNet, and the like as main, wherein RCNN, SPPNet, fast RCNN, hyperNet, FPN, mask RCNN, and TridentNet are different types of two-stage target detection algorithms in the related art.

The single-stage target detection algorithm integrates the target feature extraction, target classification and candidate frame regression processes, so that an end-to-end target detection task is realized. Currently, the algorithm mainly comprises a YOLO series, an SSD series, a MobileNet series, a shuffleunenet series, a RetinaNet, efficientDet series, a Swin fransformer and the like (the YOLO series, the SSD series, the MobileNet series, the shuffleunenet series, retinaNet, efficientDet, the Swin fransformer are different single-stage target detection algorithms in the related art). The single-stage target detection algorithm is generally superior to the two-stage target detection algorithm in real-time performance, and has higher detection precision, so that the single-stage target detection algorithm is widely applied to various occasions.

The key point-based target detection algorithm essentially uses the detection and matching of the target key points to replace the generation process of the candidate frame, so that the problems of sample unbalance and the like caused by the candidate frame are eliminated. The algorithm firstly regards the process of searching the target center point as target key point evaluation, and then adjusts the position, angle, gesture and other attributes of the target by using a key point regression strategy. Currently, the key point-based target detection algorithms are mainly CornerNet series, centerNet and ExtremeNet. However, the algorithm model is complex, and the requirements of light weight and real-time performance are difficult to achieve.

Currently, the difficulties of preventing the further improvement of the target detection algorithm based on the deep learning are mainly complex model structure, increased super-parameters, complex optimization process, unbalanced sample data distribution and the like.

The backbone feature extraction networks at the present stage are various in variety and complex in structure, the algorithm calculation amount is easy to increase by expanding the depth and the width of the backbone feature extraction networks, the algorithm model is difficult to deploy, and the efficiency of identifying and classifying targets in images is low, for example: after the image is acquired by the image acquisition device in the financial institution, the identification process of the target is required, but the problems of low image identification efficiency, serious waste of system resources, difficulty in rapidly developing other works of the financial institution and the like are caused by complex structure of the target detection algorithm in the prior art.

In addition, the multi-scale feature fusion in the related technology is easy to cause the circulation of non-salient features in the neck feature fusion network, so that the algorithm model can repeatedly use gradient information, and unnecessary computing resource consumption is caused.

In addition, the multi-scale feature fusion strategy stacks or weights the features with larger differences in an average manner, so that significant features and non-significant features cannot be effectively distinguished, and larger differences exist between input features and output features.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the invention provides a target detection method for enhancing multi-scale feature extraction, multiplexing and fusion, which aims to solve the technical problems of low target recognition efficiency in the target detection process caused by complex model structure, high calculation cost and no pertinence to salient features in a target detection algorithm.

According to an aspect of the embodiment of the present invention, there is provided a target detection method for enhancing multi-scale feature extraction, multiplexing and fusion, including: in response to a request for identifying a target image, inputting the target image into a target feature extraction network of a target detection model, and outputting N first feature images, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of the target image, and N is an integer greater than 1; inputting the N first feature images into the feature fusion network, and outputting M second feature images, wherein the feature fusion network is used for carrying out multi-scale feature fusion on the N first feature images, and M is an integer greater than 1; and inputting the M second characteristic images into the target prediction network, and outputting the identification result of the target image, wherein the identification result comprises a classification result of the target in the target image.

Further, the target feature extraction network at least includes: the device comprises a plurality of first feature extraction modules and a plurality of dimension adjustment modules, wherein each dimension adjustment module at least comprises: the convolution layer, the batch normalization layer and the activation function layer input the target image into a target feature extraction network of a target detection model, output N first feature images, and comprise the following steps: and carrying out dimension adjustment on the target image through each dimension adjustment module, and carrying out feature extraction on the target image subjected to dimension adjustment through each first feature extraction module to obtain M second feature images.

Further, each of the first feature extraction modules includes at least: the device comprises a plurality of dimension adjustment sub-modules and a residual sub-module, wherein the residual sub-module comprises a plurality of residual units, each dimension adjustment sub-module has the same structure as the dimension adjustment module, and the dimension-adjusted target image is subjected to feature extraction through each first feature extraction module to obtain M second feature images, and the device comprises: receiving a third feature image through the first feature extraction module, inputting the third feature image into a first dimension adjustment sub-module through a first transmission channel, inputting a fourth feature image output by the first dimension adjustment sub-module into the residual sub-module, and outputting a first feature layer, wherein the first feature extraction module is one of a plurality of first feature extraction modules, the third feature image is the target image processed by one or more of a plurality of first feature extraction modules and a plurality of dimension adjustment modules, and the first dimension adjustment sub-module is one of a plurality of dimension adjustment sub-modules; in the first transmission channel, the first characteristic layer is input into a stacking unit after convolution processing; in a second transmission channel, the third characteristic image is input into the stacking unit after being subjected to convolution processing, and characteristic data received by the stacking unit are subjected to stacking processing through the stacking unit to obtain a second characteristic layer; performing target processing on the second feature layer to obtain a target feature image, wherein the target processing at least comprises one of the following steps: normalizing, activating a function, and adjusting a characteristic dimension, wherein the target characteristic image is one of M second characteristic images; m second feature images are determined based on the plurality of target feature images.

Further, each of the residual units includes at least: the device comprises a plurality of dimension adjustment subunits, a weighting subunit, a residual error subunit and a first feature layer, wherein the fourth feature image output by the first dimension adjustment subunit is input into the residual error subunit, and the first feature layer is output, and the device comprises: based on a first transmission sub-channel and a plurality of dimension adjustment sub-units, channel adjustment is carried out on the fourth characteristic image to obtain a third characteristic layer, and the third characteristic layer is input into the weighting sub-unit; inputting the fourth feature image into the weighting subunit through a second transmission subchannel; and carrying out weighting processing on the third characteristic layer and the fourth characteristic image through the weighting subunit, and outputting the first characteristic layer.

Further, the feature fusion network at least includes: the system comprises a plurality of second feature extraction modules, a plurality of feature fusion modules and a plurality of feature fusion layers, wherein a plurality of residual units in the first feature extraction modules are replaced by a plurality of dimension adjustment units to serve as the second feature extraction modules, the N first feature images are different in image size, the N first feature images are input into a feature fusion network, and M second feature images are output, and the system comprises: the method comprises the steps that in a first feature fusion layer, a first feature image with a first size is input into a first feature fusion module, a result output by the first feature fusion module is input into a first extraction module, and a first image is output, wherein the first feature fusion module is used for fusing image features transmitted by a first feature circulation path with the first feature image with the first size, the transmission direction of the image features transmitted by the first feature circulation path is from the last feature fusion layer to the first feature fusion layer, the first extraction module is one of a plurality of second feature extraction modules, the first extraction module is used for transmitting an output result of the first feature extraction module to the last feature fusion layer through a second feature circulation path, and the image feature transmission direction of the second feature circulation path is from the first feature fusion layer to the last feature fusion layer; in the second feature fusion layer, based on a third transmission channel and a fourth transmission channel, carrying out feature fusion on the first feature image with a second size and the image features transmitted by the first feature circulation path, and carrying out fusion on the result obtained after feature fusion and the image features transmitted by the second feature circulation path to determine a second image; in the last layer of the feature fusion layer, carrying out feature fusion on the first feature image with the third size and the features transmitted by the second feature circulation path after pyramid pooling operation based on a fifth transmission channel and a sixth transmission channel, and determining a third image; m of the second feature images are determined based on the first image, the second image, and the third image.

Further, in the second feature fusion layer, based on the third transmission channel and the fourth transmission channel, feature fusion is performed on the first feature image with the second size and the image features transmitted by the first feature flow path, and the feature fusion result is fused with the image features transmitted by the second feature flow path, so as to determine a second image, which includes: transmitting the first characteristic image with the second size to a second characteristic fusion module through the third transmission channel in the second characteristic fusion layer, and performing multi-scale fusion with the image characteristics transmitted by the first characteristic circulation path to obtain a first fusion characteristic; inputting the first fusion feature into a second extraction module to obtain a second fusion feature, wherein the second extraction module is one of a plurality of second feature extraction modules; transmitting the first characteristic image with the second size to a third characteristic fusion module through a fourth transmission channel, and carrying out multi-scale characteristic fusion on the characteristic transmitted by the second characteristic flow path and the second fusion characteristic to determine a third fusion characteristic; and inputting the third fusion feature into a third extraction module and outputting the second image, wherein the third extraction module is one of a plurality of second feature extraction modules.

Further, after the pyramid pooling operation of the first feature image of the third size, the method further includes: and transmitting a result obtained after pyramid pooling operation of the first characteristic image with the third size to the characteristic fusion layer of the first layer through the first characteristic flow path.

Further, the object detection model is determined by: obtaining pre-training weights and training samples, and dividing the training samples into a training set, a verification set and a test set, wherein the training samples comprise: a plurality of images, wherein each image is marked with a marking frame of the target; training an initial detection model based on the pre-training weight and the training set, and verifying whether the initial detection model is converged or not through the verification set in a cross verification mode in the training process, wherein the initial detection model is an untrained model; and under the condition that the initial detection model is converged, determining the target detection model, and testing the detection precision of the initial detection model based on a weight file and the test set, wherein the weight file is used for storing a plurality of weights of the target detection model.

Further, the target detection method for enhancing multi-scale feature extraction, multiplexing and fusion further comprises the following steps: in the training process of the initial detection model, a fusion feature map output by the feature fusion network is received through the target prediction network; dividing the fusion feature map into a plurality of grids, wherein the labeling frames of the targets are marked on the fusion feature map, and each grid comprises a plurality of anchor frames; searching a target grid where the annotation frame in the fusion feature map is located in the multiple grids; determining a prediction frame based on an intersection ratio of each anchor frame in the target grid and the annotation frame of the target; and determining a detection result of the initial detection model through the prediction frame.

Further, the step of determining a prediction box based on the intersection ratio of each anchor box in the target grid and the annotation box of the target comprises: calculating the intersection ratio of each anchor frame in the target grid and the labeling frame of the target to obtain a plurality of ratios; taking an anchor frame associated with the largest ratio of the plurality of ratios as a target anchor frame, wherein the target anchor frame is used for detecting the target on the fusion characteristic diagram; and updating the position parameters of the target anchor frame to obtain the prediction frame.

Further, the loss function adopted when training the initial detection model at least comprises: classification loss function, confidence loss function, and location loss function.

According to another aspect of the embodiments of the present invention, there is further provided a target detection apparatus for enhancing multi-scale feature extraction, multiplexing, and fusion, including: the first processing unit is used for responding to a recognition request of a target image, inputting the target image into a target feature extraction network of a target detection model, and outputting N first feature images, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of the target image, and N is an integer greater than 1; the second processing unit is used for inputting the N first characteristic images into the characteristic fusion network and outputting M second characteristic images, wherein the characteristic fusion network is used for carrying out multi-scale characteristic fusion on the N first characteristic images, and M is an integer larger than 1; and the third processing unit is used for inputting the M second characteristic images into the target prediction network and outputting the identification result of the target image, wherein the identification result comprises a classification result of the target in the target image.

Further, the target feature extraction network at least includes: the device comprises a plurality of first feature extraction modules and a plurality of dimension adjustment modules, wherein each dimension adjustment module at least comprises: a convolution layer, a batch normalization layer and an activation function layer, wherein the first processing unit comprises: the first processing subunit is used for carrying out dimension adjustment on the target image through each dimension adjustment module, and carrying out feature extraction on the target image subjected to dimension adjustment through each first feature extraction module to obtain M second feature images.

Further, each of the first feature extraction modules includes at least: the device comprises a plurality of dimension adjustment sub-modules and a residual sub-module, wherein the residual sub-module comprises a plurality of residual units, each dimension adjustment sub-module has the same structure as the dimension adjustment module, and the processing sub-unit comprises: the first processing module is used for receiving a third characteristic image through the first characteristic extraction module, inputting the third characteristic image into the first dimension adjustment sub-module through a first transmission channel, inputting a fourth characteristic image output by the first dimension adjustment sub-module into the residual sub-module and outputting a first characteristic layer, wherein the first characteristic extraction module is one of a plurality of first characteristic extraction modules, the third characteristic image is the target image processed by one or more modules of the plurality of first characteristic extraction modules and the plurality of dimension adjustment modules, and the first dimension adjustment sub-module is one of a plurality of dimension adjustment sub-modules; the second processing module is used for inputting the first characteristic layer into the stacking unit after convolution processing is carried out on the first characteristic layer in the first transmission channel; the third processing module is used for inputting the third characteristic image into the stacking unit after convolution processing in the second transmission channel, and carrying out stacking processing on the characteristic data received by the stacking unit through the stacking unit to obtain a second characteristic layer; the fourth processing module is configured to perform target processing on the second feature layer to obtain a target feature image, where the target processing at least includes one of the following: normalizing, activating a function, and adjusting a characteristic dimension, wherein the target characteristic image is one of M second characteristic images; and the determining module is used for determining M second characteristic images based on the target characteristic images.

Further, each of the residual units includes at least: a plurality of dimension adjustment subunits, a weighting subunit, and a first processing module comprising: the adjustment sub-module is used for carrying out channel adjustment on the fourth characteristic image based on a first transmission sub-channel and a plurality of dimension adjustment sub-units to obtain a third characteristic layer, and inputting the third characteristic layer into the weighting sub-unit; a first input module for inputting the fourth feature image into the weighting subunit through a second transmission subchannel; and the weighting module is used for carrying out weighting processing on the third characteristic layer and the fourth characteristic image through the weighting subunit and outputting the first characteristic layer.

Further, the feature fusion network at least includes: the second processing unit includes: the second processing subunit is configured to input the first feature image with the first size into a first feature fusion module at a first layer of the feature fusion layer, input a result output by the first feature fusion module into a first extraction module, and output a first image, where the first feature fusion module is configured to fuse an image feature transmitted by a first feature flow path with the first feature image with the first size, a transmission direction of the image feature transmitted by the first feature flow path is a direction from a last layer of the feature fusion layer to the first layer of the feature fusion layer, the first extraction module is one of a plurality of second feature extraction modules, and the first extraction module is configured to transmit an output result of the first feature extraction module to the last layer of the feature fusion layer through a second feature flow path, where a transmission direction of the image feature of the second feature flow path is a direction from the first layer of the feature fusion layer to the last layer of the feature fusion layer; the third processing subunit is configured to perform feature fusion on the first feature image with a second size and the image features transmitted by the first feature flow path based on a third transmission channel and a fourth transmission channel in the second feature fusion layer, and fuse a result after feature fusion with the image features transmitted by the second feature flow path to determine a second image; the fourth processing subunit is used for carrying out feature fusion on the first feature image with the third size and the features transmitted by the second feature flow path after pyramid pooling operation on the basis of a fifth transmission channel and a sixth transmission channel in the last feature fusion layer, so as to determine a third image; a determining subunit configured to determine M second feature images based on the first image, the second image, and the third image.

Further, the third processing subunit includes: the first fusion module is used for transmitting the first characteristic image with the second size to the second characteristic fusion module through the third transmission channel at the second characteristic fusion layer, and carrying out multi-scale fusion on the first characteristic image and the image characteristics transmitted by the first characteristic flow path to obtain a first fusion characteristic; the input module is used for inputting the first fusion feature into a second extraction module to obtain a second fusion feature, wherein the second extraction module is one of a plurality of second feature extraction modules; the second fusion module is used for transmitting the first characteristic image with the second size to the third characteristic fusion module through the fourth transmission channel, carrying out multi-scale characteristic fusion on the characteristics transmitted by the second characteristic flow path and the second fusion characteristics, and determining the third fusion characteristics; and the input and output module is used for inputting the third fusion feature into a third extraction module and outputting the second image, wherein the third extraction module is one of the plurality of second feature extraction modules.

Further, the second processing unit further includes: and the transmission subunit is used for transmitting the result obtained by the pyramid pooling operation of the first characteristic image with the third size to the first layer of characteristic fusion layer through the first characteristic flow path after the pyramid pooling operation of the first characteristic image with the third size.

Further, the object detection model is determined by: the fourth processing unit is configured to obtain a pre-training weight and a training sample, and divide the training sample into a training set, a verification set and a test set, where the training sample includes: a plurality of images, wherein each image is marked with a marking frame of the target; the training unit is used for training an initial detection model based on the pre-training weight and the training set, and verifying whether the initial detection model is converged or not through the verification set in a cross verification mode in the training process, wherein the initial detection model is an untrained model; and the testing unit is used for determining the target detection model under the condition that the initial detection model is converged, and testing the detection precision of the initial detection model based on a weight file and the test set, wherein the weight file is used for storing a plurality of weights of the target detection model.

Further, the target detection method for enhancing multi-scale feature extraction, multiplexing and fusion further comprises the following steps: the receiving subunit is used for receiving the fusion feature map output by the feature fusion network through the target prediction network in the process of training the initial detection model; the division subunit is used for dividing the fusion feature map into a plurality of grids, wherein the label frames of the targets are marked on the fusion feature map, and each grid comprises a plurality of anchor frames; the searching subunit is used for searching the target grids of the annotation frame in the fusion feature map in the grids; a first determining subunit, configured to determine a prediction frame based on an intersection ratio of each anchor frame in the target mesh and the labeling frame of the target; and the second determination subunit is used for determining the detection result of the initial detection model through the prediction frame.

Further, the first determining subunit includes: the calculation module is used for calculating the intersection ratio of each anchor frame in the target grid and the marking frame of the target to obtain a plurality of ratios; the anchor frame processing module is used for taking an anchor frame associated with the largest ratio of the plurality of ratios as a target anchor frame, wherein the target anchor frame is used for detecting the target on the fusion characteristic diagram; and the updating module is used for updating the position parameters of the target anchor frame to obtain the prediction frame.

According to another aspect of the embodiment of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the enhanced multi-scale feature extraction, multiplexing and fusion object detection method of any of the above via execution of the executable instructions.

According to another aspect of the embodiments of the present invention, there is further provided a computer readable storage medium storing a computer program, where the computer program when executed controls a device in which the computer readable storage medium is located to perform the method for object detection of enhanced multi-scale feature extraction, multiplexing and fusion of any one of the above.

In the invention, in response to a request for identifying a target image, the target image is input into a target feature extraction network of a target detection model, and N first feature images are output, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of the target image, and N is an integer greater than 1; inputting the N first feature images into the feature fusion network, and outputting M second feature images, wherein the feature fusion network is used for carrying out multi-scale feature fusion on the N first feature images, and M is an integer greater than 1; and inputting the M second characteristic images into the target prediction network, and outputting the identification result of the target image, wherein the identification result comprises a classification result of the target in the target image. The method further solves the technical problems of low target recognition efficiency in the target detection process caused by complex model structure, high calculation cost and no pertinence to the salient features in the target detection algorithm. According to the invention, the targets in the target image are classified and identified through the target detection model comprising a plurality of transmission channels, so that the conditions of low identification efficiency caused by complex structure and low feature circulation speed of the detection model in the related technology are avoided, and the technical effect of classifying and identifying the targets in the image by the target detection model is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flow chart of an alternative enhanced multi-scale feature extraction, multiplexing and fusion target detection method according to an embodiment of the invention;

FIG. 2 is a model block diagram of an alternative object detection model in accordance with an embodiment of the present invention;

FIG. 3 is a block diagram of an alternative ECSPDarkNet1-X module in accordance with an embodiment of the present invention;

FIG. 4 is a block diagram of an alternative ECSPDarkNet2-X module in accordance with an embodiment of the present invention;

FIG. 5 is a flow chart of an alternative neck feature fusion network multi-scale feature fusion process according to an embodiment of the present invention;

FIG. 6 is a flow chart of an alternative model training according to an embodiment of the present invention;

FIG. 7 is a flow chart of an alternative model test according to an embodiment of the invention;

FIG. 8 is a flow chart of an alternative three loss function variation curves employing a cross-validation training strategy in accordance with an embodiment of the present invention;

FIG. 9 is a flow chart of an alternative anchor frame to prediction frame adjustment process according to an embodiment of the present invention;

FIG. 10 is a flow chart of an alternative prediction block regression process according to an embodiment of the present invention;

FIG. 11 is a schematic diagram of an alternative enhanced multi-scale feature extraction, multiplexing and fusion target detection device according to an embodiment of the invention;

fig. 12 is a schematic diagram of an electronic device according to an embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, displayed data, image feature data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide a corresponding operation entry for the user to select authorization or rejection.

The invention can be applied to various software products, control systems and client (including but not limited to mobile clients, PCs and the like) control systems of various financial institutions for visual detection, and is schematically illustrated by taking the software products as an example, and related business contents (including but not limited to business functions of transfer, financing, fund, payment, account checking, advertisement, recommendation and the like) of the financial institutions can be subjected to real-time monitoring of the flow of people at the banking sites, analysis of the visitors at the banking sites, anti-deceptive anti-counterfeit identification of paper money, abnormal behavior detection of the visitors at the banking sites, monitoring alarm systems at the banking sites, identification of the bank cards, face recognition of the visitors at the banking sites and the like by the visual detection system installed on the mobile clients.

Example 1

In accordance with an embodiment of the present invention, an alternative method embodiment of an enhanced multi-scale feature extraction, multiplexing, and fusion target detection method is provided, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

FIG. 1 is a flow chart of an alternative method of enhanced multi-scale feature extraction, multiplexing and fusion target detection according to an embodiment of the invention, as shown in FIG. 1, the method comprising the steps of:

step S101, in response to a request for identifying a target image, inputting the target image into a target feature extraction network of a target detection model, and outputting N first feature images, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, wherein the transmission channels are used for transmitting image features of target images, and N is an integer greater than 1.

Fig. 2 is a model structure diagram of an alternative object detection model according to an embodiment of the present invention, and as shown in fig. 2, the above object detection model may be composed of four modules, that is, an input image portion (for receiving an object image to be identified), a backbone feature extraction network (corresponding to the above object feature extraction network), a neck feature fusion network (corresponding to the above feature fusion network), and an object prediction network. The input target image size may be within a preset range, for example: in the range of (0.5-1.5) × (640×640), in an alternative approach, the size of the input target image may be 416×416, and the backbone feature extraction network described above may be used to account for multi-scale shallow feature extraction.

In this embodiment, by inputting the target image into the target feature extraction network of the target detection model, N first feature images may be output through the target feature extraction network, so as to avoid the situation that the model structure in the related art is complex, resulting in low feature circulation efficiency in the model, the target detection model may include multiple transmission channels, and transmit the image features, so as to improve the circulation speed of the image features in the target detection model, and improve the efficiency of the target detection model in identifying and classifying the targets in the image.

Step S102, inputting the N first feature images into a feature fusion network to output M second feature images, wherein the feature fusion network is used for carrying out multi-scale feature fusion on the N first feature images, and M is an integer larger than 1.

The feature fusion network can perform multi-scale fusion on the feature images output by the target feature extraction network to obtain a plurality of second feature images (corresponding to the M second feature images). For example: the neck feature fusion network (corresponding to the feature fusion network described above) may perform multi-scale fusion of multi-scale features from the backbone feature extraction network output.

Step S103, inputting the M second characteristic images into a target prediction network, and outputting the identification result of the target image, wherein the identification result comprises the classification result of the target in the target image.

In this embodiment, M second feature images output by the feature fusion network may be input into the target prediction network, and the attribute of the target in the target image is predicted and judged by the target prediction network, so as to identify whether the target image includes the target and the category of the target.

The method for detecting the target by extracting, multiplexing and fusing the enhanced multi-scale features in the target image can be used in the image acquisition equipment of the financial institution, and can be used for identifying and classifying the target in the image acquired by the financial institution so as to improve the processing efficiency of the image data in the financial institution.

It should be noted that, the object detection model of the present embodiment may be applied to a visual detection system of a financial institution, such as: the bank vision detection system is used for improving the online real-time monitoring safety capability of banking outlets, and is specific: the embodiment can be applied to the scenes such as bank website people flow real-time monitoring, bank visitor analysis, paper money anti-cheating anti-counterfeiting recognition, bank visitor abnormal behavior detection, bank website monitoring alarm system, bank card recognition, bank visitor face recognition and the like.

In the target detection model of the embodiment, feature information can be integrated in a target geometric feature statistical mode, so that the effect of detecting a target is achieved. The primary task of the object detection model of this embodiment may be to locate an object in sample data, then process feature information by using a feature extraction and fusion module, and finally classify the object information. The embodiment relies on the artificial intelligence technology, and can be widely applied to the fields of intelligent finance, intelligent traffic, intelligent medical treatment, intelligent safety, target tracking, industrial detection and the like.

Through the steps, in the embodiment, the targets in the target image are classified and identified through the target detection model comprising a plurality of transmission channels, so that the conditions of low identification efficiency caused by complex structure and low feature circulation speed of the detection model in the related technology are avoided, and the technical effect of classifying and identifying the targets in the image by the target detection model is improved. The method further solves the technical problems of low target recognition efficiency in the target detection process caused by complex model structure, high calculation cost and no pertinence to the salient features in the target detection algorithm.

The basic operation procedure in the target detection algorithm in this embodiment is described below:

in the present embodiment, the feature extraction operation may be performed using a general convolution operation, as shown in formula (1). Where x is the input feature, w is the convolution kernel, and s is the output feature.

s(t)＝(x*w)(t) (1)

In addition, feature dimensions need to be determined between the input features and the output features, and in this embodiment, the feature size may be calculated by using the formula (2) and the formula (3).

Where P is the fill, S is the step size, (H, W) is the height and width of the input feature, (FH, FW) is the height and width of the convolution kernel, and (OH, OW) is the height and width of the output feature. In this embodiment, the size of the input feature may be in the range of (0.5-1.5) × (640×640), and it is sufficient to satisfy 32 times of the maximum network downsampling operation. In general, convolution kernel sizes may be 1×1, 3×3.

The forward propagation and backward propagation processes in the target detection algorithm in this embodiment are described below:

the forward propagation essence is that the characteristics are subjected to convolution operation in the network to extract the characteristics, and the characteristics are subjected to forward circulation by adopting an optimization strategy. The back propagation is to learn and update the gradient of each weight and bias of the convolution kernel.

The following describes the forward propagation process of the convolutional layer in the target detection algorithm in this embodiment: m is marked as a first-1 output characteristic diagram,

input features for the current layer, < >>

For the current layer output characteristics, the forward propagation process of the convolutional layer is shown as equation (4) through equation (5). Where w is the convolution kernel, b is the offset, and f () is the activation function.

The following describes the back propagation process of the convolution layer in the target detection algorithm in this embodiment:

gradient delta to the first layer ^l The gradient of the first layer-1 can be obtained by back propagation, which is marked as delta ^l-1 As shown in equation (6). If no UpSampling operation is used during the back propagation, upSampling () may be ignored.

The current convolution layer carries out convolution operation on the gradient result of w as the deviation gradient of the input and the output of the layer, and the rest convolution kernels are similar as shown in a formula (7). And the offset b is a vector, so that the error gradient term of each convolution kernel is summed to obtain an error vector related to b, as shown in a formula (8), and finally, a gradient optimization algorithm is selected to update parameters, as shown in a formula (9), in an updating process of w, and the updating process of b is similar. Where J is the target loss function and η is the learning rate.

The forward propagation process of the pooling layer in the target detection algorithm in this embodiment is described below:

like the convolutional layer, the first layer is denoted as a pooling layer, and the forward propagation process is shown in formulas (10) to (11).

Wherein DowmSampling () is a downsampling operation.

The following describes the back propagation process of the pooling layer in the target detection algorithm in this embodiment:

recording the gradient of the first-1 layer of the pooling layer as delta ^l-1 The back propagation process to obtain the pooling layer is shown in equation (12).

δ ^l-1 ＝UpSampling(δ ^l )f'(u ^l-1 ) (12)

The following describes a model optimization algorithm procedure in the target detection algorithm in the present embodiment:

in this embodiment, an Adam gradient optimization algorithm may be selected to update the parameters, and the gradient optimization algorithm has an important role in performing moving average and deviation correction on the gradient. The pseudo code execution process is as follows:

the Adam gradient optimization algorithm pseudo code of the embodiment:

for learning rate eta and learnable parameter theta ₀ Initializing epsilon;

setting a smoothing constant beta for m and v ₁ And beta ₂ ；

Initializing m ₀ ＝0，v ₀ ＝0，t＝0；

whileθ _t Do when not converging

t←t+1；

m _t ←β ₁ ·m _t-1 +(1-β ₁ )·δ _t ；

end while

The algorithm first calculates the gradient delta at time t _t Then, the first moment estimation m is calculated for the gradients _t And second moment estimate v _t Respectively performing offset correction on the first moment estimation and the second moment estimation to obtain

And->

And finally updating the learnable parameter theta.

Optionally, the target feature extraction network includes at least: the device comprises a plurality of first feature extraction modules and a plurality of dimension adjustment modules, wherein each dimension adjustment module at least comprises: the convolution layer, the batch normalization layer and the activation function layer are used for inputting the target image into a target feature extraction network of a target detection model and outputting N first feature images, and the method comprises the following steps: and performing dimension adjustment on the target image through each dimension adjustment module, and performing feature extraction on the target image subjected to dimension adjustment through each first feature extraction module to obtain M second feature images.

The backbone feature extraction network (corresponding to the target feature extraction network described above) is described in an alternative manner, as shown in fig. 2, and may include 4 ecspdark net1-X modules (corresponding to the first feature extraction modules described above), as shown in fig. 2, ecspdark net1-4, ecspdark net1-9, ecspdark net1-12, and ecspdark net1-4, and may further include 5 CBS modules (corresponding to the dimension adjustment modules described above). The ECSPDarkNet1-X module can be used for enhancing multi-scale feature extraction and reducing model parameters, and the CBS module can be used for adjusting dimension information between input features and output features of each layer. As shown in fig. 2, in the CBS module, a normal convolution Conv (corresponding to the convolution layer described above), a batch normalization BN (corresponding to the batch normalization layer described above), and a SiLU activation function (corresponding to the activation function layer described above) may be included. The general convolution Conv can be used for channel integration, batch normalization BN can be used for balancing nonlinear feature flows and accelerating model training speed, siLU activation functions can be used for introducing nonlinear feature factors and improving nonlinear feature expression capacity of a deep neural network on a model.

As shown in FIG. 2, the target feature map may be sequentially input into two 3×3 CBS modules, then processed by one ECSPDarkNet1-4 module, then processed by one 3×3 CBS module, input into the ECSPDarkNet1-9 module, input the result output by the ECSPDarkNet1-9 module (one of the M second feature images) into the neck feature fusion network, or input the result output by the ECSPDarkNet1-9 module into the ECSPDarkNet1-12 module after processed by one 3×3 CBS module, input the result output by the ECSPDarkNet1-12 module (one of the M second feature images) into the neck feature fusion network, or input the result output by the ECSPDarkNet1-12 module into another ECSPDarkNet1-4 module after processed by one 3×3 CBS module, and the result output by the ECSPDarkNet1-4 module (one of the M second feature images) can achieve the effect of the linear feature fusion of the neural network.

Optionally, each first feature extraction module includes at least: the device comprises a plurality of dimension adjustment sub-modules and a residual sub-module, wherein the residual sub-module comprises a plurality of residual units, each dimension adjustment sub-module has the same structure as the dimension adjustment module, and each first feature extraction module is used for extracting features of a target image subjected to dimension adjustment to obtain M second feature images, and the device comprises: receiving a third characteristic image through a first characteristic extraction module, inputting the third characteristic image into a first dimension adjustment sub-module through a first transmission channel, inputting a fourth characteristic image output by the first dimension adjustment sub-module into a residual sub-module, and outputting a first characteristic layer, wherein the first characteristic extraction module is one of a plurality of first characteristic extraction modules, the third characteristic image is a target image processed by one or more of the plurality of first characteristic extraction modules and the plurality of dimension adjustment modules, and the first dimension adjustment sub-module is one of a plurality of dimension adjustment sub-modules; in the first transmission channel, the first characteristic layer is input into the stacking unit after convolution processing; in the second transmission channel, the third characteristic image is input into a stacking unit after convolution processing, and characteristic data received by the stacking unit is stacked through the stacking unit to obtain a second characteristic layer; performing target processing on the second feature layer to obtain a target feature image, wherein the target processing at least comprises one of the following steps: normalizing, activating a function, adjusting a characteristic dimension, wherein the target characteristic image is one of M second characteristic images; m second feature images are determined based on the plurality of target feature images.

FIG. 3 is a block diagram of an optional ECSPDarkNet1-X module according to an embodiment of the present invention, as shown in FIG. 3, the feature extraction module (e.g., ECSPDarkNet1-X in FIG. 3) may include a plurality of dimension adjustment sub-modules (e.g., CBS in FIG. 3), residual sub-modules (e.g., X residual units in FIG. 3), a convolution module (e.g., conv in FIG. 3), a stacking module (e.g., concat in FIG. 3), a normalization module (e.g., BN in FIG. 3), and an activation function module (e.g., leakyReLU in FIG. 3), where each CBS may further include: conv (convolution), BN (normalization), siLU (activation function).

First, the input features (corresponding to the third feature image described above) of the respective ecspdark net1-X modules (corresponding to the first feature extraction modules described above) are divided into two parts per channel. Wherein x' ₀ The backbone branch of the ecspdark 1-X module (corresponding to the first transmission channel) is first processed by the CBS module (corresponding to the first dimension adjustment sub-module) and then enters the resultalbut residual module (corresponding to the residual sub-module) which outputs the feature layer (corresponding to the first feature layer). Record x' ₀ To cross the connecting edge (corresponding to the second transmission channel) and to add x' ₀ The processed sign layer (namely the feature layer obtained by convolving the first feature layer) is output together with x' ₀ And carrying out stacking operation or stacking processing on the transmitted third characteristic image through a Concat module (corresponding to the stacking unit) according to a channel after carrying out common convolution Conv processing on the third characteristic image, so as to enrich multi-scale characteristic information fusion, then adopting batch normalization BN operation (corresponding to the normalization processing) to ensure the consistency of the characteristics, adopting a LeakyReLU activation function (corresponding to the activation function processing) to carry out nonlinear linearization on the characteristics, and finally using a CBS module to adjust the scale of the characteristics (characteristic dimension adjustment processing) and taking the dimension of the characteristics as the input of the next module.

The convolution process described above may be a normal convolution process. As shown in equations (13) to (15), the forward propagation process of the ecspdark net1-X module is shown, and equations (16) to (18) are the backward propagation process. Wherein w is a weight parameter, x is an input feature, "+" is a convolution operation, x _C Is the output characteristic of the characteristic layer after common convolution operation Conv processing, (x) ₁ ,x ₂ ,x ₃ ,......,x _X ) Output characteristics weighted by X residual modules, concat [ X ] _C ,x _T+1 ]For feature stacking, J is the objective function and g is the gradient.

x _X+1 ＝w _x *(x ₁ ,x ₂ ,x ₃ ,......,x _X ) (13)

x _C ＝w _C *x' ₀ (14)

x _Concat ＝w _concat *Concat[x _C ,x _X+1 ] (15)

w' _X ＝J(w _X ,g ₁ ,......,g _X-1 ,g _X ) (16)

w' _C ＝J(w _C ,g' ₀ ) (17)

w' _Concat ＝J(w _concat ,g _C ,g _X+1 ) (18)

Such as the back propagation process, x' ₀ (second transmission channel) and x' ₀ The gradient of each of the two branches (of the first transmission channel) does not contain the gradient of the other branch, i.e. the gradient is not recycled. In addition, the features are divided into different network branches by the ECSPDarkNet1-X module to be transmitted, repeated gradient injection into the network is relieved while model training parameters are reduced, more effective features are extracted by the model, and the effect of improving the feature extraction efficiency of the target detection model is achieved.

Optionally, each residual unit comprises at least: the device comprises a plurality of dimension adjustment subunits, a weighting subunit, a residual error subunit and a first feature layer, wherein the fourth feature image output by the first dimension adjustment subunit is input into the residual error subunit, and the first feature layer is output, and the device comprises: based on the first transmission sub-channel and the plurality of dimension adjustment sub-units, channel adjustment is carried out on the fourth characteristic image to obtain a third characteristic layer, and the third characteristic layer is input into the weighting sub-unit; inputting the fourth characteristic image into the weighting subunit through the second transmission subchannel; and weighting the third characteristic layer and the fourth characteristic image through a weighting subunit, and outputting the first characteristic layer.

As shown in fig. 3, each residual unit in the residual sub-module described above may include: a plurality of dimension adjustment subunits (e.g., two CBSs in fig. 3), and a weighting subunit (e.g., add in fig. 3).

In this embodiment, the first branch (i.e., the first transmission subchannel) in the residual module (i.e., the residual unit) performs channel adjustment by using a CBS module (dimension adjustment subunit) with a convolution kernel size of 1×1, and then enhances feature extraction by using a CBS module (dimension adjustment subunit) with a convolution kernel size of 3×3. It should be noted that, the other side in the residual module may be used as a residual side (i.e. the second transmission sub-channel), and finally, in order to increase the feature information and not change the dimension of the feature layer, the Add module (weighting sub-unit) may be used to perform a weighting operation to output the first feature layer, where the first transmission sub-channel and the second transmission sub-channel may be used to transmit the transmitted image feature in the residual unit, and the feature circulation is performed through the first transmission sub-channel and the second transmission sub-channel, so that the feature circulation efficiency in the target detection model is improved, and the technical effect of improving the processing efficiency of the target detection model in performing the recognition processing on the image is achieved.

Optionally, the feature fusion network includes at least: the system comprises a plurality of second feature extraction modules, a plurality of feature fusion modules and a plurality of feature fusion layers, wherein a plurality of residual units in the first feature extraction modules are replaced by a plurality of dimension adjustment units to serve as the second feature extraction modules, the image sizes of N first feature images are different, N first feature images are input into a feature fusion network, M second feature images are output, and the system comprises: inputting a first characteristic image with a first size into a first characteristic fusion module at a first layer characteristic fusion layer, inputting the output result of the first characteristic fusion module into a first extraction module, and outputting a first image, wherein the first characteristic fusion module is used for fusing the image characteristic transmitted by a first characteristic flow path with the first characteristic image with the first size, the transmission direction of the image characteristic transmitted by the first characteristic flow path is from the last layer of characteristic fusion layer to the first layer of characteristic fusion layer, the first extraction module is one of a plurality of second characteristic extraction modules, the first extraction module is used for transmitting the output result of the first characteristic extraction module to the last layer of characteristic fusion layer through a second characteristic flow path, and the image characteristic transmission direction of the second characteristic flow path is from the first layer of characteristic fusion layer to the last layer of characteristic fusion layer; at a second layer of feature fusion layer, based on a third transmission channel and a fourth transmission channel, carrying out feature fusion on a first feature image with a second size and image features transmitted by a first feature circulation path, and fusing a result obtained after feature fusion with the image features transmitted by the second feature circulation path to determine a second image; at the last layer of feature fusion layer, based on a fifth transmission channel and a sixth transmission channel, carrying out feature fusion on the first feature image with the third size and the features transmitted by the second feature circulation path after pyramid pooling operation, and determining a third image; m second feature images are determined based on the first image, the second image, and the third image.

As shown in fig. 2, the ecspdark 2-X module (corresponding to the second feature extraction module described above) may be applied in a neck feature fusion network (corresponding to the feature fusion network described above). In the related art, gradient multiplexing is one of the main reasons for increasing the model reasoning cost. Therefore, in this embodiment, the ecspdark net2-X module is provided to be used in the neck feature fusion network, so as to increase the multi-scale feature fusion efficiency while accelerating the circulation of the multi-scale features in the network. FIG. 4 is a block diagram of an alternative ECSPDarkNet2-X module according to an embodiment of the present invention, as shown in FIG. 4, unlike ECSPDarkNet1-X, ECSPDarkNet2-X replaces the residual module (i.e., residual unit) with X CBS (corresponding to the multiple dimension adjustment units described above) with convolution kernel sizes of 1X 1 and 3X 3, respectively, for the purpose of reducing feature computation and speeding up feature flow.

Similar to the ECSPDarkNet1-X module, this embodiment reduces the utilization of the repeated gradients by the model using cross-link edge operations (multiple transport channels) in the ECSPDarkNet2-X module for faster feature flow and less model parameters. The ECSPDarkNet2-X module forward propagation process is shown in formulas (19) to (21) and the backward propagation process is shown in formulas (22) to (24). Respectively record x' ₀ And x' ₀ The method comprises the steps of crossing a connecting side branch and a backbone branch, wherein w is a weight parameter, x is an input characteristic, "x" is convolution operation, and x _X For the output characteristics of X CBS combined modules, X _C Is the output feature of the feature layer after common convolution operation Conv processing, concat [ x ] _C ,x _T+1 ]For feature stacking, J is the objective function and g is the gradient.

x _X+1 ＝w _x *x _X (19)

x _C ＝w _C *x' ₀ (20)

x _Concat ＝w _Concat *Concat[x _T ,x _X+1 ] (21)

w' _X ＝J(w _X ,g _X ) (22)

w' _C ＝J(w _C ,g' ₀ ) (23)

w' _Concat ＝J(w _Concat ,g _C ,g _X+1 ) (24)

The backbone branches and the gradients of the cross-linking edges are independently integrated in the back propagation process, and only the gradients belonging to the backbone branches and the cross-linking edges are updated when the gradients are processed, namely, the use of ECSPDarkNet2-X modules can reduce repeated gradient blocking networks, the reinforced features are subjected to bottom-up, top-down and transverse connection multi-scale fusion in a neck feature fusion network, and the circulation of the features in the network can be accelerated while the computational power consumption is relieved.

As shown in fig. 5, a multi-scale feature fusion process is provided for a neck feature fusion network. The neck feature fusion network uses 4 NFPN modules (corresponding to the above feature fusion modules) and 4 ecspdark 2-X modules (corresponding to the above second feature extraction modules), and the neck feature fusion network may further include: SPPF module (i.e., pyramid pooling operation module), as shown in fig. 2, the SPPF module described above may include: two CBSs, concat (stacked modules), one 5 x 5 Maxpool (max pooling module), one 9 x 9 Maxpool and one 13 x 13 Maxpool, wherein CBS may consist of one Conv (normal convolution), BN (normalization), siLU (activation function). Because the effect of the intermediate operations of single input and no-feature fusion on feature fusion is smaller, in this embodiment, these operations can be deleted first in the neck feature fusion network, and then a cross-connection edge (transmission channel) is added between the input feature and the output feature to reduce the model calculation power consumption, while ensuring to fuse more effective features. In addition, the fusion of the shallow features and the deep features is one of the main reasons for increasing the model scale, so in the neck feature fusion network of the embodiment, a NFPN module can be firstly adopted to set a learnable weight parameter for each input feature, so that the model learns and distinguishes the importance of the features, and then an ecspdark net2-X module is used after the NFPN module to accelerate the propagation of more effective multi-scale features in the network.

As in FIG. 5, it can be separately noted that

(first feature image corresponding to the first size described above), -and (ii)>

(first feature image corresponding to the second size described above) and +.>

(first feature image corresponding to the third size) is three scale feature images from the backbone feature extraction network, and the corresponding output features are respectively marked as ∈ ->

(corresponding to the first image described above), ->

(corresponding to the second image described above) and +.>

(corresponding to the third image described above), a small target feature map, a medium target feature map, and a large target feature map in this order, in this embodiment, the ECSPDarkNet2-X module may be denoted as E ₂ The output of the NFPN module is denoted +.>

UpSampling is an UpSampling operation.

As shown in formulas (25) to (31), a small target feature map, a medium target feature map, and a large target feature map are generated. Wherein w and w' are learnable weight parameters, ε takes 0.0001 to prevent zero removal operations. Analysis shows that NFPN can be used for setting a learnable parameter for multi-scale characteristics, then performing multi-scale fusion, and finally performing convolution operation treatment to serve as input of the next layer.

As can be seen from the analysis in figure 5,

shallow features of the network are extracted for features from the backbone, which are delivered as features on top-down paths and small target input features of the neck feature fusion network, respectively. As shown in formula (25) to formula (26), at +. >

During the formation of NFPN module(s)>

And the features transferred from the bottom-up path (corresponding to the first feature flow path) in the neck feature fusion network are subjected to multi-scale feature fusion, and finally the fused features are processed by an ECSPDarkNet2-X module (corresponding to the first extraction module) to form a small target detection feature graph>

(i.e., the first image).

From the analysis of the formula (27) to the formula (29),

features are delivered into the top-down path and neck feature fusion network, respectively. Wherein (1)>

Processing in top-down path to obtain +.>

The path transferred into the neck feature fusion network is in feature flow through two transverse channels (a third transmission channel and a fourth transmission channel) crossing the connecting edge. At->

In the formation of (corresponding to the second image described above), first +.>

The multi-scale features sequentially transferred on a bottom-up path (corresponding to the first feature flow path) and a top-down path (corresponding to the second feature flow path) in the neck feature fusion network are subjected to multi-scale fusion through two NFPN modules, so that the multi-scale feature fusion network sequentially forms

And->

Then->

And->

Continuing propagation on the bottom-up and top-down paths, respectively, in the neck feature fusion network, finally +. >

The medium-target detection characteristic diagram is obtained after the ECSPDarkNet2-X module is processed

(corresponding to the third image described above) is formed by first +.>

After being processed by the SPPF module of the space pyramid pooling operation, the method sequentially carries out feature transfer through a bottom-up path and a cross-connection side channel (a fifth transmission channel and a sixth transmission channel) in a neck feature fusion network, and then the method comprises the steps of +_>

The multi-scale feature is transferred to a neck feature fusion network through a transverse connection channel, then is subjected to multi-scale feature fusion by utilizing an NFPN module together with the multi-scale feature (corresponding to the second feature flow path) transferred on a top-down path in the neck feature fusion network, and finally the feature is processed by an ECSPDarkNet2-X module (a third extraction module) to obtain a large target detection feature map (corresponding to the third image)>

Optionally, at the second layer of feature fusion layer, based on the third transmission channel and the fourth transmission channel, feature fusion is performed on the first feature image with the image feature transmitted by the first feature flow path, and the result after feature fusion is fused with the image feature transmitted by the second feature flow path, so as to determine the second image, including: transmitting the first characteristic image with the second size to a second characteristic fusion module through a third transmission channel in a second characteristic fusion layer, and carrying out multi-scale fusion on the first characteristic image and the image characteristic transmitted by the first characteristic flow path to obtain a first fusion characteristic; inputting the first fusion feature into a second extraction module to obtain a second fusion feature, wherein the second extraction module is one of a plurality of second feature extraction modules; transmitting the first characteristic image with the second size to a third characteristic fusion module through a fourth transmission channel, and carrying out multi-scale characteristic fusion on the characteristic transmitted by the second characteristic flow path and the second fusion characteristic to determine a third fusion characteristic; and inputting the third fusion feature into a third extraction module and outputting a second image, wherein the third extraction module is one of a plurality of second feature extraction modules.

From the analysis of the formula (27) to the formula (29),

Processing in top-down path to obtain +.>

And->

Then->

And->

Continuing propagation on the bottom-up path (first characteristic flow path) and the top-down path (second characteristic flow path) in the neck characteristic fusion network, respectively, and finally ∈>

After being processed by an ECSPDarkNet-X module, the middle target detection characteristic diagram is obtained>

The technical effect of improving the fusion efficiency of multi-scale feature fusion is achieved.

Optionally, after the pyramid pooling operation of the first feature image of the third size, further comprising: and transmitting a result obtained after pyramid pooling operation of the first characteristic image with the third size to the first layer of characteristic fusion layer through the first characteristic flow path.

In the present embodiment of the present invention,

(corresponding to the third image described above) is formed by first +.>

After being processed by the SPPF module of the space pyramid pooling operation, the SPPF module sequentially transmits the characteristics through a bottom-up path (corresponding to the first characteristic flow path) and a cross-connection side channel (a fifth transmission channel and a sixth transmission channel) in the neck characteristic fusion network, and then the SPPF module transmits the characteristics>

The SPPF module in the first characteristic circulation path in the neck characteristic fusion network is transmitted to the neck characteristic fusion network through the transverse connection channel, and the SPPF module can also transmit through a bottom-up path, so that the technical effect of improving the characteristic circulation speed in the neck characteristic fusion network is realized.

As shown in fig. 5, the bottom-up path in the cervical feature fusion network may be SPPF- >1×1 CBS- > UpSampling- > NFPN- > ECSDarkNet2-4- >1×1 CBS- > UpSampling- > NFPN; the top-down path in the neck feature fusion network may be ECSDarkNet2-4- >3×3 CBS- > NFPN- > ECSDarkNet2-4.

Optionally, the object detection model is determined by: the method comprises the steps of obtaining pre-training weights and training samples, and dividing the training samples into a training set, a verification set and a test set, wherein the training samples comprise: a plurality of images, each of which is marked with a target marking frame; training an initial detection model based on pre-training weight and a training set, and verifying whether the initial detection model is converged by a verification set in a cross verification mode in the training process, wherein the initial detection model is an untrained model; and under the condition that the initial detection model is converged, determining a target detection model, and testing the detection precision of the initial detection model based on a weight file and a test set, wherein the weight file is used for storing a plurality of weights of the target detection model.

The model training process in this embodiment is described below in conjunction with an alternative approach:

(1) The target detection algorithm (corresponding to the target detection model described above) is built up in the environment.

In this embodiment, pychrm, anaconda may be used as a script editing tool, python 3.7 may be used as a script design language, the GPU parallel computing tool may be RTX 2070super, the GPU accelerator may be CUDA 10.1, and Logitech Brio 500 may be used as a sample collector.

(2) An image sample dataset is created.

In this embodiment, the data set may be created in VOC format for the target detection algorithm. Firstly, a VOC file can be created under a folder to serve as a data set storage source, then a LableImg tool is adopted to label a real frame (corresponding to the label frame) of each target in each Image in the data set, the source file path, the center point, the width and height and other attributes of the corresponding XML file storage target are generated, and finally an animation file and an Image file are created under the VOC file to respectively store the XML file and the labeled Image.

After the data set (corresponding to the training sample) is manufactured, the data set can be divided into a training set, a verification set and a Test set, then the XML files of the training set, the verification set and the Test set are respectively stored under the analysis file by creating the Train-Annotation, val-analysis and Test-analysis files, and finally the Image samples of the training set, the verification set and the Test set can be respectively stored under the Image by creating the Train-Image, the Val-Image and the Test-Image files. Generally, the random division ratio of the training set, the verification set and the test set may be a preset ratio, and the preset ratio may be 0.6:0.2:0.2.

(3) And (5) model training.

The model training flow is shown in fig. 6, and the model is trained after the environment is built and the data set is manufactured. Firstly, loading a pre-training weight PT file, which is used for shortening training time, improving model detection precision, and storing weight parameters which are finally trained by a model in the PT file. Then, the model configuration file is modified, the number of target categories is determined, and the training size of the input image is formulated. Finally, the super parameters required by model training are adjusted so as to determine a model training strategy and an optimization strategy, and finally, a verification set is used for cross verification training in the training process.

(4) Model testing and evaluation.

The model test flow is shown in fig. 7, and when model training is finished, a final weight PT file is generated. First, a weight PT file is loaded and a model test configuration file is modified. Then, the batch size is set and the test image input size is adjusted. Finally, a confidence coefficient threshold value of 0.5 and an IOU threshold value of 0.5 can be set for screening the prediction frames, and the prediction frame with the highest output confidence coefficient is used as a detection result after final testing.

In order to verify the effectiveness of the model, the embodiment evaluates the model on a test set, and can analyze the model by adopting accuracy P, recall R, accuracy average AP and average accuracy average mAP evaluation indexes, and finally, adopts visual analysis and ablation experiments to verify the effectiveness of the model.

As shown in equation (32), is the precision P. Wherein TP represents that the sample is actually a positive example, and the detection result is also a positive example; FP indicates that the sample is actually a negative example, but the detection result is a positive example; TP+FP represents the positive example of model detection.

Thus, the accuracy characterizes the ability of the model to determine the true case.

As shown in equation (33), is the recall R. Wherein FN represents that the sample is actually a negative example, and the detection result is also a negative example; fp+fn represents all positive examples in the test set. Thus, the recall characterizes the ability of the model to detect an actual positive instance as a positive instance.

As shown in equation (34), is the precision average AP. The method characterizes the average value of the accuracy rate when the recall rate is changed from 0 to 1, namely the detection accuracy of the model for various categories.

As shown in equation (35), is the average accuracy average mAP. Where N is the total number of all targets in all categories, i.e. the total number of real boxes in all categories. mAP characterizes the detection ability of the model for all classes.

As shown in fig. 8, three loss function curves for the model of this embodiment using a cross-validation training strategy are shown. Wherein the training round is 200. Analysis shows that the model of the embodiment trains steadily on the PASCAL VOC 2012 data set, and finally tends to stabilize values.

In addition, PR curves on the paspal VOC 2012 test set can be modeled by the present embodiment. It can be determined that the optimal value of the mAP can be 0.841 when the IOU is 0.5, that is, the target detection model in this embodiment has higher detection accuracy for the entire PASCAL VOC 2012 (a target detection data set) data set.

Optionally, the target detection method for enhancing multi-scale feature extraction, multiplexing and fusion further comprises: in the training process of the initial detection model, a fusion feature map output by a feature fusion network is received through a target prediction network; dividing the fusion feature map into a plurality of grids, wherein the fusion feature map is marked with target marking frames, and each grid comprises a plurality of anchor frames; searching a target grid where a label frame in the fusion feature map is located in the multiple grids; determining a prediction frame based on the intersection ratio of each anchor frame in the target grid and the labeling frame of the target; and determining the detection result of the initial detection model through a prediction frame.

In the present embodiment, three feature maps of 52×52×75, 26×26×75, and 13×13×75 can be generated in the target prediction network. The feature map is divided into K x K grids, each grid can generate B anchor frames, when the center point of the real frame of a certain target falls in a certain grid point, the anchor frame of the target, which is the largest among the B anchor frames generated by the grid, and the real frame (corresponding to the labeling frame) IOU is responsible for the detection task of the target, the network obtains a predicted frame through updating the anchor frame position parameters, and finally the predicted frame closest to the real frame is screened out by a confidence score and an NMS algorithm to serve as a detection result.

Optionally, the step of determining the prediction frame based on the intersection ratio of each anchor frame in the target grid and the labeling frame of the target includes: calculating the intersection ratio of each anchor frame in the target grid and the labeling frame of the target to obtain a plurality of ratios; taking an anchor frame associated with the largest ratio among the plurality of ratios as a target anchor frame, wherein the target anchor frame is used for detecting a target on the fusion feature map; and updating the position parameters of the target anchor frame to obtain a prediction frame.

In training the initial model, the present embodiment can produce three feature maps of 52×52×75, 26×26×75, and 13×13×75 in the target prediction network. The feature graphs can be divided into K multiplied by K grids, each grid can generate B anchor frames, when the center point of the real frame of a certain target falls in a certain grid point, the anchor frame with the largest IOU (cross-over ratio) with the real frame (corresponding to the labeling frame) among the B anchor frames generated by the grid is responsible for the detection task of the target, the network obtains a prediction frame through updating the anchor frame position parameters, and finally, the prediction frame closest to the real frame is screened out by a confidence score and an NMS algorithm to serve as a detection result. In addition, in this embodiment, the prediction frame having the largest IOU with the real frame among all the prediction frames may be used as a positive sample, and the image background and the anchor frame having an IOU (cross-over ratio) of less than 0.5 with the real frame may be used as a negative sample.

Wherein 75 is the product of 3 and 25, 3 generates 3 anchor frames for each grid, 25 is the sum of 20, 4 and 1, 20 is the number of PASCAL VOC data set categories, 4 is the four position parameter t of the prediction frame _x 、t _y 、t _w And t _h 1 is a confidence score C, representing the confidence that the target is of a certain class within the current prediction box, as shown in equation (36). Wherein, the liquid crystal display device comprises a liquid crystal display device,

representing the confidence of the jth prediction box in the ith grid point, P (object) is 1 when the target is contained in the grid point, otherwise is 0, P (class _i Object) indicates the probability that the object within the grid point is a certain class, +.>

Is truly the overlap ratio of the frame and the predicted frame.

As shown in fig. 9, the anchor frame to prediction frame adjustment process is embodied as shown in formulas (37) to (40).

In fig. 9, each grid point is 1 in width and height, the dotted frame is an anchor frame, and the solid frame is a prediction frame. Wherein the width of the anchor frame is p _w High p _h The method comprises the steps of carrying out a first treatment on the surface of the The center point, width and height of the normalized prediction frame are (b) _x ,b _y ) W and h; sigma (x) =1/(1+e) ^-x ) Is a sigmoid logical constraint function used for converting t _x And t _y Constrained between 0 and 1, so σ (t _x ) Sum sigma (t) _y ) An offset of a center point of the prediction frame relative to an upper left corner of the grid; (C) _x ,C _y ) Is (1, 1); t is t _x 、t _y 、t _w And t _h Parameters that need to be continuously learned and updated in the target prediction network are the center point, width and height of the prediction frame relative to the feature map. In addition, the coefficients 2 and 0.5 are mainly for eliminating grid sensitivity, and fixing the relative offset in the (-0.5, 1.5) range.

b _x ＝(2σ(t _x )-0.5)+c _x (37)

b _y ＝(2σ(t _y )-0.5)+c _y (38)

w＝p _w ·(2σ(t _w )) ² (39)

h＝p _h ·(2σ(t _h )) ² (40)

By adjusting the position parameters of the anchor frame, the prediction closest to the real frame is determined, and then the detection result is determined, so that the technical effect of improving the accuracy of image identification of the target detection model is realized

Optionally, the loss function used in training the initial detection model includes at least: classification loss function, confidence loss function, and location loss function.

In this embodiment, the loss function J of the corresponding target detection framework (or target detection model) _loss Can be obtained by classifying the loss function L _class Target confidence loss function L _object (corresponding to the confidence loss function described above) and a location loss function L _location Composition, loss function J _loss As shown in equation (41).

As shown in equation (42), the classification error may employ a binary cross entropy function as the classification loss function, i.e., when the jth anchor box of the ith grid is responsible for a certain target detection, then the prediction box generated by this anchor box is responsible for classification loss calculation. Wherein the feature map is divided into K x K grids, the number of target categories is c,

when the IOU of the jth anchor frame and the real frame generated by the ith grid is maximum, the anchor frame is responsible for detecting the target, and the >

Otherwise->

Representing the probability of the object class in the jth anchor frame corresponding to the ith grid being the c-th object,/for the ith grid>

Indicating whether the target in the jth anchor frame corresponding to the ith grid is a c-type target, if so, then +.>

Otherwise

J _loss ＝L _class +L _object +L _location (41)

As shown in equation (43), the target confidence loss function may be composed of two sets of binary cross entropy functions, the first set being a positive sample confidence loss function and the second set being a negative sample confidence loss function, the closer the confidence of the positive sample prediction box is to 1, the closer the positive sample confidence loss function value is to 0, i.e., the greater the confidence that a certain type of target is within the prediction box. Wherein lambda is _noobj For equalizing the distribution of the positive and negative samples,

indicating whether the target of the jth anchor frame corresponding to the ith grid is a negative sample, if so, the target is +.>

Otherwise->

Indicating whether the jth anchor frame corresponding to the ith grid is responsible for target detection, if so, the jth anchor frame is +.>

Otherwise->

And representing the predicted value generated after the j anchor frame corresponding to the i-th grid finally forms the predicted frame.

As shown in equation (44), the positioning loss function is represented by the overlap loss function L _IOU Center distance loss function L _distance And a width-height loss function L _aspect Composition is prepared. The overlapping loss function and the center distance loss function are used to eliminate the problems of convergence speed and the like caused by the lack of consideration of the overlapping area and the center point distance, and the loss function is caused to optimize towards the direction of increasing the overlapping area, as shown in a formula (45) and a formula (46) respectively. Wherein, the liquid crystal display device comprises a liquid crystal display device,

For the cross-correlation of the predicted and real frames ρ (b, b) ^gt ) The Euclidean distance between the center points of the predicted frame and the real frame, and c is the predicted frame and the real frameThe frame is the smallest of the diagonal lengths of the bounding frame.

L _location ＝L _IOU +L _distance +L _aspect (44)

L _IOU ＝1-IOU (45)

FIG. 10 is a schematic diagram of an alternative prediction frame regression process according to an embodiment of the present invention, in FIG. 10, pred is the prediction frame, truth is the real frame, the dashed frame is the minimum circumscribed frame, (b) _x ,b _y ) In order to predict the center point of the frame,

the shadow part is the overlapping area of the predicted frame and the real frame, and w and h are the width and the height of the predicted frame and w respectively ^gt And h ^gt Is the width and height of the real frame, w _c And h _c Is the width and height of the minimum bounding box.

To prevent the aspect ratio of the width and height of the prediction frame to the real frame from exhibiting nonlinearity, i.e., to allow the width and height of the prediction frame to be increased or decreased simultaneously, the present embodiment employs a width-height loss function L _aspect The difference between the width and the height of the prediction frame and the real frame is minimum, so that the model convergence speed is increased, more accurate positioning precision is realized, and the width and the height loss function L is realized _aspect As shown in equation (47). Wherein ρ (w, w) ^gt ) To predict the difference in width between the frame and the real frame, ρ (h, h ^gt ) Is the difference in height between the predicted and real frames.

In addition, since samples with poor image quality are more than samples with higher quality in the training process, and samples with poor image quality will generate larger error to affect model optimization, so that the problem of unbalance of training samples exists in the training process, the weight of the positioning loss function is reset according to the embodiment, as shown in a formula (48). Wherein, gamma is 0.5, and the final positioning loss function is shown in formula (49).

L _location ←IOU ^γ L _location (48)

Through the loss function in the embodiment, the technical effect of improving the detection accuracy of the target detection model is achieved.

As shown in table 1, the object detection model in this embodiment is compared with the currently mainstream object detection algorithm by performing test time, frame rate FPS, and map@5 (i.e., when the IOU is set to 0.5, average AP (average accuracy) of all images under each category) on the PASCAL VOC 2012 (object detection data set) test set. Analysis shows that the minimum value is reached in the test time of the single picture, and the frame rate FPS reaches 98, which indicates that the embodiment meets the real-time requirement. In addition, mAP@5 also reached an optimal value. The validity of the object detection model of the present embodiment is explained. Wherein YOLOV4, YOLOV3, SSD, fast R-CNN, tridentNet, retinaNet, cornerNet, extremeNet, centerNet in table 1 represent different types of deep learning based detection models in the related art.

TABLE 1

/>

As shown in table 2, the present example also employs an ablation experiment to verify the effectiveness of the proposed partial strategy. Analysis shows that the average accuracy rate AP, the average recall rate AR and mAP@5 reach the lowest values in a basic model without introducing the three strategies, but experiments of group 2, group 4, group 7 and group 8 show the feasibility and effectiveness of the three strategies of the invention.

TABLE 2

ECSPDarkNet1-X	ECSPDarkNet2-X	NFPN	AP	AR	mAP@0.5
						×	×	×	0.843	0.842	0.799
×	×	√	0.844	0.840	0.812
						×	√	√	0.867	0.864	0.819
×	√	×	0.853	0.851	0.808
						√	√	×	0.872	0.870	0.829
√	×	×	0.861	0.854	0.822
						√	×	√	0.878	0.869	0.825
√	√	√	0.889	0.848	0.841

Example two

In a second embodiment of the present application, an optional target detection device for enhancing multi-scale feature extraction, multiplexing and fusion is provided, where each implementation unit in the identification device corresponds to each implementation step in the first embodiment.

FIG. 11 is a schematic diagram of an optionally enhanced multi-scale feature extraction, multiplexing and fusion object detection device according to an embodiment of the invention, as shown in FIG. 11, the recognition device includes: a first processing unit 111, a second processing unit 112, and a third processing unit 113.

Specifically, the first processing unit 111 is configured to input the target image into the target feature extraction network of the target detection model in response to a request for identifying the target image, and output N first feature images, where the target detection model at least includes: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of a target image, and N is an integer greater than 1;

The second processing unit 112 is configured to input the N first feature images into a feature fusion network, and output M second feature images, where the feature fusion network is configured to perform multi-scale feature fusion on the N first feature images, and M is an integer greater than 1;

the third processing unit 113 is configured to input the M second feature images into the target prediction network, and output a recognition result of the target image, where the recognition result includes a classification result of the target in the target image.

In the target detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, the target image may be input into the target feature extraction network of the target detection model by the first processing unit 111 in response to the identification request for the target image, and N first feature images are output, where the target detection model at least includes: the target detection model comprises a plurality of transmission channels, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of target images, N is an integer larger than 1, N first feature images are input into the feature fusion network through a second processing unit 112, M second feature images are output, the feature fusion network is used for carrying out multi-scale feature fusion on the N first feature images, M is an integer larger than 1, the M second feature images are input into the target prediction network through a third processing unit 113, and recognition results of the target images are output, wherein the recognition results comprise classification results of targets in the target images. The method further solves the technical problems of low target recognition efficiency in the target detection process caused by complex model structure, high calculation cost and no pertinence to the salient features in the target detection algorithm. In this embodiment, the object in the object image is classified and identified by the object detection model including multiple transmission channels, which avoids the situation of low identification efficiency caused by complex structure and slow feature circulation speed of the detection model in the related technology, thereby improving the technical effect of classifying and identifying the object in the image by the object detection model.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, the object feature extraction network at least includes: the device comprises a plurality of first feature extraction modules and a plurality of dimension adjustment modules, wherein each dimension adjustment module at least comprises: a convolution layer, a batch normalization layer and an activation function layer, wherein the first processing unit comprises: the first processing subunit is used for carrying out dimension adjustment on the target image through each dimension adjustment module, and carrying out feature extraction on the target image subjected to dimension adjustment through each first feature extraction module to obtain M second feature images.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, each first feature extraction module at least includes: the device comprises a plurality of dimension adjustment sub-modules and a residual sub-module, wherein the residual sub-module comprises a plurality of residual units, each dimension adjustment sub-module has the same structure with the dimension adjustment module, and the processing sub-unit comprises: the first processing module is used for receiving a third characteristic image through the first characteristic extraction module, inputting the third characteristic image into the first dimension adjustment sub-module through the first transmission channel, inputting a fourth characteristic image output by the first dimension adjustment sub-module into the residual sub-module and outputting a first characteristic layer, wherein the first characteristic extraction module is one of a plurality of first characteristic extraction modules, the third characteristic image is a target image processed by one or more of the plurality of first characteristic extraction modules and the plurality of dimension adjustment modules, and the first dimension adjustment sub-module is one of the plurality of dimension adjustment sub-modules; the second processing module is used for inputting the first characteristic layer into the stacking unit after convolution processing in the first transmission channel; the third processing module is used for inputting the third characteristic image into the stacking unit after convolution processing in the second transmission channel, and carrying out stacking processing on the characteristic data received by the stacking unit through the stacking unit to obtain a second characteristic layer; the fourth processing module is configured to perform target processing on the second feature layer to obtain a target feature image, where the target processing at least includes one of the following: normalizing, activating a function, adjusting a characteristic dimension, wherein the target characteristic image is one of M second characteristic images; and the determining module is used for determining M second characteristic images based on the target characteristic images.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, each residual unit at least includes: a plurality of dimension adjustment subunits, a weighting subunit, and a first processing module comprising: the adjusting sub-module is used for carrying out channel adjustment on the fourth characteristic image based on the first transmission sub-channel and the plurality of dimension adjusting sub-units to obtain a third characteristic layer, and inputting the third characteristic layer into the weighting sub-unit; the first input module is used for inputting the fourth characteristic image into the weighting subunit through the second transmission subchannel; and the weighting module is used for carrying out weighting processing on the third characteristic layer and the fourth characteristic image through the weighting subunit and outputting the first characteristic layer.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, the feature fusion network at least includes: the system comprises a plurality of second feature extraction modules, a plurality of feature fusion modules and a plurality of feature fusion layers, wherein a plurality of residual units in the first feature extraction modules are replaced by a plurality of dimension adjustment units to serve as the second feature extraction modules, the image sizes of N first feature images are different, and the second processing unit comprises: the second processing subunit is configured to input a first feature image with a first size into the first feature fusion layer, input a result output by the first feature fusion module into the first extraction module, and output the first image, where the first feature fusion module is configured to fuse an image feature transmitted by the first feature flow path with the first feature image with the first size, a transmission direction of the image feature transmitted by the first feature flow path is a direction from a last feature fusion layer to the first feature fusion layer, the first extraction module is one of the plurality of second feature extraction modules, the first extraction module is configured to transmit an output result of the first feature extraction module to the last feature fusion layer through the second feature flow path, and a transmission direction of the image feature of the second feature flow path is a direction from the first feature fusion layer to the last feature fusion layer; the third processing subunit is configured to perform feature fusion on the first feature image with the second size and the image feature transmitted by the first feature flow path based on the third transmission channel and the fourth transmission channel, and fuse the result after feature fusion with the image feature transmitted by the second feature flow path to determine a second image; the fourth processing subunit is used for carrying out feature fusion on the first feature image with the third size and the features transmitted by the second feature circulation path after pyramid pooling operation on the basis of the fifth transmission channel and the sixth transmission channel in the last feature fusion layer, and determining a third image; and a determining subunit configured to determine M second feature images based on the first image, the second image, and the third image.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, the third processing subunit includes: the first fusion module is used for transmitting the first characteristic image with the second size to the second characteristic fusion module through the third transmission channel at the second characteristic fusion layer, and carrying out multi-scale fusion on the first characteristic image and the image characteristic transmitted by the first characteristic flow path to obtain a first fusion characteristic; the input module is used for inputting the first fusion characteristic into the second extraction module to obtain a second fusion characteristic, wherein the second extraction module is one of a plurality of second characteristic extraction modules; the second fusion module is used for transmitting the first characteristic image with the second size to the third characteristic fusion module through the fourth transmission channel, carrying out multi-scale characteristic fusion on the characteristics transmitted by the second characteristic flow path and the second fusion characteristics, and determining the third fusion characteristics; the input/output module is used for inputting the third fusion feature into the third extraction module and outputting the second image, wherein the third extraction module is one of the plurality of second feature extraction modules.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, the second processing unit further includes: and the transmission subunit is used for transmitting the result obtained by the pyramid pooling operation of the first characteristic image with the third size to the first layer characteristic fusion layer through the first characteristic flow path after the pyramid pooling operation of the first characteristic image with the third size.

Optionally, in the enhanced multi-scale feature extraction, multiplexing and fusion object detection device provided in the second embodiment of the present application, the object detection model is determined by: the fourth processing unit is configured to obtain a pre-training weight and a training sample, and divide the training sample into a training set, a verification set and a test set, where the training sample includes: a plurality of images, each of which is marked with a target marking frame; the training unit is used for training the initial detection model based on the pre-training weight and the training set, and verifying whether the initial detection model is converged or not through the verification set by adopting a cross verification mode in the training process, wherein the initial detection model is an untrained model; and the testing unit is used for determining the target detection model under the condition that the initial detection model is converged, and testing the detection precision of the initial detection model based on a weight file and a test set, wherein the weight file is used for storing a plurality of weights of the target detection model.

Optionally, in the target detection device for extracting, multiplexing and fusing enhanced multi-scale features provided in the second embodiment of the present application, the method for detecting a target by extracting, multiplexing and fusing enhanced multi-scale features further includes: the receiving subunit is used for receiving the fusion feature map output by the feature fusion network through the target prediction network in the process of training the initial detection model; the division subunit is used for dividing the fusion feature map into a plurality of grids, wherein the fusion feature map is marked with target marking frames, and each grid comprises a plurality of anchor frames; the searching subunit is used for searching the target grid where the labeling frame in the fusion feature diagram is located in the multiple grids; a first determining subunit, configured to determine a prediction frame based on an intersection ratio of each anchor frame in the target grid and the labeling frame of the target; and the second determination subunit is used for determining the detection result of the initial detection model through the prediction frame.

Optionally, in the object detection device for enhanced multi-scale feature extraction, multiplexing and fusion provided in the second embodiment of the present application, the first determining subunit includes: the calculation module is used for calculating the intersection ratio of each anchor frame in the target grid and the labeling frame of the target to obtain a plurality of ratios; the anchor frame processing module is used for taking an anchor frame associated with the largest ratio among the plurality of ratios as a target anchor frame, wherein the target anchor frame is used for detecting a target on the fusion characteristic diagram; and the updating module is used for updating the position parameters of the target anchor frame to obtain a prediction frame.

Optionally, in the target detection device for extracting, multiplexing and fusing enhanced multi-scale features provided in the second embodiment of the present application, a loss function used when training an initial detection model at least includes: classification loss function, confidence loss function, and location loss function.

The object detection device for enhanced multi-scale feature extraction, multiplexing and fusion may further include a processor and a memory, where the first processing unit 111, the second processing unit 112, the third processing unit 113, and the like are stored as program units, and the processor executes the program units stored in the memory to implement corresponding functions.

The processor includes a kernel, and the kernel fetches a corresponding program unit from the memory. The kernel can be provided with one or more than one kernel, and the kernel parameters are adjusted to classify and identify the targets in the target image through the target detection model comprising a plurality of transmission channels, so that the conditions of low target identification efficiency caused by complex structure and low feature circulation speed of the detection model in the related technology are avoided, and the technical effect of classifying and identifying the targets by the target detection model is improved.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), which includes at least one memory chip.

Fig. 12 is a schematic diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 12, an embodiment of the present invention provides an electronic device 120, where the electronic device includes a processor, a memory, and a program stored on the memory and capable of running on the processor, and the processor implements the method for detecting an object by using the enhanced multi-scale feature extraction, multiplexing and fusion of any of the above when executing the program.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

In the foregoing embodiments of the present invention, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, for example, may be a logic function division, and may be implemented in another manner, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. The target detection method for enhancing multi-scale feature extraction, multiplexing and fusion is characterized by comprising the following steps of:

in response to a request for identifying a target image, inputting the target image into a target feature extraction network of a target detection model, and outputting N first feature images, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of the target image, and N is an integer greater than 1;

inputting the N first feature images into the feature fusion network, and outputting M second feature images, wherein the feature fusion network is used for carrying out multi-scale feature fusion on the N first feature images, and M is an integer greater than 1;

and inputting the M second characteristic images into the target prediction network, and outputting the identification result of the target image, wherein the identification result comprises a classification result of the target in the target image.

2. The method according to claim 1, wherein the target feature extraction network comprises at least: the device comprises a plurality of first feature extraction modules and a plurality of dimension adjustment modules, wherein each dimension adjustment module at least comprises: the convolution layer, the batch normalization layer and the activation function layer input the target image into a target feature extraction network of a target detection model, output N first feature images, and comprise the following steps:

and carrying out dimension adjustment on the target image through each dimension adjustment module, and carrying out feature extraction on the target image subjected to dimension adjustment through each first feature extraction module to obtain M second feature images.

3. The detection method according to claim 2, wherein each of the first feature extraction modules includes at least: the device comprises a plurality of dimension adjustment sub-modules and a residual sub-module, wherein the residual sub-module comprises a plurality of residual units, each dimension adjustment sub-module has the same structure as the dimension adjustment module, and the dimension-adjusted target image is subjected to feature extraction through each first feature extraction module to obtain M second feature images, and the device comprises:

Receiving a third feature image through the first feature extraction module, inputting the third feature image into a first dimension adjustment sub-module through a first transmission channel, inputting a fourth feature image output by the first dimension adjustment sub-module into the residual sub-module, and outputting a first feature layer, wherein the first feature extraction module is one of a plurality of first feature extraction modules, the third feature image is the target image processed by one or more of a plurality of first feature extraction modules and a plurality of dimension adjustment modules, and the first dimension adjustment sub-module is one of a plurality of dimension adjustment sub-modules;

in the first transmission channel, the first characteristic layer is input into a stacking unit after convolution processing;

in a second transmission channel, the third characteristic image is input into the stacking unit after being subjected to convolution processing, and characteristic data received by the stacking unit are subjected to stacking processing through the stacking unit to obtain a second characteristic layer;

performing target processing on the second feature layer to obtain a target feature image, wherein the target processing at least comprises one of the following steps: normalizing, activating a function, and adjusting a characteristic dimension, wherein the target characteristic image is one of M second characteristic images;

M second feature images are determined based on the plurality of target feature images.

4. A detection method according to claim 3, wherein each residual unit comprises at least: the device comprises a plurality of dimension adjustment subunits, a weighting subunit, a residual error subunit and a first feature layer, wherein the fourth feature image output by the first dimension adjustment subunit is input into the residual error subunit, and the first feature layer is output, and the device comprises:

based on a first transmission sub-channel and a plurality of dimension adjustment sub-units, channel adjustment is carried out on the fourth characteristic image to obtain a third characteristic layer, and the third characteristic layer is input into the weighting sub-unit;

inputting the fourth feature image into the weighting subunit through a second transmission subchannel;

and carrying out weighting processing on the third characteristic layer and the fourth characteristic image through the weighting subunit, and outputting the first characteristic layer.

5. The method according to claim 2, wherein the feature fusion network comprises at least: the system comprises a plurality of second feature extraction modules, a plurality of feature fusion modules and a plurality of feature fusion layers, wherein a plurality of residual units in the first feature extraction modules are replaced by a plurality of dimension adjustment units to serve as the second feature extraction modules, the N first feature images are different in image size, the N first feature images are input into a feature fusion network, and M second feature images are output, and the system comprises:

The method comprises the steps that in a first feature fusion layer, a first feature image with a first size is input into a first feature fusion module, a result output by the first feature fusion module is input into a first extraction module, and a first image is output, wherein the first feature fusion module is used for fusing image features transmitted by a first feature circulation path with the first feature image with the first size, the transmission direction of the image features transmitted by the first feature circulation path is from the last feature fusion layer to the first feature fusion layer, the first extraction module is one of a plurality of second feature extraction modules, the first extraction module is used for transmitting an output result of the first feature extraction module to the last feature fusion layer through a second feature circulation path, and the image feature transmission direction of the second feature circulation path is from the first feature fusion layer to the last feature fusion layer;

in the second feature fusion layer, based on a third transmission channel and a fourth transmission channel, carrying out feature fusion on the first feature image with a second size and the image features transmitted by the first feature circulation path, and carrying out fusion on the result obtained after feature fusion and the image features transmitted by the second feature circulation path to determine a second image;

In the last layer of the feature fusion layer, carrying out feature fusion on the first feature image with the third size and the features transmitted by the second feature circulation path after pyramid pooling operation based on a fifth transmission channel and a sixth transmission channel, and determining a third image;

m of the second feature images are determined based on the first image, the second image, and the third image.

6. The method according to claim 5, wherein at the feature fusion layer, feature fusion is performed on the first feature image of the second size and the image feature transmitted by the first feature flow path based on a third transmission channel and a fourth transmission channel, and the feature fusion result and the image feature transmitted by the second feature flow path are fused, so as to determine a second image, which includes:

transmitting the first characteristic image with the second size to a second characteristic fusion module through the third transmission channel in the second characteristic fusion layer, and performing multi-scale fusion with the image characteristics transmitted by the first characteristic circulation path to obtain a first fusion characteristic;

inputting the first fusion feature into a second extraction module to obtain a second fusion feature, wherein the second extraction module is one of a plurality of second feature extraction modules;

Transmitting the first characteristic image with the second size to a third characteristic fusion module through a fourth transmission channel, and carrying out multi-scale characteristic fusion on the characteristic transmitted by the second characteristic flow path and the second fusion characteristic to determine a third fusion characteristic;

and inputting the third fusion feature into a third extraction module and outputting the second image, wherein the third extraction module is one of a plurality of second feature extraction modules.

7. The method of detecting according to claim 5, further comprising, after pooling the first feature image of the third size by pyramid:

and transmitting a result obtained after pyramid pooling operation of the first characteristic image with the third size to the characteristic fusion layer of the first layer through the first characteristic flow path.

8. The detection method according to claim 1, wherein the target detection model is determined by:

obtaining pre-training weights and training samples, and dividing the training samples into a training set, a verification set and a test set, wherein the training samples comprise: a plurality of images, wherein each image is marked with a marking frame of the target;

Training an initial detection model based on the pre-training weight and the training set, and verifying whether the initial detection model is converged or not through the verification set in a cross verification mode in the training process, wherein the initial detection model is an untrained model;

and under the condition that the initial detection model is converged, determining the target detection model, and testing the detection precision of the initial detection model based on a weight file and the test set, wherein the weight file is used for storing a plurality of weights of the target detection model.

9. The method of detecting according to claim 8, further comprising:

in the training process of the initial detection model, a fusion feature map output by the feature fusion network is received through the target prediction network;

dividing the fusion feature map into a plurality of grids, wherein the labeling frames of the targets are marked on the fusion feature map, and each grid comprises a plurality of anchor frames;

searching a target grid where the annotation frame in the fusion feature map is located in the multiple grids;

determining a prediction frame based on an intersection ratio of each anchor frame in the target grid and the annotation frame of the target;

And determining a detection result of the initial detection model through the prediction frame.

10. The method of detecting according to claim 9, wherein the step of determining a predicted frame based on an intersection ratio of each anchor frame in the target grid with the annotation frame of the target comprises:

calculating the intersection ratio of each anchor frame in the target grid and the labeling frame of the target to obtain a plurality of ratios;

taking an anchor frame associated with the largest ratio of the plurality of ratios as a target anchor frame, wherein the target anchor frame is used for detecting the target on the fusion characteristic diagram;

and updating the position parameters of the target anchor frame to obtain the prediction frame.

11. The method of claim 8, wherein the loss function employed in training the initial detection model comprises at least: classification loss function, confidence loss function, and location loss function.

12. The utility model provides a target detection device of intensive multiscale feature extraction, multiplexing and fusion which characterized in that includes:

the first processing unit is used for responding to a recognition request of a target image, inputting the target image into a target feature extraction network of a target detection model, and outputting N first feature images, wherein the target detection model at least comprises: the target detection model comprises a plurality of transmission channels, a target feature extraction network, a feature fusion network and a target prediction network, wherein the transmission channels are used for transmitting image features of the target image, and N is an integer greater than 1;

The second processing unit is used for inputting the N first characteristic images into the characteristic fusion network and outputting M second characteristic images, wherein the characteristic fusion network is used for carrying out multi-scale characteristic fusion on the N first characteristic images, and M is an integer larger than 1;

and the third processing unit is used for inputting the M second characteristic images into the target prediction network and outputting the identification result of the target image, wherein the identification result comprises a classification result of the target in the target image.

13. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and wherein the computer program when executed controls a device in which the computer readable storage medium is located to perform the object detection method for enhanced multi-scale feature extraction, multiplexing and fusion according to any one of claims 1 to 11.

14. An electronic device comprising one or more processors and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the enhanced multi-scale feature extraction, multiplexing, and fusion target detection method of any of claims 1-11.