CN117132761A

CN117132761A - Target detection method and device, storage medium and electronic equipment

Info

Publication number: CN117132761A
Application number: CN202311088379.1A
Authority: CN
Inventors: 毕岳峰; 黄瑞文
Original assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Current assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-11-28

Abstract

The disclosure provides a target detection method and device, a storage medium and electronic equipment, and relates to the technical field of machine learning. The method comprises the following steps: inputting an image to be detected into a trained target detection network model, the model comprising: convolutional neural network, region proposal network RPN, region of interest ROI layer; the convolutional layers of a plurality of layers of the convolutional neural network are cascaded with receptive field modules RFB; the RPN is configured with a feature pyramid network FPN; acquiring a first feature map of an image to be detected by using a convolutional neural network; acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map; determining corresponding feature vectors of each level in the fusion feature map by using the FPN, and constructing a feature vector set; determining candidate frames corresponding to each feature vector in the feature vector set by utilizing the RPN; and mapping the fusion feature map and the candidate frame through the ROI layer to determine a target detection result.

Description

Target detection method and device, storage medium and electronic equipment

Technical Field

The disclosure relates to the technical field of machine learning, and in particular relates to a target detection method and device, a storage medium and electronic equipment.

Background

In the current passenger flow statistics field, various target detection technologies are continuously emerging, most of the technologies are based on a deep learning algorithm, and the technologies are better for identifying images with single scale and less shielding.

When the passenger flow of the natural scene is counted, due to the influences of factors such as different distances and angles between the target sample and the camera, shielding and the like, the acquired image has the problems of different dimensions, serious deformation, shielding and the like, and the traditional target detection algorithm is easy to fail.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure provides a target detection method and apparatus, a storage medium, and an electronic device, which overcome, at least to some extent, the problem of target detection failure due to the related art.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to one aspect of the present disclosure, there is provided a target detection method including:

Inputting an image to be detected into a trained target detection network model; the trained object detection network model comprises: convolutional neural network, region proposal network RPN, region of interest ROI layer; a plurality of layers of convolution layers of the convolution neural network are cascaded with receptive field modules RFB; the RPN is configured with a feature pyramid network FPN;

acquiring a first feature map of the image to be detected by using the convolutional neural network;

acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map, wherein the fusion feature map has a plurality of levels;

determining feature vectors corresponding to each level in the fusion feature map by using the FPN, and constructing a feature vector set;

determining a candidate frame corresponding to each feature vector in the feature vector set by utilizing the RPN;

and mapping the fusion feature map and the candidate frame through the ROI layer to determine a target detection result, wherein the target detection result comprises a detection target and the candidate frame surrounding the detection target.

In some embodiments, further comprising:

performing maximum pooling downsampling on the feature vector with the highest level in the feature vector set, and determining an updated feature vector; the feature vectors in the feature vector set are ranked in a hierarchical order according to resolution, and the resolution of the feature vector with the highest hierarchy is the smallest;

The update vector is configured at a higher level than the feature vector with the highest level to update the feature vector set.

In some embodiments, the convolutional neural network comprises: the resolution ratio of the first level convolution layer to the N level convolution layer is sequentially reduced;

the receptive field module RFB block comprises: a base receptive field module and an optimized receptive field module;

the convolutional neural network has a receptive field module RFB in cascade with a plurality of hierarchical convolutional layers, comprising:

the second level convolution layer cascades the optimized receptive field modules, and each of the third level convolution layer to the N-1 level convolution layer cascades the basic receptive field modules respectively.

In some embodiments, the optimized receptive field module is structurally configured with more convolution kernels than the base receptive field module, the configured more convolution kernels having a smaller size than the base receptive field module.

In some embodiments, acquiring a first feature map of the image to be detected using the convolutional neural network includes:

respectively convolving the image to be detected by using a first level convolution layer to an Nth level convolution layer of the convolutional neural network, obtaining a feature map level corresponding to each level convolution layer, and determining a first feature map; the hierarchy of the first feature map corresponds to a hierarchy of convolutional layers of the convolutional neural network.

In some embodiments, obtaining a second feature map of the first feature map using the RFB, determining a fused feature map from the first feature map and the second feature map, comprising:

acquiring second feature graphs corresponding to second to Nth levels in the first feature graphs by utilizing the RFB;

and fusing the first feature map containing the first level to the Nth level and the second feature map containing the second level to the Nth level, and determining a fused feature map, wherein the number of the levels of the fused feature map is equal to that of the second feature map.

In some embodiments, fusing a first feature map having a first level through an nth level and a second feature map having a second level through an nth level, determining a fused feature map includes:

convolving a first-level first feature map obtained by convolving the image to be detected by using a first-level convolution layer through a second-level convolution layer to determine a second-level first feature map;

extracting a first-level second feature map of the second-level first feature map using the optimized receptive field module cascaded with the second-level convolutional layer; fusing the second-level first feature map and the first-level second feature map, and then convolving the fused second-level first feature map and the first-level second feature map by using a third-level convolution layer to determine a third-level first feature map;

Extracting a second-level second feature map of the third-level first feature map by using the basic receptive field module cascaded with the third-level convolution layer; fusing the third-level first feature map and the second-level second feature map, and then convolving the fused third-level first feature map and the second-level second feature map by using a fourth-level convolution layer to determine a fourth-level first feature map;

circularly executing the extraction process of the fourth-level first feature map according to the residual level of the convolution layer and the basic receptive field module corresponding to the residual level, and determining fifth-level first feature map to Nth-level first feature map;

and determining the second-level first feature map, the third-level first feature map, the fourth-level first feature map and the fifth-level first feature map to the N-level first feature map as a plurality of levels of fusion feature maps.

In some embodiments, the second-level convolutional layers through the nth level of the convolutional neural network are connected bottom-up;

the feature pyramid network FPN includes: and third feature maps of a plurality of levels connected from top to bottom, wherein the third feature map of each level is laterally connected with a convolution layer of a corresponding level in the second-level to N-level convolution layers of the convolutional neural network.

In some embodiments, determining feature vectors corresponding to each level in the fused feature map using the FPN, constructing a feature vector set includes:

determining a third feature map of a next level according to a corresponding level in a fusion feature map corresponding to a convolution layer which is laterally connected with a third feature map of a topmost layer in the FPN; the third feature map at the topmost layer is obtained by convolving an nth-level first feature map corresponding to the nth-level convolution layer;

circularly executing the steps until reaching the third feature map of the bottommost layer to obtain a plurality of layers of third feature maps;

respectively convolving each of the plurality of hierarchical third feature maps to determine a plurality of feature vectors;

and constructing a feature vector set according to the plurality of feature vectors.

In some embodiments, determining, with the RPN, a candidate box for each feature vector in the set of feature vectors includes:

and respectively sliding each feature vector in the feature vector set by using the RPN according to a set running track by using a sliding window with a set size, identifying a detection target on each feature vector, and configuring a candidate frame surrounded by the detection target.

In some embodiments, the trained object detection network model further comprises: a full connection layer;

mapping the fusion feature map and the candidate frame through the ROI layer to determine a target detection result, wherein the method comprises the following steps:

mapping the candidate frame to a corresponding position in the fusion feature map of a corresponding level by using the ROI layer, and determining a low-dimensional vector;

and inputting the low-dimensional vector into the full-connection layer to classify and regress, and determining a target detection result.

In some embodiments, the training process of the object detection network model includes:

acquiring a first feature map of the marked image by using the convolutional neural network; the marked image comprises marked targets marked by a marking frame;

acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map;

mapping the fusion feature map and the candidate frame through the ROI layer to determine a prediction result; the prediction result comprises a detection target and a candidate frame surrounding the detection target;

And training the target detection network model according to the candidate frame surrounding the detection target in the prediction result and the mark frame until the matching degree of the candidate frame in the prediction result and the mark frame reaches a set threshold value and the detection target in the prediction result is matched with the marked target marked by the mark frame, and determining the trained target detection network model.

According to another aspect of the present disclosure, there is also provided an object detection apparatus including:

the image to be detected input module is used for inputting the image to be detected into the trained target detection network model; the trained object detection network model comprises: convolutional neural network, region proposal network RPN, region of interest ROI layer; a plurality of layers of convolution layers of the convolution neural network are cascaded with receptive field modules RFB; the RPN is configured with a feature pyramid network FPN;

the first feature map determining module is used for acquiring a first feature map of the image to be detected by using the convolutional neural network;

the fusion feature map determining module is used for acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map, wherein the fusion feature map has a plurality of levels;

The feature vector set construction module is used for determining feature vectors corresponding to each level in the fusion feature map by using the FPN to construct a feature vector set;

the candidate frame determining module is used for determining a candidate frame corresponding to each feature vector in the feature vector set by utilizing the RPN;

and the target detection result determining module is used for mapping the fusion feature map and the candidate frame through the ROI layer to determine a target detection result, wherein the target detection result comprises a detection target and the candidate frame surrounding the detection target.

According to another aspect of the present disclosure, there is also provided an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform an object detection method according to any one of the preceding claims via execution of the executable instructions.

According to another aspect of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a target detection method of any one of the above.

According to another aspect of the present disclosure, there is also provided a computer program product comprising a computer program which, when executed by a processor, implements a target detection method of any one of the above.

According to the target detection method, the target detection device, the storage medium and the electronic equipment, the convolution hierarchy joint receptive field modules RFB of the multiple hierarchies of the convolution neural network are provided, the RFB has the structural characteristics that convolution kernels of different sizes are used in convolution layer branches of different hierarchies, each convolution kernel adopts different expansion coefficients, a larger weight is distributed to the convolution kernel of a smaller size close to the center, so that each branch corresponding to each hierarchy of the fusion characteristic diagram obtained after the first characteristic diagram and the second characteristic diagram are fused can obtain a mutually filled structure of different receptive field sizes, and compared with a mode of cascading RFB modules after the convolution layer only by adopting convolution operation, the characteristic diagram of different receptive field sizes can be extracted, so that characteristic information is more abundant, the problem of target identification failure caused by shielding is effectively avoided, and the characteristic extraction capacity of the network is improved on the premise that the network scale is not increased. The first characteristic image of the image to be detected is obtained by utilizing the convolution layers of a plurality of layers of the convolution neural network, the second characteristic image of the first characteristic image is obtained by utilizing the RFB, the convolution layers of each layer of the convolution neural network are fully utilized, and the detail information of each layer is utilized, so that the multi-scale target detection capability is improved. The feature pyramid network FPN is configured in the regional proposal network RPN, the capability of the top-down branch and the transverse connection is fully utilized, the feature vector corresponding to each level fusion feature map is calculated, more detail information in the low-level fusion feature map and high semantic information in the high-level fusion feature map are fused, multiple different-scale target detection is realized, and the adaptability of the model to the multi-scale targets is improved. And constructing a feature vector set by using the FPN, determining a candidate frame corresponding to each feature vector by using the RPN, and mapping the candidate frame to a position corresponding to the fusion feature map through the ROI layer to obtain a detection target and a candidate frame surrounding the detection target. The method and the device can directly identify the detection target in the image under the conditions of different target scales and less characteristic information, improve the scale and the accuracy of target detection, and have better detection effect on small targets and partial shielding targets.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort.

Fig. 1 is a schematic diagram showing a system configuration of a target detection method in an embodiment of the present disclosure.

Fig. 2 shows a schematic diagram of a passenger flow statistics scenario.

Fig. 3 shows a schematic diagram of the result of an existing target detection scheme in a passenger flow statistics scenario.

Fig. 4 is a schematic diagram of a target detection method according to an embodiment of the disclosure.

Fig. 5 shows a schematic diagram of an object detection result to which an object detection method according to an embodiment of the present disclosure is applied.

Fig. 6 is a schematic diagram illustrating a feature vector set updating process of a target detection method according to an embodiment of the disclosure.

Fig. 7 illustrates a schematic diagram of an RFB-based model structure of a target detection method in an embodiment of the disclosure.

Fig. 8 illustrates RFB and RFB-s block diagrams of a target detection method in an embodiment of the disclosure.

Fig. 9 shows an RFB expanded receptive field effect plot of a target detection method in an embodiment of the disclosure.

Fig. 10 is a schematic diagram illustrating a process of determining a fusion profile of a target detection method according to an embodiment of the disclosure.

Fig. 11 is a schematic diagram illustrating a specific process of determining a fusion feature map according to an object detection method in an embodiment of the disclosure.

Fig. 12 is a schematic diagram of an FPN structure of a target detection method according to an embodiment of the disclosure.

Fig. 13 is a schematic diagram illustrating a process of constructing a feature vector set in a target detection method according to an embodiment of the disclosure.

Fig. 14 shows a schematic diagram of FPN applied to RPN in an object detection method according to an embodiment of the disclosure.

Fig. 15 is a schematic diagram of an RPN network structure of a target detection method according to an embodiment of the disclosure.

Fig. 16 is a schematic diagram illustrating a training process of an object detection network model of an object detection method according to an embodiment of the disclosure.

Fig. 17 shows a flowchart of a target detection method in an embodiment of the present disclosure.

Fig. 18 shows a structure diagram of an object detection network model of an object detection method in an embodiment of the present disclosure.

Fig. 19 shows a schematic diagram of an object detection device in an embodiment of the disclosure.

Fig. 20 is a block diagram showing a structure of a computer device of an object detection method in an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus a repetitive description thereof will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in software or in one or more hardware modules or integrated circuits or in different networks and/or processor devices and/or microcontroller devices.

The following detailed description of embodiments of the present disclosure refers to the accompanying drawings.

FIG. 1 illustrates a schematic diagram of an exemplary application system architecture to which the object detection method of embodiments of the present disclosure may be applied. As shown in fig. 1, the system architecture may include a terminal device 101, a network 102, and a server 103.

The medium used by the network 102 to provide a communication link between the terminal device 101 and the server 103 may be a wired network or a wireless network.

Alternatively, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the Internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over a network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible MarkupLanguage, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure sockets layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet ProtocolSecurity, IPsec), etc. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

The terminal device 101 may be a variety of electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, wearable devices, augmented reality devices, virtual reality devices, and the like.

Alternatively, the clients of the applications installed in different terminal devices 101 are the same or clients of the same type of application based on different operating systems. The specific form of the application client may also be different based on the different terminal platforms, for example, the application client may be a mobile phone client, a PC client, etc.

The server 103 may be a server providing various services, such as a background management server providing support for devices operated by the user with the terminal apparatus 101. The background management server can analyze and process the received data such as the request and the like, and feed back the processing result to the terminal equipment.

Optionally, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

Those skilled in the art will appreciate that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative, and that any number of terminal devices, networks, and servers may be provided as desired. The embodiments of the present disclosure are not limited in this regard.

Under the system architecture described above, embodiments of the present disclosure provide a target detection method that may be performed by any electronic device with computing processing capabilities.

In some embodiments, the target detection method provided in the embodiments of the present disclosure may be performed by a terminal device of the above system architecture; in other embodiments, the object detection method provided in the embodiments of the present disclosure may be performed by a server in the system architecture described above; in other embodiments, the target detection method provided in the embodiments of the present disclosure may be implemented by the terminal device and the server in the system architecture in an interactive manner.

As shown in fig. 2, in the passenger flow statistics scene, since the image acquisition is performed in the natural scene, the camera faces a road inclined from the upper left corner to the lower right corner, shielding objects such as houses and trees are arranged on two sides of the road, a clear large-scale target person is arranged at a position close to the image acquisition device (such as the camera), a part of the target person is shielded behind the trees, and a relatively fuzzy small-scale target person is arranged at the end positions of the acquisition range of the houses and the camera at a distance. When the existing passenger flow statistical algorithm (for example, faterRCNN algorithm) is adopted, as shown in fig. 3, due to the influence of factors such as different distances and angles of target characters from a camera, shielding and the like, the acquired image has the problems of different dimensions, serious deformation, shielding and the like, the traditional existing target detection algorithm is easy to fail, only clear large-scale target characters close to the camera can be identified, the non-shielding large-scale target characters are surrounded and identified by a candidate frame, and the blocked target characters and the fuzzy target characters with smaller remote dimensions cannot be identified, so that passenger flow statistical results are wrong.

In order to solve the problem of passenger flow statistics error caused by the above failure of target identification, fig. 4 is a schematic diagram illustrating a target detection method in an embodiment of the disclosure, and as shown in fig. 4, the target detection method provided in the embodiment of the disclosure includes the following steps:

s402: inputting an image to be detected into a trained target detection network model; the trained object detection network model comprises: convolutional neural network, region proposal network RPN (Region Proposal Networks), region of interest ROI (Region Of Interest) layers; a receptive field module RFB (Receptive Field Block) is cascaded by a plurality of layers of the convolutional neural network; the RPN is configured with a feature pyramid network FPN (Feature Pyramid Networks);

s404: acquiring a first feature map of the image to be detected by using the convolutional neural network;

s406: acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map, wherein the fusion feature map has a plurality of levels;

s408: determining feature vectors corresponding to each level in the fusion feature map by using the FPN, and constructing a feature vector set;

S4010: determining a candidate frame corresponding to each feature vector in the feature vector set by utilizing the RPN;

s4012: and mapping the fusion feature map and the candidate frame through the ROI layer to determine a target detection result, wherein the target detection result comprises a detection target and the candidate frame surrounding the detection target.

The present disclosure provides a trained object detection network model comprising: convolutional neural network, region proposal network RPN (Region Proposal Networks), region of interest ROI (Region Of Interest) layers; the convolutional neural network may employ a basic convolutional neural network consisting of a convolutional+activation function+pooling function (conv+relu+pooling). The function of the ROI layer is to determine an 'interested region' for an image, after an input image obtains a feature image through a convolution network, a plurality of target candidate frames are obtained by utilizing an RPN algorithm, and the mapping regions of the candidate frames taking the input image as a reference coordinate on the feature image are the interested region in target detection. By mapping the candidate frame to the corresponding position in the fusion feature map, a detection target and a candidate frame surrounding the detection target can be obtained. The main role of the RPN is to generate a candidate frame containing a detection target; the main implementation is to generate candidate boxes by sliding windows on feature vectors.

According to the method, the system and the device, the receptive field modules RFB are connected in the convolution layers of the multiple layers of the convolution neural network, the RFB has the structural characteristics that convolution kernels with different sizes are used in convolution layer branches of different layers, each convolution kernel adopts different expansion coefficients, and a larger weight is distributed to the convolution kernel with a smaller size close to the center, so that each branch corresponding to each layer of the fusion characteristic diagram obtained after the first characteristic diagram and the second characteristic diagram are fused can obtain a mutually filled structure with different receptive fields, and compared with a mode of cascading the RFB modules after the convolution layers only by adopting convolution operation, the characteristic diagram with different receptive field sizes can be extracted, so that characteristic information is richer, the problem of target identification failure caused by shielding is effectively avoided, and the characteristic extraction capacity of the network is improved on the premise that the network scale is not increased.

Further, by configuring the feature pyramid network FPN in the regional proposal network RPN, the capability of the top-down branch and the transverse connection is fully utilized, the feature vector corresponding to each level fusion feature map is calculated, more detail information in the low-level fusion feature map and high semantic information in the high-level fusion feature map are fused, multiple different-scale target detection is realized, and the adaptability of the model to the multi-scale targets is improved.

As shown in fig. 5, the first feature map of the image to be detected is obtained by using the convolution layers of the multiple levels of the convolution neural network, and the second feature of the first feature map is obtained by using the RFB to perform fusion to obtain a fusion feature map, so that the convolution layers of each level of the convolution neural network are fully utilized, and the detailed information of each level is utilized, thereby being beneficial to improving the multi-scale target detection capability. And constructing a feature vector set by using the FPN, determining a candidate frame corresponding to each feature vector by using the RPN, and mapping the candidate frame to a position corresponding to the fusion feature map through the ROI layer to obtain a detection target and a candidate frame surrounding the detection target.

Compared with the existing target detection algorithm used in fig. 3, the target detection method provided by the disclosure is used for passenger flow statistics in fig. 5, so that large-scale target characters close to the image acquisition equipment can be identified, a solid line frame is adopted for marking, and for fuzzy small-scale target characters far away from the image acquisition equipment, because an RFB module is cascaded behind a convolution layer and a feature pyramid network FPN is configured in a regional proposal network RPN, accurate small-scale target identification can be realized, a dotted line frame is adopted for marking, in addition, for partially-blocked target characters, detection of the target characters can be realized, and the dotted line frame is adopted for marking, so that the accuracy of passenger flow statistics in a natural scene is improved, and the target detection identification under the conditions of various types, various scales and shielding can be covered.

As shown in fig. 6, the method for detecting an object provided in the present disclosure further includes, in one embodiment:

step 602: performing maximum pooling downsampling on the feature vector with the highest level in the feature vector set, and determining an updated feature vector; the feature vectors in the feature vector set are ranked in a hierarchical order according to resolution, and the resolution of the feature vector with the highest hierarchy is the smallest;

step 604: the update vector is configured at a higher level than the feature vector with the highest level to update the feature vector set.

The feature vector layers in the feature vector set correspond to the fusion feature map, and the feature map layers are related to the convolutional neural network layers, so that the richness of scale information corresponding to the original feature vectors in the feature vector set is not perfect enough, and in order to increase the richness of scale information, the feature vector of a new layer needs to be further added in the original feature vector set, so that the detection performance of both the too small target and the too large target is better. Firstly, the feature vectors in the original feature vector set are required to be ranked according to the resolution, the feature vector with the minimum resolution is used as the feature vector with the highest level, the feature vector is required to be ranked according to the mode that the resolution is gradually increased, and the feature vector with the maximum resolution is used as the feature vector with the lowest level. And then, carrying out maximum pooling downsampling on the feature vector with the highest level in the original feature vector set, and determining an updated feature vector, wherein the resolution of the updated feature vector is further smaller than that of the feature vector with the highest level in the original feature vector set. Finally, the update vector is configured at a higher level of the feature vector with the highest level to update the feature vector set.

The low-level feature vector step length (stride) corresponding to the low-level feature map is smaller, so that the size is larger, the receptive field is smaller, and the detection of a small target is facilitated; the high-level feature vector step length (stride) corresponding to the high-level feature map is larger, the size is smaller, the receptive field is larger, the large target can be detected easily, the updated feature vector with further reduced resolution is obtained by carrying out maximum pooling downsampling on the feature vector with the highest level, the size of the feature map corresponding to the updated feature vector is reduced, the large target can be detected easily, and the target detection network model has better detection performance on the undersize target and the oversized target by increasing the richness of scale information.

In an embodiment, the convolutional neural network comprises: the resolution ratio of the first level convolution layer to the N level convolution layer is sequentially reduced;

The RFB-based model structure shown in fig. 7 is exemplified by a convolutional neural network having a 5-layer structure, which includes: the resolution of the first-level convolution layer C1, the second-level convolution layer C2, the third-level convolution layer C3, the fourth-level convolution layer C4 and the fifth-level convolution layer C5 is highest, the resolution of the first-level convolution layer C1 is sequentially reduced from the resolution of the first-level convolution layer C1 to the resolution of the fifth-level convolution layer C5, and the resolution of the fifth-level convolution layer C5 is lowest.

With the deepening of the layer number of the convolutional neural network, the better the characteristic extraction capability of the network is, but the phenomena of excessive parameters, large calculated amount, gradient disappearance and the like are caused. Therefore, the convolutional neural network is required to be designed to improve the characteristic extraction capability of the network under the condition of ensuring a lightweight network model, and the RFB structure enhances the characteristic extraction capability of the network through a receptive field mechanism simulating human vision. Specifically, the receptive field module RFB block comprises: a basic receptive field module RFB and an optimized receptive field module RFB-s; wherein, as shown in the RFB and RFB-s structure diagram of FIG. 8, the optimized receptive field module RFB-s is configured with more convolution kernels than the basic receptive field module RFB in structure, and the size of the configured more convolution kernels is smaller than the convolution kernels of the basic receptive field module. The problem of limited characteristic extraction capacity of a network caused by convolution calculation of convolution kernels with the same size in each layer of convolution in the prior art can be solved by utilizing the difference of convolution kernel sizes in RFB and RFB-s, so that the characteristic extraction capacity of a network model is improved.

Most of the existing target detection algorithms are based on convolutional neural networks, and as the depth of a model network increases, the models have strong learning ability, but have larger calculation cost. The design introduces an RFB structure (Receptive Field Block) by taking the thought of an acceptance structure into reference, and improves the characteristic extraction capability of the network by simulating the receptive field of human vision, wherein the RFB adds expansion convolution on the basis of the acceptance structure, so that the receptive field is effectively increased, and the RFB improves the target detection performance on the premise of not increasing the network scale.

As shown in the RFB enlarged receptive field effect graph of fig. 9, the RFB structure uses convolution kernels with different sizes in different branches, each convolution kernel adopts different expansion coefficients, and a larger weight is allocated to the convolution kernel with smaller size near the center, so that after the characteristics of each branch are fused finally, a mutually filled structure with different receptive field sizes can be obtained. Compared with the original convolution operation, after RFB is used, feature graphs with different receptive fields can be extracted, so that feature information is more abundant.

Since the first level of convolution layer C1 has the highest resolution, it is not suitable to cascade with the receptive field module RFB, and therefore starts with the receptive field module RFB from the second level of convolution layer C2. Specifically, the optimized receptive field module RFB-s structure has more convolution kernels with smaller size, so that the number of parameters can be reduced, and the optimized receptive field module RFB-s structure can be applied to the feature map with higher low-layer resolution. Therefore, the optimized receptive field modules RFB-s are cascaded in the second-level convolution layer C2 level, the third-level convolution layer C3 is cascaded with the basic receptive field modules RFB, and the fourth-level convolution layer C4 is cascaded with the basic receptive field modules RFB. Because another receptive field module RFB is added in each branch, the second-level convolution layer C2 and the optimized receptive field module RFB-s branch fuse the obtained characteristic diagram with the original characteristic diagram to obtain a new characteristic diagram which is transmitted into the third-level convolution layer C3, the third-level convolution layer C3 and the receptive field module RFB branch fuse the obtained characteristic diagram with the original characteristic diagram to obtain a new characteristic diagram which is transmitted into the fourth-level convolution layer C4, the fourth-level convolution layer C4 and the receptive field module RFB branch fuse the obtained characteristic diagram with the original characteristic diagram to obtain a new characteristic diagram which is transmitted into the fifth-level convolution layer C5, and the receptive field module RFB is not cascaded in the fifth-level convolution layer C5 because the total layers of the convolution neural network are 5 layers, so that the problem of fusion of the C5 and the receptive field module RFB is avoided.

In an embodiment, the acquiring, by using the convolutional neural network, the first feature map of the image to be detected in step S404 may include:

Taking the convolutional neural network with the 5-layer structure as an example, the first-layer convolutional layer C1 is utilized to carry out convolution on an image to be detected, a feature map layer corresponding to the first-layer convolutional layer C1 is determined, the second-layer convolutional layer C2 is utilized to carry out convolution on the image to be detected, a feature map layer corresponding to the second-layer convolutional layer C2 is determined, the third-layer convolutional layer C3 is utilized to carry out convolution on the image to be detected, a feature map layer corresponding to the third-layer convolutional layer C3 is determined, the fourth-layer convolutional layer C4 is utilized to carry out convolution on the image to be detected, a feature map layer corresponding to the fourth-layer convolutional layer C4 is determined, a feature map layer corresponding to the fifth-layer convolutional layer C5 is determined, and the feature map layer corresponding to the first-layer convolutional layer C1, the feature map layer corresponding to the second-layer convolutional layer C2, the feature map layer corresponding to the third-layer convolutional layer C3, the feature map layer corresponding to the fourth-layer convolutional layer C4 and the feature map layer corresponding to the fifth-layer convolutional layer C5 are configured according to the structure of the convolutional neural network, so that the first feature map can be obtained.

In another embodiment, the image to be detected is convolved by using a first level convolution layer C1, and a feature map level corresponding to the first level convolution layer C1 is determined; convolving the feature map level corresponding to the first level convolution layer C1 by using the second level convolution layer C2, and determining the feature map level corresponding to the second level convolution layer C2; convolving the feature map level corresponding to the second level convolution layer C2 by using the third level convolution layer C3, and determining the feature map level corresponding to the third level convolution layer C3; convolving the feature map level corresponding to the third level convolution layer C3 by using the fourth level convolution layer C4, and determining the feature map level corresponding to the fourth level convolution layer C4; convolving the feature map level corresponding to the fourth level convolution layer C4 by using the fifth level convolution layer C5, and determining the feature map level corresponding to the fifth level convolution layer C5; the feature map levels corresponding to the first level convolution layer C1, the feature map level corresponding to the second level convolution layer C2, the feature map level corresponding to the third level convolution layer C3, the feature map level corresponding to the fourth level convolution layer C4, and the feature map level corresponding to the fifth level convolution layer C5 are configured according to the hierarchical structure of the convolutional neural network, so that the first feature map may be obtained.

As shown in fig. 10, in an embodiment, the obtaining, by using the RFB, the second feature map of the first feature map in step S406, and determining a fusion feature map according to the first feature map and the second feature map may include:

step S1002: acquiring second feature graphs corresponding to second to Nth levels in the first feature graphs by utilizing the RFB;

step S1004: and fusing the first feature map containing the first level to the Nth level and the second feature map containing the second level to the Nth level, and determining a fused feature map, wherein the number of the levels of the fused feature map is equal to that of the second feature map.

Because the mode of cascading the receptive field modules RFB after the convolutional layers of the convolutional neural network is adopted, the ground characteristic diagram obtained by the cascaded convolutional layers can be subjected to characteristic extraction through the receptive field modules RFB,

a receptive field module RFB is cascaded from a second level convolution layer to an N-1 level convolution layer in the convolution neural network, and then a second feature map corresponding to a second level to an N level in the first feature map is obtained through the receptive field module RFB; then, fusing the first feature map and the second feature map to obtain a fused feature map, wherein the first feature map has a first level to an Nth level, the second feature map has a second level to an Nth level, and the fused feature map obtained in a fusion mode has the second level to the Nth level, namely the number of the levels of the fused feature map is equal to that of the second feature map; if the counting mode is from the beginning, the fusion feature map has the first level to the N-1 level.

As shown in fig. 11, in the embodiment, in the step S1004, the fusing the first feature map including the first level to the nth level and the second feature map including the second level to the nth level to determine the fused feature map may include:

s1102: convolving a first-level first feature map obtained by convolving the image to be detected by using a first-level convolution layer through a second-level convolution layer to determine a second-level first feature map;

s1104: extracting a first-level second feature map of the second-level first feature map using the optimized receptive field module cascaded with the second-level convolutional layer; fusing the second-level first feature map and the first-level second feature map, and then convolving the fused second-level first feature map and the first-level second feature map by using a third-level convolution layer to determine a third-level first feature map;

s1106: extracting a second-level second feature map of the third-level first feature map by using the basic receptive field module cascaded with the third-level convolution layer; fusing the third-level first feature map and the second-level second feature map, and then convolving the fused third-level first feature map and the second-level second feature map by using a fourth-level convolution layer to determine a fourth-level first feature map;

S1108: circularly executing the extraction process of the fourth-level first feature map according to the residual level of the convolution layer and the basic receptive field module corresponding to the residual level, and determining fifth-level first feature map to Nth-level first feature map;

s11010: and determining the second-level first feature map, the third-level first feature map, the fourth-level first feature map and the fifth-level first feature map to the N-level first feature map as a plurality of levels of fusion feature maps.

Firstly, taking the convolutional neural network with a 5-layer structure as an example, convolving an image to be detected through a first-layer convolutional layer C1 to obtain a first-layer first feature map, and transmitting the first-layer first feature map into a second-layer convolutional layer C2; convolving the first characteristic map of the first level through a second convolution layer C2 to determine the first characteristic map of the second level; extracting a first-level second feature map of the second-level first feature map by using an optimized receptive field module RFB-s cascaded with a second-level convolution layer C2; fusing the second-level first feature map and the first-level second feature map, and then convolving the fused second-level first feature map and the first-level second feature map by using a third-level convolution C3 layer to determine a third-level first feature map; extracting a second level second feature map of the third level first feature map using a base receptive field module RFB associated with the third level convolution level C3; after fusing the third-level first feature map and the second-level second feature map, carrying out convolution by using a fourth-level convolution layer C4 to determine a fourth-level first feature map; extracting a third-level second feature map of the fourth-level first feature map by using a basic receptive field module RFB cascaded with a fourth-level convolution layer C4; after fusing the fourth-level first feature map and the third-level second feature map, carrying out convolution by using a fifth-level convolution layer C5 to determine a fifth-level first feature map; the second-level first feature map, the third-level first feature map, the fourth-level first feature map, and the fifth-level first feature map are determined as a plurality of levels of the fused feature map.

If the first level of the fusion feature map corresponds to the second level first feature map, the second level of the fusion feature map corresponds to the second level third feature map, the third level of the fusion feature map corresponds to the second level fourth feature map, the fourth level of the fusion feature map corresponds to the second level fifth feature map, and the fusion feature map has four levels in total. If the counting mode is started from two, the second level of the fusion feature map corresponds to the second level first feature map, the third level of the fusion feature map corresponds to the third level first feature map, the fourth level of the fusion feature map corresponds to the fourth level first feature map, the fifth level of the fusion feature map corresponds to the fifth level first feature map, and the fusion feature map still has four levels.

Specifically, if the structure of the convolutional neural network exceeds 5 layers, after the above fourth-level first feature map is obtained, the extraction process of the fourth-level first feature map can be circularly executed according to the remaining levels of the convolutional layers and the basic receptive field modules corresponding to the remaining levels, so as to determine fifth-level first feature maps to Nth-level first feature maps; and determining the second-level first feature map, the third-level first feature map, the fourth-level first feature map and the fifth-level first feature map to the N-level first feature map as a plurality of levels of the fusion feature map.

Most of target detection algorithms are generated in a candidate region of the last convolution layer of the convolution neural network, and a low-level feature map with high resolution is not utilized, so that detailed information of the low-level feature map is lost, convolution calculation is carried out on each layer of convolution by using convolution kernels with the same size, and the feature extraction capability of the network is limited, so that the detection of multi-scale targets is not facilitated. When the method is applied to a natural scene, the image Feature Pyramid Network (FPN) has strong advantages in processing some multi-scale target problems due to the fact that branches and transverse connections from top to bottom are performed, and the feature extraction capability of the network is enhanced by simulating a receptive field mechanism of human vision by combining the RFB structure.

As shown in the FPN structure schematic of fig. 12, in an embodiment, the second-level convolutional layers to the nth level of the convolutional neural network are connected from bottom to top;

The FPN aims to construct a feature pyramid by utilizing feature graphs of different levels of a convolutional neural network, and the FPN introduced into the RPN mainly comprises two parts: the first part is a bottom-up process that is built up of the various levels of the petroleum convolutional neural network, and the second part is a fusion process of top-down and side-to-side connections. In the existing Faster RCNN structure, a sliding window is performed on the feature image on the last convolution layer to generate feature vectors, so that detailed information of some low-level feature images can be lost, and multi-scale detection performance is not facilitated. Therefore, the feature pyramid network FPN is applied to the RPN network, and the multi-scale detection capability of the target detection network model is improved through fusion of a low-level feature map with more detail information and a high-level feature map with high semantic information.

Taking the convolutional neural network with the 5-layer structure as an example, the first-layer convolutional layer C1 does not participate in fusion, and therefore does not participate in connection with the FPN, and the second-layer convolutional layer C2 is taken as the bottommost layer from the second-layer convolutional layer C2, and the third-layer convolutional layer C3, the fourth-layer convolutional layer C4 and the fifth-layer convolutional layer C5 are connected layer by layer upwards. The feature pyramid network is connected from top to bottom, and is connected from the third feature map at the top layer to the third feature map at the bottom layer by layer, and the third feature map at each layer is laterally connected with the convolution layer at the corresponding layer.

As shown in fig. 13, in the embodiment, determining, by using the FPN, a feature vector corresponding to each level in the fused feature map in step S408, and constructing a feature vector set may include:

step S1302: determining a third feature map of a next level according to a corresponding level in a fusion feature map corresponding to a convolution layer which is laterally connected with a third feature map of a topmost layer in the FPN; the third feature map at the topmost layer is obtained by convolving an nth-level first feature map corresponding to the nth-level convolution layer;

step S1304: circularly executing the steps until reaching the third feature map of the bottommost layer to obtain a plurality of layers of third feature maps;

step S1306: respectively convolving each of the plurality of hierarchical third feature maps to determine a plurality of feature vectors;

step S1308: and constructing a feature vector set according to the plurality of feature vectors.

Based on the convolutional neural network with a 5-layer structure, the feature pyramid network FPN comprises: a first-level third feature map M2, a second-level third feature map M3, a third-level third feature map M4, and a fourth-level third feature map M5; the fourth-level third feature map M5 to the first-level third feature map M2 are connected top-down by upsampling; the first-level convolution layer C1 to the fifth-level convolution layer C5 are connected from bottom to top; the second-level convolution layer C2 is laterally connected with the first-level third feature map M2 through convolution, the second-level convolution layer C2 corresponds to the first level of the fusion feature map, and the first level of the fusion feature map corresponds to the second-level first feature map; the third-level convolution layer C3 is laterally connected with a third feature map M3 of a second level through convolution, the third-level convolution layer C3 corresponds to the second level of the fusion feature map, and the second level of the fusion feature map corresponds to the first feature map of the third level; the fourth-level convolution layer C4 is laterally connected with a third-level third feature map M4 through convolution, the fourth-level convolution layer C4 corresponds to the third level of the fusion feature map, and the third level of the fusion feature map corresponds to the fourth-level first feature map; the fifth-level convolution layer C5 is laterally connected with the fourth-level third feature map M5 through convolution, the fifth-level convolution layer C5 corresponds to the fourth level of the fusion feature map, and the fourth level of the fusion feature map corresponds to the fifth-level first feature map.

When the feature vector set is constructed, firstly, the fourth level of the fusion feature map is convolved to obtain a fourth level third feature map M5, and because the fourth level of the fusion feature map corresponds to the fifth level first feature map, the fifth level first feature map is actually convolved by 1 multiplied by 1 to obtain the fourth level third feature map M5; the above-described 1 x 1 convolution is intended to adjust the number of channels. The fourth level third feature map M5 is the top-most third feature map in the FPN.

Determining a third-level third feature map M4 according to the third level of the fusion feature map corresponding to the fourth-level third feature map M5 and the laterally connected fourth-level convolution layer C4; determining a third characteristic diagram M3 of a second level according to the third characteristic diagram M4 of the third level and the second level of the fusion characteristic diagram corresponding to the laterally connected third level convolution layer C3; determining a first-level third feature map M2 according to the second-level third feature map M3 and a first level of the fusion feature map corresponding to the laterally connected second-level convolution layer C2; the first-level third feature map M2 is the third feature map of the bottom layer in the FPN.

Wherein, when determining the third level third feature map M4, the main process includes: and up-sampling the third characteristic diagram M5 of the fourth level by 2 times, and adding the characteristic diagrams of the third level of the fusion characteristic diagram corresponding to the fourth level convolution layer C4 after the third level is subjected to 1X 1 convolution to obtain a third characteristic diagram M4 of the third level. In determining the second-level third feature map M3, the main process includes: and up-sampling the third characteristic diagram M4 of the third level by 2 times, and adding the characteristic diagrams of the second level of the fusion characteristic diagram corresponding to the convolution layer C3 of the third level after convolution of 1 multiplied by 1 to obtain a third characteristic diagram M3 of the second level. In determining the first hierarchical third feature map M2, the main process includes: and up-sampling the third characteristic diagram M3 of the second level by 2 times, and adding the characteristic diagrams of the first level of the fusion characteristic diagram corresponding to the second level convolution layer C2 after 1X 1 convolution to obtain the third characteristic diagram M2 of the first level.

After the first level third feature map M2, the second level third feature map M3, the third level third feature map M4 and the fourth level third feature map M5 are obtained, 3×3 convolution is performed on the first level third feature map M2, the second level third feature map M3, the third level third feature map M4 and the fourth level third feature map M5 respectively to obtain a plurality of feature vectors P2-P5, wherein the first level third feature map M2 corresponds to the feature vector P2, the second level third feature map M3 corresponds to the feature vector P3, the third level third feature map M4 corresponds to the feature vector P4, and the fourth level third feature map M5 corresponds to the feature vector P5. The purpose of the 3 x 3 convolution described above is to mitigate the aliasing effects of up-sampled nearest neighbor interpolation. From the feature vectors P2, P3, P4 and P5, a feature vector set is constructed.

The size and number of channels of the feature map are represented as follows, the first two sets of numbers representing the size of the feature map and the third set of numbers representing the number of channels of the feature map, for example: c2 is 256×256×4, C3 is 128×128×8, C4 is 64×64×16, and C5 is 32×32. Taking C2 as an example, the first set of numbers 256 and the second set of numbers 256 represent the size of the feature map, and the third set of numbers 4 represents the number of channels of the feature map. At the C5 layer, the convolution of 1×1 represents that the feature map size is unchanged, and the convolution is performed by using 16 convolution kernels of 1×1, so that the finally obtained feature map is 32×32×16; then, the characteristic diagram is sampled at the ratio of 2, only the size of the characteristic diagram is changed to 64 x 16, and the characteristic diagram with the same size and the same channel number as the C4 layer is obtained at the moment, so that the characteristic diagram is convenient to fuse with the C4 layer.

The FPN shown in fig. 14 is applied to the RPN schematic diagram, and based on the steps S602-S604, the feature vector P5 is subjected to the maximum pooling downsampling, the updated feature vector P6 is determined, the P6 is configured above the P5, and the feature vector set is updated. In order to achieve the multi-scale detection effect, the fused features { P2, P3, P4, P5} are all used as inputs of the RPN. In order to make the scale information more rich. By utilizing the maximum pooling downsampling process P5 to generate P6, the size of the characteristic diagram of the previous level is further reduced, so that the large target can be detected, and compared with the original FPN network, the richness of scale information is increased by inputting the characteristic vectors { P2, P3, P4, P5 and P6} into the RPN, so that the network has better detection performance on the small target and the large target.

In an embodiment, the determining, by using the RPN, a candidate box corresponding to each feature vector in the feature vector set in step S4010 may include:

As shown in the schematic diagram of the RPN network structure in fig. 15, the main role of the RPN is to generate a candidate frame containing a foreground detection target, which is still a sliding window in nature, and the RPN in the existing fast RCNN generates a candidate region by performing a sliding window on the final layer of feature map C5 of the convolutional neural network. Details of the low-level feature map are not utilized, which is disadvantageous for detection of small objects. In this embodiment, by configuring the FPN in the RPN, the feature vectors { P2, P3, P4, P5, P6} may be input into the RPN by using the feature information of the C2, C3, C4, and C5 layers at the same time, so as to obtain the candidate frame. The RPN flow with FPN is to use a 3×3 sliding window to slide from top to bottom and from left to right on each feature vector in { P2, P3, P4, P5, P6}, generate 9 candidate frames with different sizes at the center of each sliding window, identify the detection target, adjust the candidate frames to surround the detection target, obtain a 512-dimensional vector, and perform subsequent classification and regression.

In an embodiment, the trained object detection network model further comprises: a full connection layer;

the mapping the fusion feature map and the candidate frame through the ROI layer in step S4012 may include:

Mapping the candidate frames to the corresponding positions in the fusion feature map of the corresponding level through the ROI layer, determining a low-dimensional vector, inputting the low-dimensional vector into the full-connection layer, and classifying the candidate frames by using a softmax function, for example, when the detection target is a person, classifying the candidate frames according to gender. And (3) returning the final accurate position of the candidate frame by using the Bbox function, and correcting the most candidate frame, so that the actual situation is more met.

As shown in fig. 16, in an embodiment, the training process of the object detection network model includes:

step S1602: acquiring a first feature map of the marked image by using the convolutional neural network; the marked image comprises marked targets marked by a marking frame;

step S1604: acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map;

step S1606: determining feature vectors corresponding to each level in the fusion feature map by using the FPN, and constructing a feature vector set;

Step S1608: determining a candidate frame corresponding to each feature vector in the feature vector set by utilizing the RPN;

step S16010: mapping the fusion feature map and the candidate frame through the ROI layer to determine a prediction result; the prediction result comprises a detection target and a candidate frame surrounding the detection target;

step S16012: and training the target detection network model according to the candidate frame surrounding the detection target in the prediction result and the mark frame until the matching degree of the candidate frame in the prediction result and the mark frame reaches a set threshold value and the detection target in the prediction result is matched with the marked target marked by the mark frame, and determining the trained target detection network model.

Specifically, when training a target detection network model, marking a marked target by using a marking frame in an image in advance, so that the marked image is used as training data, a convolutional neural network is used for obtaining a first characteristic image of the marked image, an RFB is used for obtaining a second characteristic image of the first characteristic image, and the first characteristic image and the second characteristic image are fused to determine a fusion characteristic image; determining feature vectors corresponding to each level in the fusion feature map by using the FPN, and constructing a feature vector set; determining candidate frames corresponding to each feature vector in the feature vector set by utilizing the RPN; mapping the fusion feature map and the candidate frame through the ROI layer to determine a prediction result; the prediction result comprises a detection target and a candidate frame surrounding the detection target; and training the target detection network model according to the candidate frames and the marked frames surrounding the detection target in the prediction result until the matching degree of the candidate frames and the marked frames in the prediction result reaches a set threshold value and the detection target in the prediction result is matched with the marked target marked by the marked frame, and determining the trained target detection network model. Specifically, the overlapping area between the candidate frame and the marking frame is judged through the softmax function, a judging threshold value is set, when the overlapping area is higher than the judging threshold value, the candidate frame is a positive sample and accords with the result of manual marking, if the overlapping area is lower than the judging threshold value, the candidate frame is a negative sample, further training and correction are needed, when all the candidate frames are positive samples, the candidate frame is corrected through Bbox regression, a trained model at the moment is output, and a trained target detection network model is obtained.

As shown in the flowchart of fig. 17, the object detection method of the present disclosure may include the following steps in one example: and extracting image features by using a convolutional neural network to obtain a fusion feature map, generating candidate frames by using RPN, mapping the candidate frames with different sizes to regions with the same size on the fusion feature map by using the ROI layer, and finally performing joint training of classification regression tasks by using softmax and bbox. Under the passenger flow statistics scene, due to the influences of factors such as different distances and angles between a target sample and a camera, shielding and the like, the acquired image has the problems of different dimensions, serious deformation and the like of the target, and the traditional target detection algorithm is easy to fail. Therefore, the invention provides the target detection method based on passenger flow statistics, which can directly identify the target information in the image under the conditions of different target scales and less characteristic information. Most target detection algorithms are generated in a candidate region of the last convolution layer of the convolution neural network, a high-resolution low-level feature map is not utilized, convolution calculation is carried out on each layer of convolution by using convolution kernels with the same size, and the feature extraction capacity of the network is limited. The image feature pyramid network FPN has strong advantages in processing some multi-scale target problems due to the fact that branches and transverse connection are conducted from top to bottom, and the RFB structure strengthens the feature extraction capability of the network through a receptive field mechanism simulating human vision.

The object detection network model structure as shown in fig. 18 is mainly composed of three parts: convolutional neural network, region proposal network RPN, ROI layer. RFB is cascaded in the convolutional neural network, and FPN is configured in the RPN. A convolutional neural network for extracting feature maps of images using a set of underlying conv+relu+pooling layers. The feature map is used for the subsequent RPN layer and full connection layer. The RFB module is added at the stage, and the characteristic diagrams of the C2, C3 and C4 layers are fused with the characteristic diagrams of the RFB structure to obtain new characteristic diagrams C2, C3, C4 and C5. And the RPN is used for generating the candidate area. The layer judges positive and negative samples of the candidate frame through softmax, and then corrects the candidate frame by using Bbox regression. The patent adds the FPN network at the stage, and P2, P3, P4, P5 and P6 all perform candidate region generation operation. And the ROI layer is used for collecting the feature images and the candidate frames generated in the first two stages, mapping the candidate frames to the corresponding positions of the feature images, and then sending the feature images to the subsequent full-connection layer for target category judgment. Classification regression for classifying the candidate boxes using softmax, and obtaining final accurate positions of the candidate boxes using Bbox regression.

The existing target detection algorithm is generated in a candidate region of the last convolution layer of the convolution neural network, and high-resolution feature images of other levels are not utilized, so that detailed information can be lost, and the detection of a multi-scale target is not facilitated; and each layer of convolution is calculated by using convolution kernels with the same size, and the characteristic extraction capability of the network is limited.

According to the method and the device, the FPN network is optimized, and the richness of the network scale information is improved through downsampling of the feature map of the last hierarchy. Further, since the low-level feature map has a large resolution and a large number of parameters, RFB-s modules are cascaded at the C2 level, and RFB modules are cascaded at the C3 and C4 levels. The passenger flow detection model with higher precision can be provided under the conditions of changeable target scale information, deformation, shielding and the like; the false detection condition of the original FasterRCNN algorithm in passenger flow statistics can be solved.

Aiming at the characteristic information loss caused by different distances and angles between a sample target and a camera, shielding and the like in passenger flow statistics, the target detection algorithm of the multi-scale sample is provided, FPN is applied to the RPN stage of target detection, and the adaptability of the original model to the multi-scale target is improved through fusion of high-layer and low-layer characteristics. The RFB is added into the convolutional neural network, so that the characteristic extraction capability of the network is improved under the condition that the original convolutional neural network model is not replaced.

It should be noted that, in the technical solution of the present disclosure, the acquiring, storing, using, processing, etc. of data all conform to relevant regulations of national laws and regulations, and various types of data such as personal identity data, operation data, behavior data, etc. relevant to individuals, clients, crowds, etc. acquired in the embodiments of the present disclosure have been authorized.

Based on the same inventive concept, an object detection device is also provided in the embodiments of the present disclosure, as described in the following embodiments. Since the principle of solving the problem of the embodiment of the device is similar to that of the embodiment of the method, the implementation of the embodiment of the device can be referred to the implementation of the embodiment of the method, and the repetition is omitted.

Fig. 19 shows a schematic diagram of an object detection device according to an embodiment of the disclosure, as shown in fig. 19, the device includes:

the image to be detected input module 1901 is used for inputting an image to be detected into the trained target detection network model; the trained object detection network model comprises: convolutional neural network, region proposal network RPN, region of interest ROI layer; a plurality of layers of convolution layers of the convolution neural network are cascaded with receptive field modules RFB; the RPN is configured with a feature pyramid network FPN;

a first feature map determining module 1902, configured to acquire a first feature map of the image to be detected using the convolutional neural network;

a fused feature map determining module 1903, configured to obtain a second feature map of the first feature map using the RFB, determine a fused feature map according to the first feature map and the second feature map, where the fused feature map has multiple levels;

A feature vector set construction module 1904, configured to determine a feature vector corresponding to each level in the fused feature map by using the FPN, and construct a feature vector set;

a candidate frame determining module 1905, configured to determine a candidate frame corresponding to each feature vector in the feature vector set using the RPN;

and a target detection result determining module 1906, configured to map the fusion feature map and the candidate frame through the ROI layer, and determine a target detection result, where the target detection result includes a detection target and the candidate frame surrounding the detection target.

It should be noted that, the above-mentioned image input module 1901 to be detected, the first feature map determining module 1902, the fusion feature map determining module 1903, the feature vector set constructing module 1904, the candidate frame determining module 1905, and the target detection result determining module 1806 correspond to S402 to S4012 in the method embodiment, and the above-mentioned modules are the same as examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the above-mentioned method embodiment. It should be noted that the modules described above may be implemented as part of an apparatus in a computer system, such as a set of computer-executable instructions.

In an embodiment, the method further includes a feature vector set updating module, configured to:

In an embodiment, the optimized receptive field module is structurally configured with more convolution kernels than the base receptive field module, the configured more convolution kernels having a smaller size than the base receptive field module.

In an embodiment, the first feature map determining module is specifically configured to:

In an embodiment, the fusion feature map determining module is specifically configured to:

In an embodiment, the fusion feature map determining module is further configured to:

In an embodiment, the second level to the nth level of the convolutional neural network are connected from bottom to top;

In an embodiment, the feature vector set construction module is specifically configured to:

In an embodiment, the candidate box determining module is specifically configured to:

the target detection result determining module is specifically configured to:

In an embodiment, the system further includes a target detection network model training module, configured to:

Those skilled in the art will appreciate that the various aspects of the present disclosure may be implemented as a system, method, or program product. Accordingly, various aspects of the disclosure may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" system.

An electronic device 2000 according to such an embodiment of the present disclosure is described below with reference to fig. 20. The electronic device 2000 illustrated in fig. 20 is merely an example, and should not be construed to limit the functionality and scope of use of embodiments of the present disclosure in any way.

As shown in fig. 20, the electronic device 2000 is embodied in the form of a general purpose computing device. Components of the electronic device 2000 may include, but are not limited to: the at least one processing unit 2010, the at least one memory unit 2020, and a bus 2030 connecting the different system components (including the memory unit 2020 and the processing unit 2010).

Wherein the storage unit stores program code that is executable by the processing unit 2010 such that the processing unit 2010 performs steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section of the present specification. For example, the processing unit 2010 may perform the following steps of the method embodiments described above: inputting an image to be detected into a trained target detection network model; the trained object detection network model comprises: convolutional neural network, region proposal network RPN, region of interest ROI layer; a plurality of layers of convolution layers of the convolution neural network are cascaded with receptive field modules RFB; the RPN is configured with a feature pyramid network FPN; acquiring a first feature map of the image to be detected by using the convolutional neural network; acquiring a second feature map of the first feature map by utilizing the RFB, and determining a fusion feature map according to the first feature map and the second feature map, wherein the fusion feature map has a plurality of levels; determining feature vectors corresponding to each level in the fusion feature map by using the FPN, and constructing a feature vector set; determining a candidate frame corresponding to each feature vector in the feature vector set by utilizing the RPN; and mapping the fusion feature map and the candidate frame through the ROI layer to determine a target detection result, wherein the target detection result comprises a detection target and the candidate frame surrounding the detection target.

The storage unit 2020 may include readable media in the form of volatile storage units such as random access memory unit (RAM) 20201 and/or cache memory unit 20202, and may further include read only memory unit (ROM) 20203.

The storage unit 2020 may also include a program/utility 20204 having a set (at least one) of program modules 20205, such program modules 20205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The bus 2030 may be one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, a graphics accelerator port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 2000 may also be in communication with one or more external devices 2040 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 2000, and/or any device (e.g., router, modem, etc.) that enables the electronic device 2000 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 2050. Also, the electronic device 2000 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 2060. As shown, the network adapter 2060 communicates with other modules of the electronic device 2000 via the bus 2030. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with the electronic device 2000, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a terminal device, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

In particular, according to embodiments of the present disclosure, the process described above with reference to the flowcharts may be implemented as a computer program product comprising: a computer program which, when executed by a processor, implements the above-described object detection method.

In an exemplary embodiment of the present disclosure, a computer-readable storage medium, which may be a readable signal medium or a readable storage medium, is also provided. On which a program product is stored which enables the implementation of the method described above of the present disclosure. In some possible implementations, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the disclosure as described in the "exemplary methods" section of this specification, when the program product is run on the terminal device.

More specific examples of the computer readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In this disclosure, a computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Alternatively, the program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In particular implementations, the program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.

Furthermore, although the steps of the methods in the present disclosure are depicted in a particular order in the drawings, this does not require or imply that the steps must be performed in that particular order or that all illustrated steps be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.

From the description of the above embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, including several instructions to cause a computing device (may be a personal computer, a server, a mobile terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A method of detecting an object, comprising:

2. The target detection method according to claim 1, further comprising:

3. The target detection method according to claim 1, wherein the convolutional neural network comprises: the resolution ratio of the first level convolution layer to the N level convolution layer is sequentially reduced;

4. The method of claim 3, wherein the optimized receptive field module is structurally configured with more convolution kernels than the base receptive field module, the configured more convolution kernels having a smaller size than the convolution kernels of the base receptive field module.

5. The method of claim 3, wherein acquiring the first feature map of the image to be detected using the convolutional neural network comprises:

6. The object detection method according to claim 3, wherein acquiring a second feature map of the first feature map using the RFB, determining a fusion feature map from the first feature map and the second feature map, comprises:

7. The object detection method according to claim 6, wherein fusing the first feature map including the first level to the nth level and the second feature map including the second level to the nth level to determine the fused feature map includes:

8. The method of claim 6, wherein the second to nth levels of convolutional neural network are connected bottom-up;

9. The method of claim 8, wherein determining feature vectors corresponding to each level in the fused feature map using the FPN, and constructing a feature vector set, comprises:

10. The target detection method according to claim 1, wherein determining a candidate box corresponding to each feature vector in the feature vector set using the RPN includes:

11. The method of claim 1, wherein the trained object detection network model further comprises: a full connection layer;

12. The method of claim 1, wherein the training process of the object detection network model comprises:

13. An object detection apparatus, comprising:

14. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the object detection method of any one of claims 1-12 via execution of the executable instructions.

15. A computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the object detection method according to any of claims 1 to 12.