WO2021208726A1 - 基于注意力机制的目标检测方法、装置及计算机设备 - Google Patents

基于注意力机制的目标检测方法、装置及计算机设备 Download PDF

Info

Publication number
WO2021208726A1
WO2021208726A1 PCT/CN2021/083935 CN2021083935W WO2021208726A1 WO 2021208726 A1 WO2021208726 A1 WO 2021208726A1 CN 2021083935 W CN2021083935 W CN 2021083935W WO 2021208726 A1 WO2021208726 A1 WO 2021208726A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
feature
layer
image
detected
Prior art date
Application number
PCT/CN2021/083935
Other languages
English (en)
French (fr)
Inventor
张国辉
杨国青
宋晨
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021208726A1 publication Critical patent/WO2021208726A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the technical field of target detection, and in particular to a target detection method, device and computer equipment based on an attention mechanism.
  • the feature pyramid is used to upsample the high-level features and The adjacent underlying features are spliced to perform feature fusion.
  • the large-size feature map in the feature pyramid needs to be used for target detection; when a large target detection task needs to be performed, the small-size feature map in the feature pyramid needs to be used for target detection.
  • the use of feature pyramids for target detection has better detection accuracy, the inventor realizes that the existing target detection technology still cannot meet the accuracy of ideal detection. Therefore, how to improve the accuracy of detection when performing different target detection tasks on the basis of the feature pyramid is a problem to be solved in this application.
  • the embodiments of the present application provide an attention mechanism-based target detection method, device, and computer equipment, aiming to solve the problem that the detection accuracy of different target detection tasks based on feature pyramids cannot meet the detection requirements in the prior art.
  • an embodiment of the present application provides a target detection method based on an attention mechanism, which includes:
  • an embodiment of the present application provides a target detection device based on an attention mechanism, which includes:
  • the receiving unit is used to receive the image to be detected input by the user
  • the first generating unit is configured to input the image to be detected into a preset convolutional neural network model, and extract a multi-layer feature map of the image to be detected;
  • a second generating unit configured to weight the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map
  • a third generating unit configured to generate a feature pyramid of the image to be detected according to the multi-layer feature map
  • a fusion unit configured to fuse the weighted feature map with each layer of feature maps in the feature pyramid to obtain a fused feature pyramid
  • An acquiring unit configured to acquire a feature map matching the target image in the image to be detected from the fused feature pyramid
  • the target detection unit is configured to perform target detection on the feature map matching the target image according to a preset target detection model to obtain the target image in the image to be detected.
  • an embodiment of the present application further provides a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor executes the Perform the following steps in the computer program:
  • the embodiments of the present application also provide a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the following steps :
  • the embodiments of the present application provide a target detection method, device and computer equipment based on an attention mechanism.
  • different feature layer weights can be adaptively adjusted when performing target detection tasks, and the final fusion feature is more suitable For target detection tasks, the detection accuracy can be greatly improved under the condition of small additional time overhead.
  • FIG. 1 is a schematic flowchart of a target detection method based on an attention mechanism provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of a sub-flow of a target detection method based on an attention mechanism provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of another sub-flow of the target detection method based on the attention mechanism provided by an embodiment of the application;
  • FIG. 4 is a schematic diagram of another sub-flow of the target detection method based on the attention mechanism provided by an embodiment of the application;
  • FIG. 5 is a schematic diagram of another sub-flow of a target detection method based on an attention mechanism provided by an embodiment of the application;
  • FIG. 6 is a schematic block diagram of a target detection device based on an attention mechanism provided by an embodiment of the application.
  • FIG. 7 is a schematic block diagram of subunits of an attention mechanism-based target detection device provided by an embodiment of this application.
  • FIG. 8 is a schematic block diagram of another subunit of an attention mechanism-based target detection apparatus provided by an embodiment of the application.
  • FIG. 9 is a schematic block diagram of another subunit of the target detection device based on the attention mechanism provided by an embodiment of the application.
  • FIG. 10 is a schematic block diagram of another subunit of the target detection device based on the attention mechanism provided by an embodiment of the application.
  • FIG. 11 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of a target detection method based on an attention mechanism provided by an embodiment of the application.
  • the target detection method based on the attention mechanism is built and run in a server.
  • the server receives an image to be detected sent by a smart terminal device such as a laptop computer, a tablet computer, etc., it performs feature extraction on the image to be detected to obtain
  • the multi-layer feature map of the image to be detected is then weighted according to a preset attention mechanism to obtain a weighted feature map, the weighted feature map and the multi-layer feature map Corresponding to each layer of the feature map in the multi-layer feature map, and then convolve each layer of the feature map in the multi-layer feature map again to obtain the feature pyramid of the image to be detected, and finally the weighted feature maps are respectively Fuse with each layer of feature maps in the feature pyramid to obtain a fused feature pyramid.
  • the fused feature pyramid is more suitable for the detection of the target image, and can be greatly improved when the additional time
  • the target detection method based on the attention mechanism will be described in detail below. As shown in Figure 1, the method includes the following steps S110 to S170.
  • S110 Receive an image to be detected input by a user.
  • the image to be detected contains feature information of the target image
  • the user sends the image to be detected to the server through a terminal such as a laptop, a tablet, a smart phone, etc.
  • the server receives the image to be detected.
  • the target detection method based on the attention mechanism can be executed to obtain the fused feature pyramid of the image to be detected, so as to adapt to different target detection tasks.
  • the image to be detected is input into a preset convolutional neural network model, and a multi-layer feature map of the image to be detected is extracted.
  • the convolutional neural network model is a model that is pre-trained and used to perform feature extraction on the input image to be detected to obtain a multi-layer feature map of the image to be detected, that is, the image to be detected is input
  • the image to be detected passes through several convolutional layers, pooling layers, and activation function layers in sequence. The number of channels in each layer of the multi-layer feature map gradually increases from bottom to top.
  • the feature map is composed of the feature map in the integration stage, and the richness of the semantic information of the feature map from the bottom to the top of the multi-layer feature map gradually increases, and the resolution gradually decreases.
  • the feature map at the bottom of the multi-layer feature map has the least semantic information and the highest resolution, and is not suitable for detecting small targets; the feature map at the top of the multi-layer feature map has the richest semantics, The resolution is the lowest and is not suitable for detecting large targets.
  • the convolutional neural network may be a deep convolutional neural network such as VGG (Visual Geometry Group, super-resolution test sequence) convolutional neural network, deep ResNet (Residual Networks, residual network).
  • VGG Visual Geometry Group, super-resolution test sequence
  • ResNet Residual Networks, residual network
  • S130 Weight the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map.
  • the multi-layer feature map is weighted according to a preset attention mechanism to obtain a weighted feature map.
  • the attention mechanism is essentially similar to the human selective visual attention mechanism. The core idea is to select information that is more critical to the current task goal from a large number of information.
  • the attention mechanism is used to obtain the weight of each layer of the feature map in the multi-layer feature map. After the weight of each layer of the feature map in the multi-layer feature map is obtained, the multi-layer feature map is The feature values of each layer of feature maps are multiplied by their corresponding weights and then added to complete the weighting of the multi-layer feature maps, thereby obtaining the weighted feature maps.
  • step S130 includes: sub-step S131 and sub-step S132.
  • the weight of each layer of the feature map in the multi-layer feature map is obtained from the convolutional neural network model according to the attention mechanism.
  • the attention mechanism in the embodiment of this application is a spatial attention mechanism.
  • the image to be detected is input into the convolutional neural network model, and after the multi-layer feature map is obtained, each of the multi-layer feature maps The layer feature maps all have corresponding weights. Since the output of each layer of the feature map in the multi-layer feature map is a real number, and the sum of the weights of each layer of the feature map in the multi-layer feature map is 1. Therefore, after the weight of each layer of the feature map in the multi-layer feature map is obtained according to the attention mechanism, the weight of each layer of the feature map is normalized to obtain the multi-layer feature map.
  • the weight of each layer of feature maps in the layer feature map wherein the normalization process is to normalize the weight of each layer of feature maps to (0, 1).
  • the attention mechanism is a spatial attention mechanism, and the weights of each layer of the feature map are normalized by using the Sigmoid function to obtain the features of each layer in the multi-layer feature map.
  • the weight of the graph is a spatial attention mechanism, and the weights of each layer of the feature map are normalized by using the Sigmoid function to obtain the features of each layer in the multi-layer feature map.
  • the multi-layer feature map is weighted according to the weight of each layer of the feature map in the multi-layer feature map to obtain the weighted feature map. Specifically, after the weight of each layer of the feature map in the multi-layer feature map is obtained through the attention mechanism, the feature value of each layer of the feature map in the multi-layer feature map is multiplied by its corresponding After the weights are added, a feature map with moderate size and semantic information is obtained, that is, the weighted feature map.
  • f i is the multi-layer feature map
  • the feature value of a feature map of, w i is the weight of a feature map in the multi-layer feature map.
  • S140 Generate a feature pyramid of the image to be detected according to the multi-layer feature map.
  • the feature pyramid of the image to be detected is generated according to the multi-layer feature map. Specifically, the feature pyramid is constructed from top to bottom through the multi-layer feature map.
  • the feature pyramid can be used for target detection for different tasks. When a small target in the image to be detected needs to be detected, only the large-size feature map in the feature pyramid is used for target recognition to obtain rich semantic information; when a large target in the image to be detected needs to be detected At this time, only a small-sized feature map in the feature pyramid is used for identification to obtain rich semantic information.
  • step S140 includes sub-steps S141 and S142.
  • S141 Perform convolution on each layer of the feature map in the multi-layer feature map according to a preset convolution kernel to obtain a multi-layer feature map after convolution.
  • each layer of the feature map in the multi-layer feature map is convolved by using the convolution kernel, the number of channels in each layer of the feature map in each layer of the multi-layer feature map is equal, In order to facilitate the subsequent construction of a feature pyramid through the multi-layer feature map.
  • the size of the convolution kernel can be set according to actual conditions, and there is no limitation here.
  • each layer of the feature map in the multi-layer feature map is C1, C2, C3, C4, C5 from top to bottom, C1, C2, C3, C4, C5 are passed through a 1*1 convolution kernel Perform convolution to make C1, C2, C3, C4, and C5 have the same number of channels after convolution.
  • S142 Generate a feature pyramid of the image to be detected according to the multi-layer feature map after the convolution.
  • the feature pyramid of the image to be detected is generated according to the multi-layer feature map after the convolution. Specifically, the number of channels of each layer of the feature map in the multi-layered feature map after the convolution is equal, and the number of layers of the feature map in the multi-layered feature map after the convolution is equal to the number of layers in the feature pyramid And the size of each layer is equal.
  • step S142 includes sub-steps S1421 and S1422.
  • S1421 construct a feature map of the top layer of the feature pyramid according to the feature map of the top layer of the multi-layer feature map after the convolution.
  • the feature map of the top layer of the feature pyramid is constructed according to the feature map of the top layer in the multi-layer feature map after the convolution. Specifically, the feature map of the top layer in the multi-layer feature map after the convolution has the smallest size and the richest semantics in the multi-layer feature map after the convolution. Therefore, the multi-layer feature map after the convolution can be directly
  • the feature map at the top level of the feature map serves as the feature map at the top level of the feature pyramid.
  • a feature map below the top level of the feature pyramid according to the feature map at the top level of the feature pyramid.
  • the specific process of constructing a feature map below the top layer of the feature pyramid from the feature map at the top of the feature pyramid is: sampling the top layer of the feature pyramid and convolving it with the multi-layer feature map after convolution.
  • the feature maps adjacent to the top layer are added to obtain the feature map adjacent to the top layer in the feature pyramid.
  • the convolved multi-layer feature map needs to be compared with the topmost layer.
  • the adjacent feature maps can be added after being reduced to twice the original size, and the feature pyramid can be constructed by proceeding from top to bottom in sequence.
  • the convolved C1 as the feature map P1 at the top of the feature pyramid, sample P1, and at the same time scale the convolved C2 to twice the original value, and then combine the feature map after sampling P1 with the convolution After the C2 is scaled to twice the original feature maps and added, the P2 adjacent to P1 in the feature pyramid can be obtained.
  • the feature maps in the feature pyramid can be obtained from top to bottom. : P1, P2, P3, P4, P5.
  • the weighted feature maps are respectively fused with each layer of feature maps in the feature pyramid to obtain a fused feature pyramid.
  • the object of the convolution operation of the convolutional neural network model is a set of multi-dimensional matrices.
  • the image to be detected is input In the convolutional neural network model, each layer of the feature map in the obtained multi-layer feature map is a set of multi-dimensional matrices, and each layer of the feature map in the feature pyramid constructed according to the multi-layer feature map , And the eigenvalues of each layer of the feature map in the multi-layer feature map are multiplied by its corresponding weight and then added.
  • the weighted feature maps obtained are also a set of multi-dimensional matrices, so In the process of fusing the weighted feature maps with each layer of feature maps in the feature pyramid, the corresponding matrices are added, that is, the weighted feature maps are respectively combined with the feature pyramid.
  • Each layer of feature maps in is spliced head to tail, and a new set of multi-dimensional matrices is obtained, which is the fused feature pyramid.
  • Each layer of the feature map in the fused feature pyramid contains richer semantic information than the feature map corresponding to the feature pyramid.
  • S160 Acquire a feature map matching the target image in the image to be detected from the merged feature pyramid.
  • a feature map matching the target image in the image to be detected is obtained from the fused feature pyramid. Specifically, according to the target size of the target image in the image to be detected, a feature map matching the target image in the image to be detected is obtained from the fused feature pyramid.
  • the user when sending the image to be detected, the user also sends the instruction information of the detection request for the target detection of the image to be detected. According to the instruction information, the target of the target image in the image to be detected can be obtained.
  • the size of the target can be selected from the merged feature pyramid to match the feature map for target detection, and then input the feature map into a pre-trained target detection model to obtain the target image.
  • S170 Perform target detection on a feature map matching the target image according to a preset target detection model, to obtain a target image in the image to be detected.
  • the target detection model is a model for extracting a plurality of rectangular bounding boxes from the feature map matching the target image in the image to be detected, and the plurality of rectangular bounding boxes is the plurality of rectangular bounding boxes.
  • the target detection model After inputting a feature map matching the target image in the image to be detected into the target detection model, the target detection model will output the multiple candidate frames, wherein the multiple Each candidate frame includes a target detection frame, the multiple candidate frames are candidate frames related to the target image in the image to be detected, and each of the multiple candidate frames includes feature information of part or all of the target image, Then, the target image in the image to be detected is obtained.
  • step S170 includes sub-steps S171 and S172.
  • the feature map matching the target image in the image to be detected is input into a preset region generation network model to obtain multiple candidate frames.
  • the region generation network model is pre-trained and used to extract feature maps that match the target image in the image to be detected to obtain a model containing multiple candidate frames of the target detection frame, which will be compared with
  • the anchor point of the sliding window of the preset size is used as the center to generate multiple target detection frames through size transformation.
  • the candidate frame in the embodiment of the present application, the size of the sliding window is 3 ⁇ 3.
  • the target detection frame is screened out from the multiple candidate frames according to a preset non-maximum value suppression algorithm to obtain the target image.
  • the non-maximum value suppression algorithm is referred to as the NMS algorithm for short, and is often used in edge detection, face detection, target detection, etc. in computer vision.
  • the non-maximum value suppression algorithm is used to perform target detection on the image to be detected. Since a large number of candidate frames are generated at the same target position in the process of target detection, these candidate frames may overlap with each other. In this case, a non-maximum suppression algorithm needs to be used to find the target from the multiple candidate frames. Check box.
  • the non-maximum value suppression algorithm performs screening according to the confidence of each candidate frame in the multiple candidate frames to obtain the target detection frame.
  • the specific process of the non-maximum suppression algorithm is as follows: first, sort according to the confidence of each candidate frame in the multiple candidate frames from high to low, and eliminate candidates whose confidence is less than the preset first threshold.
  • the target detection frame through which the target image can be obtained.
  • IoU is a concept used in target detection.
  • the preset first threshold is set to 0.3
  • the preset second threshold is set to 0.5
  • the target detection method based on the attention mechanism provided by the embodiment of the present application, by receiving the image to be detected input by the user; inputting the image to be detected into a preset convolutional neural network model, and extracting the image to be detected Detect the multi-layer feature map of the image; weight the multi-layer feature map according to the preset attention mechanism to obtain the weighted feature map; generate the feature pyramid of the image to be detected according to the multi-layer feature map; The weighted feature map is respectively fused with each layer of the feature map in the feature pyramid to obtain a fused feature pyramid; from the fused feature pyramid, the target image corresponding to the target image in the image to be detected is obtained.
  • a matched feature map performing target detection on a feature map matching the target image according to a preset target detection model to obtain the target image in the image to be detected.
  • the embodiment of the present application also provides an attention mechanism-based target detection device 100, which is used to execute any embodiment of the aforementioned attention mechanism-based target detection method.
  • FIG. 6 is a schematic block diagram of a target detection apparatus 100 based on an attention mechanism provided by an embodiment of the present application.
  • the target detection device 100 based on the attention mechanism includes a receiving unit 110, a first generating unit 120, a second generating unit 130, a third generating unit 140, a fusion unit 150, and an acquiring unit 160.
  • a receiving unit 110 receives a signal from a source.
  • a first generating unit 120 generates a signal from a source.
  • a second generating unit 130 generates a signal from a source.
  • a third generating unit 140 As shown in FIG. 6, the target detection device 100 based on the attention mechanism includes a receiving unit 110, a first generating unit 120, a second generating unit 130, a third generating unit 140, a fusion unit 150, and an acquiring unit 160.
  • a fusion unit 150 receives a signal from a signal from a signal from a signal from a signal.
  • the receiving unit 110 is configured to receive the image to be detected input by the user.
  • the first generating unit 120 is configured to input the image to be detected into a preset convolutional neural network model, and extract a multi-layer feature map of the image to be detected.
  • the second generating unit 130 is configured to weight the multi-layer feature map according to a preset attention mechanism to obtain a weighted feature map.
  • the second generating unit 130 includes a weight obtaining unit 131 and a fourth generating unit 132.
  • the weight obtaining unit 131 is configured to obtain the weight of each layer of the feature map in the multi-layer feature map from the convolutional neural network model according to the attention mechanism.
  • the fourth generating unit 132 is configured to weight the multi-layer feature map according to the weight of each layer of the multi-layer feature map to obtain the weighted feature map.
  • the third generating unit 140 is configured to generate a feature pyramid of the image to be detected according to the multi-layer feature map.
  • the third generating unit 140 includes: a convolution unit 141 and a fifth generating unit 142.
  • the convolution unit 141 is configured to convolve each layer of the feature map in the multi-layer feature map according to a preset convolution kernel to obtain a multi-layer feature map after convolution.
  • the fifth generating unit 142 is configured to generate a feature pyramid of the image to be detected according to the multi-layer feature map after the convolution.
  • the fifth generation unit 142 includes: a first construction unit 1421 and a second construction unit 1422.
  • the first construction unit 1421 is configured to construct the feature map of the top layer of the feature pyramid according to the feature map of the top layer of the multi-layer feature map after the convolution.
  • the second construction unit 1422 is configured to construct a feature map below the top level of the feature pyramid according to the feature map at the top level of the feature pyramid.
  • the fusion unit 150 is configured to fuse the weighted feature maps with each layer of feature maps in the feature pyramid to obtain a fused feature pyramid.
  • the acquiring unit 160 is configured to acquire a feature map matching the target image in the image to be detected from the fused feature pyramid;
  • the target detection unit 170 is configured to perform target detection on the feature map matching the target image according to a preset target detection model, to obtain the target image in the image to be detected.
  • the target detection unit 170 includes: a sixth generating unit 171 and a screening unit 172.
  • the sixth generating unit 171 is configured to input the feature map matching the target image in the image to be detected into a preset region generation network model to obtain multiple candidate frames.
  • the screening unit 172 is configured to screen out the target detection frame from the multiple candidate frames according to a preset non-maximum value suppression algorithm to obtain the target image.
  • the target detection device 100 based on the attention mechanism provided by the embodiment of the present application is used to execute the above-mentioned image to be detected for receiving user input; input the image to be detected into a preset convolutional neural network model, and extract The multi-layer feature map of the image to be detected; the multi-layer feature map is weighted according to a preset attention mechanism to obtain a weighted feature map; the feature of the image to be detected is generated according to the multi-layer feature map Pyramid; the weighted feature maps are respectively fused with each layer of feature maps in the feature pyramid to obtain a fused feature pyramid; the fused feature pyramid is obtained from the fused feature pyramid and the image to be detected A feature map matching the target image; performing target detection on the feature map matching the target image according to a preset target detection model to obtain the target image in the image to be detected.
  • FIG. 11 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute the target detection method based on the attention mechanism.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503.
  • the processor 502 can execute the target detection method based on the attention mechanism.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • 11 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the device 500 to which the solution of the present application is applied.
  • the specific device 500 may be Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • the processor 502 is configured to run a computer program 5032 stored in a memory, so as to implement any embodiment of the above-mentioned target detection method based on the attention mechanism.
  • the computer program may be stored in a storage medium, and the storage medium may be a computer-readable storage medium.
  • the computer program is executed by at least one processor in the computer system to implement the process steps of the foregoing method embodiment.
  • the computer-readable storage medium may be non-volatile or volatile.
  • the storage medium stores a computer program that, when executed by a processor, implements any embodiment of the above-mentioned target detection method based on the attention mechanism.
  • the computer-readable storage medium may be a U disk, a mobile hard disk, a read-only memory (ROM, Read-Only Memory), a magnetic disk, or an optical disk, and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

一种基于注意力机制的目标检测方法、装置及计算机设备,该方法包括:接收用户输入的待检测图像;将待检测图像输入至卷积神经网络模型中,提取到待检测图像的多层特征图;根据注意力机制对多层特征图进行加权,得到加权后的特征图;根据多层特征图生成待检测图像的特征金字塔;将加权后的特征图分别与特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;从融合后的特征金字塔中获取与目标图像相匹配的特征图;根据目标检测模型对与目标图像相匹配的特征图进行目标检测,得到目标图像。该方法基于人工智能中的神经网络技术,通过引入注意力机制对卷积输出层的特征进行融合,大幅度提高了在进行不同目标检测任务时的精度。

Description

基于注意力机制的目标检测方法、装置及计算机设备
本申请要求于2020年11月23日提交中国专利局、申请号为202011322670.7,发明名称为“基于注意力机制的目标检测方法、装置及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及目标检测技术领域,尤其涉及一种基于注意力机制的目标检测方法、装置及计算机设备。
背景技术
在现有的目标检测技术中,无论是在两阶段的Faster RCNN的多层特征融合上,还是在单阶段的YOLO的多层特征融合上,均采用的是特征金字塔将高层特征上采样后和临近的底层特征拼接以进行特征融合。当需要执行小目标的检测任务时,需采用特征金字塔中大尺寸的特征图来进行目标检测;当需要执行大目标的检测任务时,需采用特征金字塔中小尺寸的特征图来进行目标检测。虽然采用特征金字塔进行目标检测具有较好的检测精度,但是发明人意识到现有的目标检测技术仍然无法满足理想检测的精确度。因此,如何在特征金字塔的基础上提高对进行不同目标检测任务时的检测的精确度为本申请所需解决的问题。
发明内容
本申请实施例提供了一种基于注意力机制的目标检测方法、装置及计算机设备,旨在解决现有技术中基于特征金字塔进行不同目标检测任务时的检测精度无法满足检测需求的问题。
第一方面,本申请实施例提供了一种基于注意力机制的目标检测方法,其包括:
接收用户输入的待检测图像;
将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
根据所述多层特征图生成所述待检测图像的特征金字塔;
将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
第二方面,本申请实施例提供了一种基于注意力机制的目标检测装置,其包括:
接收单元,用于接收用户输入的待检测图像;
第一生成单元,用于将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
第二生成单元,用于根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
第三生成单元,用于根据所述多层特征图生成所述待检测图像的特征金字塔;
融合单元,用于将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
获取单元,用于从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
目标检测单元,用于根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
第三方面,本申请实施例又提供了一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时执行以下步骤:
接收用户输入的待检测图像;
将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
根据所述多层特征图生成所述待检测图像的特征金字塔;
将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下步骤:
接收用户输入的待检测图像;
将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
根据所述多层特征图生成所述待检测图像的特征金字塔;
将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
本申请实施例提供了一种基于注意力机制的目标检测方法、装置及计算机设备,通过上述方法可以在进行目标检测任务时,自适应的调节不同的特征层权重,同时最后的融合特征更适用于目标检测任务,在额外时间开销较小的情况下可大幅度的提高检测精度。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作 简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的基于注意力机制的目标检测方法的流程示意图;
图2为本申请实施例提供的基于注意力机制的目标检测方法的子流程示意图;
图3为本申请实施例提供的基于注意力机制的目标检测方法的另一子流程示意图;
图4为本申请实施例提供的基于注意力机制的目标检测方法的另一子流程示意图;
图5为本申请实施例提供的基于注意力机制的目标检测方法的另一子流程示意图;
图6为本申请实施例提供的基于注意力机制的目标检测装置的示意性框图;
图7为本申请实施例提供的基于注意力机制的目标检测装置的子单元示意性框图;
图8为本申请实施例提供的基于注意力机制的目标检测装置的另一子单元示意性框图;
图9为本申请实施例提供的基于注意力机制的目标检测装置的另一子单元示意性框图;
图10为本申请实施例提供的基于注意力机制的目标检测装置的另一子单元示意性框图;
图11为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1,图1为本申请实施例提供的基于注意力机制的目标检测方法的流程示意图。所述基于注意力机制的目标检测方法在服务器中进行搭建并运行,在服务器接收到例如手提电脑、平板电脑等智能终端设备发送的待检测图像后,将所述待检测图像进行特征提取,得到所述待检测图像的多层特征图,然后根据预设的注意力机制对所述多层特征图进行加权,得到加权后的特征图,所述加权后的特征图与所述多层特征图中的每层特征图的相对应,然后通过对所述多层特征图中的每层特征图再次进行卷积,得到所述待检测图像的特征金字塔,最后将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔,所述融合后的特征金字塔更适配目标图像的检测,在额外时间开销较小的情况下可大幅度的提高检测精度。
下面对所述基于注意力机制的目标检测方法进行详细说明。如图1所示,该方法包括以 下步骤S110~S170。
S110、接收用户输入的待检测图像。
接收用户输入的待检测图像。具体的,所述待检测图像中包含有目标图像的特征信息,用户通过终端如手提电脑、平板电脑、智能手机等终端设备向服务器发送所述待检测图像,服务器在接收到所述待检测图像后,便可执行所述基于注意力机制的目标检测方法,得到所述待检测图像的融合后的特征金字塔,以适应于不同的目标检测任务。
S120、将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图。
将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图。具体的,所述卷积神经网络模型为预先训练好且用于对输入的所述待检测图像进行特征提取,得到所述待检测图像的多层特征图的模型,即所述待检测图像输入到所述卷积神经网络模型中后,所述待检测图像依次经过若干卷积层、池化层、激活函数层,所述多层特征图中的每层特征图自底向上的通道数逐渐变多,尺寸逐渐变小,每层提取的特征被送入下一层作为输入,即所述多层特征图由所述待检测图像输入至所述卷积神经网络模型中后经过的不同卷积阶段的特征图组成,所述多层特征图自底向上的特征图的语义信息的丰富度逐渐增强,分辨率逐渐降低。所述多层特征图中最底层的特征图中的语义信息最少,分辨率最高,不适用于对小的目标进行检测;所述多层特征图中最顶层的特征图中的语义最丰富,分辨率最低,不适用于对大的目标进行检测。其中,卷积神经网络可以为VGG(Visual Geometry Group,超分辨率测试序列)卷积神经网络、深度ResNet(Residual Networks,残差网络)等深度卷积神经网络。例如,当所述卷积神经网络的卷积过程包含conv1、conv2、conv3、conv4四个阶段时,提取conv1、conv2、conv3、conv4四个阶段最后一层的特征图,便可得到所述待检测图像的多层特征图。
S130、根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图。
根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图。具体的,注意力机制本质上与人类的选择性视觉注意力机制类似,核心思想是从众多信息中选择出对当前任务目标更关键的信息。所述注意力机制用于获取所述多层特征图中的每层特征图的权重,在获取到所述多层特征图中的每层特征图的权重后,将所述多层特征图中的每层特征图的特征值均乘以其相对应的权重后进行相加便可完成对所述多层特征图进行加权,进而得到所述加权后的特征图。
在另一实施例中,如图2所示,步骤S130包括:子步骤S131和子步骤S132。
S131、根据所述注意力机制从所述卷积神经网络模型中获取所述多层特征图中的每层特征图的权重。
根据所述注意力机制从所述卷积神经网络模型中获取所述多层特征图中的每层特征图的权重。本申请实施例中所述注意力机制为空间注意力机制,所述待检测图像输入到所述卷积神经网络模型中,得到所述多层特征图后,所述多层特征图中的每层特征图均具有相应的权值。由于所述多层特征图中的每层特征图的输出均为实数,而所述多层特征图中的每层特征 图的权重之和为1。因此,在根据所述注意力机制获取到所述多层特征图中的每层特征图的权值后,对所述每层特征图的权值进行归一化处理,便可得到所述多层特征图中的每层特征图的权重,其中,归一化处理即为将所述每层特征图的权值规整到(0,1)之间。在本申请实施例中,所述注意力机制为空间注意力机制,采用Sigmoid函数对所述每层特征图的权值进行归一化处理便可得到所述多层特征图中的每层特征图的权重。
S132、根据所述多层特征图中的每层特征图的权重对所述多层特征图进行加权,得到所述加权后的特征图。
根据所述多层特征图中的每层特征图的权重对所述多层特征图进行加权,得到所述加权后的特征图。具体的,通过所述注意力机制获取到所述多层特征图中的每层特征图的权重后,将所述多层特征图中的每层特征图的特征值均乘以其相对应的权重后进行相加,得到一个尺寸以及语义信息适中的特征图,即所述加权后的特征图。所述加权后的特征图的特征值的计算公式表示为:F=f 1×w 1+f 2×w 2+…f i×w i),其中,f i为所述多层特征图中的某一特征图的特征值,w i为所述多层特征图中的某一特征图的权重。
S140、根据所述多层特征图生成所述待检测图像的特征金字塔。
根据所述多层特征图生成所述待检测图像的特征金字塔。具体的,所述特征金字塔为通过所述多层特征图自顶向下进行构建。所述特征金字塔可用于对不同任务的目标检测。当需要检测所述待检测图像中的小目标时,只需采用所述特征金字塔中大尺寸的特征图进行目标识别便可获得丰富的语义信息;当需要检测所述待检测图像中的大目标时,只需采用所述特征金字塔中小尺寸的特征图进行识别便可获得丰富的语义信息。
在另一实施例中,如图3所示,步骤S140包括子步骤S141和S142。
S141、根据预置的卷积核对所述多层特征图中的每层特征图进行卷积,得到卷积后的多层特征图。
根据预置的卷积核对所述多层特征图中的每层特征图进行卷积,得到卷积后的多层特征图。具体的,所述多层特征图中的每层特征图通过使用所述卷积核进行卷积后,所述多层特征图中的每层特征图中的每层特征图的通道数相等,以便于后续通过所述多层特征图构建特征金字塔。所述卷积核的大小可根据实际情况进行设定,在此不做限制。例如,若所述多层特征图中的每层特征图自顶向下依次为C1、C2、C3、C4、C5,将C1、C2、C3、C4、C5通过1*1大小的卷积核进行卷积以使得C1、C2、C3、C4、C5在进行卷积后的通道数相等。
S142、根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔。
根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔。具体的,所述卷积后的多层特征图中的每层特征图的通道数相等,所述卷积后的多层特征图中的特征图的层数与所述特征金字塔的层数相等以及每层的尺寸大小均相等。
在另一实施例中,如图4所示,步骤S142包括子步骤S1421和S1422。
S1421、根据所述卷积后的多层特征图中的顶层的特征图构建所述特征金字塔的顶层的特征图。
根据所述卷积后的多层特征图中的顶层的特征图构建所述特征金字塔的顶层的特征图。 具体的,所述卷积后的多层特征图中的顶层的特征图在所述卷积后的多层特征图中尺寸最小,语义最丰富,因此可直接将所述卷积后的多层特征图中的顶层的特征图作为所述特征金字塔的顶层的特征图。
S1422、根据所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图。
根据所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图。通过所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图的具体过程为:对所述特征金字塔的顶层进行采样并与卷积后的所述多层特征图中与最顶层相邻的特征图相加,得到所述特征金字塔中与顶层相邻的特征图,其中,在进行相加的过程中,需将卷积后的所述多层特征图中与最顶层相邻的特征图缩小至原来的两倍后才可进行相加,依次自顶向下进行便可构建出所述特征金字塔。例如:将卷积后的C1作为所述特征金字塔的顶层的特征图P1,对P1进行采样,同时对卷积后的C2缩放至原来的两倍,然后将对P1采样后特征图与卷积后的C2缩放至原来的两倍的特征图进行相加便可得到所述特征金字塔中与P1相邻的P2,依次类推,便可得到所述特征金字塔中的特征图自顶向下依次为:P1、P2、P3、P4、P5。
S150、将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔。
将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔。具体的,所述待检测图像在所述卷积神经网络模型中进行卷积操作时,所述卷积神经网络模型卷积操作的对象是一组多维矩阵,同样的,所述待检测图像输入至所述卷积神经网络模型中,得到的所述多层特征图中的每层特征图均为一组多维矩阵,根据所述多层特征图构建的所述特征金字塔中的每层特征图,以及将所述多层特征图中的每层特征图的特征值均乘以其相对应的权重后进行相加,得到的所述加权后的特征图也均为一组多维矩阵,故将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合的过程中,均将其所对应的矩阵进行相加,即将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行首尾拼接,,得到一组新的多维矩阵便为所述融合后的特征金字塔。所述融合后的特征金字塔中的每层特征图比与所述特征金字塔中相对应的特征图含有更丰富的语义信息,在进行不同任务的目标检测时,可大幅度的提高了目标检测的精确度。
S160、从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图。
从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图。具体的,根据所述待检测图像中的目标图像的目标尺寸从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图。通常用户在发送所述待检测图像的同时,也发送了对所述待检测图像进行目标检测的检测请求的指令信息,根据所述指令信息便可获取所述待检测图像中的目标图像的目标尺寸,所述目标尺寸可从所述融合后的特征金字塔中选取符合进行目标检测的特征图,然后将该特征图输入至预先训练好的目标检测模型中便可得到所述目标图像。
S170、根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到 所述待检测图像中的目标图像。
根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。具体的,所述目标检测模型为用于从所述与所述待检测图像中的目标图像相匹配的特征图中提取多个矩形边界框的模型,该多个矩形边界框即为所述多个候选框,将与所述待检测图像中的目标图像相匹配的特征图输入至所述目标检测模型中后,所述目标检测模型将会输出所述多个候选框,其中,所述多个候选框中包括目标检测框,所述多个候选框为与所述待检测图像中的目标图像相关的候选框,所述多个候选框中均包括有部分或者全部目标图像的特征信息,进而得到所述待检测图像中的目标图像。
在另一实施例中,如图5所示,步骤S170包括子步骤S171和S172。
S171、将所述与所述待检测图像中的目标图像相匹配的特征图输入至预置的区域生成网络模型中,得到多个候选框。
将所述与所述待检测图像中的目标图像相匹配的特征图输入至预置的区域生成网络模型中,得到多个候选框。具体的,所述区域生成网络模型为预先训练好且用于对与所述待检测图像中的目标图像相匹配的特征图进行提取,得到含有目标检测框的多个候选框的模型,将与所述待检测图像中的目标图像相匹配的特征图输入至所述区域生成网络模型中后,先以预设尺寸的滑动窗口的锚点为中心通过尺寸变换以生成含有目标检测框的多个候选框,在本申请实施例中,滑动窗口的尺寸为3×3。
S172、根据预设的非极大值抑制算法从所述多个候选框筛选出所述目标检测框,得到所述目标图像。
根据预设的非极大值抑制算法从所述多个候选框筛选出所述目标检测框,得到所述目标图像。具体的,所述非极大值抑制算法简称为NMS算法,常用于计算机视觉中的边缘检测、人脸检测、目标检测等。在本实施例中,所述非极大值抑制算法用于对所述待检测图像进行目标检测。由于目标检测的过程中在同一目标的位置上会产生大量的候选框,这些候选框相互之间可能会有重叠,此时需要通过非极大值抑制算法从所述多个候选框中找到目标检测框。所述区域生成网络模型输出所述多个候选框时,同时输出所述多个候选框中每个候选框的置信度,所述置信度为目标图像在所述多个候选框中的每个候选框中的概率,所述非极大值抑制算法根据所述多个候选框中每个候选框的置信度进行筛选,得到所述目标检测框。所述非极大值抑制算法的具体流程为:首先根据所述多个候选框中每个候选框的置信度从高往低的顺序进行排序并剔除置信度小于预设的第一阈值的候选框,计算未被剔除的候选框中每个候选框的面积,然后分别计算未被剔除的候选框中置信度最高的候选框分别与剩余未被剔除的候选框的IoU,判断计算出来的IoU是否超过预设的第二阈值,如超过预设的第二阈值,则剔除与未被剔除的候选框中置信度最高的候选框进行IoU计算的剩余未被剔除的候选框,最终得到所述目标检测框,通过所述目标检测框便可得到所述目标图像。其中,IoU即交并比是目标检测中使用的一个概念,表示的是候选框与原标记框的交叠率或者说重叠度,即候选框与原标记框的交集与并集的比值。在本实施例中,预设的第一阈值设定为0.3,预设的第二阈值设定为0.5。
在本申请实施例所提供的基于注意力机制的目标检测方法中,通过接收用户输入的待检测图像;将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;根据所述多层特征图生成所述待检测图像的特征金字塔;将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。通过上述方法可以在进行目标检测任务时,自适应的调节不同的特征层权重,使得最后的融合特征更适配检测任务,在额外时间开销较小的情况下可大幅度的提高检测精度。
本申请实施例还提供了一种基于注意力机制的目标检测装置100,该装置用于执行前述基于注意力机制的目标检测方法的任一实施例。具体地,请参阅图6,图6是本申请实施例提供的基于注意力机制的目标检测装置100的示意性框图。
如图6所示,所述基于注意力机制的目标检测装置100,该装置包括接收单元110、第一生成单元120、第二生成单元130、第三生成单元140、融合单元150、获取单元160和目标检测单元170。
接收单元110,用于接收用户输入的待检测图像。
第一生成单元120,用于将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图。
第二生成单元130,用于根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图。
在其他发明实施例中,如图7所示,所述第二生成单元130包括权重获取单元131和第四生成单元132。
权重获取单元131,用于根据所述注意力机制从所述卷积神经网络模型中获取所述多层特征图中的每层特征图的权重。
第四生成单元132,用于根据所述多层特征图中的每层特征图的权重对所述多层特征图进行加权,得到所述加权后的特征图。
第三生成单元140,用于根据所述多层特征图生成所述待检测图像的特征金字塔。
在其他发明实施例中,如图8所示,所述第三生成单元140包括:卷积单元141和第五生成单元142。
卷积单元141,用于根据预置的卷积核对所述多层特征图中的每层特征图进行卷积,得到卷积后的多层特征图。
第五生成单元142,用于根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔。
在其他发明实施例中,如图9所示,所述第五生成单元142包括:第一构建单元1421和第二构建单元1422。
第一构建单元1421,用于根据所述卷积后的多层特征图中的顶层的特征图构建所述特征 金字塔的顶层的特征图。
第二构建单元1422,用于根据所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图。
融合单元150,用于将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔。
获取单元160,用于从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
目标检测单元170,用于根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
在其他发明实施例中,如图10所示,所述目标检测单元170包括:第六生成单元171和筛选单元172。
第六生成单元171,用于将所述与所述待检测图像中的目标图像相匹配的特征图输入至预置的区域生成网络模型中,得到多个候选框。
筛选单元172,用于根据预设的非极大值抑制算法从所述多个候选框筛选出所述目标检测框,得到所述目标图像。
本申请实施例所提供的基于注意力机制的目标检测装置100用于执行上述用于接收用户输入的待检测图像;将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;根据所述多层特征图生成所述待检测图像的特征金字塔;将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
请参阅图11,图11是本申请实施例提供的计算机设备的示意性框图。
参阅图11,该设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于注意力机制的目标检测方法。该处理器502用于提供计算和控制能力,支撑整个设备500的运行。该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于注意力机制的目标检测方法。该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的设备500的限定,具体的设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述基于注意力机制的目标检测方法的任一实施例。
本领域普通技术人员可以理解的是实现上述实施例的方法中的全部或部分流程,是可以 通过计算机程序来指令相关的硬件来完成。该计算机程序可存储于一存储介质中,该存储介质可以为计算机可读存储介质。该计算机程序被该计算机系统中的至少一个处理器执行,以实现上述方法的实施例的流程步骤。
因此,本申请还提供了一种计算机可读存储介质。该计算机可读存储介质可以是非易失性,也可以是易失性。该存储介质存储有计算机程序,该计算机程序当被处理器执行时实现上述基于注意力机制的目标检测方法的任一实施例。
该计算机可读存储介质可以是U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置、设备和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置、设备和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种基于注意力机制的目标检测方法,包括以下步骤:
    接收用户输入的待检测图像;
    将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
    根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
    根据所述多层特征图生成所述待检测图像的特征金字塔;
    将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
    从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
    根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
  2. 根据权利要求1所述的基于注意力机制的目标检测方法,其中,所述根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图,包括:
    根据所述注意力机制从所述卷积神经网络模型中获取所述多层特征图中的每层特征图的权重;
    根据所述多层特征图中的每层特征图的权重对所述多层特征图进行加权,得到所述加权后的特征图。
  3. 根据权利要求1所述的基于注意力机制的目标检测方法,其中,所述根据所述多层特征图生成所述待检测图像的特征金字塔,包括:
    根据预置的卷积核对所述多层特征图中的每层特征图进行卷积,得到卷积后的多层特征图;
    根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔。
  4. 根据权利要求3所述的基于注意力机制的目标检测方法,其中,所述根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔,包括:
    根据所述卷积后的多层特征图中的顶层的特征图构建所述特征金字塔的顶层的特征图;
    根据所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图。
  5. 根据权利要求1所述的基于注意力机制的目标检测方法,其中,所述将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔,包括:
    将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行首尾拼接,得到所述融合后的特征金字塔。
  6. 根据权利要求1所述的基于注意力机制的目标检测方法,其中,所述从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图,包括:
    根据所述待检测图像中的目标图像的目标尺寸从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图。
  7. 根据权利要求1所述的基于注意力机制的目标检测方法,其中,所述根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图 像,包括:
    将与所述目标图像相匹配的特征图输入至预置的区域生成网络模型中,得到多个候选框;
    根据预设的非极大值抑制算法从所述多个候选框筛选出所述目标检测框,得到所述目标图像。
  8. 一种基于注意力机制的目标检测装置,包括:
    接收单元,用于接收用户输入的待检测图像;
    第一生成单元,用于将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
    第二生成单元,用于根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
    第三生成单元,用于根据所述多层特征图生成所述待检测图像的特征金字塔;
    融合单元,用于将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
    获取单元,用于从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
    目标检测单元,用于根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时执行以下步骤:
    接收用户输入的待检测图像;
    将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
    根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
    根据所述多层特征图生成所述待检测图像的特征金字塔;
    将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
    从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
    根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
  10. 根据权利要求9所述的计算机设备,其中,所述根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图,包括:
    根据所述注意力机制从所述卷积神经网络模型中获取所述多层特征图中的每层特征图的权重;
    根据所述多层特征图中的每层特征图的权重对所述多层特征图进行加权,得到所述加权后的特征图。
  11. 根据权利要求9所述的计算机设备,其中,所述根据所述多层特征图生成所述待检测 图像的特征金字塔,包括:
    根据预置的卷积核对所述多层特征图中的每层特征图进行卷积,得到卷积后的多层特征图;
    根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔。
  12. 根据权利要求11所述的计算机设备,其中,所述根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔,包括:
    根据所述卷积后的多层特征图中的顶层的特征图构建所述特征金字塔的顶层的特征图;
    根据所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图。
  13. 根据权利要求9所述的计算机设备,其中,所述将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔,包括:
    将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行首尾拼接,得到所述融合后的特征金字塔。
  14. 根据权利要求9所述的计算机设备,其中,所述从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图,包括:
    根据所述待检测图像中的目标图像的目标尺寸从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图。
  15. 根据权利要求9所述的计算机设备,其中,所述根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像,包括:
    将与所述目标图像相匹配的特征图输入至预置的区域生成网络模型中,得到多个候选框;
    根据预设的非极大值抑制算法从所述多个候选框筛选出所述目标检测框,得到所述目标图像。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下步骤:
    接收用户输入的待检测图像;
    将所述待检测图像输入至预置的卷积神经网络模型中,提取到所述待检测图像的多层特征图;
    根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图;
    根据所述多层特征图生成所述待检测图像的特征金字塔;
    将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔;
    从所述融合后的特征金字塔中获取与所述待检测图像中的目标图像相匹配的特征图;
    根据预置的目标检测模型对与所述目标图像相匹配的特征图进行目标检测,得到所述待检测图像中的目标图像。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据预置的注意力机制对所述多层特征图进行加权,得到加权后的特征图,包括:
    根据所述注意力机制从所述卷积神经网络模型中获取所述多层特征图中的每层特征图的 权重;
    根据所述多层特征图中的每层特征图的权重对所述多层特征图进行加权,得到所述加权后的特征图。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述根据所述多层特征图生成所述待检测图像的特征金字塔,包括:
    根据预置的卷积核对所述多层特征图中的每层特征图进行卷积,得到卷积后的多层特征图;
    根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述根据所述卷积后的多层特征图生成所述待检测图像的特征金字塔,包括:
    根据所述卷积后的多层特征图中的顶层的特征图构建所述特征金字塔的顶层的特征图;
    根据所述特征金字塔的顶层的特征图构建所述特征金字塔的顶层下方的特征图。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行融合,得到融合后的特征金字塔,包括:
    将所述加权后的特征图分别与所述特征金字塔中的每层特征图进行首尾拼接,得到所述融合后的特征金字塔。
PCT/CN2021/083935 2020-11-23 2021-03-30 基于注意力机制的目标检测方法、装置及计算机设备 WO2021208726A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011322670.7 2020-11-23
CN202011322670.7A CN112396115B (zh) 2020-11-23 2020-11-23 基于注意力机制的目标检测方法、装置及计算机设备

Publications (1)

Publication Number Publication Date
WO2021208726A1 true WO2021208726A1 (zh) 2021-10-21

Family

ID=74606965

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083935 WO2021208726A1 (zh) 2020-11-23 2021-03-30 基于注意力机制的目标检测方法、装置及计算机设备

Country Status (2)

Country Link
CN (1) CN112396115B (zh)
WO (1) WO2021208726A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821121A (zh) * 2022-05-09 2022-07-29 盐城工学院 一种基于rgb三分量分组注意力加权融合的图像分类方法
CN114972860A (zh) * 2022-05-23 2022-08-30 郑州轻工业大学 一种基于注意增强的双向特征金字塔网络的目标检测方法
CN115546032A (zh) * 2022-12-01 2022-12-30 泉州市蓝领物联科技有限公司 一种基于特征融合与注意力机制的单帧图像超分辨率方法
CN115564789A (zh) * 2022-12-01 2023-01-03 北京矩视智能科技有限公司 一种跨级融合的工件缺陷区域分割方法、装置及存储介质
CN116228685A (zh) * 2023-02-07 2023-06-06 重庆大学 一种基于深度学习的肺结节检测与剔除方法
CN116778346A (zh) * 2023-08-23 2023-09-19 济南大学 一种基于改进自注意力机制的管线识别方法及系统
CN116787022A (zh) * 2023-08-29 2023-09-22 深圳市鑫典金光电科技有限公司 基于多源数据的散热铜底板焊接质量检测方法及系统
CN117237746A (zh) * 2023-11-13 2023-12-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112396115B (zh) * 2020-11-23 2023-12-22 平安科技(深圳)有限公司 基于注意力机制的目标检测方法、装置及计算机设备
CN113177133B (zh) * 2021-04-23 2024-03-29 深圳依时货拉拉科技有限公司 一种图像检索方法、装置、设备及存储介质
CN113327226B (zh) * 2021-05-07 2024-06-21 北京工业大学 目标检测方法、装置、电子设备及存储介质
CN113361502B (zh) * 2021-08-10 2021-11-02 江苏久智环境科技服务有限公司 一种基于边缘群体计算的园林周界智能预警方法
CN113822871A (zh) * 2021-09-29 2021-12-21 平安医疗健康管理股份有限公司 基于动态检测头的目标检测方法、装置、存储介质及设备
CN114022682A (zh) * 2021-11-05 2022-02-08 天津大学 一种基于注意力的二次特征融合机制的弱小目标检测方法
CN113868542B (zh) * 2021-11-25 2022-03-11 平安科技(深圳)有限公司 基于注意力模型的推送数据获取方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160104058A1 (en) * 2014-10-09 2016-04-14 Microsoft Technology Licensing, Llc Generic object detection in images
CN110782420A (zh) * 2019-09-19 2020-02-11 杭州电子科技大学 一种基于深度学习的小目标特征表示增强方法
CN111738110A (zh) * 2020-06-10 2020-10-02 杭州电子科技大学 基于多尺度注意力机制的遥感图像车辆目标检测方法
CN111915613A (zh) * 2020-08-11 2020-11-10 华侨大学 一种图像实例分割方法、装置、设备及存储介质
CN112396115A (zh) * 2020-11-23 2021-02-23 平安科技(深圳)有限公司 基于注意力机制的目标检测方法、装置及计算机设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160104058A1 (en) * 2014-10-09 2016-04-14 Microsoft Technology Licensing, Llc Generic object detection in images
CN110782420A (zh) * 2019-09-19 2020-02-11 杭州电子科技大学 一种基于深度学习的小目标特征表示增强方法
CN111738110A (zh) * 2020-06-10 2020-10-02 杭州电子科技大学 基于多尺度注意力机制的遥感图像车辆目标检测方法
CN111915613A (zh) * 2020-08-11 2020-11-10 华侨大学 一种图像实例分割方法、装置、设备及存储介质
CN112396115A (zh) * 2020-11-23 2021-02-23 平安科技(深圳)有限公司 基于注意力机制的目标检测方法、装置及计算机设备

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114821121A (zh) * 2022-05-09 2022-07-29 盐城工学院 一种基于rgb三分量分组注意力加权融合的图像分类方法
CN114821121B (zh) * 2022-05-09 2023-02-03 盐城工学院 一种基于rgb三分量分组注意力加权融合的图像分类方法
CN114972860A (zh) * 2022-05-23 2022-08-30 郑州轻工业大学 一种基于注意增强的双向特征金字塔网络的目标检测方法
CN115546032A (zh) * 2022-12-01 2022-12-30 泉州市蓝领物联科技有限公司 一种基于特征融合与注意力机制的单帧图像超分辨率方法
CN115564789A (zh) * 2022-12-01 2023-01-03 北京矩视智能科技有限公司 一种跨级融合的工件缺陷区域分割方法、装置及存储介质
CN116228685B (zh) * 2023-02-07 2023-08-22 重庆大学 一种基于深度学习的肺结节检测与剔除方法
CN116228685A (zh) * 2023-02-07 2023-06-06 重庆大学 一种基于深度学习的肺结节检测与剔除方法
CN116778346A (zh) * 2023-08-23 2023-09-19 济南大学 一种基于改进自注意力机制的管线识别方法及系统
CN116778346B (zh) * 2023-08-23 2023-12-08 蓝茵建筑数据科技(上海)有限公司 一种基于改进自注意力机制的管线识别方法及系统
CN116787022A (zh) * 2023-08-29 2023-09-22 深圳市鑫典金光电科技有限公司 基于多源数据的散热铜底板焊接质量检测方法及系统
CN116787022B (zh) * 2023-08-29 2023-10-24 深圳市鑫典金光电科技有限公司 基于多源数据的散热铜底板焊接质量检测方法及系统
CN117237746A (zh) * 2023-11-13 2023-12-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质
CN117237746B (zh) * 2023-11-13 2024-03-15 光宇锦业(武汉)智能科技有限公司 基于多交叉边缘融合小目标检测方法、系统及存储介质

Also Published As

Publication number Publication date
CN112396115B (zh) 2023-12-22
CN112396115A (zh) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2021208726A1 (zh) 基于注意力机制的目标检测方法、装置及计算机设备
US10242289B2 (en) Method for analysing media content
CN113255694B (zh) 训练图像特征提取模型和提取图像特征的方法、装置
US8463025B2 (en) Distributed artificial intelligence services on a cell phone
US20230401833A1 (en) Method, computer device, and storage medium, for feature fusion model training and sample retrieval
CN110837811A (zh) 语义分割网络结构的生成方法、装置、设备及存储介质
EP4050570A2 (en) Method for generating image classification model, roadside device and cloud control platform
CN109977832B (zh) 一种图像处理方法、装置及存储介质
CN110796199A (zh) 一种图像处理方法、装置以及电子医疗设备
CN111814744A (zh) 一种人脸检测方法、装置、电子设备和计算机存储介质
US11763086B1 (en) Anomaly detection in text
CN112749726B (zh) 目标检测模型的训练方法、装置、计算机设备和存储介质
CN112446322A (zh) 眼球特征检测方法、装置、设备及计算机可读存储介质
US20070016576A1 (en) Method and apparatus for blocking objectionable multimedia information
CN110135428A (zh) 图像分割处理方法和装置
CN110991305B (zh) 一种遥感图像下的飞机检测方法及存储介质
CN115577106B (zh) 基于人工智能的文本分类方法、装置、设备和介质
CN114842482B (zh) 一种图像分类方法、装置、设备和存储介质
CN113222016B (zh) 一种基于高层和低层特征交叉增强的变化检测方法及装置
CN113139483B (zh) 人体行为识别方法、装置、设备、存储介质以及程序产品
CN112949777B (zh) 相似图像确定方法及装置、电子设备和存储介质
CN116030290A (zh) 在设备上检测数字对象并且生成对象掩膜
CN113971830A (zh) 一种人脸识别方法、装置、存储介质及电子设备
WO2023092296A1 (zh) 文本识别方法和装置、存储介质及电子设备
US20240135698A1 (en) Image classification method, model training method, device, storage medium, and computer program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21788676

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21788676

Country of ref document: EP

Kind code of ref document: A1