WO2022241803A1

WO2022241803A1 - Attention mechanism-based system and method for detecting feature in target, and storage medium

Info

Publication number: WO2022241803A1
Application number: PCT/CN2021/095956
Authority: WO
Inventors: 黄宇恒; 魏东; 岳许要; 金晓峰; 徐天适
Original assignee: 广州广电运通金融电子股份有限公司
Priority date: 2021-05-20
Filing date: 2021-05-26
Publication date: 2022-11-24
Also published as: CN113255759B; CN113255759A

Abstract

The present invention provides an attention mechanism-based system and method for detecting a feature in a target, and a storage medium, and relates to the field of intelligent security. The system comprises a semantic extraction module, an attention map module and a detection module, and a classification sub-module is responsible for carrying out global attribute classification on a target and for supervising the training of an attention sub-module; the attention sub-module is responsible for constructing an attention map; and the detection module comprises an anchor frame filter layer, a target detection layer and a parsing layer, the anchor frame filter layer performing data filtering on a received result of the attention map module, sending the result to the target detection layer and the parsing layer for detection and analysis, and outputting a detection result. In the present invention, a multi-task learning method based on a deep convolutional network is employed, and attention learning and a single-scale detection mechanism are introduced, a feature in a target is detected and positioned, and global attributes of the target are classified and recognized, which solves the problem that in training stages of conventional solutions, sample distribution is unbalanced and the requirement for computing power is high due to multiple anchor frames and multiple scales, thereby improving detection efficiency and precision.

Description

In-target feature detection system, method and storage medium based on attention mechanism

technical field

The invention relates to the field of intelligent security, in particular to an attention mechanism-based feature detection system, method and storage medium in an object.

Background technique

Object detection is one of the research hotspots in machine vision, and object feature detection refers to the technology of locating the components in the object to further analyze the structured information of the object in the video and image. In-target feature detection is one of the important technologies for video/image structured analysis. For example, in the vehicle structured task, it is necessary to locate the features of the vehicle target such as the face, windows, lights, logos, and luggage racks, and for further analysis.

At this stage, the feature detection in the target mainly relies on general scene detection methods such as SSD and YOLO, and its shortcomings are as follows:

1. In-target feature detection usually locates multiple feature locations from within a fixed type of target. The existing general detection framework does not consider the guidance of target attributes for in-target feature detection.

2. Using the existing general detection framework, it is necessary to build a multi-scale image pyramid or feature pyramid, which takes a long time and is not conducive to the deployment of edge devices.

3. The existing detection framework, with the help of the powerful fitting ability of the deep convolutional network, uses the same feature map to perform detection frame regression and classification on the target, which is not conducive to performance improvement.

4. The anchor box selection method of the existing detection framework will generate a large number of negative samples, resulting in the problem of unbalanced samples in the training phase and difficult classification.

Contents of the invention

In order to overcome the deficiencies in the prior art, the object of the present invention is to provide a system, method and storage medium for feature detection in an object based on attention mechanism, which can solve the above problems.

A feature detection system based on an attention mechanism, the system includes a semantic extraction module, an attention map module, and a detection module, wherein: the semantic extraction module includes a multi-layer deep convolutional network, which is responsible for extracting high-level semantics from input images information, and share the extracted high-level semantic information to the attention map module and the detection module; the attention map module includes a classification sub-module and an attention sub-module; wherein, each attribute branch of the classification sub-module includes multiple Convolution layer, global pooling layer, global connection layer and softmax layer are responsible for classifying the global attributes of the target and supervising the training of the attention sub-module; wherein the attention sub-module includes multiple convolution layers and deconvolution layer, is responsible for constructing the attention map; the detection module includes an anchor frame filtering layer, a target detection layer and a parsing layer, and the anchor frame filtering layer performs data filtering on the received result of the attention map module and sends it to the target detection layer and The parsing layer performs detection and analysis, and outputs the detection results.

The present invention also provides a method for detecting features within an object based on an attention mechanism, the method comprising the following steps:

Step S1, sample preparation: obtain the training image, mark the global attribute label of the image, the feature position of the image and the corresponding classification label;

Step S2, attention map training: the attention map module uses the training image and feature location information to generate attention map label information, and uses the attention map label information and image global attribute labels to supervise the attention module training;

Step S3, detection network training: fix the parameters of the attention module and the semantic extraction module, use the image feature position and image feature label to supervise the detection network training, and use the attention map to generate anchor frames, and introduce the attention module into the detection framework;

Step S4, global network optimization, according to the training, to obtain an optimized network framework;

Step S5. For the new detection target, the image or video is acquired through the optical system, and the optimized network framework is imported to realize target positioning, analysis and detection.

The present invention also provides a computer-readable storage medium, on which computer instructions are stored, and the above-mentioned method is executed when the computer instructions are executed.

Compared with the prior art, the beneficial effect of the present invention is that: the present invention adopts a multi-task learning method based on a deep convolutional network, and introduces attention learning and a single-scale detection mechanism to detect and locate the internal features of the target, and the target Classification and recognition of global attributes solves the problems of unbalanced sample distribution, multi-anchor boxes and multi-scales in the training phase of traditional solutions, which lead to high computing power requirements, and improves detection efficiency and accuracy.

Description of drawings

Fig. 1 is a schematic diagram of the object feature detection system based on the attention mechanism of the present invention;

Figure 2 is a schematic diagram of the process flow of the object feature detection method based on the attention mechanism;

Figure 3 is a schematic diagram of candidate anchor box generation.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

It should be understood that "system", "device", "unit" and/or "module" used in this specification is a method for distinguishing different components, elements, parts, parts or assemblies of different levels. However, the words may be replaced by other expressions if other words can achieve the same purpose.

As indicated in the specification and claims, the terms "a", "an", "an" and/or "the" are not specific to the singular and may include the plural unless the context clearly indicates an exception. Generally speaking, the terms "comprising" and "comprising" only suggest the inclusion of explicitly identified steps and elements, and these steps and elements do not constitute an exclusive list, and the method or device may also contain other steps or elements.

The flowchart is used in this specification to illustrate the operations performed by the system according to the embodiment of this specification. It should be understood that the preceding or following operations are not necessarily performed in the exact order. Instead, various steps may be processed in reverse order or simultaneously. At the same time, other operations can be added to these procedures, or a certain step or steps can be removed from these procedures.

first embodiment

An attention mechanism-based feature detection system within an object, see Figure 1, the system consists of a semantic extraction module, an attention map module, and a detection module.

Among them, the semantic extraction module includes a multi-layer deep convolutional network, which is responsible for extracting high-level semantic information from the input image, and sharing the extracted high-level semantic information to the attention map module and the detection module.

Wherein, the attention map module includes a classification sub-module and an attention sub-module.

Each attribute branch of the classification submodule includes a plurality of convolutional layers, a global pooling layer, a global connection layer and a softmax layer, which are responsible for carrying out global attribute classification to the target and supervising the training of the attention submodule; wherein the attention The force sub-module includes multiple convolutional layers and deconvolutional layers, responsible for constructing attention maps;

Wherein, the detection module includes an anchor frame filtering layer, a target detection layer and an analysis layer, and the anchor frame filtering layer performs data filtering on the received results of the attention map module and sends them to the target detection layer and the analysis layer for detection and analysis , output the detection result.

Taking a motor vehicle as an example, in Figure 1, the global attributes of a motor vehicle include the direction, model, and body color of the motor vehicle, and the vehicle features include windows, logos, lights, luggage racks, and sunroofs. The training steps are divided into sample preparation, attention map multi-task training, detection network training, global network tuning, etc.

second embodiment

A kind of feature detection method in the target based on attention mechanism, this method is implemented by the system of the first embodiment, referring to Fig. 2, method comprises the following steps.

Step S1, sample preparation: obtain the training image, mark the global attribute label of the image, the feature position of the image and the corresponding classification label.

Step S2, attention map training: the attention map module uses the training image and feature location information to generate attention map label information, and uses the attention map label information and image global attribute labels to supervise the attention module training; wherein, the generation method of the attention map label information Include the following steps.

S21. Calculate the mean image of each feature of the label, expressed as:

S22, according to the sample mean image

Calculate the center of gravity (x _c , y _c ) of the sample difference map; the formula for calculating the center of gravity (x _c , y _c ) of the sample difference map is:

In the formula,

is the sample mean image

Corresponding to the pixel value at the coordinate (i, j), p _{i, j} is the coordinate value at the feature image (i, j).

S23. Generate an attention map G(x, y) according to the center of gravity (x _c , y _c ) of the difference map. The calculation formula of the attention map G(x, y) is:

In the formula, x and y represent the coordinates of the pixel in the attention map, and x _s , x _e , y _s , and y _e represent the starting and ending positions of the target feature in the horizontal and vertical directions of the image, respectively.

Step S3, detection network training: fix the parameters of the attention module and semantic extraction module, use the image feature position and image feature label to supervise the detection network training, and use the attention map to generate anchor boxes, and introduce the attention module into the detection framework; among them, the anchor box The generation method of is:

S31. Generate candidate anchor boxes, see Figure 3, use each position in the attention map as an anchor point, and generate rectangular boxes of different scales centered on the anchor point as candidate boxes; each coordinate is (i, j) The k-th candidate box Bbox _{i, j, k} among multiple candidate boxes corresponding to the anchor point of is:

Bbox _{i, j, k} = {l _{i, j, k} , t _{i, j, k} , w _{i, j, k} , h _{i, j, k} }…… Formula 3;

In the formula, l _{i, j, k} , t _{i, j, k} , w _{i, j, k} , h _{i, j, k} are respectively the abscissa of the upper left corner of the anchor frame, the ordinate of the upper left corner, the width of the anchor frame and the anchor frame box height.

S32. Calculate the confidence C _i,j,k of each candidate frame:

where f is the value of each corresponding point in the anchor box area of the attention map.

S33. According to the confidence of the candidate boxes, filter the candidate boxes to obtain the final candidate box set Bboxes:

Bboxes={C _{i, j, k ≥} T}…………………… Formula 5;

In the formula, T is the confidence filtering threshold.

Step S4, global network optimization, according to the training, to obtain an optimized network framework.

third embodiment

The present invention also provides a computer-readable storage medium, on which computer instructions are stored, and the steps of the aforementioned method are executed when the computer instructions are run. Wherein, for the method, please refer to the detailed introduction in the foregoing part, and details will not be repeated here.

Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the computer-readable medium includes permanent Both non-permanent and non-permanent, removable and non-removable media can be implemented by any method or technology for information storage. Information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cartridge, tape magnetic disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

The computer program codes required for the operation of each part of this application can be written in any one or more programming languages, including object-oriented programming languages such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB.NET, Python etc., conventional procedural programming languages such as C language, VisualBasic, Fortran2003, Perl, COBOL2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may run entirely on the user's computer, or as a stand-alone software package, or run partly on the user's computer and partly on a remote computer, or entirely on the remote computer or processing device. In the latter case, the remote computer can be connected to the user computer through any form of network, such as a local area network (LAN) or wide area network (WAN), or to an external computer (such as through the Internet), or in a cloud computing environment, or as a service Use software as a service (SaaS).

It should be noted that different embodiments may have different beneficial effects. In different embodiments, the possible beneficial effects may be any one or a combination of the above, or any other possible beneficial effects.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.

Claims

A feature detection system based on an attention mechanism, characterized in that: the system includes a semantic extraction module, an attention map module, and a detection module, wherein:

The semantic extraction module includes a multi-layer deep convolutional network, which is responsible for extracting high-level semantic information from the input image, and sharing the extracted high-level semantic information to the attention map module and detection module;

The attention map module includes a classification sub-module and an attention sub-module; wherein, each attribute branch of the classification sub-module includes a plurality of convolutional layers, a global pooling layer, a global connection layer and a softmax layer, which are responsible for the target Global attribute classification and supervision of the training of the attention sub-module; wherein the attention sub-module includes a plurality of convolutional layers and deconvolution layers, responsible for constructing the attention map;

The detection module includes an anchor frame filter layer, a target detection layer and an analysis layer, and the anchor frame filter layer performs data filtering on the received result of the attention map module and sends it to the target detection layer and the analysis layer for detection and analysis, and outputs Test results.
A method for feature detection in a target based on an attention mechanism, characterized in that the method comprises the following steps:

Step S1, sample preparation: obtain the training image, mark the global attribute label of the image, the feature position of the image and the corresponding classification label;

Step S2, attention map training: the attention map module uses the training image and feature location information to generate attention map label information, and uses the attention map label information and image global attribute labels to supervise the attention module training;

Step S3, detection network training: fix the parameters of the attention module and the semantic extraction module, use the image feature position and image feature label to supervise the detection network training, and use the attention map to generate anchor frames, and introduce the attention module into the detection framework;

Step S4, global network optimization, according to the training, to obtain an optimized network framework;

Step S5. For the new detection target, the image or video is acquired through the optical system, and the optimized network framework is imported to realize target positioning, analysis and detection.
The detection method according to claim 2, wherein the generation method of the attention map label information in step S2 comprises the following steps:

S21. Calculate the mean image of each feature of the label, expressed as:

S22, according to the sample mean image
Calculate the center of gravity of the sample difference map (x c , y c );

S23. Generate an attention map G(x, y) according to the center of gravity (x c , y c ) of the difference map.
The detection method according to claim 3, characterized in that: the calculation formula for the center of gravity (x c , y c ) of the sample difference map is:

In the formula,
is the sample mean image
Corresponding to the pixel value at the coordinate (i, j), p i, j is the coordinate value at the feature image (i, j).
The detection method according to claim 4, characterized in that: the calculation formula of the attention map G (x, y) is:

In the formula, x and y represent the coordinates of the pixel in the attention map, and x s , x e , y s , and y e represent the starting and ending positions of the target feature in the horizontal and vertical directions of the image, respectively.
The detection method according to claim 5, wherein the generation method of the anchor frame in step S3 is:

S31. Generate a candidate anchor frame, use each position in the attention map as an anchor point, and generate rectangular frames of different scales centered on the anchor point as a candidate frame; each coordinate corresponds to the anchor point at (i, j) The k-th candidate box Bbox i, j, k in multiple candidate boxes of is:

Bbox i, j, k = {l i, j, k , t i, j, k , w i, j, k , h i, j, k }…… Formula 3;

In the formula, l i, j, k , t i, j, k , w i, j, k , h i, j, k are respectively the abscissa of the upper left corner of the anchor frame, the ordinate of the upper left corner, the width of the anchor frame and the anchor frame frame height;

S32. Calculate the confidence C i,j,k of each candidate frame:

In the formula, f is the value of each corresponding point in the anchor frame area of the attention map;

S33. According to the confidence of the candidate boxes, filter the candidate boxes to obtain the final candidate box set Bboxes:

Bboxes={C i, j, k ≥ T}…………………… Formula 5;

In the formula, T is the confidence filtering threshold.
A computer-readable storage medium on which computer instructions are stored, wherein the method according to any one of claims 2-6 is executed when the computer instructions are run.