CN113936158A

CN113936158A - Label matching method and device

Info

Publication number: CN113936158A
Application number: CN202111194593.6A
Authority: CN
Inventors: 陈子亮
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-01-14

Abstract

The disclosure provides a tag matching method and a tag matching device, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be applied to scenes such as image processing, image recognition and the like. The specific implementation scheme is as follows: carrying out feature extraction on the sample image to obtain a feature map; mapping the regional features in the feature map corresponding to the real label of the sample image into a first feature vector; acquiring the characteristics of the central point of each anchor point frame in the characteristic diagram; and calculating the similarity between the first feature vector corresponding to the real label and each second feature vector, and dividing the positive sample anchor frame and the negative sample anchor frame based on the similarity. The features extracted by the network can learn the size, shape, shielding and other information of the target, so that the region of interest of the real target can be adaptively matched by calculating the similarity between the feature vector corresponding to the real label and the feature vector corresponding to the anchor point frame, and the positive and negative sample anchor point frames with better robustness are divided.

Description

Label matching method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technology, specifically to the field of computer vision and deep learning technology, and can be applied to scenes such as image processing, image recognition, etc.

Background

The task of target detection is to find out objects in images and determine categories and positions, and is one of the core problems in the field of computer vision.

With the development of computer vision technology, the research on target detection is more and more popular, and the method is widely applied to the fields of intelligent monitoring systems, automatic driving and the like.

Disclosure of Invention

The disclosure provides a tag matching method, a tag matching device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a tag matching method, including:

performing feature extraction on a sample image to obtain a feature map, wherein the sample image is provided with at least one real label;

mapping the regional features in the feature map corresponding to the real label of the sample image into a first feature vector based on a preset regional feature aggregation algorithm;

acquiring the characteristics of the central point of each anchor point frame in the characteristic diagram as a second characteristic vector;

and calculating the similarity of the first characteristic vector and each second characteristic vector corresponding to each real label, and dividing each anchor frame into a positive sample anchor frame and a negative sample anchor frame based on the similarity.

According to another aspect of the present disclosure, there is provided a tag matching apparatus including:

the characteristic extraction module is used for extracting characteristics of a sample image to obtain a characteristic diagram, and the sample image is provided with at least one real label;

the mapping module is used for mapping the regional characteristics in the characteristic diagram corresponding to the real label of the sample image into a first characteristic vector based on a preset regional characteristic aggregation algorithm;

the acquisition module is used for acquiring the characteristics of the central point of each anchor point frame in the characteristic diagram as a second characteristic vector;

and the dividing module is used for calculating the similarity between the first characteristic vector and each second characteristic vector corresponding to each real label and dividing each anchor point frame into a positive sample anchor point frame and a negative sample anchor point frame based on the similarity.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a tag matching method.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute a tag matching method.

According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a tag matching method.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of tag matching in the prior art;

FIG. 2 is another schematic diagram of tag matching in the prior art;

fig. 3 is a schematic flow chart of a tag matching method according to an embodiment of the present disclosure;

FIG. 4 is a schematic illustration of a mapping of real tags to regions in a feature map;

FIG. 5 is a block diagram of an apparatus for implementing a tag matching method of an embodiment of the present disclosure;

fig. 6 is a block diagram of an electronic device for implementing a tag matching method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Currently, target detection is mostly based on convolutional neural networks. In the training process of target detection, label matching is required to be performed firstly to obtain anchor boxes (anchors) of positive and negative samples, and then position loss and/or category loss are calculated based on the anchor boxes of the positive and negative samples.

The label matching is also called label assign in the field of target detection, and the essence of the label matching is as follows: for each real tag (ground detect), the anchor box is divided into positive and negative samples. Where a genuine label is typically a rectangular area in the sample image that frames an object in the image. The anchor block is a rectangular block that is automatically generated according to the size of the feature map.

The existing tag matching methods are roughly divided into two types:

the first mode is as follows: and calculating the intersection ratio between the real label and the anchor point frame, and distinguishing the anchor point frames of the positive and negative samples by using a fixed intersection ratio threshold, for example, in a RetinaNet algorithm, dividing the anchor point frame with the intersection ratio larger than 0.5 into a positive sample anchor point frame, and dividing the anchor point frame with the intersection ratio smaller than 0.4 into a negative sample anchor point frame. For example, referring to fig. 1, fig. 1 is a schematic diagram of tag matching in the prior art. FIG. 1 shows the intersection ratio of 9 anchor boxes to the real tag, with the intersection ratio of only the center anchor box to the real tag being greater than 0.5, which is taken as the positive sample anchor box; and the intersection ratio of other anchor point frames and the real label is less than 0.4, and the anchor point frames are used as negative sample anchor point frames.

The second mode is as follows: in the anchor-free-based target detection training process, each feature point is used as an anchor frame, the feature points in the range of the real label area are used as positive sample anchor frames, and the feature points outside the range of the real label area are used as negative sample anchor frames. For example, referring to fig. 2, fig. 2 is another schematic diagram of tag matching in the prior art. And four feature points contained in the real label are used as positive sample anchor points, and other feature points are used as negative sample anchor points.

Both of the two label matching methods have certain defects, which are specifically as follows:

in the first mode, anchor point frames of positive and negative samples are selected based on a fixed intersection ratio threshold, but in practical application, suitable intersection ratio thresholds may be different for different types of targets, and the suitable intersection ratio may also dynamically change as training progresses, so that the anchor point frames of positive and negative samples divided based on the fixed intersection ratio threshold are often suboptimal.

In the second mode, the influence of uncertainty such as size and shape caused by the possibility of shielding of the target is ignored, and a proper positive and negative sample anchor point frame cannot be dynamically selected according to the picture data.

In order to solve the above technical problems of the existing tag matching process, the present disclosure provides a tag matching method, apparatus, electronic device and storage medium.

In one embodiment of the present disclosure, a tag matching method is provided, and the method includes:

performing feature extraction on the sample image to obtain a feature map, wherein the sample image is provided with at least one real label;

Therefore, in the label matching process, the positive and negative sample anchor point frames are divided according to the feature similarity, which belongs to a dynamic threshold mode, and the feature graph can also be dynamically learned and adjusted along with the training, so that the calculated feature similarity can be dynamically evolved.

In addition, because the features extracted by the network can learn the information such as the size, the shape, the shielding and the like of the target, the region of interest of the real target can be adaptively matched by calculating the similarity between the feature vector corresponding to the real label and the feature vector corresponding to the anchor point frame, so that the anchor point frame of the positive sample with better robustness is divided, and the target detection algorithm has a more accurate detection result under the condition of not increasing time consumption in forward reasoning.

The following describes in detail a tag matching method, a tag matching apparatus, an electronic device, and a storage medium provided in the embodiments of the present disclosure, respectively.

Referring to fig. 3, fig. 3 is a schematic flow chart of a tag matching method provided in an embodiment of the present disclosure, and as shown in fig. 3, the method may include the following steps:

s301: and performing feature extraction on the sample image to obtain a feature map, wherein the sample image is provided with at least one real label.

The label matching method provided by the embodiment of the disclosure can be applied to the training process of the target detection network.

In the training process, the sample image is provided with at least one real label, and the real label is usually a rectangular area which frames the detected object in the sample image.

In this step, feature extraction may be performed on the sample image through a convolutional neural network to obtain a feature map (feature map), which is usually multidimensional. That is, the sample image is input to the convolutional neural network, so that the convolutional neural network extracts the features of the sample image.

S302: and mapping the area features in the feature map corresponding to the real label of the sample image into a first feature vector based on a preset area feature aggregation algorithm.

In the embodiment of the disclosure, after the sample image is subjected to feature extraction, the real label of the sample image is mapped to the region in the feature map. As an example, referring to fig. 4, fig. 4 is a schematic diagram of mapping a real label to an area in a feature map, and as shown in fig. 4, a real label in a sample image is mapped to an area in a feature map after feature extraction.

The region feature clustering algorithm is to pool the corresponding regions in the feature map into features of specific sizes so as to perform subsequent classification regression and position regression, and the feature sizes can be set according to requirements.

In this step, the regional features in the feature map corresponding to the real label of the sample image may be mapped to a feature vector form as a first feature vector based on a regional feature clustering algorithm.

The Region feature clustering algorithm may be a Region of Interest (ROI _ posing) algorithm, or a Region of Interest (ROI _ align) algorithm.

S303: and acquiring the characteristics of the central point of each anchor point frame in the characteristic diagram as a second characteristic vector.

The anchor point frame is automatically generated according to the size of the feature diagram, and is usually rectangular, and the size, the aspect ratio and the generated central position of the anchor point frame can be set according to requirements.

Since the feature map is multidimensional, the feature of the center point of each anchor point frame in the feature map is itself in the form of a multidimensional feature vector, which can be used as a second feature vector.

S304: and calculating the similarity of the first characteristic vector and each second characteristic vector corresponding to each real label, and dividing each anchor frame into a positive sample anchor frame and a negative sample anchor frame based on the similarity.

In the embodiment of the present disclosure, for each real tag, a positive sample anchor box and a negative sample anchor box need to be divided.

The real label corresponds to a first feature vector, each anchor point frame corresponds to a second feature vector, the similarity of the first feature vector and each second feature vector is calculated respectively, and the anchor point frames are divided into positive sample anchor point frames and negative sample anchor point frames according to the similarity.

In the embodiment of the present disclosure, a preset number of anchor frames may be selected as positive sample anchor frames according to a similarity descending order, and the anchor frames that are not selected are determined as negative sample anchor frames.

As an example, the region feature in the feature map corresponding to the real label a is mapped to a first feature vector a, and the anchor point box B in the feature map is₁-B_mThe second feature vectors of the center points of (1) are respectively b₁-b_mThen, respectively calculating a first feature vector a and a second feature vector b₁And b2 … bm, selecting k second feature vectors with higher similarity, determining the anchor point frames corresponding to the k selected second feature vectors as positive sample anchor point frames, and determining other anchor point frames as negative sample anchor point frames.

In the embodiment of the present disclosure, the feature of the positive sample anchor point frame may be understood as a foreground feature of the detection target, and the feature of the negative sample anchor point frame may be understood as a background feature of the detection target. Therefore, the roles of the positive sample anchor box and the negative sample anchor box are also different during subsequent training.

In the disclosed embodiment, for each real label, a position loss is calculated based on the positive sample anchor box, and a category confidence loss is calculated based on the positive sample anchor box and the negative sample anchor box.

Specifically, the positive sample anchor point frame corresponds to the foreground region of the detection target, and can reflect the position information of the detection target, so that the position information can be used for calculating the position loss and the category confidence loss; the negative sample anchor box corresponds to the background region of the detected target and can only be used to calculate the category confidence loss. The loss function chosen for calculating the loss can be found in the related art.

After the position loss and the category confidence loss are calculated, parameters of the target detection network can be adjusted according to the loss values, and training of target detection is achieved.

Referring to fig. 5, fig. 5 is a block diagram of an apparatus for implementing a tag matching method according to an embodiment of the present disclosure, and as shown in fig. 5, the apparatus may include:

the feature extraction module 501 is used for the feature extraction module to perform feature extraction on a sample image to obtain a feature map, wherein the sample image has at least one real label;

a mapping module 502, configured to map, based on a preset regional feature aggregation algorithm, a regional feature in a feature map corresponding to a real label of the sample image into a first feature vector;

an obtaining module 503, configured to obtain a feature of a central point of each anchor point frame in the feature map as a second feature vector;

a dividing module 504, configured to calculate, for each real label, a similarity between a first feature vector and each second feature vector corresponding to the real label, and divide each anchor frame into a positive sample anchor frame and a negative sample anchor frame based on the similarity.

In an embodiment of the present disclosure, the feature extraction module is specifically configured to:

inputting the sample image into a convolutional neural network so that the convolutional neural network extracts features of the sample image.

In an embodiment of the present disclosure, the dividing module is specifically configured to:

selecting a preset number of anchor point frames as positive sample anchor point frames according to the similarity descending order;

and determining the unselected anchor point frame as a negative sample anchor point frame.

In one embodiment of the present disclosure, the apparatus further includes a loss calculating module configured to:

for each real tag, a position penalty is calculated based on the positive sample anchor box, and a category confidence penalty is calculated based on the positive sample anchor box and the negative sample anchor box.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

The present disclosure provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a tag matching method.

The present disclosure provides a non-transitory computer-readable storage medium having stored thereon computer instructions for causing the computer to execute a tag matching method.

The present disclosure provides a computer program product comprising a computer program which, when executed by a processor, implements a tag matching method.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown, the device 600 comprises a computing unit 601, which may perform various suitable actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as the tag matching method. For example, in some embodiments, the training method of the image processing network may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM603 and executed by the computing unit 601, one or more steps of the tag matching method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the tag matching method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of tag matching, the method comprising:

2. The method of claim 1, wherein the step of feature extracting the sample image comprises:

3. The method of claim 1, wherein the dividing each anchor block into a positive sample anchor block and a negative sample anchor block based on similarity comprises:

4. The method of claim 1, after dividing each anchor block into a positive sample anchor block and a negative sample anchor block based on the similarity, further comprising:

5. A tag matching apparatus, the apparatus comprising:

6. The apparatus of claim 5, wherein the feature extraction module is specifically configured to:

7. The apparatus according to claim 5, wherein the partitioning module is specifically configured to:

8. The apparatus of claim 5, further comprising: a loss calculation module to:

9. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-4.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-4.