CN116740398A

CN116740398A - Target detection and matching method, device and readable storage medium

Info

Publication number: CN116740398A
Application number: CN202310678789.5A
Authority: CN
Inventors: 王驰宇; 蔡科
Original assignee: Glodon Co Ltd
Current assignee: Glodon Co Ltd
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-09-12

Abstract

The invention discloses a target detection and matching method, a device and a readable storage medium, wherein the method comprises the following steps: acquiring an image to be processed, and inputting the image to be processed into a trained target detection model; calculating anchor frame information of each anchor frame through the target detection model; screening target anchor frames from all anchor frames according to the object detection information of each anchor frame; drawing an object detection frame for marking the target object on the image to be processed according to the object detection information of the target anchor frame; drawing an element prediction frame which has a matching relation with the object prediction frame and is used for marking the target element on the image to be processed according to the element detection information of the target anchor frame; according to the invention, the target detection frames of each category and the matching relation between the target detection frames can be calculated from the images through the trained target detection model with the double inference heads at one time, so that the detection accuracy is high, and the detection efficiency is good.

Description

Target detection and matching method, device and readable storage medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method and apparatus for detecting and matching a target, and a readable storage medium.

Background

In an image-based target detection scene, when two target objects need to be detected in one image, even if a certain matching relationship exists between the two target objects, two target detection models need to be trained respectively to detect the two target objects in the image respectively, and then the matching relationship between the target objects is calculated; for example, when two target objects are a human body and a human head, a currently-used detection method is to detect a boundary box of the human body and a boundary box of the human head in an image at the same time, and then determine a matching relationship between the human body and the human head by calculating the overlapping degree of the boundary boxes between a plurality of human bodies and the plurality of human heads; however, in the case where the degree of overlap of human body bounding boxes is high, by simply calculating the degree of overlap of bounding boxes between a plurality of human bodies and a plurality of human heads, an error condition in matching the human head of one person to the human body of another person may be caused. Therefore, how to more accurately detect two target objects with matching relationship from an image is a technical problem that needs to be solved by those skilled in the art.

Disclosure of Invention

The invention aims to provide a target detection and matching method, a target detection and matching device and a readable storage medium, which can calculate target detection frames of all classes and matching relations among the target detection frames from images at one time through a trained target detection model with double inference heads, and have high detection accuracy and high detection efficiency.

According to one aspect of the present invention, there is provided a target detection and matching method, the method comprising:

acquiring an image to be processed, and inputting the image to be processed into a trained target detection model;

calculating anchor frame information of each anchor frame through the target detection model; wherein the anchor frame information includes: object detection information for describing a target object and element detection information for describing a target element having a matching relationship;

screening target anchor frames from all anchor frames according to the object detection information of each anchor frame;

drawing an object detection frame for marking the target object on the image to be processed according to the object detection information of the target anchor frame;

and drawing an element prediction frame which has a matching relation with the object prediction frame and is used for marking the target element on the image to be processed according to the element detection information of the target anchor frame.

Optionally, before the acquiring the image to be processed and inputting the image to be processed into the trained target detection model, the method further includes:

acquiring an original image of an object real frame drawn with the object real frame for marking the target object;

generating N sub-original pictures based on the original pictures according to the number N of the object real frames in the original pictures; wherein, the pixels in an object real frame are not repeatedly reserved in each sub-original image, and the pixels outside the reserved object real frame are set to be black;

drawing an element real frame for marking the target element in each sub-original image in sequence, and establishing a matching relation between the object real frame and the element real frame in each sub-original image to obtain a sample image for model training;

training a preset model according to the sample image to obtain a target detection model for detecting the target object and the target element with the matching relation in the image.

Optionally, the training the preset model according to the sample image to obtain a target detection model for detecting the target object and the target element with a matching relationship in the image includes:

forming sample data according to the position information of the object real frame, the position information of the element real frame and the matching relation between the object real frame and the element real frame in the sample image;

training a YOLOv5 model with double inference heads according to the sample data to obtain the target detection model;

wherein the sample data comprises: object tag information describing the target object and element tag information describing the target element having a matching relationship;

the object marking information includes: the target object category ID, the abscissa of the center point of the object real frame, the ordinate of the center point of the object real frame, the width of the object real frame and the height of the object real frame;

the element marking information includes: the target element category ID, the abscissa of the element real frame center point, the ordinate of the element real frame center point, the width of the element real frame, and the height of the element real frame.

Optionally, the object detection information includes: object location information and object confidence for characterizing the probability of the target object appearing within an anchor frame; wherein the object position information includes: a deviation value of an abscissa of the center point of the object prediction frame and an abscissa of the center point of the anchor frame, a deviation value of an ordinate of the center point of the object prediction frame and an ordinate of the center point of the anchor frame, a deviation value of a width of the object prediction frame and a width of the anchor frame, and a deviation value of a height of the object prediction frame and a height of the anchor frame;

the element detection information includes: element position information and element confidence for characterizing the probability of the target element occurring within an anchor frame; wherein the element position information includes: the deviation value of the abscissa of the element prediction frame center point and the abscissa of the anchor frame center point, the deviation value of the ordinate of the element prediction frame center point and the ordinate of the anchor frame center point, the deviation value of the width of the element prediction frame and the width of the anchor frame, and the deviation value of the height of the element prediction frame and the height of the anchor frame.

Optionally, the screening the target anchor frame from all anchor frames according to the object detection information of each anchor frame includes:

screening one or more candidate anchor frames which have the object confidence coefficient larger than a first preset threshold value and are mutually adjacent from all anchor frames according to the object confidence coefficient of each anchor frame by a non-maximum suppression algorithm NMS, and merging all screened candidate anchor frames into the target anchor frame;

and forming the object detection information and the element detection information of the target anchor frame according to the object detection information and the element detection information of all the candidate anchor frames.

Optionally, the drawing, on the image to be processed, an object detection frame for marking the target object according to the object detection information of the target anchor frame includes:

and drawing the object prediction frame on the image to be processed according to the object position information in the object detection information of the target anchor frame and the center point position information of the target anchor frame.

Optionally, the drawing, on the image to be processed, an element prediction frame for marking the target element, where the element prediction frame has a matching relationship with the object prediction frame, according to the element detection information of the target anchor frame includes:

judging whether the element confidence coefficient in the element detection information of the target anchor frame is larger than a second preset threshold value or not;

if so, drawing the element prediction frame on the image to be processed according to the element position information in the element detection information of the target anchor frame and the center point position information of the target anchor frame;

if not, the element prediction frame is not drawn on the image to be processed.

Optionally, the target object is a human body, the target element is a human face, or the target object is a human body, the target element is a safety helmet, or the target object is a vehicle body, and the target element is a vehicle hopper.

In order to achieve the above purpose, the present invention further provides a target detection and matching device, which specifically includes the following components:

the acquisition module is used for acquiring an image to be processed and inputting the image to be processed into a trained target detection model;

the calculation module is used for calculating anchor frame information of each anchor frame through the target detection model; wherein the anchor frame information includes: object detection information for describing a target object and element detection information for describing a target element having a matching relationship;

the screening module is used for screening target anchor frames from all anchor frames according to the object detection information of each anchor frame;

the marking module is used for drawing an object detection frame for marking the target object on the image to be processed according to the object detection information of the target anchor frame;

and the matching module is used for drawing an element prediction frame which has a matching relation with the object prediction frame and is used for marking the target element on the image to be processed according to the element detection information of the target anchor frame.

In order to achieve the above object, the present invention further provides a computer device, which specifically includes: the system comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the target detection and matching method when executing the computer program.

In order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described object detection and matching method.

According to the target detection and matching method, the target detection and matching device and the readable storage medium, the target object and the target element with the matching relation can be identified from the image through the trained target detection model at one time, so that the detection accuracy is high, and the detection efficiency is good; compared with the traditional target detection model which can only output one inference head, the target detection model can output two inference heads, namely object detection information and element detection information, and the two inference heads can mark an object prediction frame and an element prediction frame which have a matching relationship on an image to be processed, so that the matching relationship between matched detection objects is realized while reasoning.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic flow chart of an alternative method for detecting and matching targets according to the first embodiment;

FIG. 2 is a schematic diagram of an artwork for model training provided in accordance with the first embodiment;

FIGS. 3 (a), 3 (b), and 3 (c) are schematic diagrams of three sub-artwork provided in embodiment one;

FIG. 4 is a schematic diagram of a model structure of improved YOLOv5 with two inference heads according to the first embodiment;

FIG. 5 is a schematic diagram of an alternative structure of a target detection and matching device according to the second embodiment;

fig. 6 is a schematic diagram of an alternative hardware architecture of a computer device according to the third embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention provides a target detection and matching method, as shown in fig. 1, which specifically comprises the following steps:

step S101: and acquiring an image to be processed, and inputting the image to be processed into a trained target detection model.

Preferably, the target detection model is trained based on a YOLOv5 (You Only Look Once Version 5) model with a double inference head.

Specifically, before the step S101, the method further includes:

step A1: acquiring an original image of an object real frame drawn with a mark target object;

step A2: generating N sub-original pictures based on the original pictures according to the number N of the object real frames in the original pictures; wherein, the pixels in an object real frame are not repeatedly reserved in each sub-original image, and the pixels outside the reserved object real frame are set to be black;

for example, in the scene where the target object is a human body, if three human bodies are included in the original image as shown in fig. 2, three sub-original images need to be generated according to the original image, as shown in fig. 3 (a), 3 (b), and 3 (c), schematic diagrams of the three sub-original images are shown, each sub-original image includes a human body BBOX (Bounding Box), and in addition, in each sub-original image, except for the pixels in the BBOX, the remaining pixels are set to be black;

step A3: drawing an element real frame for marking a target element in each sub-original image in sequence, and establishing a matching relation between the object real frame and the element real frame in each sub-original image to obtain a sample image for model training;

in this embodiment, the marking information of the target object (i.e., the information of the object real frame) and the marking information of the target element (i.e., the information of the element real frame) in the sample image are combined (i.e., a matching relationship is established) to be used as training marks for model training;

step A4: training a preset model according to the sample image to obtain a target detection model for detecting the target object and the target element with the matching relation in the image.

It should be noted that, since only a single BBOX can be marked in the image when using the conventional YOLO-related marking tool, two matching BBOXs cannot be marked, in this embodiment, two matching BBOXs (i.e., the object real frame and the element real frame) can be marked in one original image by the manner of the above steps A1 to A4.

Further, the step A4 specifically includes:

step A41: forming sample data according to the position information of the object real frame, the position information of the element real frame and the matching relation between the object real frame and the element real frame in the sample image;

One training marker (i.e., sample data) in the conventional YOLOv5 training dataset contains only five items of data, respectively: class ID, abscissa x of BBOX central point, ordinate y of BBOX central point, width w of BBOX, height h of BBOX; in this embodiment, one sample data may include ten items of data, for example: class_person, x, y, w, h, class_head, x_head, y_head, w_head, h_head; the class_person represents a target object category ID, which can be set to 0, x, y, w and h as an abscissa of an object real frame center point, an ordinate of an object real frame center point, a width of an object real frame, and a height of an object real frame, and the class_head represents a target element category ID, which can be set to 1, x_head, y_head, w_head and h_head as an abscissa of an element real frame center point, an ordinate of an element real frame center point, a width of an element real frame, and a height of an element real frame. Also, if the target object is included in one sample image only and the target element is not included, the x_head, y_head, w_head, h_head may be set to "0, 0".

Step A42: and training a YOLOv5 model with double inference heads according to the sample data to obtain the target detection model.

In this embodiment, an existing YOLOv5 model is modified to detect a target object and a target element having a matching relationship; in the prior art, one YOLOv5 model trained has only one inference head, by which only a target object (e.g., human body) or a target element (e.g., human head) can be identified from an image, but a matching relationship between the identified target object (e.g., human body) and the target element (e.g., human head) cannot be acquired. In contrast, in this embodiment, an improved YOLOv5 model with two inference heads is used, as shown in fig. 4, which is a schematic diagram of an improved YOLOv5 model with two inference heads, where one inference head is used to detect a target object and the other inference head is used to detect a target element. In addition, since the sample image centered on the target object is selected to train the inference head of the target element in this embodiment, it is necessary to select the GIOU as the loss function of the BBOX of the human head target.

Further, the step S101 specifically includes:

step B1: performing data enhancement operation on the image to be processed; wherein the data enhancement operation includes at least one of: stitched image (mosaics), scale transform (scale), flip (flip), HSV (Value) enhancement;

step B2: and inputting the image subjected to the preprocessing operation into the target detection model.

The robustness and the ubiquity of the target detection model are improved by preprocessing the image to be processed.

Step S102: calculating anchor frame information of each anchor frame through the target detection model; wherein the anchor frame information includes: object detection information for describing a target object and element detection information for describing a target element having a matching relationship.

It should be noted that, the object detection information and the element detection information in this embodiment are equivalent to two inference heads, that is, two inference heads with a matching relationship are output after the image to be processed is input to the target detection model.

Specifically, the object detection information includes: object location information and object confidence for characterizing the probability of the target object appearing within an anchor frame; wherein the object position information includes: a deviation value of an abscissa of the center point of the object prediction frame and an abscissa of the center point of the anchor frame, a deviation value of an ordinate of the center point of the object prediction frame and an ordinate of the center point of the anchor frame, a deviation value of a width of the object prediction frame and a width of the anchor frame, and a deviation value of a height of the object prediction frame and a height of the anchor frame;

Step S103: and screening target anchor frames from all anchor frames according to the object detection information of each anchor frame.

Specifically, the step S103 includes:

step C1: screening one or more candidate anchor frames with object confidence degrees larger than a first preset threshold value and adjacent to each other from all anchor frames through a non-maximum suppression algorithm (NMS, non Maximum Suppression) according to the object confidence degrees of all the anchor frames, merging all the screened candidate anchor frames into the target anchor frame, and forming object detection information and element detection information of the target anchor frame according to the object detection information and the element detection information of all the screened candidate anchor frames;

it should be noted that, since the target object may appear in multiple anchor frames, the multiple anchor frames in which the target object appears need to be combined by the NMS algorithm to obtain the target anchor frame;

if only one candidate anchor frame exists, the object detection information and the element detection information of the candidate anchor frame are used as the object detection information and the element detection information of the target anchor frame; if a plurality of candidate anchor frames appear, merging object detection information of the plurality of candidate anchor frames into object detection information of a target anchor frame according to a preset rule, and merging element detection information of the plurality of candidate anchor frames into element detection information of the target anchor frame; the object detection information of the target anchor frame also comprises object position information and object confidence, and the element detection information of the target anchor frame also comprises element position information and element confidence.

Step S104: and drawing an object detection frame for marking the target object on the image to be processed according to the object detection information of the target anchor frame.

Specifically, the step S104 includes:

Step S105: and drawing an element prediction frame which has a matching relation with the object prediction frame and is used for marking the target element on the image to be processed according to the element detection information of the target anchor frame.

Specifically, step S105 includes:

if not, the element prediction frame is not drawn on the image to be processed.

It should be noted that, if the element confidence is smaller than the second preset threshold, it indicates that only the target object exists on the image to be processed and no target element exists, and only the object prediction frame may be drawn and the element prediction frame may not be drawn.

According to the target detection mode provided by the embodiment, since training data in the training process is based on the fact that the target object and the target element are in pairs, the model can learn the matching relationship between the target object and the target element, and therefore the target detection model can output two inference heads: an object prediction frame can be obtained from the output of one object inference head, and an element prediction frame can be obtained from the corresponding element inference head, so that the object prediction frame and the element prediction frame which are matched one by one can be obtained.

Further, the target object is a human body, the target element is a human face, or the target object is a human body, the target element is a safety helmet, or the target object is a vehicle body, and the target element is a vehicle hopper.

The target detection mode provided by the embodiment can be suitable for a scene combining early warning event detection and identity recognition, for example, after the target detection model detects the head of the person who does not wear the safety helmet, the staff identity information corresponding to the head of the person who does not wear the safety helmet can be matched.

In the embodiment, a target detection model capable of outputting multiple inference heads is provided, and the matching relation between detected objects can be matched while reasoning; the method can be applied to the matching detection scene of the head and the body and can be also expanded to the matching detection scene of the body and any body part. In addition, the method can be applied to a matching detection scene between other objects, such as a relation between a muck truck and a truck head.

According to the embodiment, the target object and the target element with the matching relationship can be identified from the image through the trained target detection model at one time, so that the detection accuracy is high, and the detection efficiency is good; compared with the traditional target detection model which can only output one inference head, the target detection model in the embodiment can output two inference heads, namely object detection information and element detection information, and the two inference heads can mark an object prediction frame and an element prediction frame which have a matching relationship on an image to be processed, so that the matching relationship between matched detection objects is realized while reasoning.

Example two

The embodiment of the invention provides a target detection and matching device, as shown in fig. 5, which specifically comprises the following components:

the acquisition module 501 is configured to acquire an image to be processed, and input the image to be processed into a trained target detection model;

the calculating module 502 is configured to calculate anchor frame information of each anchor frame according to the target detection model; wherein the anchor frame information includes: object detection information for describing a target object and element detection information for describing a target element having a matching relationship;

a screening module 503, configured to screen out target anchor frames from all anchor frames according to the object detection information of each anchor frame;

a marking module 504, configured to draw an object detection frame for marking the target object on the image to be processed according to the object detection information of the target anchor frame;

and the matching module 505 is configured to draw an element prediction frame for marking the target element, which has a matching relationship with the object prediction frame, on the image to be processed according to the element detection information of the target anchor frame.

Specifically, the device further comprises: training module for:

Further, the training module is specifically configured to:

Further, the screening module 503 is specifically configured to:

Further, the marking module 504 is specifically configured to:

Further, the matching module 505 is specifically configured to:

if not, the element prediction frame is not drawn on the image to be processed.

Example III

The present embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster formed by a plurality of servers) that can execute a program. As shown in fig. 6, the computer device 60 of the present embodiment includes at least, but is not limited to: a memory 601, and a processor 602 which are communicably connected to each other via a system bus. It should be noted that FIG. 6 only shows a computer device 60 having components 601-602, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.

In this embodiment, the memory 601 (i.e., readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 601 may be an internal storage unit of the computer device 60, such as a hard disk or memory of the computer device 60. In other embodiments, the memory 601 may also be an external storage device of the computer device 60, such as a plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash memory Card (Flash Card) or the like, which are provided on the computer device 60. Of course, the memory 601 may also include both internal storage units of the computer device 60 and external storage devices. In this embodiment, the memory 601 is typically used to store an operating system and various types of application software installed on the computer device 60. In addition, the memory 601 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 602 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 602 is generally used to control the overall operation of the computer device 60.

Specifically, in the present embodiment, the processor 602 is configured to execute a program of an object detection and matching method stored in the memory 601, where the program of the object detection and matching method is executed to implement the following steps:

The specific embodiment process of the above method steps can be referred to as embodiment one, and the description of this embodiment is not repeated here.

Example IV

The present embodiment also provides a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., having stored thereon a computer program that when executed by a processor performs the following method steps:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method for target detection and matching, the method comprising:

2. The target detection and matching method according to claim 1, wherein before the capturing the image to be processed and inputting the image to be processed into the trained target detection model, the method further comprises:

3. The method according to claim 2, wherein training a preset model according to the sample image to obtain a target detection model for detecting the target object and the target element having a matching relationship in an image comprises:

4. The target detection and matching method according to claim 1, wherein the object detection information includes: object location information and object confidence for characterizing the probability of the target object appearing within an anchor frame; wherein the object position information includes: a deviation value of an abscissa of the center point of the object prediction frame and an abscissa of the center point of the anchor frame, a deviation value of an ordinate of the center point of the object prediction frame and an ordinate of the center point of the anchor frame, a deviation value of a width of the object prediction frame and a width of the anchor frame, and a deviation value of a height of the object prediction frame and a height of the anchor frame;

5. The method for detecting and matching objects as claimed in claim 4, wherein the step of screening out the object anchor frames from all anchor frames according to the object detection information of each anchor frame comprises:

6. The method for detecting and matching a target according to claim 5, wherein drawing an object detection frame for marking the target object on the image to be processed according to the object detection information of the target anchor frame, comprises:

7. The method according to claim 5, wherein drawing an element prediction frame for marking the target element on the image to be processed, which has a matching relationship with the object prediction frame, based on the element detection information of the target anchor frame, comprises:

if not, the element prediction frame is not drawn on the image to be processed.

8. The target detection and matching method according to any one of claims 1 to 7, wherein the target object is a human body, the target element is a human face, or the target object is a human body, the target element is a helmet, or the target object is a vehicle body, the target element is a car hopper.

9. A target detection and matching device, the device comprising:

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 8.