CN114187439A

CN114187439A - Anchor frame setting method and device based on image area and storage medium

Info

Publication number: CN114187439A
Application number: CN202111429475.9A
Authority: CN
Inventors: 孟健; 郁淑聪; 郝斌; 王镭; 鹿宁宁; 王馨; 朱观宏; 高少杰; 李亚楠; 张渤
Original assignee: Sinotruk Data Co ltd; China Automotive Technology and Research Center Co Ltd; Automotive Data of China Tianjin Co Ltd
Current assignee: Sinotruk Data Co ltd; China Automotive Technology and Research Center Co Ltd; Automotive Data of China Tianjin Co Ltd
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-15

Abstract

The embodiment of the invention discloses an anchor frame setting method, equipment and a medium based on an image area, and relates to the technical field of image detection. The method comprises the following steps: counting the size of a boundary frame of the target to be detected in each image area in a data concentration mode, and clustering the size of the boundary frame of the target to be detected in each image area respectively to obtain the size of an anchor frame in each image area; acquiring a feature map obtained by extracting features of an image to be detected; dividing the feature map into a plurality of sub-feature maps corresponding to different image areas, respectively performing convolution processing on the plurality of sub-feature maps to obtain a direct prediction result of each image area, and obtaining the size of a prediction frame of each image area according to the size of the anchor frame and the size scaling ratio of each image area. The embodiment sets anchor frames with different sizes aiming at different image areas, enhances the edge recognition effect and reduces the number of invalid anchor frames.

Description

Anchor frame setting method and device based on image area and storage medium

Technical Field

The embodiment of the invention relates to an image detection technology, in particular to an anchor frame setting method, equipment and a storage medium based on an image area.

Background

In the task of object detection, identifying objects of different areas and sizes in an image is a difficult problem commonly encountered in object detection. This is usually solved using an anchor frame method. Specifically, a plurality of typical rectangular frames are obtained by a manual selection or clustering method according to the actual size of the target to be detected, and the rectangular frames are arranged at a plurality of fixed points in the image; in the detection process, the target detection model performs micro correction on the set rectangular frames, and a more accurate detection result can be obtained.

In an actual application scene, a plurality of targets to be detected appear at the edge of an image, if the existing anchor frame setting method is still used, a large number of anchor frames exceed the boundary of the image, and the number of invalid anchor frames is increased; moreover, the target recognition effect on the image edge is relatively poor.

Disclosure of Invention

The embodiment of the invention provides an image area-based anchor frame setting method, equipment and a storage medium, aiming at setting anchor frames with different sizes aiming at different image areas, enhancing the edge recognition effect and reducing the number of invalid anchor frames.

In a first aspect, an embodiment of the present invention provides an anchor frame setting method based on an image area, including:

counting the size of a boundary frame of the target to be detected in each image area in a data concentration mode, and clustering the size of the boundary frame of the target to be detected in each image area respectively to obtain the size of an anchor frame in each image area; the data set comprises a plurality of images;

acquiring a feature map obtained by extracting features of an image to be detected; the image to be detected comprises a plurality of non-overlapping image areas;

dividing the feature map into a plurality of sub-feature maps corresponding to different image areas, and respectively performing convolution processing on the plurality of sub-feature maps to obtain a direct prediction result of each image area, wherein the direct prediction result at least comprises a size scaling ratio of a prediction frame offset in each image cell unit relative to an anchor frame;

and obtaining the size of a prediction frame of each image area according to the size of the anchor frame and the size scaling ratio of each image area, wherein the size of the prediction frame is used for predicting the real size of the target to be detected.

In a second aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the image region-based anchor frame setting method of any of the embodiments.

In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the image area-based anchor frame setting method according to any embodiment.

According to the embodiment of the invention, when the target to be detected appears in different image areas, different shape characteristics can be presented due to incomplete display, so that the sizes of the anchor frames in the different image areas can be obtained by clustering, and as the size of the anchor frame conforms to the shape characteristics of the target to be detected in the area, few anchor frames can exceed the image. And performing convolution processing on the sub-feature maps corresponding to different image areas and performing reprocessing by adopting the sizes of the anchor frames of the corresponding areas to obtain the sizes of the prediction frames which accord with the presentation characteristics of the targets to be detected in different areas. The size of the prediction frame conforms to the shape characteristics of the target to be detected in the region, so that the position of the target in each image region can be accurately predicted.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of an anchor frame setting method based on image areas according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a bounding box of an object to be detected according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a bounding box of another object to be detected according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a feature map provided by an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc., indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplicity of description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood in specific cases to those skilled in the art.

An embodiment of the present invention provides an anchor frame setting method based on an image region, and a flowchart thereof is shown in fig. 1, which is applicable to a case where an anchor frame and a prediction frame are set in an image. The present embodiment is performed by an electronic device. With reference to fig. 1, the method provided in this embodiment specifically includes:

s110, counting the size of the boundary frame of the target to be detected in each image area in a data centralization mode, and clustering the size of the boundary frame of the target to be detected in each image area respectively to obtain the size of the anchor frame in each image area.

The data set comprises a plurality of images, the plurality of images are the same in size and consistent with the image to be detected, and the target to be detected is displayed in each of the plurality of images. The present embodiment does not limit the kind of the target to be detected, and may be a person, a vehicle, an object, or the like.

Optionally, a bounding box of the target to be detected is drawn on each image of the data set by a manual labeling method or a target recognition method. Fig. 2 is a schematic diagram of a bounding box of an object to be detected according to an embodiment of the present invention. The bounding box is rectangular, with one bounding box uniquely represented by its center position (indicated by a dark dot), width and height.

The plurality of images in the data set and the later image to be detected are divided into a plurality of image areas in advance by the same method, in other words, the image areas in the images in the data set and the image to be detected are consistent. The plurality of image areas do not overlap each other. It should be noted that the image area is merely a division of size and position, and the image is not clipped. Referring to fig. 2, the plurality of image areas include an edge area and a middle area. The middle area is outlined with a dashed line. Obviously, the target to be detected appearing in the middle area is displayed completely, and the width and the height of the boundary frame are closer; the display of the target to be detected appearing in the edge area is incomplete, and the width and height difference of the bounding box is large. It is therefore necessary to count the bounding box size separately for each image region.

In an alternative embodiment, the first step: acquiring the center positions of the boundary frames of the targets to be detected in the multiple images, and summarizing the targets to be detected appearing in each image region based on the center positions of the boundary frames of the targets to be detected, optionally, referring to fig. 2, taking the targets to be detected (cats) with the center positions appearing in the edge regions as the targets to be detected appearing in the edge regions, and taking the targets to be detected (cats and dogs) with the center positions appearing in the images as the targets to be detected appearing in the middle regions. Preferably, on the basis of fig. 2, fig. 3 is a schematic diagram of a bounding box of another object to be detected according to an embodiment of the present invention. The edge regions include left and right edge regions, upper and lower edge regions, and corner regions. And uniformly dividing any image in the data set or the image to be detected into 9 grids, wherein the serial numbers are 1-9 in sequence.

Numbers

1, 3, 7, and 9 are corner regions,

numbers

4 and 6 are left and right edge regions,

numbers

2 and 8 are upper and lower edge regions, and number 5 is a middle region.

And counting which grid the boundary frame center position of each target to be detected falls in the data set, and dividing the targets to be detected into the following 4 groups according to grid numbers.

A first group: and taking the target to be detected with the center position appearing in the left and right edge areas as the target to be detected appearing in the left and right edge areas. The group of targets to be detected appear at the left edge and the right edge of the image, and the targets to be detected are considered to be incompletely displayed. The method is characterized in that: the bounding box is narrower, taller, and smaller in length and width.

Second group: and taking the target to be detected with the center position appearing in the upper and lower edge regions as the target to be detected appearing in the upper and lower edge regions. The group of targets to be detected appear at the upper and lower edges of the image, and the targets to be detected are considered to be incompletely displayed. The method is characterized in that: the bounding box is higher, narrower, and the length and width values are smaller.

Third group: and taking the target to be detected with the center position appearing in the corner region as the target to be detected appearing in the corner region. The group of targets to be detected appear at four corners of the image, and the display of the targets to be detected is considered incomplete. The method is characterized in that: the length and width of the bounding box are close and the length and width values are small.

And a fourth group: and taking the target to be detected with the center position appearing in the image as the target to be detected appearing in the middle area. The group of targets to be detected appears in the middle of the image, and the targets to be detected are considered to be displayed completely. Since the present embodiment does not limit the kind of the target to be detected, it is considered that any shape of target may appear in the middle area in order to be suitable for anchor frame setting for various targets to be detected. Therefore, the targets to be detected in all the areas in the image are taken as the targets to be detected appearing in the middle area. The boundary frame has the characteristics of being large or small, high or low, and wide or narrow.

And secondly, counting the size of a boundary box of the target to be detected appearing in each image area, wherein the size comprises the width and the height of the pixel level of the boundary box.

The third step: and respectively clustering the size of the boundary frame of the target to be detected in each image area to obtain the size of the anchor frame in each image area.

Optionally, the size of the bounding box in each group is clustered by using a K-means clustering algorithm to obtain a plurality of clusters, and the central point of each cluster is taken as the size of the anchor frame.

It should be noted that clustering each group actually clusters the bounding box size in each image region. When the size of the boundary frame in the middle area is clustered, the size of the boundary frame of the target to be detected with the center position appearing in the image is clustered.

For ease of description and distinction, anchor frame dimensions include the width and height of the anchor frame, with one anchor frame dimension being a set of widths and heights. The number of anchor frame sizes in different image areas may be the same or different. For example, the left and right edge regions and the upper and lower edge regions have 3 anchor frame sizes, respectively, the corner region has 2 anchor frame sizes, and the middle region has 3 anchor frame sizes.

And S120, obtaining a characteristic diagram obtained by characteristic extraction of the image to be detected.

The present embodiment does not limit the method for feature extraction, and at least one layer of convolution may be used for feature extraction.

S130, dividing the feature map into a plurality of sub-feature maps corresponding to different image areas, and performing convolution processing on the sub-feature maps to obtain a direct prediction result of each image area.

Fig. 4 is a schematic diagram of a feature map provided by an embodiment of the invention. The feature values in the feature map have a one-to-one correspondence with pixel blocks in the image, and the feature map can be divided in the same manner as the image. For example, dividing the feature map into nine squares yields nine sub-feature maps, each corresponding to an image region at a corresponding position in fig. 3.

Continuing with fig. 4, a corresponding convolution layer is set for each sub-feature map, and each sub-feature map passes through the corresponding convolution layer in parallel and then respectively undergoes convolution operation, so as to obtain a direct prediction result of each image region. The direct prediction result at least comprises the size scaling ratio of a prediction frame offset in each image cell unit (cell) relative to an anchor frame(t_h,t_w) And the offset (t) of the prediction frame relative to the anchor frame in the cell_x,t_y) Confidence and category. The present embodiment does not improve the offset and confidence calculation method, and only focuses on the sizes of the anchor frame and the prediction frame.

S140, obtaining the size of the prediction frame of each image area according to the size of the anchor frame and the size scaling ratio of each image area, wherein the size of the prediction frame is used for predicting the real size of the target to be detected.

In the step, the size of the anchor frame is applied to the size scaling ratio in the direct prediction result in the same image area. Referring to the following formula, the size of the prediction frame is characterized by the size of the anchor frame.

Wherein, t_hFor high scaling of the prediction frame relative to the anchor frame, t_wIs the scaling ratio of the width of the prediction box relative to the width of the anchor box. p is a radical of_wAnd p_hWidth and height of anchor frame, respectively_wAnd b_hWidth and height of prediction box respectively

Optionally, after the prediction frame in each image region is obtained, the frame regression and the IOU algorithm are finally used for screening the prediction frame, and the real size of the target to be detected is obtained through prediction.

Preferably, after the obtaining the size of the prediction frame of each image region according to the anchor frame size and the size scaling ratio of each image region, the method further includes: merging the sizes of the prediction frames of the plurality of image areas according to the positions of the image areas; and the sizes of the combined prediction frames are used for predicting the real size of the target to be detected.

Taking the image area shown in fig. 3 as an example, 4 groups of output results corresponding to 4 image areas, that is, the sizes of the prediction frames, are obtained, and the sizes of the 4 groups of prediction frames are combined according to the positions of the image areas to obtain the sizes of all the prediction frames in the whole image to be detected, so that the subsequent prediction of the real size of the target to be detected can be performed at one time without dividing the image areas for prediction.

On the basis of the above embodiment, the number of anchor frame sizes in different image areas may be different. In order to improve the clustering effect, different cluster numbers (2, 3 and 4 … …) are traversed, the inter-cluster distances of the clusters obtained under each cluster number are tested, and the cluster number with the largest inter-cluster distance, namely the number of the anchor frame sizes, is selected. The size distribution of the boundary frames is fully considered by the number of the anchor frame sizes obtained by the method, and the size characteristics of the boundary frames are accurately represented. The number of anchor frame sizes in each image region obtained by clustering by adopting the method is generally different.

In the same image area, the size of the anchor frame and the offset prediction frame in each image cell unit in the direct prediction result have a one-to-one correspondence relationship, so that the sizes of the prediction frames are obtained by one-to-one correspondence multiplication. It is therefore necessary to control the number of size scaling ratios in the direct prediction result for each image region. Optionally, the number of size scaling ratios is controlled by controlling the number of convolution kernels. Specifically, the obtaining of the direct prediction result of each image region by respectively performing convolution processing on the plurality of sub-feature maps includes: determining the number of convolution kernels according to the number of the sizes of the anchor frames in the image area corresponding to each sub-feature map; respectively carrying out convolution kernels on the plurality of sub-feature graphs by a corresponding number to obtain direct prediction results, wherein the direct prediction results at least comprise the size scaling ratio of a shifted prediction frame in each image cell unit relative to an anchor frame; the number of predicted frames shifted within each image cell unit is consistent with the number of anchor frame sizes.

Illustratively, the middle area has 3 anchor frame sizes, the size of the sub-feature map corresponding to the middle area is 3 × 3, the number of categories is 80, the sub-feature map corresponding to the middle area is determined to pass through 3 × (80+5) ═ 255 convolution kernels, where 3 is the number of anchor frame sizes, and a direct prediction result is obtained, which is a feature map including 3 × 3 cell units (cells). Each image cell (cell) includes 3 scaling ratios of the size of the prediction frame relative to the anchor frame. In the same image region (e.g., the middle region), the size scaling ratio of each prediction frame shifted within each image cell unit with respect to the anchor frame is multiplied by the corresponding anchor frame size to obtain the size of the prediction frame of each image region. The size of the intermediate region is finally obtained, which is 3 × 3 × 3 — 27 prediction frames.

Illustratively, the upper and lower edge regions have 3 anchor frame sizes, the number of categories is 80, the sub-feature maps of the upper and lower edges have a size of 3 × 3, the sub-feature maps of the upper and lower edges are merged to obtain a 3 × 6 feature map, the sub-feature maps corresponding to the upper and lower edge regions are determined to have passed through 3 × (80+5) ═ 255 convolution kernels, where 3 is the number of anchor frame sizes, and a direct prediction result is obtained, which is a feature map containing 3 × 6 cell units (cells). Each image cell (cell) includes 3 scaling ratios of the size of the prediction frame relative to the anchor frame. In the same image region (for example, upper and lower edge regions), the size scaling ratio of each prediction frame shifted in each image cell unit with respect to the anchor frame is multiplied by the corresponding anchor frame size to obtain the size of the prediction frame of each image region. Finally, the size of the prediction frame in the upper and lower edge regions is 54 × 6 × 3.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 5, the electronic device includes a processor 40, a memory 41, an input device 42, and an output device 43; the number of processors 40 in the device may be one or more, and one processor 40 is taken as an example in fig. 5; the processor 40, the memory 41, the input device 42 and the output device 43 in the apparatus may be connected by a bus or other means, which is exemplified in fig. 5.

The memory 41 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to the image area-based anchor frame setting method in the embodiment of the present invention. The processor 40 executes various functional applications of the device and data processing, i.e., implements the image area-based anchor frame setting method described above, by running software programs, instructions, and modules stored in the memory 41.

The memory 41 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 41 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 41 may further include memory located remotely from processor 40, which may be connected to the device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 42 is operable to receive input numeric or character information and to generate key signal inputs relating to user settings and function controls of the apparatus. The output device 43 may include a display device such as a display screen.

The embodiment of the invention also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the image area-based anchor frame setting method of any embodiment.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention.

Claims

1. An anchor frame setting method based on an image area is characterized by comprising the following steps:

2. The method of claim 1, wherein the statistical data is concentrated on bounding box sizes of objects to be detected in each image region, comprising:

acquiring the center positions of the boundary frames of the targets to be detected in the multiple images, and summarizing the targets to be detected appearing in each image area based on the center positions of the boundary frames of the targets to be detected;

and counting the size of the bounding box of the target to be detected appearing in each image area.

3. The method of claim 1, wherein the plurality of image regions comprises an edge region and a middle region;

the summarizing of the targets to be detected appearing in each image area based on the central position of the targets to be detected comprises the following steps:

taking the target to be detected with the center position appearing in the edge area as the target to be detected appearing in the edge area;

and taking the target to be detected with the center position appearing in the image as the target to be detected appearing in the middle area.

4. The method of claim 3, wherein the edge regions comprise left and right edge regions, upper and lower edge regions, and corner regions;

the step of taking the target to be detected with the center position appearing in the edge area as the target to be detected appearing in the edge area comprises the following steps:

taking the target to be detected with the center position appearing in the left edge area and the right edge area as the target to be detected appearing in the left edge area and the right edge area;

taking the target to be detected with the center position appearing in the upper and lower edge regions as the target to be detected appearing in the upper and lower edge regions;

and taking the target to be detected with the center position appearing in the corner region as the target to be detected appearing in the corner region.

5. The method according to claim 1, wherein the convolving the sub-feature maps to obtain the direct prediction result of each image region comprises:

determining the number of convolution kernels according to the number of the sizes of the anchor frames in the image area corresponding to each sub-feature map;

respectively carrying out convolution kernels on the plurality of sub-feature graphs by a corresponding number to obtain direct prediction results, wherein the direct prediction results at least comprise the size scaling ratio of a shifted prediction frame in each image cell unit relative to an anchor frame; the number of predicted frames shifted within each image cell unit is consistent with the number of anchor frame sizes.

6. The method of claim 5, wherein the offset prediction frame in each image cell unit has a one-to-one correspondence with the anchor frame size in the same image region;

the obtaining of the size of the prediction frame of each image region according to the anchor frame size and the size scaling ratio of each image region comprises:

in the same image area, the size scaling ratio of each prediction frame shifted in each image cell unit relative to the size of the anchor frame is multiplied by the corresponding size of the anchor frame to obtain the size of the prediction frame of each image area.

7. The method of claim 1, further comprising, after said deriving the size of the prediction frame for each image region from the anchor frame size and size scaling ratio for each image region:

merging the sizes of the prediction frames of the plurality of image areas according to the positions of the image areas;

and the sizes of the combined prediction frames are used for predicting the real size of the target to be detected.

8. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the image region-based anchor frame setting method of any one of claims 1-7.

9. A computer-readable storage medium on which a computer program is stored, the program, when executed by a processor, implementing the image area-based anchor frame setting method according to any one of claims 1 to 7.