CN111523403B

CN111523403B - Method and device for acquiring target area in picture and computer readable storage medium

Info

Publication number: CN111523403B
Application number: CN202010258207.4A
Authority: CN
Inventors: 徐嵚嵛
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2023-10-20
Anticipated expiration: 2040-04-03
Also published as: CN111523403A

Abstract

The application relates to the field of image processing, and discloses a method and a device for acquiring a target area in a picture and a computer readable storage medium. The method for acquiring the target area in the picture comprises the following steps: inputting the picture to be detected into a detection model, and acquiring a target detection point in the picture to be detected; the detection model comprises a backbone module and a multi-stage detection module connected with the backbone module; the backbone module is used for acquiring feature graphs of the pictures to be detected in each preset dimension; the multistage detection module is used for carrying out position detection and category detection on the feature map and determining a target detection point according to the detection result of the position detection and the detection result of the category detection; and acquiring a target area from the picture to be detected according to the target detection point. Compared with the prior art, the method and the device for acquiring the target area in the picture and the computer readable storage medium provided by the embodiment of the application have the advantages that the target area where the target to be identified is accurately detected, and the picture is convenient to accurately cut.

Description

Method and device for acquiring target area in picture and computer readable storage medium

Technical Field

The present application relates to the field of image processing, and in particular, to a method and apparatus for acquiring a target area in a picture, and a computer readable storage medium.

Background

With the development of modern information technology to intelligent and humanized, various man-machine interaction, virtual reality and intelligent monitoring systems are sequentially developed. Human body posture estimation, motion recognition, behavior understanding, etc. techniques based on computer vision play an important role therein.

However, the inventor of the present application found that, in the prior art, a deep learning model is generally required to be used, and when the deep learning model performs motion recognition on a picture, if the motion image of the picture is small, for example, when performing motion recognition on a football match picture, if the football match picture is a long-range picture, the motion image of a player is small, and in this case, in the prior art, the picture is generally required to be scaled, so that part of motion details are lost, and the motion recognition effect is poor. In order to improve the action recognition effect and reduce the loss of action details, the prior art also has the technical proposal of cutting pictures, cutting larger pictures into a plurality of small pictures and carrying out action recognition on the small pictures. However, since the region where the target to be identified is located cannot be accurately cut in the prior art, all the small pictures need to be respectively identified in an action mode, which results in a multiple increase in the calculation amount.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for acquiring a target area in a picture, and a computer readable storage medium, which can accurately detect the target area where a target to be identified is located, and is convenient for accurately cutting the picture.

In order to solve the above technical problems, an embodiment of the present application provides a method for acquiring a target area in a picture, where the method includes: inputting the picture to be detected into a detection model, and acquiring a target detection point in the picture to be detected; the detection model comprises a backbone module and a multi-stage detection module connected with the backbone module; the backbone module is used for acquiring feature graphs of the picture to be detected in each preset dimension; the multi-stage detection module is used for carrying out position detection and category detection on the feature map, and determining the target detection point according to the detection result of the position detection and the detection result of the category detection; and acquiring a target area from the picture to be detected according to the target detection point.

The embodiment of the application also provides a target detection device, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for acquiring a target area in a picture as described above.

The embodiment of the application also provides a computer readable storage medium which stores a computer program, and the computer program realizes the method for acquiring the target area in the picture when being executed by a processor.

Compared with the prior art, the built detection model comprises a backbone module and a multi-stage detection module, wherein the backbone module acquires the characteristic diagrams of the pictures to be detected in each preset dimension, the characteristic diagrams in a plurality of preset dimensions are respectively input into the multi-stage detection module, the multi-stage detection module respectively performs position detection and category detection on the input characteristic diagrams in the plurality of preset dimensions, and then the obtained detection results of the position detection and the detection results of the category detection are combined to determine the positions corresponding to the target detection points. After the picture to be detected is input into the detection model after the training is completed, the detection model can output the target detection point in the picture to be detected, and after the position of the target detection point is determined, the range of the target area in the picture to be detected can be determined according to the target detection point. Through the accurate identification to the target area, can more be convenient for wait to detect the picture and carry out accurate tailorring, can carry out action recognition through tailorring the picture later, reduce action detail and lose the while, reduce action recognition's calculated quantity.

In addition, the detection result of the position detection comprises a central point coordinate parameter of the target area and a size parameter of the target area; and constructing a loss function of the detection model according to the central point coordinate parameter and the size parameter, wherein the weight of the central point coordinate parameter in the loss function is larger than the weight of the size parameter, and the weight of the central point coordinate parameter and the weight of the size parameter are preset constants. Because the validity of the central point coordinate parameter of the target detection point is greater than that of the size parameter, the weight of the central point coordinate parameter in the loss function is set to be greater than that of the size parameter, and the accuracy of the target detection point output by the detection model can be effectively improved.

In addition, the backbone module comprises a first sub-module, a second sub-module, a third sub-module and a fourth sub-module which are sequentially connected; the first sub-module, the second sub-module, the third sub-module and the fourth sub-module comprise a plurality of convolution modules with the same convolution kernel, and the number of the convolution kernels of the first sub-module, the second sub-module, the third sub-module and the fourth sub-module is increased in sequence; the first sub-module, the second sub-module, the third sub-module and the fourth sub-module are respectively used for outputting the feature graphs of the preset dimensions corresponding to the first sub-module, the second sub-module, the third sub-module and the fourth sub-module.

In addition, the convolution modules with the same convolution kernel comprise a feature mapping sub-module, a convolution computing sub-module, a batch norm computing sub-module, a linear rectification computing sub-module and a feature mapping sub-module which are connected in sequence.

In addition, the detection model further comprises a feature pyramid network connecting the fourth sub-module and the multi-stage detection module; and the characteristic pyramid network is used for combining output results output by the fourth submodule and inputting the combined output results into the multi-stage detection module.

In addition, the determining the target detection point according to the detection result of the position detection and the detection result of the category detection specifically includes: and combining the detection result of the position detection and the detection result of the category detection according to a preset dimension to obtain the target detection point.

In addition, the obtaining the target area from the picture to be detected according to the target detection point specifically includes: and cutting the picture to be detected by taking the target detection point as a center point and the preset size as a side length to obtain the target area.

In addition, before inputting the picture to be detected into the detection model, the method further comprises the following steps: acquiring a plurality of football match pictures; inputting a plurality of football match images into a far-middle and near-view judging model to obtain near-view football match pictures, middle-view football match pictures and far-view football match pictures; and taking the middle-view football match picture and the distant-view football match picture as the pictures to be detected. The middle-view football match picture and the distant-view football match picture are used as pictures to be detected to be input into the detection model, so that the accuracy of the detection results of football actions in the middle-view football match picture and the distant-view football match picture can be effectively improved. In addition, the close-range football match pictures are directly subjected to action recognition, and the pictures are not required to be cut, so that the calculated amount can be effectively reduced.

Drawings

Fig. 1 is a program flow chart of a method for acquiring a target area in a picture according to a first embodiment of the present application;

fig. 2 is a schematic structural diagram of a detection model in the method for acquiring a target area in a picture according to the first embodiment of the present application;

fig. 3 is a schematic structural diagram of a convolution module in the method for obtaining a target area in a picture according to the first embodiment of the present application;

fig. 4 is a schematic structural diagram of a pyramid module in the method for obtaining a target area in a picture according to the first embodiment of the present application;

fig. 5 is a flowchart of a method for acquiring a target area in a picture according to a second embodiment of the present application;

fig. 6 is a schematic structural diagram of an object detection device according to a third embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, those of ordinary skill in the art will understand that in various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the claimed application may be practiced without these specific details and with various changes and modifications based on the following embodiments.

The first embodiment of the application relates to a method for acquiring a target area in a picture. The specific flow is shown in fig. 1, and comprises the following steps:

step S101: and inputting the picture to be detected into the detection model.

Specifically, in the present embodiment, as shown in fig. 2, the inspection model includes a backbone module 201 and a multi-stage inspection module 202 connected to the backbone module 201. The backbone module is used for acquiring feature images of the pictures to be detected in each preset dimension, and the multi-stage detection module is used for carrying out position detection and category detection on the feature images and determining target detection points in the pictures to be detected by combining detection results of the position detection and detection results of the category detection.

Further, in the present embodiment, the backbone module 201 includes a first sub-module 2011, a second sub-module 2012, a third sub-module 2013, and a fourth sub-module 2014 that are sequentially connected; the first sub-module 2011, the second sub-module 2012, the third sub-module 2013 and the fourth sub-module 2014 each comprise a plurality of convolution modules 203 with identical convolution kernels. And the number of convolution kernels within the first 2011, second 2012, third 2013 and fourth 2014 sub-modules increases in sequence.

The convolution kernel of the convolution module 203 is a 3x3 convolution kernel, and it is understood that the convolution kernel of the convolution module 203 is a 3x3 convolution kernel only as a specific application example in this embodiment, and in other embodiments of the present application, the convolution kernel of the convolution module 203 may also be other values, such as 4x4, 5x5, etc., which are not listed here, and in particular, may be flexibly set according to actual needs.

In the following, the operation process of the backbone module in this embodiment is specifically illustrated, and it will be understood that the number of convolution kernels and the size of the output feature map described below are only one specific illustration in this embodiment, and are not limited thereto. In this embodiment, the picture to be detected with the size of 1280x720 is input into the backbone module, the number of convolution kernels of the convolution module 203 included in the first submodule 2011 is 128, the feature map output size is 1280x720, the number of convolution kernels of the convolution module 203 included in the second submodule 2012 is 256, the feature map output size is 640x360, the number of convolution kernels of the convolution module 203 included in the third submodule 2013 is 512, the feature map output size is 320x180, and the number of convolution kernels of the convolution module 203 included in the fourth submodule 2014 is 1024, and the feature map output size is 160x90.

Preferably, in this embodiment, as shown in fig. 3, the convolution module 203 includes a feature mapping submodule 2031, a convolution calculation submodule 2032, a batch norm operator submodule 2033, a linear rectification calculation submodule 2034, and a feature mapping submodule 2035, which are connected in sequence.

Note that, in fig. 2, the horizontally extending arrow indicates the same 3×3 convolution calculation process with a step size of 1 as the convolution module 203; the top-down arrow indicates the downsampling process and the bottom-up arrow indicates the upsampling process. The up-sampling process and the down-sampling process are convolution calculation processes, and the step sizes of the up-sampling process and the down-sampling process are different from those of the convolution module 203, for example, the up-sampling process and the down-sampling process may be convolution calculation processes with a step size of 2 or convolution calculation processes with a step size of 4.

Further, in the present embodiment, as shown in fig. 2, the multi-level detection module 202 includes a position detection sub-module 2021 and a category detection sub-module 2022. The position detection sub-module 2021 is configured to perform position detection on an input feature map, and the category detection sub-module 2022 is configured to perform category detection on the input feature map.

In this embodiment, the feature pyramid module 204 is further included. The feature pyramid module 204 is connected to the fourth sub-module 2014, the position detection sub-module 2021, and the category detection sub-module 2022, respectively. The specific structure of the feature pyramid module 204 is shown in fig. 4, and is used for combining the feature information output by the fourth submodule 2014 to form a feature map, and transmitting the feature map to the position detection submodule 2021 and the category detection submodule 2022.

Specifically, in the present embodiment, the multi-stage detection module 202 further includes a combination module connected to the position detection sub-module 2021 and the category detection sub-module 2022. The combination module is configured to combine the detection results of the position detection sub-module 2021 and the detection results of the category detection sub-module 2022. The location of the target detection point is determined. In the present embodiment, the combination module performs combination calculation of a preset dimension on the detection result of the position detection sub-module 2021 and the detection result of the category detection sub-module 2022 to obtain a combination result, and uses the combination result as the target detection point. For example, the dimension of the detection result of the position detection sub-module 2021 before combination is 5x5x24, and for position detection, 4-dimensional position coordinates are unified, so 5x5x24 is converted to 150x4. Similarly, the dimension of the detection result of the class detection sub-module 2022 before combination is 5x5x18, and for the class detection, the 3-dimensional class number is uniform, so the 5x5x18 is transformed to 150x3, so that combination is performed according to the common dimension 150. It should be understood that the foregoing is merely a specific example of the present embodiment, and is not limited thereto, and other combinations may be adopted in other embodiments of the present application, and the present application is not limited to the specific examples, and may be flexibly set according to actual needs.

Specifically, the detection model is a learning model with training completed. The training process comprises the following steps: training samples are first obtained. For example, in the present embodiment, a soccer game picture is exemplified.

First, a plurality of football match pictures are acquired, and for example, a plurality of frames of images may be extracted from a piece of football match video. Then, the target areas in the football match pictures are marked by anchor frame marks, for example, football and ball holders can be marked by manual marking and other methods, and a plurality of football match pictures with marks are formed as training samples. And finally, inputting the training sample into a detection model, firstly, acquiring feature images of each preset dimension of the training sample through a backbone module, then, carrying out position detection on the feature images through a position detection sub-module, and carrying out category detection on the feature images through a category detection sub-module. And solving a loss function according to the position detection result and the category detection result and combining the labels of the training samples. Specifically, the position detection sub-module presets an anchor frame on the feature map, and each coordinate point on the feature map generates a corresponding anchor frame. And comparing each anchor frame with the anchor frame label, and calculating a loss function.

The loss function calculation formula of the position detection result is as follows:

；

wherein L is _loc For loss of position detection, n _posotives And posives is an anchor frame with IoU greater than 0.5 as compared to the anchor frame label; the SizeLoss input is the last two dimensions of the four-dimensional coordinates after combination transformation in the detection module, namely the width and the height of the target area, and the Sizecenter is the four-dimensional coordinates after combination transformation in the detection module, namely the coordinates of the target detection point; smoothL1 represents a smooth L1 loss function, specifically as follows:

。

IoU is IoU = (A.u.B)/(A.u.B), and A, B is respectively an anchor frame and an anchor frame mark.

In the present embodiment, since the validity of the target detection point center point coordinate parameter is greater than the validity of the size parameter of the target area, different weights are assigned and α2> α1. I.e. the weight of the position offset is larger than the weight of the predicted size offset.

The loss function calculation formula of the category detection result is as follows:

wherein L is _conf For loss of class detection results, CELoss is the cross entropy loss function and hard negotives is a fixed multiple of posotives. hard negotives are obtained by respective negative matches (IoU<0.5 The largest of the cross entropy losses of anchor frame prediction).

The loss function is L:. Wherein, beta is a preset constant.

After the loss function is obtained, training the model through a back propagation algorithm, and continuously reducing the value of the loss function until the value of the loss function reaches a preset threshold.

Step S102: and acquiring a target area from the picture to be detected according to the target detection point.

Specifically, in this embodiment, after the target detection point is obtained, the side length of the preset size can be set according to the actual requirement, and the picture to be detected is cut by using the target detection point as the center point, so as to obtain the target area.

Compared with the prior art, in the method for acquiring the target area in the picture provided by the first embodiment of the application, the established detection model comprises two parts, namely a backbone module and a multi-stage detection module, after the backbone module acquires the feature images of the picture to be detected in a plurality of preset dimensions, the feature images in the plurality of preset dimensions are respectively input into the multi-stage detection module, and the multi-stage detection module respectively performs position detection and category detection on the input feature images in the plurality of preset dimensions, and then combines the obtained detection results of the position detection and the detection results of the category detection to determine the position corresponding to the target detection point. By training the data of the detection model, the accuracy of the target detection point output by the detection model is improved, after the picture to be detected is input into the detection model after the training is completed, the detection model can output the target detection point in the picture to be detected, and after the position of the target detection point is determined, the range of the target area in the picture to be detected can be determined according to the target detection point. Through the accurate identification to the target area, can more be convenient for wait to detect the picture and carry out accurate tailorring, can carry out action recognition through tailorring the picture later, reduce action detail and lose the while, reduce action recognition's calculated quantity.

The second embodiment of the application relates to a method for acquiring a target area in a picture. The second embodiment is substantially the same as the first embodiment, and differs mainly in that: the second embodiment is a cropping process applied to a soccer game picture. The specific steps are shown in fig. 5, including:

step S201: a plurality of football match pictures are acquired.

Specifically, in this embodiment, a football match video may be obtained; and extracting a plurality of picture frames from the football match video to obtain a plurality of football match pictures. It should be understood that the foregoing is merely a specific illustration of capturing a plurality of football match pictures in the present embodiment, and is not meant to be limiting.

Step S202: and inputting a plurality of football match images into a far-middle and near-view judging model to obtain near-view football match pictures, middle-view football match pictures and far-view football match pictures.

Specifically, in this embodiment, the far, middle and near view judgment model is a res net34 network, and the specific structure is as follows:

it should be understood that the foregoing table is merely a specific structural example of the network of the res net34, and is not limited thereto, and in other embodiments of the present application, other structures may be used, which are not listed herein, and may be specifically set flexibly according to actual needs.

Step S203: cutting the close-range football match pictures according to a preset size to obtain a target area.

Step S204: inputting the middle-view football match pictures and the distant-view football match pictures into a detection model after training is completed, and obtaining target detection points output by the detection model.

Step S205: and acquiring a target area from the middle-view football match picture and the distant-view football match picture according to the target detection point.

It should be understood that the steps S204 to S205 are substantially the same as the steps S101 to S102 in the first embodiment, and specific reference may be made to the specific description of the first embodiment, which is not repeated herein.

Compared with the prior art, the method for acquiring the target area in the picture provided by the second embodiment of the application acquires the close-range football match picture, the middle-range football match picture and the far-range football match picture in the far-middle close-range judgment model while retaining the technical effect of the first embodiment; for the close-range football match picture, as the action details are clear, the target area is obtained by directly cutting according to the preset size or the target area can be cut without cutting, so that the operation amount is effectively reduced.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they contain the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

A third embodiment of the present application relates to an object detection apparatus, as shown in fig. 6, including: at least one processor 601; and a memory 602 communicatively coupled to the at least one processor 601; the memory 602 stores instructions executable by the at least one processor 601, where the instructions are executed by the at least one processor 601, so that the at least one processor 601 can perform a method for acquiring a target area in a picture as described above.

Where the memory 602 and the processor 601 are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors 601 and the memory 602. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 601 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 601.

The processor 601 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory 602 may be used to store data used by processor 601 in performing operations.

A fourth embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments of the application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples of carrying out the application and that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims

1. The method for acquiring the target area in the picture is characterized by comprising the following steps:

inputting the picture to be detected into a detection model, and obtaining a target detection point in the picture to be detected; the detection model comprises a backbone module and a multi-stage detection module connected with the backbone module; the backbone module is used for acquiring feature graphs of the picture to be detected in each preset dimension; the multi-stage detection module is used for carrying out position detection and category detection on the feature map, and determining the target detection point according to the detection result of the position detection and the detection result of the category detection;

acquiring a target area from the picture to be detected according to the target detection point;

the backbone module comprises a first sub-module, a second sub-module, a third sub-module and a fourth sub-module which are sequentially connected; the first sub-module, the second sub-module, the third sub-module and the fourth sub-module comprise a plurality of convolution modules with the same convolution kernel, and the number of the convolution kernels of the first sub-module, the second sub-module, the third sub-module and the fourth sub-module is increased in sequence; the first sub-module, the second sub-module, the third sub-module and the fourth sub-module are respectively used for outputting the feature graphs of the corresponding preset dimensions; the convolution modules with the same convolution kernel comprise a feature mapping sub-module, a convolution computing sub-module, a batch norm computing sub-module, a linear rectification computing sub-module and a feature mapping sub-module which are connected in sequence; the detection model further comprises a characteristic pyramid network connecting the fourth sub-module and the multi-level detection module; and the characteristic pyramid network is used for combining output results output by the fourth submodule and inputting the combined output results into the multi-stage detection module.

2. The method for obtaining a target area in a picture according to claim 1, wherein the detection result of the position detection includes a center point coordinate parameter of the target area and a size parameter of the target area;

and constructing a loss function of the detection model according to the central point coordinate parameter and the size parameter, wherein the weight of the central point coordinate parameter in the loss function is larger than the weight of the size parameter, and the weight of the central point coordinate parameter and the weight of the size parameter are preset constants.

3. The method for acquiring the target area in the picture according to claim 1, wherein the determining the target detection point according to the detection result of the position detection and the detection result of the category detection specifically includes:

and combining the detection result of the position detection and the detection result of the category detection according to a preset dimension to obtain the target detection point.

4. The method for obtaining a target area in a picture according to claim 1, wherein the obtaining the target area from the picture to be detected according to the target detection point specifically includes:

and cutting the picture to be detected by taking the target detection point as a center point and the preset size as a side length to obtain the target area.

5. The method for obtaining a target area in a picture according to claim 1, wherein before inputting the picture to be detected into the detection model, the method further comprises:

acquiring a plurality of football match pictures;

inputting a plurality of football match pictures into a far-middle and near-view judging model to obtain near-view football match pictures, middle-view football match pictures and far-view football match pictures;

and taking the middle-view football match picture and the distant-view football match picture as the pictures to be detected.

6. An object detection apparatus, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to at least one of the processors; wherein,,

the memory stores instructions executable by at least one of the processors to enable the at least one of the processors to perform the method of acquiring a target region in a picture according to any one of claims 1 to 5.

7. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of acquiring a target area in a picture as claimed in any one of claims 1 to 5.