CN114882436A

CN114882436A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN114882436A
Application number: CN202210557427.6A
Authority: CN
Inventors: 吴晨; 刘文韬; 钱晨; 周杨; 黄诗尧
Original assignee: Shenzhen TetrasAI Technology Co Ltd
Current assignee: Shenzhen TetrasAI Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-09

Abstract

The embodiment of the disclosure discloses a target detection method, a target detection device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a first channel image of an image to be detected; adjusting the resolution of the first channel image to a first resolution, wherein the first resolution is higher than the resolution of the image to be detected; and detecting the adjusted first channel image to obtain a target detection result. According to the method, the first channel image in the image to be detected is extracted, the resolution of the first channel image is improved, and the target detection is performed on the first channel image with the improved resolution, so that the target detection effect can be improved under the conditions of reducing redundant information in the image to be detected and reducing data processing amount.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The embodiments of the present disclosure relate to the field of target detection, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium.

Background

Currently, object detection networks are used in many scenarios, such as pedestrian detection, article detection, and the like. In the related art, generally, all images are directly input to a target detection network to perform target detection. For example, for the RGB image collected by the monitoring device, the collected image is directly input to the target detection network.

The target detection network generally needs to use a smaller network in an actual application scene, but the accuracy of the network needs to be improved at the same time, which is mutually restrictive. The existing target detection network is optimized aiming at the network structure to improve the detection performance, but the problem that the detection result on a small-scale network is not ideal cannot be solved, especially on the small target detection.

Disclosure of Invention

In view of this, the embodiment of the present disclosure discloses at least a target detection method, which includes:

acquiring a first channel image of an image to be detected;

adjusting the resolution of the first channel image to a first resolution, wherein the first resolution is higher than the resolution of the image to be detected;

and detecting the adjusted first channel image to obtain a target detection result.

In some embodiments shown, the number of channels included in the first channel image is less than the number of channels included in the image to be detected.

In some embodiments shown, the first channel image comprises a luminance channel image, and the image to be detected comprises an RGB image or an HSV image; the first channel image of the image to be detected is obtained, and the method comprises the following steps:

converting the image to be detected into a YUV image;

and acquiring a brightness channel image in the YUV image.

In some illustrated embodiments, the first channel image comprises a luminance channel image, the image to be detected comprises an RGB image, the first channel image comprises a luminance channel image, and the image to be detected comprises an RGB image; the first channel image of the image to be detected is obtained, and the method comprises the following steps:

carrying out channel separation on the image to be detected to obtain a channel separation result, wherein the channel separation result comprises pixel values of an R channel, a G channel and a B channel;

and performing color gamut conversion on the channel separation result to obtain the brightness channel image.

In some embodiments shown, the detecting the adjusted first channel image to obtain a target detection result includes:

processing the adjusted first channel image through at least two detection branches of a target detection network to obtain detection information which is output by each detection branch of the at least two detection branches and corresponds to the branch task;

and determining a target detection result according to the detection information output by the at least two detection branches.

In some embodiments shown, before processing the adjusted first channel image through at least two detection branches of the object detection network, the method further includes:

acquiring first characteristics of the adjusted first channel image in multiple scales;

obtaining a second feature of a first set scale according to the first features of the multiple scales;

the processing the adjusted first channel image through at least two detection branches of the target detection network includes:

and processing the second feature with the first set scale through at least two detection branches of the target detection network.

In some embodiments, the deriving a second feature of a first set dimension from a first feature of the plurality of dimensions includes:

respectively carrying out interpolation processing on the first characteristic of each scale in the multiple scales to obtain an interpolation result of the same scale;

and performing feature extraction on the sum of the interpolation results corresponding to all scales to obtain second features of the first set scale.

In some embodiments shown, the detection information includes at least two of:

first detection information for indicating a probability that each point in the second feature is a first center point of a target to be detected;

second detection information for indicating a first offset between a first central point of the target to be detected and a real central point;

and third detection information used for indicating the size of the detection frame of the target to be detected.

In some embodiments shown, the determining a target detection result according to the detection information output by the at least two detection branches includes:

determining a first central point of the target to be detected according to the first detection information;

determining a target center point of a detection frame of the target to be detected according to a first offset indicated by a corresponding point of the first center point in the second detection information;

determining the size of the detection frame of the target to be detected according to the size of the detection frame indicated by the corresponding point of the first central point in the third detection information;

and determining a detection frame of the target to be detected in the image to be detected according to the determined target central point and the size of the detection frame.

In some embodiments shown, in the case that the object to be detected is a human body, the detection information further includes: fourth detection information indicating a second offset between the first center point of the human body and a human body key point;

the determining a target detection result according to the detection information output by the at least two detection branches includes:

and determining the position information of the human body key point corresponding to the human body detection frame according to the second offset indicated by the corresponding point of the first central point in the fourth detection information.

In some embodiments shown, in the case that the object to be detected is a human body, the detection information further includes: fifth detection information indicating human body attribute information of the human body;

and determining the human body attribute characteristics corresponding to the detection frame of the human body according to the human body attribute information indicated by the corresponding point of the first central point in the fifth detection information.

The embodiment of the present disclosure further discloses a target detection device, which includes:

the acquisition module is used for acquiring a first channel image of an image to be detected;

the adjusting module is used for adjusting the resolution of the first channel image to a first resolution, wherein the first resolution is higher than the resolution of the image to be detected;

and the detection module is used for detecting the adjusted first channel image to obtain a target detection result.

The embodiment of the present disclosure also discloses an electronic device, which includes:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the object detection method as previously described.

The disclosed embodiment also discloses a computer readable storage medium, which stores a computer program for executing the object detection method.

In the embodiment of the disclosure, the first channel image in the image to be detected can be extracted, the resolution of the first channel image is then improved, the target detection is performed on the first channel image with the improved resolution, and the target detection effect can be improved under the conditions of reducing redundant information in the image to be detected and reducing data processing amount.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of embodiments of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art according to the drawings.

Fig. 1 is a schematic flow chart of a target detection method provided in an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an object detection network provided in an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an object detection apparatus provided in an embodiment of the present disclosure;

fig. 4 is a schematic diagram of an apparatus for configuring a method according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the specification, as detailed in the appended claims.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In order to make the technical solutions in the embodiments of the present disclosure better understood and make the above objects, features and advantages of the embodiments of the present disclosure more comprehensible, the technical solutions in the embodiments of the present disclosure are described in further detail below with reference to the accompanying drawings.

The target detection network generally needs to use a smaller network in an actual application scene, but the accuracy of the network needs to be improved at the same time, which is mutually restrictive. The existing target detection network is optimized aiming at the network structure to improve the detection performance, but the problem that the detection index is not ideal on a small-scale network cannot be solved, especially on the small target detection.

In view of this, the embodiment of the present disclosure at least discloses a target detection method.

The execution subject of the method may be an electronic device. The electronic device may execute the target detection method by installing a software system corresponding to the target detection method. The object detection method may be performed by a terminal device or a server or other processing device. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a PDA terminal, a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the above object detection method may be implemented by a processor calling computer readable instructions stored in a memory.

Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present disclosure.

The method may include the following steps.

S101: and acquiring a first channel image of an image to be detected.

S102: and adjusting the resolution of the first channel image to be a first resolution, wherein the first resolution is higher than the resolution of the image to be detected.

S103: and detecting the adjusted first channel image to obtain a target detection result.

According to the method, the first channel image in the image to be detected can be extracted, the resolution of the first channel image is improved, the target detection is carried out on the first channel image with the improved resolution, and the target detection effect can be improved under the conditions of reducing redundant information in the image to be detected and reducing data processing quantity.

For S101, the flow of the method does not limit the acquisition source of the image to be detected, and the image is only required to be subjected to target detection.

In the monitoring scene, an image shot by the monitoring device in real time can be determined as an image to be detected. Specifically, pedestrian detection can be performed for the surveillance video, and pedestrian detection can be performed for each video frame of the surveillance video.

In the shooting scene, the image currently displayed in real time by the shooting device can be determined as the image to be detected. The human body detection can be specifically carried out on the image shot in real time in the camera device view frame.

The image to be detected can be a multi-channel image which needs to be subjected to target detection, and a first channel image used for target detection can be obtained from the multi-channel image.

The present embodiment does not limit the form of the channel image specifically included in the image to be detected. The image to be detected may be an RGB (Red, Green, Blue) three-channel image, may also be a YUV (where Y represents Luminance (Luminance) and UV represents Chrominance (Chrominance), and may also be an HSV (Hue), Saturation, Value) three-channel image.

The image from which object detection is performed is typically a multi-channel image, e.g., an RGB three-channel image. In the embodiment, when the actual target detection is considered, it is not necessary to use all the channel images in the multi-channel image to perform the target detection. For example, for human vision, only simple texture and gray scale information are needed to clearly distinguish the human body in the scene. Therefore, the present embodiment proposes that object detection can be performed using a part of the multi-channel images.

Regarding the first channel image, the number of channels included in the first channel image may be less than the number of channels included in the image to be detected, so that redundant information in the image to be detected may be reduced. The flow of the method does not limit the channel image specifically contained in the first channel image, as long as the first channel image contains a part of the channel image in the image to be detected, and the number of channels contained in the first channel image is less than the number of channels contained in the image to be detected.

In the case that the image to be detected is a three-channel image, the first channel image may include a two-channel image in the image to be detected, for example, an RG two-channel image, a GB two-channel image, and an RB two-channel image, and may also include any one single-channel image in the image to be detected, for example, an R channel image, a G channel image, or a B channel image.

In the embodiment, the first channel image is limited to include part of the channel images in the image to be detected, and the number of channels included in the first channel image is less than that of the channels included in the image to be detected, so that the effect of reducing redundant information in the image to be detected can be achieved, and the data processing amount is reduced.

Different forms of the channel images in the image to be detected can be converted, and the channel images in different forms can be conveniently obtained. For example, the RGB three-channel image and the YUV three-channel image may be converted to each other, the RGB three-channel image and the HSV three-channel image may be converted to each other, and the HSV three-channel image and the YUV three-channel image may be converted to each other.

In some embodiments, the first channel image may be acquired directly from the channel images included in the image to be detected. For example, in the case that the image to be detected includes a YUV image and the first channel map includes a Y channel image, that is, a luminance image, the Y channel image may be directly extracted from the image to be detected.

In some embodiments, a conversion may be performed on a channel image included in an image to be detected, and a first channel image may be obtained from a conversion result. For example, when the image to be detected includes an RGB three-channel image and the first channel image includes a Y-channel image, the RGB three-channel image may be converted into a YUV three-channel image, so that the Y-channel image may be extracted. Similarly, under the condition that the image to be detected comprises an HSV three-channel image, the HSV three-channel image can be converted into a YUV three-channel image, so that a Y-channel image can be extracted.

In some embodiments, the channel image of the image to be detected may be subjected to channel separation, and then the channel separation result may be subjected to color gamut conversion, so as to obtain the required channel image, that is, the first channel image. For example, under the condition that the image to be detected comprises an RGB three-channel image and the first channel image comprises a Y channel image, channel separation may be performed on the image to be detected to obtain a channel separation result, where the channel separation result may include pixel values of an R channel, a G channel, and a B channel; and converting the channel separation result according to a color gamut conversion formula to obtain a Y channel image. Wherein, the color gamut conversion formula is shown as formula (1):

in the formula (1), a _ij ，b _k For the element values in the transformation matrix, i, j, k indicate the position of the element. Wherein, a ₁₁ ，a ₂₁ ，a ₃₁ And b ₁ Is a parameter for calculating a Y-channel pixel value by synthesizing RGB three-channel pixel values, b ₁ Is a constant term; a is ₁₂ ，a ₂₂ ，a ₃₂ And b ₂ Is used for calculating the U-channel pixel value by synthesizing the pixel values of RGB three channels, b ₂ Is a constant term; a is ₁₃ ，a ₂₃ ，a ₃₃ And b ₃ Is a parameter for calculating a V-channel pixel value by synthesizing pixel values of RGB three channels, b ₃ Is a constant term.

In some embodiments, the first channel image may include a luminance channel image, and the image to be detected may include an RGB image or an HSV image; acquiring a first channel image of an image to be detected may include: converting an image to be detected into a YUV image; and acquiring a Y-channel image in the YUV image.

In some embodiments, the first channel image comprises a Y channel image, and the image to be detected comprises an RGB image; acquiring a first channel image of an image to be detected may include: carrying out channel separation on an image to be detected to obtain a channel separation result, wherein the channel separation result comprises pixel values of an R channel, a G channel and a B channel; and performing color gamut conversion on the channel separation result to obtain a Y-channel image.

By acquiring the Y channel image of the image to be detected and improving the resolution of the Y channel image, the target detection is carried out on the Y channel image with improved resolution, so that redundant information in the image to be detected is reduced, and the data processing amount is reduced while the detection effect is ensured. In the embodiment of the disclosure, any form of channel image can be acquired for the image to be detected.

The flow of the method is not limited to a specific method for obtaining the first channel image of the image to be detected, and the method can be specifically obtained according to the channel image form contained in the specific image to be detected and the channel image form required by the first channel image.

For step S102, on the basis of reducing redundant information of the image to be detected, resolution enhancement processing may be performed on the first channel image, so as to obtain a clearer first channel image.

The method flow does not limit the resolution adjustment mode of the first channel image, as long as the resolution of the first channel image can be improved. For example, a pixel point may be newly added between adjacent pixel points of the first channel image through an interpolation strategy.

Regarding the adjusted first channel image, after the resolution is improved, the definition of the first channel image is also improved, so as to improve the effect of target detection, which may specifically include improving the accuracy of target detection. For the fuzzy target in the first channel image, the definition can be improved after the resolution is improved, so that the accuracy of target detection can be improved.

For example, the image to be detected may include a blurred human body, and if the target detection is directly performed on the image to be detected, it is often difficult to detect the human body. After the resolution of the first channel image is improved, the definition of a fuzzy human body can be improved, so that the human body can be detected, and the accuracy of target detection is improved.

For example, an image to be detected may include an object with a small size, and the object is often blurred in the image to be detected due to the small size of the object, and it is often difficult to detect the object. After the resolution of the first channel image is improved, the definition of the object can be improved, so that the object can be detected, and the accuracy of target detection is improved.

In the related art, the image to be detected is directly input to the target detection network, so that more computing resources are consumed, and the detection efficiency is low.

And under the condition that the data volume of the adjusted first channel image is less than or equal to the image to be detected, the adjusted first channel image is input to the target detection network, so that the target detection effect can be improved under the condition of not increasing the calculation load. Of course, when the data size of the adjusted first channel image is smaller than that of the image to be detected, the consumption of computing resources can be reduced, and the efficiency of target detection can be improved.

Fig. 2 shows a schematic structural diagram of an object detection network provided in an embodiment of the present disclosure. The target detection network comprises a backbone neural network, a pyramid neural network and a plurality of detection branches. In some embodiments, a first channel image of several tagged images may be acquired, training the target detection network. The trained target detection network can perform target detection on the first channel image.

The method comprises the steps of obtaining a plurality of first channel images with label images, wherein the first channel images with label images can be obtained firstly, and then obtaining a first channel image of each label image, and the obtained first channel images are provided with labeled target detection frame labels; or a plurality of images are obtained first, and then the first channel image of each image is obtained to mark the target detection frame label.

For example, a plurality of labeled images may be obtained, each image may include an labeled target detection frame label, and then a luminance channel image in each labeled image may be obtained, and then the labeled target detection frame label may be corresponding to the labeled target detection frame label for training a target detection network.

In some embodiments, the target detection network may output the predicted target detection box directly for the adjusted first channel image.

In some embodiments, the target detection network may include at least two detection branches, and different detection branches may perform different branch tasks to accomplish target detection.

In some embodiments, detecting, by the target detection network, the adjusted first channel image to obtain a target detection result may include: processing the adjusted first channel image through at least two detection branches of the target detection network to obtain detection information which is output by each detection branch of the at least two detection branches and corresponds to the branch task; and determining a target detection result according to the detection information output by the at least two detection branches.

The adjusted first channel image can be subjected to target detection through the detection branches in the target detection network, respective branch tasks are executed, and detection information is obtained, so that a target detection result can be conveniently and comprehensively determined according to the detection information output by the detection branches.

Of course, the present embodiment does not limit specific inputs for detecting branches in the target detection network. The adjusted first channel image may be directly input to the detection branch to execute the branch task, or the adjusted first channel image may be subjected to some processing, such as feature extraction, and then input to the detection branch to execute the branch task.

In some embodiments, before processing the adjusted first channel image through at least two detection branches of the target detection network, the method may further include: acquiring first characteristics of the adjusted first channel image in multiple scales; and obtaining a second feature of the first set dimension according to the first features of the plurality of dimensions.

Processing the adjusted first channel image through at least two detection branches of the target detection network may include: and processing the second feature with the first set scale through at least two detection branches of the target detection network.

In an optional embodiment, detecting, by the target detection network, the adjusted first channel image to obtain a target detection result may include: acquiring first characteristics of multiple scales of the adjusted first channel image; obtaining a second feature of a first set scale according to the first features of the multiple scales; inputting the second characteristics of the first set scale into a plurality of detection branches to obtain detection information which is output by each detection branch and corresponds to the detection branch task; and determining a target detection result according to the detection information output by the plurality of detection branches.

The embodiment does not limit the method for acquiring the first feature.

In some embodiments, the first features of multiple scales may be extracted from the adjusted first channel image by convolution.

Specifically, the convolution kernel is used to perform convolution operations for different times on the adjusted first channel image, so as to obtain convolution results of different scales as the first features of different scales. The convolution kernels may or may not be identical.

In one particular example, the target detection network may include a backbone neural network, each layer of the convolutional neural network in the backbone neural network may include a number of convolutional filters for outputting a plurality of scales of the first features for the input image.

In some embodiments, deriving the second feature of the first set dimension from the first feature of the plurality of dimensions may include: respectively carrying out interpolation processing on the first characteristic of each scale in the multiple scales to obtain an interpolation result of the same scale; and extracting the features of the sum of the interpolation results corresponding to all scales to obtain second features of the first set scale.

However, the present embodiment does not limit a specific method of interpolation processing. The interpolation may be linear or non-linear.

The present embodiment also does not limit a specific method of summing interpolation results. Either a direct summation or a weighted sum is calculated.

In one particular example, the target detection network may include a pyramid neural network. The input of the pyramid neural network can be first characteristics of multiple scales, specifically first characteristics of multiple scales output by the backbone neural network, and then the first characteristics can be summed to the same scale through bilinear interpolation, and then characteristic extraction is performed through a convolution filter, so that second characteristics of a first set scale can be output.

In an alternative embodiment, the target detection network may comprise a plurality of detection branches, which may be used to detect for the second feature, thereby outputting detection information.

In some embodiments, detection may be performed for each point in the second feature, outputting detection information of a first set scale. The second feature may specifically be a feature matrix, and each element in the matrix may include multiple channels of information. For convenience of description, the elements in the second feature are referred to as points, and may specifically correspond to pixel points in the image to be detected. The image to be detected can also be regarded as a digital matrix, each element comprises multi-channel data, and each element can represent a pixel point.

The present embodiment does not limit specific detection information as long as at least target detection can be performed and a target detection frame can be output.

The embodiment also does not limit the specific form of the detection branch as long as the second feature can be detected. The detection branch may include several layers of neural networks, and is configured to detect the second feature and output detection information.

In some embodiments, in order to perform the target detection, a target detection frame is obtained, and the plurality of detection branches may include detection of a center point of the target detection frame and detection of a size of the target detection frame.

The center point of the target detection frame is usually the intersection point of the diagonals of the target detection frame. The position of the target detection frame can be conveniently and rapidly determined by detecting the central point of the target detection frame, and then the target detection frame can be determined by utilizing the size of the corresponding target detection frame.

The detection branch usually performs center point detection on the second feature with the first set scale, rather than directly performing center point detection on the adjusted first channel image or the image to be detected. Therefore, it is also necessary to determine the corresponding pixel points of each point included in the second feature in the image to be detected, so as to detect the probability that the corresponding pixel points are the central points.

Because the scale of the second feature is usually smaller than the image to be detected, all the points in the second feature do not correspond to all the pixel points in the image to be detected. Therefore, there may be a case where the central point is actually a pixel point other than the pixel points corresponding to all the points in the second feature. That is, even if a certain point in the second feature is determined as the center point, it is not the actually predicted center point, and further correction is required to obtain the predicted center point.

Further, a detection branch is needed to detect the offset from the true center point for each point included in the second feature. The real center point can be the center point of the actually labeled target detection box, and the detected offset is also predicted by detecting the branch.

In other words, the probability of the central point and the offset between the two detection branches and the real central point can be detected respectively for each point contained in the second feature, and the two types of detection information can be integrated to determine the predicted central point position in the image to be detected.

Thus, in some embodiments, the plurality of detection branches may include:

and outputting a detection branch of first detection information, wherein the first detection information can be used for indicating the probability that each point in the second characteristic is the first central point of the target to be detected.

And outputting a detection branch of second detection information, wherein the second detection information can be used for indicating a first offset between a first central point and a real central point of the target to be detected.

And outputting a detection branch of third detection information, wherein the third detection information can be used for indicating the size of a detection frame of the target to be detected.

In some embodiments, the detection information may include at least two of: first detection information for indicating a probability that each point in the second feature is a first center point of the target to be detected; second detection information for indicating a first offset between a first central point of the target to be detected and the real central point; and third detection information for indicating the size of the detection frame of the object to be detected. Correspondingly, determining the target detection result according to the detection information output by the at least two detection branches may include:

determining a first central point of a target to be detected according to the first detection information;

determining a target center point of a detection frame of a target to be detected according to a first offset indicated by a corresponding point of the first center point in second detection information;

determining the size of a detection frame of the target to be detected according to the size of the detection frame indicated by the corresponding point of the first central point in the third detection information;

and determining a detection frame of the target to be detected according to the determined target central point and the size of the detection frame in the image to be detected.

In some embodiments, a point of the first detection information, at which the corresponding probability value is greater than a preset probability threshold, may be determined as the first center point. In some embodiments, a first center point of one or more objects to be detected may be determined. In some embodiments, since the scales of the detection information may all be the first set scale, the corresponding point of the first central point in the second detection information, specifically, the relative position in the second detection information may be the same point as the relative position of the first central point in the first detection information; the same applies to the first center point at the corresponding point of the third detected information. j is a function of

The first central point of the target to be detected is determined according to the first detection information, and the embodiment does not limit the specific form of the first detection information and the determination method of the first central point, as long as the first detection information is used to represent the probability that each point in the second feature is the first central point of the target to be detected.

For example, in the first detection information, it is determined that the point on the 4 th row and the 3 rd column is the first center point, and in the second detection information, the corresponding point of the first center point may be the point on the 4 th row and the 3 rd column in the second detection information.

In the other detection information, the corresponding point of the first center point may be a point having the same relative position in the other detection information as the relative position of the first center point in the first detection information.

For each first center point, the first offset represented by the corresponding point in the second detection information is determined, so that an actually predicted target center point, that is, the position of the predicted target center point in the image to be detected, specifically, the position of the pixel point can be determined. Further, the size of the detection frame characterized by the corresponding point in the third detection information can be determined. For example, 100 pixels wide and 200 pixels high.

And then the target detection frame of the target to be detected can be determined in the image to be detected by utilizing the target central point and the corresponding detection frame size.

In some embodiments, after the target detection box is determined, a non-maxima suppression algorithm may be used for screening.

The flow of the method is not limited to the specific form of the target to be detected, and can be a human body or an object.

In some embodiments, other information may also be detected in the plurality of detection branches, such as attributes or key points of the target. The embodiment can synchronously detect other information of the target, and improves the detection efficiency and the detection effect.

In some embodiments, in a case where the object to be detected is a human body, the plurality of detection branches may include a detection branch that outputs fourth detection information; the fourth detection information may be used to indicate a second offset between the first center point of the human body and the key point of the human body. Correspondingly, the position information of the human body key point corresponding to the detection frame of the human body can be determined according to the second offset indicated by the corresponding point of the first center point in the fourth detection information.

Therefore, in some embodiments, in the case that the object to be detected is a human body, the detection information may further include: and fourth detection information indicating a second offset between the first center point of the human body and the key point of the human body.

Wherein, the human body key points may include: key points such as human faces, joints, limbs and the like.

In some embodiments, in a case where the object to be detected is a human body, the plurality of detection branches may include a detection branch that outputs fifth detection information; the fifth detection information may be used to indicate human body attribute information of the human body.

Correspondingly, the human body attribute feature corresponding to the detection frame of the human body can be determined according to the human body attribute information indicated by the corresponding point of the first central point in the fifth detection information.

Under the condition that the target to be detected is a human body, the detection information further comprises: fifth detection information indicating human body attribute information of the human body.

The human body attribute characteristics can include age, gender, whether glasses are worn, whether a hat is worn, and the like. And the human body attribute information may include an age probability distribution, a gender probability distribution, a probability of wearing glasses, a probability of wearing a hat, etc.

In the above embodiment, the content of the target detection result can be enriched by detecting the position information of the human body key points related to the human body or the attribute characteristics of the human body, so that the output target detection frame can be accompanied by various detection information.

For ease of understanding, a specific example is provided below.

In an actual application scenario, a target detection network often has trade-off problems in the aspects of precision, speed, power consumption and the like.

In general, in practical application scenarios, a smaller target detection network is required to increase the detection speed, and a sufficiently high detection accuracy is also required, but both aspects are often not satisfied at the same time.

For example, in a monitoring scene, pedestrians often need to be detected in real time, each video frame shot in real time needs to be detected, and if the detection speed is slow, real-time pedestrian detection often cannot be realized; if the detection precision is lower, the monitoring security effect is poorer.

In addition, because the detection accuracy is often insufficient, in an actual application scene, the detection effect is often poor for a fuzzy target or a target with a small size in the image to be detected.

For example, the human running in the monitored scene is blurred and often difficult to detect accurately. Some pedestrians are far away from the shooting device in the monitoring scene, the size of the pedestrians in a video frame shot in real time is small, the problem of blurring generally exists, and accurate detection is often difficult.

In the related art, a target detection network inputs a three-channel sRGB image, but for human vision, a human body in a scene can be clearly distinguished only by simple texture and gray information, so that the pedestrian detection method is provided.

The present embodiment provides a pedestrian detection method. The image to be detected can be an RGB three-channel image.

The method may comprise the following steps.

Firstly, a Y-channel image of an image to be detected is obtained, wherein the image to be detected is an RGB image.

Specifically, channel separation is performed on the RGB image first, and then the Y-channel image is obtained according to the color gamut conversion formula shown in formula (1).

Next, the resolution of the Y-channel image is adjusted to a first resolution, wherein the first resolution is higher than the resolution of the RGB image.

Then, the adjusted Y-channel image is input to a target detection network, where the target detection network includes a backbone neural network, a pyramid neural network, and multiple detection branches, as shown in fig. 2.

The backbone neural network is used for acquiring first features of multiple scales, such as first features of four scales, of the adjusted Y-channel image;

the pyramid neural network is used for performing bilinear interpolation on first features of multiple scales output by the backbone neural network to the same dimension for summation, and then performing feature extraction through a convolution filter to obtain second features of the first scale;

and the plurality of detection branches are respectively input into the second characteristic of the output of the pyramid neural network, and detection information corresponding to the branch tasks is output.

And then, determining a target detection result according to the detection information output by the plurality of detection branches, wherein the target detection result comprises the steps of predicting a human body detection frame, attributes and human body key point information on the to-be-detected image.

And finally, carrying out non-maximum value suppression on the target detection result, filtering partial overlapped prediction frames with high confidence level, and outputting the final detection result of the image to be detected.

In the embodiment of the disclosure, by extracting the Y-channel image in the image to be detected and increasing the resolution of the Y-channel image to perform target detection, input redundant information can be reduced, the resolution of the input image can be increased under the same network size so as to improve the detection performance, that is, the target detection of a small-scale and fuzzy target can be optimized under the condition that the network size is limited; and various attributes of pedestrians can be predicted by utilizing deep learning under the limited network size, so that the power consumption and the whole algorithm running time can be saved.

Corresponding to the method embodiment, the embodiment of the disclosure also provides a device embodiment.

Fig. 3 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present disclosure.

The apparatus may include the following modules.

The obtaining module 301 is configured to obtain a first channel image of an image to be detected.

The adjusting module 302 is configured to adjust a resolution of the first channel image to a first resolution, where the first resolution is higher than a resolution of the image to be detected.

And the detection module 303 is configured to detect the adjusted first channel image to obtain a target detection result.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The embodiments of the present disclosure also provide a computer device, which at least includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any of the foregoing embodiments when executing the program.

Fig. 4 is a schematic diagram illustrating a more specific hardware structure of a computing device according to an embodiment of the present disclosure, where the computing device may include: a processor 401, a memory 402, an input/output interface 403, a communication interface 404, and a bus 405. Wherein the processor 401, the memory 402, the input/output interface 403 and the communication interface 404 are communicatively connected to each other within the device by a bus 405.

The processor 401 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided by the embodiments of the present disclosure. The processor 401 may further include a graphics card, which may be an Nvidia titan X graphics card or a 1080Ti graphics card, etc.

The Memory 402 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 402 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 402 and called to be executed by the processor 401.

The input/output interface 403 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 404 is used to connect a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The bus 405 includes a path that transfers information between the various components of the device, such as the processor 401, memory 402, input/output interface 403, and communication interface 404.

It should be noted that although the above-mentioned device only shows the processor 401, the memory 402, the input/output interface 403, the communication interface 404 and the bus 405, in a specific implementation, the device may also include other components necessary for normal operation. Moreover, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present disclosure, and need not include all of the components shown in the figures.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the method of any of the foregoing embodiments.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

From the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present disclosure can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, or the like, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments or some parts of the embodiments of the present disclosure.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, it is relatively simple to describe, and reference may be made to some descriptions of the method embodiment for relevant points. The above-described apparatus embodiments are merely illustrative, and the modules described as separate components may or may not be physically separate, and the functions of the modules may be implemented in one or more software and/or hardware when implementing the embodiments of the present disclosure. And part or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The foregoing is merely a detailed description of the embodiments of the disclosure, and it should be noted that modifications and decorations can be made by those skilled in the art without departing from the principle of the embodiments of the disclosure, and these modifications and decorations should also be regarded as the scope of protection of the embodiments of the disclosure.

Claims

1. A method of object detection, the method comprising:

acquiring a first channel image of an image to be detected;

2. The method according to claim 1, wherein the first channel image contains fewer channels than the image to be detected.

3. The method according to claim 1, wherein the first channel image comprises a luminance channel image, and the image to be detected comprises an RGB image or an HSV image; the acquiring of the first channel image of the image to be detected comprises:

converting the image to be detected into a YUV image;

and acquiring a brightness channel image in the YUV image.

4. The method of claim 1, wherein the first channel image comprises a luminance channel image, and the image to be detected comprises an RGB image; the first channel image of the image to be detected is obtained, and the method comprises the following steps:

5. The method according to any one of claims 1 to 4, wherein the detecting the adjusted first channel image to obtain a target detection result comprises:

6. The method of claim 5, further comprising, prior to processing the adjusted first channel image through at least two detection branches of the object detection network:

7. The method of claim 6, wherein deriving the second feature of the first set scale from the first feature of the plurality of scales comprises:

and performing feature extraction on the sum of the interpolation results corresponding to each scale to obtain second features of the first set scale.

8. The method according to any one of claims 5 to 7, wherein the detection information includes at least two of:

9. The method of claim 8, wherein determining the target detection result according to the detection information output by the at least two detection branches comprises:

10. The method according to claim 8, wherein in the case where the object to be detected is a human body, the detection information further includes: fourth detection information indicating a second offset between the first center point of the human body and a human body key point;

11. The method according to claim 8, wherein in the case where the object to be detected is a human body, the detection information further includes: fifth detection information indicating human body attribute information of the human body;

12. An object detection apparatus, characterized in that the apparatus comprises:

13. An electronic device, characterized in that the device comprises:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to invoke executable instructions stored in the memory to implement the object detection method of any one of claims 1 to 11.

14. A computer-readable storage medium, characterized in that the storage medium stores a computer program for implementing the object detection method according to any one of claims 1 to 11.