CN108229497B

CN108229497B - Image processing method, image processing apparatus, storage medium, computer program, and electronic device

Info

Publication number: CN108229497B
Application number: CN201710632941.0A
Authority: CN
Inventors: 杨巍; 欧阳万里; 李爽; 李鸿升; 王晓刚
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2017-07-28
Filing date: 2017-07-28
Publication date: 2021-01-05
Anticipated expiration: 2037-07-28
Also published as: CN108229497A; WO2019020075A1

Abstract

An embodiment of the invention provides an image processing method, an image processing device, a storage medium, a computer program and an electronic device, wherein the image processing method comprises the following steps: acquiring a characteristic diagram of an image to be detected; extracting features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps; and combining the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected. By adopting the technical scheme of the embodiment of the invention, the neural network can be utilized to learn and extract the features with different scales, and the accuracy and the robustness of feature extraction are improved.

Description

Image processing method, image processing apparatus, storage medium, computer program, and electronic device

Technical Field

The embodiment of the invention relates to the technical field of computer vision, in particular to an image processing method, an image processing device, a storage medium, a computer program and electronic equipment.

Background

Human body posture estimation mainly positions the positions of all parts of a human body in a given image or video, is an important research subject in the field of computer vision, and is mainly applied to aspects of action recognition, behavior recognition, clothing analysis, task comparison, human-computer interaction and the like.

At present, the human body posture estimation method depends on the characteristics detected by an object detector, and the existing object detector is generally obtained by training on a fixed scale.

Disclosure of Invention

The embodiment of the invention provides an image processing scheme.

According to a first aspect of embodiments of the present invention, there is provided an image processing method including: acquiring a characteristic diagram of an image to be detected; extracting features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps; and combining the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected.

Optionally, the method further comprises: and carrying out key point detection on the target object in the image to be detected according to the first characteristic diagram.

Optionally, the performing, according to the first feature map, keypoint detection on the target object includes: respectively acquiring a score map of at least one key point of the target object according to the first feature map; and determining the position of the corresponding key point of the target object according to the scores of the pixel points included in each score map.

Optionally, the neural network comprises at least one feature pyramid sub-network, the feature pyramid sub-network comprises a first branch network and at least one second branch network respectively connected in parallel with the first branch network; the other feature map comprises a second feature map or a third feature map; the first branch network performs feature extraction on the feature map based on the original scale of the feature map to obtain a second feature map; and each second branch network respectively extracts the features of the feature map based on other scales different from the original scale to obtain the third feature map.

Optionally, the first branch network comprises a second convolutional layer, a third convolutional layer and a fourth convolutional layer; the second convolutional layer reduces the dimension of the feature map; the third convolution layer performs convolution processing on the feature map with the reduced dimensionality based on the original scale of the feature map; and the fourth convolution layer promotes the dimension of the feature map subjected to convolution processing to obtain the second feature map.

Optionally, at least one of the second branch networks comprises a fifth convolutional layer, a downsampling layer, a sixth convolutional layer, an upsampling layer, and a seventh convolutional layer; the fifth convolutional layer reduces the dimension of the feature map; the down-sampling layer down-samples the feature map with the reduced dimensionality according to a set down-sampling proportion, wherein the scale of the feature map after down-sampling is smaller than the original scale of the feature map; the sixth convolution layer performs convolution processing on the downsampled feature map; the upsampling layer upsamples the convolved feature map according to a set upsampling proportion, wherein the scale of the upsampled feature map is equal to the original scale of the feature map; and the seventh convolutional layer promotes the dimension of the feature map after the upsampling to obtain the third feature map.

Optionally, there are a plurality of said second branch networks; the set down-sampling ratios of at least two of the second branch networks are different, and/or the set down-sampling ratios of at least two of the second branch networks are the same.

Optionally, there are a plurality of said second branch networks; the sixth convolutional layers of at least two of the second branch networks share parameters.

Optionally, the second branch network comprises a fifth convolutional layer, an expansion convolutional layer and a seventh convolutional layer; the fifth convolutional layer reduces the dimension of the feature map; and the expansion convolutional layer performs expansion convolution processing on the feature map with the dimensionality reduced, and the seventh convolutional layer promotes the dimensionality of the feature map after the expansion convolution to obtain the third feature map.

Optionally, there are a plurality of said second branch networks; at least two of said second branch networks sharing said fifth convolutional layer and/or said seventh convolutional layer; and/or at least two of said second branch networks have respective said fifth convolutional layers and/or said seventh convolutional layers.

Optionally, the feature pyramid sub-network further comprises a first output merging layer; the first output merging layer merges respective outputs of at least two of the second branch networks sharing the seventh convolutional layer before the seventh convolutional layer, and outputs a merged result to the shared seventh convolutional layer.

Optionally, the neural network comprises at least two feature pyramid sub-networks; and each feature pyramid sub-network takes a first feature map output by a previous feature pyramid sub-network connected with the current feature pyramid sub-network as input, and extracts the first feature map of the current feature pyramid sub-network based on different scales according to the input first feature map.

Optionally, the neural network is an HOURGLASS neural network comprising at least one HOURGLASS module comprising at least one of the feature pyramid sub-networks.

Optionally, the initialized network parameter of at least one network layer of the neural network is obtained from network parameter distribution determined according to a mean and a variance of the initialized network parameter, and the mean of the initialized network parameter is zero.

Optionally, if there is a situation including at least two identical mapping additions in the neural network, an output adjustment module is disposed in at least one identical mapping branch that needs to be added, and the output adjustment module adjusts the first characteristic diagram output by the identical mapping branch.

According to a second aspect of the embodiments of the present invention, there is provided an image processing apparatus including: the acquisition module is used for acquiring a characteristic diagram of an image to be detected; the extraction module is used for extracting the features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps; and the merging module is used for merging the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected.

Optionally, the method further comprises: and the detection module is used for detecting key points of the target object in the image to be detected according to the first characteristic diagram.

Optionally, the detection module includes: the scoring unit is used for respectively acquiring a scoring graph of at least one key point of the target object according to the first feature graph; and the determining unit is used for determining the positions of corresponding key points of the target object according to the scores of the pixel points included in each score map.

Optionally, the neural network comprises at least one feature pyramid sub-network, the feature pyramid sub-network comprises a first branch network and at least one second branch network respectively connected in parallel with the first branch network; the other feature map comprises a second feature map or a third feature map; the first branch network is used for extracting features of the feature map based on the original scale of the feature map to obtain a second feature map; and each second branch network is used for respectively extracting the features of the feature map based on other scales different from the original scale to obtain the third feature map.

Optionally, the first branch network comprises a second convolutional layer, a third convolutional layer and a fourth convolutional layer; the second convolution layer is used for reducing the dimension of the feature map; the third convolution layer is used for performing convolution processing on the feature map with the reduced dimensionality based on the original scale of the feature map; and the fourth convolution layer is used for improving the dimension of the feature map subjected to convolution processing to obtain the second feature map.

Optionally, at least one of the second branch networks comprises a fifth convolutional layer, a downsampling layer, a sixth convolutional layer, an upsampling layer, and a seventh convolutional layer; the fifth convolution layer is used for reducing the dimension of the feature map; the down-sampling layer is used for down-sampling the feature map with reduced dimensionality according to a set down-sampling proportion, wherein the scale of the down-sampled feature map is smaller than the original scale of the feature map; the sixth convolutional layer is used for carrying out convolution processing on the down-sampled feature map; the upsampling layer is used for upsampling the convolved feature map according to a set upsampling proportion, wherein the scale of the upsampled feature map is equal to the original scale of the feature map; and the seventh convolutional layer is used for improving the dimension of the feature map subjected to the upsampling to obtain the third feature map.

Optionally, the second branch network comprises a fifth convolutional layer, an expansion convolutional layer and a seventh convolutional layer; the fifth convolution layer is used for reducing the dimension of the feature map; the expansion convolution layer is used for performing expansion convolution processing on the feature map with the dimensionality reduced; and the seventh convolution layer is used for improving the dimension of the feature map after expansion convolution to obtain the third feature map.

Optionally, the feature pyramid sub-network further comprises a first output merging layer; the first output merging layer is configured to merge respective outputs of at least two of the second branch networks sharing the seventh convolutional layer before the seventh convolutional layer, and output a merging result to the shared seventh convolutional layer.

Optionally, the neural network comprises at least two feature pyramid sub-networks; and each feature pyramid sub-network is used for taking a first feature map output by a previous feature pyramid sub-network connected with the current feature pyramid sub-network as input, and extracting the first feature map of the current feature pyramid sub-network based on different scales according to the input first feature map.

Optionally, if there is a situation including at least two identical mapping additions in the neural network, an output adjustment module is disposed in at least one identical mapping branch that needs to be added, and the output adjustment module is configured to adjust the first characteristic diagram output by the identical mapping branch.

According to a third aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions, wherein the program instructions, when executed by a processor, implement the steps of any of the preceding image processing methods.

According to a fourth aspect of embodiments of the present invention, there is provided an electronic apparatus, including: the system comprises a processor, a memory, a communication element and a communication bus, wherein the processor, the memory and the communication element are communicated with each other through the communication bus; the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the image processing method in any one of the preceding claims.

According to a fifth aspect of embodiments of the present invention, there is provided a computer program comprising: at least one executable instruction, when being processed by the processor, is used for realizing the corresponding operation of any one of the image processing methods.

According to the image processing scheme provided by the embodiment of the invention, after the characteristic diagram of the image to be detected is obtained, the characteristic diagram is subjected to characteristic extraction through the neural network based on a plurality of different scales to obtain a plurality of other characteristic diagrams, the characteristic diagram and the other characteristic diagrams are combined to obtain the first characteristic diagram of the image to be detected, and the characteristics of the neural network at different scales are learned and extracted, so that the accuracy and the robustness of the characteristic extraction of the neural network are improved.

Drawings

FIG. 1 is a flowchart illustrating steps of an image processing method according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of an image processing method according to a second embodiment of the present invention;

FIG. 3 is a first structural diagram of a feature pyramid sub-network according to a second embodiment of the present invention;

FIG. 4 is a second structural diagram of a feature pyramid sub-network according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of a third structure of a feature pyramid sub-network according to a second embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a neural network for image processing according to a second embodiment of the present invention;

fig. 7 is a schematic structural diagram of a HOURGLASSs network according to a second embodiment of the present invention;

FIG. 8 is a score plot of the output of the image processing method according to the second embodiment of the present invention;

FIG. 9 is a schematic diagram of an identity mapping addition according to the second embodiment of the present invention;

fig. 10 is a block diagram of an image processing apparatus according to a third embodiment of the present invention;

fig. 11 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings (like numerals indicate like elements throughout the several views) and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present invention are used merely to distinguish one element, step, device, module, or the like from another element, and do not denote any particular technical or logical order therebetween.

Example one

Referring to fig. 1, a flowchart illustrating steps of an image processing method according to a first embodiment of the present invention is shown.

The image processing method of the embodiment includes the steps of:

step S102: and acquiring a characteristic map of the image to be detected.

In this embodiment, any image analysis processing method may be adopted to perform feature extraction processing on the image to be detected, so as to obtain a feature map of the image to be detected. Optionally, a Feature extraction operation is performed on the image to be detected through, for example, a convolutional neural network, and a Feature Map (Feature Map) including Feature information of the image to be detected is acquired. The image to be detected may be an independent still image or an image of any frame in a video sequence.

It is noted here that the acquired feature map may be a global feature map of the image to be detected, or may be a non-global feature map, which is not limited in this embodiment. For example, in practical applications, a global feature map of an image to be detected or a local feature map including a target object may be acquired according to different application scenarios where the acquired feature maps are used for image processing, object recognition, or the like.

Step S104: and performing feature extraction on the feature map based on at least two different scales through a neural network to obtain at least two other feature maps.

The at least two other feature maps are feature maps of the image to be detected by the neural network, and are obtained by further feature extraction operation based on at least two different scales, wherein each scale corresponds to one other feature map.

The scale on which the neural network performs the feature extraction operation can be defined. In the embodiment of the invention, the neural network extracts the characteristics of the image to be detected based on different scales, and the characteristics of the image to be detected can be stably and accurately extracted by learning and extracting the characteristics of different scales through the neural network. The embodiment of the invention can effectively solve the problem of transmission change of the characteristic scale of the image to be detected caused by the problems of shielding, perspective and the like, thereby improving the robustness of characteristic extraction.

In practical applications, the dimensions on which the feature extraction is based may be, but are not limited to, physical sizes of images that are different, or sizes of effective portions of images that are different (for example, although the physical sizes of the images are the same, pixel values of some pixels of the images have been processed in a manner of, but not limited to, zeroing, and the like, and a portion made up of other pixels except the processed pixels corresponds to the effective portion, and the size of the effective portion is smaller than the physical size of the images).

Optionally, the at least two different scales may include an original scale and at least one scale different from the original scale of the image to be detected, or at least two different scales different from the original scale.

Step S106: and combining the characteristic diagram and other characteristic diagrams to obtain a first characteristic diagram of the image to be detected.

And combining the feature map and each other feature map to obtain a first feature map, so that the first feature map comprises the extracted features with different scales. The first feature map obtained by combination can be used for carrying out subsequent image processing on the image to be detected, such as key point detection, object identification, image segmentation, object clustering and the like, and the subsequent image processing effect can be improved.

According to the image processing method provided by the embodiment of the invention, after the characteristic diagram of the image to be detected is obtained, the characteristic diagram is subjected to characteristic extraction through the neural network based on a plurality of different scales to obtain a plurality of other characteristic diagrams, the characteristic diagram and the other characteristic diagrams are combined to obtain the first characteristic diagram of the image to be detected, and the characteristics of the neural network at different scales are learned and extracted, so that the accuracy and the robustness of the characteristic extraction of the neural network are improved.

Any of the image processing methods provided by embodiments of the present invention may be performed by any suitable device having data processing capabilities, including but not limited to: terminal equipment, a server and the like. Alternatively, any image processing method provided by the embodiments of the present invention may be executed by a processor, for example, the processor may execute any image processing method mentioned by the embodiments of the present invention by calling a corresponding instruction stored in a memory. And will not be described in detail below.

Example two

Referring to fig. 2, a flowchart illustrating steps of an image processing method according to a second embodiment of the present invention is shown.

The image processing method of the embodiment includes the steps of:

step S202: and acquiring a characteristic map of the image to be detected.

In this embodiment, a feature image is obtained by performing a feature extraction operation on an image to be detected through a neural network. For example, the neural network includes a Convolution layer (Conv) for feature extraction, and performs preliminary detection and feature extraction operations on an image to be detected input into the neural network to obtain an initial feature map including the image to be detected.

Step S204: and performing feature extraction on the feature map based on at least two different scales through a neural network to obtain at least two other feature maps.

Optionally, the neural network includes at least one feature pyramid sub-network, and is configured to perform feature extraction on the feature map based on at least two different scales to obtain at least two other feature maps. The feature pyramid comprises a first branch network and at least one second branch network connected in parallel to the first branch network, respectively. The first branch network further extracts the features of the feature map of the input feature pyramid based on the original scale of the image to be detected to obtain a second feature map; and each second branch network further extracts the features of the feature map based on other scales different from the original scale to obtain a third feature map. That is, the at least two other feature maps include the second feature map and the third feature map.

In an alternative embodiment, with reference to fig. 3, the first branch network comprises a second convolutional layer (Convolutio 2, Conv 2), a third convolutional layer (Conv 3) and a fourth convolutional layer (Conv 4). The at least one second branch network comprises a fifth convolutional layer (Conv 5), a down-sampling layer, a sixth convolutional layer (Conv 6), an up-sampling layer and a seventh convolutional layer (Conv 7).

The first branch network is f₀Each second branch network is f₁To f_cWherein f is₀The original scale of the input features is preserved. Inputting characteristic graphs of the characteristic pyramid sub-network into f₀To f_c。f₀And f a second convolution layer₁To f_cThe fifth convolutional layer of (2) may each employ a 1 × 1 convolutional network for reducing the dimensionality of the input feature map. f. of₁To f_cThe down-sampling layers respectively down-sample the dimensionality-reduced feature maps output by the fifth convolution layers according to the set down-sampling ratios Ratio 1 to Ratio c to obtain feature maps with different resolutions. And the scale of the characteristic diagram after the down sampling is smaller than the original scale of the characteristic diagram. f. of₀And f a third convolution layer₁To f_cThe sixth convolutional layers can adopt a convolution network of 3 × 3, and are used for respectively convolving the feature map output by the second convolutional layer after dimensionality reduction and the downsampled feature map output by the corresponding downsampling layer, and learning and extracting features of different scales. f. of₁To f_cThe upsampling layer performs upsampling on the convolved feature maps output by the sixth convolution layers respectively based on different upsampling proportions, wherein the scale of the upsampled feature maps is equal to the original scale of the feature maps. f. of₀The fourth convolutional layer promotes the dimension of the feature map output by the third convolutional layer and subjected to convolution processing, and a second feature map is obtained. f. of₁To f_cThe seventh convolutional layer of (1) promotes the dimensionality of the up-sampled characteristic diagram output by each corresponding up-sampling layer, and respectively obtains a third characteristic diagram.

Wherein each second branch network f₁To f_cThe set down-sampling ratios of the at least two second branch networks are different and/or the set down-sampling ratios of the at least two second branch networks are the same. That is, the down-sampling ratios adopted by the second branch networks may be different, may be partially the same, or may be all the same. For these three cases, the first one is based on the original scaleThe branch networks are matched, and the characteristic pyramid sub-network can extract different characteristics based on at least two different scales.

In addition, due to f₀Preserving the original dimensions of the input features without changing the resolution of the features, therefore f₀Without employing down-sampling and up-sampling layers, in practice, f₀A down-sampling layer and an up-sampling layer with a down-sampling ratio and an up-sampling ratio of 1 may also be used.

Optionally, the sixth convolutional layer of the at least two second branch networks shares parameters. For example, the convolution kernels are shared by the sixth convolutional layers of the at least two second branch networks, that is, the convolution kernels of the at least two sixth convolutional layers have the same parameter, so that the number of parameters is reduced by adopting an internal parameter sharing mechanism, and a higher accuracy can be obtained based on the parameters obtained by data and task learning.

In another alternative embodiment, the structural form of the feature pyramid sub-network shown in fig. 4 may also be adopted, and at least one second branch network includes a fifth convolutional layer, an expansion convolutional layer, and a seventh convolutional layer; a fifth convolutional layer reduces the dimension of the feature map; the expansion convolution layer carries out expansion convolution processing on the feature map with the reduced dimensionality; and the seventh convolution layer promotes the dimension of the feature map after the expansion convolution to obtain a third feature map. That is, the down-sampling layer, the sixth convolution layer, and the up-sampling layer of at least one second branch network are replaced by expanded convolution (shown as dsTride 1 to dsTride c in the figure), the network structure inside the feature pyramid sub-network is simplified, the resolution of the input features can be increased, the expanded convolution layer is used to complete the sampling operation of features with different resolutions, the extraction operation of features with different scales, the sampling operation of features with the same resolution, and the like, so as to obtain the features with different scales. The dilation convolution process may also achieve downsampling by, for example, setting the pixel values of some pixels in the feature map to 0, and reducing the portion of the feature map having valid pixel values while keeping the physical size of the image uniform, thereby achieving the same effect of downsampling.

Optionally, at least two second branch networks share a fifth convolutional layer and/or a seventh convolutional layer; and/or at least two second branch networks have respective fifth convolutional layers and/or seventh convolutional layers.

For example, to simplify the structure of the feature pyramid sub-network, at least two second branch networks may share the same fifth convolutional layer in the structure of the feature pyramid sub-network shown in fig. 5. For example, the fifth convolutional layer is a 1 × 1 convolutional network, and the features of the input feature pyramid sub-network are subjected to dimensionality reduction, and then output to the down-sampling layers of the second branch networks sharing the fifth convolutional layer. The feature pyramid sub-network of the structure has fewer parameters and lower calculation complexity.

Optionally, the feature pyramid sub-network further includes a first output merging layer, where the first output merging layer merges respective outputs of at least two second branch networks sharing the seventh convolutional layer before the seventh convolutional layer, and outputs a merging result to the shared seventh convolutional layer.

For example, the first output merging layer is connected between the shared seventh convolutional layer upsampling layer and the seventh convolutional layer, and is configured to merge the feature maps output by the upsampling layers of the second branch networks, and output the merged feature maps to the seventh convolutional layer. Here, the merging process may include an addition operation or a concatenation operation. For example, as shown in the figures

Indicating an output addition operation, of

Can also be replaced by

To indicate output tandem operation (condensation). The addition operation may be expressed as a point-to-point addition of the tensors, and the concatenation operation may be expressed as a concatenation of the tensors in one dimension. If c second branch networks f₁To f_cC 256 × 64 × 64 feature maps are output,the feature map of 256 × 64 × 64 is obtained after the addition operation, and the feature map of (256 × c) × 64 × 64 is obtained after the series operation.

The seventh convolutional layer is also used to linearly transform the characteristics of each second branch network output to add the characteristics of the original scale of the first branch network output. The seventh convolution layer is further configured to perform a map transformation process on the feature map output by the first output merging layer to map the feature map into the size of the feature map before concatenation if the merging process performed by the first output merging layer is a concatenation operation. For example, the feature map of (256 × c) × 64 × 64 described above is mapped and converted into a feature map of 256 × 64 × 64.

Step S206: and combining the characteristic diagram and other characteristic diagrams to obtain a first characteristic diagram of the image to be detected.

Optionally, the feature pyramid sub-network further comprises a second output combining layer, the outputs of the first branch network and each second branch network are connected to the second output combining layer, where the outputs of the second branch networks comprise the output of the shared seventh convolutional layer and the output of the upsampling layer of each second branch network that does not share the seventh convolutional layer. And the second output merging layer is used for merging the feature graph, the second feature graph output by the first branch network and the third feature graph output by each second branch network to obtain the first feature graph. Here, the merging processing is an addition operation.

In this embodiment, the neural network includes at least two feature pyramid sub-networks; and each characteristic pyramid sub-network takes a first characteristic graph output by a previous characteristic pyramid sub-network connected with the current characteristic pyramid sub-network as input, and extracts the first characteristic graph of the current characteristic pyramid sub-network based on different scales according to the input first characteristic graph. Wherein, the input of the first feature pyramid network is the feature map obtained in step S202, and step S204 to step S206 are executed to obtain the first feature map; and the input of the non-first characteristic pyramid sub-network is the first characteristic diagram output by the previous characteristic pyramid sub-network, and the steps S204 to S206 are executed, the characteristic extraction is carried out on the input first characteristic diagram based on at least two different scales, and the acquired other characteristic diagrams are combined with the input first characteristic diagram to obtain the first characteristic diagram of the current characteristic pyramid sub-network.

In this embodiment, each sub-neural network includes a plurality of feature pyramid sub-networks, and the output of the previous feature pyramid sub-network may be the input of the next adjacent feature pyramid sub-network. For example, if x^(l)And W^(l)Representing the input (feature map) and parameters of the ith feature pyramid sub-network, the output of the feature pyramid sub-network, i.e., the input of the next feature pyramid sub-network, can be represented as:

x^(l+1)＝x^(l)+p(x^(l)+W^(l)) (1)

wherein, p (x)^(l)+W^(l)) The feature extraction operation performed for a feature pyramid sub-network, and may be further expressed as:

wherein c is the number of the second branch networks,

representing the respective second branch network f_cThe feature extraction operation that is performed is,

representing a first branch network f₀The feature extraction operation that is performed is,

the processing performed by the seventh convolutional layer is shown.

In practical application, the neural network can extract features of different scales by taking the feature pyramid sub-network as a basic composition module and utilizing a feature pyramid learning mechanism.

In an alternative embodiment, the neural network may employ, but is not limited to, the HOURGLASSs (HOURGLASS) network structure shown in fig. 6 as an alternative basic network structure. The neural network structure comprises a plurality of HOURGLASSs structures connected end-to-end forming a HOURGLASS network structure, each HOURGLASS structure comprising at least one feature pyramid sub-network. The output of the former HOURGLASS structure is the input of the adjacent latter HOURGLASS structure, and through the network structure, the analysis and the learning from bottom to top and from top to bottom are enabled to run through the model all the time, so that the characteristics extracted by the neural network are enabled to be more effective and accurate, and the accuracy of the acquired first characteristic diagram is ensured. Since the HOURGLASS network uses a Residual Unit as a basic component Module, the feature pyramid sub-network of this embodiment may be a feature Pyramid Residual Module (PRM) for forming the HOURGLASS network structure. Here, the number of HOURGLASS structures and feature pyramid subnetworks may be set as appropriate according to actual needs.

In the HOURGLASS network structure shown in fig. 7, each HOURGLASS structure may be composed of a plurality of feature pyramid sub-networks, so as to learn and extract features of different scales by using the feature pyramid sub-networks, and output a first feature map. The feature pyramid sub-network may adopt any one of the structures of the feature pyramid sub-networks shown in fig. 3 to 5. Wherein, the neural network shown in fig. 7 further includes a first convolution layer (Conv1) operable to execute the aforementioned step S202 to obtain the feature map; and a Pooling layer (Pool) for continuously reducing the resolution of the feature map to obtain global features, and then performing interpolation amplification on the global features to combine with the positions of the corresponding resolutions in the feature map, i.e., performing global Pooling on the feature map to obtain the feature map of the image to be detected. The obtained feature map can be input into the feature pyramid sub-network, so that the feature pyramid sub-network can carry out deeper learning and extraction on the feature map, and further extract the first feature map based on different scales. Optionally, a feature pyramid sub-network or a convolution layer may be further disposed between the pooling layer and the feature pyramid sub-network, and is used to adjust attributes such as resolution of the feature map.

Step S208: and detecting key points of the target object in the image to be detected according to the first characteristic diagram.

Optionally, a score map of at least one key point of the target object is respectively obtained according to the first feature map; and determining the position of the corresponding key point of the target object according to the scores of the pixel points included in each score map. The first feature map of the image to be detected, which is acquired through the feature pyramid sub-network, is used for detecting and extracting the features of the image to be detected based on different scales, so that the features of the different scales can be stably and accurately detected, and on the basis, the key point detection is performed according to the first feature map, so that the accuracy of the key point detection is effectively improved.

In an alternative embodiment, for a certain key point, the position with the highest score in the score map represents the detected position of the key point. As shown in fig. 8, the output score map corresponds to each key point of the target object in the image to be detected, corresponding to the image to be detected input to the neural network. The target object in the image to be detected is a human, and includes 16 key points, such as a hand, a knee, and the like. And determining the position of the corresponding key point according to the position with the highest score in the 16 score maps, so that the positioning detection of the 16 key points can be finished.

In an actual application scenario, the image processing method provided by the embodiment of the invention can be used for, but is not limited to, human body posture estimation, video understanding analysis, behavior recognition and human-computer interaction, image segmentation, object clustering and the like.

For example, when human body posture estimation is performed, an image to be detected is input into a neural network, feature extraction is performed based on different scales by using a feature pyramid sub-network, and key point detection is performed on a target object according to the extracted features, so that human body posture estimation is performed according to the positions of the detected key points. For example, positions (e.g., coordinates) of key points corresponding to the 16 score maps shown in fig. 8 are acquired, and the human body posture can be accurately estimated from the positions of the 16 key points. The image processing method of the embodiment utilizes the feature pyramid learning mechanism to extract features, so that target objects with different scales can be detected, and the robustness of human posture estimation is ensured.

For another example, for a video sequence including a target object, the image processing method of this embodiment may be adopted, and a feature pyramid learning mechanism is used to stably extract a feature map of a video frame image, so as to accurately perform key point positioning of the target object, which is helpful for implementing video understanding analysis.

Optionally, the initialized network parameter of at least one network layer of the neural network of this embodiment is obtained from a network parameter distribution determined according to a mean and a variance of the network parameter. The network parameter distribution can be a set Gaussian distribution or a uniform distribution, the mean value and the variance of the network parameter distribution are determined by the input and output number of the parameter layer, and the initialized network parameters can be obtained by random sampling from the network parameter distribution. The parameter initialization method can train the neural network with a multi-branch network structure, is suitable for not only being proposed based on a single-branch network, but also solving the problem of training a characteristic pyramid residual module with the multi-branch network, and enables the training process of the neural network to be more stable.

For example, in the network parameter initialization process, for the neural network forward propagation process, the mean value of the network parameters is initialized to 0 to ensure that the variances of the input and output of each layer of the neural network are substantially consistent. After the variance σ of the network parameter is obtained, the initialized network parameter can be sampled from a gaussian distribution or a uniform distribution with a mean value of 0 and a variance σ as the initialized network parameter of the forward propagation process. For the neural network back propagation process, the mean value of the network parameters is initialized to 0, so that the mean value of the gradients of the network parameters is 0, and the variance of the input gradient and the output gradient of each layer of the neural network is basically consistent. After obtaining the variance σ 'of the gradient of the network parameter, the initialized network parameter can be sampled from a gaussian distribution or uniform distribution with a mean of 0 and a variance σ' of the gradient as the initialized network parameter of the back propagation process.

Optionally, if there is a situation in the neural network that includes at least two Identity Mapping (Identity Mapping) additions, an output adjustment module is disposed in at least one Identity Mapping branch that needs to be added, and the output adjustment module adjusts the first characteristic diagram output by the Identity Mapping branch.

For example, if there are at least two instances of adding identity maps in the neural network (not illustrated in fig. 9, two instances are taken as an example), a BN-ReLU-Conv (batch normalization-normalized Linear Units-Convolution) module is provided in one of the identity map branches to adjust parameters such as a range of variance of outputs of the identity map branch. Taking the case of adding two identity maps shown in fig. 9 as an example, an output adjustment module may be provided in any one of the two identity map branches.

For another example, in the neural networks mentioned in the embodiments corresponding to fig. 3 to 5, there are also cases where a plurality of identity mapping branches are added, and at least one identity mapping branch (e.g. f) can be added in the neural networks₀、f₁… … or f_c) And a BN-ReLU-Conv layer is additionally arranged, so that the output of the branch is adjusted, and the problems of overlapping of corresponding variances and the like caused by adding a plurality of identical mapping branches are avoided.

According to the image processing method provided by the embodiment of the invention, the characteristic of the image to be detected is extracted based on various different scales through the characteristic pyramid sub-network of the neural network, and a plurality of other characteristic graphs are obtained and combined with the characteristic graph to obtain the first characteristic graph of the image to be detected; on the basis, the key point detection is carried out according to the acquired first characteristic diagram, and the accuracy of the key point detection is effectively improved.

EXAMPLE III

Referring to fig. 10, a block diagram of an image processing apparatus according to a third embodiment of the present invention is shown.

The image processing apparatus of the present embodiment includes: an obtaining module 1002, configured to obtain a feature map of an image to be detected; an extraction module 1004, configured to perform feature extraction on the feature map based on at least two different scales through a neural network to obtain at least two other feature maps; a merging module 1006, configured to merge the feature map and each of the other feature maps to obtain a first feature map of the image to be detected.

Optionally, the method further comprises: and the detection module 1008 is configured to perform key point detection on the target object in the image to be detected according to the first feature map.

Optionally, the detection module 1008 includes: a scoring unit (not shown in the figure) for respectively obtaining a scoring graph of at least one key point of the target object according to the first feature graph; a determining unit (not shown in the figure) configured to determine, according to the scores of the pixel points included in each score map, the positions of the corresponding key points of the target object.

The image processing apparatus of this embodiment is used to implement the corresponding image processing method in the foregoing method embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

The present embodiment also provides a computer-readable storage medium, on which computer program instructions are stored, wherein the program instructions, when executed by a processor, implement the steps of any one of the image processing methods provided by the embodiments of the present invention.

The present embodiment also provides a computer program, including: at least one executable instruction, which when executed by a processor, is used to implement the steps of any one of the image processing methods provided by the embodiments of the present invention.

Example four

The fourth embodiment of the present invention provides an electronic device, which may be, for example, a mobile terminal, a Personal Computer (PC), a tablet computer, a server, or the like. Referring now to fig. 11, there is shown a schematic block diagram of an electronic device 1100 suitable for use as a terminal device or server for implementing embodiments of the invention: as shown in fig. 11, the electronic device 1100 includes one or more processors, communication elements, and the like, for example: one or more Central Processing Units (CPU)1101, and/or one or more image processors (GPU)1113, etc., which may perform various suitable actions and processes in accordance with executable instructions stored in a Read Only Memory (ROM)1102 or loaded from a storage section 1108 into a Random Access Memory (RAM) 1103. The communication element includes a communication component 1112 and/or a communication interface 1109. Among other things, the communication component 1112 may include, but is not limited to, a network card, which may include, but is not limited to, an ib (infiniband) network card, the communication interface 1109 includes a communication interface of a network interface card such as a LAN card, a modem, or the like, and the communication interface 1109 performs communication processing via a network such as the internet.

The processor may communicate with the read-only memory 1102 and/or the random access memory 1103 to execute executable instructions, and is connected to the communication component 1112 through the communication bus 1104, and communicates with other target devices through the communication component 1112, so as to complete operations corresponding to any image processing method provided by the embodiment of the present invention, for example, acquiring a feature map of an image to be detected; extracting features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps; and combining the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected.

In addition, in the RAM1103, various programs and data necessary for the operation of the apparatus can also be stored. The CPU1101 or GPU1113, ROM1102, and RAM1103 are connected to each other by a communication bus 1104. The ROM1102 is an optional module in case of the RAM 1103. The RAM1103 stores or writes executable instructions into the ROM1102 at runtime, and the executable instructions cause the processor to perform operations corresponding to the above-described communication method. An input/output (I/O) interface 1105 is also connected to communication bus 1104. The communications component 1112 may be integrated or configured with multiple sub-modules (e.g., IB cards) and linked over a communications bus.

The following components are connected to the I/O interface 1105: an input portion 1106 including a keyboard, mouse, and the like; an output portion 1107 including a signal output unit such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1108 including a hard disk and the like; and a communication interface 1109 including a network interface card such as a LAN card, a modem, or the like. A driver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1110 as necessary, so that a computer program read out therefrom is mounted into the storage section 1108 as necessary.

It should be noted that the architecture shown in fig. 11 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 11 may be selected, deleted, added, or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication element may be separately set, or the GPU and the CPU may be integrated, and so on. These alternative embodiments are all within the scope of the present invention.

In particular, according to an embodiment of the present invention, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the invention include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing a method as illustrated in the flowcharts, the program code may include instructions corresponding to performing steps of a method as provided by embodiments of the invention, e.g., obtaining a feature map of an image to be detected; extracting features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps; and combining the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected. In such an embodiment, the computer program may be downloaded and installed from a network through the communication element, and/or installed from the removable media 1111. Which when executed by a processor performs the above-described functions defined in the method of an embodiment of the invention.

It should be noted that, according to the implementation requirement, each component/step described in the embodiment of the present invention may be divided into more components/steps, and two or more components/steps or partial operations of the components/steps may also be combined into a new component/step to achieve the purpose of the embodiment of the present invention.

The above-described method according to an embodiment of the present invention may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the method described herein may be stored in such software processing on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the processing methods described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

Those of ordinary skill in the art will appreciate that the various illustrative elements and method steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The above embodiments are only for illustrating the embodiments of the present invention and not for limiting the embodiments of the present invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the embodiments of the present invention, so that all equivalent technical solutions also belong to the scope of the embodiments of the present invention, and the scope of patent protection of the embodiments of the present invention should be defined by the claims.

Claims

1. An image processing method comprising:

acquiring a characteristic diagram of an image to be detected;

extracting features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps;

combining the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected,

the neural network comprises at least one characteristic pyramid sub-network, and the characteristic pyramid sub-network comprises a first branch network and a plurality of second branch networks which are respectively connected with the first branch network in parallel; the other feature map comprises a second feature map or a third feature map; the first branch network performs feature extraction on the feature map based on an original scale of the feature map to obtain the second feature map, and each second branch network performs feature extraction on the feature map based on a scale different from the original scale to obtain the third feature map;

wherein at least two of the second branch networks include a fifth convolutional layer, a downsampling layer, a sixth convolutional layer, an upsampling layer, and a seventh convolutional layer, and at least two of the second branch networks share the fifth convolutional layer and/or the seventh convolutional layer.

2. The method of claim 1, further comprising:

and carrying out key point detection on the target object in the image to be detected according to the first characteristic diagram.

3. The method of claim 2, wherein the performing keypoint detection on the target object according to the first feature map comprises:

respectively acquiring a score map of at least one key point of the target object according to the first feature map;

and determining the position of the corresponding key point of the target object according to the scores of the pixel points included in each score map.

4. The method of claim 1, wherein the first branch network includes a second convolutional layer, a third convolutional layer, and a fourth convolutional layer;

the second convolutional layer reduces the dimension of the feature map;

the third convolution layer performs convolution processing on the feature map with the reduced dimensionality based on the original scale of the feature map;

and the fourth convolution layer promotes the dimension of the feature map subjected to convolution processing to obtain the second feature map.

5. The method of claim 4, wherein the fifth convolutional layer reduces the dimension of the feature map;

the down-sampling layer down-samples the feature map with the reduced dimensionality according to a set down-sampling proportion, wherein the scale of the feature map after down-sampling is smaller than the original scale of the feature map;

the sixth convolution layer performs convolution processing on the downsampled feature map;

the upsampling layer upsamples the convolved feature map according to a set upsampling proportion, wherein the scale of the upsampled feature map is equal to the original scale of the feature map;

and the seventh convolutional layer promotes the dimension of the feature map after the upsampling to obtain the third feature map.

6. The method of claim 5, wherein there are a plurality of said second branch networks;

the set down-sampling ratios of at least two of the second branch networks are different, and/or the set down-sampling ratios of at least two of the second branch networks are the same.

7. The method of claim 5, wherein there are a plurality of said second branch networks;

the sixth convolutional layers of at least two of the second branch networks share parameters.

8. The method of any of claims 1-7, wherein the second branch network includes a fifth convolutional layer, an intumescent convolutional layer, and a seventh convolutional layer;

the fifth convolutional layer reduces the dimension of the feature map;

the expansion convolution layer performs expansion convolution processing on the feature map with the dimensionality reduced,

and the seventh convolution layer promotes the dimension of the feature map after the expansion convolution to obtain the third feature map.

9. A method according to any one of claims 5 to 7, wherein at least two of said second branch networks have respective said fifth convolutional layers and/or said seventh convolutional layers.

10. The method of claim 9, wherein the feature pyramid sub-network further comprises a first output merge layer;

the first output merging layer merges respective outputs of at least two of the second branch networks sharing the seventh convolutional layer before the seventh convolutional layer, and outputs a merged result to the shared seventh convolutional layer.

11. The method of any of claims 1 to 7, wherein the neural network comprises at least two feature pyramid sub-networks;

and each feature pyramid sub-network takes a first feature map output by a previous feature pyramid sub-network connected with the current feature pyramid sub-network as input, and extracts the first feature map of the current feature pyramid sub-network based on different scales according to the input first feature map.

12. The method of claim 11, wherein the neural network is an HOURGLASS neural network comprising at least one HOURGLASS module comprising at least one of the feature pyramid sub-networks.

13. The method according to any one of claims 1 to 7, wherein the initialization network parameters of at least one network layer of the neural network are obtained from a network parameter distribution determined from the mean and variance of the initialization network parameters, and the mean of the initialization network parameters is zero.

14. The method according to any one of claims 1 to 7, wherein if there is a situation in the neural network that includes at least two identical mapping additions, an output adjustment module is provided in at least one identical mapping branch that needs to be added, and the first characteristic diagram output by the identical mapping branch is adjusted by the output adjustment module.

15. An image processing apparatus comprising:

the acquisition module is used for acquiring a characteristic diagram of an image to be detected;

the extraction module is used for extracting the features of the feature map based on at least two different scales through a neural network to obtain at least two other feature maps;

a merging module for merging the characteristic diagram and each other characteristic diagram to obtain a first characteristic diagram of the image to be detected,

16. The apparatus of claim 15, further comprising:

and the detection module is used for detecting key points of the target object in the image to be detected according to the first characteristic diagram.

17. The apparatus of claim 16, wherein the detection module comprises:

the scoring unit is used for respectively acquiring a scoring graph of at least one key point of the target object according to the first feature graph;

and the determining unit is used for determining the positions of corresponding key points of the target object according to the scores of the pixel points included in each score map.

18. The apparatus of claim 15, wherein the first branch network comprises a second convolutional layer, a third convolutional layer, and a fourth convolutional layer;

the second convolution layer is used for reducing the dimension of the feature map;

the third convolution layer is used for performing convolution processing on the feature map with the reduced dimensionality based on the original scale of the feature map;

and the fourth convolution layer is used for improving the dimension of the feature map subjected to convolution processing to obtain the second feature map.

19. The apparatus of claim 18, wherein the fifth convolutional layer is used to reduce the dimensionality of the feature map;

the down-sampling layer is used for down-sampling the feature map with reduced dimensionality according to a set down-sampling proportion, wherein the scale of the down-sampled feature map is smaller than the original scale of the feature map;

the sixth convolutional layer is used for carrying out convolution processing on the down-sampled feature map;

the upsampling layer is used for upsampling the convolved feature map according to a set upsampling proportion, wherein the scale of the upsampled feature map is equal to the original scale of the feature map;

and the seventh convolutional layer is used for improving the dimension of the feature map subjected to the upsampling to obtain the third feature map.

20. The apparatus of claim 19, wherein there are a plurality of said second branch networks;

21. The apparatus of claim 19, wherein there are a plurality of said second branch networks;

22. The apparatus of any of claims 15 to 21, wherein the second branch network comprises a fifth convolutional layer, an expandable convolutional layer, and a seventh convolutional layer;

the fifth convolution layer is used for reducing the dimension of the feature map;

the expansion convolution layer is used for carrying out expansion convolution processing on the feature map with the reduced dimensionality,

and the seventh convolution layer is used for improving the dimension of the feature map after expansion convolution to obtain the third feature map.

23. Apparatus according to any of claims 19 to 21, wherein at least two of said second branch networks have respective said fifth convolutional layers and/or said seventh convolutional layers.

24. The apparatus of claim 23, wherein the feature pyramid sub-network further comprises a first output merge layer;

the first output merging layer is configured to merge respective outputs of at least two of the second branch networks sharing the seventh convolutional layer before the seventh convolutional layer, and output a merging result to the shared seventh convolutional layer.

25. The apparatus of any of claims 15 to 21, wherein the neural network comprises at least two feature pyramid sub-networks;

and each feature pyramid sub-network is used for taking a first feature map output by a previous feature pyramid sub-network connected with the current feature pyramid sub-network as input, and extracting the first feature map of the current feature pyramid sub-network based on different scales according to the input first feature map.

26. The apparatus of claim 25, wherein the neural network is an HOURGLASS HOURGLASSs neural network comprising at least one HOURGLASS module comprising at least one of the feature pyramid sub-networks.

27. The apparatus of any one of claims 15 to 17, wherein the initialization network parameters of at least one network layer of the neural network are obtained from a network parameter distribution determined from a mean and a variance of the initialization network parameters, and the mean of the initialization network parameters is zero.

28. The apparatus according to any one of claims 15 to 21, wherein if there is a situation in the neural network that includes at least two identical mapping additions, an output adjustment module is provided in at least one identical mapping branch that needs to be added, and the output adjustment module is configured to adjust the first characteristic map output by the identical mapping branch.

29. A computer readable storage medium having stored thereon computer program instructions, wherein the program instructions, when executed by a processor, implement the steps of the image processing method of any of claims 1 to 14.

30. An electronic device, comprising: the system comprises a processor, a memory, a communication element and a communication bus, wherein the processor, the memory and the communication element are communicated with each other through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction causes the processor to execute the operation corresponding to the image processing method according to any one of claims 1 to 14.