CN113469017A

CN113469017A - Image processing method and device and electronic equipment

Info

Publication number: CN113469017A
Application number: CN202110725463.4A
Authority: CN
Inventors: 刘昕; 刘文韬; 钱晨; 谢符宝
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-01
Also published as: WO2023273071A1

Abstract

The embodiment of the invention discloses an image processing method, an image processing device and electronic equipment. The method comprises the following steps: acquiring a multi-frame two-dimensional image containing a target object; detecting a hand of a first two-dimensional image in the multiple frames of two-dimensional images to obtain an initial detection frame of the hand of the target object in the first two-dimensional image; determining a first area in a second two-dimensional image based on an area of the initial detection frame in the first two-dimensional image, and obtaining at least one of a detection frame of the hand, key point information of the hand and state category information corresponding to the hand in the second two-dimensional image based on a pixel point in the first area in the second two-dimensional image; the second two-dimensional image is a frame image after the first two-dimensional image.

Description

Image processing method and device and electronic equipment

Technical Field

The present invention relates to image processing technologies, and in particular, to an image processing method and apparatus, and an electronic device.

Background

In recent years, the touch interaction mode has a large number of applications and interaction designs in mobile terminals such as mobile phones and tablet computers, and interaction experience is greatly improved. In recent years, the gesture interaction mode (gesture operation) is a new technical mode explored and pursued by various manufacturers. At present, the gesture interaction mode is high in cost and low in stability by acquiring data through a depth camera or an infrared camera.

Disclosure of Invention

In order to solve the existing technical problems, embodiments of the present invention provide an image processing method and apparatus, and an electronic device.

In order to achieve the above purpose, the technical solution of the embodiment of the present invention is realized as follows:

the embodiment of the invention provides an image processing method, which comprises the following steps:

acquiring a multi-frame two-dimensional image containing a target object;

detecting a hand of a first two-dimensional image in the multiple frames of two-dimensional images to obtain an initial detection frame of the hand of the target object in the first two-dimensional image;

determining a first area in a second two-dimensional image based on an area of the initial detection frame in the first two-dimensional image, and obtaining at least one of a detection frame of the hand, key point information of the hand and state category information corresponding to the hand in the second two-dimensional image based on a pixel point in the first area in the second two-dimensional image; the second two-dimensional image is a frame image after the first two-dimensional image.

In the foregoing solution, the obtaining at least one of the detection frame of the hand, the key point information of the hand, and the state type information corresponding to the hand in the second two-dimensional image based on the pixel point in the first region in the second two-dimensional image includes:

Shearing the second three-dimensional image according to the first area to obtain a sheared image;

and performing feature recognition on the cut image, and determining at least one of a detection frame of the hand, key point information of the hand and state type information corresponding to the hand based on the recognized features.

In the above scheme, the method further comprises: and performing hand detection on the cut image based on the identified features to obtain the judgment information whether the cut image comprises a hand or not.

In the foregoing aspect, the identifying the features of the cutout image, and determining at least one of a detection frame of the hand, key point information of the hand, and state type information corresponding to the hand based on the identified features includes: performing feature identification on the cut image through a feature extraction part of a first network to obtain a feature image set, wherein the feature image set comprises a plurality of feature images with different receptive fields;

inputting at least some of the feature images in the set of feature images to the first network to perform at least one of:

performing hand detection on the characteristic image input to the first network based on a first branch in the first network to obtain a detection frame of the hand;

Performing hand key point detection on the feature image input to the first network based on a second branch in the first network to obtain key point information of the hand;

performing hand state recognition on the characteristic image input to the first network based on a third branch in the first network to obtain state category information corresponding to the hand state;

wherein the feature image input to the first network is different in at least part of a plurality of weighting parameters respectively corresponding to the first branch, the second branch, and the third branch.

In the above scheme, the method further comprises: performing hand detection on the feature image input to the first network based on a fourth branch in the first network to obtain judgment information whether the cut image comprises a hand or not; wherein the feature image input to the first network has a plurality of weighting parameters corresponding to the fourth branch, and at least a part of the weighting parameters are different from a plurality of weighting parameters corresponding to the feature images in the first branch, the second branch, and the third branch.

In the above scheme, the method further comprises: and in response to the judgment information indicating that the hand is not included in the cut-out image, re-detecting the hand of the second two-dimensional image, and obtaining a detection frame of the hand of the target object in the second two-dimensional image.

In the foregoing solution, the determining a first region in a second two-dimensional image based on the region of the initial detection frame in the first two-dimensional image includes:

carrying out constant-amplitude amplification processing on the area of the initial detection frame in the first two-dimensional image to obtain a second area;

and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

In the above solution, before the determining the first region in the second two-dimensional image based on the region of the initial detection frame in the first two-dimensional image, the method further includes:

detecting a hand of a third two-dimensional image in the plurality of frames of two-dimensional images, and determining the position of the hand in the third two-dimensional image; the third two-dimensional image is a frame of image before the second two-dimensional image;

determining a movement tendency of the hand based on a position of the hand in the third two-dimensional image and a position of the hand in the first two-dimensional image.

amplifying the area of the initial detection frame in the first two-dimensional image based on the movement trend of the hand to obtain a second area; the amplification amplitude of a subarea corresponding to the movement trend in an area in the first two-dimensional image of the initial detection frame is larger than the amplification amplitudes of other subareas except the subarea;

An embodiment of the present invention further provides an image processing apparatus, where the apparatus includes: the device comprises an acquisition unit, a detection unit, a determination unit and a processing unit; wherein the content of the first and second substances,

the acquisition unit is used for acquiring a plurality of frames of two-dimensional images containing a target object;

the detection unit is used for detecting a hand of a first two-dimensional image in the plurality of frames of two-dimensional images to obtain an initial detection frame of the hand of the target object in the first two-dimensional image;

the determining unit is used for determining a first area in a second two-dimensional image based on the area of the initial detection frame in the first two-dimensional image;

the processing unit is configured to obtain at least one of a detection frame of the hand, key point information of the hand, and state category information corresponding to the hand in the second two-dimensional image based on a pixel point in the first region in the second two-dimensional image; the second two-dimensional image is a frame image after the first two-dimensional image.

In some optional embodiments of the present invention, the processing unit is configured to cut the second three-dimensional image according to the first area to obtain a cut image, perform feature recognition on the cut image, and determine at least one of a detection frame of the hand, keypoint information of the hand, and state category information corresponding to the hand based on the recognized feature.

In some optional embodiments of the present invention, the processing unit is further configured to perform hand detection on the cut-out image based on the identified features, so as to obtain the discrimination information of whether a hand is included in the cut-out image.

In some optional embodiments of the present invention, the processing unit is configured to perform feature identification on the clipped image through a feature extraction part of a first network, so as to obtain a feature image set, where the feature image set includes a plurality of feature images with different receptive fields; inputting at least some of the feature images in the set of feature images to the first network to perform at least one of:

In some optional embodiments of the present invention, the processing unit is further configured to perform hand detection on the feature image input to the first network based on a fourth branch in the first network, so as to obtain the determination information whether the cut image includes a hand; wherein the feature image input to the first network has a plurality of weighting parameters corresponding to the fourth branch, and at least a part of the weighting parameters are different from a plurality of weighting parameters corresponding to the feature images in the first branch, the second branch, and the third branch.

In some optional embodiments of the present invention, the detection unit is further configured to, in response to the determination information obtained by the processing unit indicating that no hand is included in the cutout image, re-detect the hand of the second two-dimensional image, and obtain an initial detection frame of the hand of the target object in the second two-dimensional image.

In some optional embodiments of the present invention, the determining unit is configured to perform equal-amplitude amplification processing on an area of the initial detection frame in the first two-dimensional image to obtain a second area; and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

In some optional embodiments of the present invention, the apparatus further comprises a trend detection unit, configured to detect a hand of a third two-dimensional image of the plurality of frames of two-dimensional images, and determine a position of the hand in the third two-dimensional image; the third two-dimensional image is a frame of image before the second two-dimensional image; determining a movement tendency of the hand based on a position of the hand in the third two-dimensional image and a position of the hand in the first two-dimensional image.

In some optional embodiments of the present invention, the determining unit is configured to perform an enlargement process on an area of the initial detection frame in the first two-dimensional image based on a movement trend of the hand to obtain a second area; the amplification amplitude of a subarea corresponding to the movement trend in an area in the first two-dimensional image of the initial detection frame is larger than the amplification amplitudes of other subareas except the subarea; and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

Embodiments of the present invention also provide a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the steps of the method according to an embodiment of the present invention.

The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor, and when the processor executes the computer program, the steps of the method according to the embodiment of the present invention are implemented.

According to the image processing method, the image processing device and the electronic equipment provided by the embodiment of the invention, the hand in the first two-dimensional image is detected to obtain the initial detection frame of the hand, and then the pixel point in the first area of the subsequent image (the second two-dimensional image) is determined based on the initial detection frame to obtain at least one of the detection frame of the hand, the key point information of the hand and the state type information corresponding to the hand in the second two-dimensional image. On one hand, a depth image acquisition assembly such as a depth camera or an infrared camera of a depth image is not required, so that the implementation cost is greatly reduced; on the other hand, the detection frame of the hand, the key point information of the hand and the state category information corresponding to the hand are obtained in a multitasking mode, the identification information is rich, support is provided for the subsequent gesture interaction function, and the information acquisition duration is shortened.

Drawings

FIG. 1 is a first flowchart illustrating an image processing method according to an embodiment of the present invention;

FIG. 2 is a second flowchart illustrating an image processing method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a method of step 1032 in the image processing method according to an embodiment of the invention;

FIG. 4 is a diagram illustrating key points of a hand in an image processing method according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a first network in the image processing method according to the embodiment of the present invention;

FIG. 6 is a schematic diagram of a structure of an image processing apparatus according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware component structure of the electronic device according to the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two.

The embodiment of the invention provides an image processing method. FIG. 1 is a first flowchart illustrating an image processing method according to an embodiment of the present invention; as shown in fig. 1, the method includes:

step 101: acquiring a multi-frame two-dimensional image containing a target object;

step 102: detecting a hand of a first two-dimensional image in the multiple frames of two-dimensional images to obtain an initial detection frame of the hand of the target object in the first two-dimensional image;

step 103: determining a first area in a second two-dimensional image based on an area of the initial detection frame in the first two-dimensional image, and obtaining at least one of a detection frame of the hand, key point information of the hand and state category information corresponding to the hand in the second two-dimensional image based on a pixel point in the first area in the second two-dimensional image; the second two-dimensional image is a frame image after the first two-dimensional image.

The image processing method of the embodiment can be applied to an image processing device, and the image processing device can be disposed in an electronic device with a processing function, such as a personal computer, a server, and the like, wherein the electronic device can also be a display device, such as a smart television, a projector, an intelligent screen, an outdoor display, and the like, or a processor or a chip executes a computer program to implement the image processing method.

In this embodiment, the multi-frame two-dimensional image may be a continuous video captured by an image capturing device built in or externally connected to the electronic device, or may also be a received video transmitted by another electronic device, or the like. In some optional embodiments, an image capturing component (e.g., a camera) may be included in the electronic device, and a plurality of frames of two-dimensional images containing the target object are obtained through the image capturing component. In other alternative embodiments, the electronic device may include a communication component, and the communication component obtains multiple frames of two-dimensional images including the target object captured by other cameras (for example, a camera independently disposed in the image capture area, or a camera in other electronic devices). For example, taking the electronic device as a mobile phone, a front camera of the mobile phone may be used to collect multiple frames of two-dimensional images including a target object. In other alternative embodiments, the frames of two-dimensional images may be videos stored in a local or other video library.

For example, the image capture device (or image capture component) may be a conventional camera, and need not be a camera with depth data capture, such as a depth camera or infrared camera. Illustratively, the multi-frame two-dimensional image may be an RGB image, for example. According to the embodiment of the application, common image acquisition equipment (or an image acquisition assembly) can be adopted to obtain a common two-dimensional image, identification of relevant information of a hand is carried out based on the common two-dimensional image, depth image acquisition assemblies such as a depth camera or an infrared camera of the depth image are not needed, and the implementation cost is greatly reduced.

Note that the two-dimensional image in the present embodiment may be simply referred to as an image.

In this embodiment, the target object may specifically be a target person; the target person may specifically be a person in the image that is in the foreground; alternatively, the target person may be a specified person in the image.

In this embodiment, each two-dimensional image of the multiple two-dimensional images may be referred to as a frame image and is a minimum unit constituting a video (i.e., an image to be processed), it can be understood that the multiple two-dimensional images are a group of frame images with continuous time, the multiple two-dimensional images are formed according to the acquisition time of each frame image, and the time parameters corresponding to each frame image are continuous. For example, taking the target object as an actual person as an example, when the target object is included in the multiple frames of two-dimensional images, one or more target objects may be included in a time range corresponding to the multiple frames of two-dimensional images, or one or more target objects may be included in a part of the time range of the multiple frames of two-dimensional images, which is not limited in this embodiment.

In this embodiment, the first two-dimensional image is any one of a plurality of frames of two-dimensional images, and optionally, the first two-dimensional image may be a first frame of image of the plurality of frames of two-dimensional images; the second image is a frame of two-dimensional image after the first two-dimensional image. Wherein the second two-dimensional image may be a subsequent frame image temporally continuous with the first two-dimensional image. For example, the multi-frame two-dimensional image includes 10 frames of images, the first two-dimensional image is the 2 nd frame of the 10 frames of images, and the second two-dimensional image is the 3 rd frame of image. Alternatively, the second two-dimensional image may be a frame image that is a predetermined number of frame images away from the first two-dimensional image after the first two-dimensional image. For example, the plurality of two-dimensional images include 20 frames of images, the first two-dimensional image is a 2 nd frame of the 20 frames of images, and the second two-dimensional image may be a 6 th frame of the 20 frames of images assuming that a preset number of frames of images are 3 frames of images. The preset number may be preset according to an actual situation, for example, the preset number may be preset according to a moving speed of the target object. The embodiment can effectively reduce the data processing amount, thereby reducing the consumption of the image processing device.

In this embodiment, the hand in the first two-dimensional image may be detected through the target detection network, so as to obtain an initial detection frame of the hand of the target object in the first two-dimensional image. The target detection network can be obtained through training of sample images of detection frames marked with hands, and the hands in the images can be detected, so that initial detection frames of the hands are obtained. The target detection network may adopt any network structure capable of detecting a hand of a target object, which is not limited in this embodiment.

For example, feature extraction may be performed on the first two-dimensional image through the target detection network, and two coordinates of the area where the hand of the target object is located at diagonal positions or coordinates of four corners of the area where the hand is located may be determined based on the extracted feature map. Taking two coordinates at opposite angles as an example, the coordinates can be coordinates of the upper left corner and the lower right corner, and then the initial detection frame of the hand is obtained according to the determined coordinates of the upper left corner and the lower right corner.

In some optional embodiments, the determining a first region in a second two-dimensional image based on the region of the initial detection frame in the first two-dimensional image comprises: carrying out constant-amplitude amplification processing on the area of the initial detection frame in the first two-dimensional image to obtain a second area; and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

For example, if the height of the initial detection frame is H and the width is W, the center point of the area may be used as the center, and the four sides of the area extend toward a direction away from the center point, for example, in the height direction, the four sides extend toward a direction away from the center point by H/4, and in the width direction, the four sides extend toward a direction away from the center point by W/4, so as to obtain the second area. Of course, the degree of the amplification processing performed on the area where the initial detection frame is located in the present embodiment is not limited to the above, and other parameters of the degree of the amplification processing may also be within the protection scope of the present embodiment.

In further optional embodiments, before the determining the first region in the second two-dimensional image based on the region in the first two-dimensional image of the initial detection frame, the method further comprises: detecting a hand of a third two-dimensional image in the plurality of frames of two-dimensional images, and determining the position of the hand in the third two-dimensional image; the third two-dimensional image is a frame of image before the second two-dimensional image; determining a movement tendency of the hand based on a position of the hand in the third two-dimensional image and a position of the hand in the first two-dimensional image.

The determining a first area in a second two-dimensional image based on the area of the initial detection frame in the first two-dimensional image comprises: amplifying the area of the initial detection frame in the first two-dimensional image based on the movement trend of the hand to obtain a second area; the amplification amplitude of a subarea corresponding to the movement trend in an area in the first two-dimensional image of the initial detection frame is larger than the amplification amplitudes of other subareas except the subarea; and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

In this embodiment, since the hand may be in a fast moving state, for example, the initial detection frame of the hand is detected in the a region in the first two-dimensional image, only a part of the hand may be detected in the a region in the second two-dimensional image, or even the hand may not be detected. Based on this, in this embodiment, the moving trend of the hand is determined according to the position of the hand in the third two-dimensional image and the position of the hand in the first two-dimensional image, and then the region of the initial detection frame in the first two-dimensional image is subjected to the non-uniform amplitude amplification processing according to the moving trend of the hand to obtain the second region.

For example, if the third two-dimensional image is a frame of image after the first two-dimensional image and before the second two-dimensional image, a displacement between the two positions may be determined according to a position of the hand in the third two-dimensional image and a position of the hand in the first two-dimensional image, where the direction of the displacement represents a moving direction of the hand, and the magnitude of the displacement represents a distance that the hand moves within a corresponding time length range between the third two-dimensional image and the first two-dimensional image. Further, according to the displacement, the area of the initial detection frame in the first two-dimensional image is amplified to obtain a second area.

Illustratively, the direction of displacement corresponds to the sub-region. For example, a plane coordinate system is established with a central point of the image as an origin, and if the direction of displacement is a positive x-axis direction, in the process of obtaining the second region by performing amplification processing on the region of the initial detection frame in the first two-dimensional image, if the height of the initial detection frame is H and the width of the initial detection frame is W, the central point of the region may be used as a center, four sides of the region extend toward a direction away from the central point, and the extent of extension toward the positive x-axis direction is greater than that of extension toward other directions. For example, the second region is obtained by extending H/4 in the height direction in a direction away from the center point, W/4 in the width direction in the negative x-axis direction, and W/2 in the positive x-axis direction. Of course, in this embodiment, the expansion parameter for the sub-region may be determined according to the magnitude of the displacement, if the moving speed of the hand is higher, the expansion parameter for the sub-region may be correspondingly higher, and if the moving speed of the hand is lower, the expansion parameter for the sub-region may also be correspondingly lower. This reduces the chance of the first area in the second two-dimensional image being undetectable for the hand.

In this embodiment, according to a second region obtained after the amplification processing of the region where the initial detection frame is located, a first region corresponding to a position range of the second region in the second two-dimensional image may be determined. Optionally, the size of the second region is required to satisfy a condition. Due to the adoption of the technical scheme that the hand in the second two-dimensional image is obtained based on the hand tracking in the first two-dimensional image, the first two-dimensional image and the second two-dimensional image are two adjacent frames of images or images which are close to each other in acquisition time, and the sizes of the same hand corresponding to the same target object in the first two-dimensional image and the second two-dimensional image are generally similar in the first two-dimensional image and the second two-dimensional image; based on this, the requirement that the size of the second region satisfies the condition may be: the magnification ratio of the size of the second area to the size of the initial detection frame is not greater than the first threshold, that is, the ratio of the size of the hand of the target object in the second area to the size of the second area needs to be greater than or equal to the second threshold, and correspondingly, the ratio of the size of the hand in the first area to the size of the first area needs to be greater than or equal to the second threshold. Therefore, the proportion of the area occupied by the hand in the image is larger, more effective data are available, and the information related to the hand can be better obtained only by processing through the first network.

In this embodiment, a first area in a second two-dimensional image is determined according to an area of an initial detection frame in the first two-dimensional image, and then at least one of a detection frame of the hand, key point information of the hand, and state category information corresponding to the hand in the second two-dimensional image is obtained based on a pixel point in the first area in the second two-dimensional image and a first network, that is, the pixel point in the first area in the second two-dimensional image is used as an input value of the first network, and at least one of the detection frame of the hand, the key point information of the hand, and the state category information corresponding to the hand in the second two-dimensional image can be obtained through processing of the first network.

By adopting the technical scheme of the embodiment of the invention, the initial detection frame of the hand is obtained by detecting the hand in the first two-dimensional image, and then the pixel point in the first area of the subsequent image (the second two-dimensional image) is determined based on the initial detection frame to obtain at least one of the detection frame of the hand, the key point information of the hand and the state category information corresponding to the hand in the second two-dimensional image. On one hand, a depth image acquisition assembly such as a depth camera or an infrared camera of a depth image is not required, so that the implementation cost is greatly reduced; on the other hand, the detection frame of the hand, the key point information of the hand and the state category information corresponding to the hand are obtained in a multitasking mode, the identification information is rich, support is provided for the subsequent gesture interaction function, and the information acquisition duration is shortened.

FIG. 2 is a second flowchart illustrating an image processing method according to an embodiment of the present invention; on the basis of the embodiment shown in fig. 1, in this embodiment, step 103 further may include:

step 1031: shearing the second three-dimensional image according to the first area to obtain a sheared image;

step 1032: and performing feature recognition on the cut image, and determining at least one of a detection frame of the hand, key point information of the hand and state type information corresponding to the hand based on the recognized features.

In this embodiment, since feature recognition is performed on a clipped image based on a first network, and at least one of the detection frame of the hand, the key point information of the hand, and the state type information corresponding to the hand is determined based on the recognized feature, it is necessary to clip the second three-dimensional image in accordance with the first region so that the obtained clipped image satisfies the size of the input image of the first network. Illustratively, the size of the cropped image may be 80 x 80.

Alternatively, as shown in fig. 3, step 1032 may comprise:

step 10321: performing feature identification on the cut image through a feature extraction part of a first network to obtain a feature image set, wherein the feature image set comprises a plurality of feature images with different receptive fields;

Step 10322: performing hand detection on the characteristic image input to the first network based on a first branch in the first network to obtain a detection frame of the hand;

step 10323: performing hand key point detection on the feature image input to the first network based on a second branch in the first network to obtain key point information of the hand;

step 10324: and performing hand state recognition on the characteristic image input to the first network based on a third branch in the first network to obtain state category information corresponding to the hand state.

In this embodiment, at least a part of the plurality of weight parameters corresponding to the feature images input to the first network in the first branch, the second branch, and the third branch are different from each other.

In this embodiment, the execution sequence of step 10322 to step 10324 is not limited to the above-described sequence, and step 10322 to step 10324 may be executed in parallel.

In this embodiment, on one hand, the first network includes a feature extraction part, and the feature extraction part performs feature extraction on the clipped image to obtain a plurality of feature images (i.e., feature image sets) with different receptive fields. Illustratively, the feature extraction part of the first network has convolution kernels of a plurality of sizes, and feature extraction is performed on the cut image respectively through the convolution kernels of the plurality of sizes, so as to obtain a plurality of feature images processed by the convolution kernels of different sizes, wherein the receptive fields of the feature images correspond to the sizes of the convolution kernels, so that the feature images processed by the convolution kernels of small sizes have smaller receptive fields, namely the feature images are more focused on local features; the characteristic image obtained by the large-size convolution kernel processing has a larger corresponding receptive field, namely the characteristic image is more focused on global characteristics.

On the other hand, the first network has at least three branches, namely: a first branch, a second branch and a third branch; the first branch is used for obtaining a detection frame of the hand, the second branch is used for obtaining key point information of the hand, and the third branch is used for identifying the state of the hand, so that state type information corresponding to the state of the hand is obtained. Since the task of each branch is different, at least part of the weight parameters of the plurality of weight parameters respectively corresponding to the first branch, the second branch, and the third branch are different in the feature image input to each branch. For example, a plurality of feature images having different receptive fields are input to the first branch, the second branch, and the third branch, respectively, and since the task of each branch is different and the emphasis point of a required feature is also different, each feature image has different weight parameters in at least a part of the plurality of weight parameters corresponding to the first branch, the second branch, and the third branch, respectively. For example, regarding the first branch, the detection frame for detecting the hand is focused on the shape of the hand, and the overall state of the hand is focused on, so that the weight parameter corresponding to the feature image with a large receptive field may be relatively large, and the weight parameter corresponding to the feature image with a small receptive field may be relatively small. For example, if the second branch is used to detect the key point of the hand and the focus is on the local information of the hand, the weight parameter corresponding to the feature image with a small receptive field may be relatively large, and the weight parameter corresponding to the feature image with a large receptive field may be relatively small.

In some alternative embodiments, the key points of the hand may be as described with reference to fig. 4, and the key points of the hand may include at least one of: wrist (Wrist) keypoints, joint keypoints of fingers, fingertip (TIP) keypoints of fingers, and the like; wherein, the key points of the joints of the fingers at least comprise at least one of the following points: metacarpophalangeal Point (MCP) key Point, Proximal Interphalangeal Point (PIP) key Point, and Distal Interphalangeal Point (DIP) key Point. The fingers may include at least one of: thumb (Thumb), Index finger (Index), Middle finger (Middle), Ring finger (Ring), Little finger (Little); as shown in FIG. 2, the wrist keypoints may include keypoints P₁(ii) a Thumb (Thumb) keypoints may include P₂、P₃、P₄At least one ofA key point; index finger (Index) keypoints can include P₅、P₆、P₇、P₈At least one keypoint of; middle finger (Middle) key points may include P₉、P₁₀、P₁₁、P₁₂At least one keypoint of; ring finger (Ring) key points may include P₁₃、P₁₄、P₁₅、P₁₆At least one keypoint of; little finger (Little) key points may include P₁₇、P₁₈、P₁₉、P₂₀At least one key point.

In some alternative embodiments, the hand state may be, for example, a palm state, a fist state, or other hand transcatheter state. For example, the hand can be determined to be in a palm state when the five fingers of the hand are opened, the palm center of the hand faces the image acquisition assembly or the back of the hand faces the image acquisition assembly; or the hand is recognized to be in a fist making state, the palm faces towards the image acquisition assembly or the back of the hand faces towards the image acquisition assembly, and the hand can be determined to be in a fist making state. Of course, the hand state in the present embodiment is not limited to the above example, and other hand states may be within the protection range of the present embodiment.

In this embodiment, the state type information may be a preset or predefined hand state type, and if the electronic device detects that the state of the hand corresponds to a certain preset or defined state type information, the electronic device may execute a corresponding instruction based on the state type information. It can be considered that the hand states correspond to N states, the number of the state type information is M, and N is a positive integer greater than or equal to M.

In some optional embodiments of the invention, the method may further include: and performing hand detection on the cut image based on the identified features to obtain the judgment information whether the cut image comprises a hand or not.

Optionally, as shown in fig. 3, after step 10321, the method may further include:

step 10325: performing hand detection on the feature image input to the first network based on a fourth branch in the first network to obtain judgment information whether the cut image comprises a hand or not;

wherein the feature image input to the first network has a plurality of weighting parameters corresponding to the fourth branch, and at least a part of the weighting parameters are different from a plurality of weighting parameters corresponding to the feature images in the first branch, the second branch, and the third branch.

And in response to the judgment information indicating that the hand is not included in the cut-out image, re-detecting the hand of the second two-dimensional image to obtain an initial detection frame of the hand of the target object in the second two-dimensional image.

In this embodiment, the execution sequence of step 10325 and steps 10322 to 10324 is not limited to the above, and steps 10322 to 10325 may be executed in parallel.

In this embodiment, the fourth branch is used to process the plurality of feature images to obtain the determination information of whether the cut image includes the hand. For example, the output result of the fourth branch may be "1" or "0", where "1" represents a judgment result that the cut image includes a hand; "0" indicates that the hand is not included in the cut image. If the hand is not included in the clipped image, it may be indicated that the hand is not tracked in the second two-dimensional image, and the initial detection frame of the hand of the target object in the second two-dimensional image needs to be re-detected in the detection manner of step 102.

Exemplarily, fig. 5 is a schematic structural diagram of a first network in the image processing method according to the embodiment of the present invention; as shown in fig. 5, the first network at least includes a feature extraction part, and a first branch, a second branch, a third branch, and a fourth branch, and performs feature extraction on the clipped hand image by the feature extraction part to obtain a feature image set, where the feature image set includes a plurality of feature images with different receptive fields; the feature image set is further used as input data of the first branch, the second branch, the third branch and the fourth branch, so as to obtain a hand detection frame, hand key point information, hand state category information and loss judgment information (loss judgment information, that is, a judgment result indicating whether the cut image includes a hand) output by each branch.

The feature extraction layer and the first branch network, the second branch network, the third branch network and the fourth branch network respectively comprise a plurality of convolution layers, and the convolution layers are used for carrying out convolution processing on the image. Wherein, the second branch network is also provided with a heat map network layer to regress the hand key points. The first, third and fourth branch networks are also provided with a full connection layer.

The embodiment of the invention also provides an image processing device. FIG. 6 is a schematic diagram of a structure of an image processing apparatus according to an embodiment of the present invention; as shown in fig. 6, the apparatus includes: an acquisition unit 31, a detection unit 32, a determination unit 33, and a processing unit 34; wherein the content of the first and second substances,

the acquiring unit 31 is configured to acquire a plurality of frames of two-dimensional images including a target object;

the detection unit 32 is configured to detect a hand of a first two-dimensional image in the multiple frames of two-dimensional images, and obtain an initial detection frame of the hand of the target object in the first two-dimensional image;

the determining unit 33 is configured to determine a first region in the second two-dimensional image based on the region of the initial detection frame in the first two-dimensional image;

the processing unit 34 is configured to obtain at least one of a detection frame of the hand, key point information of the hand, and state category information corresponding to the hand in the second two-dimensional image based on a pixel point in the first region in the second two-dimensional image; the second two-dimensional image is a frame image after the first two-dimensional image.

In some optional embodiments of the present invention, the processing unit 34 is configured to cut the second three-dimensional image according to the first area to obtain a cut image, perform feature recognition on the cut image, and determine at least one of a detection frame of the hand, keypoint information of the hand, and state category information corresponding to the hand based on the recognized feature.

In some optional embodiments of the present invention, the processing unit 34 is further configured to perform hand detection on the cut-out image based on the identified features, so as to obtain the discrimination information of whether a hand is included in the cut-out image.

In some optional embodiments of the present invention, the processing unit 34 is configured to perform feature identification on the cut image through a feature extraction part of a first network, so as to obtain a feature image set, where the feature image set includes a plurality of feature images with different receptive fields; inputting at least some of the feature images in the set of feature images to the first network to perform at least one of:

In some optional embodiments of the present invention, the processing unit 34 is further configured to perform hand detection on the feature image input to the first network based on a fourth branch in the first network, so as to obtain the determination information whether the cut image includes a hand; wherein the feature image input to the first network has a plurality of weighting parameters corresponding to the fourth branch, and at least a part of the weighting parameters are different from a plurality of weighting parameters corresponding to the feature images in the first branch, the second branch, and the third branch.

In some optional embodiments of the present invention, the detecting unit 32 is further configured to, in response to that the discrimination information obtained by the processing unit 34 indicates that no hand is included in the cutout image, re-detect the hand of the second two-dimensional image, and obtain an initial detection frame of the hand of the target object in the second two-dimensional image.

In some optional embodiments of the present invention, the determining unit 33 is configured to perform equal-amplitude amplification processing on an area of the initial detection frame in the first two-dimensional image to obtain a second area; and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

In some optional embodiments of the present invention, the determining unit 33 is configured to perform an enlargement process on an area of the initial detection frame in the first two-dimensional image based on a movement trend of the hand to obtain a second area; the amplification amplitude of a subarea corresponding to the movement trend in an area in the first two-dimensional image of the initial detection frame is larger than the amplification amplitudes of other subareas except the subarea; and determining a first area corresponding to the position range of the second area in the second two-dimensional image according to the second area.

In the embodiment of the present invention, the obtaining Unit 31, the detecting Unit 32, the determining Unit 33, the Processing Unit 34, and the trend detecting Unit in the image Processing apparatus may be implemented by a Central Processing Unit (CPU), a Digital Signal Processor (DSP), a Micro Control Unit (MCU), or a Programmable Gate Array (FPGA) in practical applications.

It should be noted that: the image processing apparatus provided in the above embodiment is exemplified by the division of each program module when performing image processing, and in practical applications, the processing may be distributed to different program modules according to needs, that is, the internal structure of the apparatus may be divided into different program modules to complete all or part of the processing described above. In addition, the image processing apparatus and the image processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The embodiment of the invention also provides the electronic equipment. Fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention, as shown in fig. 7, the electronic device includes a memory 42, a processor 41, and a computer program stored in the memory 42 and capable of running on the processor 41, and when the processor 41 executes the computer program, the steps of the image processing method according to the embodiment of the present invention are implemented.

Optionally, the various components in the electronic device are coupled together by a bus system 43. It will be appreciated that the bus system 43 is used to enable communications among the components. The bus system 43 includes a power bus, a control bus, and a status signal bus in addition to the data bus. For clarity of illustration, however, the various buses are labeled as bus system 43 in fig. 7.

It will be appreciated that the memory 42 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read Only Memory (EPROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced Synchronous DRAM), Direct Memory Access (DRAM), and Direct Memory Access (DRDRU). The memory 42 described in connection with the embodiments of the invention is intended to comprise, without being limited to, these and any other suitable types of memory.

The method disclosed in the above embodiments of the present invention may be applied to the processor 41, or implemented by the processor 41. The processor 41 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 41. The processor 41 described above may be a general purpose processor, a DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Processor 41 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in memory 42, where processor 41 reads the information in memory 42 and in combination with its hardware performs the steps of the method described above.

In an exemplary embodiment, the electronic Device may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), FPGAs, general purpose processors, controllers, MCUs, microprocessors (microprocessors), or other electronic components for performing the aforementioned methods.

In an exemplary embodiment, the present invention further provides a computer readable storage medium, such as a memory 42, comprising a computer program, which is executable by a processor 41 of an electronic device to perform the steps of the aforementioned method. The computer readable storage medium can be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

The computer readable storage medium provided by the embodiment of the present invention stores thereon a computer program, which when executed by a processor implements the steps of the image processing method according to the embodiment of the present invention.

The methods disclosed in the several method embodiments, apparatus embodiments, device embodiments, etc. provided in this application may be combined arbitrarily without conflict to obtain new method embodiments, apparatus embodiments, device embodiments, etc.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring a multi-frame two-dimensional image containing a target object;

2. The method according to claim 1, wherein the obtaining at least one of a detection frame of the hand, key point information of the hand, and state category information corresponding to the hand in the second two-dimensional image based on pixel points in the first region in the second two-dimensional image comprises:

3. The method of claim 2, further comprising:

and performing hand detection on the cut image based on the identified features to obtain the judgment information whether the cut image comprises a hand or not.

4. The method according to claim 3, wherein the performing feature recognition on the cut-out image, and determining at least one of a detection frame of the hand, key point information of the hand, and state category information corresponding to the hand based on the recognized features comprises:

performing feature identification on the cut image through a feature extraction part of a first network to obtain a feature image set, wherein the feature image set comprises a plurality of feature images with different receptive fields;

5. The method of claim 4, further comprising:

performing hand detection on the feature image input to the first network based on a fourth branch in the first network to obtain judgment information whether the cut image comprises a hand or not;

6. The method according to any one of claims 3 to 5, further comprising:

and in response to the judgment information indicating that the hand is not included in the cut-out image, re-detecting the hand of the second two-dimensional image, and obtaining a detection frame of the hand of the target object in the second two-dimensional image.

7. The method of any of claims 1 to 6, wherein determining the first region in the second two-dimensional image based on the region of the initial detection box in the first two-dimensional image comprises:

8. The method of any of claims 1 to 6, wherein prior to said determining the first region in the second two-dimensional image based on the region of the initial detection box in the first two-dimensional image, the method further comprises:

9. The method of claim 8, wherein determining the first region in the second two-dimensional image based on the region of the initial detection box in the first two-dimensional image comprises:

10. An image processing apparatus, characterized in that the apparatus comprises: the device comprises an acquisition unit, a detection unit, a determination unit and a processing unit; wherein the content of the first and second substances,

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 9 are implemented when the program is executed by the processor.