CN112655021A

CN112655021A - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN112655021A
Application number: CN202080004938.4A
Authority: CN
Inventors: 任创杰; 李思晋; 李鑫超
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2021-04-13
Also published as: WO2021203368A1

Abstract

The embodiment of the invention provides an image processing method, an image processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a shot video stream; determining a target with posture information meeting preset conditions according to at least one frame of image in the video stream; and starting the function corresponding to the preset condition. The image processing method, the image processing device, the electronic equipment and the storage medium provided by the embodiment of the invention can acquire the shot video stream, determine the target that the posture information meets the preset condition according to at least one frame of image in the video stream, and start the function corresponding to the preset condition, thereby simplifying the steps required by using the corresponding function, reducing the time spent, improving the use efficiency of the equipment, providing a more perfect human-computer interaction function and a more friendly human-computer interaction experience for a user, and improving the user experience.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of unmanned aerial vehicles, in particular to an image processing method, an image processing device, electronic equipment and a storage medium.

Background

In the prior art, in the process of interaction between an intelligent device and a user, the user often needs to perform certain operations to use corresponding functions. With the unmanned aerial vehicle who provides intelligence and follow the function as an example, the user wants to get into intelligence and follows the mode, need carry out a series of complicated operations on unmanned aerial vehicle or the cell-phone that binds, according to pointing out one step and accomplish appointed step, just can use unmanned aerial vehicle's intelligence to follow the function.

The defects of the prior art are that the steps required for using the corresponding functions are complex, the time spent is long, and the use efficiency of the equipment is low.

Disclosure of Invention

The embodiment of the invention provides an image processing method and device, electronic equipment and a storage medium, which are used for solving the technical problems of complicated operation steps and low operation efficiency of the electronic equipment in the prior art.

A first aspect of the present invention provides an image processing method, including:

acquiring a shot video stream;

determining a target with posture information meeting preset conditions according to at least one frame of image in the video stream;

and starting the function corresponding to the preset condition.

A second aspect of the present invention provides an image processing apparatus comprising:

a memory for storing a computer program;

a processor for executing the computer program stored in the memory to implement:

acquiring a shot video stream;

and starting the function corresponding to the preset condition.

A third aspect of the present invention provides an electronic device comprising the image processing apparatus of the second aspect.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, in which program instructions are stored, and the program instructions are used to implement the method according to the first aspect.

According to the image processing method and device, the electronic equipment and the storage medium, the shot video stream can be obtained, the target that the posture information meets the preset condition is determined according to at least one frame of image in the video stream, and the function corresponding to the preset condition is started, so that the steps required by using the corresponding function are simplified, the time spent is reduced, the use efficiency of the equipment is improved, a more complete human-computer interaction function and a more friendly human-computer interaction experience are provided for a user, and the user experience is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic flowchart of an image processing method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of an image processing method according to a second embodiment of the present invention;

fig. 3 is a schematic flowchart of an image processing method according to a third embodiment of the present invention;

fig. 4 is a schematic diagram of a key point position of a single-hand waving gesture in an image processing method according to a third embodiment of the present invention;

fig. 5 is a schematic flowchart illustrating a process of determining user key point information in an image processing method according to a third embodiment of the present invention;

fig. 6 is a schematic diagram illustrating a principle of determining key point information in an image processing method according to a third embodiment of the present invention;

fig. 7 is a schematic position diagram of a gaussian distribution area and a zero response background of a confidence feature map in an image processing method according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of an image processing apparatus according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

The image processing method provided by the embodiment of the invention can determine the attitude information of the user through the shot image and start the corresponding function according to the attitude information, and the method provided by the embodiment of the invention can be applied to any electronic equipment, such as a mobile phone, a camera, a holder, an unmanned aerial vehicle, an unmanned vehicle, Augmented Reality (AR) equipment, monitoring equipment and the like.

In the following, some embodiments of the present invention are described in detail by taking the electronic device as an unmanned aerial vehicle as an example with reference to the accompanying drawings. The features of the embodiments or embodiments described below may be combined with each other without conflict between the embodiments.

Example one

The embodiment of the invention provides an image processing method. Fig. 1 is a flowchart illustrating an image processing method according to an embodiment of the present invention. As shown in fig. 1, the image processing method in the present embodiment may include:

step 101, acquiring a shot video stream.

The execution subject of the method in this embodiment may be an image processing device in an unmanned aerial vehicle. The last shooting device that can be provided with of unmanned aerial vehicle, the video stream of acquireing in this step of shooing can specifically include: and acquiring a video stream shot by a shooting device of the unmanned aerial vehicle.

And step 102, determining a target of which the posture information meets a preset condition according to at least one frame of image in the video stream.

The video stream photographed by the photographing device may include a plurality of frames of images, at least one frame of image is selected from the plurality of frames of images, and a target in which the posture information satisfies a preset condition is determined.

The target may be a person or an object such as a vehicle. If the target is a person, the pose information may include, but is not limited to: standing, walking, squatting, lying down, etc. If the target is a vehicle, the attitude information may include, but is not limited to: straight, left turn, right turn, etc.

And step 103, starting a function corresponding to the preset condition.

The function enabled in this step may be any function that the unmanned aerial vehicle has, and the preset condition and the enabled function may be set according to actual needs. For example, the gesture information satisfying the preset condition may include, but is not limited to: any one or more of the predetermined attitude occurs, the predetermined attitude is maintained for more than a preset time, the first attitude is changed into the second attitude, and the like. The respective functions enabled may include, but are not limited to: take off, land, change posture, record video, take a picture, enter power saving mode, shut down, etc.

In an optional implementation manner, the unmanned aerial vehicle may be provided with an audio playing device, and if it is detected that the user claps his hands, the function of automatically playing music may be started.

In another alternative embodiment, the drone may be used to track the vehicle and initiate corresponding functions according to the attitude information of the vehicle, for example, if the vehicle is detected to be in a turning state, the drone may be raised to enlarge the field of view and prevent from losing track of the vehicle.

The image processing method provided by the embodiment can acquire the shot video stream, determine the target of the gesture information meeting the preset condition according to at least one frame of image in the video stream, and start the function corresponding to the preset condition, thereby simplifying the steps required by using the corresponding function, reducing the time spent, improving the use efficiency of the unmanned aerial vehicle, providing a more perfect human-computer interaction function and a more friendly human-computer interaction experience for the user, and improving the user experience.

Example two

The second embodiment of the invention provides an image processing method. The embodiment is based on the technical scheme provided by the embodiment, and automatically enters the following mode when the hand waving of the user is detected.

Fig. 2 is a flowchart illustrating an image processing method according to a second embodiment of the present invention. As shown in fig. 2, the image processing method in the present embodiment may include:

step 201, acquiring a shot video stream, wherein at least one frame of image in the video stream is used for determining the posture information of a user.

In this step, an image for determining the user posture information is recorded as an image to be processed.

In an optional implementation manner, one frame of image can be selected from the video stream as an image to be processed, so that the method is simple and convenient to calculate, and the efficiency of detecting the user posture can be effectively improved.

In another optional implementation manner, continuous multi-frame images of the video stream can be used as images to be processed, so that the accuracy of user gesture detection can be effectively improved.

In yet another alternative embodiment, multiple frames of images may be taken at intervals from the video stream, for example, one frame of image at every 1 second, allowing for efficiency and accuracy.

Step 202, determining, for each of the at least one frame of image, pose information of at least one user in the image.

Optionally, the neural network may be trained through the sample, and the trained neural network is used to process the image, so as to obtain corresponding posture information. Alternatively, the pose information of the user in the image may be detected directly by an algorithm such as openpos or YOLO.

In the case where the image to be processed has only one frame, the pose information of at least one user in the image can be obtained by step 202.

In the case that there are multiple frames in the image to be processed, the user's posture information in the image of multiple frames can be obtained through step 202. Some users may only appear in one or a few frames of images, but the pose information of these users can still be detected.

Step 203, determining a target to be followed according to the determined posture information of the at least one user, wherein the target to be followed is a user of which the posture information meets a preset condition.

Optionally, the number of the targets to be followed may be one or more. In the scene of following a plurality of targets, when the plurality of targets are separated, the following can be stopped, and part of the targets can be selected from the targets to continue the following. In this embodiment, a target to be followed is explained as an example.

In an optional embodiment, determining the target to be followed according to the determined posture information of the at least one user may include: and if the gesture information of only one user meets the preset condition, determining that the user is the target to be followed.

For example, the preset condition may be that a preset posture is maintained for more than a preset time. If one and only one user maintains the preset posture for more than the preset time, determining that the user is the target to be followed.

Optionally, the preset gesture may be a single-hand waving gesture, and the preset time may be 1 second. Then, only when the time that a single user is in the one-handed waving state exceeds 1 second, it can become the target to be followed. If a single person waves hands with both hands, lowers both hands, lifts the hands for a short time, or a plurality of persons wave hands with one hand at the same time, the target to be followed cannot be determined. The automatic following function can be triggered only when one user meets the preset condition, so that single tracking can be quickly and accurately realized, and the target tracking error is avoided.

In another alternative embodiment, determining the target to be followed according to the determined posture information of the at least one user may include: and if the posture information of a plurality of users meets the preset condition, determining that the user which is detected to meet the preset condition firstly in the plurality of users is the target to be followed.

For example, if a plurality of users have a single-handed waving time exceeding 1 second, the user who has first detected the single-handed waving for more than 1 second may be taken as the target to be followed. The user meeting the posture condition firstly is set as the target to be followed, so that the interference of other users can be effectively avoided, and the following is ensured to be carried out smoothly.

In another alternative embodiment, determining the target to be followed according to the determined posture information of the at least one user may include: and if the posture information of a plurality of users meets the preset condition, determining the user closest to the center of the shooting picture among the users as the target to be followed.

For example, if the time for which a plurality of users swing their hands by one exceeds 1 second, the user closest to the center of the screen may be selected as the target to be followed among the users who swing their hands by one longer than 1 second. By selecting the user close to the center of the picture as the target to be followed from the plurality of users meeting the condition, the target to be followed is ensured to be closest to the center of the picture, the time for turning to the target is saved, and the following efficiency is improved.

In another alternative embodiment, determining the target to be followed according to the determined posture information of the at least one user may include: if the posture information of a plurality of users meets a preset condition and the plurality of users comprise preset users, determining that the preset users are targets to be followed.

For example, if the time for waving one hand of a plurality of users exceeds 1 second, the plurality of users may be subjected to identification, and if a preset user is included therein, the preset user may be taken as a target to be followed.

The identity recognition can be realized through modes such as face recognition and iris recognition. The preset user may be any user previously set. For example, the owner of the unmanned aerial vehicle can set itself as a preset user, and when a plurality of people make a gesture of waving one hand at the same time, the unmanned aerial vehicle can recognize the owner and use the owner as a target to be followed. The preset user is preferentially followed, so that the individual requirements of the user can be effectively met.

And step 204, following the target.

After the target to be followed is determined, the following mode can be entered, the target can be followed, and therefore the following mode can be automatically entered through waving one hand. Of course, other gestures than a single-handed waving may be used as the gesture to trigger automatic following, such as clapping, nodding, etc.

Optionally, the target is followed, and the distance between the unmanned aerial vehicle and the target can be always controlled within a preset range. For example, if the target moves forward, the drone moves forward, and if the target stops, the drone also stops. The specific following strategy can be set according to actual needs, and the embodiment does not limit this.

In the image processing method provided by this embodiment, by obtaining a captured video stream, at least one frame of image in the video stream is used to determine pose information of a user, and determining the pose information of at least one user in the image for each frame of image in the at least one frame of image, if a pose of the user meets a preset condition, such as a single-handed waving or other poses, the user can be determined as a target to be followed, and the target is followed, so that direct and automatic entering into a following mode through the single-handed waving or other poses can be effectively achieved, compared with a scheme that the following mode can be entered through a series of operations, such as connecting a remote controller of an unmanned aerial vehicle with a mobile phone, opening an application program, clicking a series of buttons, selecting the following target, and the like, steps required for entering into the following mode are simplified, time spent is reduced, and efficiency of automatic following by the unmanned aerial vehicle is improved, the electric quantity of the unmanned aerial vehicle is saved, and the service life of the unmanned aerial vehicle is prolonged.

EXAMPLE III

The third embodiment of the invention provides an image processing method. On the basis of the technical scheme provided by the embodiment, the embodiment realizes the detection of the user posture by a method of determining the key point and then determining the posture information.

Fig. 3 is a flowchart illustrating an image processing method according to a third embodiment of the present invention. As shown in fig. 3, the image processing method in the present embodiment may include:

step 301, acquiring a shot video stream, wherein at least one frame of image in the video stream is used for determining the posture information of a user.

In this embodiment, reference may be made to the foregoing embodiments for specific implementation principles and methods of step 301, which are not described herein again.

Step 302, for each frame image of the at least one frame image, determining at least one user to be analyzed according to the image.

In this step, all users in the image may be identified by a Multi-Object Tracking (MOT) algorithm or the like, and the at least one user to be analyzed may be all or part of users detected in the image.

Optionally, a preset number of users may be selected from the all users as the at least one user to be analyzed, so that the efficiency of the algorithm can be effectively improved, and the burden of the device can be reduced. The preset number may be set according to actual needs, and may be 4, for example.

Specifically, if the number of all users in the image is less than or equal to a preset number, taking all users as objects to be analyzed; if the number of all users in the image is larger than the preset number, the users can be screened according to certain conditions.

In an alternative embodiment, a preset number of users near the center of the image may be selected as the at least one user to be analyzed.

The center of the image may refer to a horizontal center line of the image, may also refer to a vertical center line of the image, or may also refer to a center point of the image.

In another alternative embodiment, a predetermined number of users that are most foreground in the image may be selected as the at least one user to be analyzed. The most foreground preset number of users may refer to a preset number of users closest to the device.

For example, if five users are detected in the image, four of which are located approximately 3 meters or so from the device and another one of which is located approximately 10 meters or so from the device, the first four may be selected as the objects to be analyzed. The distance can be judged by image definition change or infrared detection and the like.

By selecting a preset number of users meeting a certain condition from all users, users at important positions in the image can be prevented from being overlooked on the basis of improving the efficiency, and the equipment is ensured to normally enter a following mode.

Step 303, in each frame of image, determining, for each user of the at least one user to be analyzed, the key point information of the user, and determining the posture information of the user according to the key point information of the user.

After determining at least one user to be analyzed, keypoint information of each user may be detected, and pose information of the user may be determined according to the keypoint information.

Alternatively, the key point information in the image can be directly determined through a deep learning algorithm such as a neural network. Wherein the key point information of the user may include location information of a plurality of key points of the user. The position information may specifically be coordinates where the key points are located.

Optionally, the plurality of key points may include, but are not limited to: at least two of a nose, a middle shoulder, a right shoulder joint, a right elbow joint, a right hand, a left shoulder joint, a left elbow joint, a left hand, a right hip joint, a right knee, a right ankle, a left hip joint, a left knee, and a left ankle.

After determining the key point information of the user, the pose information of the user may be determined according to the key point information of the user.

Under the scene of entering the following mode through single-hand waving, if the elbow joint on any one side of the user is higher than the shoulder joint on the same side, and the elbow joint on the other side is lower than the shoulder joint on the same side, it can be determined that the user is in the single-hand waving posture. Through the height relation of the shoulder joints and the elbow joints on the two sides, whether the user is in the hand-waving posture of one hand can be determined quickly and accurately.

Fig. 4 is a schematic diagram of a key point position of a single-hand waving gesture in an image processing method according to a third embodiment of the present invention. As shown in fig. 4, the black dots indicate key points of the user, in which the left elbow joint 401 is higher than the ipsilateral shoulder joint 402, and the right elbow joint 404 is lower than the ipsilateral shoulder joint 403, so that it can be determined that the user is in a one-handed waving state.

Step 304, determining a target to be followed according to the determined posture information of at least one user, wherein the target to be followed is the user of which the posture information meets the preset condition.

And 305, following the target.

In this embodiment, the specific implementation principle and process of step 304 to step 305 may refer to the foregoing embodiments, and are not described herein again.

According to the image processing method provided by the embodiment, for each frame of image, at least one user to be analyzed can be determined according to the image, the key point information of the user is determined for each user in the at least one user to be analyzed, and the posture information of the user is determined according to the key point information of the user, so that the detection efficiency can be effectively improved, and the corresponding function can be timely and accurately started; and moreover, the key point information is determined firstly, and then the corresponding posture information is determined, so that the human body posture can be analyzed more comprehensively, compared with a scheme of directly outputting the posture information according to a neural network, the method has higher recognition accuracy and is more flexible, and when the action types needing to be recognized need to be replaced, all samples do not need to be re-marked, so that the labor cost is saved, and the development amount during demand change is reduced.

In the technical solution provided by the third embodiment, for each frame of image, when determining the key point information of the user, an optional implementation method is that the key point information of the user in the image can be determined directly according to the whole image through a deep learning algorithm. Another optional implementation method may be to determine an image Of a Region Of Interest (ROI) where the user is located, and then determine the key point information in the ROI image according to a neural network.

Fig. 5 is a schematic flowchart of determining user key point information in an image processing method according to a third embodiment of the present invention. The method of fig. 5 may be employed to determine the keypoint information for each user to be analyzed in the image. As shown in fig. 5, determining the key point information of the user may include:

step 501, determining the ROI image where the user is located.

Optionally, the shot image may be cropped through a bounding box (bounding box) where the user is located, so as to obtain an ROI image corresponding to the user.

Fig. 6 is a schematic diagram illustrating a principle of determining key point information in an image processing method according to a third embodiment of the present invention. As shown in fig. 6, the captured image may be an RGB image, and a bounding box in which a user is located in the RGB image may be determined through a multi-target tracking algorithm or other algorithms, where the category of the bounding box is a person. The representation form of the bounding box can be coordinate information of four corners of the bounding box, and the ROI image corresponding to the user can be determined through the bounding box and the RGB image.

As described above, all users in the image can be identified by a multi-target tracking algorithm or the like, and a user to be analyzed is selected from the identified users. Specifically, the bounding boxes corresponding to a plurality of users can be obtained through a multi-target tracking algorithm, when the number of the bounding boxes is greater than a preset number, the preset number of the bounding boxes is selected from the plurality of the bounding boxes, and the corresponding ROI image can be obtained by using the RGB image and the preset number of the bounding boxes as input.

For example, using the MOT algorithm, the bounding box of 5 users may be determined from the RGB image, from which the bounding box of 4 users may be selected. According to the selected 4 bounding boxes, 4 ROI images can be cut out from the GRB image, and are respectively corresponding to 4 users.

And 502, inputting the ROI image into a neural network to obtain confidence characteristic maps corresponding to a plurality of key points.

The confidence characteristic graph corresponding to any key point comprises the probability that each pixel point belongs to the key point.

After the ROI image of each user is acquired, the ROI image of the user may be input into a neural network model, and a confidence feature map corresponding to the user is determined by using the model. In this embodiment, the model may be a Convolutional Neural Network (CNN), and specifically may be a full Convolutional Neural network (FCN).

In this embodiment, the processing for the neural network may include two stages of training and detecting. The training phase may be implemented before the detection phase, or the neural network may be trained between any two detections. In the training stage, the sample can be used to train the neural network, and parameters in the neural network are adjusted so that the output result is close to the target result. In the detection stage, the fully trained neural network parameters are used for detecting the image and outputting a confidence coefficient characteristic diagram.

The training phase of the neural network model is described first. Optionally, the training process may include: obtaining a training sample, wherein the training sample comprises a sample image and a confidence coefficient characteristic diagram corresponding to the sample image; and training the neural network according to the training sample. The confidence coefficient characteristic diagram is used as a target result to train the neural network, so that the output result of the neural network is close to the target result, the anti-interference performance of the neural network can be effectively improved, and the over-fitting of the neural network is avoided.

Optionally, the process of acquiring the training sample may include: acquiring a sample image and position information of key points in the sample image; and determining a confidence characteristic map corresponding to the sample image according to the position information of the key points. And in the confidence characteristic graph corresponding to the sample image, the probability of corresponding to the pixel points which are closer to the key points is higher.

The sample image can be an ROI image cut out from any image acquired from a database, for each sample image, the position information of key points in the image is determined by a manual labeling method, and a confidence characteristic map is generated according to the position information of the key points.

Assuming that the position coordinates of the shoulder joint in the image are determined to be (50, 50) through manual labeling, a confidence characteristic map corresponding to the shoulder joint can be generated according to the position information. The principle of generating the confidence feature map is that the closer the pixel point is to the true position of the shoulder joint, the greater the probability that the pixel point belongs to the shoulder joint, for example, the probability that the pixel point with the coordinate of (50, 50) corresponds to is the maximum, and assuming that the probability may be 0.8, the probability that the pixel point with the coordinate of (55, 55) corresponds to should be greater than the probability that the pixel point with the coordinate of (60, 60) corresponds to, for example, the probabilities that the pixel point with the coordinate of (55, 55) corresponds to may be 0.1 and 0.01, respectively, and the probability that the pixel point far away from (50, 50) at the edge of the image belongs.

Optionally, the confidence characteristic map corresponding to the sample image may be generated through two-dimensional gaussian distribution according to the position information of the key point. Specifically, in the confidence characteristic diagram, the position coordinates of the pixel points may obey two-dimensional gaussian distribution with an expectation of being the coordinates of the key points and a variance of D1; alternatively, the distance between the pixel point and the labeled keypoint may follow a Gaussian distribution with a variance D2, expected to be 0. The variances D1 and D2 can be set according to actual needs. The confidence coefficient characteristic diagram corresponding to the sample image is determined through two-dimensional Gaussian distribution, the probability that each pixel point belongs to a key point can be effectively simulated, and the detection accuracy is improved.

Alternatively, the confidence feature map may also consist of a gaussian distribution and a background of zero response. Specifically, in a preset range around the key point, the probability corresponding to each pixel point can be determined according to gaussian distribution, and outside the preset range, a zero-response background can be set, that is, the probability corresponding to each pixel point outside the preset range is set to 0.

Taking the key point as an example of a shoulder joint, in a preset range of the position of the shoulder joint, generating probabilities corresponding to the pixel points by adopting gaussian distribution, for example, the preset range may be a circle with the shoulder joint as a center and a radius of 5, when a certain pixel point and a coordinate point of the shoulder joint in the image are spaced by more than 5 pixel points, the pixel point is almost impossible to belong to the shoulder joint, and the corresponding probability is 0.

Fig. 7 is a schematic position diagram of a gaussian distribution area and a zero-response background of a confidence feature map in an image processing method according to a third embodiment of the present invention. As shown in fig. 7, in the confidence feature map, the middle black point represents a manually labeled key point, the shaded portion represents a gaussian distribution region, the probability corresponding to each pixel point in the region is determined by gaussian distribution, the region outside the shadow is a zero-response background region, and the probability corresponding to each pixel point in the zero-response background region is 0. The confidence coefficient feature map is formed by Gaussian distribution and zero response background, the generation process of the confidence coefficient feature map can be effectively simplified, and the generation efficiency and accuracy of the confidence coefficient feature map are improved.

Besides gaussian distribution, other methods can be adopted to generate a confidence feature map according to the position of the labeled key point, as long as the longer the distance between the pixel point and the key point is, the lower the probability that the pixel point belongs to the key point is.

If multiple keypoints are labeled in the sample image, a confidence feature map may be generated for each keypoint. And acquiring a plurality of sample images and corresponding confidence coefficient characteristic graphs, and training a neural network, wherein the neural network is trained to determine the confidence coefficient characteristic graphs corresponding to the key points according to the images.

After the training is completed, the actually shot image can be processed according to the neural network obtained by training. As shown in fig. 6, the ROI image is input to a neural network, and a confidence feature map corresponding to a plurality of key points can be obtained.

Step 503, determining the key point information of the user according to the confidence characteristic maps corresponding to the plurality of key points.

As shown in fig. 6, after determining the confidence feature maps corresponding to the plurality of key points, the position information of the plurality of key points may be determined according to the confidence feature maps.

For example, when the posture information of the target needs to use 4 key points including a left shoulder joint, a right shoulder joint and a left elbow joint, the shot image is input into the neural network, confidence characteristic maps corresponding to the 4 key points can be obtained through the neural network, and the positions of the 4 key points can be respectively determined according to the 4 confidence characteristic maps.

Optionally, the determining the key point information of the user according to the confidence feature maps corresponding to the plurality of key points in this step may include: determining a pixel point with the highest probability belonging to any key point in a confidence characteristic graph corresponding to the key point; and if the probability corresponding to the pixel point with the highest probability is greater than a preset threshold, the position information of the key point of the user is the position information of the pixel point with the highest probability.

For example, in the confidence characteristic diagram corresponding to the shoulder joint, if the coordinate of the pixel point with the highest probability is located at (10, 10), the probability corresponding to the pixel point is 0.7, and is greater than the preset threshold, the confidence level that the pixel point belongs to the shoulder joint is high enough, and the coordinate of the shoulder joint is considered to be (10, 10). If the probability corresponding to the pixel point with the highest probability is smaller than the preset threshold, it is indicated that the probability that all the pixel points belong to the shoulder joint is not high enough, and then the shoulder joint is considered to be absent in the graph. The preset threshold may be set according to actual needs, and may be 0.5, for example.

After the key point information of the target is determined according to the neural network, the corresponding posture information can be determined according to the key point information. Specifically, after the key points are obtained, limbs can be formed according to the connection relationship formed between the key points, and the formed limbs can be used as the judgment basis of the posture.

The method for determining the information of the key points of the user, which is provided by fig. 5, can determine the positions of the key points through the confidence characteristic diagram, and compared with a scheme that the coordinates of the key points are directly used as a training target, the method is not easy to over-fit, has higher recognition accuracy and stronger anti-interference performance, does not need to collect a large number of samples and label corresponding data, and reduces the workload of manual labeling; through two-dimensional Gaussian distribution, the confidence coefficient characteristic diagram corresponding to the sample image can be rapidly and accurately determined, so that the training process is more stable, manual marking errors are avoided, the anti-interference performance is realized, and the accuracy rate of key point identification is improved.

On the basis of the technical scheme provided by the embodiment, optionally, the number of the pixel points of the confidence characteristic diagram output by the neural network may be smaller than the number of the pixel points of the input ROI image.

For example, the ROI image is an RGB image of h × w × 3, h and w are the length and width of the input, the neural network outputs a confidence feature map of h '× w' × k, h 'and w' are the length and width of the output, respectively, where h '═ 0.25 × h, w' ═ 0.25 × w, and k is the number of categories of the key points, and in this embodiment, k ═ 4 is the left and right shoulder joints and the left and right elbow joints, respectively.

Assuming that the input ROI image has 100 × 100 pixel points, 8 confidence feature maps are output, each including 25 × 25 pixel points. In training, the size of the target result can be set to 1/4 of the input image, and the function of reducing the image through the neural network can be realized.

The number of the pixel points contained in the output confidence characteristic diagram is set to be smaller than that of the pixel points of the input ROI image, the processing efficiency of the shot image can be improved, the occupied space of an output result is reduced, and due to the fact that the manual labeling key points have certain errors, the errors can be avoided to a certain extent through reducing the size of the output image, and the identification accuracy is improved.

Example four

Fig. 8 is a schematic structural diagram of an image processing apparatus according to a fourth embodiment of the present invention. The image processing apparatus may execute the image processing method corresponding to fig. 1, and as shown in fig. 8, the image processing apparatus may include:

a memory 11 for storing a computer program;

a processor 12 for executing the computer program stored in the memory to implement:

acquiring a shot video stream;

and starting the function corresponding to the preset condition.

Optionally, the image processing apparatus may further include a communication interface 13 for communicating with other devices or a communication network.

In an implementable manner, when the function corresponding to the preset condition is enabled, the processor 12 is specifically configured to:

and following the target.

In an implementation manner, when determining, according to at least one frame of image in the video stream, that the pose information meets the target of the preset condition, the processor 12 is specifically configured to:

determining, for each of the at least one frame of images, pose information for at least one user in the image;

and determining a target to be followed according to the determined posture information of at least one user, wherein the target to be followed is the user of which the posture information meets the preset condition.

In an implementable manner, when determining the target to be followed based on the posture information of the at least one user, the processor 12 is specifically configured to:

and if the gesture information of only one user meets the preset condition, determining that the user is the target to be followed.

and if the posture information of a plurality of users meets the preset condition, determining that the user which is detected to meet the preset condition firstly in the plurality of users is the target to be followed.

and if the posture information of a plurality of users meets the preset condition, determining the user closest to the center of the shooting picture among the users as the target to be followed.

if the posture information of a plurality of users meets a preset condition and the plurality of users comprise preset users, determining that the preset users are targets to be followed.

In an implementation manner, when it is determined that the user is the target to be followed if there is and only one user's posture information satisfies the preset condition, the processor 12 is specifically configured to:

and if one user maintains the preset posture for more than the preset time, determining that the user is the target to be followed.

In one practical manner, the preset gesture is a single-hand waving gesture.

In one practical implementation, in determining the pose information of at least one user in the image, the processor 12 is specifically configured to:

determining at least one user to be analyzed from the image;

for each user of the at least one user to be analyzed, determining key point information of the user, and determining posture information of the user according to the key point information of the user, wherein the key point information of the user comprises position information of a plurality of key points of the user.

In an implementable manner, in determining at least one user to be analyzed from the image, the processor 12 is specifically configured to:

identifying all users in the image through a multi-target tracking algorithm;

selecting a preset number of users from the total users as the at least one user to be analyzed.

In an implementable manner, when a preset number of users are selected from the total number of users as the at least one user to be analyzed, the processor 12 is specifically configured to:

and if the number of all users in the image is larger than the preset number, selecting the preset number of users close to the center of the image as the at least one user to be analyzed.

In one practical implementation, when determining the key point information of the user, the processor 12 is specifically configured to:

determining a region of interest ROI image where the user is located;

and determining key point information in the ROI image according to a neural network.

In an implementable manner, in determining the ROI image of the region of interest where the user is located, the processor 12 is specifically configured to:

and cutting the shot image according to the determined boundary box of the user according to the multi-target tracking algorithm to obtain the ROI image corresponding to the user.

In one practical implementation, when determining the keypoint information in the ROI image according to a neural network, the processor 12 is specifically configured to:

inputting the ROI image into a neural network to obtain confidence characteristic maps corresponding to a plurality of key points, wherein the confidence characteristic map corresponding to any key point comprises the probability that each pixel point belongs to the key point;

and determining the key point information of the user according to the confidence characteristic graphs corresponding to the key points.

In an implementable manner, when determining the key point information of the user according to the confidence feature maps corresponding to the plurality of key points, the processor 12 is specifically configured to:

determining a pixel point with the highest probability belonging to any key point in a confidence characteristic graph corresponding to the key point;

and if the probability corresponding to the pixel point with the highest probability is greater than a preset threshold, the position information of the key point of the user is the position information of the pixel point with the highest probability.

In one implementable manner, prior to determining keypoint information in the ROI image from a neural network, the processor 12 is further configured to:

obtaining a training sample, wherein the training sample comprises a sample image and a confidence coefficient characteristic diagram corresponding to the sample image;

and training the neural network according to the training sample.

In one practical implementation, when obtaining the training samples, the processor 12 is specifically configured to:

acquiring a sample image and position information of key points in the sample image;

determining a confidence characteristic diagram corresponding to the sample image according to the position information of the key points;

and in the confidence characteristic graph corresponding to the sample image, the probability of corresponding to the pixel points which are closer to the key points is higher.

In an implementation manner, when determining the confidence feature map corresponding to the sample image according to the location information of the key point, the processor 12 is specifically configured to:

and determining a confidence characteristic map corresponding to the sample image through two-dimensional Gaussian distribution according to the position information of the key points.

In an implementable manner, the number of pixels of the confidence feature map output by the neural network is less than the number of pixels of the ROI image.

In an implementable manner, when determining the pose information of the user based on the key point information of the user, the processor 12 is specifically configured to:

and if the elbow joint on any side of the user is higher than the shoulder joint on the same side, and the elbow joint on the other side is lower than the shoulder joint on the same side, determining that the user is in the hand waving posture of one hand.

The image processing apparatus shown in fig. 8 can execute the method of the embodiment shown in fig. 1-7, and the related description of the embodiment shown in fig. 1-7 can be referred to for the part not described in detail in this embodiment. The implementation process and technical effect of the technical solution refer to the descriptions in the embodiments shown in fig. 1 to 7, and are not described herein again.

An embodiment of the present invention further provides an electronic device, including the image processing apparatus according to any of the above embodiments.

Optionally, the electronic device is an unmanned aerial vehicle or an unmanned vehicle.

Optionally, the electronic device may further include:

shooting means for sending a shot video stream to the processor;

and the driving device is used for driving the electronic equipment to follow the target under the control of the processor.

The driving device can be a motor and the like, and the electronic equipment can move through the driving device, so that the target can be followed.

The structure and function of each component in the electronic device provided by the embodiment of the present invention may refer to the foregoing embodiments, and are not described herein again.

In addition, an embodiment of the present invention provides a storage medium, which is a computer-readable storage medium, and program instructions are stored in the computer-readable storage medium, and the program instructions are used to implement the image processing method in the embodiments shown in fig. 1 to 7.

The technical solutions and the technical features in the above embodiments may be used alone or in combination when conflicting with the present invention, and all embodiments are equivalent embodiments within the scope of the present invention as long as they do not exceed the scope recognized by those skilled in the art.

In the embodiments provided in the present invention, it should be understood that the disclosed related devices and methods can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer Processor (Processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An image processing method, comprising:

acquiring a shot video stream;

and starting the function corresponding to the preset condition.

2. The method of claim 1, wherein enabling the function corresponding to the preset condition comprises:

and following the target.

3. The method of claim 2, wherein determining a target for which the pose information satisfies a preset condition according to at least one frame of image in the video stream comprises:

4. The method of claim 3, wherein determining the target to follow based on the pose information of the at least one user comprises:

5. The method of claim 3, wherein determining the target to follow based on the pose information of the at least one user comprises:

6. The method of claim 3, wherein determining the target to follow based on the pose information of the at least one user comprises:

7. The method of claim 3, wherein determining the target to follow based on the pose information of the at least one user comprises:

8. The method according to claim 4, wherein if the gesture information of only one user meets a preset condition, determining that the user is the target to be followed comprises:

9. The method of claim 8, wherein the preset gesture is a single-handed waving gesture.

10. The method of claim 3, wherein determining pose information for at least one user in the image comprises:

determining at least one user to be analyzed from the image;

11. The method of claim 10, wherein determining at least one user to analyze from the image comprises:

identifying all users in the image through a multi-target tracking algorithm;

12. The method of claim 11, wherein selecting a preset number of users from the total number of users as the at least one user to be analyzed comprises:

13. The method of claim 10, wherein determining the key point information of the user comprises:

determining a region of interest ROI image where the user is located;

14. The method of claim 13, wherein determining the ROI image of the region of interest where the user is located comprises:

15. The method of claim 13, wherein determining keypoint information in the ROI image from a neural network comprises:

16. The method of claim 15, wherein determining the keypoint information of the user from the confidence feature maps corresponding to the plurality of keypoints comprises:

17. The method of claim 13, further comprising, prior to determining keypoint information in the ROI image from a neural network:

and training the neural network according to the training sample.

18. The method of claim 17, wherein obtaining training samples comprises:

19. The method of claim 18, wherein determining the confidence feature map corresponding to the sample image according to the location information of the keypoint comprises:

20. The method of claim 15, wherein the number of pixels of the confidence feature map output by the neural network is less than the number of pixels of the ROI image.

21. The method of claim 10, wherein determining the pose information of the user based on the keypoint information of the user comprises:

22. An image processing apparatus characterized by comprising:

a memory for storing a computer program;

acquiring a shot video stream;

and starting the function corresponding to the preset condition.

23. The apparatus of claim 22, wherein when the function corresponding to the preset condition is enabled, the processor is specifically configured to:

and following the target.

24. The apparatus according to claim 23, wherein when determining that the pose information satisfies the target of the preset condition according to at least one frame of image in the video stream, the processor is specifically configured to:

25. The apparatus of claim 24, wherein in determining the target to follow based on the pose information of the at least one user, the processor is specifically configured to:

26. The apparatus of claim 24, wherein in determining the target to follow based on the pose information of the at least one user, the processor is specifically configured to:

27. The apparatus of claim 24, wherein in determining the target to follow based on the pose information of the at least one user, the processor is specifically configured to:

28. The apparatus of claim 24, wherein in determining the target to follow based on the pose information of the at least one user, the processor is specifically configured to:

29. The apparatus according to claim 25, wherein when it is determined that the user is the target to be followed if the gesture information of only one user satisfies the preset condition, the processor is specifically configured to:

30. The apparatus of claim 29, wherein the preset gesture is a single-handed waving gesture.

31. The apparatus of claim 24, wherein in determining pose information for at least one user in the image, the processor is specifically configured to:

determining at least one user to be analyzed from the image;

32. The apparatus of claim 31, wherein in determining at least one user to analyze from the image, the processor is specifically configured to:

identifying all users in the image through a multi-target tracking algorithm;

33. The apparatus of claim 32, wherein in selecting a preset number of users from the total number of users as the at least one user to be analyzed, the processor is specifically configured to:

34. The apparatus of claim 31, wherein in determining the keypoint information of the user, the processor is specifically configured to:

determining a region of interest ROI image where the user is located;

35. The apparatus according to claim 34, wherein the processor, in determining the ROI image of the region of interest where the user is located, is specifically configured to:

36. The apparatus of claim 34, wherein in determining keypoint information in the ROI image from a neural network, the processor is specifically configured to:

37. The apparatus according to claim 36, wherein when determining the keypoint information of the user from the confidence feature maps corresponding to the plurality of keypoints, the processor is specifically configured to:

38. The apparatus of claim 34, wherein prior to determining keypoint information in the ROI image from a neural network, the processor is further configured to:

and training the neural network according to the training sample.

39. The apparatus of claim 38, wherein in obtaining training samples, the processor is specifically configured to:

40. The apparatus according to claim 39, wherein when determining the confidence feature map corresponding to the sample image according to the location information of the keypoint, the processor is specifically configured to:

41. The apparatus of claim 36, wherein the number of pixels of the confidence feature map output by the neural network is less than the number of pixels of the ROI image.

42. The apparatus of claim 31, wherein in determining the pose information of the user based on the keypoint information of the user, the processor is specifically configured to:

43. An electronic device characterized by comprising the image processing apparatus of any one of claims 22-42.

44. The device of claim 43, wherein the electronic device is a drone or an unmanned vehicle.

45. The device of claim 43, wherein the electronic device further comprises:

shooting means for sending a shot video stream to the processor;

46. A computer-readable storage medium, characterized in that a program instruction for implementing the image processing method according to any one of claims 1 to 21 is stored in the computer-readable storage medium.