CN112418153A

CN112418153A - Image processing method, image processing device, electronic equipment and computer storage medium

Info

Publication number: CN112418153A
Application number: CN202011417021.5A
Authority: CN
Inventors: 吴天行; 张研; 吴玉东
Original assignee: Shanghai Sensetime Technology Development Co Ltd
Current assignee: Shanghai Sensetime Technology Development Co Ltd
Priority date: 2020-12-04
Filing date: 2020-12-04
Publication date: 2021-02-26

Abstract

The embodiment of the disclosure provides an image processing method, an image processing device, an electronic device and a computer storage medium, wherein the method comprises the following steps: acquiring first video data and second video data; carrying out normalization processing on coordinates of human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data; normalizing the coordinates of the human key points of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data; and determining the human body action similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data.

Description

Image processing method, image processing device, electronic equipment and computer storage medium

Technical Field

The present disclosure relates to computer vision processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a computer storage medium.

Background

At present, human body action and gesture recognition becomes a hot research problem in multiple fields, and with the emergence of various human body gesture recognition algorithms, human body action and gesture recognition achieves certain research progress. The measurement of the human body action posture similarity is widely applied to the aspects of action learning, game fitness, man-machine interaction, virtual reality and the like; for example, in an online video teaching scene, comparison of motion posture similarity is important.

However, in the related art, the similarity of the human body motions in the video is calculated based on the human body joint angle, and the human body joint angle cannot intuitively and accurately reflect the standard degree of the motions, for example, although the clockwise included angle and the counterclockwise included angle have the same angle, the corresponding motions are different.

Disclosure of Invention

The embodiment of the disclosure is expected to provide a technical scheme of image processing, which can accurately obtain human motion similarity in different videos.

The embodiment of the present disclosure provides an image processing method, including:

acquiring first video data and second video data;

carrying out normalization processing on coordinates of human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data; normalizing the coordinates of the human key points of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data;

and determining the human body action similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data.

In some embodiments, the determining the similarity of the human body motions of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data includes:

merging the normalized coordinate information of each frame of image in the first video data to obtain the key point feature vector of each frame of image in the first video data; merging the normalized coordinate information of each frame of image in the second video data to obtain the key point feature vector of each frame of image in the second video data;

and determining the human body motion similarity of the first video data and the second video data according to the key point feature vector of each frame of image in the first video data and the key point feature vector of each frame of image in the second video data.

In some embodiments, the determining the similarity of human body actions of the first video data and the second video data according to the key point feature vector of each frame of image in the first video data and the key point feature vector of each frame of image in the second video data includes:

merging the key point feature vectors of each frame of image in the first video data to obtain a feature vector time sequence of the first video data;

merging the key point feature vectors of each frame of image in the second video data to obtain a feature vector time sequence of the second video data;

and determining the human body motion similarity of the first video data and the second video data according to the feature vector time sequence of the first video data and the feature vector time sequence of the second video data.

In some embodiments, the merging the feature vectors of the key points of each frame of image in the first video data to obtain the time series of feature vectors of the first video data includes:

normalizing the feature vectors of the key points of each frame of image in the first video data to obtain normalized feature vectors of each frame of image in the first video data, and merging the normalized feature vectors of each frame of image in the first video data to obtain a feature vector time sequence of the first video data;

the merging the feature vectors of the key points of each frame of image in the second video data to obtain the feature vector time sequence of the second video data includes:

and normalizing the feature vectors of the key points of the images in the second video data to obtain normalized feature vectors of the images in the second video data, and merging the normalized feature vectors of the images in the second video data to obtain a feature vector time sequence of the second video data.

In some embodiments, the determining the similarity of human body actions of the first video data and the second video data according to the feature vector time sequence of the first video data and the feature vector time sequence of the second video data includes:

determining the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data by adopting a Dynamic Time Warping (DTW) method according to the feature vector Time sequence of the first video data and the feature vector Time sequence of the second video data;

and determining the human body motion similarity of the first video data and the second video data according to the distance of the normalized feature vectors of the corresponding frame images in the first video data and the second video data.

In some embodiments, the determining the human motion similarity of the first video data and the second video data according to the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data includes:

determining similarity scoring values of each corresponding frame image in the first video data and the second video data according to the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data;

and determining the human body action similarity of the first video data and the second video data according to the similarity scoring values of the corresponding frame images in the first video data and the second video data.

In some embodiments, the determining the human motion similarity of the first video data and the second video data according to the similarity score values of the corresponding frame images in the first video data and the second video data includes:

and determining the human body action similarity of the first video data and the second video data according to the average value of the similarity scoring values of the corresponding frame images in the first video data and the second video data.

In some embodiments, the obtaining the first video data and the second video data comprises:

acquiring first initial video data and second initial video data to be compared;

and preprocessing the first initial video data and the second initial video data to obtain the first video data and the second video data with the same frame number.

In some embodiments, the normalizing the coordinates of the human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data includes:

carrying out noise reduction smoothing processing on the coordinates of the human key points of each frame of image in the first video data, and carrying out normalization processing on the coordinates of the human key points after the noise reduction smoothing processing corresponding to the first video data to obtain normalized coordinate information of each frame of image in the first video data;

the normalizing the coordinates of the key points of the human body of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data includes:

and performing noise reduction smoothing processing on the coordinates of the human key points of each frame of image in the second video data, and performing normalization processing on the coordinates of the human key points after the noise reduction smoothing processing corresponding to the second video data to obtain normalized coordinate information of each frame of image in the second video data.

An embodiment of the present disclosure further provides an image processing apparatus, including:

the acquisition module is used for acquiring first video data and second video data;

the first processing module is used for carrying out normalization processing on the coordinates of the human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data; normalizing the coordinates of the human key points of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data;

and the second processing module is used for determining the human body action similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data.

In some embodiments, the second processing module is configured to determine human motion similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data, and includes:

In some embodiments, the second processing module is configured to determine human motion similarity of the first video data and the second video data according to the key point feature vector of each frame of image in the first video data and the key point feature vector of each frame of image in the second video data, and includes:

In some embodiments, the second processing module is configured to combine the feature vectors of the key points of each frame of image in the first video data to obtain the feature vector time sequence of the first video data, and includes:

the second processing module is configured to merge the feature vectors of the key points of each frame of image in the second video data to obtain a feature vector time sequence of the second video data, and includes:

In some embodiments, the second processing module is configured to determine human motion similarity of the first video data and the second video data according to the feature vector time series of the first video data and the feature vector time series of the second video data, and includes:

determining the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data by adopting a DTW (dynamic time warping) method according to the feature vector time sequence of the first video data and the feature vector time sequence of the second video data;

In some embodiments, the second processing module is configured to determine human motion similarity of the first video data and the second video data according to a distance between normalized feature vectors of corresponding frame images in the first video data and the second video data, and includes:

In some embodiments, the second processing module is configured to determine human motion similarity of the first video data and the second video data according to similarity score values of corresponding frame images in the first video data and the second video data, and includes:

In some embodiments, the obtaining module is configured to obtain the first video data and the second video data, and includes:

In some embodiments, the first processing module is configured to perform normalization processing on coordinates of a human body key point of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data, and the normalization processing module includes:

the first processing module is configured to perform normalization processing on coordinates of human key points of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data, and includes:

The disclosed embodiments also provide an electronic device comprising a processor and a memory for storing a computer program capable of running on the processor; wherein the content of the first and second substances,

the processor is configured to run the computer program to perform any one of the image processing methods described above.

The disclosed embodiments also provide a computer storage medium having a computer program stored thereon, which when executed by a processor implements any of the image processing methods described above.

In the image processing method, the image processing device, the electronic equipment and the computer storage medium provided by the embodiment of the disclosure, first video data and second video data are acquired; carrying out normalization processing on coordinates of human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data; normalizing the coordinates of the human key points of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data; the normalized coordinate information represents the normalized coordinates of the human body key points; and determining the human body action similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data.

It can be seen that, in the embodiment of the present disclosure, the coordinates of the human body key points of each frame of image in the video data can be normalized, and then, the comparison of the human body motion similarity in the video data can be performed based on the normalized coordinate information of each frame of image in the video data; the normalization processing is carried out on the coordinates of the human key points of each frame of image in the video data, so that the influence caused by different factors such as background, human position and scale in the video can be eliminated, and further, the human action similarity of different video data can be visually and accurately compared.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow chart of an image processing method of an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an application scenario of an embodiment of the present disclosure;

fig. 3A is a frame image of first initial video data according to an embodiment of the disclosure;

fig. 3B is a frame of image in the second initial video data according to the embodiment of the disclosure.

FIG. 4 is a schematic diagram of a component structure of an image processing apparatus according to an embodiment of the disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

The present disclosure will be described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the examples provided herein are merely illustrative of the present disclosure and are not intended to limit the present disclosure. In addition, the embodiments provided below are some embodiments for implementing the disclosure, not all embodiments for implementing the disclosure, and the technical solutions described in the embodiments of the disclosure may be implemented in any combination without conflict.

It should be noted that, in the embodiments of the present disclosure, the terms "comprises," "comprising," or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, the use of the phrase "including a. -. said." does not exclude the presence of other elements (e.g., steps in a method or elements in a device, such as portions of circuitry, processors, programs, software, etc.) in the method or device in which the element is included.

For example, the image processing method provided by the embodiment of the present disclosure includes a series of steps, but the image processing method provided by the embodiment of the present disclosure is not limited to the described steps, and similarly, the image processing apparatus provided by the embodiment of the present disclosure includes a series of modules, but the apparatus provided by the embodiment of the present disclosure is not limited to include the explicitly described modules, and may also include modules that are required to be configured to acquire related information or perform processing based on the information.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

The disclosed embodiments may be implemented in computer systems comprising terminals and/or servers and may be operational with numerous other general purpose or special purpose computing system environments or configurations. Here, the terminal may be a thin client, a thick client, a hand-held or laptop device, a microprocessor-based system, a set-top box, a programmable consumer electronics, a network personal computer, a small computer system, etc., and the server may be a server computer system, a small computer system, a mainframe computer system, a distributed cloud computing environment including any of the above, etc.

The electronic devices of the terminal, server, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

In the related art, due to the limitation of scene conditions, most of videos have a single shooting angle, so that the shot videos are planar two-dimensional video data, and motions in the videos are likely not completely synchronous with standard motions, and therefore, the similarity of human posture motions needs to be calculated for the video data of different shooting angles.

The calculation of the human motion similarity in different video data can be realized based on human joint angles and a DTW (dynamic time warping) method, and in some embodiments, the human joint angles can be used as a comparison standard to eliminate the interference of video backgrounds and the influence of different human body sizes and positions; then, the sequence of the joint angles is processed by an exponential smoothing method, each sequence of the video data (namely, a plurality of frames of images arranged according to the time sequence) is divided by using the angle difference, and finally, the distance between the sequences of different video data is obtained by using a DTW mode.

The method for calculating the human body motion similarity has the following problems:

1) the human joint angle is used as a comparison standard, however, the human joint angle cannot sufficiently and reliably reflect the standard degree of the motion, so that the finally obtained human motion similarity is not accurate enough; moreover, the DTW algorithm determines the distance based on the shortest path of the sequence of different video data, and there may be a problem that the human body motion discrimination of different video data is low, for example, two completely different human body motions may also be recognized as human body motions with higher similarity.

2) However, since the motion range of each person is different, it is difficult to determine the general segmentation standard of the video sequence for different human bodies, which results in different video frames for two video data to be compared, and is not favorable for comparing the motion similarity of human bodies in the video data.

In view of the above technical problems, in some embodiments of the present disclosure, an image processing method is provided, and embodiments of the present disclosure may be applied to scenes such as human motion and gesture recognition.

Fig. 1 is a flowchart of an image processing method according to an embodiment of the disclosure, and as shown in fig. 1, the flowchart may include:

step 101: first video data and second video data are acquired.

Here, the first video data and the second video data represent two video data that need to be compared; the first video data and the second video data each include a plurality of frames of human body images.

In some embodiments, the first video data or the second video data may be acquired by an image acquisition device such as a camera, or may be video data acquired from a network side or a local storage space; in other embodiments, the initial video data acquired by the image acquisition device may be acquired first, or the initial video data may be acquired from a network side, or the initial video data may be acquired from a local storage space, and then the initial video data may be preprocessed to obtain the first video data or the second video data.

In some embodiments, the number of frames of the first video data and the second video data may be the same or different, and the embodiments of the present disclosure are not limited thereto.

Step 102: carrying out normalization processing on coordinates of human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data; and carrying out normalization processing on the coordinates of the human key points of each frame of image in the second video data to obtain the normalized coordinate information of each frame of image in the second video data.

In the embodiment of the disclosure, the human key point of each frame of image is at least one human key point which is specified in advance; in practical application, at least one human body key point can be specified from all human body key points according to practical requirements.

In some embodiments, the human body key points and the sequence number of each frame of image can be expressed as: {0, "Nose (Nose)" }, {1, "Neck (cock)" }, {2, "right shoulder (RShoulder)" }, {3, "right elbow (relabw)" }, {4, "right wrist (RWrist)" }, {5, "left shoulder (LShoulder)" }, {6, "left elbow (lelnow)" }, {7, "left wrist (LWrist)" }, {8, "hip middle (MidHip)" }, {9, "right hip (RHip)" }, {10, "right knee (rkne)" }, {11, "right ankle (Rankle)" }, {12, left hip (LHip) "}, {13," left knee (LKnee) "}, {14," left ankle (LAnkle) ", where 0 to 14 represent the sequence numbers of respective key points of the human body.

In some embodiments, the human body key points and the sequence number of each frame of image can be expressed as: {15, "right eye (Reye)" }, {16, "left eye (leeye)" }, {17, "right ear (real)" }, {18, "left ear (bear)" }, {19, "left big toe (LBigToe)" }, {20, "left small toe (LSmallToe)" }, {21, left heel (LHeel) "}, {22," right big toe (RBigToe) "}, {23," right small toe (RSmallToe) "}, {24," right heel (RHeel) "}, wherein 15 to 24 denote the sequence numbers of the respective human body key points.

In the embodiment of the disclosure, after the first video data and the second video data are obtained, each frame of image in the first video data and the second video data can be processed by adopting a human key point identification algorithm to obtain coordinates of human key points of each frame of image in the first video data and coordinates of human key points of each frame of image in the second video data; here, the human body key point recognition algorithm may be implemented using a neural network.

In some embodiments, after obtaining the coordinates of the key points of the human body of each frame of image in the first video data and the second video data, the noise reduction smoothing processing may be performed on the coordinates of the key points of the human body of each frame of image in the first video data and the second video data, and the normalization processing may be performed on the coordinates of the key points of the human body after the noise reduction smoothing processing corresponding to the first video data, so as to obtain normalized coordinate information of each frame of image in the first video data; and normalizing the coordinates of the key points of the human body after the noise reduction smoothing processing corresponding to the second video data to obtain normalized coordinate information of each frame of image in the second video data.

In the embodiment of the present disclosure, the principle of performing noise reduction smoothing processing on the coordinates of the human body key points of each frame of image in the video data is as follows: correcting the coordinates of the human body key points according to the coordinates of the pixel points in the neighborhood of the human body key points, for example, taking the average value of the coordinates of the pixel points in the neighborhood as the coordinates of the human body key points after noise reduction and smoothing treatment; it can be understood that, by performing noise reduction smoothing processing on the coordinates of the human body key points of each frame of image in the video data, inaccurate coordinates (i.e., noise data) of the human body key points can be corrected to a certain extent, so as to achieve the purpose of noise reduction. Illustratively, the method of the noise reduction smoothing process includes, but is not limited to, exponential smoothing, mean filtering, kalman filtering, and the like.

In some embodiments, the normalized coordinate information represents normalized coordinates of the human body key points, and for example, the coordinates of the human body key points of each frame of image in the first video data or the second video data may be normalized based on the following formulas (1) and (2):

wherein x is_iAbscissa, x, representing ith individual body key point of each frame image in first video data or second video data_i' denotes the abscissa of the ith individual key point after normalization, at x_iIn the case of the abscissa representing the ith individual body key point of each frame image in the first video data, min { x }_iDenotes the minimum value of the abscissa of each human key point in the first video data, max { x }_iExpressing the maximum value of the abscissa of each human body key point in the first video data; at x_iIn the case of the abscissa of the ith individual body key point representing each frame image in the second video data, min { x }_iDenotes the minimum value of the abscissa of each human body key point in the second video data, max { x }_iAnd represents the maximum value of the abscissa of each human key point in the second video data. y is_iOrdinate, y, of i-th individual body key point representing each frame image in first video data or second video data_i' represents the ordinate of the ith personal volume key point of each frame of image in the first video data or the second video data after normalization processing; at y_iIn the case of the ordinate of the ith individual body key point representing each frame image in the first video data, min { y }_iDenotes the minimum value of the ordinate of each human body key point in the first video data, max { y }_iExpressing the maximum value of the ordinate of each human body key point in the first video data; at y_iRepresents the secondIn the case of the ordinate of the ith individual body key point of each frame image in the video data, min { y }_iDenotes the minimum value of the ordinate of each human body key point in the second video data, max { y }_iExpressing the maximum value of the ordinate of each human body key point in the second video data; i represents an integer greater than or equal to 1.

It can be understood that, based on the above formula (1) and formula (2), the coordinates of the human body key points of each frame of image in the first video data or the second video data may be subjected to translation and scaling processing, so that the ranges of the abscissa and the ordinate of each human body key point after normalization processing are unified to [0,1 ].

Step 103: and determining the human body action similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data.

Fig. 2 is a schematic view of an application scenario of the embodiment of the disclosure, as shown in fig. 2, first video data 201 and second video data 202 may be input to an image processing apparatus 203, and the image processing apparatus 203 may perform processing by the image processing method described in the foregoing embodiment, so as to obtain human motion similarity of the first video data and the second video data.

In practical applications, the steps 101 to 103 may be implemented by a Processor in an electronic Device, where the Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor.

In some embodiments, the obtaining of the first video data and the second video data may be implemented by obtaining first initial video data and second initial video data to be compared; and preprocessing the first initial video data and the second initial video data to obtain the first video data and the second video data with the same frame number.

In one embodiment, the first initial video data is video data acquired from an image acquisition device, a network terminal or a local storage space, and the second initial video data is video data which is acquired in advance and contains standard actions of a human body; for example, the first initial video data is used to represent the tai chi boxing action to be compared, and the second initial video data represents the standard tai chi boxing action.

In an embodiment, the first initial video data and the second initial video data may be preprocessed by performing frame dropping on the first initial video data and the second initial video data to obtain the first video data and the second video data with the same number of frames.

In the embodiment of the disclosure, the first video data and the second video data with the same frame number can be obtained by preprocessing the first initial video data and the second initial video data; on the basis of obtaining the first video data and the second video data with the same frame number, the human body motion similarity of the first video data and the second video data is favorably compared.

In the embodiment of the present disclosure, as for the implementation manner of step 103, exemplarily, normalized coordinate information of each frame of image in the first video data may be merged to obtain a feature vector of a key point of each frame of image in the first video data; and combining the normalized coordinate information of each frame of image in the second video data to obtain the key point feature vector of each frame of image in the second video data.

In some embodiments, after obtaining the normalized coordinate information of each frame of image in the first video data and the second video data, the keypoint feature vector of each frame of image in the first video data or the second video data may be obtained according to formula (3):

v＝(x₁',y₁'....x_n',y_n') (3)

where v represents a keypoint feature vector of each frame of image in the first video data or the second video data, x₁' to x_n' normalized abscissa, y, of the 1 st through nth human body key points of each frame of image in the first or second video data, respectively₁' to y_n' respectively represents the normalized vertical coordinates of the 1 st to nth human body key points of each frame of image in the first video data or the second video data, wherein n represents the number of the pre-designated human body key points.

In the embodiment of the present disclosure, the normalized coordinate information of different frame images may be combined according to the same sequence, so that in the obtained keypoint feature vectors of different frame images, elements at the same position represent the same meaning, for example, in the keypoint feature vectors of different frame images, the 3 rd element represents the coordinate of the right elbow keypoint of the human body.

For determining an implementation manner of human motion similarity of first video data and second video data according to a key point feature vector of each frame of image in the first video data and a key point feature vector of each frame of image in the second video data, in one example, the key point feature vectors of each frame of image in the first video data may be merged to obtain a feature vector time sequence of the first video data; merging the key point feature vectors of each frame of image in the second video data to obtain a feature vector time sequence of the second video data; and determining the human body motion similarity of the first video data and the second video data according to the feature vector time sequence of the first video data and the feature vector time sequence of the second video data.

Here, the feature vector time series of the first video data represents a plurality of key point feature vectors arranged in time series; the feature vector time series of the second video data represents a plurality of keypoint feature vectors arranged in a temporal order.

In the embodiment of the disclosure, the first video data and the second video data both include multi-frame data, and after the key point feature vectors of each frame of image in the first video data or the second video data are obtained, the key point feature vectors of each frame of image in the first video data or the second video data can be spliced in a time dimension to obtain a feature vector time sequence of the first video data or a feature vector time sequence of the second video data; in some embodiments, the feature vector time-series of the first video data or the feature vector time-series of the second video data may be a matrix.

It can be understood that, on the basis of obtaining the feature vector time sequence of the first video data or the feature vector time sequence of the second video data, the feature vector time sequence of the first video data or the feature vector time sequence of the second video data is favorably aligned in a time dimension, and actions and motions at different time points in the two video data are favorably aligned, so that the similarity of human body motions in the first video data and the second video data is favorably and accurately obtained.

For the implementation manner in which the feature vectors of the key points of each frame of image in the first video data are merged to obtain the feature vector time sequence of the first video data, for example, the feature vectors of the key points of each frame of image in the first video data may be normalized to obtain the normalized feature vectors of each frame of image in the first video data, and the normalized feature vectors of each frame of image in the first video data are merged to obtain the feature vector time sequence of the first video data.

For the implementation manner in which the feature vectors of the key points of each frame of image in the second video data are merged to obtain the feature vector time sequence of the second video data, for example, the feature vectors of the key points of each frame of image in the second video data may be normalized to obtain the normalized feature vectors of each frame of image in the second video data, and the normalized feature vectors of each frame of image in the second video data are merged to obtain the feature vector time sequence of the second video data.

In some embodiments, the keypoint feature vectors of each frame of image in the first video data or the second video data may be subjected to an L2 normalization process according to the following formula (4)

Where v' represents a normalized feature vector of each frame of image in the first video data or the second video data.

In some embodiments of the present application, after obtaining the feature vector time sequence of the first video data and the feature vector time sequence of the second video data, the distance between the normalized feature vectors of each corresponding frame image in the first video data and the second video data may be determined according to the feature vector time sequence of the first video data and the feature vector time sequence of the second video data by using a DTW method; and then, determining the human body motion similarity of the first video data and the second video data according to the distance of the normalized feature vectors of the corresponding frame images in the first video data and the second video data.

In some embodiments, each corresponding frame image in the first video data and the second video data represents: each pair of images with the same frame sequence number in the first video data and the second video data; in the case that the number of frames of the first video data and the second video data is the same, the distances of the normalized feature vectors of the images with the same frame number in the first video data and the second video data can be directly compared, where the frame number indicates the position of the video data frame, for example, the first video data and the second video data both include m frames of images, the frame number of the jth frame in the first video data and the second video data is j, j is 1 to m, and m is an integer greater than 1.

In other embodiments, when the number of frames of the first video data and the second video data is different, based on the DTW method, the video data with a smaller number of frames may be linearly mapped by using a linear expansion method, so that the number of frames of the video data after linear mapping reaches a target number of frames, where the target number of frames represents the larger number of frames of the first video data and the second video data; when the number of frames of the two pieces of video data is the same, according to the content described in the foregoing embodiment, for the two pieces of video data with the same number of frames, the steps of performing coordinate normalization processing, merging normalized coordinate information, performing normalization processing on the feature vectors of the key points, merging normalized feature vectors, calculating the distance between the normalized feature vectors, and the like are performed, so as to determine the distance between the normalized feature vectors of each corresponding frame image in the two pieces of video data with the same number of frames; here, the distance of the normalized feature vector of each corresponding frame image in two video data of the same frame number is: the distances of the normalized feature vectors of each corresponding frame image in the first video data and the second video data.

In some embodiments, the distance between the normalized feature vectors of each corresponding frame image in the first video data and the second video data is a cosine similarity d.

It can be seen that the DTW method can be used to dynamically compare the motion of two video data, that is, the normalized feature vectors of each corresponding frame image can be compared by combining the whole feature vector time sequence, which is beneficial to accurately obtain the human motion similarity of different video data compared with the scheme of performing static frame-by-frame comparison according to the time frame number; in addition, the embodiment of the disclosure does not need to segment each video data sequence based on the angle difference, but can obtain the feature vector time sequence capable of more accurately distinguishing the similarity of different human bodies according to the normalized coordinate information of each frame of image in the video data, and further, the DTW method is adopted to compare the normalized feature vectors of each corresponding frame of image in different video data, which is beneficial to accurately comparing the motion similarity of human bodies in the video data.

For an implementation of determining human motion similarity of the first video data and the second video data according to the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data, for example, the similarity score value of each corresponding frame image in the first video data and the second video data may be determined according to the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data; and determining the human body action similarity of the first video data and the second video data according to the similarity grade values of the corresponding frame images in the first video data and the second video data.

In some embodiments, the human motion similarity of the first video data and the second video data may be determined according to an average value of the similarity score values of each corresponding frame image in the first video data and the second video data.

In some embodiments, after obtaining the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data, the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data may be obtained according to a preset scoring mode.

In some embodiments, the similarity score values of the respective corresponding frame images in the first video data and the second video data, which are derived based on the preset scoring mode, range from [0,100 ].

In some embodiments, the predetermined scoring manner may be described according to the following equation (5).

Wherein s represents the similarity score value of the corresponding frame image in the first video data and the second video data; it can be seen that, based on the above equation (5), the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data can be mapped to a score value of 0 to 100, i.e., a static pose score is derived.

It should be noted that the above-mentioned contents are merely exemplary illustrations of preset scoring manners, and the embodiments of the present disclosure are not limited thereto.

In some embodiments, an average value of similarity score values of each corresponding frame image in the first video data and the second video data may be denoted as s _ mean, and the human motion similarity of the first video data and the second video data may be determined as s _ mean; in this way, the similarity of human body actions in the first video data and the second video data can be visually embodied by using the average value s _ mean of the similarity score values of the corresponding frame images in the first video data and the second video data.

Fig. 3A is a frame of image in first video data according to an embodiment of the disclosure; FIG. 3B is a block diagram of a second video frame according to an embodiment of the disclosure; fig. 3A and 3B show different human images, fig. 3A showing a tai chi boxing action to be compared, and fig. 3B showing a standard tai chi boxing action; in one example, the average value of the similarity score values of the respective corresponding frame images in the first video data and the second video data is 95.6, and the similarity score value of the corresponding frame image shown in fig. 3A and 3B is 95.6.

The embodiment of the disclosure can be applied to scenes such as human body action comparison and the like; in some embodiments, a user may record a motion of the user into a video, and input the video together with a standard video, and then based on the image processing method of the embodiments of the present disclosure, a similarity between the motion of the user and the motion in the standard video may be obtained, and a similarity score value of the motion of the user and the motion in the standard video in each corresponding frame may be determined, so that a portion of the motion of the user that is not in accordance with the motion of the standard video may be obtained.

In some embodiments, a user can shoot own actions through a camera or other equipment, then perform real-time operation on the actions of the standard video, determine the similarity score values of the own actions and the actions in the standard video in each corresponding frame, and display the similarity score values of the own actions and the actions in the standard video in each corresponding frame on a display screen in real time.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

On the basis of the image processing method proposed by the foregoing embodiment, an embodiment of the present disclosure proposes an image processing apparatus.

Fig. 4 is a schematic diagram of a composition structure of an image processing apparatus according to an embodiment of the disclosure, and as shown in fig. 4, the apparatus may include:

an obtaining module 400, configured to obtain first video data and second video data;

a first processing module 401, configured to perform normalization processing on coordinates of human key points of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data; normalizing the coordinates of the human key points of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data;

a second processing module 402, configured to determine human motion similarity between the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data.

In some embodiments, the second processing module 402, configured to determine the human motion similarity of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data, includes:

In some embodiments, the second processing module 402 is configured to combine the feature vectors of the key points of each frame of image in the first video data to obtain the feature vector time sequence of the first video data, and includes:

the second processing module 402 is configured to merge the feature vectors of the key points of each frame of image in the second video data to obtain a feature vector time sequence of the second video data, and includes:

In some embodiments, the second processing module 402 is configured to determine human motion similarity of the first video data and the second video data according to the key point feature vector of each frame of image in the first video data and the key point feature vector of each frame of image in the second video data, and includes:

normalizing the feature vectors of the key points of each frame of image in the second video data to obtain normalized feature vectors of each frame of image in the second video data, and merging the normalized feature vectors of each frame of image in the second video data to obtain a feature vector time sequence of the second video data;

In some embodiments, the second processing module 402, configured to determine human motion similarity of the first video data and the second video data according to the feature vector time series of the first video data and the feature vector time series of the second video data, includes:

In some embodiments, the second processing module 402 is configured to determine human motion similarity of the first video data and the second video data according to a distance between normalized feature vectors of corresponding frame images in the first video data and the second video data, and includes:

In some embodiments, the second processing module 402 is configured to determine human motion similarity of the first video data and the second video data according to similarity score values of corresponding frame images in the first video data and the second video data, and includes:

In some embodiments, the obtaining module 400 is configured to obtain the first video data and the second video data, and includes:

In some embodiments, the first processing module 401 is configured to perform normalization processing on coordinates of a human body key point of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data, and includes:

the first processing module 401 is configured to perform normalization processing on coordinates of key points of a human body of each frame of image in the second video data to obtain normalized coordinate information of each frame of image in the second video data, and includes:

In practical applications, the obtaining module 400, the first processing module 401 and the second processing module 402 may all be implemented by a processor in an electronic device, where the processor may be at least one of an ASIC, a DSP, a DSPD, a PLD, an FPGA, a CPU, a controller, a microcontroller, and a microprocessor.

In addition, each functional module in this embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Specifically, the computer program instructions corresponding to an image processing method in the present embodiment may be stored on a storage medium such as an optical disc, a hard disk, a usb disk, or the like, and when the computer program instructions corresponding to an image processing method in the storage medium are read or executed by an electronic device, any one of the image processing methods of the foregoing embodiments is implemented.

Based on the same technical concept of the foregoing embodiment, referring to fig. 5, it shows an electronic device 5 provided by the embodiment of the present disclosure, which may include: a memory 501 and a processor 502; wherein the content of the first and second substances,

the memory 501 is used for storing computer programs and data;

the processor 502 is configured to execute the computer program stored in the memory to implement any one of the image processing methods of the foregoing embodiments.

In practical applications, the memory 501 may be a volatile memory (volatile memory), such as a RAM; or a non-volatile memory (non-volatile memory) such as a ROM, a flash memory (flash memory), a Hard Disk (Hard Disk Drive, HDD) or a Solid-State Drive (SSD); or a combination of the above types of memories and provides instructions and data to the processor 502.

The processor 502 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic devices for implementing the above-described processor functions may be other devices, and the embodiments of the present disclosure are not particularly limited.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, which are not repeated herein for brevity

The methods disclosed in the method embodiments provided by the present application can be combined arbitrarily without conflict to obtain new method embodiments.

Features disclosed in various product embodiments provided by the application can be combined arbitrarily to obtain new product embodiments without conflict.

The features disclosed in the various method or apparatus embodiments provided herein may be combined in any combination to arrive at new method or apparatus embodiments without conflict.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring first video data and second video data;

2. The method of claim 1, wherein determining the similarity of the human body actions of the first video data and the second video data according to the normalized coordinate information of each frame of image in the first video data and the normalized coordinate information of each frame of image in the second video data comprises:

3. The method of claim 2, wherein determining the similarity of human body actions of the first video data and the second video data according to the key point feature vector of each frame of image in the first video data and the key point feature vector of each frame of image in the second video data comprises:

4. The method according to claim 3, wherein the merging the feature vectors of the keypoints of the images in the first video data to obtain the time series of feature vectors of the first video data comprises:

5. The method according to claim 3 or 4, wherein the determining the similarity of the human body motions of the first video data and the second video data according to the feature vector time sequence of the first video data and the feature vector time sequence of the second video data comprises:

6. The method of claim 5, wherein determining the human motion similarity of the first video data and the second video data according to the distance of the normalized feature vector of each corresponding frame image in the first video data and the second video data comprises:

7. The method according to claim 6, wherein the determining human motion similarity of the first video data and the second video data according to the similarity score values of the corresponding frame images in the first video data and the second video data comprises:

8. The method of claim 1, wherein the obtaining the first video data and the second video data comprises:

9. The method according to claim 1, wherein the normalizing the coordinates of the key points of the human body of each frame of image in the first video data to obtain normalized coordinate information of each frame of image in the first video data comprises:

10. An image processing apparatus, characterized in that the apparatus comprises:

11. An electronic device comprising a processor and a memory for storing a computer program operable on the processor; wherein the content of the first and second substances,

the processor is configured to run the computer program to perform the method of any one of claims 1 to 9.

12. A computer storage medium on which a computer program is stored, characterized in that the computer program realizes the method of any one of claims 1 to 9 when executed by a processor.