CN109583363B

CN109583363B - Method and system for detecting and improving posture and body movement of lecturer based on human body key points

Info

Publication number: CN109583363B
Application number: CN201811422699.5A
Authority: CN
Inventors: 夏东; 佐凯; 张翀
Original assignee: Hunan Vision Miracle Intelligent Technology Co ltd
Current assignee: Hunan Vision Miracle Intelligent Technology Co ltd
Priority date: 2018-11-27
Filing date: 2018-11-27
Publication date: 2022-02-11
Anticipated expiration: 2038-11-27
Also published as: CN109583363A

Abstract

The invention discloses a method and a system for detecting and improving the posture and body actions of a speaker based on human body key points, wherein the method comprises the following steps: storing standard preset human body key point data and a description language corresponding to each group of preset human body key point data in a database; acquiring a video stream with depth information through a camera device; extracting human body key point signals from a video stream with depth information, comparing the human body key point signals with preset human body key point data in a database one by one, if the difference between the human body key point signals and the preset human body key point data in the database is less than a threshold value, judging that the comparison is successful, and acquiring a descriptor corresponding to the preset human body key point data; the descriptive speech is broadcast to the user. Compared with the traditional training method, the method has the advantages of real-time performance, high efficiency, low cost and the like.

Description

Method and system for detecting and improving posture and body movement of lecturer based on human body key points

Technical Field

The invention relates to the field of education and the field of computer vision, in particular to a method and a system for detecting and improving the posture and body actions of a speaker based on human body key points.

Background

At present, most individuals have slight gestures which are not easily perceived by themselves when performing public speech, and the redundant small actions sometimes greatly affect the overall speech effect, and in order to overcome the adverse effects, people usually do deliberate training by repeatedly watching videos or ask professional coaches to conduct supervision training nearby. The former method of self-learning by video alone is inefficient, and the latter method is difficult to popularize due to high costs and dissonance in teaching coaching level.

Disclosure of Invention

The invention provides a method and a system for detecting and improving the posture action of a lecturer based on human body key points, which are used for solving the technical problems of low efficiency and high price of the existing lecturer posture improvement.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a method for detecting and improving the posture and body movement of a speaker based on human body key points comprises the following steps:

storing standard preset human body key point data and a description language corresponding to each group of preset human body key point data in a database;

acquiring a video stream with depth information through a camera device;

extracting human body key point signals from a video stream with depth information, comparing the human body key point signals with preset human body key point data in a database one by one, if the difference between the human body key point signals and the preset human body key point data in the database is less than a threshold value, judging that the comparison is successful, and acquiring a descriptor corresponding to the preset human body key point data;

the descriptive speech is broadcast to the user.

As a further improvement of the process of the invention:

preferably, if the difference between the human body key point signal and the preset human body key point data in the database is greater than or equal to the threshold value, returning to obtain the video stream with the depth information again, extracting the human body key point signal again, and comparing.

Preferably, the human body key point signal is a human body 2D key point position combination, including: the combination I consisting of 4 key point positions of an upper eyelid outer side point 37, an upper eyelid inner side point 38, a lower eyelid inner side point 40 and a lower eyelid outer side point 41 of the right eye, the combination II consisting of 4 key point positions of an upper eyelid inner side point 43, an upper eyelid outer side point 44, a lower eyelid outer side point 46 and a lower eyelid inner side point 47 of the left eye, and the combination III consisting of key point positions of an inner side point 21 of the right eyebrow and an inner side point 22 of the left eyebrow.

Preferably, the human 3D keypoint location combination comprises: 3D point cloud coordinates of 21 key points of left hand and right hand respectively

Preferably, the human 3D keypoint location combination comprises: 3D point cloud coordinates of 3 key points of nose 0, right wrist 4 and left wrist 7.

The invention also provides a system for detecting and improving the posture and body actions of the lecturer based on the key points of the human body, which comprises the following steps:

the storage module is used for storing standard preset human body key point data and a description language corresponding to each group of preset human body key point data in a database;

the input module is used for acquiring a video stream with depth information through the camera device;

the 3D human key point detection module is used for extracting human key point signals from a video stream with depth information and inputting the human key point signals into the control module;

the control module is used for comparing the human body key point signals with preset human body key point data in the database one by one, if the difference between the human body key point signals and the preset human body key point data in the database is smaller than a threshold value, the comparison is judged to be successful, and descriptors corresponding to the preset human body key point data are obtained;

and the voice prompt module is used for broadcasting the descriptive voice to the user.

Preferably, the control module is further configured to, when a difference between the human body key point signal and the preset human body key point data in the database is greater than or equal to a threshold, return to notify the input module to reacquire the video stream with the depth information, and re-extract the human body key point signal through the 3D human body key point detection module for comparison.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when executing the computer program.

The invention has the following beneficial effects:

the method and the system for improving the posture and body action of the speaker based on the human body key point detection adopt a 3D human body key point detection module to detect the 3D action of a human body, obtain a 3D human body key point signal, and a control module compares the 3D human body key point signal with 3D human body key point data preset in a storage module one by one to obtain a description language corresponding to the successfully compared 3D human body key point, so as to prompt a user and achieve the aim of continuously improving the posture and body action of the speaker. Compared with the traditional training method, the method has the advantages of real-time performance, high efficiency, low cost and the like.

In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a flow chart illustrating a method for improving the posture and body movement of a speaker based on human body key point detection according to a preferred embodiment of the present invention;

FIG. 2 is a schematic structural diagram of detecting an improved lecturer gesture based on human body key points according to preferred embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of 70 key points of the face in accordance with the preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of 21 key points on a hand according to the preferred embodiment 3 of the present invention;

FIG. 5 is a schematic diagram of 25 key points of the body in accordance with a preferred embodiment 4 of the present invention;

fig. 6 is a schematic diagram of 3D human key points according to the preferred embodiment 1 of the present invention.

Detailed Description

The embodiments of the invention will be described in detail below with reference to the drawings, but the invention can be implemented in many different ways as defined and covered by the claims.

Referring to fig. 1, the method for detecting and improving the posture and body movement of a lecturer based on the human body key points comprises the following steps:

acquiring a video stream with depth information through a camera device;

the descriptive speech is broadcast to the user.

In the above steps, the 3D human body key point detection module is adopted to detect the 3D movement of the human body, so as to obtain 3D human body key point signals, the control module compares the 3D human body key point signals with the 3D human body key point data preset in the storage module one by one, and obtains the description words corresponding to the 3D human body key points which are successfully compared, so as to prompt the user, and the purpose of continuously improving the posture and body movement of the speaker is achieved.

In practice, the above method can be expanded or applied as follows, all the technical features in the following embodiments can be combined with each other, and the embodiments are only used as examples and are not limited to the normal combination of the technical features.

Example 1:

acquiring a video stream with depth information through a camera device;

extracting human body key point signals from the video stream with the depth information, and comparing the human body key point signals with preset human body key point data in a database one by one:

if the difference between the human body key point signal and the preset human body key point data in the database is smaller than a threshold value, judging that the comparison is successful, and obtaining a description language corresponding to the preset human body key point data; broadcasting the descriptive voice to the user;

and if the difference between the human body key point signal and the preset human body key point data in the database is greater than or equal to the threshold value, returning to obtain the video stream with the depth information again, and extracting the human body key point signal again for comparison. Fig. 6 is a schematic diagram of 3D human key points of preferred embodiment 1 of the present invention, wherein the number of key points is 135 in total, not shown in fig. 6, and actually includes 70 key points of the face of fig. 2, 21 key points of the hand of fig. 4 (multiplied by 2), and the sum of 25 key points of the body of fig. 5 (wherein, the key point 0 of the wrist is repeated with the key points of the body, and is calculated only once).

Referring to fig. 2, the present embodiment further provides a system for detecting and improving the posture and body movement of a lecturer based on the human body key points, including:

the control module is used for comparing the human body key point signals with preset human body key point data in the database one by one, if the difference between the human body key point signals and the preset human body key point data in the database is smaller than a threshold value, the comparison is judged to be successful, and descriptors corresponding to the preset human body key point data are obtained; and the human body key point detection module is also used for returning to inform the input module to reacquire the video stream with the depth information when the difference between the human body key point signal and the preset human body key point data in the database is greater than or equal to a threshold value, and re-extracting the human body key point signal through the 3D human body key point detection module for comparison.

And the voice prompt module is used for broadcasting the description language to the user after the description language is acquired.

Example 2:

this embodiment is an application example of embodiment 1, and the steps of this embodiment are substantially the same as those of embodiment 1, and are not described herein again. The difference between the two is that: the human body key point signal is a combination of human body 2D key point positions, as shown in fig. 3, and includes: a first combination consisting of No. 4 key point positions of 37 (upper eyelid outer point), 38 (upper eyelid inner point), 40 (lower eyelid inner point) and 41 (lower eyelid outer point) of the right eye, a second combination consisting of No. 4 key point positions of 43 (upper eyelid inner point), 44 (upper eyelid outer point), 46 (lower eyelid outer point) and 47 (lower eyelid inner point) of the left eye, and a third combination consisting of No. 21 (inner point) key point of the right eyebrow and No. 22 (inner point) key point position of the left eyebrow.

Acquiring front images of a plurality of human faces including eyes of gazelle, normal eyes open, half eyes open and eyes closed, and acquiring position coordinates of all key points in the first combination and the second combination; for the combination, the euclidean distances of v values of coordinates No. 37 and No. 41 and coordinates No. 38 and No. 40 in an image coordinate system (an abscissa u represents the number of columns in an image array, and an ordinate v represents the number of rows in the image array) are calculated and accumulated, and the obtained values are used as thresholds for subsequently judging whether the right eye is glared, normally open, half open and closed. And combining the second image and the first image, calculating the Euclidean distances of v values of coordinates No. 43 and No. 47 and coordinates No. 44 and No. 46 in an image coordinate system, and accumulating the Euclidean distances to obtain a value serving as a threshold value for subsequently judging whether the left eye is glared, normally open, half-open and closed.

Acquiring a plurality of front images of the normal facial expression and the frown expression of the human body, acquiring position coordinates of key points No. 21 and No. 22 in the third group, calculating the Euclidean distance of u values of the coordinates No. 21 and No. 22 in an image coordinate system, and using the acquired value as a threshold value for subsequently judging whether the frown is generated or not.

And in the judgment process, comparing the distance values of the combination I, the combination II and the combination III obtained by real-time detection and the calculation method with corresponding threshold values, and performing voice prompt if the distance values are greater than the threshold value of normal eye opening: "the left/right eye is opened too much", if less than the half-open threshold then voice prompt is made: "left/right eye closure"; and if the combined three distance values are smaller than the threshold value under the normal expression, and the difference between the combined three distance values and the threshold value under the frown expression is smaller than half of the difference between the threshold value under the normal expression and the threshold value under the frown expression, the user is considered to frown, and voice prompt is carried out: "frown".

Example 3:

this embodiment is an application example of embodiment 1, and the steps of this embodiment are substantially the same as those of embodiment 1, and are not described herein again. The difference between the two is that: the human body 3D keypoint location combination, as shown in fig. 4, includes: point cloud coordinates of 21 key points for each of the left and right hands.

Hand detection images of a plurality of users in the states of opening, slightly opening and fist making of both hands are obtained, 3D point cloud coordinates of the users are obtained, variance values of all key points of the left hand and the right hand in each state are calculated respectively and serve as state threshold values of the left hand and the right hand.

The judgment process compares the variance value of all key points of both hands with the threshold value of each state through real-time detection and calculation, if the threshold value is smaller than the threshold value under the state of a little open, and the threshold value difference with the state of a fist is smaller than half of the threshold value difference under the state of a little open and a fist, then the user is considered to be fist-opened, and voice prompt is carried out: "left/right hand clenched fist".

Example 4:

this embodiment is an application example of embodiment 1, and the steps of this embodiment are substantially the same as those of embodiment 1, and are not described herein again. The difference between the two is that: the human body 3D keypoint location combination, as shown in fig. 5, includes: 3D point cloud coordinates of 3 key points of No. 0 (nose), No. 4 (right wrist) and No. 7 (left wrist).

Acquiring a plurality of body detection images of users in a normal standing state, acquiring 3D point cloud coordinates of the users, respectively calculating Euclidean distances between No. 4 and No. 0 and between No. 7 and No. 0, and taking the values as threshold values of normal standing posture states.

The method comprises the steps of obtaining a plurality of body detection images of a user with the left hand placed on the left ear, the nose and the touch head and the right hand placed on the right ear and the nose in the touch state, obtaining 3D point cloud coordinates of the body detection images, respectively calculating Euclidean distances between No. 4 and No. 0 and between No. 7 and No. 0, then calculating the average value of the Euclidean distances of the left hand and the right hand in three states, and taking the value as the threshold value of the left hand and the right hand in the touch state.

And in the judgment process, comparing Euclidean distances between No. 4 and No. 0 and between No. 7 and No. 0 with a threshold value through real-time detection and calculation, and if the detected distance value is smaller than the threshold value of the normal standing posture state and the threshold value difference between the detected distance value and the left-right hand head state is smaller than 1/3 of the threshold value difference between the normal standing posture state and the left-right hand head state, considering that the user is touched, and performing voice prompt: "left/right hand was touched".

Example 5:

the present embodiment provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of any of the above embodiments when executing the computer program.

In conclusion, compared with the traditional training method, the method has the advantages of real-time performance, high efficiency, low cost and the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for detecting and improving the posture and body movement of a speaker based on human body key points is characterized by comprising the following steps:

acquiring a video stream with depth information through a camera device;

extracting human body key point signals from the video stream with the depth information, comparing the human body key point signals with preset human body key point data in a database one by one, comparing the Euclidean distance of a human body key point combination with a threshold value of a preset human body key point combination state in the database, judging that the comparison is successful if the Euclidean distance is smaller than the threshold value in a normal state and the difference between the Euclidean distance and the threshold value in an abnormal state is smaller than 1/2 of the threshold value difference between the normal state and the abnormal state, and obtaining a description language corresponding to the group of preset human body key point data; broadcasting the descriptive voice to a user;

wherein, human key point signal includes human 2D key point position combination, includes: a first combination consisting of 4 key point positions of an upper eyelid outer point (37), an upper eyelid inner point (38), a lower eyelid inner point (40) and a lower eyelid outer point (41) of the right eye, a second combination consisting of 4 key point positions of an upper eyelid inner point (43), an upper eyelid outer point (44), a lower eyelid outer point (46) and a lower eyelid inner point (47) of the left eye, and a third combination consisting of key point positions of an inner point (21) of the right eyebrow and an inner point (22) of the left eyebrow; the Euclidean distance of the corresponding human body key point combination is as follows: respectively calculating the sum of the Euclidean distances of v values or u values of the coordinates of the corresponding key points in the first combination, the second combination and the third combination in an image coordinate system;

and if the difference between the human body key point signal and the preset human body key point data in the database is greater than or equal to a threshold value, returning to obtain the video stream with the depth information again, re-extracting the human body key point signal, and comparing.

2. A system for detecting and improving a speaker's posture based on human body keypoints, comprising:

the 3D human key point detection module is used for extracting human key point signals from the video stream with the depth information and inputting the human key point signals into the control module;

the control module is used for comparing the human body key point signals with preset human body key point data in a database one by one, comparing the Euclidean distance of a human body key point combination with a threshold value of a preset human body key point combination state in the database, and if the Euclidean distance is smaller than the threshold value in a normal state and the difference between the Euclidean distance and the threshold value in an abnormal state is smaller than 1/2 of the difference between the threshold values in the normal state and the abnormal state, judging that the comparison is successful, and acquiring a description language corresponding to the group of preset human body key point data; the voice prompt module is used for broadcasting the descriptive voice to a user;

and the control module is also used for returning to inform the input module to reacquire the video stream with the depth information when the difference between the human body key point signal and the preset human body key point data in the database is greater than or equal to a threshold value, and extracting the human body key point signal again for comparison.

3. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of claim 1 are performed when the computer program is executed by the processor.