CN111368800A

CN111368800A - Gesture recognition method and device

Info

Publication number: CN111368800A
Application number: CN202010227340.3A
Authority: CN
Inventors: 徐林嘉; 李晓萍; 纪耀宗; 马格
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2020-07-03
Anticipated expiration: 2040-03-27
Also published as: CN111368800B

Abstract

The invention discloses a gesture recognition method and a device, wherein the gesture recognition method comprises the following steps: acquiring a color image, a depth image, an infrared image and human body bone point information of a gesture made by a user; and inputting the color image, the depth image, the infrared image and the human body skeleton point information into a trained gesture recognition model to obtain semantic information of the gesture made by the user. The invention realizes the technical effect of quickly and accurately identifying the semantic information contained in the user gesture.

Description

Gesture recognition method and device

Technical Field

The invention relates to the field of artificial intelligence, in particular to a gesture recognition method and device.

Background

Gestures are one of the most convenient and common communication modes among people, and play an important role in long-term social production practice activities of human beings. With the development of artificial intelligence and the like, human-computer interaction is more and more gradually applied to the aspects of people's life, and meanwhile, gestures are more and more important in the human-computer interaction process. The characteristics of nature and convenience of the gestures greatly improve the efficiency of human-computer interaction and greatly expand the application scene of the human-computer interaction. However, human gestures are originally complex, different recognition methods receive various environmental interferences, and how to quickly and accurately recognize complex semantic information contained in human gestures becomes a problem to be solved in a gesture recognition research process.

Disclosure of Invention

In order to solve at least one technical problem in the background art, the invention provides a gesture recognition method and device.

In order to achieve the above object, according to an aspect of the present invention, there is provided a gesture recognition method including:

acquiring a color image, a depth image, an infrared image and human body bone point information of a gesture made by a user;

and inputting the color image, the depth image, the infrared image and the human body skeleton point information into a trained gesture recognition model to obtain semantic information of the gesture made by the user.

Optionally, the trained gesture recognition model is obtained by using a gesture sample labeled with semantic information as training data and training by using a preset machine learning algorithm, wherein the gesture sample includes a color image, a depth image, an infrared image and human skeleton point information of a gesture made by a user.

Optionally, the gesture recognition method further includes:

acquiring a training sample set, wherein the training sample set comprises a plurality of gesture samples marked with semantic information, and the gesture samples comprise color images, depth images, infrared images and human skeleton point information of gestures made by a user;

and performing model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.

Optionally, the machine learning algorithm includes: the centrnet algorithm.

Optionally, the gesture recognition method further includes:

acquiring collected voice information of a user;

inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a voice sample preset as a root source by adopting a Transformer algorithm;

and outputting gesture information corresponding to the voice recognition result.

In order to achieve the above object, according to another aspect of the present invention, there is provided a gesture recognition apparatus including:

the gesture acquisition unit is used for acquiring a color image, a depth image, an infrared image and human body bone point information of a gesture made by a user;

and the gesture recognition unit is used for inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of the gesture made by the user.

Optionally, the gesture recognition apparatus further includes:

the training sample set acquisition unit is used for acquiring a training sample set, wherein the training sample set comprises a plurality of gesture samples marked with semantic information, and the gesture samples comprise color images, depth images, infrared images and human skeleton point information of gestures made by a user;

and the model training unit is used for performing model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.

Optionally, the machine learning algorithm includes: the centrnet algorithm.

Optionally, the gesture recognition apparatus further includes:

the voice information acquisition unit is used for acquiring the acquired voice information of the user;

the voice recognition unit is used for inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a preset voice sample as a root through a Transformer algorithm;

and the gesture output unit is used for outputting gesture information corresponding to the voice recognition result.

In order to achieve the above object, according to another aspect of the present invention, there is also provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the gesture recognition method when executing the computer program.

In order to achieve the above object, according to another aspect of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the gesture recognition method described above.

The invention has the beneficial effects that: according to the invention, the gesture recognition model is trained through the color image, the depth image, the infrared image and the human body skeleton point information when the user makes the static gesture, and then the semantic information corresponding to the user gesture is recognized according to the trained gesture recognition model, so that the technical effect of quickly and accurately recognizing the semantic information contained in the user gesture is realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts. In the drawings:

FIG. 1 is a flow chart of a gesture recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a training process of a gesture recognition model according to an embodiment of the present invention;

FIG. 3 is a flow chart of speech translation to gestures in accordance with an embodiment of the present invention;

FIG. 4 is a first block diagram of a gesture recognition apparatus according to an embodiment of the present invention;

FIG. 5 is a second block diagram of a gesture recognition apparatus according to an embodiment of the present invention;

FIG. 6 is a third block diagram of a gesture recognition apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It should be noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and the above-described drawings, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

The invention provides a gesture recognition method based on Kinect, which realizes the translation from static sign language to voice and the translation from voice to corresponding sign language and effectively improves the accuracy of static sign language recognition. The invention can realize two functions of converting sign language into voice and converting voice into sign language by means of the computer and the Kinect camera.

Fig. 1 is a flowchart of a gesture recognition method according to an embodiment of the present invention, and as shown in fig. 1, the gesture recognition method according to the embodiment includes steps S101 to S102.

And step S101, acquiring a color image, a depth image, an infrared image and human body bone point information of a gesture made by a user.

In an optional embodiment of the invention, the step can be completed by the Kinect camera to acquire the gesture image of the user. The Kinect is internally provided with a color camera, a depth camera, an infrared camera and a microphone array; the color camera, the depth camera and the infrared camera can respectively acquire color image information, depth image information and infrared image information of the gesture action. In addition, a Kinect camera can be used for collecting and generating human body bone point information of the user.

During collection, the user stands in front of the Kinect camera to make gesture actions, and after the Kinect camera collects color images, depth images, infrared image information and human body skeleton point information of gestures made by the user, the obtained image information is preprocessed through geometric transformation and image enhancement through an image preprocessing algorithm, so that the number of low-quality images is reduced.

And S102, inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of the gesture made by the user.

In an optional embodiment of the present invention, the trained gesture recognition model is obtained by using a gesture sample labeled with semantic information as training data and training the training data by using a preset machine learning algorithm, wherein the gesture sample includes a color image, a depth image, an infrared image and human skeleton point information of a gesture made by a user.

In an alternative embodiment of the present invention, semantic information of the gesture made by the user may be represented in the form of semantic words or preset numbers. In an optional embodiment of the present invention, after obtaining the semantic information of the gesture made by the user, the step may further determine the voice information corresponding to the semantic information according to the preset corresponding relationship, and play the voice information, so as to implement the conversion from the gesture to the voice.

From the above description, it can be seen that the gesture recognition model is trained through the color image, the depth image, the infrared image and the human body skeleton point information when the user makes the static gesture, and then the semantic information corresponding to the user gesture is recognized according to the trained gesture recognition model, so that the technical effect of quickly and accurately recognizing the semantic information contained in the user gesture is realized.

Fig. 2 is a training flowchart of the gesture recognition model according to the embodiment of the present invention, and as shown in fig. 2, the specific training process of the gesture recognition model of step S102 includes steps S201 to S202.

Step S201, a training sample set is obtained, wherein the training sample set comprises a plurality of gesture samples marked with semantic information, and the gesture samples comprise color images, depth images, infrared images and human skeleton point information of gestures made by a user.

And S202, performing model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.

In an alternative embodiment of the present invention, the machine learning algorithm may adopt various existing machine learning algorithms. Preferably, the machine learning algorithm may adopt a centret algorithm.

The CenterNet algorithm is an algorithm with good performance in the One-Stage target detection algorithm, and utilizes key point triples to detect objects.

The centret algorithm model derives the center heatmaps (center heat maps) and corner heatmaps (corner heat maps) from the center firing (center pooling) and the case corner firing (cascade corner pooling), respectively, to predict the locations of keypoints.

Center firing: the center of an object does not necessarily contain strong semantic information that is easily distinguished from other classes. While center firing can be used to enrich the center point features. The center firing extracts the maximum values of the horizontal direction and the vertical direction of the center point and adds the maximum values, thereby providing information beyond the position of the center point. This operation gives the central point the opportunity to obtain semantic information that is more easily distinguished from other categories.

Cascade corn pooling; generally, the corner points are located outside the object, and the located positions do not contain semantic information of the associated object, which brings difficulty to the detection of the corner points. The method comprises the steps of extracting object boundary maximum values at first, then continuously extracting maximum values towards the inside (along the direction of a dotted line in the figure) at the boundary maximum values, and adding the maximum values with the boundary maximum values, so as to provide richer associated object semantic information for corner point features.

After the positions and the types of the angular points are obtained, the positions of the angular points are mapped to the corresponding positions of the output picture through offsets, and then which two angular points belong to the same object is judged through embeddings so as to form a detection frame. As mentioned above, the lack of assistance from the information inside the target area during the combination process results in a large number of false detections. To solve this problem, the centret algorithm predicts not only the corner points, but also the center points. Defining each central area for each prediction frame, and determining whether the central area of each target frame contains a central point, if so, retaining, and taking the confidence of the frame as the average of the central point, the upper left corner point and the lower right corner point, if not, removing, so that the network has the capability of sensing the internal information of the target area, and can effectively remove the wrong target frame.

Too small a central region results in many small-scale false target boxes not being removed, while too large a central region results in many large-scale false target boxes not being removed, so the centret algorithm uses a scale-adjustable central region definition method, which can be as follows:

the method may define a relatively small central region when the dimensions of the prediction box are large and predict a relatively large central region when the dimensions of the prediction box are small.

Therefore, the gesture recognition model trained by the CenterNet algorithm has high recognition accuracy and recognition efficiency.

The invention can also realize the conversion of the voice into the sign language by means of the computer and the Kinect camera. Fig. 3 is a flowchart of voice conversion to gesture according to an embodiment of the present invention, and as shown in fig. 3, the flow of voice conversion to gesture according to an embodiment of the present invention includes steps S301 to S303.

Step S301, acquiring the collected voice information of the user.

In an optional embodiment of the invention, the step can finish the collection of the user voice through a microphone array of the Kinect camera to obtain the voice information of the user.

In the optional embodiment of the invention, after the voice information of the user is collected, corresponding processing such as filtering, preprocessing and the like is required to be carried out, so that the quality of the voice information is improved.

Step S302, inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a voice sample preset as a root through a Transformer algorithm.

The Transformer algorithm model improves the defect of slow training of the RNN most suffered from the fouling, and a self-attack mechanism is utilized to realize quick parallelism; and a residual error structure is added in the Transformer algorithm, so that the depth can be increased to a very deep depth, the characteristics of the DNN model are fully explored, and the model identification accuracy is improved.

In an optional embodiment of the present invention, the voice recognition result may be semantic information corresponding to the voice information, and the semantic information may be represented in the form of semantic characters or preset numbers.

And step S303, outputting gesture information corresponding to the voice recognition result.

In an optional embodiment of the present invention, the step determines, plays and displays the sign language video and/or the text information corresponding to the voice recognition result.

The embodiment can show that the invention realizes the method for inter-translating the static gesture and the voice, various image information of a person when the person makes the static gesture can be collected by utilizing the Kinect camera, the human body skeleton information obtained by the Kinect is supplemented, the static sign language identification function is realized by utilizing the target detection CenterNet algorithm, compared with the common sign language identification method based on computer vision, the influence of a noisy background on the identification effect is reduced, the multi-dimensional information is fully utilized, and the identification accuracy and stability are improved. Meanwhile, the invention can realize the voice recognition function by using a microphone with a built-in Kinect and a voice recognition algorithm.

Therefore, compared with the common sign language recognition technology, the Kinect-based gesture recognition method has higher accuracy, and can better reduce the influence caused by complex background and different illumination intensities; moreover, the bidirectional translation of the sign language voice can be realized, and the method has stronger functionality.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

Based on the same inventive concept, an embodiment of the present invention further provides a gesture recognition apparatus, which can be used to implement the gesture recognition method described in the foregoing embodiment, as described in the following embodiments. Because the principle of the gesture recognition apparatus for solving the problem is similar to that of the gesture recognition method, the embodiment of the gesture recognition apparatus can be referred to the embodiment of the gesture recognition method, and repeated details are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 4 is a first structural block diagram of a gesture recognition apparatus according to an embodiment of the present invention, and as shown in fig. 4, the gesture recognition apparatus according to the embodiment of the present invention includes: gesture collection unit 1 and gesture recognition unit 2.

The gesture collection unit 1 is used for obtaining a color image, a depth image, an infrared image and human body skeleton point information of a gesture made by a user.

And the gesture recognition unit 2 is used for inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of the gesture made by the user.

Fig. 5 is a second structural block diagram of the gesture recognition apparatus according to the embodiment of the present invention, and as shown in fig. 5, the gesture recognition apparatus according to the embodiment of the present invention further includes: a training sample set acquisition unit 3 and a model training unit 4.

The training sample set obtaining unit 3 is configured to obtain a training sample set, where the training sample set includes a plurality of gesture samples labeled with semantic information, and the gesture samples include color images, depth images, infrared images, and human skeleton point information of gestures made by a user.

And the model training unit 4 is used for performing model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.

In an alternative embodiment of the invention, the machine learning algorithm comprises: the centrnet algorithm.

Fig. 6 is a block diagram of a third structure of the gesture recognition apparatus according to the embodiment of the present invention, and as shown in fig. 6, the gesture recognition apparatus according to the embodiment of the present invention further includes: a voice information acquisition unit 5, a voice recognition unit 6, and a gesture output unit 7.

And the voice information acquisition unit 5 is used for acquiring the acquired voice information of the user.

And the voice recognition unit 6 is configured to input the voice information into a trained voice recognition model to obtain a voice recognition result, where the trained voice recognition model is obtained by training a voice sample preset as a root through a Transformer algorithm.

And the gesture output unit 7 is used for outputting gesture information corresponding to the voice recognition result.

To achieve the above object, according to another aspect of the present application, there is also provided a computer apparatus. As shown in fig. 7, the computer device comprises a memory, a processor, a communication interface and a communication bus, wherein a computer program that can be run on the processor is stored in the memory, and the steps of the method of the above embodiment are realized when the processor executes the computer program.

The processor may be a Central Processing Unit (CPU). The Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or a combination thereof.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and units, such as the corresponding program units in the above-described method embodiments of the present invention. The processor executes various functional applications of the processor and the processing of the work data by executing the non-transitory software programs, instructions and modules stored in the memory, that is, the method in the above method embodiment is realized.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created by the processor, and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be coupled to the processor via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The one or more units are stored in the memory and when executed by the processor perform the method of the above embodiments.

The specific details of the computer device may be understood by referring to the corresponding related descriptions and effects in the above embodiments, and are not described herein again.

In order to achieve the above object, according to another aspect of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the gesture recognition method described above. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may alternatively be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, or fabricated separately as individual integrated circuit modules, or fabricated as a single integrated circuit module from multiple modules or steps. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A gesture recognition method, comprising:

2. The gesture recognition method according to claim 1, wherein the trained gesture recognition model is obtained by using a gesture sample labeled with semantic information as training data and training the training data by using a preset machine learning algorithm, wherein the gesture sample comprises a color image, a depth image, an infrared image and human skeleton point information of a gesture made by a user.

3. The gesture recognition method according to claim 1, further comprising:

4. The gesture recognition method according to claim 2 or 3, wherein the machine learning algorithm comprises: the centrnet algorithm.

5. The gesture recognition method according to claim 1, further comprising:

acquiring collected voice information of a user;

6. A gesture recognition apparatus, comprising:

7. The gesture recognition device according to claim 6, wherein the trained gesture recognition model is obtained by using a gesture sample labeled with semantic information as training data and training the training data by using a preset machine learning algorithm, wherein the gesture sample comprises a color image, a depth image, an infrared image and human skeleton point information of a gesture made by a user.

8. The gesture recognition device of claim 6, further comprising:

9. The gesture recognition apparatus according to claim 7 or 8, wherein the machine learning algorithm comprises: the centrnet algorithm.

10. The gesture recognition device of claim 6, further comprising:

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 5 when executing the computer program.

12. A computer-readable storage medium, in which a computer program is stored which, when executed in a computer processor, implements the method of any one of claims 1 to 5.