CN111368800B - Gesture recognition method and device - Google Patents
Gesture recognition method and device Download PDFInfo
- Publication number
- CN111368800B CN111368800B CN202010227340.3A CN202010227340A CN111368800B CN 111368800 B CN111368800 B CN 111368800B CN 202010227340 A CN202010227340 A CN 202010227340A CN 111368800 B CN111368800 B CN 111368800B
- Authority
- CN
- China
- Prior art keywords
- information
- gesture
- voice
- central
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 241000282414 Homo sapiens Species 0.000 claims abstract description 34
- 238000012549 training Methods 0.000 claims description 52
- 230000015654 memory Effects 0.000 claims description 21
- 238000010801 machine learning Methods 0.000 claims description 18
- 238000011176 pooling Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 238000001514 detection method Methods 0.000 claims description 6
- 239000000284 extract Substances 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims 2
- 230000000694 effects Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 8
- 230000003068 static effect Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000003993 interaction Effects 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000011282 treatment Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Psychiatry (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Social Psychology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- User Interface Of Digital Computer (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a gesture recognition method and a gesture recognition device, wherein the gesture recognition method comprises the following steps: acquiring color images, depth images, infrared images and human skeleton point information of gestures made by a user; and inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of gestures made by the user. The application realizes the technical effect of quickly and accurately identifying the semantic information contained in the gesture of the user.
Description
Technical Field
The application relates to the field of artificial intelligence, in particular to a gesture recognition method and device.
Background
Gestures are one of the most convenient and most commonly used communication modes among people, and play a very important role in long-term social production practice activities of human beings. With the development of artificial intelligence and the like, human-computer interaction is gradually applied to aspects of life of people, and meanwhile, the position of gestures in the human-computer interaction process is more and more heavy. The characteristics of nature and convenience of the gestures greatly improve the efficiency of human-computer interaction and greatly expand the application scene of human-computer interaction. However, the human gestures are very complex, and various environmental disturbances are received by different recognition methods, so how to quickly and accurately recognize the complex semantic information contained in the human gestures becomes a problem to be solved in the process of gesture recognition and research.
Disclosure of Invention
The application provides a gesture recognition method and device for solving at least one technical problem in the background art.
To achieve the above object, according to one aspect of the present application, there is provided a gesture recognition method, including:
acquiring color images, depth images, infrared images and human skeleton point information of gestures made by a user;
and inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of gestures made by the user.
Optionally, the trained gesture recognition model is obtained by using a gesture sample marked with semantic information as training data and training by using a preset machine learning algorithm, wherein the gesture sample comprises a color image, a depth image, an infrared image and human skeleton point information of a gesture made by a user.
Optionally, the gesture recognition method further includes:
acquiring a training sample set, wherein the training sample set comprises a plurality of gesture samples marked with semantic information, and the gesture samples comprise color images, depth images, infrared images and human skeleton point information of gestures made by a user;
and training the model by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.
Optionally, the machine learning algorithm includes: the centrnet algorithm.
Optionally, the gesture recognition method further includes:
acquiring collected voice information of a user;
inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a root-preset voice sample by using a transducer algorithm;
and outputting gesture information corresponding to the voice recognition result.
In order to achieve the above object, according to another aspect of the present application, there is provided a gesture recognition apparatus including:
the gesture acquisition unit is used for acquiring color images, depth images, infrared images and human skeleton point information of gestures made by a user;
the gesture recognition unit is used for inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of gestures made by the user.
Optionally, the trained gesture recognition model is obtained by using a gesture sample marked with semantic information as training data and training by using a preset machine learning algorithm, wherein the gesture sample comprises a color image, a depth image, an infrared image and human skeleton point information of a gesture made by a user.
Optionally, the gesture recognition apparatus further includes:
the training sample set acquisition unit is used for acquiring a training sample set, wherein the training sample set comprises a plurality of gesture samples marked with semantic information, and the gesture samples comprise color images, depth images, infrared images and human skeleton point information of gestures made by a user;
and the model training unit is used for carrying out model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.
Optionally, the machine learning algorithm includes: the centrnet algorithm.
Optionally, the gesture recognition apparatus further includes:
the voice information acquisition unit is used for acquiring the acquired voice information of the user;
the voice recognition unit is used for inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a root preset voice sample by using a transducer algorithm;
and the gesture output unit is used for outputting gesture information corresponding to the voice recognition result.
To achieve the above object, according to another aspect of the present application, there is also provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the gesture recognition method described above when executing the computer program.
To achieve the above object, according to another aspect of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps of the gesture recognition method described above.
The beneficial effects of the application are as follows: according to the method and the device for identifying the semantic information of the user gesture, the gesture identification model is trained through the color image, the depth image, the infrared image and the human skeleton point information when the user makes the static gesture, so that the semantic information corresponding to the user gesture is identified according to the trained gesture identification model, and the technical effect of quickly and accurately identifying the semantic information contained in the user gesture is achieved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
FIG. 1 is a flow chart of a gesture recognition method according to an embodiment of the present application;
FIG. 2 is a training flow diagram of a gesture recognition model in accordance with an embodiment of the present application;
FIG. 3 is a flow chart of voice conversion to gestures according to an embodiment of the present application;
FIG. 4 is a first block diagram of a gesture recognition apparatus according to an embodiment of the present application;
FIG. 5 is a second block diagram of a gesture recognition apparatus according to an embodiment of the present application;
FIG. 6 is a third block diagram of a gesture recognition apparatus according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a computer device according to an embodiment of the application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present application and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
The gesture recognition method based on Kinect provided by the application realizes translation from static sign language to voice and translation from voice to corresponding sign language, and effectively improves accuracy of static sign language recognition. The application can realize two functions of converting sign language into voice and converting voice into sign language by means of the computer and the Kinect camera.
Fig. 1 is a flowchart of a gesture recognition method according to an embodiment of the present application, as shown in fig. 1, the gesture recognition method of the present embodiment includes steps S101 to S102.
Step S101, color image, depth image, infrared image, and human skeleton point information of a gesture made by a user are acquired.
In an alternative embodiment of the present application, the step may complete the acquisition of the gesture image of the user through the Kinect camera. The Kinect is internally provided with a color camera, a depth camera, an infrared camera and a microphone array; the color camera, the depth camera and the infrared camera can respectively acquire color image information, depth image information and infrared image information of gesture actions. In addition, the Kinect camera can be used for collecting and generating human skeleton point information of the user.
When the user stands in front of the Kinect camera during acquisition, gesture actions are made, after the Kinect camera acquires color images, depth images, infrared image information and human skeleton point information which are subjected to gesture by the user, the acquired image information is preprocessed through an image preprocessing algorithm by utilizing geometric transformation and image enhancement, and the number of low-quality images is reduced.
Step S102, inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of gestures made by the user.
In an optional embodiment of the present application, the trained gesture recognition model is obtained by using a gesture sample labeled with semantic information as training data and using a preset machine learning algorithm, where the gesture sample includes a color image, a depth image, an infrared image, and human skeleton point information of a gesture made by a user.
In an alternative embodiment of the present application, the semantic information of the gesture made by the user may be represented in the form of semantic text or a preset number. In an alternative embodiment of the present application, after the semantic information of the gesture made by the user is obtained, the step may further determine the voice information corresponding to the semantic information according to the preset corresponding relationship, and play the voice information, so as to implement the conversion from the gesture to the voice.
From the above description, the gesture recognition model is trained by the color image, the depth image, the infrared image and the human skeleton point information when the user makes the static gesture, so that the semantic information corresponding to the user gesture is recognized according to the trained gesture recognition model, and the technical effect of quickly and accurately recognizing the semantic information contained in the user gesture is realized.
Fig. 2 is a training flowchart of the gesture recognition model according to an embodiment of the present application, and as shown in fig. 2, the specific training flowchart of the gesture recognition model in step S102 includes steps S201 to S202.
Step S201, a training sample set is obtained, where the training sample set includes a plurality of gesture samples labeled with semantic information, and the gesture samples include color images, depth images, infrared images, and human skeleton point information of a gesture made by a user.
Step S202, training a model by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.
In alternative embodiments of the present application, the machine learning algorithm described above may employ a variety of existing machine learning algorithms. Preferably, the machine learning algorithm may be a centrnet algorithm.
The CenterNet algorithm is a very good algorithm in One-Stage target detection algorithm, and detects objects by using key point triplets.
The center net algorithm model obtains center heat map and corner heat map through center pooling and cascade corner pooling cascade corner pooling, respectively, and is used for predicting the positions of the key points.
Center working: the center of an object does not necessarily contain strong semantic information, and is easily distinguished from other categories. And center shaping can be used to enrich the center point feature. center pulling extracts and adds the maximum values of the horizontal direction and the vertical direction of the center point, so as to provide information outside the position of the center point. This operation gives the central point the opportunity to obtain semantic information that is more easily distinguished from other categories.
Cascade corner pooling; in general, the corner points are located outside the object, and the positions of the corner points do not contain semantic information of related objects, which brings difficulty to the detection of the corner points. cascade corner pooling first extracts the object boundary maxima, then continues to extract the extracted maxima inward (in the direction of the dashed line in the figure) at the boundary maxima, and adds the extracted maxima to the boundary maxima, thereby providing more abundant associated object semantic information for corner features.
After the positions and the categories of the corner points are obtained, the positions of the corner points are mapped to the corresponding positions of the output pictures through offsets, and then the judgment of which two corner points belong to the same object is carried out through empeddings so as to form a detection frame. As mentioned above, the combination process results in a large number of false positives due to lack of assistance from information within the target area. To solve this problem, the centrnet algorithm predicts not only corner points but also center points. And defining each central area for each prediction frame, judging whether the central area of each target frame contains a central point, if yes, reserving, wherein the confidence of the frame is the average of the confidence of the central point, the upper left corner and the lower right corner, and if not, removing, so that the network has the capability of sensing the information in the target area, and the error target frame can be effectively removed.
Too small a center region results in many small-scale false target frames being unable to be removed, while too large a center region results in many large-scale false target frames being unable to be removed, so the centrnet algorithm uses a scale-adjustable center region definition method, which can be as follows:
the method may define a relatively small central region when the dimensions of the prediction frame are large and predict a relatively large central region when the dimensions of the prediction frame are small.
Therefore, the gesture recognition model trained by the CenterNet algorithm has high recognition accuracy and recognition efficiency.
The application can also realize the conversion of voice into sign language by means of the computer and the Kinect camera. Fig. 3 is a flowchart of the voice conversion to the gesture according to the embodiment of the present application, and as shown in fig. 3, the voice conversion to the gesture according to the embodiment of the present application includes steps S301 to S303.
Step S301, acquiring collected voice information of a user.
In an alternative embodiment of the present application, the step may complete the collection of the user's voice through the microphone array of the Kinect camera, so as to obtain the voice information of the user.
In the alternative embodiment of the application, after the voice information of the user is acquired, the corresponding filtering, preprocessing and other treatments are needed to be carried out, so that the quality of the voice information is improved.
Step S302, inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a root-preset voice sample by using a transducer algorithm.
The Transformer algorithm model improves the defect of slow training of the most offensive RNN, and realizes rapid parallelization by using self-attion mechanism; and the residual structure is added in the transform algorithm, so that the depth can be increased to a very deep depth, the characteristics of the DNN model are fully explored, and the model identification accuracy is improved.
In an alternative embodiment of the present application, the speech recognition result may be semantic information corresponding to the speech information, where the semantic information may be represented in the form of semantic text or a preset number.
Step S303, outputting gesture information corresponding to the voice recognition result.
In an alternative embodiment of the present application, the step determines sign language video and/or text information corresponding to the voice recognition result, and plays and displays the sign language video and/or text information.
According to the embodiment, the method for translating the static gesture and the voice is realized, various image information of a person when the static gesture is made can be acquired by using the Kinect camera, the human skeleton information acquired by using the Kinect is assisted, the function of static sign language identification is realized by using the target detection CenterNet algorithm, compared with a common sign language identification method based on computer vision, the influence of a noisy background on the identification effect is reduced, multi-dimensional information is fully utilized, and the accuracy and stability of identification are improved. Meanwhile, the application can realize the voice recognition function by using the microphone and the voice recognition algorithm which are arranged in the Kinect.
Therefore, compared with the common sign language recognition technology, the gesture recognition method based on Kinect has higher accuracy, and can better reduce the influence caused by complex background and different illumination intensities; moreover, the two-way mutual translation of sign language and voice can be realized, and the method has stronger functionality.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Based on the same inventive concept, the embodiment of the present application also provides a gesture recognition apparatus, which can be used to implement the gesture recognition method described in the above embodiment, as described in the following embodiment. Since the principle of solving the problem by the gesture recognition apparatus is similar to that of the gesture recognition method, the embodiment of the gesture recognition apparatus may refer to the embodiment of the gesture recognition method, and the repetition is not repeated. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
FIG. 4 is a first block diagram of a gesture recognition apparatus according to an embodiment of the present application, as shown in FIG. 4, the gesture recognition apparatus according to an embodiment of the present application includes: a gesture acquisition unit 1 and a gesture recognition unit 2.
The gesture acquisition unit 1 is used for acquiring color images, depth images, infrared images and human skeleton point information of gestures made by a user.
The gesture recognition unit 2 is configured to input the color image, the depth image, the infrared image, and the human skeleton point information into a trained gesture recognition model, so as to obtain semantic information of a gesture made by the user.
In an optional embodiment of the present application, the trained gesture recognition model is obtained by using a gesture sample labeled with semantic information as training data and using a preset machine learning algorithm, where the gesture sample includes a color image, a depth image, an infrared image, and human skeleton point information of a gesture made by a user.
FIG. 5 is a second block diagram of the gesture recognition apparatus according to the embodiment of the present application, as shown in FIG. 5, where the gesture recognition apparatus according to the embodiment of the present application further includes: a training sample set acquisition unit 3 and a model training unit 4.
The training sample set obtaining unit 3 is configured to obtain a training sample set, where the training sample set includes a plurality of gesture samples labeled with semantic information, and the gesture samples include a color image, a depth image, an infrared image, and human skeleton point information of a gesture made by a user.
And the model training unit 4 is used for carrying out model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.
In an alternative embodiment of the application, the machine learning algorithm comprises: the centrnet algorithm.
FIG. 6 is a third block diagram of a gesture recognition apparatus according to an embodiment of the present application, as shown in FIG. 6, where the gesture recognition apparatus according to an embodiment of the present application further includes: a voice information acquisition unit 5, a voice recognition unit 6, and a gesture output unit 7.
And the voice information acquisition unit 5 is used for acquiring the acquired voice information of the user.
The voice recognition unit 6 is configured to input the voice information into a trained voice recognition model to obtain a voice recognition result, where the trained voice recognition model is obtained by training a root-preset voice sample by using a transducer algorithm.
And the gesture output unit 7 is used for outputting gesture information corresponding to the voice recognition result.
To achieve the above object, according to another aspect of the present application, there is also provided a computer apparatus. As shown in fig. 7, the computer device includes a memory, a processor, a communication interface, and a communication bus, on which a computer program executable on the processor is stored, which processor implements the steps of the method of the embodiments described above when executing the computer program.
The processor may be a central processing unit (Central Processing Unit, CPU). The processor may also be any other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof.
The memory is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and units, such as corresponding program units in the above-described method embodiments of the application. The processor executes the various functional applications of the processor and the processing of the composition data by running non-transitory software programs, instructions and modules stored in the memory, i.e., implementing the methods of the method embodiments described above.
The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created by the processor, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more units are stored in the memory, which when executed by the processor, performs the method in the above embodiments.
The details of the computer device may be correspondingly understood by referring to the corresponding relevant descriptions and effects in the above embodiments, and will not be repeated here.
To achieve the above object, according to another aspect of the present application, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps of the gesture recognition method described above. It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the above-described embodiment method when executed. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (RandomAccessMemory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It will be apparent to those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.
Claims (6)
1. A method of gesture recognition, comprising:
acquiring color images, depth images, infrared images and human skeleton point information of a gesture made by a user, wherein the user stands in front of a Kinect camera to make gesture actions, and the Kinect camera acquires the color images, the depth images, the infrared image information and the human skeleton point information of the gesture made by the user, wherein the Kinect camera is internally provided with the color camera, the depth camera, the infrared camera and a microphone array;
inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of gestures made by the user, further determining voice information corresponding to the semantic information according to a preset corresponding relation, and playing the voice information to realize conversion from the gestures to the voices, wherein the trained gesture recognition model is obtained by taking gesture samples marked with the semantic information as training data and training by adopting a preset machine learning algorithm, and the gesture samples comprise the color image, the depth image, the infrared image and the human skeleton point information of the gestures made by the user;
the gesture recognition model is specifically a central Net algorithm model, the central Net algorithm model obtains a central heat map through central pooling, and obtains a corner heat map through cascading corner pooling, so that the positions of key points are predicted according to the central heat map and the corner heat map; the center pooling is used for enriching the characteristics of the center points, and the center pooling extracts the maximum values of the horizontal direction and the vertical direction of the center points and adds the maximum values, so that information except the positions of the center points is provided for the center points, and the center points have the opportunity to obtain semantic information which is easier to distinguish from other categories; extracting object boundary maximum value from the cascading corner pooling, continuously extracting the maximum value from the boundary maximum value, adding the maximum value with the boundary maximum value, providing richer associated object semantic information for corner features, mapping the positions of the corner points to the corresponding positions of output pictures through an offset feature map after obtaining the positions and the types of the corner points, and judging which two corner points belong to the same object through embedding the feature vector feature map so as to form a detection frame; the central Net algorithm model predicts angular points and central points, defines each central area for each prediction frame, and judges whether the central area of each target frame contains the central points or not, if yes, the central area is reserved, the confidence coefficient of the frame is the average of the confidence coefficients of the central points, the upper left angular points and the lower right angular points, and if not, the central area is removed, so that the network has the capability of sensing the internal information of the target area, and the wrong target frame can be effectively removed; the central Net algorithm model uses a central region definition method with adjustable scale;
acquiring voice information of a collected user, wherein the collection of the voice of the user is completed through a microphone array of a Kinect camera;
inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a transducer algorithm according to a preset voice sample, the voice recognition result is semantic information corresponding to the voice information, and the semantic information is represented in a form of semantic characters or preset numbers;
and determining sign language video and/or text information corresponding to the voice recognition result, and playing and displaying the sign language video and/or text information.
2. The gesture recognition method of claim 1, further comprising:
acquiring a training sample set, wherein the training sample set comprises a plurality of gesture samples marked with semantic information;
and training the model by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.
3. A gesture recognition apparatus, comprising:
the gesture acquisition unit is used for acquiring color images, depth images, infrared images and human skeleton point information of gestures made by a user, wherein the user stands in front of a Kinect camera to make gesture actions, the Kinect camera acquires the color images, the depth images, the infrared image information and the human skeleton point information of the gestures made by the user, and the Kinect camera is internally provided with the color camera, the depth camera, the infrared camera and a microphone array;
the gesture recognition unit is used for inputting the color image, the depth image, the infrared image and the human skeleton point information into a trained gesture recognition model to obtain semantic information of gestures made by the user, further determining voice information corresponding to the semantic information according to a preset corresponding relation, and playing the voice information to realize conversion from the gestures to the voices, wherein the trained gesture recognition model is obtained by taking a gesture sample marked with the semantic information as training data and training by adopting a preset machine learning algorithm, and the gesture sample comprises the color image, the depth image, the infrared image and the human skeleton point information of the gestures made by the user;
the gesture recognition model is specifically a central Net algorithm model, the central Net algorithm model obtains a central heat map through central pooling, and obtains a corner heat map through cascading corner pooling, so that the positions of key points are predicted according to the central heat map and the corner heat map; the center pooling is used for enriching the characteristics of the center points, and the center pooling extracts the maximum values of the horizontal direction and the vertical direction of the center points and adds the maximum values, so that information except the positions of the center points is provided for the center points, and the center points have the opportunity to obtain semantic information which is easier to distinguish from other categories; extracting object boundary maximum value from the cascading corner pooling, continuously extracting the maximum value from the boundary maximum value, adding the maximum value with the boundary maximum value, providing richer associated object semantic information for corner features, mapping the positions of the corner points to the corresponding positions of output pictures through an offset feature map after obtaining the positions and the types of the corner points, and judging which two corner points belong to the same object through embedding the feature vector feature map so as to form a detection frame; the central Net algorithm model predicts angular points and central points, defines each central area for each prediction frame, and judges whether the central area of each target frame contains the central points or not, if yes, the central area is reserved, the confidence coefficient of the frame is the average of the confidence coefficients of the central points, the upper left angular points and the lower right angular points, and if not, the central area is removed, so that the network has the capability of sensing the internal information of the target area, and the wrong target frame can be effectively removed; the central Net algorithm model uses a central region definition method with adjustable scale;
the voice information acquisition unit is used for acquiring the acquired voice information of the user, wherein the acquisition of the voice of the user is completed through a microphone array of the Kinect camera;
the voice recognition unit is used for inputting the voice information into a trained voice recognition model to obtain a voice recognition result, wherein the trained voice recognition model is obtained by training a transducer algorithm according to a preset voice sample, the voice recognition result is semantic information corresponding to the voice information, and the semantic information is represented in a form of semantic characters or preset numbers;
and the gesture output unit is used for determining sign language video and/or text information corresponding to the voice recognition result, and playing and displaying the sign language video and/or text information.
4. The gesture recognition apparatus of claim 3, further comprising:
the training sample set acquisition unit is used for acquiring a training sample set, wherein the training sample set comprises a plurality of gesture samples marked with semantic information;
and the model training unit is used for carrying out model training by adopting a preset machine learning algorithm according to the training sample set to obtain a trained gesture recognition model.
5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 2 when executing the computer program.
6. A computer readable storage medium storing a computer program, characterized in that the computer program when executed in a computer processor implements the method of any one of claims 1 to 2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010227340.3A CN111368800B (en) | 2020-03-27 | 2020-03-27 | Gesture recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010227340.3A CN111368800B (en) | 2020-03-27 | 2020-03-27 | Gesture recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111368800A CN111368800A (en) | 2020-07-03 |
CN111368800B true CN111368800B (en) | 2023-11-28 |
Family
ID=71212100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010227340.3A Active CN111368800B (en) | 2020-03-27 | 2020-03-27 | Gesture recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111368800B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112750437A (en) * | 2021-01-04 | 2021-05-04 | 欧普照明股份有限公司 | Control method, control device and electronic equipment |
CN113515191A (en) * | 2021-05-12 | 2021-10-19 | 中国工商银行股份有限公司 | Information interaction method and device based on sign language identification and synthesis |
CN115471917B (en) * | 2022-09-29 | 2024-02-27 | 中国电子科技集团公司信息科学研究院 | Gesture detection and recognition system and method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598915A (en) * | 2014-01-24 | 2015-05-06 | 深圳奥比中光科技有限公司 | Gesture recognition method and gesture recognition device |
CN107679491A (en) * | 2017-09-29 | 2018-02-09 | 华中师范大学 | A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data |
CN110209273A (en) * | 2019-05-23 | 2019-09-06 | Oppo广东移动通信有限公司 | Gesture identification method, interaction control method, device, medium and electronic equipment |
CN110728191A (en) * | 2019-09-16 | 2020-01-24 | 北京华捷艾米科技有限公司 | Sign language translation method, and MR-based sign language-voice interaction method and system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130077820A1 (en) * | 2011-09-26 | 2013-03-28 | Microsoft Corporation | Machine learning gesture detection |
-
2020
- 2020-03-27 CN CN202010227340.3A patent/CN111368800B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104598915A (en) * | 2014-01-24 | 2015-05-06 | 深圳奥比中光科技有限公司 | Gesture recognition method and gesture recognition device |
CN107679491A (en) * | 2017-09-29 | 2018-02-09 | 华中师范大学 | A kind of 3D convolutional neural networks sign Language Recognition Methods for merging multi-modal data |
CN110209273A (en) * | 2019-05-23 | 2019-09-06 | Oppo广东移动通信有限公司 | Gesture identification method, interaction control method, device, medium and electronic equipment |
CN110728191A (en) * | 2019-09-16 | 2020-01-24 | 北京华捷艾米科技有限公司 | Sign language translation method, and MR-based sign language-voice interaction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111368800A (en) | 2020-07-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107273800B (en) | Attention mechanism-based motion recognition method for convolutional recurrent neural network | |
US20210012198A1 (en) | Method for training deep neural network and apparatus | |
CN111368800B (en) | Gesture recognition method and device | |
CN106709461B (en) | Activity recognition method and device based on video | |
Nguyen et al. | Yolo based real-time human detection for smart video surveillance at the edge | |
WO2023185785A1 (en) | Image processing method, model training method, and related apparatuses | |
US12033374B2 (en) | Image processing method, apparatus, and device, and storage medium | |
CN109359527B (en) | Hair region extraction method and system based on neural network | |
CN113516990A (en) | Voice enhancement method, method for training neural network and related equipment | |
CN107786867A (en) | Image identification method and system based on deep learning architecture | |
Rastgoo et al. | Word separation in continuous sign language using isolated signs and post-processing | |
Kumar et al. | Facial emotion recognition and detection using cnn | |
CN111445545A (en) | Text-to-map method, device, storage medium and electronic equipment | |
Vijitkunsawat et al. | Video-Based Sign Language Digit Recognition for the Thai Language: A New Dataset and Method Comparisons. | |
CN115222047A (en) | Model training method, device, equipment and storage medium | |
CN114038045A (en) | Cross-modal face recognition model construction method and device and electronic equipment | |
Hou et al. | End-to-end bloody video recognition by audio-visual feature fusion | |
Naim et al. | Semantic segmentation network for horizontal scene text detection. | |
CN117576245B (en) | Method and device for converting style of image, electronic equipment and storage medium | |
Kheldoun et al. | Algsl89: An algerian sign language dataset | |
CN117033308B (en) | Multi-mode retrieval method and device based on specific range | |
CN116884077B (en) | Face image category determining method and device, electronic equipment and storage medium | |
Liu et al. | Specific action recognition method based on unbalanced dataset | |
Ameer et al. | Deep Transfer Learning for Lip Reading Based on NASNetMobile Pretrained Model in Wild Dataset | |
Kumar et al. | Hand sign detection using deep learning single shot detection technique |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |