CN114067425A

CN114067425A - Gesture recognition method and device

Info

Publication number: CN114067425A
Application number: CN202010749828.2A
Authority: CN
Inventors: 李明阳; 周振坤; 徐羽琼
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-02-18

Abstract

The application discloses a gesture recognition method and device. The method comprises the following steps: acquiring a first image vector and the weight of the first image vector; the first image vector is an image vector of a key frame in an image sequence generated according to a video image of a gesture to be recognized, and the weight of the first image vector is the weight of the key frame in the image sequence; generating a first gesture feature vector according to the first image vector and the weight of the first image vector; determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set; the first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results. By adopting the method, the gesture recognition process can be converted into the retrieval process of the gesture feature vector by converting the image vector of the key frame of the image sequence of the gesture to be recognized into the gesture feature vector with fixed length, so that the problems of noise introduction or feature information loss and equipment overhead increase can be avoided.

Description

Gesture recognition method and device

Technical Field

The application relates to the technical field of terminal equipment, in particular to a gesture recognition method and device.

Background

With the development of human-computer interaction technology and computer vision technology, human-computer interaction is becoming an important part of people's daily life. For example, gesture recognition based on computer vision technology is a very friendly man-machine interaction mode.

Gesture recognition may include dynamic gesture recognition, i.e., recognition by a computer of a dynamic gesture made by a user. Dynamic gestures are typically combined from multiple hand movements. At present, the dynamic gesture can be recognized by using a deep learning model, or can also be recognized by a template matching mode.

When the deep learning model is used for recognizing the dynamic gesture, the feature extraction needs to be carried out on the dynamic gesture. At present, a fixed feature acquisition window is generally used for feature extraction of dynamic gestures made by different users and representing the same meaning. However, since the motion frequency, amplitude and hand shape of the dynamic gestures made by different users are generally different greatly, if the fixed feature acquisition window is used for extracting features of the dynamic gestures made by different users, noise data can be introduced into the dynamic gestures made by users with relatively fast motion frequency, and a part of feature information can be lost to the dynamic gestures made by users with relatively slow motion frequency, so that the accuracy of gesture recognition is affected.

When the dynamic gesture is recognized in the template matching mode, because the action frequency, amplitude and hand shape of the dynamic gesture representing the same meaning made by different users are greatly different, and the random variability of the dynamic gesture representing the same meaning made by the same user, a large number of templates need to be provided, so that the gesture recognition efficiency is influenced, the equipment cost is increased, and the applicability is poor.

Disclosure of Invention

The application provides a gesture recognition method and a gesture recognition device, which are used for solving the problems of low gesture recognition accuracy and efficiency and high equipment overhead in the conventional dynamic gesture recognition.

In a first aspect, the present application provides a gesture recognition method, including: acquiring a first image vector and the weight of the first image vector; the first image vector is an image vector of a key frame in a first image sequence; the weight of the first image vector is the weight of the key frame in the first image sequence; the first image sequence is an image sequence generated according to a video image of a gesture to be recognized; the key frame is an image carrying hand information in the first image sequence; generating a first gesture feature vector according to the first image vector and the weight of the first image vector; determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set; the first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results.

In the implementation manner, a first image vector of a key frame in a first image sequence of a gesture to be recognized is extracted, a weight of the first image vector used for representing the importance degree of the key frame in the first image sequence is obtained, a first gesture feature vector is generated according to the first image vector and the weight of the first image vector, and finally a gesture recognition result of the gesture to be recognized is determined according to the first gesture feature vector and a first gesture feature vector set containing the gesture feature vectors with known gesture recognition results. By adopting the scheme of the implementation mode, the information of the important gesture features of the corresponding gesture to be recognized can be acquired through the extraction of the key frames under the condition that the gestures made by different users aiming at the same meaning are different and the gesture made by the same user aiming at the same meaning is changed randomly, the problem that noise is introduced or the gesture feature information is lost due to the fixation of the feature acquisition window is solved, and the accuracy of gesture recognition can be improved. Secondly, the image vectors of the key frames of the gestures to be recognized and the corresponding weights are fused to generate first gesture feature vectors, and the image vectors of the key frames corresponding to the gestures to be recognized, which are made by the same user for many times, have changes and represent the same meanings, can be converted into the first gesture feature vectors with fixed lengths, so that the number of matched templates can be reduced, and the cost of terminal equipment is reduced. In addition, the gesture recognition result of the gesture to be recognized can be determined through matching of the first gesture feature vector and the first gesture feature vector set containing the gesture feature vector with the known gesture recognition result, the gesture recognition process is converted into the retrieval process of the first gesture feature vector, the gesture recognition process is simpler, the gesture to be recognized can be recognized more quickly, and the gesture recognition efficiency is improved.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the obtaining a first image vector and a weight of the first image vector includes: acquiring a first image vector set; the first set of image vectors is a set of image vectors of images comprised by the first sequence of images; calculating an attention weight of each image vector in the first set of image vectors according to an attention model; determining an image vector with an attention weight greater than an attention threshold and the attention weight as a first image vector and a weight of the first image vector.

In this implementation manner, the attention weight of each frame of image in the first image sequence may be determined according to the importance degree of each frame of image in the first image sequence, that is, the importance degree of each frame of image to the gesture recognition result, and by comparing the attention weight with the attention threshold, the key frame of the first image sequence, that is, the first image vector is extracted, and the weight of the first image vector is determined, so that the obtained first image vector and the weight of the first image vector are more accurate.

With reference to the first aspect, in a second possible implementation manner of the first aspect, the generating a first gesture feature vector according to the first image vector and a weight of the first image vector includes: coding the first image vector to obtain a coded first image vector; and generating a first gesture feature vector according to the encoded first image vector and the weight of the first image vector.

In the implementation mode, the first image vector can be converted into the first gesture feature vector with the fixed length in a coding mode, so that the number of matched templates in gesture recognition can be reduced, and the expense of terminal equipment is reduced.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the gesture feature vectors included in the first gesture feature vector set and having known gesture recognition results correspond to at least one known gesture category, and one known gesture category corresponds to one known gesture recognition result; the determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set comprises: determining a first gesture category; the first gesture category is a known gesture category which has the largest similarity with the first gesture feature vector in the at least one known gesture category; and if the similarity between the first gesture feature vector and the first gesture category is greater than a preset similarity threshold value, determining that the known gesture recognition result corresponding to the first gesture category is the gesture recognition result of the gesture to be recognized.

In this implementation, carry out the similarity through the gesture classification that the known gesture eigenvector that contains first gesture eigenvector and first gesture eigenvector set corresponds and match, confirm the gesture recognition result of treating the discernment gesture to turn into the process that carries out the similarity matching to first gesture eigenvector and known gesture classification with the gesture recognition process, make the gesture recognition process simpler, can be more quick treat the discernment gesture and discern, improved gesture recognition's efficiency.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the method further includes, if the similarity between the first gesture feature vector and the first gesture category is less than or equal to a preset similarity threshold, obtaining a first gesture recognition result input by the user; and determining that the first gesture recognition result is the gesture recognition result of the gesture to be recognized.

In the implementation mode, for the gesture to be recognized corresponding to the first gesture feature vector with unsuccessful similarity matching, the gesture recognition result input by the user is directly determined as the gesture recognition result of the gesture to be recognized, modeling and re-detection are not needed for the gesture to be recognized, the gesture recognition process is simpler, and the applicability is better.

With reference to the first aspect, in a fifth possible implementation manner of the first aspect, the method further includes: and updating the first gesture feature vector set according to the first gesture feature vector and the gesture recognition result of the gesture to be recognized.

In the implementation mode, when the first gesture feature vector set is updated, namely, when the known gesture feature vector is updated, the new gesture feature vector is directly updated to the first gesture feature vector set, the newly added gesture does not need to be identified through the model, and the updating process of the known gesture feature vector is simpler.

With reference to the first aspect, in a sixth possible implementation manner of the first aspect, the attention model is preset.

In the implementation mode, the attention model is preset, so that the attention model can be obtained more simply and rapidly, and the applicability is better.

With reference to the first aspect, in a seventh possible implementation manner of the first aspect, the first gesture feature vector set is preset.

In the implementation mode, the first gesture feature vector set is preset, so that the first gesture feature vector set can be obtained more simply and rapidly, and the applicability is better.

In a second aspect, the present application provides a gesture recognition apparatus, comprising: an obtaining module, configured to obtain a first image vector and a weight of the first image vector; the first image vector is an image vector of a key frame in a first image sequence; the weight of the first image vector is the weight of the key frame in the first image sequence; the first image sequence is an image sequence generated according to a video image of a gesture to be recognized; the key frame is an image carrying hand information in the first image sequence; the processing module is used for generating a first gesture feature vector according to the first image vector and the weight of the first image vector; determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set; the first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results.

The device of the implementation manner can extract a first image vector of a key frame in a first image sequence of a gesture to be recognized, obtain a weight of the first image vector for representing the importance degree of the key frame in the first image sequence, generate a first gesture feature vector according to the first image vector and the weight of the first image vector, and finally determine a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and a first gesture feature vector set containing the gesture feature vectors with known gesture recognition results. The device can acquire the information of important gesture features of corresponding gestures to be recognized through the extraction of key frames under the conditions that different users have differences in gestures made according to the same meaning and the gestures made by the same user randomly change according to the same meaning, the problem that noise is introduced or the gesture feature information is lost due to the fact that a feature acquisition window is fixed is solved, and the accuracy of gesture recognition can be improved. Secondly, the image vectors of the key frames of the gestures to be recognized and the corresponding weights are fused to generate first gesture feature vectors, and the image vectors of the key frames corresponding to the gestures to be recognized, which are made by the same user for many times, have changes and represent the same meanings, can be converted into the first gesture feature vectors with fixed lengths, so that the number of matched templates can be reduced, and the cost of terminal equipment is reduced. In addition, the gesture recognition result of the gesture to be recognized can be determined through matching of the first gesture feature vector and the first gesture feature vector set containing the gesture feature vector with the known gesture recognition result, the gesture recognition process is converted into the retrieval process of the first gesture feature vector, the gesture recognition process is simpler, the gesture to be recognized can be recognized more quickly, and the gesture recognition efficiency is improved.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the obtaining module is specifically configured to: acquiring a first image vector set; the first set of image vectors is a set of image vectors of images comprised by the first sequence of images; calculating an attention weight of each image vector in the first set of image vectors according to an attention model; determining an image vector with an attention weight greater than an attention threshold and the attention weight as a first image vector and a weight of the first image vector.

According to the device of the implementation manner, the attention weight of each frame of image in the first image sequence can be determined according to the importance degree of each frame of image in the first image sequence, namely the importance degree of each frame of image to the gesture recognition result, the key frame of the first image sequence, namely the first image vector, is extracted by comparing the attention weight with the attention threshold, the weight of the first image vector is determined, and the obtained first image vector and the weight of the first image vector are more accurate.

With reference to the second aspect, in a second possible implementation manner of the second aspect, the processing module is specifically configured to: coding the first image vector to obtain a coded first image vector; and generating a first gesture feature vector according to the encoded first image vector and the weight of the first image vector.

The device of the implementation mode can convert the first image vector into the first gesture feature vector with fixed length in a coding mode, thereby reducing the number of matched templates in gesture recognition and reducing the expense of terminal equipment.

With reference to the second aspect, in a third possible implementation manner of the second aspect, the gesture feature vectors included in the first set of gesture feature vectors and having known gesture recognition results correspond to at least one known gesture category, and one known gesture category corresponds to one known gesture recognition result; the processing module is specifically configured to: determining a first gesture category; the first gesture category is a known gesture category which has the largest similarity with the first gesture feature vector in the at least one known gesture category; and if the similarity between the first gesture feature vector and the first gesture category is greater than a preset similarity threshold value, determining that the known gesture recognition result corresponding to the first gesture category is the gesture recognition result of the gesture to be recognized.

This implementation's device can carry out the similarity through the gesture classification that the known gesture eigenvector that contains first gesture eigenvector and first gesture eigenvector set corresponds and match, confirms the gesture recognition result of treating the discernment gesture to turn into the process that carries out the similarity matching to first gesture eigenvector and known gesture classification with the gesture recognition process, make the gesture recognition process simpler, can be more quick treat the discernment gesture and discern, improved gesture recognition's efficiency.

With reference to the second aspect, in a fourth possible implementation manner of the second aspect, the processing module is further configured to: if the similarity between the first gesture feature vector and the first gesture category is smaller than or equal to a preset similarity threshold value, acquiring a first gesture recognition result input by a user; and determining that the first gesture recognition result is the gesture recognition result of the gesture to be recognized.

According to the device of the implementation mode, for the gesture to be recognized corresponding to the first gesture feature vector with unsuccessful similarity matching, the gesture recognition result input by the user can be directly determined as the gesture recognition result of the gesture to be recognized, the gesture to be recognized does not need to be modeled and detected again, the gesture recognition process is simpler, and the applicability is better.

With reference to the second aspect, in a fifth possible implementation manner of the second aspect, the processing module is further configured to: and updating the first gesture feature vector set according to the first gesture feature vector and the gesture recognition result of the gesture to be recognized.

According to the device of the implementation mode, when the first gesture feature vector set is updated, namely when the known gesture feature vector is updated, the new gesture feature vector can be directly updated into the first gesture feature vector set, the newly added gesture does not need to be identified through the model, and the updating process of the known gesture feature vector is simpler.

With reference to the second aspect, in a sixth possible implementation manner of the second aspect, the attention model is preset.

According to the device of the implementation mode, the attention model is preset, the attention model can be obtained more simply and rapidly, and the applicability is better.

With reference to the second aspect, in a seventh possible implementation manner of the second aspect, the first set of gesture feature vectors is preset.

According to the device of the implementation mode, the first gesture feature vector set is preset, the first gesture feature vector set can be obtained more simply and rapidly, and the applicability is better.

In a third aspect, embodiments of the present application provide an apparatus comprising a processor, and when the processor executes a computer program or instructions in a memory, the method according to the first aspect is performed.

In a fourth aspect, embodiments of the present application provide an apparatus comprising a processor and a memory for storing computer programs or instructions; the processor is adapted to execute computer programs or instructions stored by the memory to cause the apparatus to perform the respective method as shown in the first aspect.

In a fifth aspect, an embodiment of the present application provides an apparatus, which includes a processor, a memory, and a transceiver; the transceiver is used for receiving signals or sending signals; the memory for storing computer programs or instructions; the processor for invoking the computer program or instructions from the memory to perform the method according to the first aspect.

In a sixth aspect, an embodiment of the present application provides an apparatus, which includes a processor and an interface circuit; the interface circuit is used for receiving a computer program or instructions and transmitting the computer program or instructions to the processor; the processor executes the computer program or instructions to perform the respective method as shown in the first aspect.

In a seventh aspect, an embodiment of the present application provides a computer storage medium for storing a computer program or instructions, which when executed, enable the method of the first aspect to be implemented.

In an eighth aspect, embodiments of the present application provide a computer program product comprising a computer program or instructions, which when executed, cause the method of the first aspect to be implemented.

In order to solve the problems of low gesture recognition accuracy and efficiency and high equipment overhead in the conventional dynamic gesture recognition, the application provides a gesture recognition method and device. According to the method, a first image vector of a key frame in a first image sequence of a gesture to be recognized is extracted, the weight of the first image vector used for representing the importance degree of the key frame in the first image sequence is obtained, a first gesture feature vector is generated according to the first image vector and the weight of the first image vector, and finally the gesture recognition result of the gesture to be recognized is determined according to the first gesture feature vector and a first gesture feature vector set containing the gesture feature vectors with known gesture recognition results. By adopting the scheme of the implementation mode, the information of the important gesture features of the corresponding gesture to be recognized can be acquired through the extraction of the key frames under the condition that the gestures made by different users aiming at the same meaning are different and the gesture made by the same user aiming at the same meaning is changed randomly, the problem that noise is introduced or the gesture feature information is lost due to the fixation of the feature acquisition window is solved, and the accuracy of gesture recognition can be improved. Secondly, the image vectors of the key frames of the gestures to be recognized and the corresponding weights are fused to generate first gesture feature vectors, and the image vectors of the key frames corresponding to the gestures to be recognized, which are made by the same user for many times, have changes and represent the same meanings, can be converted into the first gesture feature vectors with fixed lengths, so that the number of matched templates can be reduced, and the cost of terminal equipment is reduced. In addition, the gesture recognition result of the gesture to be recognized can be determined through matching of the first gesture feature vector and the first gesture feature vector set containing the gesture feature vector with the known gesture recognition result, the gesture recognition process is converted into the retrieval process of the first gesture feature vector, the gesture recognition process is simpler, the gesture to be recognized can be recognized more quickly, and the gesture recognition efficiency is improved.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a gesture recognition method provided herein;

FIG. 2 is a block diagram of a gesture recognition apparatus according to an embodiment of the present disclosure;

fig. 3 is a block diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The technical solution of the present application is described below with reference to the accompanying drawings.

In the description of this application, "/" means "or" unless otherwise stated, for example, A/B may mean A or B. "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. Further, "at least one" means one or more, "a plurality" means two or more. The terms "first", "second", and the like do not necessarily limit the number and execution order, and the terms "first", "second", and the like do not necessarily limit the difference.

It is noted that, in the present application, words such as "exemplary" or "for example" are used to mean exemplary, illustrative, or descriptive. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

The embodiment of the application provides a gesture recognition method and device. The technical scheme provided by the application can be applied to the terminal equipment, and the terminal equipment can be mobile or static. The terminal device may be a User Equipment (UE), and the UE may be a mobile phone (mobile phone), a tablet computer (Pad), a Personal Digital Assistant (PDA), or the like. The terminal equipment may be a carrier

Or other operating system device.

Embodiments of the gesture recognition method provided in the present application are explained below.

Referring to fig. 1, fig. 1 is a schematic flowchart of an embodiment of a gesture recognition method provided in the present application, where the method may include the following steps:

step S101, a first image vector and the weight of the first image vector are obtained.

The first image vector is an image vector of a key frame in the first image sequence. The first image sequence may include a key frame or a plurality of key frames, so the number of the first image vectors may be one or more. The weight of each first image vector is the weight of the key frame corresponding to the first image vector in the first image sequence, and the weight is used for indicating the importance degree of the key frame in the first image sequence. The first image sequence is an image sequence generated according to the video images of the gesture to be recognized.

When the terminal equipment identifies the gesture to be identified, firstly, the gesture to be identified made by the user is shot through the camera, and a video image of the gesture to be identified is obtained. The gesture to be recognized may be a dynamic gesture, and accordingly, the video image of the gesture to be recognized may include a plurality of frames of images. In the multi-frame image, some images carry hand information, and some images do not carry hand information, that is, the multi-frame image includes both an image carrying hand information and an image not carrying hand information. And the image carrying the hand information is a key frame in the multi-frame image. Generally, each frame of key frame corresponds to one hand motion, and the hand motions corresponding to all the key frames are connected together to form the gesture to be recognized.

The key frame in this embodiment can also be understood as follows: an action sequence is composed of a plurality of frames of pictures, a frame without information change is a common frame, and a frame with change or ending change is a key frame. Specifically, for example, the beginning or end of an action and the middle of the motion process are key frames in the sequence. In addition, the key frame is often used in the image field, and has some differences in different scenes, which is not limited in the present application, and the scheme of the present application can also be applied in different scenes for reference to the key frame recognition or extraction manner known in the art.

In a possible implementation, the first sequence of images may consist of all the images contained in the video image of the gesture to be recognized. In a possible implementation, a part of the images may also be selected from all the images included in the video image of the gesture to be recognized, so as to form the first image sequence.

The terminal device shoots a video image of a gesture to be recognized, and after a first image sequence is obtained according to the video image of the gesture to be recognized, image vectorization conversion is carried out on each frame of image in the first image sequence to obtain an image vector of the frame of image. The image vectors of all images in the first image sequence constitute a first set of image vectors. Inputting the first image vector set into an attention model, and calculating to obtain the attention weight of each image vector in the first image vector set; and determining an image vector with the attention weight larger than the attention threshold value in the first image vector set as a first image vector, and determining the attention weight of the first image vector as the weight of the first image vector. The image corresponding to the first image vector is the key frame of the first image sequence. Wherein, the attention threshold value can be set according to the requirement of the practical application scene.

The attention model can be trained by a deep learning model based on an attention mechanism. The attention model may be used to extract key information contained in a plurality of pieces of original information from the plurality of pieces of original information. For example, keywords, key sentences, or abstracts of a certain document can be extracted using an attention model. For another example, the attention model may be used to extract a key frame of a certain image sequence, and the specific implementation manner may refer to the contents of the foregoing embodiments.

The image vectors (namely the first image vectors) of the key frames in the first image sequence are extracted through the attention model, the information of important gesture features of the gesture to be recognized can be obtained, and the gesture recognition result of the gesture to be recognized can be accurately determined according to the information of the important gesture features of the gesture to be recognized, so that the problem that the gesture recognition result is inaccurate due to the fact that the gesture feature information of the gesture to be recognized is lost or noise is introduced due to a fixed feature acquisition window can be solved.

As can be seen from the above description of the embodiments, in a possible implementation manner, before obtaining the first image vector and the weight of the first image vector, the attention model needs to be obtained first. The acquisition mode of the attention model may include various modes, such as:

in one possible implementation, the attention model may be generated by training. Illustratively, the attention-based deep learning model can be trained by using a preset known image sequence set to generate an attention model. In the embodiment of the present application, the set of known image sequences used for training to generate the attention model is referred to as a first set of known image sequences.

Wherein the set of known image sequences comprises a plurality of known image sequences. The known image sequence is a sequence of images generated from video images of known dynamic gestures. The manner of generating the known image sequence from the video images of the known dynamic gesture can refer to the manner of generating the first image sequence in the foregoing embodiments, and will not be described in detail here.

The known dynamic gesture is a dynamic gesture with a known gesture category (i.e., the gesture category is determined) and a known gesture meaning (i.e., the gesture meaning is determined), or the known dynamic gesture can be considered as a dynamic gesture with a known gesture category and a known gesture recognition result (i.e., the gesture recognition result is determined). The gesture category refers to a category to which the dynamic gesture belongs. For example, the gesture category may be a left swipe, a right swipe, a clockwise circle, a counterclockwise circle, and so on.

The gesture category corresponding to the known dynamic gesture can be defined as a known gesture category, the gesture recognition result corresponding to the known dynamic gesture can be defined as a known gesture recognition result, and the gesture meaning corresponding to the known dynamic gesture can be defined as a known gesture meaning. The known gesture category and the known gesture recognition result or the known gesture meaning have corresponding relation. In one possible implementation, the known gesture categories and the known gesture recognition results or the known gesture meanings may correspond one to one. In one possible implementation, multiple known gesture categories may correspond to the same known gesture recognition result or known gesture meaning.

When the first known image sequence set is used for training the attention-based deep learning model, each known image sequence contained in the first known image sequence set has a label, and the label records the known gesture class corresponding to the known image sequence.

Illustratively, training an attention-based deep learning model using a first set of known image sequences to generate an attention model may be implemented as follows: sequentially selecting a known image sequence from a first known image sequence set, carrying out image vectorization conversion on each frame image contained in the known image sequence to obtain an image vector of the frame image, wherein the image vectors of all the images contained in the known image sequence form a second image vector set, inputting the second image vector set into a depth learning model based on the attention mechanism, training the depth learning model based on the attention mechanism until a loss function corresponding to the depth learning model based on the attention mechanism converges, and determining the currently trained depth learning model based on the attention mechanism as the attention model.

In a possible implementation manner, the trained attention model may be preset in the terminal device, and when the attention model is obtained, the preset attention model may be obtained. Therefore, the attention model can be acquired quickly, the acquisition process of the attention model is simpler, and the applicability is better.

And S102, generating a first gesture feature vector according to the first image vector and the weight of the first image vector.

After acquiring each first image vector and the weight of the first image vector, inputting all the first image vectors into a first encoder, and performing encoding processing on all the first image vectors, so that all the first image vectors are converted into the same coordinate system. Then, performing weighted fusion processing on all the first image vectors after the encoding processing and the weights of the corresponding first image vectors to generate an image vector with a fixed dimension, and determining the image vector as a first gesture feature vector. Therefore, the length of the obtained first gesture feature vector is fixed for gestures to be recognized which are made by a user for multiple times and have the same meaning, the subsequent gesture recognition process for the gestures to be recognized can be converted into the problem of searching and matching the first gesture feature vector with the fixed length, the gestures to be recognized can be recognized simply and quickly, the gesture recognition efficiency is higher, the problem that a large number of templates need to be matched due to the randomness and the variability of dynamic gestures made by the user can be avoided, and the applicability is better.

As can be seen from the above description of the embodiments, in a possible implementation manner, a first encoder needs to be obtained before generating the first gesture feature vector according to the first image vector and the weight of the first image vector. Implementations of obtaining the first encoder may include a variety, for example:

in one possible implementation, the first encoder is obtained by: acquiring a plurality of third image vector sets; and training a preset encoder by using the plurality of third image vector sets to generate a first encoder. For example, the preset encoder may be an Auto Encoder (AE).

Optionally, when the attention model is generated by training using the first known image sequence set, each second image vector set is input to the depth learning model based on the attention mechanism, after the attention weight of each image vector in the second image vector set is obtained, a set of image vectors in the second image vector set, of which the attention weight is greater than the attention threshold, is determined as a third image vector set, so as to obtain a plurality of third image vector sets.

Optionally, a known image sequence may be sequentially selected from the second known image sequence set, image vectorization conversion is performed on each frame image included in the known image sequence to obtain an image vector of the frame image, the image vectors of all the images included in the known image sequence form a fourth image vector set, the fourth image vector set is input into the attention model, the attention weight of each image vector in the fourth image vector set is obtained through calculation, and a set formed by image vectors in the fourth image vector set, of which the attention weight is greater than the attention threshold, is determined as the third image vector set, so as to obtain a plurality of third image vector sets. The second set of known image sequences may be the same as or different from the first set of known image sequences.

Optionally, image vectorization conversion may be performed on each frame image included in the unknown image sequence to obtain an image vector of the frame image, the image vectors of all the images included in the unknown image sequence form a fifth image vector set, the fifth image vector set is input into the attention model, the attention weight of each image vector in the fifth image vector set is obtained through calculation, and a set formed by image vectors in the fifth image vector set whose attention weight is greater than the attention threshold is determined as the third image vector set. In this way, a plurality of sets of third image vectors may be derived from a plurality of unknown image sequences.

Wherein the unknown image sequence is an image sequence generated from video images of the unknown dynamic gesture. The determination of the unknown image sequence may refer to the determination of the first image sequence in the foregoing embodiments, and will not be described in detail here. Unknown dynamic gestures are dynamic gestures in which the gesture category is unknown (i.e., the gesture category is not determined) and the gesture meaning or gesture recognition result is unknown (i.e., the gesture meaning or gesture recognition result is not determined). The unknown dynamic gestures can be acquired in a shooting mode.

In a possible implementation manner, the trained first encoder may be preset in the terminal device, and when the first encoder is obtained, the preset first encoder may be obtained. Therefore, the first encoder can be acquired quickly, the acquisition process of the first encoder is simpler, and the applicability is better.

And S103, determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set.

The first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results or known gesture meanings. In the embodiment of the application, the gesture feature vector with a known gesture recognition result or a known gesture meaning is defined as a known gesture feature vector. The known gesture feature vectors included in the first set of gesture feature vectors may correspond to at least one known gesture category, wherein a plurality of known gesture feature vectors may correspond to the same known gesture category. That is, the known gesture feature vectors included in the first gesture feature vector set belong to one or more known gesture categories, where each known gesture category may correspond to one known gesture feature vector or a plurality of known gesture feature vectors. According to the known gesture type corresponding to the known gesture feature vector, a known gesture recognition result or a known gesture meaning corresponding to the known gesture feature vector can be determined.

Before step S103 is executed, a first gesture feature vector set needs to be acquired. Acquiring a first gesture feature vector set, which can be realized by the following steps: sequentially selecting a known image sequence from a third known image sequence set, performing image vectorization conversion on each frame image contained in the known image sequence to obtain an image vector of the frame image, wherein the image vectors of all the images contained in the known image sequence form a sixth image vector set, inputting the sixth image vector set into an attention model, calculating the attention weight of each image vector in the sixth image vector set, determining a set formed by the image vectors with the attention weights larger than an attention threshold value in the sixth image vector set as a seventh image vector set, inputting the seventh image vector set into a first encoder, performing encoding processing on all the image vectors in the seventh image vector set, and performing weighted fusion processing on the encoded image vectors and the attention weights corresponding to the image vectors, and generating a second gesture feature vector with fixed dimension, wherein the plurality of second gesture feature vectors form a first gesture feature vector set.

The second gesture feature vector is generated according to the known image sequence, so that the second gesture feature vector is a known gesture feature vector with a known gesture recognition result, and the known gesture recognition result or the known gesture meaning corresponding to the second gesture feature vector is a known gesture recognition result or a known gesture meaning corresponding to a known dynamic gesture corresponding to the known image sequence.

The third set of known image sequences may be the same as or different from the first set of known image sequences or the second set of known image sequences.

In a possible implementation manner, the first gesture feature vector set obtained according to the above implementation manner may be preset in the terminal device, and when the first gesture feature vector set needs to be obtained, the first gesture feature vector set is directly obtained from the terminal device.

In a possible implementation manner, after the first gesture feature vector set and the first gesture feature vector are obtained, the first gesture feature vector may be matched with at least one known gesture category corresponding to a known gesture feature vector (i.e., a second gesture feature vector) included in the first gesture feature vector set, a similarity between the first gesture feature vector and each of the at least one known gesture category is calculated, and a known gesture category with a largest similarity to the first gesture feature vector in the at least one known gesture category is determined as the first gesture category.

And if the similarity of the first gesture feature vector and the first gesture category is greater than a preset similarity threshold value, determining a known gesture recognition result or a known gesture meaning corresponding to the first gesture category as the gesture recognition result or the gesture meaning of the gesture to be recognized. The preset similarity threshold value can be set according to the requirements of the actual application scene.

Or if the similarity between the first gesture feature vector and the first gesture category is smaller than or equal to a preset similarity threshold, acquiring input information of a user, wherein the input information of the user comprises a second gesture category and a first gesture recognition result or a first gesture meaning, determining the first gesture recognition result or the first gesture meaning as a gesture recognition result or a gesture meaning of a gesture to be recognized, and determining the second gesture category as a gesture category corresponding to the first gesture feature vector. Optionally, in this case, the first gesture feature vector set is further updated according to the first gesture feature vector, the second gesture category, and the first gesture recognition result or the first gesture meaning, specifically, the first gesture feature vector may be added to the first gesture feature vector set as a known gesture feature vector, and a corresponding relationship between the first gesture feature vector, the second gesture category, the first gesture recognition result, or the first gesture meaning is established.

In a possible implementation manner, after the similarity of the first gesture feature vector and each of at least one known gesture category corresponding to the known gesture feature vector included in the first gesture feature vector set is calculated, it may be further determined whether a similarity greater than a preset similarity threshold exists, if so, the known gesture category corresponding to the maximum similarity among all the similarities having the similarities greater than the similarity threshold is determined as a third gesture category, and the known gesture recognition result or the known gesture meaning corresponding to the third gesture category is determined as the gesture recognition result or the gesture meaning of the gesture to be recognized.

Or if the similarity larger than the preset similarity threshold does not exist, acquiring input information of the user, wherein the input information of the user comprises a second gesture category and a first gesture recognition result or a first gesture meaning, determining the first gesture recognition result or the first gesture meaning as a gesture recognition result or a gesture meaning of the gesture to be recognized, and determining the second gesture category as a gesture category corresponding to the first gesture feature vector. Optionally, in this case, the first gesture feature vector set is further updated according to the first gesture feature vector, the second gesture category, and the first gesture recognition result or the first gesture meaning, specifically, the first gesture feature vector may be added to the first gesture feature vector set as a known gesture feature vector, and a corresponding relationship between the first gesture feature vector, the second gesture category, the first gesture recognition result, or the first gesture meaning is established.

In the above embodiment, the matching of the first gesture feature vector with at least one known gesture category corresponding to the known gesture feature vector (i.e. the second gesture feature vector) included in the first gesture feature vector set, and calculating the similarity between the first gesture feature vector and each of the at least one known gesture categories may be implemented as follows: for each known gesture category in the at least one known gesture category, calculating a distance (for example, the distance may be a cosine distance) between the first gesture feature vector and each known gesture feature vector corresponding to the known gesture category, then performing a mean calculation on all the calculated distances corresponding to the known gesture category, and taking the calculated mean as a similarity between the first gesture feature vector and the known gesture category.

According to the gesture recognition method provided by the embodiment of the application, a first image vector of a key frame in a first image sequence of a gesture to be recognized is extracted, the weight of the first image vector used for representing the importance degree of the key frame in the first image sequence is obtained, a first gesture feature vector is generated according to the first image vector and the weight of the first image vector, and finally the gesture recognition result of the gesture to be recognized is determined according to the first gesture feature vector and a first gesture feature vector set containing the gesture feature vectors with known gesture recognition results. By adopting the scheme of the implementation mode, the information of the important gesture features of the corresponding gesture to be recognized can be acquired through the extraction of the key frames under the condition that the gestures made by different users aiming at the same meaning are different and the gesture made by the same user aiming at the same meaning is changed randomly, the problem that noise is introduced or the gesture feature information is lost due to the fixation of the feature acquisition window is solved, and the accuracy of gesture recognition can be improved. Secondly, the image vectors of the key frames of the gestures to be recognized and the corresponding weights are fused to generate first gesture feature vectors, and the image vectors of the key frames corresponding to the gestures to be recognized, which are made by the same user for many times, have changes and represent the same meanings, can be converted into the first gesture feature vectors with fixed lengths, so that the number of matched templates can be reduced, and the cost of terminal equipment is reduced. In addition, the gesture recognition result of the gesture to be recognized can be determined through matching of the first gesture feature vector and the first gesture feature vector set containing the gesture feature vector with the known gesture recognition result, the gesture recognition process is converted into the retrieval process of the first gesture feature vector, the gesture recognition process is simpler, the gesture to be recognized can be recognized more quickly, and the gesture recognition efficiency is improved.

The various method embodiments described herein may be implemented as stand-alone solutions or combined in accordance with inherent logic and are intended to fall within the scope of the present application.

It is to be understood that, in the above-described method embodiments, the method and operations implemented by the terminal device may also be implemented by a component (e.g., a chip or a circuit) that can be used for the terminal device.

The above-mentioned scheme provided by the embodiment of the present application is mainly introduced from the perspective of interaction between each network element. It is understood that each network element, for example, the acquiring module and the processing module of the terminal device, includes a hardware structure or a software module for performing each function, or a combination of both, in order to implement the functions. Those of skill in the art would readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the terminal device may be divided into the functional modules according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation. The following description will be given by taking an example of dividing each function module for each function.

The method provided by the embodiment of the present application is described in detail above with reference to fig. 1. Hereinafter, the apparatus provided in the embodiment of the present application will be described in detail with reference to fig. 2 and 3. It should be understood that the description of the apparatus embodiments corresponds to the description of the method embodiments, and therefore, for brevity, details are not repeated here, since the details that are not described in detail may be referred to the above method embodiments.

Referring to fig. 2, fig. 2 is a block diagram illustrating a structure of an embodiment of a gesture recognition apparatus provided in the present application. As shown in fig. 2, the apparatus 200 may include: an acquisition module 201 and a processing module 202. The apparatus 200 may be used to perform the actions performed by the terminal device in the above method embodiments.

For example: an obtaining module 201, configured to obtain a first image vector and a weight of the first image vector; the first image vector is an image vector of a key frame in a first image sequence; the weight of the first image vector is the weight of the key frame in the first image sequence; the first image sequence is an image sequence generated according to the video image of the gesture to be recognized.

A processing module 202, which may be configured to generate a first gesture feature vector according to the first image vector and a weight of the first image vector; determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set; the first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results.

Optionally, the obtaining module 201 is specifically configured to: acquiring a first image vector set; the first set of image vectors is a set of image vectors of images comprised by the first sequence of images; calculating an attention weight of each image vector in the first set of image vectors according to an attention model; determining an image vector with an attention weight greater than an attention threshold and the attention weight as a first image vector and a weight of the first image vector.

Optionally, the processing module 202 is specifically configured to: coding the first image vector to obtain a coded first image vector; and generating a first gesture feature vector according to the encoded first image vector and the weight of the first image vector.

Optionally, the gesture feature vectors included in the first gesture feature vector set and having known gesture recognition results correspond to at least one known gesture category, and one known gesture category corresponds to one known gesture recognition result; the processing module 202 is specifically configured to: determining a first gesture category; the first gesture category is a known gesture category which has the largest similarity with the first gesture feature vector in the at least one known gesture category; and if the similarity between the first gesture feature vector and the first gesture category is greater than a preset similarity threshold value, determining that the known gesture recognition result corresponding to the first gesture category is the gesture recognition result of the gesture to be recognized.

Optionally, the processing module 202 is further configured to obtain a first gesture recognition result input by the user if the similarity between the first gesture feature vector and the first gesture category is less than or equal to a preset similarity threshold; and determining that the first gesture recognition result is the gesture recognition result of the gesture to be recognized.

Optionally, the processing module 202 is further configured to: and updating the first gesture feature vector set according to the first gesture feature vector and the gesture recognition result of the gesture to be recognized.

Optionally, the attention model is preset.

Optionally, the first gesture feature vector set is preset.

That is, the apparatus 200 may implement the steps or the flow corresponding to the steps or the flow executed by the terminal device in the method shown in fig. 1 according to the embodiment of the present application, and the apparatus 200 may include modules for executing the method executed by the terminal device in the method shown in fig. 1. Also, the modules and other operations and/or functions described above in the apparatus 200 are respectively for implementing the corresponding steps of the method shown in fig. 1. For example, in one possible design, the obtaining module 201 in the apparatus 200 may be configured to perform step S101 in the method shown in fig. 1, and the processing module 202 may be configured to perform step S102 and step S103 in the method shown in fig. 1.

It should be understood that the specific processes of the modules for executing the corresponding steps are already described in detail in the above method embodiments, and therefore, for brevity, detailed descriptions thereof are omitted.

In addition, the apparatus 200 may be a terminal device, and the terminal device may perform the functions of the terminal device in the foregoing method embodiments, or implement the steps or processes performed by the terminal device in the foregoing method embodiments.

The terminal device may include a processor and a transceiver. Optionally, the terminal device may further include a memory. Wherein the processor, the transceiver and the memory can communicate with each other through the internal connection path to transmit control and/or data signals, the memory is used for storing computer programs or instructions, and the processor is used for calling and running the computer programs or instructions from the memory to control the transceiver to receive signals and/or transmit signals. Optionally, the terminal device may further include an antenna, configured to send out the uplink data or the uplink control signaling output by the transceiver through a wireless signal.

The processor may be combined with memory to form a processing device, the processor being configured to execute computer programs or instructions stored in the memory to implement the functions described above. In particular implementations, the memory may be integrated within the processor or may be separate from the processor. The processor may correspond to the processing module in fig. 2.

The above-mentioned transceivers may also be referred to as transceiving units. A transceiver may include a receiver (or receiver, receiving circuit) and/or a transmitter (or transmitter, transmitting circuit). Wherein the receiver is used for receiving signals, and the transmitter is used for sending signals.

It should be understood that the terminal device described above is capable of implementing the various processes involving the terminal device in the method embodiments shown above. The operation and/or function of each module in the terminal device are respectively for implementing the corresponding flow in the above method embodiment. Specifically, reference may be made to the description of the above method embodiments, and the detailed description is appropriately omitted herein to avoid redundancy.

Optionally, the terminal device may further include a power supply for supplying power to various devices or circuits in the terminal device.

In addition, in order to improve the functions of the terminal device, the terminal device may further include one or more of an input unit, a display unit, an audio circuit, a camera, a sensor, and the like, and the audio circuit may further include a speaker, a microphone, and the like.

The embodiment of the application also provides a processing device which comprises a processor and an interface. The processor may be adapted to perform the method of the above-described method embodiments.

It should be understood that the processing means may be a chip. For example, referring to fig. 3, fig. 3 is a block diagram of a chip according to an embodiment of the present disclosure. The chip shown in fig. 3 may be a general-purpose processor or may be a dedicated processor. The chip 300 comprises a processor 301. The processor 301 may be configured to support the apparatus shown in fig. 2 to execute the technical solution shown in fig. 1.

Optionally, the chip 300 may further include a transceiver 302, where the transceiver 302 is configured to receive control of the processor 301, and is configured to support the apparatus shown in fig. 2 to execute the technical solution shown in fig. 1. Optionally, the chip 300 shown in fig. 3 may further include: a storage medium 303.

It should be noted that the chip shown in fig. 3 can be implemented by using the following circuits or devices: one or more Field Programmable Gate Arrays (FPGAs), Programmable Logic Devices (PLDs), Application Specific Integrated Circuits (ASICs), system chips (socs), Central Processing Units (CPUs), Network Processors (NPs), digital signal processing circuits (DSPs), Micro Controller Units (MCUs), controllers, state machines, gate logic, discrete hardware components, any other suitable circuitry, or any combination of circuitry capable of performing the various functions described throughout this application.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

It should be noted that the processor in the embodiments of the present application may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor described above may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It will be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

According to the method provided by the embodiment of the present application, an embodiment of the present application further provides a computer program product, which includes: computer program or instructions which, when run on a computer, cause the computer to perform the method of any one of the embodiments shown in figure 1.

According to the method provided by the embodiment of the present application, a computer storage medium is further provided, and the computer storage medium stores a computer program or instructions, and when the computer program or instructions runs on a computer, the computer is caused to execute the method of any one of the embodiments shown in fig. 1.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer program or instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component may be localized on one computer and/or distributed between two or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

Those of ordinary skill in the art will appreciate that the various illustrative logical blocks and steps (step) described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing unit, or each module may exist alone physically, or two or more modules are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The gesture recognition apparatus, the computer storage medium, the computer program product, and the chip provided in the embodiments of the present application are all configured to execute the method provided above, and therefore, the beneficial effects achieved by the gesture recognition apparatus, the computer storage medium, the computer program product, and the chip can refer to the beneficial effects corresponding to the method provided above, and are not described herein again.

It should be understood that, in the embodiments of the present application, the execution sequence of each step should be determined by its function and inherent logic, and the size of the sequence number of each step does not mean the execution sequence, and does not limit the implementation process of the embodiments.

All parts of the specification are described in a progressive mode, the same and similar parts of all embodiments can be referred to each other, and each embodiment is mainly introduced to be different from other embodiments. In particular, as for the embodiments of the gesture recognition apparatus, the computer storage medium, the computer program product, and the chip, since they are substantially similar to the method embodiments, the description is simple, and the relevant points can be referred to the description in the method embodiments.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

The above-described embodiments of the present application do not limit the scope of the present application.

Claims

1. A gesture recognition method, comprising:

acquiring a first image vector and the weight of the first image vector; the first image vector is an image vector of a key frame in a first image sequence; the weight of the first image vector is the weight of the key frame in the first image sequence; the first image sequence is an image sequence generated according to a video image of a gesture to be recognized;

generating a first gesture feature vector according to the first image vector and the weight of the first image vector;

determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set; the first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results.

2. The gesture recognition method according to claim 1, wherein the obtaining of the first image vector and the weight of the first image vector comprises:

acquiring a first image vector set; the first set of image vectors is a set of image vectors of images comprised by the first sequence of images;

calculating an attention weight of each image vector in the first set of image vectors according to an attention model;

determining an image vector with an attention weight greater than an attention threshold and the attention weight as a first image vector and a weight of the first image vector.

3. The gesture recognition method according to claim 1 or 2, wherein the generating a first gesture feature vector according to the first image vector and a weight of the first image vector comprises:

coding the first image vector to obtain a coded first image vector;

and generating a first gesture feature vector according to the encoded first image vector and the weight of the first image vector.

4. The gesture recognition method according to any one of claims 1 to 3, wherein the first set of gesture feature vectors includes gesture feature vectors with known gesture recognition results corresponding to at least one known gesture category, and one known gesture category corresponds to one known gesture recognition result;

the determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set comprises:

determining a first gesture category; the first gesture category is a known gesture category which has the largest similarity with the first gesture feature vector in the at least one known gesture category;

and if the similarity between the first gesture feature vector and the first gesture category is greater than a preset similarity threshold value, determining that the known gesture recognition result corresponding to the first gesture category is the gesture recognition result of the gesture to be recognized.

5. The gesture recognition method according to claim 4, further comprising:

if the similarity between the first gesture feature vector and the first gesture category is smaller than or equal to a preset similarity threshold value, acquiring a first gesture recognition result input by a user;

and determining that the first gesture recognition result is the gesture recognition result of the gesture to be recognized.

6. The gesture recognition method according to claim 5, further comprising:

and updating the first gesture feature vector set according to the first gesture feature vector and the gesture recognition result of the gesture to be recognized.

7. A gesture recognition apparatus, comprising:

an obtaining module, configured to obtain a first image vector and a weight of the first image vector; the first image vector is an image vector of a key frame in a first image sequence; the weight of the first image vector is the weight of the key frame in the first image sequence; the first image sequence is an image sequence generated according to a video image of a gesture to be recognized;

the processing module is used for generating a first gesture feature vector according to the first image vector and the weight of the first image vector; determining a gesture recognition result of the gesture to be recognized according to the first gesture feature vector and the first gesture feature vector set; the first gesture feature vector set is a set formed by gesture feature vectors with known gesture recognition results.

8. The gesture recognition device of claim 7, wherein the acquisition module is specifically configured to:

9. The gesture recognition apparatus according to claim 7 or 8, wherein the processing module is specifically configured to:

coding the first image vector to obtain a coded first image vector;

10. The gesture recognition device according to any one of claims 7 to 9, wherein the first set of gesture feature vectors includes gesture feature vectors with known gesture recognition results corresponding to at least one known gesture category, and one known gesture category corresponds to one known gesture recognition result;

the processing module is specifically configured to:

11. The gesture recognition device of claim 10, wherein the processing module is further configured to:

12. The gesture recognition device of claim 11, wherein the processing module is further configured to:

13. An apparatus comprising a processor and a memory;

the processor for executing a computer program or instructions stored in the memory, the computer program or instructions, when executed, performing the method of any of claims 1 to 6.

14. An apparatus comprising a processor, a transceiver, and a memory;

the transceiver is used for receiving signals or sending signals; the processor for executing a computer program or instructions stored in the memory, which when executed, causes the apparatus to carry out the method of any one of claims 1 to 6.

15. A computer storage medium comprising computer programs or instructions which, when executed, perform the method of any one of claims 1 to 6.

16. A computer program product, characterized in that it causes a computer to carry out the method according to any one of claims 1 to 6, when said computer program product is run on a computer.

17. A chip comprising a processor coupled to a memory for executing a computer program or instructions stored in the memory, the computer program or instructions when executed causing the method of any of claims 1 to 6 to be performed.