CN115690894A - Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium - Google Patents

Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium Download PDF

Info

Publication number
CN115690894A
CN115690894A CN202110835534.6A CN202110835534A CN115690894A CN 115690894 A CN115690894 A CN 115690894A CN 202110835534 A CN202110835534 A CN 202110835534A CN 115690894 A CN115690894 A CN 115690894A
Authority
CN
China
Prior art keywords
gesture
sample image
gesture recognition
category
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110835534.6A
Other languages
Chinese (zh)
Inventor
高雪松
苗启广
梁思宇
张玉
李宇楠
史媛媛
房慧娟
扶小龙
陈绘州
苗凯彬
刘如意
刘向增
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hisense Group Holding Co Ltd
Original Assignee
Hisense Group Holding Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hisense Group Holding Co Ltd filed Critical Hisense Group Holding Co Ltd
Priority to CN202110835534.6A priority Critical patent/CN115690894A/en
Publication of CN115690894A publication Critical patent/CN115690894A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The application discloses a gesture recognition model training method, a gesture recognition device, a gesture recognition equipment and a gesture recognition medium, which are used for improving the accuracy of gesture recognition. Because the electronic equipment can determine the value of the loss function based on the anchor sample image, the positive sample image and the negative sample image, and based on the value of the loss function, when the parameters of the gesture recognition model to be trained are adjusted, the anchor sample image (positive sample image) and the negative sample image can be favorably distinguished by the gesture recognition model, so that different gestures, such as similar gestures, can be favorably distinguished by the gesture recognition model, and the accuracy of gesture recognition can be improved when the gesture recognition model trained based on the electronic equipment is used for gesture recognition.

Description

Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium
Technical Field
The present application relates to the field of gesture recognition technologies, and in particular, to a gesture recognition model training method, a gesture recognition method, an apparatus, a device, and a medium.
Background
In the wisdom family life, come control household electrical appliances through discernment user's gesture, can help the user more convenient and fast to interact with household electrical appliances. For example, when playing a video, the user's gesture can be recognized to control the video to move forward or backward, increase or decrease the sound, pause the playing, and the like. Compared with the prior art, the household appliance is controlled through the remote controller, the situation that the household appliance cannot be controlled due to the fact that a user cannot find the remote controller and the like can be avoided, control over the household appliance is achieved through the devices as few as possible, and user experience is improved.
The existing method for recognizing the gesture of the user is usually completed based on a gesture recognition model which is trained in advance. For example, darrel et al, american national Institute of Technology (MIT), media lab, trained a completed gesture recognition model based on standard gestures in a given dataset, achieved 97% recognition rate when recognizing the "Hello" gesture in the standard gestures in the dataset. However, when the gesture recognition model with a good recognition effect on the designated data set is migrated to an actual use process, an overfitting phenomenon is easy to occur, and the problem of low gesture recognition accuracy exists.
Disclosure of Invention
The application provides a gesture recognition model training method, a gesture recognition device, gesture recognition equipment and a gesture recognition medium, which are used for improving the accuracy of gesture recognition.
In a first aspect, the present application provides a method for training a gesture recognition model, where the method includes:
acquiring a positive sample image and a negative sample image of any anchor sample image in a sample set, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set is corresponding to a labeled sample type label, and the sample type label is used for identifying the sample type of the gesture contained in the sample image;
respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determining recognition class labels corresponding to the anchor sample image, the positive sample image and the negative sample image;
determining a value of a loss function based on the sample class label and an identification class label, wherein the loss function comprises a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
In a second aspect, the present application further provides a gesture recognition method based on any one of the above gesture recognition model training methods, where the method includes:
selecting a set number of video frames including a current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model;
and judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, if so, determining the candidate type as a target type of the gesture contained in the current video frame.
In a third aspect, the present application further provides a gesture recognition model training apparatus, including:
the acquisition module is used for acquiring a positive sample image and a negative sample image of any anchor sample image in a sample set, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label is used for identifying the sample category of the gesture contained in the sample image;
the input module is used for respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained and respectively determining recognition type labels corresponding to the anchor sample image, the positive sample image and the negative sample image;
an adjustment module to determine a value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
In a fourth aspect, the present application further provides a gesture recognition apparatus, including:
the selection module is used for selecting a set number of video frames including the current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model;
and the determining module is used for judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, and if so, determining the candidate type as the target type of the gesture contained in the current video frame.
In a fifth aspect, the present application further provides an electronic device, which at least includes a processor and a memory, where the processor is configured to implement the steps of the gesture recognition model training method according to any one of the above when executing a computer program stored in the memory; or, implementing the steps of the gesture recognition method as described in any of the above.
In a sixth aspect, the present application further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the gesture recognition model training method according to any one of the above methods; or, implementing the steps of the gesture recognition method as described in any of the above.
When the gesture recognition model is trained, a positive sample image and a negative sample image of any anchor sample image in a sample set can be obtained, wherein the gesture type contained in the positive sample image is the same as the gesture type contained in the anchor sample image, and the gesture type contained in the negative sample image is different from the gesture type contained in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label can be used for identifying the sample category of the gesture contained in the sample image; the electronic equipment can respectively input the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determine recognition category labels corresponding to the anchor sample image, the positive sample image and the negative sample image according to an output result of the gesture recognition model; then, determining the value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triple loss function; and adjusting parameters of the gesture recognition model to be trained according to the value of the loss function. Because the electronic equipment can determine the value of the loss function based on the anchor sample image, the positive sample image and the negative sample image, and based on the value of the loss function, when the parameters of the gesture recognition model to be trained are adjusted, the anchor sample image (the positive sample image) and the negative sample image can be favorably distinguished by the gesture recognition model, so that different gestures can be favorably distinguished by the gesture recognition model better, for example, similar gestures.
Drawings
In order to more clearly illustrate the embodiments of the present application or the implementation manner in the related art, a brief description will be given below of the drawings required for the description of the embodiments or the related art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 illustrates a schematic diagram of a gesture recognition model training process provided by some embodiments;
FIG. 2a illustrates a schematic diagram of gestures included in a first image provided by some embodiments;
FIG. 2b is a schematic diagram that illustrates gestures included in a second type of image that is provided by some embodiments;
FIG. 2c illustrates a schematic diagram of a gesture contained in a third image provided by some embodiments;
FIG. 2d illustrates a schematic diagram of a gesture contained in a fourth image provided by some embodiments;
FIG. 3a is a schematic diagram illustrating the result of adjusting parameters of a gesture recognition model to be trained according to a value of a mean square error loss function alone, and performing gesture recognition through the trained gesture recognition model according to some embodiments;
fig. 3b is a schematic diagram illustrating a result of gesture recognition performed by the trained gesture recognition model when parameters of the gesture recognition model to be trained are adjusted based on values of a mean square error loss function and values of a triplet loss function according to some embodiments;
FIG. 4 illustrates a schematic diagram of a gesture recognition model structure provided by some embodiments;
FIG. 5 illustrates a gesture recognition process diagram provided by some embodiments;
FIG. 6 illustrates a schematic diagram of determining a category of a gesture contained in a current video frame through sliding window integration according to some embodiments;
FIG. 7 illustrates another gesture recognition process diagram provided by some embodiments;
FIG. 8 illustrates a schematic diagram of a gesture recognition model training apparatus provided by some embodiments;
FIG. 9 illustrates a schematic diagram of a gesture recognition apparatus provided by some embodiments;
FIG. 10 illustrates an electronic device in accordance with certain embodiments;
fig. 11 is a schematic structural diagram of another electronic device according to some embodiments.
Detailed Description
In order to improve accuracy of gesture recognition, the embodiment of the application provides a gesture recognition model training method, a gesture recognition device, equipment and a medium.
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for convenience of understanding of the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.
Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.
The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.
In an actual use process, when the electronic device trains a gesture recognition model, a positive sample image and a negative sample image of any anchor sample image in a sample set can be obtained, wherein a gesture category contained in the positive sample image is the same as a gesture category contained in the anchor sample image, and a gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label can be used for identifying the sample category of the gesture contained in the sample image; the electronic equipment can respectively input the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determine recognition category labels corresponding to the anchor sample image, the positive sample image and the negative sample image according to an output result of the gesture recognition model; then, determining the value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triple loss function; and adjusting parameters of the gesture recognition model to be trained according to the value of the loss function. Because the electronic equipment can determine the value of the loss function based on the anchor sample image, the positive sample image and the negative sample image, and based on the value of the loss function, when the parameters of the gesture recognition model to be trained are adjusted, the anchor sample image (the positive sample image) and the negative sample image can be better distinguished by the gesture recognition model, so that different gestures can be better distinguished by the gesture recognition model, for example, similar gestures can be better distinguished.
Fig. 1 is a schematic diagram illustrating a gesture recognition model training process provided by some embodiments, and as shown in fig. 1, the process includes the following steps:
s101: acquiring a positive sample image and a negative sample image of any anchor sample image in a sample set, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; wherein any sample image in the sample set corresponds to an annotated sample category label, and the sample category label is used for identifying the sample category of the gesture contained in the sample image.
The gesture recognition model training method provided by the embodiment of the application is applied to electronic equipment, and the electronic equipment can be a server, or equipment such as a PC (personal computer) and a mobile terminal.
In one possible implementation, a plurality of sample images are contained in the sample set, and each sample image in the sample set corresponds to an annotated sample category label that can be used to identify a sample category of a gesture contained in the sample image. The sample type labels corresponding to the sample images can be labeled manually or by electronic equipment, and the labeling of the sample type labels to the sample images can be realized by adopting the prior art, which is not described herein again.
For example, fig. 2a illustrates a schematic diagram of a gesture included in a first image provided by some embodiments, fig. 2b illustrates a schematic diagram of a gesture included in a second image provided by some embodiments, fig. 2c illustrates a schematic diagram of a gesture included in a third image provided by some embodiments, and fig. 2d illustrates a schematic diagram of a gesture included in a fourth image provided by some embodiments. As shown in fig. 2a, when the type of the gesture (gesture type) contained in the sample image is a ring finger, the sample type label corresponding to the sample image may be labeled as 1 (for convenience of description, referred to as gesture No. 1); as shown in fig. 2b, when the type of the gesture included in the sample image is a palm, the sample type label corresponding to the sample image may be labeled as 2 (for convenience of description, referred to as gesture No. 2); as shown in fig. 2c, when the type of the gesture included in the sample image is a fist, the sample type label corresponding to the sample image may be labeled as 3 (for convenience of description, referred to as gesture No. 3); as shown in fig. 2d, when the category of the gesture contained in the sample image is neither a ring finger, nor a palm or a fist, the sample category label corresponding to the sample image may be labeled as 0 (for convenience of description, referred to as gesture No. 0). Specifically, the gesture category included in each sample image in the sample set and the sample category label corresponding to each gesture category may be flexibly set according to a requirement, which is not specifically limited in the present application. In one possible embodiment, the number of gesture categories included in each sample image in the sample set may be small, e.g., may be 3, 4, etc.
In one possible implementation, the gesture recognition model trained based on the sample set is low in accuracy and stability when recognizing a gesture, in view of the fact that if the difference between the number of sample images of different gesture categories in the sample set is large. In order to improve the stability and accuracy of gesture recognition, a sample number expansion manner (data enhancement manner) may be used to increase sample images of a gesture category with a smaller number of sample images, for example, the sample number expansion manner may include mirror-flipping, rotating, and the like, performed on existing images of the gesture category in a sample set to increase the number of sample images of the gesture category in the sample set, and may also extract a part of sample images of the gesture category from other similar data sets, such as an American Sign Language (ASL), and add the part to the sample set of the present application (relative to the data sets of ASL, the sample set of the present application may be regarded as a custom sample set), thereby increasing the number of sample images of the gesture category in the sample set.
For example, if the number of sample images of the gesture category of the sample set that is a ring finger (gesture No. 1) is relatively small, the existing sample images of the gesture No. 1 in the sample set may be mirror-flipped, rotated, and the like to increase the number of sample images of the gesture No. 1 in the sample set. For example, the existing sample image of gesture No. 1 may be subjected to random rotation with a rotation angle in the range of [ -30 °, +60 ° ] to increase the number of sample images of gesture No. 1 in the sample set. In addition, in view of the fact that in the actual use process, even if gestures of the same gesture type are adopted, the angles of the gestures of the user are always in a thousand-state, the gestures contained in the sample images in the sample set can be made to better accord with the actual use condition of the user by increasing the sample images in the sample set in the modes of mirror image overturning, random rotation and the like on the existing sample images, and therefore the phenomenon of overfitting when the trained gesture recognition model is transferred to the actual use process can be prevented, and the purpose of improving the gesture recognition accuracy is achieved.
For example, if the number of sample images of the gesture category in the sample set is the palm (gesture No. 2) is relatively small, a part of the sample images of the gesture No. 2 can be extracted from the ASL and added to the sample set of the present application, so as to increase the number of sample images of the gesture category in the sample set.
When the gesture recognition model to be trained is trained, any sample image in the sample set can be obtained, and for convenience of description, the obtained any sample image is called an anchor sample image. In order to improve the accuracy of gesture recognition, after the anchor sample image is obtained, a positive sample image and a negative sample image of the anchor sample image may also be obtained, in one possible implementation, the positive sample image and the negative sample image may be obtained according to a sample category label of each sample image, where one sample image may be randomly selected as the positive sample image from sample images that are the same as the sample category label of the anchor sample image; in a sample image different from the sample category label of the anchor sample image, one sample image is randomly selected as a negative sample image.
In one possible embodiment, in order to improve the accuracy of gesture recognition, the process of acquiring the negative sample image includes:
determining a target second gesture category corresponding to the gesture category contained in the positive sample image according to the stored corresponding relation between the first gesture category and the second gesture category; and selecting a negative sample image from the sample images of the target second gesture category.
In a possible implementation manner, a user may preset a corresponding relationship between a first gesture category and a second gesture category and store the corresponding relationship in the electronic device, and when the electronic device selects a negative sample image of an anchor sample image, the electronic device may first determine a target second gesture category corresponding to a gesture category included in the positive sample image (anchor sample image) according to the stored corresponding relationship between the first gesture category and the second gesture category, and then randomly select one sample image from the sample images of the target second gesture category as the negative sample image. For example, referring to table 1, if the gesture category included in the selected anchor sample image is a ring finger (gesture No. 1), the gesture included in the positive sample image is also a gesture No. 1, and the target second gesture category corresponding to the gesture category included in the positive sample image may be a gesture No. 0, a gesture No. 2, or a gesture No. 3, then a sample image may be randomly selected from the sample images of the gesture No. 0, the gesture No. 2, or the gesture No. 3 (sample category labels are 0, 2, and 3) as the negative sample image.
TABLE 1
Figure BDA0003176788110000071
In a possible embodiment, in order to improve the accuracy of gesture recognition, the selecting a negative sample image from the sample images of the target second gesture category includes:
and if the number of the target second gesture categories is at least two, selecting a negative sample image from the sample images of the target second gesture categories according to the set probability proportion that the sample images of the target second gesture categories are selected as the negative sample images.
In a possible implementation manner, when the user sets the correspondence between the first gesture category and the second gesture category, if there are at least two second gesture categories corresponding to the first gesture category, the probability ratio of the sample image of each second gesture category corresponding to the first gesture category being selected as the negative sample image may also be set. If the number of the target second gesture categories is at least two, when the negative sample image is selected, one sample image can be randomly selected from the sample images of any target second gesture category in the sample set as the negative sample image according to the set (set) probability proportion that the sample image of each target second gesture category is selected as the negative sample image.
For the sake of understanding, the above embodiments are still exemplified. Referring to table 1, if the gesture category included in the selected anchor sample image is a ring finger (gesture No. 1), the gesture category included in the positive sample image is also a gesture No. 1, and the target second gesture categories corresponding to the gesture categories included in the positive sample image may be a gesture No. 0, a gesture No. 2, and a gesture No. 3; and the probability ratio of the sample image of the target second gesture category No. 0 gesture, no. 2 gesture, and No. 3 gesture being selected as the negative sample image is 4.
In one possible embodiment, when the probability ratio of the sample image of the target second gesture category being selected as the negative sample image is set, including but not limited to, the probability ratio of the sample image of the target second gesture category being selected as the negative sample image may be controlled by setting the corresponding relationship between the random number interval range and the target second gesture category. For example, still taking the above embodiment as an example, the total range of variation of the random numbers may be set to [0,1], the corresponding relationship between [0, 0.4) and the gesture No. 0, [0.4, 0.6) and the gesture No. 2, and the corresponding relationship between [0.6,1] and the gesture No. 3 may be stored, since the randomly generated target random number has a probability of 40% in [0,0.4 ], a probability of 20% in [0.4, 0.6), and a probability of 40% in [0.6,1], then, when one sample image is randomly selected as the negative sample image from the sample images of the target second gesture category corresponding to the randomly generated target random number, the probability of the sample image which may satisfy the gesture No. 0 being selected as the negative sample image is 40%, the probability of the sample image of the gesture No. 2 being selected as the negative sample image, the probability of the sample image being selected as the negative sample image being 20%, and the probability of the sample image being selected as the negative sample image, the probability of the sample image being selected as required by the target second gesture, and the probability of the sample image being selected as the negative sample image, the sample image being selected as the sample image, the probability of the target second gesture, the sample image being selected as the sample image, the probability of the sample image being selected as required.
Referring to table 1, for another example, if the gesture category included in the selected anchor sample image is a gesture number 0, the gesture category included in the positive sample image is also a gesture number 0, and the target second gesture category corresponding to the gesture category included in the positive sample image may be a gesture number 1, a gesture number 2, or a gesture number 3; and the probability ratio of the sample images of the target second gesture type 1, 2 and 3 being selected as the negative sample images is 1.
For another example, referring to table 1, if the gesture category included in the selected anchor sample image is a gesture No. 2, the gesture category included in the positive sample image is also a gesture No. 2, and a target second gesture category corresponding to the gesture category included in the positive sample image may be a gesture No. 0 and a gesture No. 1 (since the gesture No. 2 is better distinguished from the gesture No. 3, the target second gesture category may not include the gesture No. 3); and the probability proportion of the sample images of the target second gesture type No. 0 gesture and No. 1 gesture being selected as the negative sample images is 1, when selecting the negative sample images, according to the probability distribution condition that the probability of the sample image of the No. 0 gesture being selected as the negative sample image is 50% and the probability of the sample image of the No. 1 gesture being selected as the negative sample image is 50%, one sample image is selected from the sample images of the No. 0 gesture and the No. 1 gesture in the sample set as the negative sample image actually used in the gesture recognition model training process.
For another example, referring to table 1, if the gesture category included in the selected anchor sample image is the gesture No. 3, the gesture category included in the positive sample image is also the gesture No. 3, and a target second gesture category corresponding to the gesture category included in the positive sample image may be the gesture No. 0 and the gesture No. 1 (since the gesture No. 3 is better distinguished from the gesture No. 2, the gesture No. 2 may not be included in the target second gesture category); and the probability proportion of the sample images of the target second gesture type No. 0 gesture and No. 1 gesture being selected as the negative sample images is 1, when selecting the negative sample images, according to the probability distribution condition that the probability of the sample image of the No. 0 gesture being selected as the negative sample image is 50% and the probability of the sample image of the No. 1 gesture being selected as the negative sample image is 50%, one sample image is selected from the sample images of the No. 0 gesture and the No. 1 gesture in the sample set as the negative sample image actually used in the gesture recognition model training process.
According to the method and the device, the target second gesture category corresponding to the gesture category contained in the positive sample image can be determined according to the stored corresponding relation between the first gesture category and the second gesture category; and selecting a negative sample image from the sample images of the target second gesture category. Compared with the method that the corresponding relation between the first gesture category and the second gesture category is not set, and only the negative sample image is selected from the sample images different from the gesture categories contained in the anchor sample image, the gesture categories of the negative sample image can be controlled based on the stored corresponding relation between the first gesture category and the second gesture category, for example, the gesture categories of the gestures which are easily confused with the gestures of the anchor sample image can be determined as the target second gesture categories, and the gesture categories of the gestures which are not easily confused (easily distinguished) with the gestures of the anchor sample image are excluded from the target second gesture categories, so that the acquired negative sample images are more likely to be the categories of the gestures which are easily confused with the gestures of the anchor sample image, and when the gesture recognition model trained based on the training mode is used for gesture recognition, the accuracy of the gesture recognition can be improved.
In addition, the negative sample image can be selected from the sample images of the target second gesture category according to the set probability proportion of the sample image of the target second gesture category being selected as the negative sample image, so that the probability that the negative sample image is of a certain gesture category can be further controlled, the gesture category which is easily confused with the gesture of the anchor sample image can be further obtained with a higher probability, and the accuracy of gesture recognition can be further improved when the gesture recognition model trained based on the training mode is used for gesture recognition.
S102: and respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determining recognition class labels corresponding to the anchor sample image, the positive sample image and the negative sample image.
After the anchor sample image, the positive sample image, and the negative sample image are obtained (i.e., the triplet is obtained), the obtained anchor sample image, the positive sample image, and the negative sample image may be respectively input to a gesture recognition model to be trained, so that the gesture recognition model respectively determines a recognition class label corresponding to the anchor sample image, a recognition class label corresponding to the positive sample image, and a recognition class label corresponding to the negative sample image. The gesture recognition model determines the recognition type label of the sample image as the prior art, which is not described herein again.
S103: determining a value of a loss function based on the sample class label and an identification class label, wherein the loss function comprises a triplet loss function; and adjusting parameters of the gesture recognition model to be trained according to the value of the loss function.
After the identification category label corresponding to the anchor sample image, the identification category label corresponding to the positive sample image and the identification category label corresponding to the negative sample image are respectively determined by the gesture identification model to be trained, because the sample category label corresponding to the anchor sample image, the sample category label corresponding to the positive sample image and the sample category label corresponding to the negative sample image are stored in advance, whether the identification result of the gesture identification model to be trained is accurate can be determined according to whether the sample category label corresponding to the anchor sample image is consistent with the identification category label, whether the sample category label corresponding to the positive sample image is consistent with the identification category label, and whether the sample category label corresponding to the negative sample image is consistent with the identification category label. In specific implementation, if the gesture recognition models to be trained are inconsistent, which indicates that the recognition result of the gesture recognition model to be trained is inaccurate, parameters of the gesture recognition model need to be adjusted, so that the gesture recognition model is trained.
Specifically, the value of the loss function may be determined according to whether the sample class label and the identification class label corresponding to the anchor sample image are consistent, whether the sample class label and the identification class label corresponding to the positive sample image are consistent, and whether the sample class label and the identification class label corresponding to the negative sample image are consistent, where the loss function at least includes a triplet loss function. When the recognition result is less accurate, the value of the loss function can be larger, and the adjustment amplitude of the parameters of the gesture recognition model can be larger.
In specific implementation, when parameters in the gesture recognition model are adjusted, a gradient descent algorithm can be adopted to perform back propagation on the gradient of the parameters of the gesture recognition model, so that the gesture recognition model is trained.
In one possible implementation, the above operation may be performed on each sample image in the sample set, and when a preset convergence condition is satisfied, it is determined that the training of the gesture recognition model is completed.
The preset convergence condition can be met by allowing the sample images in the sample set to pass through the gesture recognition model to be trained, wherein the number of correctly recognized sample images is greater than a set number, or the number of iterations for training the gesture recognition model reaches a set maximum number of iterations, and the like. The configuration may be flexibly implemented, and is not limited herein.
In a possible implementation manner, when training the gesture recognition model, the sample images in the sample set may be divided into training sample images and test sample images, the original gesture recognition model is trained based on the training sample images, and then the reliability of the trained gesture recognition model is verified based on the test sample images. In a possible implementation manner, the real-time performance of the gesture recognition model can be tested, so that the trained gesture recognition model meets the requirement of performing gesture recognition in real time. For example, the trained gesture recognition model may take less than 30ms to recognize a gesture in one frame of video frame, so that the gesture contained in each video frame in the video stream with the frame rate of 30fps can be recognized in real time.
When the gesture recognition model is trained, a positive sample image and a negative sample image of any anchor sample image in a sample set can be obtained, wherein the gesture type contained in the positive sample image is the same as the gesture type contained in the anchor sample image, and the gesture type contained in the negative sample image is different from the gesture type contained in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label can be used for identifying the sample category of the gesture contained in the sample image; the electronic equipment can respectively input the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determine recognition class labels corresponding to the anchor sample image, the positive sample image and the negative sample image according to an output result of the gesture recognition model; then, determining the value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triple loss function; and adjusting parameters of the gesture recognition model to be trained according to the value of the loss function. Because the electronic equipment can determine the value of the loss function based on the anchor sample image, the positive sample image and the negative sample image, and based on the value of the loss function, when the parameters of the gesture recognition model to be trained are adjusted, the anchor sample image (the positive sample image) and the negative sample image can be favorably distinguished by the gesture recognition model, so that different gestures can be favorably distinguished by the gesture recognition model better, for example, similar gestures.
In one possible embodiment, the determining the value of the loss function based on the sample class label and the identification class label comprises:
the loss function comprises a triple loss function and a mean square error loss function, and the value of the triple loss function and the value of the mean square error loss function are respectively determined based on the sample class label and the identification class label;
and determining the value of the loss function according to the value of the triple loss function, a first weight value corresponding to a preset triple loss function, the value of the mean square error loss function and a second weight value corresponding to a preset mean square error loss function.
In a possible implementation manner, if the loss function includes the triplet loss function, the value of the triplet loss function may be determined according to the sample class label and the identification class label corresponding to the anchor sample image, the sample class label and the identification class label corresponding to the positive sample image, and the sample class label and the identification class label corresponding to the negative sample image. In one possible embodiment, when determining the value of the ternary loss function, the value may be determined based on the following formula:
L Triplet =max(d(a,p)-d(a,n)+margin,0);
wherein L is Triplet Is the value of the ternary loss function; a is an anchor sample image; p is a positive sample image; n is a negative sample image; d (a, p) and d (a, n) are distance functions, where d (a, p) may be used to represent the distance between the anchor sample image and the positive sample image and d (a, n) may be used to represent the distance between the anchor sample image and the negative sample image; margin is an interval parameter and can be flexibly set according to requirements. When the value of d (a, p) -d (a, n) + margin is greater than 0, L Triplet Has a value of d (a, p) -d (a, n) + margin. When the value of d (a, p) -d (a, n) + margin is less than or equal to 0, L Triplet The value of (2) is 0.
In a possible implementation manner, when the loss function includes a mean square error loss function, a first function value of the mean square error loss function may be determined according to a sample class label and an identification class label corresponding to the anchor sample image, respectively; determining a second function value of the mean square error loss function according to the sample class label and the identification class label corresponding to the positive sample image; and determining a third function value of the mean square error loss function according to the sample class label and the identification class label corresponding to the negative sample image. Wherein the first function value, the second function value and the third function value are determinedThe function values are similar in process and can be based on formulas
Figure BDA0003176788110000111
And determining, wherein n is the total number of gesture categories in the sample set, and if the gesture categories in the sample set are gesture No. 0, gesture No. 1, gesture No. 2 and gesture No. 3, n is 4, target is a sample category label, and y is an identification category label. In one possible embodiment, after the first function value, the second function value, and the third function value of the mean square error loss function are determined, an average value of the first function value, the second function value, and the third function value may be determined as a value of the mean square error loss function.
In order to accurately determine the value of the loss function, a first weight value corresponding to the triple loss function and a second weight value corresponding to the mean square error loss function may be preset, then a first product of the value of the triple loss function and the first weight value and a second product of the value of the mean square error loss function and the second weight value are respectively determined, and the sum of the first product and the second product is determined as the value of the loss function.
For convenience of description, the process of determining the value of the loss function is described below in terms of a formula. The value of the mean square error loss function is L MSE The second weight value corresponding to the mean square error loss function is represented by alpha, and the value of the triple loss function is represented by L Triplet When the first weight value corresponding to the triple loss function is represented by β, the value L = α · L of the loss function is represented by MSE +β·L Triplet . In one possible implementation, the sum of the first weight value and the second weight value may be 1.
Fig. 3a illustrates a result diagram of adjusting parameters of a gesture recognition model to be trained according to a value of a mean square error loss function alone according to some embodiments, and performing gesture recognition by using the trained gesture recognition model, as shown in fig. 3a, there are 719 number 2 gestures, where 28 number 2 gestures are mistakenly recognized as number 0 gestures, 12 number 2 gestures are mistakenly recognized as number 1 gestures, 32 number 2 gestures are mistakenly recognized as number 3 gestures, and there are 647 number 2 gestures that are correctly recognized as number 2 gestures. In addition, as another example, referring to fig. 3a, there are 719 number 3 gestures, where 8 number 3 gestures are mistakenly recognized as the number 0 gesture, 22 number 3 gestures are mistakenly recognized as the number 1 gesture, 35 number 3 gestures are mistakenly recognized as the number 2 gesture, and 654 number 3 gestures are correctly recognized as the number 3 gestures.
Fig. 3b shows a schematic diagram of a result of gesture recognition performed by the trained gesture recognition model when parameters of the gesture recognition model to be trained are adjusted based on the value of the mean square error loss function and the value of the triple loss function provided by some embodiments, as shown in fig. 3b, 719 number 2 gestures are also provided, where 9 number 2 gestures are mistakenly recognized as number 0 gestures, 9 number 2 gestures are mistakenly recognized as number 1 gestures, 35 number 2 gestures are mistakenly recognized as number 3 gestures, and 666 number 2 gestures are correctly recognized as number 2 gestures in total. Therefore, when parameters of the gesture recognition model to be trained are adjusted based on the values of the mean square error loss function and the triples loss function, the accuracy of gesture recognition can be improved when the trained gesture recognition model is used for gesture recognition.
Referring to fig. 3b, there are 719 number 3 gestures that are identical, wherein 8 number 3 gestures are mistakenly recognized as number 0 gestures, 9 number 3 gestures are mistakenly recognized as number 1 gestures, 44 number 3 gestures are mistakenly recognized as number 2 gestures, and 658 total number 3 gestures are correctly recognized as number 3 gestures. Therefore, when parameters of the gesture recognition model to be trained are adjusted based on the values of the mean square error loss function and the triples loss function, the accuracy of gesture recognition can be improved when the trained gesture recognition model is used for gesture recognition.
In a possible implementation manner, fig. 4 shows a schematic structural diagram of a gesture recognition model provided in some embodiments, and as shown in fig. 4, a gesture recognition model in the present application may include six parts: firstly, a convolution layer (7 multiplied by 7 conv) is included, an input image (image) firstly enters the convolution layer, and a correction linear active layer (relu) and a maximum pooling layer (maxpool) are connected behind the convolution layer; the latter four parts (BasicBlock 1, basicBlock2, basicBlock3, basicBlock 4) are residual network structure blocks of the same structure, each of which includes two modules (for convenience of description, referred to as a first module and a second module respectively), each of the first module and the second module uses a 3 × 3 convolutional layer (3 × 3 conv) for feature extraction, uses a batch normalization layer (BN) for regularization, and then uses a relu activation function, illustratively, the convolutional kernel sizes (kernel _ size, denoted by k for convenience of description) of the convolutional layers in the first module of BasicBlock1, basicBlock2, basicBlock3, basicBlock4 are all 3, the convolutional step sizes (stride, denoted by s for convenience of description) are all 2, and the feature map padding widths (denoted by p for convenience of description) are all 1. The convolutional layers in the second modules of BasicBlock1, basicBlock2, basicBlock3, and BasicBlock4 all have k of 3, s of 1, p of 1.
The basic block1, basic block2, basic block3 and basic block4 are all residual network structure blocks formed by a first module and a second module and an input-accessed constant shortcut connection (1 × 1conv). Wherein, k of the identical quick connection in the basic Block1, the basic Block2, the basic Block3 and the basic Block4 is 1, s is 2, p is 0.
And the last part is accessed to an average value sampling layer (avgpool) and a full connection layer (FC) after a residual error network structure block (BasicBlock 4) for final feature extraction and obtaining a final recognition result of the gesture recognition model.
Based on the same technical concept, the present application provides a gesture recognition method, and fig. 5 shows a schematic diagram of a gesture recognition process provided by some embodiments, as shown in fig. 5, the process includes the following steps:
s501: selecting a set number of video frames including a current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model.
S502: and judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, if so, determining the candidate type as a target type of the gesture contained in the current video frame.
The gesture recognition model training method provided by the embodiment of the application is applied to electronic equipment, and the electronic equipment can be household appliances such as a PC (personal computer), a mobile terminal and a smart television, and can also be a server.
In one possible implementation, the electronic device may include an image capture module such as a camera, and may capture video frames in real time based on the image capture module. In addition, other electronic devices may also acquire the video frame and then send the video frame to the electronic device, which is not specifically limited in this application.
In a possible implementation manner, the electronic device may input each video frame in the video stream into a pre-trained gesture recognition model in real time, and determine the category of the gesture included in each video frame according to an output result of the gesture recognition model. In order to improve the accuracy of gesture recognition, in one possible embodiment, the category of the gesture included in the current video frame may be determined comprehensively based on the categories of the gestures included in a set number of video frames including the current video frame, so as to improve the accuracy of gesture recognition. The following describes in detail a process of comprehensively determining the category of a gesture included in a current video frame based on the categories of gestures included in a set number of video frames including the current video frame in the present application.
The electronic device may first acquire a set number of video frames including a current video frame, where the set number may be flexibly set according to a requirement, and this is not specifically limited in this application, and may be, for example, 5, 10, and the like. After the set number of video frames are selected, the set number of video frames can be sequentially and respectively input into the gesture recognition model which is trained in advance, and the type (for convenience of description, referred to as a candidate type) of the gesture contained in each video frame in the set number of video frames is respectively determined according to the output result of the gesture recognition model.
After determining the candidate type of the gesture included in each video frame of the set number of video frames, the number of video frames of each candidate type may be counted, and it is determined whether the number of video frames of any candidate type in the set number of video frames is not less than the set number threshold, if so, the candidate type is determined as the type of the gesture included in the current video frame (for convenience of description, referred to as a target type). The number threshold may be flexibly set according to requirements, which is not specifically limited in the present application. For example, if the number is 10 and the number threshold is 7, and the gestures included in 10 consecutive video frames including the current video frame are gesture No. 1 and gesture No. 2, respectively, where there are 2 video frames for gesture No. 1 and 8 video frames for gesture No. 2, the gesture category of the current video frame is determined to be gesture No. 2.
In a possible implementation manner, if the number of the video frames of each candidate category is less than the set number threshold, it may be considered that no gesture is contained in the current video frame or the contained gesture is not a preset control gesture, and the like, and no output may be made.
For convenience of understanding, the gesture recognition process provided in the embodiment of the present application is described below by using a specific embodiment.
In one possible implementation, the user may control the electronic device by making certain gestures at a distance of 2m from the image acquisition module of the electronic device. The image capturing module may be a camera such as a hawaiwei C930E, the frame rate of the video stream captured by the image capturing module may be 30fps, the resolution may be 920 pixels × 1080 pixels, the format of the video frame may be RGB, the video frame corresponds to three color channels of RGB, the electronic device may adjust image parameters of each video frame in the captured video stream, and the adjusted image parameters are represented by C × W H, where C is the number of color channels of the image, W is the width (in units of pixels) of the image, and H is the height (in units of pixels) of the image.
The electronic device can respectively and sequentially input each video frame into the pre-trained gesture recognition model, and determine the category of the gesture contained in each video frame according to the output result of the gesture recognition model. For convenience of description, a video frame of a currently input gesture recognition model is referred to as a current video frame, and in a possible gesture mode, in order to improve accuracy of gesture recognition, gesture categories included in a set number of video frames including the current video frame may be comprehensively determined based on the gesture categories included in the current video frame, so as to improve accuracy of gesture recognition. For convenience of description, the manner of comprehensively determining the category of the gesture included in the current video frame based on the gesture categories included in the set number of video frames including the current video frame is referred to as sliding window based, and the process of comprehensively determining the category of the gesture included in the current video frame through the sliding window in the present application is illustrated below.
Fig. 6 illustrates a schematic diagram for comprehensively determining the category of the gesture included in the current video frame through the sliding window according to some embodiments, and as shown in fig. 6, assuming that the category of the gesture included in the current video frame is comprehensively determined based on the categories of the gestures included in 10 video frames including the current video frame, the length of the sliding window may be 10 frames, that is, the length of the sliding window is the same as the set number.
In a possible implementation, the selecting a set number of video frames including the current video frame includes:
and sequentially selecting a set number of video frames including the current video frame according to the set time step interval.
In one possible implementation, the determination of which video frames contain the gesture based on which category of the gesture can be determined by the set time step interval, and the determination of the category of the gesture contained in the current video frame is integrated. In one possible embodiment, referring to fig. 6, when the set time step interval is a time interval (for convenience of description, referred to as a step size of 1) for generating one frame of video frame, a set number of video frames including the current video frame and consecutive in time may be sequentially selected, for example, 10 video frames, frame0, frame1, frame2, \\ 8230, and frame9, and the gesture category (target category) included in the current video frame is determined based on the gesture categories (candidate categories) included in the 10 video frames.
As another example, when the set time step interval is a time interval for generating two video frames (for convenience of description, referred to as a step size of 2), taking 5 video frames as the current video frame being frame0, frame2, frame4, frame6, and frame8 as an example, the 5 video frames may be sequentially selected, and the category (target category) of the gesture included in the current video frame may be determined based on the categories (candidate categories) of the gesture included in the 5 video frames.
The following description will be given taking as an example a step size of 1 and a set number of 10. For example, when the video stream starts to be captured, the video frames included in the video stream are sequentially video frames of frame0, frame1, frame2, frame3, etc., and taking the current video frame as frame0 as an example, the gesture category (target category) included in the frame0 video frame may be determined according to the output result of the gesture recognition model for the 10 video frames of frame0, frame1, frame2, \\ 8230and frame9 (i.e., according to the candidate categories of the gesture included in the 10 video frames). Taking the number threshold as 7 as an example, assuming that the gestures included in the 10 video frames are gesture No. 1 and gesture No. 2, respectively, where there are 2 video frames for gesture No. 1 and 8 video frames for gesture No. 2, the gesture category of the frame0 video frame is determined to be gesture No. 2.
As time goes by, the electronic device continuously acquires video frames, and assuming that the target categories of several video frames, namely frame0, frame1, frame2 and frame3, have already been determined and the target category of the frame4 video frame is to be determined, taking frame4 as the current video frame, the gesture category (target category) contained in the frame4 video frame can be determined according to the gesture recognition model for the output results of 10 video frames, namely frame4, frame5, frame6, 8230and frame13 (i.e. according to the candidate categories of the gestures contained in the 10 video frames). The determining process is the same as the above embodiment, and is not described herein again.
In one possible implementation manner, after the gesture category contained in the video frame is determined, a corresponding control signal may be generated based on the gesture category, so that the electronic device may be controlled based on the control signal. The control signal is generated based on the gesture category included in the video frame and the electronic device is controlled based on the control signal, which may adopt the prior art and are not described herein again.
For convenience of understanding, the gesture recognition process provided in this embodiment is further described below with a specific embodiment. FIG. 7 shows another schematic diagram of a gesture recognition process provided by some embodiments, the process comprising the steps of:
s701: video frames are acquired in real time.
S702: and determining candidate categories of the gestures contained in the video frame through the pre-trained gesture recognition model.
S703: and comprehensively determining the target category of the gesture contained in the current video frame based on the candidate categories of the gesture contained in the set number of video frames including the current video frame.
S704: and generating a corresponding control signal based on the target category of the gesture contained in the video frame, and controlling the electronic equipment based on the control signal.
Based on the same technical concept, the present application further provides a gesture recognition model training apparatus, and fig. 8 shows a schematic diagram of a gesture recognition model training apparatus provided in some embodiments, the apparatus includes:
the obtaining module 81 is configured to obtain, for any anchor sample image in a sample set, a positive sample image and a negative sample image of the anchor sample image, where a gesture category included in the positive sample image is the same as a gesture category included in the anchor sample image, and a gesture category included in the negative sample image is different from the gesture category included in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label is used for identifying the sample category of the gesture contained in the sample image;
an input module 82, configured to input the anchor sample image, the positive sample image, and the negative sample image into a gesture recognition model to be trained, and determine recognition category labels corresponding to the anchor sample image, the positive sample image, and the negative sample image, respectively;
an adjusting module 83, configured to determine a value of a loss function based on the sample class label and the identification class label, where the loss function includes a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
In a possible implementation manner, the obtaining module 81 is specifically configured to determine, according to a stored correspondence between a first gesture category and a second gesture category, a target second gesture category corresponding to a gesture category included in the positive sample image; and selecting a negative sample image from the sample images of the target second gesture category.
In a possible implementation manner, the obtaining module 81 is specifically configured to, if the number of the target second gesture categories is at least two, select a negative sample image from the sample images of the target second gesture category according to a set probability ratio that the sample images of the target second gesture category are selected as the negative sample images.
In a possible implementation, the adjusting module 83 is specifically configured to determine the values of the triple loss function and the mean square error loss function based on the sample class label and the identification class label;
and determining the value of the loss function according to the value of the triple loss function, a first weight value corresponding to a preset triple loss function, the value of the mean square error loss function and a second weight value corresponding to a preset mean square error loss function.
Based on the same technical concept, the present application further provides a gesture recognition apparatus, and fig. 9 shows a schematic diagram of a gesture recognition apparatus provided in some embodiments, where the apparatus includes:
the selection module 91 is configured to select a set number of video frames including a current video frame, sequentially input the set number of video frames into a gesture recognition model which is trained in advance, and respectively determine candidate categories of gestures included in the set number of video frames through the gesture recognition model;
a determining module 92, configured to determine whether the number of video frames in any candidate category in the set number of video frames is not less than a set number threshold, and if so, determine the candidate category as a target category of a gesture included in the current video frame.
In a possible implementation manner, the selecting module 91 is specifically configured to sequentially select a set number of video frames including a current video frame according to a set time step interval.
Based on the same technical concept, the present application further provides an electronic device, and fig. 10 shows a schematic structural diagram of an electronic device provided by some embodiments, as shown in fig. 10, including: the system comprises a processor 101, a communication interface 102, a memory 103 and a communication bus 104, wherein the processor 101, the communication interface 102 and the memory 103 are communicated with each other through the communication bus 104;
the memory 103 has stored therein a computer program which, when executed by the processor 101, causes the processor 101 to perform the steps of:
aiming at any anchor sample image in a sample set, acquiring a positive sample image and a negative sample image of the anchor sample image, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set is corresponding to a labeled sample type label, and the sample type label is used for identifying the sample type of the gesture contained in the sample image;
respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determining recognition class labels corresponding to the anchor sample image, the positive sample image and the negative sample image;
determining a value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
In a possible implementation manner, the processor 101 is specifically configured to determine, according to a stored correspondence between a first gesture category and a second gesture category, a target second gesture category corresponding to a gesture category included in the positive sample image; and selecting a negative sample image from the sample images of the target second gesture category.
In a possible implementation manner, the processor 101 is specifically configured to, if the number of the target second gesture categories is at least two, select a negative sample image from the sample images of the target second gesture category according to a set probability ratio that the sample images of the target second gesture category are selected as the negative sample images.
In a possible implementation, the processor 101 is specifically configured to determine the values of the triple loss function and the mean square error loss function based on the sample class label and the identification class label;
and determining the value of the loss function according to the value of the triple loss function, a first weight value corresponding to a preset triple loss function, the value of the mean square error loss function and a second weight value corresponding to a preset mean square error loss function.
Because the principle of solving the problems of the electronic device is similar to the gesture recognition model training method, the implementation of the electronic device can refer to the implementation of the gesture recognition model training method, and repeated parts are not described again.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 102 is used for communication between the above-described electronic device and other devices.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.
Based on the same technical concept, the present application further provides an electronic device, and fig. 11 shows another structural schematic diagram of an electronic device provided in some embodiments, as shown in fig. 11, including: the system comprises a processor 111, a communication interface 112, a memory 113 and a communication bus 114, wherein the processor 111, the communication interface 112 and the memory 113 complete mutual communication through the communication bus 114;
the memory 113 has stored therein a computer program which, when executed by the processor 111, causes the processor 111 to perform the steps of:
selecting a set number of video frames including a current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model;
and judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, if so, determining the candidate type as a target type of the gesture contained in the current video frame.
In a possible embodiment, the processor is specifically configured to sequentially select a set number of video frames including the current video frame according to a set time step interval.
Because the principle of the electronic device for solving the problems is similar to the gesture recognition method, the implementation of the electronic device can refer to the implementation of the gesture recognition method, and repeated details are not repeated.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this is not intended to represent only one bus or type of bus.
The communication interface 112 is used for communication between the above-described electronic apparatus and other apparatuses.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Alternatively, the memory may be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), and the like; but may also be a Digital instruction processor (DSP), an application specific integrated circuit, a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc.
Based on the same technical concept, the present application provides a computer-readable storage medium having stored therein a computer program executable by an electronic device, the program, when executed on the electronic device, causing the electronic device to perform the following steps:
aiming at any anchor sample image in a sample set, acquiring a positive sample image and a negative sample image of the anchor sample image, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label is used for identifying the sample category of the gesture contained in the sample image;
respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determining recognition class labels corresponding to the anchor sample image, the positive sample image and the negative sample image;
determining a value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
In one possible embodiment, the process of acquiring the negative sample image includes:
determining a target second gesture category corresponding to the gesture category contained in the positive sample image according to the stored corresponding relation between the first gesture category and the second gesture category; and selecting a negative sample image from the sample images of the target second gesture category.
In a possible implementation manner, the selecting a negative sample image from the sample images of the target second gesture category includes:
and if the number of the target second gesture categories is at least two, selecting a negative sample image from the sample images of the target second gesture categories according to the set probability proportion that the sample images of the target second gesture categories are selected as the negative sample images.
In one possible embodiment, the determining the value of the loss function based on the sample class label and the identification class label comprises:
the loss function comprises a triple loss function and a mean square error loss function, and the value of the triple loss function and the value of the mean square error loss function are respectively determined based on the sample class label and the identification class label;
and determining the value of the loss function according to the value of the triple loss function, a first weight value corresponding to a preset triple loss function, the value of the mean square error loss function and a second weight value corresponding to a preset mean square error loss function.
The computer readable storage medium may be any available medium or data storage device that can be accessed by a processor in an electronic device, including but not limited to magnetic memory such as floppy disks, hard disks, magnetic tape, magneto-optical disks (MO), etc., optical memory such as CDs, DVDs, BDs, HVDs, etc., and semiconductor memory such as ROMs, EPROMs, EEPROMs, non-volatile memories (NAND FLASH), solid State Disks (SSDs), etc.
Based on the same technical concept, the present application provides another computer-readable storage medium having stored therein a computer program executable by an electronic device, the program, when executed on the electronic device, causing the electronic device to perform the following steps:
selecting a set number of video frames including a current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model;
and judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, if so, determining the candidate type as a target type of the gesture contained in the current video frame.
In a possible implementation, the selecting a set number of video frames including the current video frame includes:
and sequentially selecting a set number of video frames including the current video frame according to the set time step interval.
The computer readable storage medium may be any available medium or data storage device that can be accessed by a processor in an electronic device, including but not limited to magnetic memory such as floppy disks, hard disks, magnetic tape, magneto-optical disks (MO), etc., optical memory such as CDs, DVDs, BDs, HVDs, etc., and semiconductor memory such as ROMs, EPROMs, EEPROMs, non-volatile memories (NAND FLASH), solid State Disks (SSDs), etc.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method for training a gesture recognition model, the method comprising:
aiming at any anchor sample image in a sample set, acquiring a positive sample image and a negative sample image of the anchor sample image, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set corresponds to an annotated sample category label, and the sample category label is used for identifying the sample category of the gesture contained in the sample image;
respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained, and respectively determining recognition class labels corresponding to the anchor sample image, the positive sample image and the negative sample image;
determining a value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
2. The method of claim 1, wherein the process of acquiring the negative sample image comprises:
determining a target second gesture category corresponding to the gesture category contained in the positive sample image according to the stored corresponding relation between the first gesture category and the second gesture category; and selecting a negative sample image from the sample images of the target second gesture category.
3. The method of claim 2, wherein the selecting a negative sample image from the sample images of the target second gesture category comprises:
and if the number of the target second gesture categories is at least two, selecting a negative sample image from the sample images of the target second gesture categories according to the set probability proportion that the sample images of the target second gesture categories are selected as the negative sample images.
4. The method of claim 1, wherein determining a value of a loss function based on the sample class label and the identification class label comprises:
the loss function comprises a triple loss function and a mean square error loss function, and the value of the triple loss function and the value of the mean square error loss function are respectively determined based on the sample class label and the identification class label;
and determining the value of the loss function according to the value of the triple loss function, a first weight value corresponding to a preset triple loss function, the value of the mean square error loss function and a second weight value corresponding to a preset mean square error loss function.
5. A gesture recognition method based on the gesture recognition model training method of any one of claims 1-4, characterized in that the method comprises:
selecting a set number of video frames including a current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model;
and judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, if so, determining the candidate type as a target type of the gesture contained in the current video frame.
6. The method of claim 5, wherein selecting a set number of video frames including the current video frame comprises:
and sequentially selecting a set number of video frames including the current video frame according to the set time step interval.
7. A gesture recognition model training apparatus, the apparatus comprising:
the acquisition module is used for acquiring a positive sample image and a negative sample image of any anchor sample image in a sample set, wherein the gesture category contained in the positive sample image is the same as the gesture category contained in the anchor sample image, and the gesture category contained in the negative sample image is different from the gesture category contained in the anchor sample image; any sample image in the sample set is corresponding to a labeled sample type label, and the sample type label is used for identifying the sample type of the gesture contained in the sample image;
the input module is used for respectively inputting the anchor sample image, the positive sample image and the negative sample image into a gesture recognition model to be trained and respectively determining recognition type labels corresponding to the anchor sample image, the positive sample image and the negative sample image;
an adjustment module to determine a value of a loss function based on the sample class label and the identification class label, wherein the loss function comprises a triplet loss function; and adjusting the parameters of the gesture recognition model to be trained according to the value of the loss function.
8. A gesture recognition apparatus, the apparatus comprising:
the selection module is used for selecting a set number of video frames including the current video frame, sequentially inputting the set number of video frames into a gesture recognition model which is trained in advance, and respectively determining candidate categories of gestures contained in the set number of video frames through the gesture recognition model;
and the determining module is used for judging whether the number of the video frames of any candidate type in the set number of video frames is not less than a set number threshold, and if so, determining the candidate type as the target type of the gesture contained in the current video frame.
9. An electronic device, characterized in that the electronic device comprises at least a processor and a memory, the processor being configured to implement the steps of the gesture recognition model training method according to any of claims 1-4 when executing a computer program stored in the memory; or, implementing the steps of the gesture recognition method according to any of claims 5-6.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when being executed by a processor, carries out the steps of the gesture recognition model training method according to any one of claims 1 to 4; or, implementing the steps of the gesture recognition method according to any of claims 5-6.
CN202110835534.6A 2021-07-23 2021-07-23 Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium Pending CN115690894A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110835534.6A CN115690894A (en) 2021-07-23 2021-07-23 Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110835534.6A CN115690894A (en) 2021-07-23 2021-07-23 Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium

Publications (1)

Publication Number Publication Date
CN115690894A true CN115690894A (en) 2023-02-03

Family

ID=85044281

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110835534.6A Pending CN115690894A (en) 2021-07-23 2021-07-23 Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium

Country Status (1)

Country Link
CN (1) CN115690894A (en)

Similar Documents

Publication Publication Date Title
US11151384B2 (en) Method and apparatus for obtaining vehicle loss assessment image, server and terminal device
US11410038B2 (en) Frame selection based on a trained neural network
US11538232B2 (en) Tracker assisted image capture
WO2020239015A1 (en) Image recognition method and apparatus, image classification method and apparatus, electronic device, and storage medium
CN108388879B (en) Target detection method, device and storage medium
CN105512685B (en) Object identification method and device
US20180284777A1 (en) Method, control apparatus, and system for tracking and shooting target
WO2020253127A1 (en) Facial feature extraction model training method and apparatus, facial feature extraction method and apparatus, device, and storage medium
KR20230013243A (en) Maintain a fixed size for the target object in the frame
US9128528B2 (en) Image-based real-time gesture recognition
US10217221B2 (en) Place recognition algorithm
CN110691259B (en) Video playing method, system, device, electronic equipment and storage medium
US20150103184A1 (en) Method and system for visual tracking of a subject for automatic metering using a mobile device
KR20150110697A (en) Systems and methods for tracking and detecting a target object
CN104333748A (en) Method, device and terminal for obtaining image main object
JP2011134114A (en) Pattern recognition method and pattern recognition apparatus
CN110991287A (en) Real-time video stream face detection tracking method and detection tracking system
US10666858B2 (en) Deep-learning-based system to assist camera autofocus
CN105528078B (en) The method and device of controlling electronic devices
CN114613006A (en) Remote gesture recognition method and device
CN112465869B (en) Track association method and device, electronic equipment and storage medium
WO2023174063A1 (en) Background replacement method and electronic device
JP2015179426A (en) Information processing apparatus, parameter determination method, and program
CN115690894A (en) Gesture recognition model training method, gesture recognition device, gesture recognition equipment and gesture recognition medium
CN113810610A (en) Object snapshot method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination