CN115761892A

CN115761892A - Gesture recognition model training method and device based on streaming image and electronic equipment

Info

Publication number: CN115761892A
Application number: CN202211486388.1A
Authority: CN
Inventors: 林垠; 沈锦瑞; 殷保才; 胡金水; 殷兵
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-03-07

Abstract

The invention discloses a gesture recognition model training method, a gesture recognition model training device and electronic equipment based on streaming images, and the method, the device and the electronic equipment have the main conception that on one hand, the information streaming reading, the characteristic streaming extraction and the result streaming output are carried out on continuous frame images, so that the training process is close to a real application scene, the problem of mismatching between training and reasoning is eliminated, and the training process is closer to actual application deployment; on the other hand, under the streaming basic training mode, the first gesture recognition model and the second gesture recognition model are trained in the mode, and a preset mutual learning strategy is established between the first gesture recognition model and the second gesture recognition model, so that the second gesture recognition model which is used for final deployment and only based on historical image information can have the capability of predicting future information, the recognition effect of the gesture recognition model is improved, the reasoning efficiency is guaranteed to be closer to real time, and the usability of gesture interaction in human-computer interaction can be effectively improved.

Description

Gesture recognition model training method and device based on streaming image and electronic equipment

Technical Field

The invention relates to the technical field of human-computer interaction, in particular to a method and a device for training a gesture recognition model based on streaming images and electronic equipment.

Background

In the existing gesture recognition model training strategy based on deep learning, input information is mainly divided into two forms based on a single-frame image and an image sequence, and a single-frame image-based scheme directly takes a gesture recognition task as a single image classification task; in the scheme based on the image sequence, the image sequence is divided into independent gesture segments (including no gesture) according to the marked intervals, then feature extraction and time sequence modeling are carried out on the single gesture segment on the basis, and finally the gesture category is predicted.

The most important problem of the scheme based on single-frame image modeling is that the time sequence relation of an image sequence cannot be modeled, so that gesture type information can be extracted and predicted only from the angle of a single image, and dynamic gestures (such as left-right waving and clockwise and anticlockwise rotation) which need to depend on time sequence information cannot be modeled.

The scheme based on image sequence modeling mostly divides an image sequence into independent gesture segments according to a labeling interval in advance, and then extracts features with fixed length from gesture information in the segments for modeling and prediction, so that the problem that the scheme based on single-frame image modeling cannot process dynamic gestures is solved to a certain extent. However, in a gesture interaction scenario of real application, a gesture instruction is possibly issued at any time and needs to be responded in real time, which means that on one hand, a model cannot sense a starting time and an ending time of the issuance of the gesture instruction in advance, and on the other hand, a real inference (application) process sampling mode is determined to be dense sampling; this is one of the biggest problems with the above-described scheme, namely the problem that there is a mismatch in the training and testing (application) processes.

Disclosure of Invention

In view of the foregoing, the present invention is directed to a method and an apparatus for training a gesture recognition model based on a streaming image, and an electronic device, so as to solve the problem that the current gesture recognition model training method is not matched with the actual recognition application.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a method for training a gesture recognition model based on streaming images, wherein the method comprises:

sampling an image sequence containing a gesture instruction to obtain N frames of images with continuous time sequence and corresponding labels thereof, and integrating the N frames of images into a set of proposal; wherein the label is distributed with a single frame image as a minimum unit;

extracting an image feature sequence corresponding to the image sequence of each set of proposal, and performing gesture class prediction training based on the image feature sequence; the image feature sequence is N groups of features with equal length, which are continuously extracted from N frames of images;

performing predictive training according to the gesture categories, and performing basic training on a preset first gesture recognition model and a preset second gesture recognition model, wherein the first gesture recognition model and the second gesture recognition model adopt a preset mutual learning supervision strategy; the first gesture recognition model predicts the gesture category at the current moment through historical information and future information in the training process; the second gesture recognition model predicts the gesture category at the current moment through historical information in the training process;

deploying the trained second gesture recognition model for actual prediction scenes.

In at least one possible implementation manner, the performing gesture class prediction training based on the image feature sequence includes:

presetting effective response time length T;

and selecting continuous T characteristics from the image characteristic sequence, and predicting the gesture category of each current moment.

In at least one possible implementation manner, the predicting the gesture category of each current time includes:

according to the time sequence and the preset step length, sequentially reading the feature information of the current frame corresponding to the current moment in the image feature sequence of each set of proposal by using a sliding window, and constructing a feature segment which is used for gesture type judgment and has the length of T on the basis of the feature information;

and inputting the characteristic segments into a preset classifier to judge the gesture category, and using the characteristic segments as gesture category response signals of the current moment in the sliding window.

In at least one possible implementation manner, the constructing, based on the feature information, a feature segment with a length T for gesture category determination includes: when the number of frames read by the sliding window is less than T frames, filling in a zero filling mode.

In at least one possible implementation manner, the constructing, based on the feature information, a feature segment with a length T for gesture category determination includes:

and storing the single-frame image characteristics corresponding to the current moment read by the sliding window, and fusing the single-frame image characteristics of the current moment with the single-frame image characteristics of the historical moment which are stored in advance.

In at least one possible implementation manner, the label allocation manner includes:

for a part containing a complete gesture segment in a set of proposals, multiplexing preset gesture type information and distributing the preset gesture type information to each frame of image to be used as a training data label;

for a part of a set of proposals containing incomplete gesture fragments, labels characterized as ignored items are given so as not to participate in model training.

In at least one possible implementation manner, the mutual learning supervision policy includes:

and performing local feature modeling and output on the image features extracted from the image sequence through the second gesture recognition model, performing global feature modeling and output through the first gesture recognition model, and performing feature distribution constraint on the output local features and the global features.

In a second aspect, the present invention provides a gesture recognition model training apparatus based on streaming images, including:

the information flow type reading module is used for sampling an image sequence containing a gesture instruction to obtain N frames of images with continuous time sequence and corresponding labels thereof, and integrating the N frames of images into a set of proposal; wherein the label is distributed with a single frame image as a minimum unit;

the feature stream type extraction module is used for extracting an image feature sequence corresponding to the image sequence of each set of proposal and carrying out gesture category prediction training based on the image feature sequence; the image feature sequence is N groups of features with equal length, which are continuously extracted from N frames of images;

the double-model mutual learning module is used for carrying out prediction training according to the gesture categories and carrying out basic training on a preset first gesture recognition model and a preset second gesture recognition model, and the first gesture recognition model and the second gesture recognition model adopt a preset mutual learning supervision strategy; the first gesture recognition model predicts the gesture category at the current moment through historical information and future information in the training process; the second gesture recognition model predicts the gesture category at the current moment through historical information in the training process;

and the model deployment module is used for deploying the trained second gesture recognition model for actual prediction scenes.

In a third aspect, the present invention provides an electronic device, comprising:

one or more processors, memory which may employ a non-volatile storage medium, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the method as in the first aspect or any possible implementation of the first aspect.

The main concept of the invention is that on one hand, information stream reading, feature stream extraction and result stream output are carried out on continuous frame images (image sequences), so that the training process is close to a real application scene, the problem of mismatching between training and reasoning is eliminated, and the training process is closer to actual application deployment; on the other hand, under the streaming basic training mode, the first gesture recognition model and the second gesture recognition model are trained in the mode, and a preset mutual learning strategy is established between the first gesture recognition model and the second gesture recognition model, so that the second gesture recognition model which is used for final deployment and only based on historical image information can have the capability of predicting future information, the recognition effect of the gesture recognition model is improved, the reasoning efficiency is guaranteed to be closer to real time, and the usability of gesture interaction in man-machine interaction can be effectively improved.

Drawings

To make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings, in which:

FIG. 1 is a flowchart of an embodiment of a method for training a gesture recognition model based on streaming images according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of a gesture recognition model training apparatus based on streaming images according to the present invention;

fig. 3 is a schematic diagram of an embodiment of an electronic device provided in the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative only and should not be construed as limiting the invention.

The invention provides an embodiment of at least one of the following methods for training a gesture recognition model based on a streaming image, as shown in fig. 1, which specifically includes:

s1, sampling an image sequence containing a gesture instruction to obtain N frames of images with continuous time sequence and corresponding labels thereof, and integrating the N frames of images into a set of proposal; wherein the label is distributed with a single frame image as a minimum unit;

specifically, in the aspect of training data production, the method reads image sequence information (videos and continuous frame images) collected by image collection equipment including but not limited to an RGB camera, an infrared camera, a depth camera and the like in a sampling mode including but not limited to dense sampling and equal-interval sampling, and samples the image sequence according to a time sequence to obtain N frames of images and corresponding label information; at this point, the data reader may add the sampled N frames of images and their corresponding label information as a set of proposals (promosa l) to the model training.

For example, a certain image sequence sequentially includes a first instruction (gesture segment) of a first gesture, a second instruction of the first gesture, a third instruction of the first gesture, a first instruction of a second gesture, a second instruction of the second gesture, and so on.

Aiming at the streaming information, different from other strategies of dividing a single gesture fragment labeling area into independent proposals (proposa l) in training modes, the invention directly takes the whole image sequence as training data and allocates labels. Considering the computational force limitation in the actual training process, the length of N is taken as the proposed length, and it can be understood that the length of N must be larger than the maximum duration of all valid gesture fragments. To illustrate the scheme, by way of example, the aforementioned may take N =64, and generate a candidate proposal, where the first proposal corresponds to a first gesture and spans a time interval of the streaming image data: t =0 to t =64; the second proposal corresponds to a second gesture, and the time span is a time sequence interval of the streaming image data: t = 60-t =124, and so on.

After the proposed time sequence interval is divided as described above, the labels thereof also need to be allocated. Different from other training modes which use independent gesture segments as minimum units for label distribution, the invention uses single-frame images as minimum units for label distribution to ensure that the prediction result has real-time performance, and the specific mode of label distribution is provided for reference:

a. for a part containing a complete gesture segment in a set of proposals, multiplexing preset gesture type information (id) and allocating the preset gesture type information (id) to each frame of image to be used as a training data label, for example, a first instruction in the first proposal (a first gesture) is a starting stage (such as a time sequence interval of 0-20), generally, the first proposal is a default that the first proposal belongs to the complete gesture segment, only represents that no gesture exists, and can endow the type label with 1, namely, id =1; the second instruction (e.g., timing interval 20-36) in the first proposal (the first gesture) may be given a type label of 2, i.e., id =2; the third instruction (e.g., the time sequence interval 36-60) in the first proposal (the first gesture) may be given a type label of 3, i.e., id =3; similarly, the tag of the beginning stage (e.g., time interval 60-120) of the first instruction of the second proposal (the second gesture) in the streaming data may be 1, and so on.

b. For a part of a set of proposals containing incomplete gesture segments, the type label of the part can be given as-100, namely, the part is characterized as an ignored item, and the image data with the label does not participate in the calculation of the loss function and the gradient in the subsequent model training process. For example, the portion of the first proposal mentioned in the foregoing example with the timing interval of 60 to 64, or the portion of the second proposal with the timing interval of 120 to 124.

S2, extracting an image feature sequence corresponding to the image sequence of each set of proposal, and performing gesture class prediction training based on the image feature sequence; the image feature sequence is N groups of features with equal length, which are continuously extracted from N frames of images;

after the proposal read in the above steps, a spatio-temporal feature extraction network including, but not limited to, 2D-CNN, 3D-CNN, CNN-LSTN, RNN, and transform may be used to extract N sets of features of equal length for an image sequence of length N.

Specifically, assuming that the sequence batch of a single training is B, the height and width of the original image sequence are H and W, respectively, the number of channels is C, the height and width of the feature map after network extraction are H and W, respectively, and the number of channels is C, the feature extraction process mentioned in the present invention can be briefly described as follows, that is, after the image sequence with the dimension of B × N × C × H × W is subjected to time sequence and spatial feature extraction network extraction, the feature sequence with the dimension of B × N × C × H × W, that is, the image feature sequence can be obtained.

In the actual gesture interaction application, the gesture recognition model is expected to make a near real-time judgment (response) on the gesture command, so that in order to make the training process approach to the real application scene (reasoning process) requirement, in some preferred training strategies of the invention, an effective response time length T is set, namely, on the basis of N groups of pre-extracted image features, continuous T features are selected to perform gesture classification prediction (T < < N), and the purpose is that the gesture classification information can be accurately judged by the requirement model only depending on the T features, so that the requirement on real-time performance in the actual use process can be ensured. Based on this, in the present invention, for the gesture classification processing at each time, only the T features that are the nearest to the current time are needed to perform prediction, and in order to efficiently achieve this purpose, in particular, the present invention preferably adopts a "feature sliding window" mechanism (default step length can be preset to be S =1, and a characterization processing unit is used for each time), refer to the following: sequentially reading the feature information of the continuous T frames in the image feature sequence of each set of proposal according to the time sequence and the preset step length, and forming the feature information of the continuous T frames into a feature segment for gesture category judgment (0 can be filled when the T frames are less than the T frames); the strategy of sliding the window on the original extracted continuous N groups of features avoids the situation that the same image is repeatedly subjected to feature calculation for many times, so that each image in a group of image sequences only needs to be subjected to feature extraction once in a single iteration process, and the model calculation power, the memory occupation and the training time consumption are greatly saved. Continuing with the above concept, for the feature segment with a length of T, further performing feature fusion processing by using methods including but not limited to average, weighted average, exponential sliding average, and the like, and finally sending the fused features into a classifier to perform gesture classification judgment, wherein the feature is used as a gesture classification response signal at a certain moment (the last moment in the preferred window for guaranteeing real-time performance) in the current sliding window. Afterwards, it can be understood that if the step length is set to S =1, it indicates that there is a gesture class response (including a special class: no gesture) corresponding to each time, that is, a frame-by-frame prediction result (the dimension may be B × T × C) can be obtained, and then, a loss function including, but not limited to, cross entropy loss, mean square error loss, and the like can be adopted for supervision and optimization to form a streaming prediction result output (gesture class prediction at N times), which is not described in detail herein.

For example, when gesture category information at the time of T =1 is predicted, the image features of the T =1 frame are sent to a feature memory, and the rest part which is less than T is sent to a classifier for classification after zero padding, so as to serve as the gesture category information at the time of T =1; similarly, at the time T =2, the image features of the T =2 frame are sent to the memory, and are fused with the image features of the T =1 frame stored historically (the rest part which is less than T is still filled with zero), and finally, the fused features are sent to the classifier to judge the gesture category to be used as gesture category information at the time T =2; by analogy, at the moment of T = T +1, the invention sends the image characteristics of T = T +1 frame into the memory, clears the image characteristics of T =1 frame, then fuses T characteristics of T = 2-T = T +1 existing in the memory, and sends the T characteristics into the classifier for judgment, and the T characteristics are used as gesture category information at the moment of T = T + 1; and (4) until t = N frames of images are processed, namely, the gesture class prediction at all N moments is completed. Based on the design, the invention avoids repeatedly extracting the features for a single frame image for multiple times, and relieves the problem of response delay caused by overlarge N due to the fact that streaming data is acquired and the features are extracted in streaming mode.

S3, performing predictive training according to the gesture classes, and performing basic training on a preset first gesture recognition model and a preset second gesture recognition model, wherein the first gesture recognition model and the second gesture recognition model adopt a preset mutual learning supervision strategy; the first gesture recognition model predicts the gesture category at the current moment through historical information and future information in the training process; the second gesture recognition model predicts the gesture category at the current moment through historical information in the training process;

and S4, deploying the trained second gesture recognition model for actual prediction scenes.

In order to improve the performance of the gesture recognition model, the invention provides a strategy of adopting mutual learning, including but not limited to a mutual learning strategy of an offline model (a first gesture recognition model) and an online model (a second gesture recognition model), wherein the offline model refers to a strategy that the model can see information of "historical time" and "future time" (can receive historical and future image sequences as input simultaneously) in a training process, and the online model refers to a strategy that the model can only see information of "historical time" (can only receive historical image sequences as input) in the training process, and it needs to be noted that in practical application, only the online model needs to be deployed and the offline model needs to be directly abandoned. The design of the mutual learning strategy aims to enable an online model (namely, a gesture recognition model actually used in testing) to have the capability of predicting future information on its own without seeing the future information, and experiments prove that the training strategy provided by the invention can bring obvious performance improvement (introduced by verification data later) aiming at the model effect with high real-time requirement of gesture recognition.

For convenience of description, the mutual learning strategy proposed in the foregoing description, for example but not limited to, may be a mutual learning scheme including mutual learning of a teacher large model and a student small model, and is still referred to by the terms of an offline model and an online model: in the implementation of the present invention, the same skeleton extraction model can be used for both the online model and the offline model, including but not limited to mainstream deep convolutional neural network within VGG, resNet, denseNet, and mobilenet.

As mentioned above, the most important difference between the two aforementioned models that learn each other is that the offline model can see the information of "historical time" and "future time" during the training process, that is, assuming that at time T = T, the offline model can simultaneously rely on the information of time T =1 to T = T-1 (historical information) and time T = T +1 to T = N for the gesture category judgment of the current time; for online models, only "historical moment" information is of interest during the training process. In the aspect of time sequence modeling of features, the invention can select a time sequence feature modeling unit including, but not limited to CNN, RNN, GRU, LSTM, BI-LSTM, etc., or use a spatial feature extractor, and of course, both of them can be described as a spatio-temporal feature modeling unit (i.e., the spatio-temporal feature extraction network).

Regarding the mutual learning supervision policy, it may include image features extracted from the image sequence, which are modeled and output as "local features" via an online branch (second gesture recognition model), and as "global features" via an offline branch (first gesture recognition model). And performing feature distribution constraint on the two groups of features output in the two branches, including: the 'entropy minimum constraint' is carried out on the added Loss functions respectively, namely the classification accuracy is ensured, and meanwhile, the 'consistency constraint' needs to be applied to the two groups of features, for example, constraint schemes including but not limited to KL-Loss, L1-Loss and L2-Loss can be adopted. The strategy can enable the online model (namely the final deployment model) to have the sensing and predicting capability of the future information on the premise of 'no future information seen', and further improve the performance of the gesture recognition model.

Compared with the traditional training effect of dividing an image sequence into independent gesture segments according to a labeling interval mode, the category response curve obtained by the streaming training strategy provided by the invention is cleaner in the gesture instruction response region, which is mainly embodied in that the 'jump' of the gesture (namely, the real gesture category) response curve corresponding to the label is more obvious at the starting moment and the ending moment of the gesture instruction, the side length and the peak value of the response region are larger, and the rest gesture response regions are obviously inhibited.

In addition, on the basis of the flow type training strategy, the performance of the model can be obviously improved by adding a strategy of mutual learning of online and offline models: as can be seen from the following table, the first row represents the recall rate, accuracy rate and F1-score of the model obtained by the traditional scheme training, and the second row and the third row represent the effects of two groups of models obtained by adding the basic streaming training strategy and training by adopting the online and offline model mutual learning strategy on the basis of the basic streaming training strategy. The results show that after the basic streaming training strategy is added, the accuracy of the model and the F1-Score are obviously improved, and are consistent with the observation results mentioned above, and on the basis, the strategy of online and offline model mutual learning is added, so that the effect index is further improved. Therefore, the training strategy provided by the invention can effectively improve the performance of the gesture recognition model.

Model (model)	Recall rate	Rate of accuracy	F1-score
				Conventional solutions	0.9433	0.7562	0.8394
The invention is of basic flow typeTraining strategy	0.9412	0.8304	0.8823
				The invention discloses a basic streaming and mutual learning training strategy	0.9377	0.8589	0.8966

In summary, the main idea of the present invention is that, on one hand, by performing information stream reading, feature stream extraction, and result stream output on continuous frame images (image sequences), the training process approaches to a real application scenario, and the problem of mismatch between training and inference is eliminated, so that the training process approaches to actual application deployment; on the other hand, under the streaming basic training mode, the first gesture recognition model and the second gesture recognition model are trained in the mode, and a preset mutual learning strategy is established between the first gesture recognition model and the second gesture recognition model, so that the second gesture recognition model which is used for final deployment and only based on historical image information can have the capability of predicting future information, the recognition effect of the gesture recognition model is improved, the reasoning efficiency is guaranteed to be closer to real time, and the usability of gesture interaction in human-computer interaction can be effectively improved.

Corresponding to the above embodiments and preferred solutions, the present invention further provides an embodiment of a gesture recognition model training apparatus based on streaming images, as shown in fig. 2, the apparatus may specifically include the following components:

the information flow type reading module 1 is used for sampling an image sequence containing a gesture instruction to obtain N frames of images with continuous time sequence and corresponding labels thereof, and integrating the N frames of images into a set of proposal; wherein the label is distributed with a single frame image as a minimum unit;

the feature stream type extraction module 2 is used for extracting an image feature sequence corresponding to the image sequence of each set of proposal, and performing gesture class prediction training based on the image feature sequence; the image feature sequence is N groups of features with equal length, which are continuously extracted from N frames of images;

the double-model mutual learning module 3 is used for performing prediction training according to the gesture categories, performing basic training on a preset first gesture recognition model and a preset second gesture recognition model, and adopting a preset mutual learning supervision strategy for the first gesture recognition model and the second gesture recognition model; the first gesture recognition model predicts the gesture category at the current moment through historical information and future information in the training process; the second gesture recognition model predicts the gesture category at the current moment through historical information in the training process;

and the model deployment module 4 is used for deploying the trained second gesture recognition model for actual prediction scenes.

It should be understood that the division of each component in the training apparatus for gesture recognition model based on streaming image shown in fig. 2 is only a division of logic functions, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these components may all be implemented in the form of software calls by the processing element; or may be implemented entirely in hardware; and part of the components can be realized in the form of software called by the processing element, and part of the components can be realized in the form of hardware. For example, a certain module may be a separately established processing element, or may be integrated into a certain chip of the electronic device. Other components are implemented similarly. In addition, all or part of the components can be integrated together or can be independently realized. In implementation, each step of the above method or each component above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above components may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. As another example, these components may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In view of the foregoing examples and preferred embodiments thereof, it will be appreciated by those skilled in the art that, in practice, the technical idea underlying the present invention may be applied in a variety of embodiments, the present invention being schematically illustrated by the following vectors:

(1) An electronic device is provided. The device may specifically include: one or more processors, memory, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions which, when executed by the apparatus, cause the apparatus to perform the steps/functions of the foregoing embodiments or equivalent implementations.

The electronic device may specifically be a computer-related electronic device, such as but not limited to various interactive terminals and electronic products, a mobile terminal, and the like.

Fig. 3 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention, and specifically, the electronic device 900 includes a processor 910 and a memory 930. Wherein, the processor 910 and the memory 930 can communicate with each other and transmit control and/or data signals through the internal connection path, the memory 930 is used for storing computer programs, and the processor 910 is used for calling and running the computer programs from the memory 930. The processor 910 and the memory 930 may be combined into a single processing device, or more generally, separate components, and the processor 910 is configured to execute the program code stored in the memory 930 to implement the functions described above. In particular implementations, the memory 930 may be integrated with the processor 910 or separate from the processor 910.

In addition, to further enhance the functionality of the electronic device 900, the device 900 may further include one or more of an input unit 960, a display unit 970, an audio circuit 980, a camera 990, a sensor 901, and the like, which may further include a speaker 982, a microphone 984, and the like. The display unit 970 may include, among other things, a display screen.

Further, the apparatus 900 may also include a power supply 950 for providing power to various devices or circuits within the apparatus 900.

It should be understood that the operations and/or functions of the various components of the apparatus 900 may be referred to in detail in the foregoing description of the embodiments of the method, system, etc., and the detailed description is omitted here where appropriate to avoid repetition.

It should be understood that the processor 910 in the electronic device 900 shown in fig. 3 may be a system on chip SOC, and the processor 910 may include a Central Processing Unit (CPU), and may further include other types of processors, such as: an image Processing Unit (GPU), etc., which will be described in detail later.

In summary, various portions of the processors or processing units within the processor 910 may cooperate to implement the foregoing method flows, and corresponding software programs for the various portions of the processors or processing units may be stored in the memory 930.

(2) A computer data storage medium having stored thereon a computer program or the above apparatus which, when executed, causes a computer to perform the steps/functions of the preceding embodiments or equivalent implementations.

In several embodiments provided by the present invention, any of the functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer data-accessible storage medium. Based on this understanding, some aspects of the present invention may be embodied in the form of software products, which are described below, or portions thereof, which substantially contribute to the art.

In particular, it should be noted that the storage medium may refer to a server or a similar computer device, and specifically, the aforementioned computer program or the aforementioned apparatus is stored in a storage device in the server or the similar computer device.

(3) A computer program product (which may include the above apparatus) when running on a terminal device, causes the terminal device to execute the method for training a gesture recognition model based on streaming images of the foregoing embodiments or equivalent implementations.

From the above description of the embodiments, it is clear to those skilled in the art that all or part of the steps in the above implementation method can be implemented by software plus a necessary general hardware platform. With this understanding, the above-described computer program product may include, but is not limited to referring to APP.

In the foregoing, the device/terminal may be a computer device, and the hardware structure of the computer device may further specifically include: at least one processor, at least one communication interface, at least one memory, and at least one communication bus; the processor, the communication interface and the memory can complete mutual communication through the communication bus. The processor may be a central Processing unit CPU, a DSP, a microcontroller, or a digital Signal processor, and may further include a GPU, an embedded Neural Network Processor (NPU), and an Image Signal Processing (ISP), and may further include a specific integrated circuit ASIC, or one or more integrated circuits configured to implement the embodiments of the present invention, and the processor may have a function of operating one or more software programs, and the software programs may be stored in a storage medium such as a memory; and the aforementioned memory/storage media may include: non-volatile memories (non-volatile memories) such as non-removable magnetic disks, U-disks, removable hard disks, optical disks, etc., and Read-Only memories (ROM), random Access Memories (RAM), etc.

In the embodiments of the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, and indicates that three relationships may exist, for example, a and/or B, and may indicate that a exists alone, a and B exist simultaneously, and B exists alone. Wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" and similar expressions refer to any combination of these items, including any combination of singular or plural items. For example, at least one of a, b, and c may represent: a, b, c, a and b, a and c, b and c or a and b and c, wherein a, b and c can be single or multiple.

Those of skill in the art will appreciate that the various modules, elements, and method steps described in the embodiments disclosed in this specification can be implemented as electronic hardware, computer software, or combinations of electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

And, modules, units, etc. described herein as separate components may or may not be physically separate, i.e., may be located in one place, or may be distributed across multiple places, e.g., nodes of a system network. Some or all of the modules and units can be selected according to actual needs to achieve the purpose of the above-mentioned embodiment. Can be understood and carried out by those skilled in the art without inventive effort.

The structure, features and effects of the present invention have been described in detail with reference to the embodiments shown in the drawings, but the above embodiments are only preferred embodiments of the present invention, and it should be understood that the technical features related to the above embodiments and the preferred modes thereof can be reasonably combined and configured into various equivalent schemes by those skilled in the art without departing from and changing the design idea and technical effects of the present invention; therefore, the invention is not limited to the specific embodiments shown in the drawings, and all changes that can be made or modified to the equivalent embodiments without departing from the spirit and scope of the invention are intended to be covered by the specification and drawings.

Claims

1. A gesture recognition model training method based on streaming images is characterized by comprising the following steps:

performing prediction training according to the gesture categories, and performing basic training on a preset first gesture recognition model and a preset second gesture recognition model, wherein the first gesture recognition model and the second gesture recognition model adopt a preset mutual learning supervision strategy; the first gesture recognition model predicts the gesture category at the current moment through historical information and future information in the training process; the second gesture recognition model predicts the gesture category at the current moment through historical information in the training process;

2. The method for training a streaming image-based gesture recognition model according to claim 1, wherein the performing gesture class prediction training based on the image feature sequence comprises:

presetting effective response time length T;

3. The method for training a gesture recognition model based on streaming images according to claim 2, wherein the predicting the gesture category of each current time comprises:

4. The method for training the gesture recognition model based on the streaming image according to claim 3, wherein the constructing the feature segment with the length T for the gesture class judgment based on the feature information comprises: when the number of frames read by the sliding window is less than T frames, filling in a zero filling mode.

5. The method for training the gesture recognition model based on the streaming image according to claim 3, wherein the constructing the feature segment with the length T for the gesture class judgment based on the feature information comprises:

and storing the single-frame image characteristics corresponding to the current moment read by the sliding window, and fusing the single-frame image characteristics at the current moment with the single-frame image characteristics at the historical moment which are stored in advance.

6. The method for training a gesture recognition model based on streaming images according to claim 3, wherein the label distribution manner comprises:

7. The method for training a gesture recognition model based on streaming images according to any one of claims 1 to 6, wherein the mutual learning supervision strategy comprises:

8. A gesture recognition model training device based on streaming images is characterized by comprising:

the double-model mutual learning module is used for carrying out predictive training according to the gesture categories and carrying out basic training on a preset first gesture recognition model and a preset second gesture recognition model, and the first gesture recognition model and the second gesture recognition model adopt a preset mutual learning supervision strategy; the first gesture recognition model predicts the gesture category at the current moment through historical information and future information in the training process; the second gesture recognition model predicts the gesture category at the current moment through historical information in the training process;

9. An electronic device, comprising:

one or more processors, memory, and one or more computer programs stored in the memory, the one or more computer programs comprising instructions that, when executed by the electronic device, cause the electronic device to perform the method of training a gesture recognition model based on streaming images of any of claims 1-7.

10. A computer data storage medium, characterized in that the computer data storage medium has a computer program stored therein, which when run on a computer causes the computer to execute the method for training a gesture recognition model based on streaming images according to any one of claims 1 to 7.