CN117373130A

CN117373130A - Behavior recognition-based hand direction recognition method, electronic device and storage medium

Info

Publication number: CN117373130A
Application number: CN202311423880.9A
Authority: CN
Inventors: 郭斌; 王娜; 赵凯星; 於志文
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-10-31
Filing date: 2023-10-31
Publication date: 2024-01-09

Abstract

The embodiment of the application relates to the technical field of behavior recognition, and discloses a hand direction recognition method, electronic equipment and storage medium based on behavior recognition, wherein the method comprises the following steps: acquiring a plurality of craftwork videos as sample videos; dividing a sample video into a plurality of video clips, calculating the similarity between key frames extracted from the video clips, and storing the video clips corresponding to the key frames with the similarity smaller than a first threshold value into the same folder; extracting feature vectors of all video clips in a folder, calculating a first average value, marking the first average value as a label of each video clip in the folder, and training a preset recognition model based on the video clips and the labels thereof; the video to be identified is input into the identification model after training is completed, the hand direction in the video to be identified is obtained, and visual display is carried out in the video to be identified, so that a user learns the non-genetic technology more carefully and conveniently, and the non-genetic technology is kept continuously.

Description

Behavior recognition-based hand direction recognition method, electronic device and storage medium

Technical Field

The embodiment of the application relates to the technical field of behavior recognition, in particular to a hand direction recognition method based on behavior recognition, electronic equipment and a storage medium.

Background

With the rapid development of the mobile internet, logging in a short video platform every day has become a part of life of people, with more and more non-genetic inherited people's resident short video platforms, the platform has become a new channel of non-genetic inheritance, and more non-genetic inheritance technological inheritance people and producers choose to propagate and display own skills on the short video platform. In the process of inheriting non-genetic skills, the operation direction and skill of hands are important, while in the process of inheriting traditional skills, many skills depend on the in-person demonstration of a master, and an apprentice learns through watching and imitating, so that the process is long in time consumption and low in efficiency, the effect of the inheritance is limited by the observation force and the understanding force of the apprentices, and errors in understanding can exist.

Along with the development of modern technology, people are gradually aware that the non-genetic technology and the modern technology are combined, so that the non-genetic culture inheritance can be well promoted, the non-genetic activation is promoted, and the gesture recognition technology is an important practice for combining the non-genetic technology with the modern technology. The gesture recognition technology is a technology for converting human hand actions into computer input, can be realized based on a sensor, but has low recognition accuracy and poor real-time performance, so that the effect of non-genetic culture is unsatisfactory.

Disclosure of Invention

The embodiment of the application aims to provide a hand direction identification method, electronic equipment and storage medium based on behavior identification, which can accurately realize hand direction identification in video in real time and visualize the identified direction and action, so that a user can learn a non-missing-hand technology more carefully and conveniently, and the non-missing-hand technology is inherited continuously.

In order to solve the technical problems, an embodiment of the present application provides a method for identifying a hand direction based on behavior identification, including the following steps: acquiring a plurality of hand craft production videos of different hand craft categories and different craftsmen as sample videos; dividing the sample video into a plurality of video clips, extracting key frames from the video clips, calculating the similarity between the key frames, and storing the video clips corresponding to the key frames with the similarity smaller than a first threshold value into the same folder; extracting feature vectors of all video clips in the folder, solving a first average value, marking the first average value as labels of all video clips in the folder, and training a preset recognition model based on the video clips and the labels thereof; inputting the video to be identified into the identification model after training, obtaining the hand direction in the video to be identified, and visually displaying the hand direction in the video to be identified.

The embodiment of the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the behavior recognition based hand direction recognition method described above.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described behavior recognition-based hand direction recognition method.

According to the hand direction recognition method, the electronic equipment and the storage medium based on behavior recognition, the recognition model constructed based on the C3D network is trained, the C3D network can learn and model time sequence information, and compared with a traditional two-dimensional convolutional neural network, the recognition accuracy of the recognition model constructed based on the C3D network is high. When a training sample is selected, the craftsman manufacturing videos with different craftsman types are crawled to be used as sample videos, so that the generalization capability and universality of the trained identification model can be improved, and the trained identification model is applicable to different identification scenes. The visual display is carried out on the hand direction identified from the video to be identified and the hand action, so that a user can more clearly, carefully and accurately see the detailed steps of the non-handcraft inheritance person in the process of manufacturing the handcraft work, the user can easily learn and manufacture the handcraft work with the help of the learning, and the user can continuously learn and manufacture the handcraft work under the support of achievement sense and satisfaction sense, thereby changing the handcraft inheritance into a new handcraft producer and a propagator, and enabling the non-handcraft inheritance to be continuously broken.

In some alternative embodiments, the acquiring the craftsman video of the plurality of different craftsman categories as the sample video includes: preprocessing a plurality of acquired craftsman production videos with different craftsman types, and taking the preprocessed craftsman production videos as sample videos; the pretreatment at least comprises: denoising and smoothing the handicraft video; normalizing the picture size of each craftwork video, wherein the normalized picture size of each craftwork video is the same; and extracting three-dimensional coordinate data of hand movements in the handicraft video and performing time alignment. The preprocessing can remove interference information in the sample video, so that the information of each sample is converted into a unified system, training of the identification model is facilitated, and accuracy and stability of the identification effect of the trained identification model are improved.

In some alternative embodiments, the calculating the similarity between the key frames includes: for a first key frame and a second key frame, respectively establishing a first window and a second window with the same size at the same initial position of the first key frame and the second key frame; calculating a local structure similarity index SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window; synchronously sliding the first window and the second window according to a preset step length, and calculating to obtain each local SSIM; and averaging all the local SSIM obtained by calculation to obtain a global SSIM, and taking the global SSIM as the similarity between the first key frame and the second key frame. The SSIM index can well measure the similarity between two pictures, and as the hand actions are more obscure and the hand contains a plurality of areas, the direct calculation of the global SSIM is inaccurate and the difference of partial local areas can be diluted, so that the method calculates each local SSIM and then averages to obtain the global SSIM, and the similarity between key frames can be well determined.

In some alternative embodiments, the local SSIM is calculated from the pixel values of the pixels in the first window and the pixel values of the pixels in the second window by the following formula:

c ₁ ＝(k ₁ L) ²

c ₂ ＝(k ₂ L) ²

k ₁ ＝0.01

k ₂ ＝0.03

wherein x represents the first key frame, y represents the second key frame, μ _x Is the average value mu of the pixel values of all the pixel points in the first window _y Is the average value of the pixel values of the pixel points in the second window,for the variance value of the pixel value of each pixel point in the first window, +.>A variance value sigma of pixel values of each pixel point in the second window _xy A pixel value of each pixel point in the first window and a second windowThe value of the covariance between the pixel values of each pixel point in the mouth, L, characterizes the dynamic range of the pixel values.

In some optional embodiments, the extracting the feature vector of each video segment in the folder and calculating a first average value, and labeling the first average value as a label of each video segment in the folder includes: extracting feature vectors of all video clips in the folder, solving a first average value, and taking the first average value as a representative vector of the folder; calculating the distance between the representative vector and five preset standard vectors, marking the standard vector with the minimum distance with the representative vector as a first label of each video segment in the folder, and marking the category of the standard vector with the minimum distance with the representative vector as a second label of each video segment in the folder; wherein the five types of standard vectors include: the first standard vector which represents the hand to move from top to bottom, the second standard vector which represents the hand to move from bottom to top, the third standard vector which represents the hand to rotate clockwise, the fourth standard vector which represents the hand to rotate anticlockwise, and the fifth standard vector which represents the hand to rotate in opposite directions. The labels are divided into the first label for representing the feature vector and the second label for representing the category, so that the training precision of the recognition model can be further improved.

In some optional embodiments, the training the recognition model previously built based on the C3D network based on the video clip and the tag thereof includes: training a preset C3D network to be converged based on the video segment and the first label thereof to obtain a trained C3D network; training a preset classifier to be converged based on the video segment and the second label thereof to obtain a trained classifier; wherein, the classifier is a Support Vector Machine (SVM); and splicing the trained C3D network and the trained classifier to obtain a trained recognition model. The C3D network and the classifier are independently trained, the feature extraction and classification performances can be independently optimized and improved, and the trained two parts are spliced to obtain a trained recognition model, so that the training speed and the training effect are effectively improved.

In some optional embodiments, the visually displaying in the video to be identified includes: adding an arrow mark and a character mark at corresponding positions in the video to be identified to represent the identified hand direction; the corresponding positions are positions when the hand motion corresponding to the hand direction occurs. The arrow mark is vivid, the character mark is accurate and reliable, and visual display of the hand direction is carried out by combining the arrow mark and the character mark, so that a user can learn hand actions in the video better.

In some optional embodiments, the inputting the video to be identified into the training-completed identification model to obtain the hand direction in the video to be identified includes: dividing a video to be identified into a plurality of video segments to be identified, and inputting each video segment to be identified into an identification model after training is completed to obtain the hand direction in each video segment to be identified; after the obtaining the hand direction in the video to be identified, the method further includes: taking the video to be identified as a unit of video fragments to be identified, and storing the video to be identified into a preset database after recording attribute information; wherein the attribute information includes: the method comprises the steps of identifying the name of a video file of a video to be identified, the number of the video to be identified in the video to be identified, the feature vector extracted from the video to be identified, the hand direction identified in the video to be identified, the identification time corresponding to the video to be identified, and the name of an identification model used when the hand direction in the video to be identified is identified, wherein the number of the video to be identified in the video to be identified is the number of the video to be identified, and the data source of the video to be identified, to which the video to be identified belongs, is the number of the video to be identified. The video after the hand motion is identified is taken as a video segment unit, and the attribute information is recorded and then stored in the database, so that a user can conveniently retrieve and order at any time to review, and the inheritance effect of the non-hand-missing process is further improved.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.

FIG. 1 is a flow chart of a behavior recognition based hand direction recognition method provided by one embodiment of the present application;

FIG. 2 is a schematic illustration of a hand direction visualization provided in one embodiment of the present application;

FIG. 3 is a flow chart of preprocessing an adversary craftwork video in one embodiment of the present application;

FIG. 4 is a flow chart of computing similarity between key frames in one embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, as will be appreciated by those of ordinary skill in the art, in the various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present application, and the embodiments may be mutually combined and referred to without contradiction.

An embodiment of the present application relates to a hand direction recognition method based on behavior recognition, which is applied to an electronic device, wherein the electronic device may be a terminal or a server, and the electronic device in this embodiment and the following embodiments will be described by taking the server as an example. The following describes implementation details of the hand direction recognition method based on behavior recognition in this embodiment specifically, and the following description is only provided for understanding the implementation details, and is not necessary to implement this embodiment.

The specific flow of the hand direction recognition method based on behavior recognition in this embodiment may be as shown in fig. 1, including:

step 101, acquiring a plurality of craftworks of different craftworks as sample videos.

In a specific implementation, the short video platforms have rich handcraft production video stock, and the server can crawl a plurality of handcraft production videos with different handcraft categories and different craftsmen from the short video platforms through package grabbing software such as Fiddler and the like to serve as sample videos. The more the types of the craftsman videos are, the more the craftsman sources are, the stronger the generalization ability and universality of the trained recognition model are.

In some examples, the craftwork videos of different craftwork categories may include a floss-making video, a paper-cut-making video, an embroidery-making video, and the like.

Step 102, dividing the sample video into a plurality of video segments, extracting key frames from each video segment, calculating the similarity between the key frames, and storing the video segments corresponding to the key frames with the similarity smaller than a first threshold value into the same folder.

In a specific implementation, after obtaining a sample video, the server may divide the sample into a plurality of video segments by using a boundary dividing method, where one video segment only includes one staged action in the craftwork manufacturing process, and these staged actions are mainly divided into several classes in terms of direction characteristics, so that the server may calculate the similarity between key frames, and the video segments corresponding to the key frames with the similarity smaller than the first threshold value are regarded as video segments with the same direction characteristics, and store the video segments with the same direction characteristics into the same folder.

In some examples, the server may divide the sample video into a plurality of video segments of equal length, each video segment being 16 frames, and there being a frame overlap in two video segments that are adjacent in time sequence, the server may extract the 8 th frame of each video segment as a key frame for each video segment.

In some examples, the server may use image similarity assessment indicators such as SSIM (Structural Similarity Index Measurement, structural similarity), MSE (Mean Squared Error, mean square error), PSNR (Peak Signal to Noise Ratio ), etc., to calculate the similarity between two key frames.

And 103, extracting feature vectors of all video clips in the folder, calculating a first average value, marking the first average value as labels of all video clips in the folder, and training an identification model which is built in advance based on a C3D network based on the video clips and the labels thereof.

In a specific implementation, the server traverses each folder, extracts the feature vector of each video segment in each folder, and aims at the feature vector of each video segment in the same folder, the server obtains a first average value of the feature vectors, and marks the first average value as the label of each video segment in the folder. And the server takes all the video clips marked with the labels as training samples, and iteratively trains the recognition model constructed in advance based on the C3D network based on the video clips and the labels thereof until the recognition model converges to obtain the recognition model after training. The C3D network can learn and model time sequence information, and compared with the traditional two-dimensional convolutional neural network, the recognition accuracy of the recognition model constructed based on the C3D network is much higher.

In some examples, the server extracts the feature vector of each video segment in the folder and obtains a first average value, the first average value is used as a representative vector of the folder, then the distance between the representative vector and five preset standard vectors is calculated, the standard vector with the minimum distance from the representative vector is marked as a first label of each video segment in the folder, and the category of the standard vector with the minimum distance from the representative vector is marked as a second label of each video segment in the folder. The labels are divided into the first label for representing the feature vector and the second label for representing the category, so that the training precision of the recognition model can be further improved.

In some examples, the five classes of standard vectors include: the first standard vector which represents the hand to move from top to bottom, the second standard vector which represents the hand to move from bottom to top, the third standard vector which represents the hand to rotate clockwise, the fourth standard vector which represents the hand to rotate anticlockwise, and the fifth standard vector which represents the hand to rotate in opposite directions. These five classes of standard vectors are based on feature extraction of the direction of the video streams.

In some examples, when training the recognition model, the server trains the preset C3D network to converge based on the video clip and the first label thereof to obtain a trained C3D network, trains the preset classifier to converge based on the video clip and the second label thereof to obtain a trained classifier, and finally splices the trained C3D network and the trained classifier to obtain the trained recognition model. The C3D network and the classifier are independently trained, the feature extraction and classification performances can be independently optimized and improved, and the trained two parts are spliced to obtain a trained recognition model, so that the training speed and the training effect are effectively improved.

In some examples, the recognition model may use an SVM (Support Vector Machines, support vector machine) as the classifier. The training principle of the SVM can be expressed by the following formula:

wherein x is ₀ Representing a test sample, x _i Represents the ith training sample, y _i Label, alpha, representing the ith training sample _i The Lagrangian multiplier for the ith training sample, K (·) the kernel function, and b the bias term.

And 104, inputting the video to be identified into the identification model after training, obtaining the hand direction in the video to be identified, and visually displaying the hand direction in the video to be identified.

In a specific implementation, after obtaining the recognition model after training, the server may input the video to be recognized into the recognition model after training to obtain a hand direction in the video to be recognized, and perform visual display in the video to be recognized, where the arrow mark and the text mark are added at corresponding positions in the video to be recognized to represent the recognized hand direction, and the corresponding positions are positions when hand movements corresponding to the hand direction occur. The arrow mark is vivid, the character mark is accurate and reliable, and visual display of the hand direction is carried out by combining the arrow mark and the character mark, so that a user can learn hand actions in the video better.

In some examples, the craftsmanship video added with the arrow mark and the text mark can be as shown in fig. 2.

In some examples, the server inputs the video to be identified into the identification model after training, and obtains the hand direction in the video to be identified, which is specifically: dividing the video to be identified into a plurality of video segments to be identified, and inputting each video segment to be identified into the identification model after training, so as to obtain the hand direction in each video segment to be identified. After the server obtains the hand direction in the video to be identified, the video to be identified can be stored in a preset database by taking the video fragment to be identified as a unit after recording the attribute information. Wherein the attribute information includes: the method includes the steps of identifying a Video file Name (Video Name) of a Video to be identified, a Number (Segment Number) of the Video to be identified in the Video to be identified, a Feature Vector (Feature Vector) extracted from the Video to be identified, a Hand Direction (Hand Direction) identified from the Video to be identified, an identification Time (identification Time) corresponding to the Video to be identified, a Name (identification Name) of an identification model used when identifying the Hand Direction in the Video to be identified, a Data Source (Data Source) of the Video to be identified to which the Video to be identified belongs, and the like. The video after the hand motion is identified is taken as a video segment unit, and the attribute information is recorded and then stored in the database, so that a user can conveniently retrieve and order at any time to review, and the inheritance effect of the non-hand-missing process is further improved.

According to the embodiment, the recognition model constructed based on the C3D network is trained, the C3D network can learn and model time sequence information, and compared with a traditional two-dimensional convolutional neural network, the recognition accuracy of the recognition model constructed based on the C3D network is much higher. When a training sample is selected, the craftsman manufacturing videos with different craftsman types are crawled to be used as sample videos, so that the generalization capability and universality of the trained identification model can be improved, and the trained identification model is applicable to different identification scenes. The visual display is carried out on the hand direction identified from the video to be identified and the hand action, so that a user can more clearly, carefully and accurately see the detailed steps of the non-handcraft inheritance person in the process of manufacturing the handcraft work, the user can easily learn and manufacture under the support of achievement sense and satisfaction sense, and the user can become a new handcraft producer and a propagator, and the non-handcraft inheritance is kept.

In some embodiments, after the server obtains the hand-made videos of the different craftsmen as the sample videos, the server also needs to preprocess the obtained hand-made videos of the different craftsmen, and uses the preprocessed hand-made videos as the sample videos, and the preprocessing of the hand-made videos by the server may be implemented through the steps shown in fig. 3, which specifically includes:

step 201, denoising and smoothing the hand-made video.

In the specific implementation, after the server crawls the hand process video, the hand process video is firstly subjected to denoising and smoothing, namely noise and unnecessary information in the video are removed, and only the hand information is reserved.

Step 202, normalizing the picture size of each craftwork video, wherein the picture size of each normalized craftwork video is the same.

In a specific implementation, the hand-craft videos obtained by crawling are different in specification and standard, so that stable training of the recognition model is not facilitated, the server normalizes the picture sizes of the hand-craft production videos, the normalized picture sizes of the hand-craft production videos are the same, and accuracy and stability of the recognition effect of the trained recognition model can be improved. .

And 203, extracting three-dimensional coordinate data of hand movements in the handicraft production video and performing time alignment.

In a specific implementation, in order to facilitate subsequent feature extraction, the server may first extract three-dimensional coordinate data of hand motions in the hand-craftwork production video and perform time alignment in a preprocessing stage, so as to shorten model training time.

In some examples, the server calculates the similarity between the key frames, which may be implemented by the steps shown in fig. 4, including:

step 301, for a first key frame and a second key frame, establishing a first window and a second window with the same size at the same initial position of the first key frame and the second key frame, respectively.

In a specific implementation, for the first key frame and the second key frame, the server establishes a first window and a second window with the same size at the same initial position of the first key frame and the second key frame, for example, establishes a first window and a second window with the size of n×n at the upper left corner of the first key frame and the second key frame, respectively.

Step 302, calculating local SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window.

In a specific implementation, for the first window and the second window, the server may calculate the local SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window by the following formulas:

c ₁ ＝(k ₁ L) ²

c ₂ ＝(k ₂ L) ²

k ₁ ＝0.01

k ₂ ＝0.03

where x represents a first key frame, y represents a second key frame, μ _x Is the average value mu of the pixel values of all the pixel points in the first window _y Is the average value of the pixel values of the pixel points in the second window,for the variance value of the pixel value of each pixel point in the first window, +.>Variance value sigma of pixel value of each pixel point in the second window _xy For the covariance value between the pixel value of each pixel in the first window and the pixel value of each pixel in the second window, L represents the dynamic range of the pixel value, and L generally takes 255.

Step 303, sliding the first window and the second window synchronously according to a preset step length, and calculating to obtain each local SSIM.

In a specific implementation, the server may slide the first window and the second window synchronously according to a preset step length to calculate to obtain each local SSIM, for example, the sizes of the first window and the second window are n×n, and if the preset step length is N, the server may slide N pixels to the right when sliding the first window and the second window.

Step 304, average all the local SSIMs obtained by calculation to obtain a global SSIM, and taking the global SSIM as the similarity between the first key frame and the second key frame.

In a specific implementation, the SSIM index can well measure the similarity between two pictures, and as the hand motion is more obscure and the hand contains a plurality of areas, the direct calculation of the global SSIM is not accurate and the difference of partial local areas can be diluted, so that the method calculates each local SSIM and then averages to obtain the global SSIM, and the similarity between the first key frame and the second key frame can be better determined.

The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.

Another embodiment of the present application relates to an electronic device, as shown in fig. 5, comprising: at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; wherein the memory 402 stores instructions executable by the at least one processor 401, the instructions being executable by the at least one processor 401 to enable the at least one processor 401 to perform the behavior recognition based hand direction recognition method in the above embodiments.

Where the memory and the processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors and the memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over the wireless medium via the antenna, which further receives the data and transmits the data to the processor.

The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory may be used to store data used by the processor in performing operations.

Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.

That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, etc., which can store program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments in which the present application is implemented and that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. The hand direction recognition method based on behavior recognition is characterized by comprising the following steps of:

acquiring a plurality of hand craft production videos of different hand craft categories and different craftsmen as sample videos;

dividing the sample video into a plurality of video clips, extracting key frames from the video clips, calculating the similarity between the key frames, and storing the video clips corresponding to the key frames with the similarity smaller than a first threshold value into the same folder;

extracting feature vectors of all video clips in the folder, solving a first average value, marking the first average value as labels of all video clips in the folder, and training an identification model which is built in advance based on a C3D network based on the video clips and the labels thereof;

inputting the video to be identified into the identification model after training, obtaining the hand direction in the video to be identified, and visually displaying the hand direction in the video to be identified.

2. The behavior recognition-based hand direction recognition method according to claim 1, wherein the acquiring of a plurality of craftsman videos of different craftsman as a sample video comprises:

preprocessing a plurality of acquired craftsman production videos with different craftsman types, and taking the preprocessed craftsman production videos as sample videos;

the pretreatment at least comprises:

denoising and smoothing the handicraft video;

normalizing the picture size of each of the craftwork video, wherein the picture sizes of the normalized craftwork video are the same;

and extracting three-dimensional coordinate data of hand movements in the handicraft video and performing time alignment.

3. The behavior recognition-based hand direction recognition method according to claim 1, wherein the calculating the similarity between the key frames comprises:

for a first key frame and a second key frame, respectively establishing a first window and a second window with the same size at the same initial position of the first key frame and the second key frame;

calculating a local structure similarity index SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window;

synchronously sliding the first window and the second window according to a preset step length, and calculating to obtain each local SSIM;

and averaging all the local SSIM obtained by calculation to obtain a global SSIM, and taking the global SSIM as the similarity between the first key frame and the second key frame.

4. A method of hand direction recognition based on behavior recognition according to claim 3, wherein the local SSIM is calculated from the pixel values of the pixels in the first window and the pixel values of the pixels in the second window by the following formula:

c ₁ ＝(k ₁ L) ²

c ₂ ＝(k ₂ L) ²

k ₁ ＝0.01

k ₂ ＝0.03

wherein x represents the first key frame, y represents the second key frame, μ _x Is the average value mu of the pixel values of all the pixel points in the first window _y Is the average value of the pixel values of the pixel points in the second window,for the variance value of the pixel value of each pixel point in the first window, +.>A variance value sigma of pixel values of each pixel point in the second window _xy L represents the dynamic range of the pixel value for the coordinate value between the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window.

5. The behavior recognition-based hand direction recognition method according to claim 1, wherein the extracting feature vectors of each video clip in the folder and calculating a first average value, and labeling the first average value as a label of each video clip in the folder comprises:

extracting feature vectors of all video clips in the folder, solving a first average value, and taking the first average value as a representative vector of the folder;

calculating the distance between the representative vector and five preset standard vectors, marking the standard vector with the minimum distance with the representative vector as a first label of each video segment in the folder, and marking the category of the standard vector with the minimum distance with the representative vector as a second label of each video segment in the folder;

wherein the five types of standard vectors include: the first standard vector which represents the hand to move from top to bottom, the second standard vector which represents the hand to move from bottom to top, the third standard vector which represents the hand to rotate clockwise, the fourth standard vector which represents the hand to rotate anticlockwise, and the fifth standard vector which represents the hand to rotate in opposite directions.

6. The behavior recognition-based hand direction recognition method according to claim 5, wherein training a recognition model previously constructed based on a C3D network based on the video clip and a tag thereof comprises:

training a preset C3D network to be converged based on the video segment and the first label thereof to obtain a trained C3D network;

training a preset classifier to be converged based on the video segment and the second label thereof to obtain a trained classifier; wherein, the classifier is a Support Vector Machine (SVM);

and splicing the trained C3D network and the trained classifier to obtain a trained recognition model.

7. The behavior recognition-based hand direction recognition method according to any one of claims 1 to 6, wherein the visually displaying in the video to be recognized includes:

adding an arrow mark and a character mark at corresponding positions in the video to be identified to represent the identified hand direction; the corresponding positions are positions when the hand motion corresponding to the hand direction occurs.

8. The behavior recognition-based hand direction recognition method according to any one of claims 1 to 6, wherein the inputting the video to be recognized into the recognition model after training to obtain the hand direction in the video to be recognized includes:

dividing a video to be identified into a plurality of video segments to be identified, and inputting each video segment to be identified into an identification model after training is completed to obtain the hand direction in each video segment to be identified;

after the obtaining the hand direction in the video to be identified, the method further includes:

taking the video to be identified as a unit of video fragments to be identified, and storing the video to be identified into a preset database after recording attribute information; wherein the attribute information includes: the method comprises the steps of identifying the name of a video file of a video to be identified, the number of the video to be identified in the video to be identified, the feature vector extracted from the video to be identified, the hand direction identified in the video to be identified, the identification time corresponding to the video to be identified, and the name of an identification model used when the hand direction in the video to be identified is identified, wherein the number of the video to be identified in the video to be identified is the number of the video to be identified, and the data source of the video to be identified, to which the video to be identified belongs, is the number of the video to be identified.

9. An electronic device, comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the behavior recognition based hand direction recognition method of any one of claims 1 to 8.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the behavior recognition-based hand direction recognition method of any one of claims 1 to 8.