CN117373130A - Behavior recognition-based hand direction recognition method, electronic device and storage medium - Google Patents
Behavior recognition-based hand direction recognition method, electronic device and storage medium Download PDFInfo
- Publication number
- CN117373130A CN117373130A CN202311423880.9A CN202311423880A CN117373130A CN 117373130 A CN117373130 A CN 117373130A CN 202311423880 A CN202311423880 A CN 202311423880A CN 117373130 A CN117373130 A CN 117373130A
- Authority
- CN
- China
- Prior art keywords
- video
- identified
- hand direction
- hand
- window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 239000013598 vector Substances 0.000 claims abstract description 58
- 238000012549 training Methods 0.000 claims abstract description 39
- 230000006399 behavior Effects 0.000 claims description 23
- 238000004519 manufacturing process Methods 0.000 claims description 20
- 230000033001 locomotion Effects 0.000 claims description 10
- 238000012706 support-vector machine Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 5
- 238000009499 grossing Methods 0.000 claims description 4
- 239000012634 fragment Substances 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 12
- 230000002068 genetic effect Effects 0.000 abstract description 11
- 230000000007 visual effect Effects 0.000 abstract description 6
- 230000009471 action Effects 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 230000000694 effects Effects 0.000 description 8
- 239000000284 extract Substances 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000012552 review Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
- G06V40/28—Recognition of hand or arm movements, e.g. recognition of deaf sign language
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Social Psychology (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The embodiment of the application relates to the technical field of behavior recognition, and discloses a hand direction recognition method, electronic equipment and storage medium based on behavior recognition, wherein the method comprises the following steps: acquiring a plurality of craftwork videos as sample videos; dividing a sample video into a plurality of video clips, calculating the similarity between key frames extracted from the video clips, and storing the video clips corresponding to the key frames with the similarity smaller than a first threshold value into the same folder; extracting feature vectors of all video clips in a folder, calculating a first average value, marking the first average value as a label of each video clip in the folder, and training a preset recognition model based on the video clips and the labels thereof; the video to be identified is input into the identification model after training is completed, the hand direction in the video to be identified is obtained, and visual display is carried out in the video to be identified, so that a user learns the non-genetic technology more carefully and conveniently, and the non-genetic technology is kept continuously.
Description
Technical Field
The embodiment of the application relates to the technical field of behavior recognition, in particular to a hand direction recognition method based on behavior recognition, electronic equipment and a storage medium.
Background
With the rapid development of the mobile internet, logging in a short video platform every day has become a part of life of people, with more and more non-genetic inherited people's resident short video platforms, the platform has become a new channel of non-genetic inheritance, and more non-genetic inheritance technological inheritance people and producers choose to propagate and display own skills on the short video platform. In the process of inheriting non-genetic skills, the operation direction and skill of hands are important, while in the process of inheriting traditional skills, many skills depend on the in-person demonstration of a master, and an apprentice learns through watching and imitating, so that the process is long in time consumption and low in efficiency, the effect of the inheritance is limited by the observation force and the understanding force of the apprentices, and errors in understanding can exist.
Along with the development of modern technology, people are gradually aware that the non-genetic technology and the modern technology are combined, so that the non-genetic culture inheritance can be well promoted, the non-genetic activation is promoted, and the gesture recognition technology is an important practice for combining the non-genetic technology with the modern technology. The gesture recognition technology is a technology for converting human hand actions into computer input, can be realized based on a sensor, but has low recognition accuracy and poor real-time performance, so that the effect of non-genetic culture is unsatisfactory.
Disclosure of Invention
The embodiment of the application aims to provide a hand direction identification method, electronic equipment and storage medium based on behavior identification, which can accurately realize hand direction identification in video in real time and visualize the identified direction and action, so that a user can learn a non-missing-hand technology more carefully and conveniently, and the non-missing-hand technology is inherited continuously.
In order to solve the technical problems, an embodiment of the present application provides a method for identifying a hand direction based on behavior identification, including the following steps: acquiring a plurality of hand craft production videos of different hand craft categories and different craftsmen as sample videos; dividing the sample video into a plurality of video clips, extracting key frames from the video clips, calculating the similarity between the key frames, and storing the video clips corresponding to the key frames with the similarity smaller than a first threshold value into the same folder; extracting feature vectors of all video clips in the folder, solving a first average value, marking the first average value as labels of all video clips in the folder, and training a preset recognition model based on the video clips and the labels thereof; inputting the video to be identified into the identification model after training, obtaining the hand direction in the video to be identified, and visually displaying the hand direction in the video to be identified.
The embodiment of the application also provides electronic equipment, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the behavior recognition based hand direction recognition method described above.
Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described behavior recognition-based hand direction recognition method.
According to the hand direction recognition method, the electronic equipment and the storage medium based on behavior recognition, the recognition model constructed based on the C3D network is trained, the C3D network can learn and model time sequence information, and compared with a traditional two-dimensional convolutional neural network, the recognition accuracy of the recognition model constructed based on the C3D network is high. When a training sample is selected, the craftsman manufacturing videos with different craftsman types are crawled to be used as sample videos, so that the generalization capability and universality of the trained identification model can be improved, and the trained identification model is applicable to different identification scenes. The visual display is carried out on the hand direction identified from the video to be identified and the hand action, so that a user can more clearly, carefully and accurately see the detailed steps of the non-handcraft inheritance person in the process of manufacturing the handcraft work, the user can easily learn and manufacture the handcraft work with the help of the learning, and the user can continuously learn and manufacture the handcraft work under the support of achievement sense and satisfaction sense, thereby changing the handcraft inheritance into a new handcraft producer and a propagator, and enabling the non-handcraft inheritance to be continuously broken.
In some alternative embodiments, the acquiring the craftsman video of the plurality of different craftsman categories as the sample video includes: preprocessing a plurality of acquired craftsman production videos with different craftsman types, and taking the preprocessed craftsman production videos as sample videos; the pretreatment at least comprises: denoising and smoothing the handicraft video; normalizing the picture size of each craftwork video, wherein the normalized picture size of each craftwork video is the same; and extracting three-dimensional coordinate data of hand movements in the handicraft video and performing time alignment. The preprocessing can remove interference information in the sample video, so that the information of each sample is converted into a unified system, training of the identification model is facilitated, and accuracy and stability of the identification effect of the trained identification model are improved.
In some alternative embodiments, the calculating the similarity between the key frames includes: for a first key frame and a second key frame, respectively establishing a first window and a second window with the same size at the same initial position of the first key frame and the second key frame; calculating a local structure similarity index SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window; synchronously sliding the first window and the second window according to a preset step length, and calculating to obtain each local SSIM; and averaging all the local SSIM obtained by calculation to obtain a global SSIM, and taking the global SSIM as the similarity between the first key frame and the second key frame. The SSIM index can well measure the similarity between two pictures, and as the hand actions are more obscure and the hand contains a plurality of areas, the direct calculation of the global SSIM is inaccurate and the difference of partial local areas can be diluted, so that the method calculates each local SSIM and then averages to obtain the global SSIM, and the similarity between key frames can be well determined.
In some alternative embodiments, the local SSIM is calculated from the pixel values of the pixels in the first window and the pixel values of the pixels in the second window by the following formula:
c 1 =(k 1 L) 2
c 2 =(k 2 L) 2
k 1 =0.01
k 2 =0.03
wherein x represents the first key frame, y represents the second key frame, μ x Is the average value mu of the pixel values of all the pixel points in the first window y Is the average value of the pixel values of the pixel points in the second window,for the variance value of the pixel value of each pixel point in the first window, +.>A variance value sigma of pixel values of each pixel point in the second window xy A pixel value of each pixel point in the first window and a second windowThe value of the covariance between the pixel values of each pixel point in the mouth, L, characterizes the dynamic range of the pixel values.
In some optional embodiments, the extracting the feature vector of each video segment in the folder and calculating a first average value, and labeling the first average value as a label of each video segment in the folder includes: extracting feature vectors of all video clips in the folder, solving a first average value, and taking the first average value as a representative vector of the folder; calculating the distance between the representative vector and five preset standard vectors, marking the standard vector with the minimum distance with the representative vector as a first label of each video segment in the folder, and marking the category of the standard vector with the minimum distance with the representative vector as a second label of each video segment in the folder; wherein the five types of standard vectors include: the first standard vector which represents the hand to move from top to bottom, the second standard vector which represents the hand to move from bottom to top, the third standard vector which represents the hand to rotate clockwise, the fourth standard vector which represents the hand to rotate anticlockwise, and the fifth standard vector which represents the hand to rotate in opposite directions. The labels are divided into the first label for representing the feature vector and the second label for representing the category, so that the training precision of the recognition model can be further improved.
In some optional embodiments, the training the recognition model previously built based on the C3D network based on the video clip and the tag thereof includes: training a preset C3D network to be converged based on the video segment and the first label thereof to obtain a trained C3D network; training a preset classifier to be converged based on the video segment and the second label thereof to obtain a trained classifier; wherein, the classifier is a Support Vector Machine (SVM); and splicing the trained C3D network and the trained classifier to obtain a trained recognition model. The C3D network and the classifier are independently trained, the feature extraction and classification performances can be independently optimized and improved, and the trained two parts are spliced to obtain a trained recognition model, so that the training speed and the training effect are effectively improved.
In some optional embodiments, the visually displaying in the video to be identified includes: adding an arrow mark and a character mark at corresponding positions in the video to be identified to represent the identified hand direction; the corresponding positions are positions when the hand motion corresponding to the hand direction occurs. The arrow mark is vivid, the character mark is accurate and reliable, and visual display of the hand direction is carried out by combining the arrow mark and the character mark, so that a user can learn hand actions in the video better.
In some optional embodiments, the inputting the video to be identified into the training-completed identification model to obtain the hand direction in the video to be identified includes: dividing a video to be identified into a plurality of video segments to be identified, and inputting each video segment to be identified into an identification model after training is completed to obtain the hand direction in each video segment to be identified; after the obtaining the hand direction in the video to be identified, the method further includes: taking the video to be identified as a unit of video fragments to be identified, and storing the video to be identified into a preset database after recording attribute information; wherein the attribute information includes: the method comprises the steps of identifying the name of a video file of a video to be identified, the number of the video to be identified in the video to be identified, the feature vector extracted from the video to be identified, the hand direction identified in the video to be identified, the identification time corresponding to the video to be identified, and the name of an identification model used when the hand direction in the video to be identified is identified, wherein the number of the video to be identified in the video to be identified is the number of the video to be identified, and the data source of the video to be identified, to which the video to be identified belongs, is the number of the video to be identified. The video after the hand motion is identified is taken as a video segment unit, and the attribute information is recorded and then stored in the database, so that a user can conveniently retrieve and order at any time to review, and the inheritance effect of the non-hand-missing process is further improved.
Drawings
One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.
FIG. 1 is a flow chart of a behavior recognition based hand direction recognition method provided by one embodiment of the present application;
FIG. 2 is a schematic illustration of a hand direction visualization provided in one embodiment of the present application;
FIG. 3 is a flow chart of preprocessing an adversary craftwork video in one embodiment of the present application;
FIG. 4 is a flow chart of computing similarity between key frames in one embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in detail below with reference to the accompanying drawings. However, as will be appreciated by those of ordinary skill in the art, in the various embodiments of the present application, numerous technical details have been set forth in order to provide a better understanding of the present application. However, the technical solutions claimed in the present application can be implemented without these technical details and with various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not be construed as limiting the specific implementation of the present application, and the embodiments may be mutually combined and referred to without contradiction.
An embodiment of the present application relates to a hand direction recognition method based on behavior recognition, which is applied to an electronic device, wherein the electronic device may be a terminal or a server, and the electronic device in this embodiment and the following embodiments will be described by taking the server as an example. The following describes implementation details of the hand direction recognition method based on behavior recognition in this embodiment specifically, and the following description is only provided for understanding the implementation details, and is not necessary to implement this embodiment.
The specific flow of the hand direction recognition method based on behavior recognition in this embodiment may be as shown in fig. 1, including:
step 101, acquiring a plurality of craftworks of different craftworks as sample videos.
In a specific implementation, the short video platforms have rich handcraft production video stock, and the server can crawl a plurality of handcraft production videos with different handcraft categories and different craftsmen from the short video platforms through package grabbing software such as Fiddler and the like to serve as sample videos. The more the types of the craftsman videos are, the more the craftsman sources are, the stronger the generalization ability and universality of the trained recognition model are.
In some examples, the craftwork videos of different craftwork categories may include a floss-making video, a paper-cut-making video, an embroidery-making video, and the like.
Step 102, dividing the sample video into a plurality of video segments, extracting key frames from each video segment, calculating the similarity between the key frames, and storing the video segments corresponding to the key frames with the similarity smaller than a first threshold value into the same folder.
In a specific implementation, after obtaining a sample video, the server may divide the sample into a plurality of video segments by using a boundary dividing method, where one video segment only includes one staged action in the craftwork manufacturing process, and these staged actions are mainly divided into several classes in terms of direction characteristics, so that the server may calculate the similarity between key frames, and the video segments corresponding to the key frames with the similarity smaller than the first threshold value are regarded as video segments with the same direction characteristics, and store the video segments with the same direction characteristics into the same folder.
In some examples, the server may divide the sample video into a plurality of video segments of equal length, each video segment being 16 frames, and there being a frame overlap in two video segments that are adjacent in time sequence, the server may extract the 8 th frame of each video segment as a key frame for each video segment.
In some examples, the server may use image similarity assessment indicators such as SSIM (Structural Similarity Index Measurement, structural similarity), MSE (Mean Squared Error, mean square error), PSNR (Peak Signal to Noise Ratio ), etc., to calculate the similarity between two key frames.
And 103, extracting feature vectors of all video clips in the folder, calculating a first average value, marking the first average value as labels of all video clips in the folder, and training an identification model which is built in advance based on a C3D network based on the video clips and the labels thereof.
In a specific implementation, the server traverses each folder, extracts the feature vector of each video segment in each folder, and aims at the feature vector of each video segment in the same folder, the server obtains a first average value of the feature vectors, and marks the first average value as the label of each video segment in the folder. And the server takes all the video clips marked with the labels as training samples, and iteratively trains the recognition model constructed in advance based on the C3D network based on the video clips and the labels thereof until the recognition model converges to obtain the recognition model after training. The C3D network can learn and model time sequence information, and compared with the traditional two-dimensional convolutional neural network, the recognition accuracy of the recognition model constructed based on the C3D network is much higher.
In some examples, the server extracts the feature vector of each video segment in the folder and obtains a first average value, the first average value is used as a representative vector of the folder, then the distance between the representative vector and five preset standard vectors is calculated, the standard vector with the minimum distance from the representative vector is marked as a first label of each video segment in the folder, and the category of the standard vector with the minimum distance from the representative vector is marked as a second label of each video segment in the folder. The labels are divided into the first label for representing the feature vector and the second label for representing the category, so that the training precision of the recognition model can be further improved.
In some examples, the five classes of standard vectors include: the first standard vector which represents the hand to move from top to bottom, the second standard vector which represents the hand to move from bottom to top, the third standard vector which represents the hand to rotate clockwise, the fourth standard vector which represents the hand to rotate anticlockwise, and the fifth standard vector which represents the hand to rotate in opposite directions. These five classes of standard vectors are based on feature extraction of the direction of the video streams.
In some examples, when training the recognition model, the server trains the preset C3D network to converge based on the video clip and the first label thereof to obtain a trained C3D network, trains the preset classifier to converge based on the video clip and the second label thereof to obtain a trained classifier, and finally splices the trained C3D network and the trained classifier to obtain the trained recognition model. The C3D network and the classifier are independently trained, the feature extraction and classification performances can be independently optimized and improved, and the trained two parts are spliced to obtain a trained recognition model, so that the training speed and the training effect are effectively improved.
In some examples, the recognition model may use an SVM (Support Vector Machines, support vector machine) as the classifier. The training principle of the SVM can be expressed by the following formula:
wherein x is 0 Representing a test sample, x i Represents the ith training sample, y i Label, alpha, representing the ith training sample i The Lagrangian multiplier for the ith training sample, K (·) the kernel function, and b the bias term.
And 104, inputting the video to be identified into the identification model after training, obtaining the hand direction in the video to be identified, and visually displaying the hand direction in the video to be identified.
In a specific implementation, after obtaining the recognition model after training, the server may input the video to be recognized into the recognition model after training to obtain a hand direction in the video to be recognized, and perform visual display in the video to be recognized, where the arrow mark and the text mark are added at corresponding positions in the video to be recognized to represent the recognized hand direction, and the corresponding positions are positions when hand movements corresponding to the hand direction occur. The arrow mark is vivid, the character mark is accurate and reliable, and visual display of the hand direction is carried out by combining the arrow mark and the character mark, so that a user can learn hand actions in the video better.
In some examples, the craftsmanship video added with the arrow mark and the text mark can be as shown in fig. 2.
In some examples, the server inputs the video to be identified into the identification model after training, and obtains the hand direction in the video to be identified, which is specifically: dividing the video to be identified into a plurality of video segments to be identified, and inputting each video segment to be identified into the identification model after training, so as to obtain the hand direction in each video segment to be identified. After the server obtains the hand direction in the video to be identified, the video to be identified can be stored in a preset database by taking the video fragment to be identified as a unit after recording the attribute information. Wherein the attribute information includes: the method includes the steps of identifying a Video file Name (Video Name) of a Video to be identified, a Number (Segment Number) of the Video to be identified in the Video to be identified, a Feature Vector (Feature Vector) extracted from the Video to be identified, a Hand Direction (Hand Direction) identified from the Video to be identified, an identification Time (identification Time) corresponding to the Video to be identified, a Name (identification Name) of an identification model used when identifying the Hand Direction in the Video to be identified, a Data Source (Data Source) of the Video to be identified to which the Video to be identified belongs, and the like. The video after the hand motion is identified is taken as a video segment unit, and the attribute information is recorded and then stored in the database, so that a user can conveniently retrieve and order at any time to review, and the inheritance effect of the non-hand-missing process is further improved.
According to the embodiment, the recognition model constructed based on the C3D network is trained, the C3D network can learn and model time sequence information, and compared with a traditional two-dimensional convolutional neural network, the recognition accuracy of the recognition model constructed based on the C3D network is much higher. When a training sample is selected, the craftsman manufacturing videos with different craftsman types are crawled to be used as sample videos, so that the generalization capability and universality of the trained identification model can be improved, and the trained identification model is applicable to different identification scenes. The visual display is carried out on the hand direction identified from the video to be identified and the hand action, so that a user can more clearly, carefully and accurately see the detailed steps of the non-handcraft inheritance person in the process of manufacturing the handcraft work, the user can easily learn and manufacture under the support of achievement sense and satisfaction sense, and the user can become a new handcraft producer and a propagator, and the non-handcraft inheritance is kept.
In some embodiments, after the server obtains the hand-made videos of the different craftsmen as the sample videos, the server also needs to preprocess the obtained hand-made videos of the different craftsmen, and uses the preprocessed hand-made videos as the sample videos, and the preprocessing of the hand-made videos by the server may be implemented through the steps shown in fig. 3, which specifically includes:
step 201, denoising and smoothing the hand-made video.
In the specific implementation, after the server crawls the hand process video, the hand process video is firstly subjected to denoising and smoothing, namely noise and unnecessary information in the video are removed, and only the hand information is reserved.
Step 202, normalizing the picture size of each craftwork video, wherein the picture size of each normalized craftwork video is the same.
In a specific implementation, the hand-craft videos obtained by crawling are different in specification and standard, so that stable training of the recognition model is not facilitated, the server normalizes the picture sizes of the hand-craft production videos, the normalized picture sizes of the hand-craft production videos are the same, and accuracy and stability of the recognition effect of the trained recognition model can be improved. .
And 203, extracting three-dimensional coordinate data of hand movements in the handicraft production video and performing time alignment.
In a specific implementation, in order to facilitate subsequent feature extraction, the server may first extract three-dimensional coordinate data of hand motions in the hand-craftwork production video and perform time alignment in a preprocessing stage, so as to shorten model training time.
In some examples, the server calculates the similarity between the key frames, which may be implemented by the steps shown in fig. 4, including:
step 301, for a first key frame and a second key frame, establishing a first window and a second window with the same size at the same initial position of the first key frame and the second key frame, respectively.
In a specific implementation, for the first key frame and the second key frame, the server establishes a first window and a second window with the same size at the same initial position of the first key frame and the second key frame, for example, establishes a first window and a second window with the size of n×n at the upper left corner of the first key frame and the second key frame, respectively.
Step 302, calculating local SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window.
In a specific implementation, for the first window and the second window, the server may calculate the local SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window by the following formulas:
c 1 =(k 1 L) 2
c 2 =(k 2 L) 2
k 1 =0.01
k 2 =0.03
where x represents a first key frame, y represents a second key frame, μ x Is the average value mu of the pixel values of all the pixel points in the first window y Is the average value of the pixel values of the pixel points in the second window,for the variance value of the pixel value of each pixel point in the first window, +.>Variance value sigma of pixel value of each pixel point in the second window xy For the covariance value between the pixel value of each pixel in the first window and the pixel value of each pixel in the second window, L represents the dynamic range of the pixel value, and L generally takes 255.
Step 303, sliding the first window and the second window synchronously according to a preset step length, and calculating to obtain each local SSIM.
In a specific implementation, the server may slide the first window and the second window synchronously according to a preset step length to calculate to obtain each local SSIM, for example, the sizes of the first window and the second window are n×n, and if the preset step length is N, the server may slide N pixels to the right when sliding the first window and the second window.
Step 304, average all the local SSIMs obtained by calculation to obtain a global SSIM, and taking the global SSIM as the similarity between the first key frame and the second key frame.
In a specific implementation, the SSIM index can well measure the similarity between two pictures, and as the hand motion is more obscure and the hand contains a plurality of areas, the direct calculation of the global SSIM is not accurate and the difference of partial local areas can be diluted, so that the method calculates each local SSIM and then averages to obtain the global SSIM, and the similarity between the first key frame and the second key frame can be better determined.
The above steps of the methods are divided, for clarity of description, and may be combined into one step or split into multiple steps when implemented, so long as they include the same logic relationship, and they are all within the protection scope of this patent; it is within the scope of this patent to add insignificant modifications to the algorithm or flow or introduce insignificant designs, but not to alter the core design of its algorithm and flow.
Another embodiment of the present application relates to an electronic device, as shown in fig. 5, comprising: at least one processor 401; and a memory 402 communicatively coupled to the at least one processor 401; wherein the memory 402 stores instructions executable by the at least one processor 401, the instructions being executable by the at least one processor 401 to enable the at least one processor 401 to perform the behavior recognition based hand direction recognition method in the above embodiments.
Where the memory and the processor are connected by a bus, the bus may comprise any number of interconnected buses and bridges, the buses connecting the various circuits of the one or more processors and the memory together. The bus may also connect various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or may be a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor is transmitted over the wireless medium via the antenna, which further receives the data and transmits the data to the processor.
The processor is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And memory may be used to store data used by the processor in performing operations.
Another embodiment of the present application relates to a computer-readable storage medium storing a computer program. The computer program implements the above-described method embodiments when executed by a processor.
That is, it will be understood by those skilled in the art that all or part of the steps in implementing the methods of the embodiments described above may be implemented by a program stored in a storage medium, where the program includes several instructions for causing a device (which may be a single-chip microcomputer, a chip or the like) or a processor (processor) to perform all or part of the steps in the methods of the embodiments described herein. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, etc., which can store program codes.
It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments in which the present application is implemented and that various changes in form and details may be made therein without departing from the spirit and scope of the present application.
Claims (10)
1. The hand direction recognition method based on behavior recognition is characterized by comprising the following steps of:
acquiring a plurality of hand craft production videos of different hand craft categories and different craftsmen as sample videos;
dividing the sample video into a plurality of video clips, extracting key frames from the video clips, calculating the similarity between the key frames, and storing the video clips corresponding to the key frames with the similarity smaller than a first threshold value into the same folder;
extracting feature vectors of all video clips in the folder, solving a first average value, marking the first average value as labels of all video clips in the folder, and training an identification model which is built in advance based on a C3D network based on the video clips and the labels thereof;
inputting the video to be identified into the identification model after training, obtaining the hand direction in the video to be identified, and visually displaying the hand direction in the video to be identified.
2. The behavior recognition-based hand direction recognition method according to claim 1, wherein the acquiring of a plurality of craftsman videos of different craftsman as a sample video comprises:
preprocessing a plurality of acquired craftsman production videos with different craftsman types, and taking the preprocessed craftsman production videos as sample videos;
the pretreatment at least comprises:
denoising and smoothing the handicraft video;
normalizing the picture size of each of the craftwork video, wherein the picture sizes of the normalized craftwork video are the same;
and extracting three-dimensional coordinate data of hand movements in the handicraft video and performing time alignment.
3. The behavior recognition-based hand direction recognition method according to claim 1, wherein the calculating the similarity between the key frames comprises:
for a first key frame and a second key frame, respectively establishing a first window and a second window with the same size at the same initial position of the first key frame and the second key frame;
calculating a local structure similarity index SSIM according to the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window;
synchronously sliding the first window and the second window according to a preset step length, and calculating to obtain each local SSIM;
and averaging all the local SSIM obtained by calculation to obtain a global SSIM, and taking the global SSIM as the similarity between the first key frame and the second key frame.
4. A method of hand direction recognition based on behavior recognition according to claim 3, wherein the local SSIM is calculated from the pixel values of the pixels in the first window and the pixel values of the pixels in the second window by the following formula:
c 1 =(k 1 L) 2
c 2 =(k 2 L) 2
k 1 =0.01
k 2 =0.03
wherein x represents the first key frame, y represents the second key frame, μ x Is the average value mu of the pixel values of all the pixel points in the first window y Is the average value of the pixel values of the pixel points in the second window,for the variance value of the pixel value of each pixel point in the first window, +.>A variance value sigma of pixel values of each pixel point in the second window xy L represents the dynamic range of the pixel value for the coordinate value between the pixel value of each pixel point in the first window and the pixel value of each pixel point in the second window.
5. The behavior recognition-based hand direction recognition method according to claim 1, wherein the extracting feature vectors of each video clip in the folder and calculating a first average value, and labeling the first average value as a label of each video clip in the folder comprises:
extracting feature vectors of all video clips in the folder, solving a first average value, and taking the first average value as a representative vector of the folder;
calculating the distance between the representative vector and five preset standard vectors, marking the standard vector with the minimum distance with the representative vector as a first label of each video segment in the folder, and marking the category of the standard vector with the minimum distance with the representative vector as a second label of each video segment in the folder;
wherein the five types of standard vectors include: the first standard vector which represents the hand to move from top to bottom, the second standard vector which represents the hand to move from bottom to top, the third standard vector which represents the hand to rotate clockwise, the fourth standard vector which represents the hand to rotate anticlockwise, and the fifth standard vector which represents the hand to rotate in opposite directions.
6. The behavior recognition-based hand direction recognition method according to claim 5, wherein training a recognition model previously constructed based on a C3D network based on the video clip and a tag thereof comprises:
training a preset C3D network to be converged based on the video segment and the first label thereof to obtain a trained C3D network;
training a preset classifier to be converged based on the video segment and the second label thereof to obtain a trained classifier; wherein, the classifier is a Support Vector Machine (SVM);
and splicing the trained C3D network and the trained classifier to obtain a trained recognition model.
7. The behavior recognition-based hand direction recognition method according to any one of claims 1 to 6, wherein the visually displaying in the video to be recognized includes:
adding an arrow mark and a character mark at corresponding positions in the video to be identified to represent the identified hand direction; the corresponding positions are positions when the hand motion corresponding to the hand direction occurs.
8. The behavior recognition-based hand direction recognition method according to any one of claims 1 to 6, wherein the inputting the video to be recognized into the recognition model after training to obtain the hand direction in the video to be recognized includes:
dividing a video to be identified into a plurality of video segments to be identified, and inputting each video segment to be identified into an identification model after training is completed to obtain the hand direction in each video segment to be identified;
after the obtaining the hand direction in the video to be identified, the method further includes:
taking the video to be identified as a unit of video fragments to be identified, and storing the video to be identified into a preset database after recording attribute information; wherein the attribute information includes: the method comprises the steps of identifying the name of a video file of a video to be identified, the number of the video to be identified in the video to be identified, the feature vector extracted from the video to be identified, the hand direction identified in the video to be identified, the identification time corresponding to the video to be identified, and the name of an identification model used when the hand direction in the video to be identified is identified, wherein the number of the video to be identified in the video to be identified is the number of the video to be identified, and the data source of the video to be identified, to which the video to be identified belongs, is the number of the video to be identified.
9. An electronic device, comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the behavior recognition based hand direction recognition method of any one of claims 1 to 8.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the behavior recognition-based hand direction recognition method of any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311423880.9A CN117373130A (en) | 2023-10-31 | 2023-10-31 | Behavior recognition-based hand direction recognition method, electronic device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311423880.9A CN117373130A (en) | 2023-10-31 | 2023-10-31 | Behavior recognition-based hand direction recognition method, electronic device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117373130A true CN117373130A (en) | 2024-01-09 |
Family
ID=89402138
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311423880.9A Pending CN117373130A (en) | 2023-10-31 | 2023-10-31 | Behavior recognition-based hand direction recognition method, electronic device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117373130A (en) |
-
2023
- 2023-10-31 CN CN202311423880.9A patent/CN117373130A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR102266529B1 (en) | Method, apparatus, device and readable storage medium for image-based data processing | |
CN108073910B (en) | Method and device for generating human face features | |
CN108280477B (en) | Method and apparatus for clustering images | |
CN104199834B (en) | The method and system for obtaining remote resource from information carrier surface interactive mode and exporting | |
CN112287820A (en) | Face detection neural network, face detection neural network training method, face detection method and storage medium | |
CN108229293A (en) | Face image processing process, device and electronic equipment | |
CN107895160A (en) | Human face detection and tracing device and method | |
CN107908641B (en) | Method and system for acquiring image annotation data | |
US20200285951A1 (en) | Figure captioning system and related methods | |
EP3852061A1 (en) | Method and device for damage segmentation of vehicle damage image | |
WO2023020005A1 (en) | Neural network model training method, image retrieval method, device, and medium | |
US20230041943A1 (en) | Method for automatically producing map data, and related apparatus | |
CN105493078A (en) | Color sketch image searching | |
KR20200059993A (en) | Apparatus and method for generating conti for webtoon | |
CN110059637B (en) | Face alignment detection method and device | |
CN110399547B (en) | Method, apparatus, device and storage medium for updating model parameters | |
CN110390724A (en) | A kind of SLAM method with example segmentation | |
CN112381118B (en) | College dance examination evaluation method and device | |
CN113505786A (en) | Test question photographing and judging method and device and electronic equipment | |
CN108369647B (en) | Image-based quality control | |
CN111144466B (en) | Image sample self-adaptive depth measurement learning method | |
CN116958512A (en) | Target detection method, target detection device, computer readable medium and electronic equipment | |
CN117373130A (en) | Behavior recognition-based hand direction recognition method, electronic device and storage medium | |
CN113255701B (en) | Small sample learning method and system based on absolute-relative learning framework | |
US20220300836A1 (en) | Machine Learning Techniques for Generating Visualization Recommendations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |