CN113158748A - Hand detection tracking and musical instrument detection combined interaction method and system - Google Patents

Hand detection tracking and musical instrument detection combined interaction method and system Download PDF

Info

Publication number
CN113158748A
CN113158748A CN202110147352.XA CN202110147352A CN113158748A CN 113158748 A CN113158748 A CN 113158748A CN 202110147352 A CN202110147352 A CN 202110147352A CN 113158748 A CN113158748 A CN 113158748A
Authority
CN
China
Prior art keywords
hand
detection
recognition model
value
instrument
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110147352.XA
Other languages
Chinese (zh)
Inventor
段若愚
史明
韩钰浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xiaopangxiong Technology Co ltd
Original Assignee
Hangzhou Xiaopangxiong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xiaopangxiong Technology Co ltd filed Critical Hangzhou Xiaopangxiong Technology Co ltd
Priority to CN202110147352.XA priority Critical patent/CN113158748A/en
Publication of CN113158748A publication Critical patent/CN113158748A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/20Scenes; Scene-specific elements in augmented reality scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/117Biometrics derived from hands

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The invention provides a combined interaction method and a system for hand detection tracking and musical instrument detection, wherein the method comprises the steps of collecting videos and/or images by using a collecting device, and further comprises the following steps: generating a hand recognition model and a musical instrument detection model; respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection; and judging whether the hand key detection point position is in the position of the musical instrument to be identified by using a judgment rule. The invention provides a combined interaction method and system for hand detection tracking and musical instrument detection.

Description

Hand detection tracking and musical instrument detection combined interaction method and system
Technical Field
The invention relates to the technical field of AR interaction, in particular to a hand detection tracking and musical instrument detection combined interaction method and system.
Background
With the combination of science and technology and entertainment content being improved, the demands of audiences for entertainment experience modes are gradually transited from single content and low-frequency interaction to more individuation, higher quality and more interactive experiences, so that interactive videos emerge, and the audiences can click interactive components appearing in the videos to select branch scenarios or visual angles which the audiences want to watch. The viewer can watch the video passively, and the interactive video gives the viewer more immersion and interactivity than the traditional one.
AR is a comprehensive integration technology, relates to the fields of computer graphics, man-machine interaction technology, sensing technology, artificial intelligence and the like, and uses a computer to generate realistic three-dimensional visual, auditory, olfactory and other senses, so that people as participants can experience and interact with a virtual world naturally through a proper device. When the user moves, the computer can immediately perform complex operation, and return the accurate 3D world image to generate the presence. The technology integrates the latest development of technologies such as Computer Graphics (CG) technology, computer simulation technology, artificial intelligence, sensing technology, display technology, network parallel processing and the like, and is a high-technology simulation system generated by the aid of the computer technology.
In the existing AR technology, a user is required to wear AR glasses. However, people all perceive the world through five senses, the AR technology can only help people to perceive through vision and hearing, and although the touch sense can be obtained through holding the interactive device by hand, the AR technology is greatly different from the real physical touch sense after all, and cannot bring the best AR interactive experience to people.
The invention patent application with the publication number of 109358748A discloses 'equipment and a method for interaction between a hand and an AR virtual object of a mobile phone', which comprises the mobile phone and a gesture tracking device matched with the mobile phone, wherein the gesture tracking device is arranged on the back of the mobile phone and is in communication connection with the mobile phone; the mobile phone comprises an android system capable of supporting an ARCore technology, and an ARCore mobile phone app matched with the gesture tracking device is installed on the mobile phone; the gesture tracking device comprises a software API, a gesture recognition module and a gesture recognition module, wherein the software API is used for capturing data information of hand motion, hand position and hand displacement of a user in a working range and transmitting the data information to an ARCore mobile phone app in real time; the display screen of the mobile phone is used for displaying the AR scene. The method has the disadvantages that the playing area of the musical instrument cannot be identified, and the key point parameters of the hand cannot be provided
Disclosure of Invention
In order to solve the above technical problems, the present invention provides a method and system for interactive combination of hand detection tracking and musical instrument detection, which detects key points of a hand and key areas of a musical instrument, determines whether the key points of the hand are in the key areas of the musical instrument, and then generates corresponding sounds according to the determination result.
The invention provides a combined interaction method of hand detection tracking and musical instrument detection, which comprises the steps of using a collecting device to collect video and/or images, and further comprises the following steps:
generating a hand recognition model and a musical instrument detection model;
respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection;
and judging whether the hand key detection point position is in the position of the musical instrument to be identified by using a judgment rule.
Preferably, the hand recognition model generation method includes the following substeps:
configuring camera parameters, collecting instrument data in batches, and labeling in a manner of labeling X, Y values of N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points;
processing the acquired images and generating an estimate X, Y value;
calculating the Euclidean distance between the estimated X, Y value and the X, Y value obtained by collecting data;
and when the Euclidean distance is smaller than a designated threshold value, saving the parameters to generate a hand recognition model.
In any of the above solutions, it is preferable that the method of processing the acquired image and generating the estimated X, Y value includes the following sub-steps:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
the feature map is pooled using the average pooling layer, and the values are estimated X, Y using regression on the average pooling layer generated values.
In any of the above aspects, it is preferable that the method for generating the instrument recognition model includes the substeps of:
configuring camera parameters, collecting instrument data in batches, marking the instrument data in a way that a rectangular frame is adopted to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame;
to the marked upper left corner xtl、ytlValue and lower right corner xbr、ybrThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left cornert1w、ytlh) Normalized to the lower right corner (x)brw、ybrh) Wherein, width represents the width of the photo, and height represents the height of the photo;
learning the image data using a neural network and generating vector values for estimates (xmin, ymin) and (xmax, ymax);
calculating the vectors of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting datat1w、ytlh) And (x)brw、ybrh) Cross entropy of vector values of (a);
and when the cross entropy is smaller than a specified threshold value, saving the weight generation model.
In any of the above aspects, preferably, the method for learning image data using a neural network and generating vector values of the estimates (xmin, ymin) and (xmax, ymax) includes the sub-steps of:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
and performing pooling operation on the feature map by using the average pooling layer, and performing regression estimation on the x and y values by using the generated value of the average pooling layer.
In any of the above schemes, preferably, the method for inputting the collected video into the hand recognition model and the instrument detection model for detection includes starting the collection device, loading the hand recognition model and the instrument recognition model at the same time, and transmitting the video frames collected by the collection device to the hand recognition model and the instrument recognition model respectively.
In any of the above schemes, preferably, the determination rule is to determine whether the following formula is a logical true value, and if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].
A second object of the present invention is to provide a combined interaction system of hand detection tracking and musical instrument detection, comprising a capturing device for capturing video and/or images, further comprising the following modules:
a training module: the hand recognition model and the musical instrument detection model are generated;
a detection module: the hand recognition model and the musical instrument detection model are used for respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection, and judging whether the hand key detection point position is in the position of the musical instrument to be recognized or not by using a judgment rule;
the system performs a combined hand detection tracking and instrument detection interaction in accordance with the method of claim 1.
Preferably, the hand recognition model generation method includes the following substeps:
configuring camera parameters, collecting instrument data in batches, and labeling in a manner of labeling X, Y values of N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points;
processing the acquired images and generating an estimate X, Y value;
calculating the Euclidean distance between the estimated X, Y value and the X, Y value obtained by collecting data;
and when the Euclidean distance is smaller than a designated threshold value, saving the parameters to generate a hand recognition model.
In any of the above solutions, it is preferable that the method of processing the acquired image and generating the estimated X, Y value includes the following sub-steps:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
the feature map is pooled using the average pooling layer, and the values are estimated X, Y using regression on the average pooling layer generated values.
In any of the above aspects, it is preferable that the method for generating the instrument recognition model includes the substeps of:
configuring camera parameters, collecting instrument data in batches, marking the instrument data in a way that a rectangular frame is adopted to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame;
to the marked upper left corner xtl、ytlValue and lower right corner xbr、ybrThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left cornert1w、ytlh) Normalized to the lower right corner (x)brw、ybrh) Wherein, width represents the width of the photo, and height represents the height of the photo;
learning the image data using a neural network, generating vector values of estimates (xmin, ymin) and (xmax, ymax);
calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting datat1w、ytlh) And (x)brw、ybrh) Cross entropy of vector values of (a);
and when the cross entropy is smaller than a specified threshold value, saving the weight generation model.
In any of the above aspects, preferably, the method for learning image data using a neural network and generating vector values of the estimates (xmin, ymin) and (xmax, ymax) includes the sub-steps of:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
and performing pooling operation on the feature map by using the average pooling layer, and performing regression estimation on the x and y values by using the generated value of the average pooling layer.
In any of the above schemes, preferably, the detection module is further configured to start the collection device, load the hand recognition model and the musical instrument recognition model at the same time, and transmit the video frames collected by the collection device to the hand recognition model and the musical instrument recognition model, respectively.
In any of the above schemes, preferably, the determination rule is to determine whether the following formula is a logical true value, and if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].
The invention provides a combined interaction method and system for hand detection tracking and musical instrument detection, which can realize interaction between hands and actual objects in an AR scene and give out interaction effects such as sound and animation.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a combined hand detection tracking and instrument detection interaction method according to the present invention.
FIG. 2 is a block diagram of a preferred embodiment of the combined interaction system of hand detection tracking and instrument detection according to the present invention.
FIG. 3 is a flowchart of an embodiment of program-initiated recognition for a combined hand detection tracking and instrument detection interaction method according to the present invention.
FIG. 4 is a flowchart of an embodiment of the hand keypoint training of the combined interaction method of hand detection tracking and instrument detection according to the present invention.
FIG. 5 is a flow chart of an embodiment of instrument recognition training according to the combined interaction method of hand detection tracking and instrument detection of the present invention.
FIG. 6 is a diagram of an embodiment of a combined hand detection tracking and instrument detection interaction method for hand-to-board instrument contact identification in accordance with the present invention.
Detailed Description
The invention is further illustrated with reference to the figures and the specific examples.
Example one
As shown in fig. 1, step 100 is performed to capture video and/or images using a capture device.
Step 110 is executed to generate a hand recognition model and a musical instrument detection model. The generation method of the hand recognition model comprises the following substeps:
step 101: and (3) configuring camera parameters, collecting instrument data in batches, and labeling, wherein the labeling mode is that X, Y values are labeled on N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points.
Step 102: the acquired image is processed to generate an estimate X, Y value. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing regression estimation X, Y on the generated value of the average pooling layer.
Step 103: the euclidean distance between the estimated X, Y value and the X, Y value obtained from the collected data is calculated.
Step 104: it is determined whether the euclidean distance is smaller than a specified threshold value (in the present embodiment, the specified threshold value is set to 0.1). If the Euclidean distance is greater than the specified threshold, then steps 105 and 103 are performed sequentially, the reduction threshold (in this embodiment, the reduction threshold is set to 0.01) will be subtracted from the initialization parameter of the deep convolutional neural network, and the Euclidean distance is recalculated. If the Euclidean distance is smaller than a designated threshold value, step 106 is executed, and parameters are saved to generate a hand recognition model.
The method for generating the instrument identification model comprises the following sub-steps:
step 111: and configuring camera parameters, collecting instrument data in batches, marking in a way of adopting a rectangular frame to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame.
Step 112: to the marked upper left corner xtl、ytlValue and lower right corner xbr、ybrThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left cornert1w、ytlh) Normalized to the lower right corner (x)brw、ybrh) Wherein width represents the width of the photo and height represents the height of the photo.
Step 113: the vector values of the estimates (xmin, ymin) and (xmax, ymax) are generated using neural network learning image data. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing a regression vector value on a generated value of the average pooling layer.
Step 114: calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting datat1w、ytlh) And (x)brw、ybrh) Cross entropy of vector values of (a).
Step 115: it is judged whether or not the cross entropy is smaller than a prescribed threshold value (in the present embodiment, the prescribed threshold value is set to 0.1). If the cross entropy is greater than the specified threshold, then steps 105 and 103 are performed sequentially, the initialization parameter of the deep convolutional neural network is subtracted by the reduction threshold (in this embodiment, the reduction threshold is set to 0.01), and the cross entropy is recalculated. If the cross entropy is less than the specified threshold, step 106 is performed to save the weight generation model.
And step 120, inputting the collected videos into the hand recognition model and the musical instrument detection model respectively for detection. And starting the acquisition device, loading the hand recognition model and the musical instrument recognition model simultaneously, and transmitting the video frames acquired by the acquisition device to the hand recognition model and the musical instrument recognition model respectively.
Step 130 is executed to determine whether the hand key detection point position is in the position of the musical instrument to be identified by using the determination rule. The judgment rule is to judge whether the following formula is a logic true value, if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].
Example two
As shown in fig. 2, a combined interaction system of hand detection tracking and musical instrument detection includes a collection device 200, a training module 210 and a detection module 220.
The collection device 200: for capturing video and/or images.
The training module 210: for generating hand recognition models and instrument detection models. The generation method of the hand recognition model comprises the following substeps:
step 101: and (3) configuring camera parameters, collecting instrument data in batches, and labeling, wherein the labeling mode is that X, Y values are labeled on N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points.
Step 102: the acquired image is processed to generate an estimate X, Y value. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing regression estimation X, Y on the generated value of the average pooling layer.
Step 103: the euclidean distance between the estimated X, Y value and the X, Y value obtained from the collected data is calculated.
Step 104: when the euclidean distance is smaller than a specified threshold value (in the present embodiment, the specified threshold value is set to 0.1), the hand recognition model is generated by saving the parameters.
The method for generating the instrument identification model comprises the following sub-steps:
step 111: and configuring camera parameters, collecting instrument data in batches, marking in a way of adopting a rectangular frame to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame.
Step 112: to the marked upper left corner xtl、ytlValue and lower right corner xbr、ybrThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left cornert1w、ytlh) Normalized to the lower right corner (x)brw、ybrh) Wherein width represents the width of the photo and height represents the height of the photo.
Step 113: the image data is learned using a neural network and vector values of estimates (xmin, ymin) and (xmax, ymax) are generated. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing regression estimation on vector values of (xmin, ymin) and (xmax, ymax) by using the generated value of the average pooling layer. Step 114: calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting datat1w、ytlh) And (x)brw、ybrh) Cross entropy of vector values of (a).
Step 115: when the cross entropy is smaller than a prescribed threshold (in the present embodiment, the prescribed threshold is set to 0.1), the weight generation model is saved.
The detection module 220: and the hand recognition model and the musical instrument detection model are used for respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection, and judging whether the hand key detection point position is in the position of the musical instrument to be recognized or not by using a judgment rule.
The detection module 220 is further configured to start the collection device, load the hand recognition model and the musical instrument recognition model at the same time, and transmit the video frames collected by the collection device to the hand recognition model and the musical instrument recognition model, respectively. The judgment rule is to judge whether the following formula is a logic true value, if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].
EXAMPLE III
The invention can realize the interaction between the hand and the actual object in the AR scene and can generate the interaction effects of sound, animation and the like. The implementation method is shown in fig. 3, and the technical scheme is as follows:
first, hand detection method (as shown in FIG. 4)
1. Hand detection and tracking;
2. hand and object detection and tracking.
3. The camera parameters are configured to ensure that the photo size 480 x 480 is collected, instrument data is collected in batches and labeled. The labeling mode labels X, Y values for the 21 key points according to the position of the hand in the image.
4. Compressing the batch collected images to 256 × 256, normalizing the compressed images, extracting features by using a MobileNet deep convolution neural network to generate a feature map, pooling the feature map by using an average pooling layer, performing regression estimation X, Y on the average pooling layer generated value, and estimating X, Y and the obtained X, Y Euclidean distance of the acquired data.
5. And if the Euclidean distance obtained by calculation is larger than 0.1, subtracting 0.01 from the initialization parameter of the deep convolutional neural network, and continuously repeating the step 4 until the Euclidean distance is smaller than 0.1, and stopping training.
Second, musical instrument detection method (as shown in FIG. 5)
6. The camera parameters are configured to ensure that the photo size 640 x 640 is collected, instrument data is collected in batches, and labeled. The labeling mode adopts a rectangular box to represent the position of an object, and x and y values at the upper left corner and x and y values at the lower right corner.
7. And converting the marked values of x and y at the upper left corner and the marked values of x and y at the lower right corner by using x/width and y/height formulas, wherein width and height respectively represent the width and height of the photo.
8. The method comprises the steps of scaling a batch of collected images to 320 x 320, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, performing regression estimation on a generated value of the average pooling layer to obtain X, Y values, and performing cross entropy on an estimated X, Y value and the obtained X, Y cross entropy of collected data.
9. And if the cross entropy obtained by calculation is larger than 0.1, subtracting 0.01 from the initialization parameter of the deep convolutional neural network, continuously repeating the step 8 until the cross entropy is smaller than 0.1, stopping training, and storing the weight to generate the model file.
And (10) starting Camera by the App program, simultaneously loading a hand recognition model and an instrument detection model, and respectively transmitting the video frames collected by the Camera to hand detection and instrument detection.
11. And calculating whether the hand key detection point position is in the positions of the musical instruments to be identified by using the returned result. Using the formula [ xmin < ═ x and x < ═ xmax and ymin < ═ y and y < ═ ymax ], (xmin, ymin, xmax, ymax respectively represent the top left corner x, y value and the bottom right corner x, y value of the instrument recognition result), (x, y represent the hand key points x, y value). If the formula is a logical true value when the hand key point is in the position of the musical instrument to be identified, the corresponding note is considered to be played. If the hand key point is not in the position of the musical instrument to be identified, the formula is a logic false value, and the corresponding note can not be played.
Example four
As shown in fig. 6, the hand is in contact with the cardboard musical instrument, the point location detected by the hand and the point location of the cardboard are fused by the musical instrument detection method and the hand detection method, and the corresponding sound effect is triggered when the point location of the hand is in the corresponding area of the cardboard musical instrument, so that the effect of playing music can be achieved.
For a better understanding of the present invention, the foregoing detailed description has been given in conjunction with specific embodiments thereof, but not with the intention of limiting the invention thereto. Any simple modifications of the above embodiments according to the technical essence of the present invention still fall within the scope of the technical solution of the present invention. In the present specification, each embodiment is described with emphasis on differences from other embodiments, and the same or similar parts between the respective embodiments may be referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims (10)

1. A combined interaction method of hand detection tracking and musical instrument detection comprises the steps of collecting videos and/or images by using a collecting device, and is characterized by further comprising the following steps:
generating a hand recognition model and a musical instrument detection model;
respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection;
and judging whether the hand key detection point position is in the position of the musical instrument to be identified by using a judgment rule.
2. A combined hand detection tracking and instrument detection interaction method as claimed in claim 1, wherein the generation method of the hand recognition model comprises the sub-steps of:
configuring camera parameters, collecting instrument data in batches, and labeling in a manner of labeling X, Y values of N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points;
processing the acquired images and generating an estimate X, Y value;
calculating the Euclidean distance between the estimated X, Y value and the X, Y value obtained by collecting data;
and when the Euclidean distance is smaller than a designated threshold value, saving the parameters to generate a hand recognition model.
3. A combined hand detection tracking and instrument detection interaction method as claimed in claim 2, wherein said method of processing the acquired images and generating an estimate X, Y value comprises the sub-steps of:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
the feature map is pooled using the average pooling layer, and the values are estimated X, Y using regression on the average pooling layer generated values.
4. The method of interacting hand detection tracking with instrument detection as recited in claim 2, wherein the method of generating the instrument recognition model comprises the sub-steps of:
configuring camera parameters, collecting instrument data in batches, marking the instrument data in a way that a rectangular frame is adopted to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame;
to the marked upper left corner xtl、ytlValue and lower right corner xbr、ybrThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left cornert1w、ytlh) Normalized to the lower right corner (x)brw、ybrh) Wherein, width represents the width of the photo, and height represents the height of the photo;
learning the image data using a neural network and generating vector values for estimates (xmin, ymin) and (xmax, ymax);
calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting datat1w、ytlh) And (x)brw、ybrh) Cross entropy of vector values of (a);
and when the cross entropy is smaller than a specified threshold value, saving the weight generation model.
5. The method of interacting hand test tracking with instrument testing in combination according to claim 4, wherein the method of learning image data using neural networks and generating vector values for estimates (xmin, ymin) and (xmax, ymax) comprises the sub-steps of:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
and performing pooling operation on the feature map by using the average pooling layer, and performing regression estimation on the x and y values by using the generated value of the average pooling layer.
6. The method of claim 4, wherein the step of inputting the captured video into the hand recognition model and the instrument detection model for detection comprises activating the capturing device, loading the hand recognition model and the instrument detection model simultaneously, and transmitting the video frames captured by the capturing device to the hand recognition model and the instrument detection model respectively.
7. The method for interactive hand detection tracking and musical instrument detection in combination as claimed in claim 6, wherein said judgment rule is to judge whether the following formula is a logical true value, if yes, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].
8. A combined interactive system for hand detection tracking and instrument detection comprises a collecting device for collecting video and/or images, and is characterized by further comprising the following modules:
a training module: the hand recognition model and the musical instrument detection model are generated;
a detection module: the hand recognition model and the musical instrument detection model are used for respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection, and judging whether the hand key detection point position is in the position of the musical instrument to be recognized or not by using a judgment rule;
the system performs a combined hand detection tracking and instrument detection interaction in accordance with the method of claim 1.
9. The combined interactive system of hand detection tracking and instrument detection of claim 8, wherein the generation method of the hand recognition model comprises the sub-steps of:
configuring camera parameters, collecting instrument data in batches, and labeling in a manner of labeling X, Y values of N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points;
processing the acquired images and generating an estimate X, Y value;
calculating the Euclidean distance between the estimated X, Y value and the X, Y value obtained by collecting data;
and when the Euclidean distance is smaller than a designated threshold value, saving the parameters to generate a hand recognition model.
10. The combined interactive hand detection tracking and instrument detection system of claim 8, wherein said method of processing the acquired images and generating an estimated X, Y value comprises the sub-steps of:
compressing the batch of collected images, and performing normalization operation on the compressed images;
performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;
the feature map is pooled using the average pooling layer, and the values are estimated X, Y using regression on the average pooling layer generated values.
CN202110147352.XA 2021-02-03 2021-02-03 Hand detection tracking and musical instrument detection combined interaction method and system Pending CN113158748A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110147352.XA CN113158748A (en) 2021-02-03 2021-02-03 Hand detection tracking and musical instrument detection combined interaction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110147352.XA CN113158748A (en) 2021-02-03 2021-02-03 Hand detection tracking and musical instrument detection combined interaction method and system

Publications (1)

Publication Number Publication Date
CN113158748A true CN113158748A (en) 2021-07-23

Family

ID=76882682

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110147352.XA Pending CN113158748A (en) 2021-02-03 2021-02-03 Hand detection tracking and musical instrument detection combined interaction method and system

Country Status (1)

Country Link
CN (1) CN113158748A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657185A (en) * 2021-07-26 2021-11-16 广东科学技术职业学院 Intelligent auxiliary method, device and medium for piano practice

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113657185A (en) * 2021-07-26 2021-11-16 广东科学技术职业学院 Intelligent auxiliary method, device and medium for piano practice

Similar Documents

Publication Publication Date Title
CN111738220B (en) Three-dimensional human body posture estimation method, device, equipment and medium
CN108229277B (en) Gesture recognition method, gesture control method, multilayer neural network training method, device and electronic equipment
CN112379812B (en) Simulation 3D digital human interaction method and device, electronic equipment and storage medium
US5802220A (en) Apparatus and method for tracking facial motion through a sequence of images
US5774591A (en) Apparatus and method for recognizing facial expressions and facial gestures in a sequence of images
CN102194443B (en) Display method and system for window of video picture in picture and video processing equipment
CN111444826A (en) Video detection method and device, storage medium and computer equipment
CN117036583A (en) Video generation method, device, storage medium and computer equipment
CN114241379A (en) Passenger abnormal behavior identification method, device and equipment and passenger monitoring system
CN111589138B (en) Action prediction method, device, equipment and storage medium
CN115131849A (en) Image generation method and related device
CN114170537A (en) Multi-mode three-dimensional visual attention prediction method and application thereof
US7006102B2 (en) Method and apparatus for generating models of individuals
CN113158748A (en) Hand detection tracking and musical instrument detection combined interaction method and system
CN110544287A (en) Picture matching processing method and electronic equipment
JP2020188449A (en) Image analyzing program, information processing terminal, and image analyzing system
CN113920167A (en) Image processing method, device, storage medium and computer system
CN117611774A (en) Multimedia display system and method based on augmented reality technology
CN111383313B (en) Virtual model rendering method, device, equipment and readable storage medium
CN111768729A (en) VR scene automatic explanation method, system and storage medium
CN110460833A (en) A kind of AR glasses and smart phone interconnected method and system
CN114299615A (en) Key point-based multi-feature fusion action identification method, device, medium and equipment
CN113011250A (en) Hand three-dimensional image recognition method and system
CN111818364A (en) Video fusion method, system, device and medium
CN115984943B (en) Facial expression capturing and model training method, device, equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination