CN113158748A

CN113158748A - Hand detection tracking and musical instrument detection combined interaction method and system

Info

Publication number: CN113158748A
Application number: CN202110147352.XA
Authority: CN
Inventors: 段若愚; 史明; 韩钰浩
Original assignee: Hangzhou Xiaopangxiong Technology Co ltd
Current assignee: Hangzhou Xiaopangxiong Technology Co ltd
Priority date: 2021-02-03
Filing date: 2021-02-03
Publication date: 2021-07-23

Abstract

The invention provides a combined interaction method and a system for hand detection tracking and musical instrument detection, wherein the method comprises the steps of collecting videos and/or images by using a collecting device, and further comprises the following steps: generating a hand recognition model and a musical instrument detection model; respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection; and judging whether the hand key detection point position is in the position of the musical instrument to be identified by using a judgment rule. The invention provides a combined interaction method and system for hand detection tracking and musical instrument detection.

Description

Hand detection tracking and musical instrument detection combined interaction method and system

Technical Field

The invention relates to the technical field of AR interaction, in particular to a hand detection tracking and musical instrument detection combined interaction method and system.

Background

With the combination of science and technology and entertainment content being improved, the demands of audiences for entertainment experience modes are gradually transited from single content and low-frequency interaction to more individuation, higher quality and more interactive experiences, so that interactive videos emerge, and the audiences can click interactive components appearing in the videos to select branch scenarios or visual angles which the audiences want to watch. The viewer can watch the video passively, and the interactive video gives the viewer more immersion and interactivity than the traditional one.

AR is a comprehensive integration technology, relates to the fields of computer graphics, man-machine interaction technology, sensing technology, artificial intelligence and the like, and uses a computer to generate realistic three-dimensional visual, auditory, olfactory and other senses, so that people as participants can experience and interact with a virtual world naturally through a proper device. When the user moves, the computer can immediately perform complex operation, and return the accurate 3D world image to generate the presence. The technology integrates the latest development of technologies such as Computer Graphics (CG) technology, computer simulation technology, artificial intelligence, sensing technology, display technology, network parallel processing and the like, and is a high-technology simulation system generated by the aid of the computer technology.

In the existing AR technology, a user is required to wear AR glasses. However, people all perceive the world through five senses, the AR technology can only help people to perceive through vision and hearing, and although the touch sense can be obtained through holding the interactive device by hand, the AR technology is greatly different from the real physical touch sense after all, and cannot bring the best AR interactive experience to people.

The invention patent application with the publication number of 109358748A discloses 'equipment and a method for interaction between a hand and an AR virtual object of a mobile phone', which comprises the mobile phone and a gesture tracking device matched with the mobile phone, wherein the gesture tracking device is arranged on the back of the mobile phone and is in communication connection with the mobile phone; the mobile phone comprises an android system capable of supporting an ARCore technology, and an ARCore mobile phone app matched with the gesture tracking device is installed on the mobile phone; the gesture tracking device comprises a software API, a gesture recognition module and a gesture recognition module, wherein the software API is used for capturing data information of hand motion, hand position and hand displacement of a user in a working range and transmitting the data information to an ARCore mobile phone app in real time; the display screen of the mobile phone is used for displaying the AR scene. The method has the disadvantages that the playing area of the musical instrument cannot be identified, and the key point parameters of the hand cannot be provided

Disclosure of Invention

In order to solve the above technical problems, the present invention provides a method and system for interactive combination of hand detection tracking and musical instrument detection, which detects key points of a hand and key areas of a musical instrument, determines whether the key points of the hand are in the key areas of the musical instrument, and then generates corresponding sounds according to the determination result.

The invention provides a combined interaction method of hand detection tracking and musical instrument detection, which comprises the steps of using a collecting device to collect video and/or images, and further comprises the following steps:

generating a hand recognition model and a musical instrument detection model;

respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection;

and judging whether the hand key detection point position is in the position of the musical instrument to be identified by using a judgment rule.

Preferably, the hand recognition model generation method includes the following substeps:

configuring camera parameters, collecting instrument data in batches, and labeling in a manner of labeling X, Y values of N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points;

processing the acquired images and generating an estimate X, Y value;

calculating the Euclidean distance between the estimated X, Y value and the X, Y value obtained by collecting data;

and when the Euclidean distance is smaller than a designated threshold value, saving the parameters to generate a hand recognition model.

In any of the above solutions, it is preferable that the method of processing the acquired image and generating the estimated X, Y value includes the following sub-steps:

compressing the batch of collected images, and performing normalization operation on the compressed images;

performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map;

the feature map is pooled using the average pooling layer, and the values are estimated X, Y using regression on the average pooling layer generated values.

In any of the above aspects, it is preferable that the method for generating the instrument recognition model includes the substeps of:

configuring camera parameters, collecting instrument data in batches, marking the instrument data in a way that a rectangular frame is adopted to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame;

to the marked upper left corner x_tl、y_tlValue and lower right corner x_br、y_brThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left corner_t1w、y_tlh) Normalized to the lower right corner (x)_brw、y_brh) Wherein, width represents the width of the photo, and height represents the height of the photo;

learning the image data using a neural network and generating vector values for estimates (xmin, ymin) and (xmax, ymax);

calculating the vectors of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting data_t1w、y_tlh) And (x)_brw、y_brh) Cross entropy of vector values of (a);

and when the cross entropy is smaller than a specified threshold value, saving the weight generation model.

In any of the above aspects, preferably, the method for learning image data using a neural network and generating vector values of the estimates (xmin, ymin) and (xmax, ymax) includes the sub-steps of:

and performing pooling operation on the feature map by using the average pooling layer, and performing regression estimation on the x and y values by using the generated value of the average pooling layer.

In any of the above schemes, preferably, the method for inputting the collected video into the hand recognition model and the instrument detection model for detection includes starting the collection device, loading the hand recognition model and the instrument recognition model at the same time, and transmitting the video frames collected by the collection device to the hand recognition model and the instrument recognition model respectively.

In any of the above schemes, preferably, the determination rule is to determine whether the following formula is a logical true value, and if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].

A second object of the present invention is to provide a combined interaction system of hand detection tracking and musical instrument detection, comprising a capturing device for capturing video and/or images, further comprising the following modules:

a training module: the hand recognition model and the musical instrument detection model are generated;

a detection module: the hand recognition model and the musical instrument detection model are used for respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection, and judging whether the hand key detection point position is in the position of the musical instrument to be recognized or not by using a judgment rule;

the system performs a combined hand detection tracking and instrument detection interaction in accordance with the method of claim 1.

processing the acquired images and generating an estimate X, Y value;

learning the image data using a neural network, generating vector values of estimates (xmin, ymin) and (xmax, ymax);

calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting data_t1w、y_tlh) And (x)_brw、y_brh) Cross entropy of vector values of (a);

In any of the above schemes, preferably, the detection module is further configured to start the collection device, load the hand recognition model and the musical instrument recognition model at the same time, and transmit the video frames collected by the collection device to the hand recognition model and the musical instrument recognition model, respectively.

The invention provides a combined interaction method and system for hand detection tracking and musical instrument detection, which can realize interaction between hands and actual objects in an AR scene and give out interaction effects such as sound and animation.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of a combined hand detection tracking and instrument detection interaction method according to the present invention.

FIG. 2 is a block diagram of a preferred embodiment of the combined interaction system of hand detection tracking and instrument detection according to the present invention.

FIG. 3 is a flowchart of an embodiment of program-initiated recognition for a combined hand detection tracking and instrument detection interaction method according to the present invention.

FIG. 4 is a flowchart of an embodiment of the hand keypoint training of the combined interaction method of hand detection tracking and instrument detection according to the present invention.

FIG. 5 is a flow chart of an embodiment of instrument recognition training according to the combined interaction method of hand detection tracking and instrument detection of the present invention.

FIG. 6 is a diagram of an embodiment of a combined hand detection tracking and instrument detection interaction method for hand-to-board instrument contact identification in accordance with the present invention.

Detailed Description

The invention is further illustrated with reference to the figures and the specific examples.

Example one

As shown in fig. 1, step 100 is performed to capture video and/or images using a capture device.

Step 110 is executed to generate a hand recognition model and a musical instrument detection model. The generation method of the hand recognition model comprises the following substeps:

step 101: and (3) configuring camera parameters, collecting instrument data in batches, and labeling, wherein the labeling mode is that X, Y values are labeled on N key points of the hand according to the position of the hand in the image, wherein N is the number of the key points.

Step 102: the acquired image is processed to generate an estimate X, Y value. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing regression estimation X, Y on the generated value of the average pooling layer.

Step 103: the euclidean distance between the estimated X, Y value and the X, Y value obtained from the collected data is calculated.

Step 104: it is determined whether the euclidean distance is smaller than a specified threshold value (in the present embodiment, the specified threshold value is set to 0.1). If the Euclidean distance is greater than the specified threshold, then steps 105 and 103 are performed sequentially, the reduction threshold (in this embodiment, the reduction threshold is set to 0.01) will be subtracted from the initialization parameter of the deep convolutional neural network, and the Euclidean distance is recalculated. If the Euclidean distance is smaller than a designated threshold value, step 106 is executed, and parameters are saved to generate a hand recognition model.

The method for generating the instrument identification model comprises the following sub-steps:

step 111: and configuring camera parameters, collecting instrument data in batches, marking in a way of adopting a rectangular frame to represent the position of an object, and recording the values of xmin and ymin at the upper left corner and xmax and ymax at the lower right corner of the rectangular frame.

Step 112: to the marked upper left corner x_tl、y_tlValue and lower right corner x_br、y_brThe values are respectively converted by using x/width and y/height formulas to generate a normalized value (x) at the upper left corner_t1w、y_tlh) Normalized to the lower right corner (x)_brw、y_brh) Wherein width represents the width of the photo and height represents the height of the photo.

Step 113: the vector values of the estimates (xmin, ymin) and (xmax, ymax) are generated using neural network learning image data. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing a regression vector value on a generated value of the average pooling layer.

Step 114: calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting data_t1w、y_tlh) And (x)_brw、y_brh) Cross entropy of vector values of (a).

Step 115: it is judged whether or not the cross entropy is smaller than a prescribed threshold value (in the present embodiment, the prescribed threshold value is set to 0.1). If the cross entropy is greater than the specified threshold, then steps 105 and 103 are performed sequentially, the initialization parameter of the deep convolutional neural network is subtracted by the reduction threshold (in this embodiment, the reduction threshold is set to 0.01), and the cross entropy is recalculated. If the cross entropy is less than the specified threshold, step 106 is performed to save the weight generation model.

And step 120, inputting the collected videos into the hand recognition model and the musical instrument detection model respectively for detection. And starting the acquisition device, loading the hand recognition model and the musical instrument recognition model simultaneously, and transmitting the video frames acquired by the acquisition device to the hand recognition model and the musical instrument recognition model respectively.

Step 130 is executed to determine whether the hand key detection point position is in the position of the musical instrument to be identified by using the determination rule. The judgment rule is to judge whether the following formula is a logic true value, if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].

Example two

As shown in fig. 2, a combined interaction system of hand detection tracking and musical instrument detection includes a collection device 200, a training module 210 and a detection module 220.

The collection device 200: for capturing video and/or images.

The training module 210: for generating hand recognition models and instrument detection models. The generation method of the hand recognition model comprises the following substeps:

Step 104: when the euclidean distance is smaller than a specified threshold value (in the present embodiment, the specified threshold value is set to 0.1), the hand recognition model is generated by saving the parameters.

Step 113: the image data is learned using a neural network and vector values of estimates (xmin, ymin) and (xmax, ymax) are generated. Compressing the batch of collected images, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, and performing regression estimation on vector values of (xmin, ymin) and (xmax, ymax) by using the generated value of the average pooling layer. Step 114: calculating the vector values of the estimates (xmin, ymin) and (xmax, ymax) and (x) obtained by collecting data_t1w、y_tlh) And (x)_brw、y_brh) Cross entropy of vector values of (a).

Step 115: when the cross entropy is smaller than a prescribed threshold (in the present embodiment, the prescribed threshold is set to 0.1), the weight generation model is saved.

The detection module 220: and the hand recognition model and the musical instrument detection model are used for respectively inputting the collected videos into the hand recognition model and the musical instrument detection model for detection, and judging whether the hand key detection point position is in the position of the musical instrument to be recognized or not by using a judgment rule.

The detection module 220 is further configured to start the collection device, load the hand recognition model and the musical instrument recognition model at the same time, and transmit the video frames collected by the collection device to the hand recognition model and the musical instrument recognition model, respectively. The judgment rule is to judge whether the following formula is a logic true value, if so, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].

EXAMPLE III

The invention can realize the interaction between the hand and the actual object in the AR scene and can generate the interaction effects of sound, animation and the like. The implementation method is shown in fig. 3, and the technical scheme is as follows:

first, hand detection method (as shown in FIG. 4)

1. Hand detection and tracking;

2. hand and object detection and tracking.

3. The camera parameters are configured to ensure that the photo size 480 x 480 is collected, instrument data is collected in batches and labeled. The labeling mode labels X, Y values for the 21 key points according to the position of the hand in the image.

4. Compressing the batch collected images to 256 × 256, normalizing the compressed images, extracting features by using a MobileNet deep convolution neural network to generate a feature map, pooling the feature map by using an average pooling layer, performing regression estimation X, Y on the average pooling layer generated value, and estimating X, Y and the obtained X, Y Euclidean distance of the acquired data.

5. And if the Euclidean distance obtained by calculation is larger than 0.1, subtracting 0.01 from the initialization parameter of the deep convolutional neural network, and continuously repeating the step 4 until the Euclidean distance is smaller than 0.1, and stopping training.

Second, musical instrument detection method (as shown in FIG. 5)

6. The camera parameters are configured to ensure that the photo size 640 x 640 is collected, instrument data is collected in batches, and labeled. The labeling mode adopts a rectangular box to represent the position of an object, and x and y values at the upper left corner and x and y values at the lower right corner.

7. And converting the marked values of x and y at the upper left corner and the marked values of x and y at the lower right corner by using x/width and y/height formulas, wherein width and height respectively represent the width and height of the photo.

8. The method comprises the steps of scaling a batch of collected images to 320 x 320, performing normalization operation on the compressed images, performing feature extraction by using a MobileNet deep convolution neural network to generate a feature map, performing pooling operation on the feature map by using an average pooling layer, performing regression estimation on a generated value of the average pooling layer to obtain X, Y values, and performing cross entropy on an estimated X, Y value and the obtained X, Y cross entropy of collected data.

9. And if the cross entropy obtained by calculation is larger than 0.1, subtracting 0.01 from the initialization parameter of the deep convolutional neural network, continuously repeating the step 8 until the cross entropy is smaller than 0.1, stopping training, and storing the weight to generate the model file.

And (10) starting Camera by the App program, simultaneously loading a hand recognition model and an instrument detection model, and respectively transmitting the video frames collected by the Camera to hand detection and instrument detection.

11. And calculating whether the hand key detection point position is in the positions of the musical instruments to be identified by using the returned result. Using the formula [ xmin < ═ x and x < ═ xmax and ymin < ═ y and y < ═ ymax ], (xmin, ymin, xmax, ymax respectively represent the top left corner x, y value and the bottom right corner x, y value of the instrument recognition result), (x, y represent the hand key points x, y value). If the formula is a logical true value when the hand key point is in the position of the musical instrument to be identified, the corresponding note is considered to be played. If the hand key point is not in the position of the musical instrument to be identified, the formula is a logic false value, and the corresponding note can not be played.

Example four

As shown in fig. 6, the hand is in contact with the cardboard musical instrument, the point location detected by the hand and the point location of the cardboard are fused by the musical instrument detection method and the hand detection method, and the corresponding sound effect is triggered when the point location of the hand is in the corresponding area of the cardboard musical instrument, so that the effect of playing music can be achieved.

For a better understanding of the present invention, the foregoing detailed description has been given in conjunction with specific embodiments thereof, but not with the intention of limiting the invention thereto. Any simple modifications of the above embodiments according to the technical essence of the present invention still fall within the scope of the technical solution of the present invention. In the present specification, each embodiment is described with emphasis on differences from other embodiments, and the same or similar parts between the respective embodiments may be referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A combined interaction method of hand detection tracking and musical instrument detection comprises the steps of collecting videos and/or images by using a collecting device, and is characterized by further comprising the following steps:

generating a hand recognition model and a musical instrument detection model;

2. A combined hand detection tracking and instrument detection interaction method as claimed in claim 1, wherein the generation method of the hand recognition model comprises the sub-steps of:

processing the acquired images and generating an estimate X, Y value;

3. A combined hand detection tracking and instrument detection interaction method as claimed in claim 2, wherein said method of processing the acquired images and generating an estimate X, Y value comprises the sub-steps of:

4. The method of interacting hand detection tracking with instrument detection as recited in claim 2, wherein the method of generating the instrument recognition model comprises the sub-steps of:

5. The method of interacting hand test tracking with instrument testing in combination according to claim 4, wherein the method of learning image data using neural networks and generating vector values for estimates (xmin, ymin) and (xmax, ymax) comprises the sub-steps of:

6. The method of claim 4, wherein the step of inputting the captured video into the hand recognition model and the instrument detection model for detection comprises activating the capturing device, loading the hand recognition model and the instrument detection model simultaneously, and transmitting the video frames captured by the capturing device to the hand recognition model and the instrument detection model respectively.

7. The method for interactive hand detection tracking and musical instrument detection in combination as claimed in claim 6, wherein said judgment rule is to judge whether the following formula is a logical true value, if yes, the corresponding note can be played; if not, the corresponding note cannot be played, wherein the formula is [ xmin ≦ X ≦ xmax and ymin ≦ Y and Y ≦ ymax ].

8. A combined interactive system for hand detection tracking and instrument detection comprises a collecting device for collecting video and/or images, and is characterized by further comprising the following modules:

9. The combined interactive system of hand detection tracking and instrument detection of claim 8, wherein the generation method of the hand recognition model comprises the sub-steps of:

processing the acquired images and generating an estimate X, Y value;

10. The combined interactive hand detection tracking and instrument detection system of claim 8, wherein said method of processing the acquired images and generating an estimated X, Y value comprises the sub-steps of: