CN108064006A

CN108064006A - Intelligent sound box and control method for playing back

Info

Publication number: CN108064006A
Application number: CN201810142948.9A
Authority: CN
Inventors: 王声平; 张立新
Original assignee: Shenzhen Water World Co Ltd
Current assignee: Shenzhen Water World Co Ltd
Priority date: 2018-02-11
Filing date: 2018-02-11
Publication date: 2018-05-22
Also published as: WO2019153382A1

Abstract

The present invention provides a kind of intelligent sound box and control method for playing back, control method for playing back includes：Intelligent sound box carries out human testing；When detecting human body, the gesture motion of the human body is identified；The broadcast state of the intelligent sound box is adjusted according to the gesture motion.Method provided by the invention increases a kind of interactive mode using intelligent sound box so that user can control intelligent sound box by gesture, improve user experience.

Description

Intelligent sound box and control method for playing back

Technical field

The present invention relates to intelligent sound box fields, especially relate to a kind of intelligent sound box and control method for playing back.

Background technology

Intelligent sound box is the product of a speaker upgrading, is the instrument that family consumer is surfed the Internet with voice, than Such as requesting songs, online shopping or understanding weather forecast, it can also control smart home device, for example open Curtain, set refrigerator temperature, allow in advance water heater heating etc..

Using Amazon Echo as the intelligent sound box of representative, intelligent sound technology is actually belonged to.Its operation is required for language Sound instructs to control.However, existing domestic environment background noise is larger, this noise can influence the correct knowledge of phonetic order Not, user experience is reduced.Therefore, it is necessary to use more modes, user is facilitated to be interacted with intelligent sound box, promote user's body It tests.

The content of the invention

For the main object of the present invention to provide a kind of intelligent sound box and control method for playing back, enhancing uses the use of intelligent sound box It experiences at family.

The present invention provides a kind of control method for playing back, comprise the following steps：

Intelligent sound box carries out human testing；

When detecting human body, the gesture motion of the human body is identified；

The broadcast state of the intelligent sound box is adjusted according to the gesture motion.

Preferably, the step of gesture motion of the identification human body includes：

The gesture of every frame images of gestures of the human body detected with background is separated, and is found out in every frame images of gestures Gesture profile；

The gesture profile is matched frame by frame with default beginning gesture profile, will match to first gesture profile It is determined as starting gesture profile；

Gesture profile of the sequential after the beginning gesture profile is matched frame by frame with default end gesture profile, The first gesture profile that will match to is determined as terminating gesture profile；

Will be using the beginning gesture profile as starting, the gesture profile that terminates is determined as identifying for the gesture motion of ending The one group of gesture motion arrived.

Preferably, it is described the intelligent sound box is adjusted according to the gesture motion broadcast state the step of include：

Determine the corresponding control instruction of the gesture motion；

The broadcast state of the intelligent sound box is adjusted according to the control instruction.

Preferably, described the step of determining the gesture motion corresponding control instruction, includes：

Feature extraction is carried out to the gesture motion, obtains gesture motion feature；

The gesture motion feature is encoded, obtains coding result；

Determine the corresponding control instruction of the coding result.

Preferably, the method further includes：

Calculate the physical distance between the intelligent sound box and the human body；

The volume of the intelligent sound box is adjusted according to the physical distance.

Preferably, the step of intelligent sound box progress human testing includes：

The intelligent sound box is based on gradient orientation histogram and carries out human testing.

Preferably, the step of intelligent sound box carries out human testing based on gradient orientation histogram includes：

First-order Gradient calculating is carried out to the image in detection window；

Calculate the gradient orientation histogram of unit lattice in described image；

All cells in described image in each block are normalized, it is straight to obtain described piece of gradient direction Fang Tu；

All pieces in described image are normalized, obtain the gradient orientation histogram of the detection window, And using the gradient orientation histogram of the detection window as characteristics of human body's vector.

Another aspect of the present invention, it is also proposed that a kind of intelligent sound box, including：

Detection module, for carrying out human testing；

Identification module, for when detecting human body, identifying the gesture motion of the human body；

Module is adjusted, for adjusting the broadcast state of the intelligent sound box according to the gesture motion.

Preferably, the identification module includes：

Separative element for the gesture of every frame images of gestures of the human body detected to be separated with background, and is found out Per the gesture profile in frame images of gestures；

Start gesture unit, for the gesture profile to be matched frame by frame with default beginning gesture profile, will match To first gesture profile be determined as start gesture profile；

Terminate gesture unit, for by sequential it is described beginning gesture profile after gesture profile frame by frame with default end Gesture profile is matched, and will match to first gesture profile is determined as terminating gesture profile；

Gesture motion unit, for will be described to terminate gesture profile to end up using the beginning gesture profile for starting Gesture motion is determined as recognize one group of gesture motion.

Preferably, the adjustment module includes：

Determine instruction unit, for determining the corresponding control instruction of the gesture motion；

Adjustment unit, for adjusting the broadcast state of the intelligent sound box according to the control instruction.

Preferably, the determine instruction unit includes：

Feature subelement is obtained, feature extraction is carried out to the gesture motion for feature, obtains gesture motion feature；

Coded sub-units for being encoded to the gesture motion feature, obtain coding result；

Determine instruction subelement, for determining the corresponding control instruction of the coding result.

Preferably, further include：

Distance calculation module, for calculating the physical distance between the intelligent sound box and the human body；

Volume module is adjusted, for adjusting the volume of the intelligent sound box according to the physical distance.

Preferably, the detection module includes：

Gradient detection units carry out human testing for being based on gradient orientation histogram.

Preferably, the gradient detection units include：

First-order Gradient computation subunit, for carrying out First-order Gradient calculating to the image in detection window；

Cell gradient subelement, for calculating the gradient orientation histogram of unit lattice in described image；

Block gradient subelement for all cells in described image in each block to be normalized, obtains Described piece of gradient orientation histogram；

Feature vector subelement is generated, for all pieces in described image to be normalized, obtains the inspection The gradient orientation histogram of window is surveyed, and using the gradient orientation histogram of the detection window as characteristics of human body's vector.

The invention also provides a kind of intelligent sound box, including memory, processor and at least one described deposit is stored in In reservoir and the application program performed by the processor is configured as, the application program is configurable for performing above-mentioned Control method for playing back.

The present invention provides intelligent sound box and control method for playing back, control method for playing back therein includes：Intelligent sound box into Row human testing；When detecting human body, the gesture motion of the human body is identified；The intelligence is adjusted according to the gesture motion The broadcast state of speaker.Method provided by the invention increases a kind of interactive mode using intelligent sound box so that user can lead to It crosses gesture to control intelligent sound box, improves user experience.

Description of the drawings

Fig. 1 is the flow diagram of one embodiment of control method for playing back of the present invention；

Fig. 2 is the structure diagram of one embodiment of intelligent sound box of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.

With reference to Fig. 1, the embodiment of the present invention proposes a kind of control method for playing back, comprises the following steps：

S10, intelligent sound box carry out human testing；

S20, when detecting human body, identify the gesture motion of the human body；

S30, the broadcast state that the intelligent sound box is adjusted according to the gesture motion.

In the present embodiment, depth transducer is installed on intelligent sound box.Depth transducer is divided into two classes：Passive type cubic phase Machine and active depth camera.Passive type stereoscopic camera observes scene using two or more cameras, and uses these Difference (displacement) in multiple views of camera between feature estimates the depth of scene.Active depth camera is to scene simulation Sightless infrared light, and according to the information reflected, estimate the depth of scene.In an application scenarios, user's first station exists With intelligent sound box certain position, some gesture instructions are made to the depth transducer of intelligent sound box, such as open play instruction, intelligence After speaker identifies the meaning of user's first gesture instruction, sound is played.

In step S10, intelligent sound box carries out human testing by depth transducer.It can be based on gradient orientation histogram (Histogram of oriented gradient, HOG), scale invariant feature convert (Scale-invariant feature Transform, SIFT), local binary patterns (Local Binary Pattern, LBP), the characteristics of image such as HARR carry out human body Detection.

In step S20, when intelligent sound box detects human body, the gesture motion of the human body is identified.Particular by depth Sensor obtains one group of video data for including gesture.Depth transducer plays the role of making video recording here.It can be by default rule Then obtain video data.Such as, it is when depth transducer monitors that user has larger gesture motion, this section of video data is true It is set to the video data for including gesture.

It is the continuous image of multiframe by above-mentioned Digital video resolution, the background in image is separated with gesture, and finds out every Gesture profile in two field picture.The start frame and end frame of gesture motion are determined by default rule.By start frame and end frame Between gesture profile be determined as gesture motion.That is, gesture motion includes the gesture profile of multiple image.

In step S30, after obtaining gesture motion, feature extraction is carried out to gesture motion, obtains gesture motion feature, it is right Gesture motion feature is identified, and obtains recognition result, finally generates control instruction according to recognition result.

Intelligent sound box adjusts broadcast state according to control instruction.Control instruction such as acquisition is to commence play out instruction, then intelligence Energy speaker commences play out sound；If the control instruction obtained is stops play instruction, intelligent sound box stops playing sound.

Optionally, step S20 includes：

In the present embodiment, intelligent sound box is stored with the different corresponding default beginning gesture profiles of control instruction and default end Gesture profile.Each gesture profile of video data is first matched with default beginning gesture profile frame by frame, by matched first Frame gesture profile is determined as starting gesture profile.Gesture profile after first frame, frame by frame with it is default end gesture profile into Matched first frame gesture profile is determined as terminating gesture profile by row matching.Then, will using it is described beginning gesture profile for Begin, the gesture profile that terminates is determined as gesture motion for the gesture profile sequence of ending.The gesture motion of acquisition can be used to know The meaning that other gesture includes, and then generate corresponding control instruction.

Optionally, step S30 includes：

Determine the corresponding control instruction of the gesture motion；

In the present embodiment, storage chip has prestored the corresponding control instruction of multigroup different gesture motion on intelligent sound box.Example As it can be stated that gesture motion " up waves " correspondence " promotion volume " instruction, gesture motion " waves " correspondence down " reduces sound Amount " instruction, gesture motion correspondence " stop play " of " waving " instruct, and gesture motion " both hands are patted " correspondence " commencing play out " refers to Order.Instruction is commenced play out when intelligent sound box determines that gesture motion that user makes is corresponding, then intelligent sound box can be according to starting to broadcast Instruction is put to play out.The content of broadcasting can be music or news.Likewise, when intelligent sound box determines that user does The corresponding end play instruction of gesture motion gone out, then intelligent sound box can be according to end play instruction stopping broadcasting sound-content. User can be against the interference for stopping the sound-content before playing.

Optionally, described the step of determining the gesture motion corresponding control instruction, includes：

The gesture motion feature is encoded, obtains coding result；

Determine the corresponding control instruction of the coding result.

In the present embodiment, gesture motion is characterized in the arrangement set of every two field picture contour feature.In order to obtain gesture motion Feature is, it is necessary to calculate the characteristic value of each profile of every two field picture.Specifically, obtained gesture profile will be extracted, calculating should The contour feature value of each profile in gesture profile.Region histogram of the contour feature value including each profile of each profile, Square and earth displacement distance.

Then, the gesture motion feature of extraction is encoded using 8 reference direction vectors, calculation code result.8 A reference direction refers to 360 degree of eight directions divided equally.

It can carry out calculation code result using DTW algorithms.In DTW algorithms, being stored in each gesture of template library becomes Sample form, a sample form be expressed as T (1), T (2) ..., T { m } ..., T { M } }.The input hand to be identified Gesture is test template, be expressed as S (1), S (2) ... S (n) ..., S (N) }.By each frame number m=1-M of test template vertical It is marked on axis, grid one by one can be formed by drawing ordinate by the coordinate of these expression frame numbers, each intersection in grid Point (n, m) represents the joint of a certain frame and a certain frame in training mode in test template.

DTW algorithms, which can be attributed to, finds a path by several lattice points in this grid to describe this paths, Assuming that path by all lattice points be followed successively by (n₁,m₁),...,(n_i,m_i),..,(n_N=m_N), wherein (n₁,m₁)=(1,1), (n_N,m_N)=(N, M) paths can use function m_i=f (n_i), wherein n_i=i, i=1,2 ..., N, f (1)=1, f (N)=M. In order to which path is made to be unlikely to too to tilt, inclination can be constrained in the range of 0-2, if path passes through lattice point (n_i,m_i), then Its previous node is only possible to be one of following three situation:(n_i-1,m_i),(n_i-1,m_i- 1) or (n_i-1,m_i-2).Path Cumulative Distance D [(n_i,m_i)]=d [S (n_i),T(m_i)+D((n_i-1,m_i)], -1) wherein (n_i-1,m_i- 1) determined by following formula:

D[(n_i-1,m_i- 1)]=min { D [n_i-1,m_i],D[(n_i-1,m_i-1)],D[(n_i-1,m_i-2)]}。

Finally, the corresponding control instruction of coding result is determined.The coding result of acquisition is compared with pre-arranged code data, output The corresponding control instruction of immediate pre-arranged code data.In order to reduce false detection rate, can also degree of being positioned proximate to threshold value, if The coding result of acquisition is too low with the matching degree of pre-arranged code data, then does not export control instruction.

Optionally, control method for playing back further includes：

In the present embodiment, can the distance between intelligent sound box and user directly be calculated by active depth camera, Then volume is adjusted according to the distance of user and intelligent sound box, so that the volume after adjusting reaches preset value.For example, user is from intelligence Can 5 meters of speaker when, when the volume heard is 50 decibels, 10 meters from intelligent sound box, in order to which the volume for hearing user is equal to 50 points Shellfish need to improve the volume of intelligent sound box.Since indoors, distance, into certain correspondence, can be closed with volume according to corresponding System adjusts the volume of intelligent sound box so that the volume that user hears in different location is the same.Preset value herein can be 5 The volume or the volume of a physical distance of factory default that user hears at rice.

Optionally, step S10 includes：

In the present embodiment, intelligent sound box can be based on gradient orientation histogram (Histogram of oriented Gradient, HOG) carry out human testing.

Gradient orientation histogram is analogous to a kind of local description symbol of scale invariant feature conversion, it is local by calculating Gradient orientation histogram on region forms characteristics of human body.Unlike scale invariant feature conversion, scale invariant feature Conversion is the feature extraction based on key point, is a kind of sparse description method, and gradient orientation histogram is intensive description side Method.

Gradient orientation histogram describes method and has the following advantages：What gradient orientation histogram represented is edge (gradient) Structure feature, therefore local shape information can be described；The quantization in position and direction space can inhibit to a certain extent The influence that translation and rotation are brought；The normalization in regional area, the influence that can be brought with partial offset illumination are taken simultaneously.Therefore The embodiment of the present invention is preferably based on gradient orientation histogram and carries out human testing.

Optionally, the step of intelligent sound box carries out human testing based on gradient orientation histogram includes：

In the present embodiment, First-order Gradient calculating is carried out to the image in detection window first, is specially：To standardize size The detection window (Detection Window) of (such as 64x128) as input, by single order (one-dimensional) Sobel operators [- 1,0, 1] gradient on the image level and vertical direction in detection window is calculated.

Using single window as the benefit that grader inputs be grader has consistency to the position of target with scale. For an input picture to be detected, it is necessary to along both horizontally and vertically moving detection window, while will be with more rulers Degree zooms in and out image to detect the human body under different scale.

Then, the gradient orientation histogram of unit lattice in described image is calculated, is specially：Gradient orientation histogram is Carry out what intensive calculations obtained in the grid for being referred to as cell (Cell) and block (Block).Divide the image into several units Lattice, each cell is made of multiple pixels, and block is then made of several adjacent cells.

In this embodiment, the gradient of each pixel in image is first calculated, then counts in image institute in each cell There is the gradient orientation histogram of the gradient orientation histogram of pixel, the i.e. cell.In the gradient direction of statistics unit lattice During histogram, [0~π] is divided into multiple sections first against each cell, then according to each pixel in the cell Gradient direction is weighted ballot paper account, obtains the gradient orientation histogram of all pixels in the cell.

When being weighted ballot paper account, the weight of each pixel is the gradient amplitude of the preferably pixel.In order to eliminate Obscure, it is preferred to use three linear differences (Trilinear Interpolationi) are weighted ballot paper account.

Each cell in traversing graph picture, obtains the gradient orientation histogram of unit lattice in image.

All cells in described image in each block are normalized, it is straight to obtain described piece of gradient direction Fang Tu.In block, the gradient orientation histogram of the cell in the block is normalized, to eliminate the influence of illumination, So as to obtain the gradient orientation histogram of the block.Each block in traversing graph picture, the gradient direction for obtaining each block in image are straight Fang Tu.

All pieces in described image are normalized, obtain the gradient orientation histogram of the detection window, And using the gradient orientation histogram of the detection window as characteristics of human body's vector.By the detection window obtained after each piece of normalization Gradient orientation histogram, form characteristics of human body vector, so as to fulfill human testing.

Since gradient orientation histogram is a kind of intensive calculations mode, calculation amount is larger.In order to reduce calculation amount, carry High detection speed, it may be considered that it selects to calculate gradient orientation histogram in the key area for having obvious human body contour outline, so as to Achieve the purpose that reduce dimension.

The present invention provides a kind of control method for playing back, including：Intelligent sound box carries out human testing；When detecting human body When, identify the gesture motion of the human body；The broadcast state of the intelligent sound box is adjusted according to the gesture motion.The present invention carries For method increase a kind of interactive mode using intelligent sound box so that user can be controlled intelligent sound box by gesture System, improves user experience.

With reference to Fig. 2, the embodiment of the present invention also proposed a kind of intelligent sound box, including：

Detection module 10, for carrying out human testing；

Identification module 20, for when detecting human body, identifying the gesture motion of the human body；

Module 30 is adjusted, for adjusting the broadcast state of the intelligent sound box according to the gesture motion.

In detection module 10, intelligent sound box carries out human testing by depth transducer.It can be based on gradient direction Nogata Scheme (Histogram of oriented gradient, HOG), scale invariant feature conversion (Scale-invariant Feature transform, SIFT), local binary patterns (Local Binary Pattern, LBP), the characteristics of image such as HARR Carry out human testing.

In identification module 20, when intelligent sound box detects human body, the gesture motion of the human body is identified.Particular by depth It spends sensor and obtains one group of video data for including gesture.Depth transducer plays the role of making video recording here.It can be by default Rule obtains video data.Such as, when depth transducer monitors that user has larger gesture motion, by this section of video data It is determined as including the video data of gesture.

It adjusts in module 30, after obtaining gesture motion, feature extraction is carried out to gesture motion, it is special to obtain gesture motion Sign, is identified gesture motion characteristic, obtains recognition result, finally generates control instruction according to recognition result.

Optionally, identification module 20 includes：

Optionally, adjustment module 30 includes：

Optionally, the determine instruction unit includes：

D[(n_i-1,m_i- 1)]=min { D [n_i-1,m_i],D[(n_i-1,m_i-1)],D[(n_i-1,m_i-2)]}。

Optionally, intelligent sound box further includes：

In the present embodiment, can the distance between intelligent sound box and user directly be calculated by active depth camera, Then volume is adjusted according to the distance of user and intelligent sound box.For example, during 5 meters from intelligent sound box of user, the volume heard is 50 Decibel at 10 meters from intelligent sound box, in order to which the volume for hearing user is equal to 50 decibels, need to improve the volume of intelligent sound box.By In indoors, distance and volume can adjust the volume of intelligent sound box so that use into certain correspondence according to correspondence The volume that family is heard in different location is the same.Preset value herein can be the volume that user hears at 5 meters or The volume of one physical distance of factory default.

Optionally, the detection module 10 includes：

Optionally, the gradient detection units include：

The present invention provides a kind of intelligent sound box, intelligent sound box carries out human testing；When detecting human body, described in identification The gesture motion of human body；The broadcast state of the intelligent sound box is adjusted according to the gesture motion.Intelligent sound provided by the invention Case increases a kind of interactive mode using intelligent sound box so that user can control intelligent sound box by gesture, improve User experience.

In embodiments of the present invention, the processor included by the intelligent sound box also has following functions：

Carry out human testing；

The broadcast state of the intelligent sound box is adjusted according to the gesture motion.The foregoing is merely the embodiment of the present invention , it is not intended to limit the invention, for those skilled in the art, the invention may be variously modified and varied. Within the spirit and principles of the invention, any modifications, equivalent replacements and improvements are made should be included in the present invention's Within right.

Claims

1. a kind of control method for playing back, which is characterized in that comprise the following steps：

Intelligent sound box carries out human testing；

2. control method for playing back according to claim 1, which is characterized in that the gesture motion of the identification human body Step includes：

The gesture of every frame images of gestures of the human body detected with background is separated, and finds out the hand in every frame images of gestures Gesture profile；

The gesture profile is matched frame by frame with default beginning gesture profile, will match to first gesture profile determines To start gesture profile；

Gesture profile of the sequential after the beginning gesture profile is matched frame by frame with default end gesture profile, general First gesture profile being fitted on is determined as terminating gesture profile；

Will be using the beginning gesture profile as starting, the gesture profile that terminates is determined as what is recognized for the gesture path of ending One group of gesture motion.

3. control method for playing back according to claim 1, which is characterized in that described according to being adjusted the gesture motion The step of broadcast state of intelligent sound box, includes：

Determine the corresponding control instruction of the gesture motion；

4. control method for playing back according to claim 3, which is characterized in that described to determine the corresponding control of the gesture motion The step of system instruction, includes：

The gesture motion feature is encoded, obtains coding result；

Determine the corresponding control instruction of the coding result.

5. control method for playing back according to any one of claims 1 to 4, which is characterized in that the method further includes：

6. a kind of intelligent sound box, which is characterized in that including：

Detection module, for carrying out human testing；

7. intelligent sound box according to claim 6, which is characterized in that the identification module includes：

Separative element for the gesture of every frame images of gestures of the human body detected to be separated with background, and finds out every frame Gesture profile in images of gestures；

Start gesture unit, for the gesture profile to be matched frame by frame with default beginning gesture profile, will match to First gesture profile is determined as starting gesture profile；

Gesture motion unit, for will be described to terminate gesture profile as the gesture that ends up using the beginning gesture profile for starting Action is determined as recognize one group of gesture motion.

8. intelligent sound box according to claim 6, which is characterized in that the adjustment module includes：

9. intelligent sound box according to claim 8, which is characterized in that the determine instruction unit includes：

10. according to claim 6 to 9 any one of them intelligent sound box, which is characterized in that further include：