WO2019153382A1 - 智能音箱及播放控制方法 - Google Patents

智能音箱及播放控制方法 Download PDF

Info

Publication number
WO2019153382A1
WO2019153382A1 PCT/CN2018/077458 CN2018077458W WO2019153382A1 WO 2019153382 A1 WO2019153382 A1 WO 2019153382A1 CN 2018077458 W CN2018077458 W CN 2018077458W WO 2019153382 A1 WO2019153382 A1 WO 2019153382A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
smart speaker
human body
profile
image
Prior art date
Application number
PCT/CN2018/077458
Other languages
English (en)
French (fr)
Inventor
王声平
张立新
Original Assignee
深圳市沃特沃德股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市沃特沃德股份有限公司 filed Critical 深圳市沃特沃德股份有限公司
Publication of WO2019153382A1 publication Critical patent/WO2019153382A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/22Arrangements for obtaining desired frequency or directional characteristics for obtaining desired frequency characteristic only 
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/01Aspects of volume control, not necessarily automatic, in sound systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R27/00Public address systems

Definitions

  • the invention relates to the field of smart speakers, in particular to a smart speaker and a play control method.
  • Smart speaker is a product of speaker upgrade. It is a tool for home consumers to use the voice to access the Internet. For example, on-demand songs, online shopping, or weather forecasting, it can also control smart home devices, such as opening curtains and setting The temperature of the refrigerator, let the water heater heat up in advance.
  • the smart speakers represented by Amazon Echo are actually intelligent voice technology. Its operation requires voice commands to control. However, the existing home environment has a large background noise, which affects the correct recognition of voice commands and reduces the user experience. Therefore, more methods are needed to facilitate user interaction with smart speakers and enhance the user experience.
  • the main object of the present invention is to provide a smart speaker and a playback control method for enhancing the user experience of using a smart speaker.
  • the invention provides a playback control method, comprising the following steps:
  • the step of recognizing the gesture action of the human body comprises:
  • the end gesture profile ending gesture gesture is determined as the identified set of gesture actions.
  • the step of adjusting the playing state of the smart speaker according to the gesture action comprises:
  • the step of determining a control instruction corresponding to the gesture action comprises:
  • the method further includes:
  • the volume of the smart speaker is adjusted according to the physical distance.
  • the step of performing the human body detection by the smart speaker comprises:
  • the smart speaker performs human body detection based on a gradient direction histogram.
  • the step of the smart speaker performing human body detection based on the gradient direction histogram comprises:
  • a smart speaker comprising:
  • An identification module configured to recognize a gesture action of the human body when the human body is detected
  • an adjustment module configured to adjust a playing state of the smart speaker according to the gesture action.
  • the identification module comprises:
  • a separating unit configured to separate a gesture of each frame of the detected human body gesture image from the background, and find a gesture contour in each frame of the gesture image
  • a start gesture unit configured to match the gesture outline frame by frame with a preset start gesture profile, and determine the matched first gesture profile as a start gesture profile;
  • End gesture unit configured to match a gesture contour of the timing after the start gesture profile to a preset end gesture profile frame by frame, and determine the matched first gesture profile as an end gesture profile;
  • a gesture action unit is configured to determine a gesture action starting with the start gesture profile and ending with the end gesture profile as the identified set of gesture actions.
  • the adjustment module comprises:
  • Determining an instruction unit configured to determine a control instruction corresponding to the gesture action
  • an adjusting unit configured to adjust a playing state of the smart speaker according to the control instruction.
  • the determining the instruction unit comprises:
  • Obtaining a feature sub-unit configured to perform feature extraction on the gesture action to obtain a gesture action feature
  • a coding subunit configured to encode the gesture action feature to obtain a coding result
  • the method further comprises:
  • a distance calculation module configured to calculate a physical distance between the smart speaker and the human body
  • adjusting a volume module configured to adjust a volume of the smart speaker according to the physical distance.
  • the detecting module comprises:
  • a gradient detecting unit for performing human body detection based on a gradient direction histogram.
  • the gradient detecting unit comprises:
  • a stepwise calculation subunit for performing a stepwise calculation on the image in the detection window
  • a cell gradient subunit for calculating a gradient direction histogram of each cell in the image
  • a block gradient sub-unit for normalizing all cells in each block in the image to obtain a gradient direction histogram of the block
  • the present invention also provides a smart speaker comprising a memory, a processor and at least one application stored in the memory and configured to be executed by the processor, the application being configured to perform the above Playback control method.
  • the invention provides a smart speaker and a play control method, wherein the play control method comprises: a smart speaker performs human body detection; when detecting a human body, recognizes a gesture action of the human body; and adjusts the play of the smart speaker according to the gesture action status.
  • the method provided by the invention increases an interaction mode using a smart speaker, so that the user can control the smart speaker through gestures, thereby improving the user experience.
  • FIG. 1 is a schematic flow chart of an embodiment of a playback control method according to the present invention.
  • FIG. 2 is a schematic structural view of an embodiment of a smart speaker according to the present invention.
  • an embodiment of the present invention provides a playback control method, including the following steps:
  • a depth sensor is mounted on the smart speaker.
  • depth sensors There are two types of depth sensors: passive stereo cameras and active depth cameras.
  • Passive stereo cameras utilize two or more cameras to view scenes and use the difference (shift) between features in multiple views of these cameras to estimate the depth of the scene.
  • the active depth camera projects invisible infrared light into the scene and estimates the depth of the scene based on the information being reflected.
  • the user A stands at a certain position with the smart speaker, and makes some gesture commands to the depth sensor of the smart speaker, such as turning on the play command, and the smart speaker recognizes the meaning of the user's gesture command and plays the sound.
  • the smart speaker performs human body detection through the depth sensor.
  • HOG gradient direction histogram
  • SIFT Scale-invariant feature transform
  • LBP Local Binary
  • HARR Local Binary
  • step S20 when the smart speaker detects the human body, the gesture action of the human body is recognized.
  • the depth sensor acquires a set of video data including gestures.
  • the depth sensor acts as a video.
  • Video data can be obtained according to preset rules. For example, when the depth sensor detects that the user has a large gesture action, the piece of video data is determined to be video data containing a gesture.
  • the above video data is parsed into a multi-frame continuous image, the background in the image is separated from the gesture, and the contour of the gesture in each frame of the image is found.
  • the start frame and end frame of the gesture action are determined according to preset rules.
  • the gesture profile between the start frame and the end frame is determined as a gesture action. That is, the gesture action includes a gesture outline of the multi-frame image.
  • step S30 after the gesture action is obtained, feature extraction is performed on the gesture action, the gesture action feature is obtained, the gesture action feature is recognized, the recognition result is obtained, and finally the control instruction is generated according to the recognition result.
  • the smart speaker adjusts the playback status according to the control command. If the obtained control command is to start the play command, the smart speaker starts to play the sound; if the obtained control command is the stop play command, the smart speaker stops playing the sound.
  • step S20 includes:
  • the end gesture profile ending gesture gesture is determined as the identified set of gesture actions.
  • the smart speaker stores a preset start gesture profile and a preset end gesture profile corresponding to different control commands.
  • Each gesture profile of the video data is first matched frame by frame with a preset start gesture profile, and the matched first frame gesture profile is determined as the start gesture profile.
  • the gesture profile after the first frame is matched with the preset end gesture profile frame by frame, and the matched first frame gesture profile is determined as the end gesture profile.
  • the end gesture profile ending gesture silhouette sequence is determined as a gesture action.
  • the obtained gesture action can be used to identify the meaning of the gesture, and then generate corresponding control instructions.
  • step S30 includes:
  • the storage chip on the smart speaker pre-stores a plurality of sets of control commands corresponding to different gesture actions.
  • the gesture action “upward swing” corresponds to the "lift volume” command
  • the gesture action “toward the next wave” corresponds to the "lower volume” command
  • the gesture action “hand swing” corresponds to the “stop playback” command
  • the gesture action “hands Tap” corresponds to the "Start Play” command.
  • the smart speaker determines the start play command corresponding to the gesture action made by the user, the smart speaker will play according to the start play command.
  • the content played can be music or news.
  • the smart speaker determines the end play instruction corresponding to the gesture action made by the user
  • the step of determining the control instruction corresponding to the gesture action comprises:
  • the gesture action feature is a sequence set of contour features of each frame of the image.
  • it is necessary to calculate the feature value of each contour of each frame of the image.
  • the obtained gesture profile is extracted, and the contour feature value of each contour in the gesture profile is calculated.
  • the contour feature values for each contour include the region histogram, moment, and earth travel distance for each contour.
  • the extracted gesture action features are encoded by using eight reference direction vectors, and the coding result is calculated.
  • the eight reference directions refer to eight directions that are equally divided by 360 degrees.
  • the DTW algorithm can be used to calculate the coding result.
  • each gesture that has been stored in the template library becomes a sample template, and a sample template is represented as ⁇ T(1), T(2),..., T ⁇ m ⁇ ,...,T ⁇ M ⁇ ⁇ .
  • One input gesture to be identified is a test template, denoted as ⁇ S(1), S(2), ...S(n),...,S(N) ⁇ .
  • the tilt can be constrained to a range of 0-2.
  • the previous node can only be one of the following three cases: (n i -1, m i ), (n i -1, m i -1) or (n i -1, m i -2).
  • the cumulative distance of the path D[(n i ,m i )] d[S(n i ), T(m i )+D((n i -1,m i -1))], where (n i - 1, m i -1) is determined by:
  • control instruction corresponding to the coding result is determined.
  • the obtained encoded result is compared with the preset encoded data, and the control instruction corresponding to the closest preset encoded data is output.
  • the proximity threshold may also be set, and if the obtained matching result is too low in matching with the preset encoded data, the control command is not output.
  • the playback control method further includes:
  • the volume of the smart speaker is adjusted according to the physical distance.
  • the distance between the smart speaker and the user can be directly calculated by the active depth camera, and then the volume is adjusted according to the distance between the user and the smart speaker, so that the adjusted volume reaches a preset value.
  • a preset value For example, when the user is 5 meters away from the smart speaker, the volume heard is 50 decibels. When the user is 10 meters away from the smart speaker, in order to make the volume heard by the user equal to 50 decibels, the volume of the smart speaker needs to be increased. Since the distance and the volume are in a certain correspondence in the room, the volume of the smart speaker can be adjusted according to the corresponding relationship, so that the volume heard by the user at different places is the same.
  • the preset value here can be the volume heard by the user at 5 meters, or the volume of a physical distance preset by the manufacturer.
  • step S10 includes:
  • the smart speaker performs human body detection based on a gradient direction histogram.
  • the smart speaker can perform human body detection based on a histogram of oriented gradient (HOG).
  • HOG histogram of oriented gradient
  • the gradient direction histogram is a local descriptor similar to the scale invariant feature transformation, which constructs the human body feature by calculating the gradient direction histogram on the local region.
  • scale-invariant feature transformation is based on feature extraction of key points, which is a sparse description method, and gradient direction histogram is a dense description method.
  • the gradient direction histogram description method has the following advantages: the gradient direction histogram represents the structural features of the edge (gradient), so that local shape information can be described; the quantization of the position and direction space can suppress the translation and rotation bands to a certain extent. The impact of the coming; at the same time the normalization in the local area can partially offset the impact of lighting. Therefore, embodiments of the present invention preferably perform human body detection based on a gradient direction histogram.
  • the step of performing the human body detection based on the gradient direction histogram of the smart speaker includes:
  • a stepwise calculation is first performed on the image in the detection window, specifically: a detection window of a normalized size (eg, 64 ⁇ 128) is taken as an input, and a first-order (one-dimensional) Sobel operator is passed through [- 1,0,1] calculates the gradient in the horizontal and vertical directions of the image within the detection window.
  • a detection window of a normalized size eg, 64 ⁇ 128
  • a first-order (one-dimensional) Sobel operator is passed through [- 1,0,1] calculates the gradient in the horizontal and vertical directions of the image within the detection window.
  • the advantage of using a single window as a classifier input is that the classifier is invariant to the position and scale of the target.
  • the detection window needs to be moved in the horizontal and vertical directions, and the image is scaled at multiple scales to detect the human body at different scales.
  • a gradient direction histogram of each cell in the image is calculated, specifically: the gradient direction histogram is obtained by performing intensive calculation in a grid called a cell and a block.
  • the image is divided into cells, each cell consisting of multiple pixels, and the block is composed of several adjacent cells.
  • the gradient of each pixel in the image is first calculated, and then the gradient direction histogram of all the pixels in each cell in the image, that is, the gradient direction histogram of the cell is counted.
  • the gradient direction histogram of each cell first divide [0 ⁇ ] into multiple intervals for each cell, and then perform weighted voting calculation according to the gradient direction of each pixel in the cell to obtain the cell.
  • the weight of each pixel is preferably the gradient magnitude of the pixel.
  • Trilinear Interpolationi performs a weighted voting calculation.
  • All cells in each block in the image are normalized to obtain a gradient direction histogram of the block.
  • the gradient direction histogram of the cells in the block is normalized to eliminate the influence of illumination, thereby obtaining a gradient direction histogram of the block.
  • Each block in the image is traversed to obtain a gradient direction histogram for each block in the image.
  • the gradient direction histogram of the detection window obtained by normalization of each block constitutes a human body feature vector, thereby realizing human body detection.
  • the gradient direction histogram is a dense calculation method, the amount of calculation is large. In order to reduce the amount of calculation and increase the detection speed, it is considered to select a gradient direction histogram in a key area with a relatively obvious human contour, thereby achieving the purpose of reducing the dimension.
  • the present invention provides a playback control method, including: a smart speaker performs human body detection; when a human body is detected, a gesture action of the human body is recognized; and a play state of the smart speaker is adjusted according to the gesture action.
  • the method provided by the invention increases an interaction mode using a smart speaker, so that the user can control the smart speaker through gestures, thereby improving the user experience.
  • an embodiment of the present invention further provides a smart speaker, including:
  • the detecting module 10 is configured to perform human body detection
  • the identification module 20 is configured to identify a gesture action of the human body when the human body is detected;
  • the adjusting module 30 is configured to adjust a playing state of the smart speaker according to the gesture action.
  • a depth sensor is mounted on the smart speaker.
  • depth sensors There are two types of depth sensors: passive stereo cameras and active depth cameras.
  • Passive stereo cameras utilize two or more cameras to view scenes and use the difference (shift) between features in multiple views of these cameras to estimate the depth of the scene.
  • the active depth camera projects invisible infrared light into the scene and estimates the depth of the scene based on the information being reflected.
  • the user A stands at a certain position with the smart speaker, and makes some gesture commands to the depth sensor of the smart speaker, such as turning on the play command, and the smart speaker recognizes the meaning of the user's gesture command and plays the sound.
  • the smart speaker performs human body detection through the depth sensor.
  • HOG gradient direction histogram
  • SIFT Scale-invariant feature transform
  • LBP Local Binary
  • HARR Local Binary
  • the gesture action of the human body is recognized.
  • the depth sensor acquires a set of video data including gestures.
  • the depth sensor acts as a video.
  • Video data can be obtained according to preset rules. For example, when the depth sensor detects that the user has a large gesture action, the piece of video data is determined to be video data containing a gesture.
  • the above video data is parsed into a multi-frame continuous image, the background in the image is separated from the gesture, and the contour of the gesture in each frame of the image is found.
  • the start frame and end frame of the gesture action are determined according to preset rules.
  • the gesture profile between the start frame and the end frame is determined as a gesture action. That is, the gesture action includes a gesture outline of the multi-frame image.
  • the adjustment module 30 after obtaining the gesture action, feature extraction is performed on the gesture action, the gesture action feature is obtained, the gesture action feature is recognized, the recognition result is obtained, and finally the control instruction is generated according to the recognition result.
  • the smart speaker adjusts the playback status according to the control command. If the obtained control command is to start the play command, the smart speaker starts to play the sound; if the obtained control command is the stop play command, the smart speaker stops playing the sound.
  • the identification module 20 includes:
  • a separating unit configured to separate a gesture of each frame of the detected human body gesture image from the background, and find a gesture contour in each frame of the gesture image
  • a start gesture unit configured to match the gesture outline frame by frame with a preset start gesture profile, and determine the matched first gesture profile as a start gesture profile;
  • End gesture unit configured to match a gesture contour of the timing after the start gesture profile to a preset end gesture profile frame by frame, and determine the matched first gesture profile as an end gesture profile;
  • a gesture action unit is configured to determine a gesture action starting with the start gesture profile and ending with the end gesture profile as the identified set of gesture actions.
  • the smart speaker stores a preset start gesture profile and a preset end gesture profile corresponding to different control commands.
  • Each gesture profile of the video data is first matched frame by frame with a preset start gesture profile, and the matched first frame gesture profile is determined as the start gesture profile.
  • the gesture profile after the first frame is matched with the preset end gesture profile frame by frame, and the matched first frame gesture profile is determined as the end gesture profile.
  • the end gesture profile ending gesture silhouette sequence is determined as a gesture action.
  • the obtained gesture action can be used to identify the meaning of the gesture, and then generate corresponding control instructions.
  • the adjustment module 30 includes:
  • Determining an instruction unit configured to determine a control instruction corresponding to the gesture action
  • an adjusting unit configured to adjust a playing state of the smart speaker according to the control instruction.
  • the storage chip on the smart speaker pre-stores a plurality of sets of control commands corresponding to different gesture actions.
  • the gesture action “upward swing” corresponds to the "lift volume” command
  • the gesture action “toward the next wave” corresponds to the "lower volume” command
  • the gesture action “hand swing” corresponds to the “stop playback” command
  • the gesture action “hands Tap” corresponds to the "Start Play” command.
  • the smart speaker determines the start play command corresponding to the gesture action made by the user, the smart speaker will play according to the start play command.
  • the content played can be music or news.
  • the smart speaker determines the end play instruction corresponding to the gesture action made by the user
  • the determining the instruction unit includes:
  • Obtaining a feature sub-unit configured to perform feature extraction on the gesture action to obtain a gesture action feature
  • a coding subunit configured to encode the gesture action feature to obtain a coding result
  • the gesture action feature is a sequence set of contour features of each frame of the image.
  • it is necessary to calculate the feature value of each contour of each frame of the image.
  • the obtained gesture profile is extracted, and the contour feature value of each contour in the gesture profile is calculated.
  • the contour feature values for each contour include the region histogram, moment, and earth travel distance for each contour.
  • the extracted gesture action features are encoded by using eight reference direction vectors, and the coding result is calculated.
  • the eight reference directions refer to eight directions that are equally divided by 360 degrees.
  • the DTW algorithm can be used to calculate the coding result.
  • each gesture that has been stored in the template library becomes a sample template, and a sample template is represented as ⁇ T(1), T(2),..., T ⁇ m ⁇ ,...,T ⁇ M ⁇ ⁇ .
  • One input gesture to be identified is a test template, denoted as ⁇ S(1), S(2), ...S(n),...,S(N) ⁇ .
  • the tilt can be constrained to a range of 0-2.
  • the previous node can only be one of the following three cases: (n i -1, m i ), (n i -1, m i -1) or (n i -1, m i -2).
  • the cumulative distance of the path D[(n i ,m i )] d[S(n i ), T(m i )+D((n i -1,m i -1))], where (n i - 1, m i -1) is determined by:
  • control instruction corresponding to the coding result is determined.
  • the obtained encoded result is compared with the preset encoded data, and the control instruction corresponding to the closest preset encoded data is output.
  • the proximity threshold may also be set, and if the obtained matching result is too low in matching with the preset encoded data, the control command is not output.
  • the smart speaker also includes:
  • a distance calculation module configured to calculate a physical distance between the smart speaker and the human body
  • adjusting a volume module configured to adjust a volume of the smart speaker according to the physical distance.
  • the distance between the smart speaker and the user can be directly calculated by the active depth camera, and then the volume is adjusted according to the distance between the user and the smart speaker. For example, when the user is 5 meters away from the smart speaker, the volume heard is 50 decibels. When the user is 10 meters away from the smart speaker, in order to make the volume heard by the user equal to 50 decibels, the volume of the smart speaker needs to be increased. Since the distance and the volume are in a certain correspondence in the room, the volume of the smart speaker can be adjusted according to the corresponding relationship, so that the volume heard by the user at different places is the same.
  • the preset value here can be the volume heard by the user at 5 meters, or the volume of a physical distance preset by the manufacturer.
  • the detecting module 10 includes:
  • a gradient detecting unit for performing human body detection based on a gradient direction histogram.
  • the smart speaker can perform human body detection based on a histogram of oriented gradient (HOG).
  • HOG histogram of oriented gradient
  • the gradient direction histogram is a local descriptor similar to the scale invariant feature transformation, which constructs the human body feature by calculating the gradient direction histogram on the local region.
  • scale-invariant feature transformation is based on feature extraction of key points, which is a sparse description method, and gradient direction histogram is a dense description method.
  • the gradient direction histogram description method has the following advantages: the gradient direction histogram represents the structural features of the edge (gradient), so that local shape information can be described; the quantization of the position and direction space can suppress the translation and rotation bands to a certain extent. The impact of the coming; at the same time the normalization in the local area can partially offset the impact of lighting. Therefore, embodiments of the present invention preferably perform human body detection based on a gradient direction histogram.
  • the gradient detecting unit includes:
  • a stepwise calculation subunit for performing a stepwise calculation on the image in the detection window
  • a cell gradient subunit for calculating a gradient direction histogram of each cell in the image
  • a block gradient sub-unit for normalizing all cells in each block in the image to obtain a gradient direction histogram of the block
  • a stepwise calculation is first performed on the image in the detection window, specifically: a detection window of a normalized size (eg, 64 ⁇ 128) is taken as an input, and a first-order (one-dimensional) Sobel operator is passed through [- 1,0,1] calculates the gradient in the horizontal and vertical directions of the image within the detection window.
  • a detection window of a normalized size eg, 64 ⁇ 128
  • a first-order (one-dimensional) Sobel operator is passed through [- 1,0,1] calculates the gradient in the horizontal and vertical directions of the image within the detection window.
  • the advantage of using a single window as a classifier input is that the classifier is invariant to the position and scale of the target.
  • the detection window needs to be moved in the horizontal and vertical directions, and the image is scaled at multiple scales to detect the human body at different scales.
  • a gradient direction histogram of each cell in the image is calculated, specifically: the gradient direction histogram is obtained by performing intensive calculation in a grid called a cell and a block.
  • the image is divided into cells, each cell consisting of multiple pixels, and the block is composed of several adjacent cells.
  • the gradient of each pixel in the image is first calculated, and then the gradient direction histogram of all the pixels in each cell in the image, that is, the gradient direction histogram of the cell is counted.
  • the gradient direction histogram of each cell first divide [0 ⁇ ] into multiple intervals for each cell, and then perform weighted voting calculation according to the gradient direction of each pixel in the cell to obtain the cell.
  • the weight of each pixel is preferably the gradient magnitude of the pixel.
  • Trilinear Interpolationi performs a weighted voting calculation.
  • All cells in each block in the image are normalized to obtain a gradient direction histogram of the block.
  • the gradient direction histogram of the cells in the block is normalized to eliminate the influence of illumination, thereby obtaining a gradient direction histogram of the block.
  • Each block in the image is traversed to obtain a gradient direction histogram for each block in the image.
  • the gradient direction histogram of the detection window obtained by normalization of each block constitutes a human body feature vector, thereby realizing human body detection.
  • the gradient direction histogram is a dense calculation method, the amount of calculation is large. In order to reduce the amount of calculation and increase the detection speed, it is considered to select a gradient direction histogram in a key area with a relatively obvious human contour, thereby achieving the purpose of reducing the dimension.
  • the invention provides a smart speaker, the smart speaker performs human body detection; when the human body is detected, the gesture action of the human body is recognized; and the playing state of the smart speaker is adjusted according to the gesture action.
  • the smart speaker provided by the invention adds an interaction mode using the smart speaker, so that the user can control the smart speaker through gestures, thereby improving the user experience.
  • the present invention also provides a smart speaker comprising a memory, a processor and at least one application stored in the memory and configured to be executed by the processor, the application being configured to perform the above Playback control method.
  • the processor included in the smart speaker further has the following functions:

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本发明提供了一种智能音箱及播放控制方法,播放控制方法包括:智能音箱进行人体检测;当检测到人体时,识别所述人体的手势动作;根据所述手势动作调节所述智能音箱的播放状态。本发明提供的方法增加一种使用智能音箱的交互方式,使得用户可以通过手势对智能音箱进行控制,提高了用户体验。

Description

智能音箱及播放控制方法 技术领域
本发明涉及到智能音箱领域,特别是涉及到一种智能音箱及播放控制方法。
背景技术
智能音箱,是一个音箱升级的产物,是家庭消费者用语音进行上网的一个工具,比如点播歌曲、上网购物,或是了解天气预报,它也可以对智能家居设备进行控制,比如打开窗帘、设置冰箱温度、提前让热水器升温等。
以亚马逊Echo为代表的智能音箱,实际上都属于智能语音技术。其操作都需要语音指令来控制。然而,现有的家居环境背景噪音较大,这种噪音会影响语音指令的正确识别,降低用户体验。因此,需要采用更多的方式,方便用户与智能音箱进行交互,提升用户体验。
技术问题
本发明的主要目的为提供一种智能音箱及播放控制方法,增强使用智能音箱的用户体验。
技术解决方案
本发明提供了一种播放控制方法,包括以下步骤:
智能音箱进行人体检测;
当检测到人体时,识别所述人体的手势动作;
根据所述手势动作调节所述智能音箱的播放状态。
优选地,所述识别所述人体的手势动作的步骤包括:
将检测到的人体的每帧手势图像的手势与背景进行分离,并找出每帧手势图像中的手势轮廓;
将所述手势轮廓逐帧与预设开始手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为开始手势轮廓;
将时序在所述开始手势轮廓之后的手势轮廓逐帧与预设结束手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为结束手势轮廓;
将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势动作确定为识别到的一组手势动作。
优选地,所述根据所述手势动作调节所述智能音箱的播放状态的步骤包括:
确定所述手势动作对应的控制指令;
根据所述控制指令调节所述智能音箱的播放状态。
优选地,所述确定所述手势动作对应的控制指令的步骤包括:
对所述手势动作进行特征提取,获得手势动作特征;
对所述手势动作特征进行编码,获得编码结果;
确定所述编码结果对应的控制指令。
优选地,所述方法还包括:
计算所述智能音箱与所述人体之间的物理距离;
根据所述物理距离调整所述智能音箱的音量。
优选地,所述智能音箱进行人体检测的步骤包括:
所述智能音箱基于梯度方向直方图进行人体检测。
优选地,所述智能音箱基于梯度方向直方图进行人体检测的步骤包括:
对检测窗口内的图像进行一阶梯度计算;
计算所述图像中各个单元格的梯度方向直方图;
对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图;
对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。
本发明的另一个方面,还提出了一种智能音箱,包括:
检测模块,用于进行人体检测;
识别模块,用于当检测到人体时,识别所述人体的手势动作;
调整模块,用于根据所述手势动作调节所述智能音箱的播放状态。
优选地,所述识别模块包括:
分离单元,用于将检测到的人体的每帧手势图像的手势与背景进行分离,并找出每帧手势图像中的手势轮廓;
开始手势单元,用于将所述手势轮廓逐帧与预设开始手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为开始手势轮廓;
结束手势单元,用于将时序在所述开始手势轮廓之后的手势轮廓逐帧与预设结束手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为结束手势轮廓;
手势动作单元,用于将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势动作确定为识别到的一组手势动作。
优选地,所述调整模块包括:
确定指令单元,用于确定所述手势动作对应的控制指令;
调整单元,用于根据所述控制指令调节所述智能音箱的播放状态。
优选地,所述确定指令单元包括:
获取特征子单元,用于特征对所述手势动作进行特征提取,获得手势动作特征;
编码子单元,用于对所述手势动作特征进行编码,获得编码结果;
确定指令子单元,用于确定所述编码结果对应的控制指令。
优选地,还包括:
距离计算模块,用于计算所述智能音箱与所述人体之间的物理距离;
调整音量模块,用于根据所述物理距离调整所述智能音箱的音量。
优选地,所述检测模块包括:
梯度检测单元,用于基于梯度方向直方图进行人体检测。
优选地,所述梯度检测单元包括:
一阶梯度计算子单元,用于对检测窗口内的图像进行一阶梯度计算;
单元格梯度子单元,用于计算所述图像中各个单元格的梯度方向直方图;
块梯度子单元,用于对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图;
生成特征向量子单元,用于对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。
本发明还提出了一种智能音箱,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,所述应用程序被配置为用于执行上述的播放控制方法。
有益效果
本发明提供了智能音箱及播放控制方法,其中的播放控制方法包括:智能音箱进行人体检测;当检测到人体时,识别所述人体的手势动作;根据所述手势动作调节所述智能音箱的播放状态。本发明提供的方法增加一种使用智能音箱的交互方式,使得用户可以通过手势对智能音箱进行控制,提高了用户体验。
附图说明
图1 为本发明播放控制方法一实施例的流程示意图;
图2 为本发明智能音箱一实施例的结构示意图。
本发明目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的最佳实施方式
应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。
参照图1,本发明实施例提出了一种播放控制方法,包括以下步骤:
S10、智能音箱进行人体检测;
S20、当检测到人体时,识别所述人体的手势动作;
S30、根据所述手势动作调节所述智能音箱的播放状态。
本实施例中,智能音箱上安装有深度传感器。深度传感器分为两类:被动式立体相机和主动式深度相机。被动式立体相机利用两个或更多个相机来观察场景,并且使用这些相机的多个视图中特征之间的差异(移位)来估计场景的深度。主动式深度相机向场景投射不可见的红外光,并且根据被反射的信息,估计场景的深度。在一应用场景中,用户甲站在与智能音箱一定位置,向智能音箱的深度传感器作出一些手势指令,如开启播放指令,智能音箱识别出用户甲手势指令的含义后,播放声音。
步骤S10中,智能音箱通过深度传感器进行人体检测。可以基于梯度方向直方图(Histogram of oriented gradient,HOG)、尺度不变特征转换(Scale-invariant feature transform,SIFT)、局部二值模式(Local Binary Pattern,LBP)、HARR等图像特征进行人体检测。
步骤S20中,当智能音箱检测到人体时,识别该人体的手势动作。具体是通过深度传感器获取一组包含手势的视频数据。在此处深度传感器起到录影的作用。可按预设的规则来获取视频数据。如,当深度传感器监测到用户有较大的手势动作时,将该段视频数据确定为包含手势的视频数据。
将上述视频数据解析为多帧连续的图像,将图像中的背景与手势分离,并找出每帧图像中的手势轮廓。按预设的规则确定手势动作的起始帧和结束帧。将起始帧与结束帧之间的手势轮廓确定为手势动作。也就是说,手势动作包括多帧图像的手势轮廓。
步骤S30中,获得手势动作之后,对手势动作进行特征提取,获得手势动作特征,对手势动作特征进行识别,获得识别结果,最后根据识别结果生成控制指令。
智能音箱根据控制指令调节播放状态。如获得的控制指令为开始播放指令,则智能音箱开始播放声音;若获得的控制指令为停止播放指令,则智能音箱停止播放声音。
可选的,步骤S20包括:
将检测到的人体的每帧手势图像的手势与背景进行分离,并找出每帧手势图像中的手势轮廓;
将所述手势轮廓逐帧与预设开始手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为开始手势轮廓;
将时序在所述开始手势轮廓之后的手势轮廓逐帧与预设结束手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为结束手势轮廓;
将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势动作确定为识别到的一组手势动作。
本实施例中,智能音箱存储有不同控制指令对应的预设开始手势轮廓和预设结束手势轮廓。视频数据的各个手势轮廓先逐帧与预设开始手势轮廓进行匹配,将匹配的第一帧手势轮廓确定为开始手势轮廓。在第一帧以后的手势轮廓,逐帧与预设结束手势轮廓进行匹配,将匹配的第一帧手势轮廓确定为结束手势轮廓。然后,将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势轮廓序列确定为手势动作。获得的手势动作可用来识别手势包含的含义,进而生成相应的控制指令。
可选的,步骤S30包括:
确定所述手势动作对应的控制指令;
根据所述控制指令调节所述智能音箱的播放状态。
本实施例中,智能音箱上存储芯片预存了多组不同手势动作对应的控制指令。例如可以规定,手势动作“往上一挥”对应“提升音量”指令,手势动作“往下一挥”对应“降低音量”指令,手势动作“摆手”对应“停止播放”指令,手势动作“双手轻拍”对应“开始播放”指令。当智能音箱确定用户做出的手势动作对应的开始播放指令,则智能音箱会按照开始播放指令进行播放。播放的内容可以是音乐,也可以是新闻。同样的,当智能音箱确定用户做出的手势动作对应的结束播放指令,则智能音箱会按照结束播放指令停止播放声音内容。用户可以免受停止播放前的声音内容的干扰。
可选的,所述确定所述手势动作对应的控制指令的步骤包括:
对所述手势动作进行特征提取,获得手势动作特征;
对所述手势动作特征进行编码,获得编码结果;
确定所述编码结果对应的控制指令。
本实施例中,手势动作特征是每帧图像轮廓特征的序列集合。为了获得手势动作特征,需要计算每帧图像的每个轮廓的特征值。具体而言,将提取得到的手势轮廓,计算该手势轮廓中每个轮廓的轮廓特征值。每个轮廓的轮廓特征值包括每个轮廓的区域直方图、矩和地球移动距离。
然后,要对提取的手势动作特征采用8个基准方向向量进行编码,计算编码结果。8个基准方向指的是360度平分的八个方向。
可以使用DTW算法来计算编码结果。在DTW算法中,已存入模板库的各个手势成为样本模板,一个样本模板表示为{T(1),T(2),...,T{m},...,T{M}}。所要识别的一个输入手势为测试模板,表示为{S(1),S(2),...S(n),...,S(N)}。将测试模板的各帧号m=1-M在纵轴上标出,通过这些表示帧号的坐标画出纵线即可形成一个个网格,网格中的每一个交叉点(n,m)表示测试模板中某一帧与训练模式中某一帧的交汇点。
DTW算法可以归结为寻找一条通过此网格中若干格点的路径.为了描述这条路径,假设路径通过的所有格点依次为(n 1,m 1),...,(n i,m i),..,(n N=m N),其中(n 1,m 1)=(1,1),(n N,m N)=(N,M).路径可以用函数m i=f(n i),其中n i=i,i=1,2,...,N,f(1)=1,f(N)=M。为了使路径不至于过分倾斜,可将倾斜约束在0-2范围内,如果路径通过格点(n i,m i),那么其前一个节点只可能是下列三种情况之一:(n i-1,m i),(n i-1,m i-1)或(n i-1,m i-2)。路径的累积距离D[(n i,m i)]=d[S(n i),T(m i)+D((n i-1,m i-1))],其中(n i-1,m i-1)由下式决定:
D[(n i-1,m i-1)]=min{D[n i-1,m i],D[(n i-1,m i-1)],D[(n i-1,m i-2)]}。
最后,确定编码结果对应的控制指令。获得的编码结果与预设编码数据比较,输出最接近的预设编码数据对应的控制指令。为了减少错误检测率,还可以设置接近度阈值,若获得的编码结果与预设编码数据的匹配度太低,则不输出控制指令。
可选的,播放控制方法还包括:
计算所述智能音箱与所述人体之间的物理距离;
根据所述物理距离调整所述智能音箱的音量。
本实施例中,可以通过主动式深度相机直接计算出智能音箱与用户之间的距离,然后根据用户与智能音箱的距离调节音量,以使调节后的音量达到预设值。例如,用户离智能音箱5米时,听到的音量为50分贝,离智能音箱10米时,为了使用户听到的音量等于50分贝,需提高智能音箱的音量。由于在室内,距离与音量成一定的对应关系,可以根据对应关系调节智能音箱的音量,使得用户在不同地点听到的音量是一样的。此处的预设值可以是5米处用户听到的音量,也可以是厂商预设的一个物理距离的音量。
可选的,步骤S10包括:
所述智能音箱基于梯度方向直方图进行人体检测。
本实施例中,智能音箱可以基于梯度方向直方图(Histogram of oriented gradient,HOG)进行人体检测。
梯度方向直方图是类似于尺度不变特征转换的一种局域描述符,它通过计算局部区域上的梯度方向直方图来构成人体特征。与尺度不变特征转换不同的是,尺度不变特征转换是基于关键点的特征提取,是一种稀疏描述方法,而梯度方向直方图是密集的描述方法。
梯度方向直方图描述方法具有以下优点:梯度方向直方图表示的是边缘(梯度)的结构特征,因此可以描述局部的形状信息;位置和方向空间的量化,在一定程度上可以抑制平移和旋转带来的影响;同时采取在局部区域的归一化,可以部分抵消光照带来的影响。故本发明实施例优选基于梯度方向直方图进行人体检测。
可选的,所述智能音箱基于梯度方向直方图进行人体检测的步骤包括:
对检测窗口内的图像进行一阶梯度计算;
计算所述图像中各个单元格的梯度方向直方图;
对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图;
对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。
本实施例中,首先对检测窗口内的图像进行一阶梯度计算,具体为:将规范化大小(如64x128)的检测窗口(Detection Window)作为输入,通过一阶(一维)Sobel算子[-1,0,1]计算检测窗口内的图像水平和垂直方向上的梯度。
采用单一窗口作为分类器输入的好处是分类器对目标的位置与尺度具有不变性。对于一个待检测的输入图像来说,需要沿着水平和垂直方向移动检测窗口,同时要以多尺度对图像进行缩放以检测不同尺度下的人体。
然后,计算所述图像中各个单元格的梯度方向直方图,具体为:梯度方向直方图是在被称为单元格(Cell)和块(Block)的网格内进行密集计算得到的。将图像分成若干单元格,每个单元格由多个像素构成,而块则是由若干相邻的单元格组成。
在此实施例中,先计算图像内每个像素的梯度,再统计出图像内每个单元格中所有像素的梯度方向直方图,即该单元格的梯度方向直方图。在统计各个单元格的梯度方向直方图时,首先针对每个单元格将[0~π]划分为多个区间,然后根据该单元格内各像素的梯度方向进行加权投票计算,得到该单元格中所有像素的梯度方向直方图。
在进行加权投票计算时,每个像素的权重为优选为该像素的梯度幅度。为了消除混淆,优选采用三线性差值(Trilinear Interpolationi)进行加权投票计算。
遍历图像中的每个单元格,得到图像中各个单元格的梯度方向直方图。
对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图。在块内,对该块内的单元格的梯度方向直方图进行归一化处理,以消除光照的影响,从而得到该块的梯度方向直方图。遍历图像中的每个块,得到图像中每个块的梯度方向直方图。
对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。由各块归一化后得到的检测窗口的梯度方向直方图,构成人体特征向量,从而实现人体检测。
由于梯度方向直方图是一种密集计算方式,因此计算量较大。为了减小计算量,提高检测速度,可以考虑选择在有较明显的人体轮廓的重点区域计算梯度方向直方图,从而达到降低维数的目的。
本发明提供了一种播放控制方法,包括:智能音箱进行人体检测;当检测到人体时,识别所述人体的手势动作;根据所述手势动作调节所述智能音箱的播放状态。本发明提供的方法增加一种使用智能音箱的交互方式,使得用户可以通过手势对智能音箱进行控制,提高了用户体验。
参照图2,本发明实施例还提出了一种智能音箱,包括:
检测模块10,用于进行人体检测;
识别模块20,用于当检测到人体时,识别所述人体的手势动作;
调整模块30,用于根据所述手势动作调节所述智能音箱的播放状态。
本实施例中,智能音箱上安装有深度传感器。深度传感器分为两类:被动式立体相机和主动式深度相机。被动式立体相机利用两个或更多个相机来观察场景,并且使用这些相机的多个视图中特征之间的差异(移位)来估计场景的深度。主动式深度相机向场景投射不可见的红外光,并且根据被反射的信息,估计场景的深度。在一应用场景中,用户甲站在与智能音箱一定位置,向智能音箱的深度传感器作出一些手势指令,如开启播放指令,智能音箱识别出用户甲手势指令的含义后,播放声音。
检测模块10中,智能音箱通过深度传感器进行人体检测。可以基于梯度方向直方图(Histogram of oriented gradient,HOG)、尺度不变特征转换(Scale-invariant feature transform,SIFT)、局部二值模式(Local Binary Pattern,LBP)、HARR等图像特征进行人体检测。
识别模块20中,当智能音箱检测到人体时,识别该人体的手势动作。具体是通过深度传感器获取一组包含手势的视频数据。在此处深度传感器起到录影的作用。可按预设的规则来获取视频数据。如,当深度传感器监测到用户有较大的手势动作时,将该段视频数据确定为包含手势的视频数据。
将上述视频数据解析为多帧连续的图像,将图像中的背景与手势分离,并找出每帧图像中的手势轮廓。按预设的规则确定手势动作的起始帧和结束帧。将起始帧与结束帧之间的手势轮廓确定为手势动作。也就是说,手势动作包括多帧图像的手势轮廓。
调整模块30中,获得手势动作之后,对手势动作进行特征提取,获得手势动作特征,对手势动作特征进行识别,获得识别结果,最后根据识别结果生成控制指令。
智能音箱根据控制指令调节播放状态。如获得的控制指令为开始播放指令,则智能音箱开始播放声音;若获得的控制指令为停止播放指令,则智能音箱停止播放声音。
可选的,识别模块20包括:
分离单元,用于将检测到的人体的每帧手势图像的手势与背景进行分离,并找出每帧手势图像中的手势轮廓;
开始手势单元,用于将所述手势轮廓逐帧与预设开始手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为开始手势轮廓;
结束手势单元,用于将时序在所述开始手势轮廓之后的手势轮廓逐帧与预设结束手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为结束手势轮廓;
手势动作单元,用于将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势动作确定为识别到的一组手势动作。
本实施例中,智能音箱存储有不同控制指令对应的预设开始手势轮廓和预设结束手势轮廓。视频数据的各个手势轮廓先逐帧与预设开始手势轮廓进行匹配,将匹配的第一帧手势轮廓确定为开始手势轮廓。在第一帧以后的手势轮廓,逐帧与预设结束手势轮廓进行匹配,将匹配的第一帧手势轮廓确定为结束手势轮廓。然后,将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势轮廓序列确定为手势动作。获得的手势动作可用来识别手势包含的含义,进而生成相应的控制指令。
可选的,调整模块30包括:
确定指令单元,用于确定所述手势动作对应的控制指令;
调整单元,用于根据所述控制指令调节所述智能音箱的播放状态。
本实施例中,智能音箱上存储芯片预存了多组不同手势动作对应的控制指令。例如可以规定,手势动作“往上一挥”对应“提升音量”指令,手势动作“往下一挥”对应“降低音量”指令,手势动作“摆手”对应“停止播放”指令,手势动作“双手轻拍”对应“开始播放”指令。当智能音箱确定用户做出的手势动作对应的开始播放指令,则智能音箱会按照开始播放指令进行播放。播放的内容可以是音乐,也可以是新闻。同样的,当智能音箱确定用户做出的手势动作对应的结束播放指令,则智能音箱会按照结束播放指令停止播放声音内容。用户可以免受停止播放前的声音内容的干扰。
可选的,所述确定指令单元包括:
获取特征子单元,用于特征对所述手势动作进行特征提取,获得手势动作特征;
编码子单元,用于对所述手势动作特征进行编码,获得编码结果;
确定指令子单元,用于确定所述编码结果对应的控制指令。
本实施例中,手势动作特征是每帧图像轮廓特征的序列集合。为了获得手势动作特征,需要计算每帧图像的每个轮廓的特征值。具体而言,将提取得到的手势轮廓,计算该手势轮廓中每个轮廓的轮廓特征值。每个轮廓的轮廓特征值包括每个轮廓的区域直方图、矩和地球移动距离。
然后,要对提取的手势动作特征采用8个基准方向向量进行编码,计算编码结果。8个基准方向指的是360度平分的八个方向。
可以使用DTW算法来计算编码结果。在DTW算法中,已存入模板库的各个手势成为样本模板,一个样本模板表示为{T(1),T(2),...,T{m},...,T{M}}。所要识别的一个输入手势为测试模板,表示为{S(1),S(2),...S(n),...,S(N)}。将测试模板的各帧号m=1-M在纵轴上标出,通过这些表示帧号的坐标画出纵线即可形成一个个网格,网格中的每一个交叉点(n,m)表示测试模板中某一帧与训练模式中某一帧的交汇点。
DTW算法可以归结为寻找一条通过此网格中若干格点的路径.为了描述这条路径,假设路径通过的所有格点依次为(n 1,m 1),...,(n i,m i),..,(n N=m N),其中(n 1,m 1)=(1,1),(n N,m N)=(N,M).路径可以用函数m i=f(n i),其中n i=i,i=1,2,...,N,f(1)=1,f(N)=M。为了使路径不至于过分倾斜,可将倾斜约束在0-2范围内,如果路径通过格点(n i,m i),那么其前一个节点只可能是下列三种情况之一:(n i-1,m i),(n i-1,m i-1)或(n i-1,m i-2)。路径的累积距离D[(n i,m i)]=d[S(n i),T(m i)+D((n i-1,m i-1))],其中(n i-1,m i-1)由下式决定:
D[(n i-1,m i-1)]=min{D[n i-1,m i],D[(n i-1,m i-1)],D[(n i-1,m i-2)]}。
最后,确定编码结果对应的控制指令。获得的编码结果与预设编码数据比较,输出最接近的预设编码数据对应的控制指令。为了减少错误检测率,还可以设置接近度阈值,若获得的编码结果与预设编码数据的匹配度太低,则不输出控制指令。
可选的,智能音箱还包括:
距离计算模块,用于计算所述智能音箱与所述人体之间的物理距离;
调整音量模块,用于根据所述物理距离调整所述智能音箱的音量。
本实施例中,可以通过主动式深度相机直接计算出智能音箱与用户之间的距离,然后根据用户与智能音箱的距离调节音量。例如,用户离智能音箱5米时,听到的音量为50分贝,离智能音箱10米时,为了使用户听到的音量等于50分贝,需提高智能音箱的音量。由于在室内,距离与音量成一定的对应关系,可以根据对应关系调节智能音箱的音量,使得用户在不同地点听到的音量是一样的。此处的预设值可以是5米处用户听到的音量,也可以是厂商预设的一个物理距离的音量。
可选的,所述检测模块10包括:
梯度检测单元,用于基于梯度方向直方图进行人体检测。
本实施例中,智能音箱可以基于梯度方向直方图(Histogram of oriented gradient,HOG)进行人体检测。
梯度方向直方图是类似于尺度不变特征转换的一种局域描述符,它通过计算局部区域上的梯度方向直方图来构成人体特征。与尺度不变特征转换不同的是,尺度不变特征转换是基于关键点的特征提取,是一种稀疏描述方法,而梯度方向直方图是密集的描述方法。
梯度方向直方图描述方法具有以下优点:梯度方向直方图表示的是边缘(梯度)的结构特征,因此可以描述局部的形状信息;位置和方向空间的量化,在一定程度上可以抑制平移和旋转带来的影响;同时采取在局部区域的归一化,可以部分抵消光照带来的影响。故本发明实施例优选基于梯度方向直方图进行人体检测。
可选的,所述梯度检测单元包括:
一阶梯度计算子单元,用于对检测窗口内的图像进行一阶梯度计算;
单元格梯度子单元,用于计算所述图像中各个单元格的梯度方向直方图;
块梯度子单元,用于对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图;
生成特征向量子单元,用于对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。
本实施例中,首先对检测窗口内的图像进行一阶梯度计算,具体为:将规范化大小(如64x128)的检测窗口(Detection Window)作为输入,通过一阶(一维)Sobel算子[-1,0,1]计算检测窗口内的图像水平和垂直方向上的梯度。
采用单一窗口作为分类器输入的好处是分类器对目标的位置与尺度具有不变性。对于一个待检测的输入图像来说,需要沿着水平和垂直方向移动检测窗口,同时要以多尺度对图像进行缩放以检测不同尺度下的人体。
然后,计算所述图像中各个单元格的梯度方向直方图,具体为:梯度方向直方图是在被称为单元格(Cell)和块(Block)的网格内进行密集计算得到的。将图像分成若干单元格,每个单元格由多个像素构成,而块则是由若干相邻的单元格组成。
在此实施例中,先计算图像内每个像素的梯度,再统计出图像内每个单元格中所有像素的梯度方向直方图,即该单元格的梯度方向直方图。在统计各个单元格的梯度方向直方图时,首先针对每个单元格将[0~π]划分为多个区间,然后根据该单元格内各像素的梯度方向进行加权投票计算,得到该单元格中所有像素的梯度方向直方图。
在进行加权投票计算时,每个像素的权重为优选为该像素的梯度幅度。为了消除混淆,优选采用三线性差值(Trilinear Interpolationi)进行加权投票计算。
遍历图像中的每个单元格,得到图像中各个单元格的梯度方向直方图。
对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图。在块内,对该块内的单元格的梯度方向直方图进行归一化处理,以消除光照的影响,从而得到该块的梯度方向直方图。遍历图像中的每个块,得到图像中每个块的梯度方向直方图。
对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。由各块归一化后得到的检测窗口的梯度方向直方图,构成人体特征向量,从而实现人体检测。
由于梯度方向直方图是一种密集计算方式,因此计算量较大。为了减小计算量,提高检测速度,可以考虑选择在有较明显的人体轮廓的重点区域计算梯度方向直方图,从而达到降低维数的目的。
本发明提供了一种智能音箱,智能音箱进行人体检测;当检测到人体时,识别所述人体的手势动作;根据所述手势动作调节所述智能音箱的播放状态。本发明提供的智能音箱增加一种使用智能音箱的交互方式,使得用户可以通过手势对智能音箱进行控制,提高了用户体验。
本发明还提出了一种智能音箱,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,所述应用程序被配置为用于执行上述的播放控制方法。
在本发明实施例中,该智能音箱所包括的处理器还具有以下功能:
进行人体检测;
当检测到人体时,识别所述人体的手势动作;
根据所述手势动作调节所述智能音箱的播放状态。以上所述仅为本发明的实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的权利要求范围之内。

Claims (15)

  1. 一种播放控制方法,其特征在于,包括以下步骤:
    智能音箱进行人体检测;
    当检测到人体时,识别所述人体的手势动作;
    根据所述手势动作调节所述智能音箱的播放状态。
  2. 根据权利要求1所述的播放控制方法,其特征在于,所述识别所述人体的手势动作的步骤包括:
    将检测到的人体的每帧手势图像的手势与背景进行分离,并找出每帧手势图像中的手势轮廓;
    将所述手势轮廓逐帧与预设开始手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为开始手势轮廓;
    将时序在所述开始手势轮廓之后的手势轮廓逐帧与预设结束手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为结束手势轮廓;
    将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势轨迹确定为识别到的一组手势动作。
  3. 根据权利要求1所述的播放控制方法,其特征在于,所述根据所述手势动作调节所述智能音箱的播放状态的步骤包括:
    确定所述手势动作对应的控制指令;
    根据所述控制指令调节所述智能音箱的播放状态。
  4. 根据权利要求3所述的播放控制方法,其特征在于,所述确定所述手势动作对应的控制指令的步骤包括:
    对所述手势动作进行特征提取,获得手势动作特征;
    对所述手势动作特征进行编码,获得编码结果;
    确定所述编码结果对应的控制指令。
  5. 根据权利要求4所述的播放控制方法,其特征在于,所述方法还包括:
    计算所述智能音箱与所述人体之间的物理距离;
    根据所述物理距离调整所述智能音箱的音量。
  6. 根据权利要求1所述的播放控制方法,其特征在于,所述智能音箱进行人体检测的步骤包括:
    所述智能音箱基于梯度方向直方图进行人体检测。
  7. 根据权利要求6所述的播放控制方法,其特征在于,所述智能音箱基于梯度方向直方图进行人体检测的步骤包括:
    对检测窗口内的图像进行一阶梯度计算;
    计算所述图像中各个单元格的梯度方向直方图;
    对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图;
    对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。
  8. 一种智能音箱,其特征在于,包括:
    检测模块,用于进行人体检测;
    识别模块,用于当检测到人体时,识别所述人体的手势动作;
    调整模块,用于根据所述手势动作调节所述智能音箱的播放状态。
  9. 根据权利要求8所述的智能音箱,其特征在于,所述识别模块包括:
    分离单元,用于将检测到的人体的每帧手势图像的手势与背景进行分离,并找出每帧手势图像中的手势轮廓;
    开始手势单元,用于将所述手势轮廓逐帧与预设开始手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为开始手势轮廓;
    结束手势单元,用于将时序在所述开始手势轮廓之后的手势轮廓逐帧与预设结束手势轮廓进行匹配,将匹配到的第一个手势轮廓确定为结束手势轮廓;
    手势动作单元,用于将以所述开始手势轮廓为起始,所述结束手势轮廓为结尾的手势动作确定为识别到的一组手势动作。
  10. 根据权利要求8所述的智能音箱,其特征在于,所述调整模块包括:
    确定指令单元,用于确定所述手势动作对应的控制指令;
    调整单元,用于根据所述控制指令调节所述智能音箱的播放状态。
  11. 根据权利要求10所述的智能音箱,其特征在于,所述确定指令单元包括:
    获取特征子单元,用于特征对所述手势动作进行特征提取,获得手势动作特征;
    编码子单元,用于对所述手势动作特征进行编码,获得编码结果;
    确定指令子单元,用于确定所述编码结果对应的控制指令。
  12. 根据权利要求11所述的智能音箱,其特征在于,还包括:
    距离计算模块,用于计算所述智能音箱与所述人体之间的物理距离;
    调整音量模块,用于根据所述物理距离调整所述智能音箱的音量。
  13. 根据权利要求8所述的智能音箱,其特征在于,所述检测模块包括:
    梯度检测单元,用于基于梯度方向直方图进行人体检测。
  14. 根据权利要求13所述的智能音箱,其特征在于,所述梯度检测单元包括:
    一阶梯度计算子单元,用于对检测窗口内的图像进行一阶梯度计算;
    单元格梯度子单元,用于计算所述图像中各个单元格的梯度方向直方图;
    块梯度子单元,用于对所述图像中每个块内的所有单元格进行归一化处理,得到所述块的梯度方向直方图;
    生成特征向量子单元,用于对所述图像内的所有块进行归一化处理,得到所述检测窗口的梯度方向直方图,并将所述检测窗口的梯度方向直方图作为人体特征向量。
  15. 一种智能音箱,包括存储器、处理器和至少一个被存储在所述存储器中并被配置为由所述处理器执行的应用程序,其特征在于,所述应用程序被配置为用于执行权利要求1所述的播放控制方法。
PCT/CN2018/077458 2018-02-11 2018-02-27 智能音箱及播放控制方法 WO2019153382A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810142948.9 2018-02-11
CN201810142948.9A CN108064006A (zh) 2018-02-11 2018-02-11 智能音箱及播放控制方法

Publications (1)

Publication Number Publication Date
WO2019153382A1 true WO2019153382A1 (zh) 2019-08-15

Family

ID=62134459

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/077458 WO2019153382A1 (zh) 2018-02-11 2018-02-27 智能音箱及播放控制方法

Country Status (2)

Country Link
CN (1) CN108064006A (zh)
WO (1) WO2019153382A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113659950A (zh) * 2021-08-13 2021-11-16 深圳市百匠科技有限公司 一种高保真多用途音响控制方法、系统、装置及存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111242149A (zh) * 2018-11-28 2020-06-05 珠海格力电器股份有限公司 智能家居的控制方法、装置、存储介质、处理器及智能家居
CN111182381B (zh) * 2019-10-10 2021-08-20 广东小天才科技有限公司 一种智能音箱的摄像头控制方法及智能音箱、存储介质
CN112992796A (zh) * 2021-02-09 2021-06-18 深圳市众芯诺科技有限公司 一种智能视觉音箱芯片
CN113311939A (zh) * 2021-04-01 2021-08-27 江苏理工学院 基于手势识别的智能音箱控制系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092332A (zh) * 2011-11-08 2013-05-08 苏州中茵泰格科技有限公司 电视数字图像交互方法及系统
CN103458288A (zh) * 2013-09-02 2013-12-18 湖南华凯创意展览服务有限公司 手势感应方法、手势感应装置及音视频播放系统
CN105744434A (zh) * 2016-02-25 2016-07-06 深圳市广懋创新科技有限公司 一种基于手势识别的智能音箱控制方法及系统
CN106358120A (zh) * 2016-09-23 2017-01-25 成都创慧科达科技有限公司 一种具有多种调节方法的音频播放装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763515B (zh) * 2009-09-23 2012-03-21 中国科学院自动化研究所 一种基于计算机视觉的实时手势交互方法
CN103679154A (zh) * 2013-12-26 2014-03-26 中国科学院自动化研究所 基于深度图像的三维手势动作的识别方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103092332A (zh) * 2011-11-08 2013-05-08 苏州中茵泰格科技有限公司 电视数字图像交互方法及系统
CN103458288A (zh) * 2013-09-02 2013-12-18 湖南华凯创意展览服务有限公司 手势感应方法、手势感应装置及音视频播放系统
CN105744434A (zh) * 2016-02-25 2016-07-06 深圳市广懋创新科技有限公司 一种基于手势识别的智能音箱控制方法及系统
CN106358120A (zh) * 2016-09-23 2017-01-25 成都创慧科达科技有限公司 一种具有多种调节方法的音频播放装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113659950A (zh) * 2021-08-13 2021-11-16 深圳市百匠科技有限公司 一种高保真多用途音响控制方法、系统、装置及存储介质
CN113659950B (zh) * 2021-08-13 2024-03-22 深圳市百匠科技有限公司 一种高保真多用途音响控制方法、系统、装置及存储介质

Also Published As

Publication number Publication date
CN108064006A (zh) 2018-05-22

Similar Documents

Publication Publication Date Title
WO2019153382A1 (zh) 智能音箱及播放控制方法
US10621991B2 (en) Joint neural network for speaker recognition
JP7431291B2 (ja) ドメイン分類器を使用したニューラルネットワークにおけるドメイン適応のためのシステム及び方法
US9899025B2 (en) Speech recognition system adaptation based on non-acoustic attributes and face selection based on mouth motion using pixel intensities
US9311534B2 (en) Method and apparatus for tracking object
JP7108144B2 (ja) クロスドメインバッチ正規化を使用したニューラルネットワークにおけるドメイン適応のためのシステム及び方法
US20230325663A1 (en) Systems and methods for domain adaptation in neural networks
US20170124400A1 (en) Automatic video summarization
TWI515605B (zh) 手勢辨識與控制方法及其裝置
US20140062862A1 (en) Gesture recognition apparatus, control method thereof, display instrument, and computer readable medium
JP2018500645A (ja) オブジェクトをトラッキングするためのシステムおよび方法
CN103353935A (zh) 一种用于智能家居系统的3d动态手势识别方法
CN104508597A (zh) 用于控制扩增实境的方法及设备
KR20160106691A (ko) 제스처를 사용하여 미디어의 재생을 제어하기 위한 시스템 및 방법
WO2020124993A1 (zh) 活体检测方法、装置、电子设备及存储介质
EP3757878A1 (en) Head pose estimation
CN114779922A (zh) 教学设备的控制方法、控制设备、教学系统和存储介质
TWI544367B (zh) 手勢辨識與控制方法及其裝置
CN108961314B (zh) 运动图像生成方法、装置、电子设备及计算机可读存储介质
WO2023193803A1 (zh) 音量控制方法、装置、存储介质和电子设备
US20140301603A1 (en) System and method for computer vision control based on a combined shape
KR101514551B1 (ko) 환경 변화에 강인한 멀티모달 사용자 인식
TWI777771B (zh) 行動影音裝置及影音播放控制方法
TW202315414A (zh) 紅外線遙控影音裝置及紅外線遙控影音播放方法
CN116386645A (zh) 说话对象的识别方法及装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18904697

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18904697

Country of ref document: EP

Kind code of ref document: A1