WO2019071989A1 - 一种智能设备的语音增强方法、装置及智能设备 - Google Patents

一种智能设备的语音增强方法、装置及智能设备 Download PDF

Info

Publication number
WO2019071989A1
WO2019071989A1 PCT/CN2018/094658 CN2018094658W WO2019071989A1 WO 2019071989 A1 WO2019071989 A1 WO 2019071989A1 CN 2018094658 W CN2018094658 W CN 2018094658W WO 2019071989 A1 WO2019071989 A1 WO 2019071989A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
smart device
depth image
sound source
determining
Prior art date
Application number
PCT/CN2018/094658
Other languages
English (en)
French (fr)
Inventor
朱剑
张向东
于振宇
罗志平
严栋
Original Assignee
歌尔股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 歌尔股份有限公司 filed Critical 歌尔股份有限公司
Priority to US16/475,013 priority Critical patent/US10984816B2/en
Publication of WO2019071989A1 publication Critical patent/WO2019071989A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/028Voice signal separating using properties of sound source
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S5/00Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations
    • G01S5/18Position-fixing by co-ordinating two or more direction or position line determinations; Position-fixing by co-ordinating two or more distance determinations using ultrasonic, sonic, or infrasonic waves
    • G01S5/20Position of source determined by a plurality of spaced direction-finders
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/40Analysis of texture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L2021/02161Number of inputs available containing the signal or the noise to be suppressed
    • G10L2021/02166Microphone arrays; Beamforming

Definitions

  • the present invention relates to the field of sound source localization technologies, and in particular, to a voice enhancement method, device and smart device for a smart device.
  • the sound source direction of the user is determined by sound source localization, and then the intensity of the sound located in the direction of the user sound source in the collected voice signal is increased by beamforming, and the sound is collected. Signals in other directions in the speech signal are treated as noise for filtering. It can be seen that determining the accuracy of the user's sound source direction by sound source localization is crucial to the effect of speech enhancement. If the determined user's sound source direction is not accurate, the user's actual voice signal will be filtered out as noise, resulting in the inability to obtain and recognize the voice command.
  • the user's sound source direction will change accordingly. If the beamforming direction in the speech enhancement algorithm remains unchanged, the user's actual speech signal may also be filtered as external noise. In addition, the voice command in the user voice signal cannot be recognized. At this time, the user needs to re-enter the voice signal (speech keyword) that controls the sound source localization function to be turned on, and re-determine the sound source to determine the user's sound source direction, and adopt a new determination.
  • the sound source direction is enhanced by voice to correctly recognize the voice command in the voice signal.
  • the present invention provides a voice enhancement method, device and smart device.
  • An embodiment of the present invention provides a voice enhancement method for a smart device, including:
  • the beamforming direction of the microphone array on the smart device is adjusted according to the direction of the user's sound source, and the voice signal is enhanced.
  • Another embodiment of the present invention further provides a voice enhancement apparatus for a smart device, including:
  • a voice signal collecting unit configured to monitor and collect voice signals sent by the user in real time
  • a user direction determining unit configured to determine a direction of the user according to the voice signal
  • a depth image acquisition unit for collecting a depth image in a direction in which the user is located
  • a sound source direction determining unit configured to determine a sound source direction of the user according to the depth image
  • the enhancement processing unit is configured to adjust a beamforming direction of the microphone array on the smart device according to a direction of the user's sound source, and perform enhancement processing on the voice signal.
  • a smart device including: a memory and a processor, wherein the memory and the processor are connected by an internal bus; and further includes a microphone array and a depth camera respectively connected to the processor; and the microphone array is monitored in real time. And collecting a voice signal sent by the user, and sending the voice signal to the processor; the depth camera collects the depth image in the direction of the user, and sends the depth image to the processor; the memory stores program instructions that can be executed by the processor, and the program instruction
  • the speech enhancement method of the smart device described above can be implemented when executed by the processor.
  • the invention has the beneficial effects that the direction of the user is roughly determined according to the obtained voice signal of the user, and after obtaining the general direction of the user, the depth image of the direction of the user is further collected, and the sound source of the user is implemented according to the depth image.
  • the precise positioning of the direction uses the direction of the user's sound source determined according to the depth image as a reference for adjusting the beamforming direction of the microphone array, thereby improving the sound quality and intensity of the user's sound source direction.
  • the present invention more accurately determines the direction of the sound source of the user through the depth image, thereby facilitating more accurate adjustment of the beamforming direction of the microphone array, so that the microphone array can accurately align with the direction of the user's sound source, and realize voice enhancement.
  • the direction of the sound source of the determined user is inaccurate.
  • the voice signal actually sent by the user is mistakenly judged as noise, thereby causing the defect that the voice command in the voice signal cannot be recognized.
  • FIG. 1 is a schematic flowchart of a voice enhancement method of a smart device according to an embodiment of the present invention
  • FIG. 2 is a schematic structural diagram of a smart device according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing the relationship between a depth camera, a microphone array, and a user's spatial coordinates according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram showing a relationship between a preset part of a user and a spatial coordinate of a microphone array according to an embodiment of the present invention
  • FIG. 5 is a schematic flowchart of a voice enhancement method of a smart device according to another embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a voice enhancement apparatus of a smart device according to an embodiment of the present invention.
  • FIG. 7 is a schematic block diagram of a smart device according to an embodiment of the present invention.
  • the inventor of the present application contemplates roughly determining the direction of the user according to the acquired voice signal sent by the user, and determining the user's sound source according to the depth image by collecting the depth image of the direction in which the user is located.
  • the direction is used as a reference for adjusting the beamforming direction of the microphone array based on the direction of the user's sound source determined according to the depth image, thereby improving the sound intensity of the user's sound source direction.
  • the present invention more accurately determines the direction of the sound source of the user by using the depth image, thereby more accurately determining the beamforming direction of the microphone array, and avoiding the prior art that the determined direction of the user's sound source is inaccurate and unrecognizable.
  • Obtaining the deficiencies of the voice command improves the effect of voice enhancement to identify the voice command.
  • FIG. 1 is a schematic flowchart diagram of a voice enhancement method of a smart device according to an embodiment of the present invention. As shown in FIG. 1, the method of the embodiment of the present invention includes:
  • S11 real-time monitoring and collecting voice signals sent by users of the smart device
  • the voice signal sent by the user can be collected by the voice collector.
  • the microphone array is preferably used as the voice collector.
  • the microphone array 21 is disposed on a side surface of the smart device. It can be understood that the arrangement of the microphone array 21 shown in FIG. 2 is only schematic. When the smart device is a robot, the microphone array can also be disposed at the head or other parts of the robot.
  • the smart device in the embodiment of the present invention determines the direction of the user according to the voice signal by using the sound source localization mode, where the user's direction is the current general direction of the user.
  • the embodiment of the present invention collects a depth image of the direction in which the user is located through the depth camera.
  • the depth camera 22 is disposed on a side surface of the smart device. It can be understood that the setting manner of the depth camera 22 shown in FIG. 2 is only schematic. When the smart device is a robot, the depth camera can also be disposed at other parts of the robot.
  • the depth image refers to an image from a depth camera to a distance (depth) of each point in the scene as a pixel value
  • the spatial position coordinate of the user's sound source may be determined according to the depth image
  • the user's position is determined according to the spatial position coordinate of the user's sound source. Sound source direction.
  • S15 Adjust a beamforming direction of the microphone array on the smart device according to the direction of the user's sound source, and perform enhancement processing on the voice signal.
  • the enhancement processing of the speech signal refers to increasing the intensity of the speech signal in the beamforming direction and filtering out the speech signals in other directions. If the direction of the user determined according to the voice signal is used as the beamforming direction for voice enhancement, when the determined direction of the user is inaccurate, the actual voice signal of the user is filtered out as external noise, and the voice command of the user cannot be recognized; Embodiments perform voice enhancement by using the sound source direction of the user determined from the depth image as the beamforming direction of the microphone array.
  • the voice enhancement method of the smart device provided by the embodiment of the present invention firstly determines the direction of the user according to the obtained voice signal sent by the user, and after acquiring the approximate direction of the user, further collects the depth image of the direction in which the user is located, according to the depth image.
  • the sound source direction of the user determined according to the depth image is used as a reference for adjusting the beamforming direction of the microphone array, thereby improving the sound quality and intensity of the user's sound source direction.
  • the present invention more accurately determines the direction of the sound source of the user through the depth image, thereby facilitating more accurate determination of the beamforming direction of the microphone array, enabling the microphone array to accurately align with the direction of the user's sound source, and implementing voice enhancement.
  • the direction of the sound source of the determined user is inaccurate.
  • the voice signal actually sent by the user is mistakenly judged as noise, thereby causing the defect that the voice command in the voice signal cannot be recognized.
  • Improves the effect and accuracy of speech enhancement thereby improving the accuracy of speech command recognition.
  • the sound source direction is determined according to the depth image before the user moves.
  • the actual voice signal of the user may be filtered out as noise, resulting in the inability to recognize the voice command, the user needs to repeat the keyword (the keyword can activate the sound source positioning function of the microphone array), and the microphone array re-sounds the sound source. Positioning affects the user experience.
  • the method further includes:
  • the smart device is controlled to move toward the moving direction of the user, and the depth image after the user moves is collected.
  • a depth camera can be set on the smart device, and the user's movement condition is monitored in real time according to the depth image of the user collected by the depth camera; when the user movement is monitored, the user's moving direction is collected, and the moving direction of the smart device toward the user is controlled.
  • Rotating so that when the user moves in a certain direction, the smart device also rotates in a certain direction; the smart device reacquires the user's depth image after the motion, and determines the user's sound source direction according to the re-acquired user's depth image.
  • the user's actual voice signal is prevented from being filtered as external noise, and the user's instruction can be recognized when the user moves, and the user does not need to repeat the keyword, thereby improving the user experience.
  • determining a sound source direction of the user according to the depth image includes:
  • the direction of the user's sound source is determined according to the spatial position coordinates of the microphone array and the spatial position coordinates of the user's preset portion.
  • the depth image includes depth information
  • the skeletal algorithm can accurately determine the spatial position coordinate of the preset part of the user according to the depth image including the depth information, and further according to the spatial position coordinate of the microphone array and the preset part of the user.
  • the spatial position coordinates determine the direction of the user's sound source.
  • the preset portion of the user is a head or a neck.
  • the spatial coordinate system is established with the depth camera as the origin, wherein the direction perpendicular to the vertical direction of the ground is the positive direction of the Y axis, the X axis and the Z axis are parallel to the ground, and the Z axis and the center of the depth camera
  • the axes coincide, and the depth camera captures the depth image in the positive direction of the Z axis, and the X axis is perpendicular to the Z axis.
  • the direction passing through the center point of the microphone array and parallel to the X-axis is taken as the reference 0 degree direction.
  • the specific process of determining the sound source direction of the user and determining the distance between the smart device and the user according to the spatial position coordinates of the microphone array and the spatial position coordinates of the user's head will be described below with reference to FIG.
  • the spatial position coordinates of the extracted user's head are (X1, Y1, Z1) in the depth image of the user acquired by the depth camera
  • the spatial position coordinates of the center point of the microphone array are (X2, Y2, Z2) (the microphone
  • the spatial position coordinates of the center point of the array can be obtained according to the positional relationship between the microphone array and the depth camera, which is a fixed value)
  • the angle between the center point of the microphone array and the user's head line and the X-axis can be calculated according to the following formula:
  • the above linear distance L can be approximated as a linear distance between the smart device and the user's head.
  • the distance between the smart device and the user can be determined and accurately positioned to the user's sound source.
  • the method further includes:
  • the smart device is controlled to move toward the direction of the user, and the distance between the smart device and the user is shortened.
  • the smart device can be controlled to move toward the direction of the user in a preset step size, and obtain the distance after the motion in real time; or the difference between the distance between the smart device and the user and the sound collection range can be calculated according to the The difference control smart device moves a specified distance toward the direction of the user such that the distance between the smart device and the user is within the pickup range.
  • determining the distance between the smart device and the user according to the depth image of the direction in which the user is located including:
  • the distance between the smart device and the preset portion of the user is determined according to the spatial position coordinates of the microphone array and the spatial position coordinates of the preset portion of the user.
  • the distance between the smart device of the embodiment of the present invention and the preset part of the user refers to a linear distance.
  • determining whether to control the motion of the smart device according to the distance between the smart device and the user includes:
  • the smart device If the distance between the smart device and the preset portion of the user is greater than the preset distance threshold, the smart device is controlled to move.
  • the smart device when the distance between the smart device and the preset part of the user is greater than the preset distance threshold, the smart device is controlled to move, the distance between the smart device and the user is shortened, and the smart device and the user are ensured.
  • the distance is within the range of the pickup to complete the voice command recognition.
  • controlling smart device motion includes controlling smart device movement and rotation. Assume that the pickup range of the microphone array is the preset distance threshold S. If L is greater than S, it indicates that the distance between the smart device and the preset part of the user is greater than the preset distance threshold, that is, the distance between the smart device and the user is Outside the pickup range, it is necessary to control the smart device to move the LS at this time, shortening the distance between the smart device and the user.
  • the horizontal rotation direction of the smart device may be determined according to the spatial position coordinates of the microphone array and the spatial position coordinates of the preset portion of the user.
  • the angle between the projection line of the connection between the center point of the microphone array of the smart device and the user's head and the X-axis is determined according to the following formula:
  • D is the angle between the projection line of the line point of the microphone array and the user's head at the horizontal plane and the X axis, that is, the horizontal rotation direction of the smart device. Rotating the smart device horizontally to direction D allows the user to enter the pickup range of the microphone array.
  • the direction of the center point of the microphone array along the X axis is the reference 0 degree direction
  • the direction of the user's sound source is based on the angle between the line connecting the center point of the microphone array and the user's head and the X axis.
  • the line connecting the center point of the microphone array to the user's head is determined by the angle D between the projection line of the horizontal plane and the X axis.
  • the smart device when the distance between the smart device and the preset part of the user is greater than the preset distance threshold, the smart device is controlled to rotate in the horizontal direction to the direction D, and the LS is moved, and then the depth image of the user's direction is re-acquired.
  • the direction of the user's sound source is determined based on the re-acquired depth image.
  • the method of the embodiment of the present invention includes:
  • S51 real-time monitoring and collecting a voice signal sent by a user through a microphone array
  • S53 collecting a depth image of a direction in which the user is located by using a depth camera
  • S55 Determine a distance between the smart device and the user according to the depth image of the direction in which the user is located;
  • step S56 determining whether the distance between the smart device and the preset portion of the user is greater than a preset distance threshold, if yes, proceeding to step S58, otherwise, proceeding to step S57;
  • S57 Adjust a beamforming direction of the microphone array on the smart device according to a direction of the user's sound source, and perform enhancement processing on the voice signal;
  • step S59 Control the smart device to move, shorten the distance between the smart device and the user, and return to step S53 to reacquire the user's depth image.
  • the embodiment of the present invention firstly determines the direction of the user according to the obtained voice signal sent by the user. After obtaining the approximate direction of the user, the depth image of the user's direction is further collected, and the direction of the user's sound source is accurately determined according to the depth image. Positioning, the sound source direction of the user determined according to the depth image is used as a reference for adjusting the beamforming direction of the microphone array, thereby improving the sound quality and intensity of the user's sound source direction.
  • the present invention more accurately determines the direction of the sound source of the user through the depth image, thereby facilitating more accurate adjustment of the beamforming direction of the microphone array, so that the microphone array can accurately align with the direction of the user's sound source, and realize voice enhancement.
  • the direction of the sound source of the determined user is inaccurate.
  • the voice signal actually sent by the user is mistakenly judged as noise, thereby causing the defect that the voice command is not recognized, and the voice enhancement is improved. The effect is to identify the voice command.
  • the moving distance and the horizontal moving direction are sent to the smart device to control the movement of the smart device, and the smart device and the user are shortened.
  • the distance between the smart device and the user is within the range of the pickup to complete the voice command recognition.
  • FIG. 6 is a schematic structural diagram of a voice enhancement apparatus of a smart device according to an embodiment of the present invention.
  • the apparatus of the embodiment of the present invention includes a voice signal collecting unit 61, a user direction determining unit 62, a depth image collecting unit 63, a sound source direction determining unit 64, and an enhancement processing unit 65, specifically:
  • the voice signal collecting unit 61 is configured to monitor and collect voice signals sent by the user in real time;
  • a user direction determining unit 62 configured to determine a direction of the user according to the voice signal
  • a depth image acquisition unit 63 configured to collect a depth image in a direction in which the user is located
  • a sound source direction determining unit 64 configured to determine a sound source direction of the user according to the depth image
  • the enhancement processing unit 65 is configured to adjust a beamforming direction of the microphone array on the smart device according to a sound source direction of the user, and perform enhancement processing on the voice signal.
  • the voice enhancement device of the smart device provided by the embodiment of the present invention firstly determines the direction of the user according to the obtained voice signal sent by the user, and after acquiring the approximate direction of the user, further collects the depth image of the direction in which the user is located, according to the depth image.
  • the sound source direction of the user determined according to the depth image is used as a reference for adjusting the beamforming direction of the microphone array, thereby improving the sound quality and intensity of the user's sound source direction.
  • the present invention more accurately determines the direction of the sound source of the user through the depth image, thereby facilitating more accurate determination of the beamforming direction of the microphone array, enabling the microphone array to accurately align with the direction of the user's sound source, and implementing voice enhancement.
  • the direction of the sound source of the determined user is inaccurate.
  • voice enhancement processing the user voice signal actually sent by the user is mistakenly judged as noise, thereby causing the defect that the voice command cannot be recognized, and the defect is improved.
  • the effect of speech enhancement is to identify the voice command.
  • the method further includes:
  • a mobile monitoring unit for monitoring the movement of the user in real time
  • a moving direction collecting unit configured to collect a moving direction of the user when monitoring the user to move
  • a motion control unit configured to control the movement of the smart device toward the moving direction of the user
  • the depth image acquisition unit is further configured to collect a depth image after the user moves.
  • the sound source direction determining unit 64 is specifically configured to: determine a spatial position coordinate of the preset part of the user according to the depth image;
  • the direction of the user's sound source is determined according to the spatial position coordinates of the microphone array and the spatial position coordinates of the user's preset portion.
  • the device further includes:
  • a distance determining unit configured to determine a distance between the smart device and the user according to the depth image of the direction in which the user is located;
  • a determining unit configured to determine, according to a distance between the smart device and the user, whether to control the motion of the smart device
  • the motion control unit is configured to control the smart device to move toward the direction of the user when determining the motion of the control smart device, and shorten the distance between the smart device and the user.
  • the distance determining unit is specifically configured to determine a spatial position coordinate of the preset part of the user according to the depth image of the direction in which the user is located;
  • the distance between the smart device and the preset portion of the user is determined according to the spatial position coordinates of the microphone array and the spatial position coordinates of the preset portion of the user.
  • the determining unit is specifically configured to determine to control the movement of the smart device when the distance between the smart device and the preset portion of the user is greater than the preset distance threshold.
  • the voice enhancement device of the smart device in the embodiment of the present invention may be used to perform the foregoing method embodiments, and the principles and technical effects thereof are similar, and details are not described herein again.
  • FIG. 7 is a schematic block diagram of a smart device according to an embodiment of the present invention.
  • the smart device includes a memory 71 and a processor 72.
  • the memory 71 and the processor 72 are communicably connected by an internal bus 73.
  • the microphone device 74 and the depth camera 75 are respectively connected to the processor 72.
  • the microphone array 74 is included.
  • the voice signal sent by the user is monitored and collected in real time, and the voice signal is sent to the processor 72;
  • the depth camera 75 collects the depth image in the direction in which the user is located, and sends the depth image to the processor 72;
  • the memory 72 stores the processor 71 capable of being stored by the processor 71.
  • the executed program instructions which are executed by the processor 71, enable the above-described voice enhancement method of the smart device.
  • the logic instructions in the memory 72 described above may be implemented in the form of software functional units and sold or used as separate products, and may be stored in a computer readable storage medium.
  • the technical solution of the present invention which is essential or contributes to the prior art, or a part of the technical solution, may be embodied in the form of a software product, which is stored in a storage medium, including
  • the instructions are used to cause a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes: a U disk, a mobile hard disk, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, and the like. .
  • the embodiment of the present invention provides a computer readable storage medium, where the computer readable storage medium stores computer instructions, and the computer instructions cause the computer to execute the voice enhancement method of the smart device provided by the foregoing method embodiments.
  • the direction of the user is roughly determined according to the voice signal sent by the obtained user, and after obtaining the general direction of the user, the depth image of the direction of the user is further collected, according to the depth image.
  • the precise positioning of the user's sound source direction is realized, and the sound source direction of the user determined according to the depth image is used as a reference for adjusting the beamforming direction of the microphone array, thereby improving the sound quality and intensity of the user's sound source direction.
  • the present invention determines the direction of the sound source of the user more accurately by using the depth image, and determines the beamforming direction of the microphone array more conveniently and more accurately, so that the microphone array can accurately align with the direction of the user's sound source, and realize the voice.
  • Enhancement to avoid the prior art inaccurate direction of the determined user's sound source, in the process of voice enhancement processing, the voice signal actually sent by the user is mistakenly judged as noise elimination, thereby causing the defect that the voice command is not recognized, and the improvement is improved.
  • the effect of speech enhancement is to identify the voice command.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.

Abstract

本发明公开了一种智能设备的语音增强方法、装置及智能设备。该方法包括:实时监测并采集用户发出的语音信号;根据语音信号确定用户的方向;采集用户所在方向的深度图像;根据深度图像确定用户的声源方向;根据用户的声源方向调整智能设备上的麦克风阵列的波束成形方向,并对语音信号进行增强处理。可见,本发明先通过声源定位获取到用户的大致方向,再采集用户所在方向的深度图像,根据深度图像来实现用户的声源方向的精确定位,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音强度。避免现有技术因确定的用户的声源方向不准确造成无法识别获得语音指令的缺陷,提高了语音增强的效果。

Description

一种智能设备的语音增强方法、装置及智能设备 技术领域
本发明涉及声源定位技术领域,特别涉及一种智能设备的语音增强方法、装置及智能设备。
背景技术
随着机器人等智能设备所处的声学环境越来越复杂,对于智能硬件的语音识别愈发具有挑战性,当用户离麦克风比较远时,智能硬件有可能识别不出用户输入的语音信号,因此需要对输入的语音信号进行语音增强处理。
现有的语音增强方法中,当用户发出语音信号时,通过声源定位确定用户的声源方向,再通过波束成形增大采集的语音信号中位于用户声源方向的声音的强度,并将采集的语音信号中其他方向的信号视为噪声进行滤除。可见,通过声源定位确定出用户的声源方向的准确性对语音增强的效果至关重要。若确定的用户的声源方向不准确,则会将用户的实际语音信号作为噪声滤除,导致无法获得并识别语音指令。
在实际使用中,当用户移动的时候,用户的声源方向将随之改变,若语音增强算法中的波束成形方向仍然保持不变,用户的实际语音信号可能也会被当作外界噪声被滤除,导致无法识别用户语音信号中的语音指令,此时,用户需要重新输入控制声源定位功能开启的语音信号(语音关键词),重新进行声源定位确定用户的声源方向,采用新确定的声源方向进行语音增强,才能正确识别语音信号中的语音指令。当用户在不停走动的时候,若想通过语音控制机器人等智能设备,就要不停的重复发出控制声源定位功能开启的语音关键词,以重新定位用户的声源方向,降低了用户体验。
发明内容
为了解决现有的语音增强方法因确定的用户的声源方向不准确造成无法识别获得语音指令的问题,本发明提供了一种语音增强方法、装置及智能设备。
本发明的一个实施例提供一种智能设备的语音增强方法,包括:
实时监测并采集用户发出的语音信号;
根据语音信号确定用户的方向;
采集用户所在方向的深度图像;
根据深度图像确定用户的声源方向;
根据用户的声源方向调整智能设备上的麦克风阵列的波束成形方向,并对语音信号进行增强处理。
本发明的另一个实施例还提供一种智能设备的语音增强装置,包括:
语音信号采集单元,用于实时监测并采集用户发出的语音信号;
用户方向确定单元,用于根据语音信号确定用户的方向;
深度图像采集单元,用于采集用户所在方向的深度图像;
声源方向确定单元,用于根据深度图像确定用户的声源方向;
增强处理单元,用于根据用户的声源方向调整智能设备上的麦克风阵列的波束成形方向,并对语音信号进行增强处理。
本发明的另一个实施例提供一种智能设备,包括:存储器和处理器,存储器和处理器之间通过内部总线通讯连接;还包括与处理器分别相连的麦克风阵列和深度摄像头;麦克风阵列实时监测并采集用户发出的语音信号,并将语音信号发送至处理器;深度摄像头采集用户所在方向的深度图像,并将深度图像发送至处理器;存储器存储有能够被处理器执行的程序指令,程序指令被处理器执行时能够实现上述的智能设备的语音增强方法。
本发明的有益效果是,先根据获取的用户发出的语音信号粗略确定用户的方向,在获取到用户的大致方向后,再进一步采集用户所在方向的深度图像,根据深度图像来实现用户的声源方向的精确定位,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音质量、强度。相对于现有技术,本发明通过深度图像更准确地确定用户的声源方向,从而便于更准确地调整麦克风阵列的波束成形方向,使麦克风阵列能精确对准用户的声源方向,实现语音增强,避免现有技术因确定的用户的声源方向不准确,在语音增强处理过程中,将用户实际发出的语音信号误判为噪声消除掉,进而造成无法识别获得语音信号中的语音指令的缺陷,提高了 语音增强的效果及准确性,进而提高了语音指令识别的准确度。
附图说明
图1为本发明一个实施例的智能设备的语音增强方法的流程示意图;
图2为本发明一个实施例的智能设备的结构示意图;
图3为本发明一个实施例的深度摄像头、麦克风阵列与用户的空间坐标关系示意图;
图4为本发明一个实施例的用户的预设部位与麦克风阵列的空间坐标关系示意图;
图5为本发明另一个实施例的智能设备的语音增强方法的流程示意图;
图6为本发明一个实施例的智能设备的语音增强装置的结构示意图;
图7为本发明一个实施例的智能设备的原理框图。
具体实施方式
为了解决或部分解决背景技术中提出的技术问题,本申请的发明人想到根据获取的用户发出的语音信号粗略确定用户的方向,通过采集用户所在方向的深度图像,根据深度图像确定用户的声源方向,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音强度。相对于现有技术,本发明通过深度图像更准确地确定用户的声源方向,从而更准确地确定麦克风阵列的波束成形方向,避免现有技术因确定的用户的声源方向不准确造成无法识别获得语音指令的缺陷,提高了语音增强的效果以识别获得语音指令。
为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。
图1为本发明一个实施例的智能设备的语音增强方法的流程示意图。如图1所示,本发明实施例的方法包括:
S11:实时监测并采集智能设备用户发出的语音信号;
具体实施时,可通过语音采集器采集用户发出的语音信号,本发明实施例优选麦克风阵列作为语音采集器。如图2所示,麦克风阵列21设置在智能设备的侧表面。可理解的是,图2展示的麦克风阵列21的设置方式只是示意性的,当智能设备为机器人时, 麦克风阵列还可设置在机器人的头部或其他部位。
S12:根据语音信号确定用户的方向;
可理解的是,本发明实施例的智能设备采用声源定位的方式根据语音信号确定用户的方向,在此,用户的方向为用户当前大致方向。
S13:采集用户所在方向的深度图像;
本发明实施例在确定用户的大致方向后,通过深度摄像头采集用户所在方向的深度图像。如图2所示,深度摄像头22设置在智能设备的侧表面。可理解的是,图2展示的深度摄像头22的设置方式只是示意性的,当智能设备为机器人时,深度摄像头还可设置在机器人的其它部位。
S14:根据深度图像确定用户的声源方向;
深度图像是指将从深度摄像头到场景中各点的距离(深度)作为像素值的图像,可根据深度图像确定用户的声源的空间位置坐标,根据用户的声源的空间位置坐标确定用户的声源方向。
S15:根据用户的声源方向调整智能设备上的麦克风阵列的波束成形方向,并对语音信号进行增强处理。
可理解的是,对语音信号进行增强处理指的是增大波束成形方向的语音信号的强度,滤除其他方向的语音信号。如果根据语音信号确定的用户的方向作为波束成形方向进行语音增强,当确定的用户的方向不准确时用户的实际语音信号会被当作外界噪声滤除,无法识别用户的语音指令;而本发明实施例将根据深度图像确定的用户的声源方向作为麦克风阵列的波束成形方向,完成语音增强。
本发明实施例提供的智能设备的语音增强方法,先根据获取的用户发出的语音信号粗略确定用户的方向,在获取到用户的大致方向后,再进一步采集用户所在方向的深度图像,根据深度图像来实现用户的声源方向的精确定位,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音质量、强度。相对于现有技术,本发明通过深度图像更准确地确定用户的声源方向,从而便于更准确地确定麦克风阵列的波束成形方向,使麦克风阵列能精确对准用户的声源方向,实现语音增强,避免现有技术因确定的用户的声源方向不准确,在语音增强处理过程中,将用户实际发出的语音信号误判为噪声消除掉,进而造成无法识别获得语音信号中的语 音指令的缺陷,提高了语音增强的效果及准确性,进而提高了语音指令识别的准确性。
在实际应用中,由于用户可能会移动,当用户移动时,即声源方向改变时,若不对移动后的用户重新采集深度图像,依然根据用户移动前的深度图像确定的声源方向对语音信号进行增强处理,则用户的实际语音信号有可能被作为噪声滤除,导致无法识别获得语音指令,用户需要重复关键词(关键词可以启动麦克风阵列的声源定位功能),麦克风阵列重新进行声源定位,影响用户体验。
在本发明实施例的一种可选的实施方式中,与图1中的方法类似,该方法还包括:
实时监测用户的移动情况;
在监测到用户移动时,采集用户的移动方向;
控制智能设备朝向用户的移动方向运动,并采集用户移动后的深度图像。
本发明实施例可在智能设备上设置深度摄像头,根据深度摄像头采集的用户的深度图像实时监测用户的移动情况;在监测到用户移动时,采集用户的移动方向,控制智能设备朝向用户的移动方向转动,从而当用户向某个方向移动时,智能设备也随之向某个方向转动;智能设备在运动后重新采集用户的深度图像,根据重新采集的用户的深度图像确定用户的声源方向,避免用户的实际语音信号被当作外界噪声滤除,保证用户移动时仍然可以识别获得用户指令,用户无需重复关键词,提升了用户体验。
具体地,根据深度图像确定用户的声源方向,包括:
根据深度图像确定用户的预设部位的空间位置坐标;
根据麦克风阵列的空间位置坐标和用户的预设部位的空间位置坐标确定用户的声源方向。
可理解的是,深度图像中包括深度信息,可根据包含深度信息的深度图像采用骨骼算法准确确定用户的预设部位的空间位置坐标,进而根据麦克风阵列的空间位置坐标和用户的预设部位的空间位置坐标确定用户的声源方向。优选地,用户的预设部位为头部或颈部。
在本发明的实施例中,以深度摄像头为原点建立空间坐标系,其中,垂直于地面竖直向上的方向为Y轴正方向,X轴和Z轴平行于地面,Z轴与深度摄像头的中心轴线重合,深度摄像头采集深度图像的方向为Z轴正方向,X轴与Z轴垂直。如图3所示,将深度摄像头中心点的空间位置坐标作为坐标原点(0,0,0),其中心轴线上的所有点 X=0,Y=0。在空间坐标系中,将经过麦克风阵列中心点且平行于X轴的方向作为参考0度方向。以下结合图4说明根据麦克风阵列的空间位置坐标和用户的头部的空间位置坐标确定用户的声源方向,以及确定智能设备与用户之间的距离的具体过程。
假设通过深度摄像头采集的用户的深度图像中,提取到的用户头部的空间位置坐标为(X1,Y1,Z1),麦克风阵列中心点的空间位置坐标为(X2,Y2,Z2)(该麦克风阵列中心点的空间位置坐标可根据麦克风阵列与深度摄像头之间的设置位置关系获得,为固定值),则
参照图4,可根据如下公式计算出麦克风阵列中心点和用户的头部连线与X轴的夹角:
Figure PCTCN2018094658-appb-000001
根据如下公式计算得到麦克风阵列中心点与用户头部的连线在水平面的投影线与X轴的夹角:
Figure PCTCN2018094658-appb-000002
而,麦克风阵列中心点与用户头部之间的直线距离:
Figure PCTCN2018094658-appb-000003
由于麦克风阵列设置在智能设备上,上述直线距离L则可近似为智能设备与用户头部之间的直线距离。
因此,结合上述夹角C、夹角D以及距离L,则可判断出智能设备与用户之间的距离,并精确定位到用户的声源方向。
在实际应用中,当用户移动时有可能导致智能设备与用户之间的距离超出麦克风阵列的拾音范围,此时无法识别获得用户指令。
进一步地,为保证智能设备与用户之间的距离在拾音范围之内,该方法还包括:
根据用户所在方向的深度图像确定智能设备与用户之间的距离;
根据智能设备与用户之间的距离确定是否控制智能设备运动;
是则控制智能设备朝向用户所在方向运动,缩短智能设备与用户之间的距离。
在实际应用中,可控制智能设备以预设的步长朝向用户所在方向运动,并实时获取运动后的距离;也可以计算智能设备与用户之间的距离与拾音范围的差值,根据该差值控制智能设备朝向用户所在方向运动指定距离,使得智能设备与用户之间的距离在拾音范围之内。
具体地,根据用户所在方向的深度图像确定智能设备与用户之间的距离,包括:
根据用户所在方向的深度图像确定用户的预设部位的空间位置坐标;
根据麦克风阵列的空间位置坐标和用户的预设部位的空间位置坐标确定智能设备与用户的预设部位之间的距离。
可理解的是,本发明实施例的智能设备与用户的预设部位之间的距离指的是直线距离。
具体地,根据智能设备与用户之间的距离确定是否控制智能设备运动,包括:
若智能设备与用户的预设部位之间的距离大于预设距离阈值,则控制智能设备运动。
需要说明的是,本发明当智能设备与用户的预设部位之间的距离大于预设距离阈值,则控制智能设备运动,缩短智能设备与用户之间的距离,保证智能设备与用户之间的距离在拾音范围之内,以完成语音指令识别。
在实际应用中,控制智能设备运动包括控制智能设备移动和转动。假设麦克风阵列的拾音范围即预设距离阈值为S,若L大于S,则表明智能设备与用户的预设部位之间的距离大于预设距离阈值,即智能设备与用户之间的距离在拾音范围之外,此时需要控制智能设备移动L-S,缩短智能设备与用户之间的距离。
为控制智能设备转动,可以根据麦克风阵列的空间位置坐标和用户的预设部位的空间位置坐标确定智能设备的水平转动方向。
参照图4,根据如下公式确定智能设备的麦克风阵列中心点与用户头部的连线在水平面的投影线与X轴的夹角:
Figure PCTCN2018094658-appb-000004
其中,D为麦克风阵列中心点与用户头部的连线在水平面的投影线与X轴的夹角,即智能设备的水平转动方向。通过控制智能设备水平转动到方向D可使得用户进入到麦克风阵列的拾音范围内。
需要说明的是,由于麦克风阵列中心点沿着X轴的方向为参考0度方向,因此,用户的声源方向根据麦克风阵列中心点和用户的头部的连线与X轴的夹角C和麦克风阵列中心点与用户头部的连线在水平面的投影线与X轴的夹角D确定。
在实际应用中,当智能设备与用户的预设部位之间的距离大于预设距离阈值,则控制智能设备沿水平方向转动到方向D,并移动L-S,然后重新采集用户所在方向的深度图像,根据重新采集的深度图像确定用户的声源方向。
以下结合图5说明本发明为保证智能设备与用户之间的距离在拾音范围之内,实现语音增强的完整过程,如图5所示,本发明实施例的方法包括:
S51:通过麦克风阵列实时监测并采集用户发出的语音信号;
S52:根据语音信号确定用户的方向;
S53:通过深度摄像头采集用户所在方向的深度图像;
S54:根据深度图像确定用户的声源方向;
S55:根据用户所在方向的深度图像确定智能设备与用户之间的距离;
S56:判断智能设备与用户的预设部位之间的距离是否大于预设距离阈值,若是,则进入步骤S58,否则,进入步骤S57;
S57:根据用户的声源方向调整智能设备上的麦克风阵列的波束成形方向,并对语音信号进行增强处理;
S58:确定智能设备的移动距离和水平移动方向;
S59:控制智能设备移动,缩短智能设备与用户之间的距离,并返回步骤S53重新采集用户的深度图像。
本发明实施例先根据获取的用户发出的语音信号粗略确定用户的方向,在获取到用户的大致方向后,再进一步采集用户所在方向的深度图像,根据深度图像来实现用户的声源方向的精确定位,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音质量、强度。相对于现有技术,本发明通过深度图像更准确地确定用户的声源方向,从而便于更准确地调整麦克风阵列的波束成形方向,使麦克风阵列能精确对准用户的声源方向,实现语音增强避免现有技术因确定的用户的声源方向不准确,在语音增强处理过程中,将用户实际发出的语音信号误判为噪声消除掉,进而造成无法识别获得语音指令的缺陷,提高了语音增强的效果以识别 获得语音指令。
并且,本发明实施例当智能设备与用户的预设部位之间的距离大于预设距离阈值,则将移动距离和水平移动方向发送给智能设备,控制智能设备运动,缩短智能设备与用户之间的距离,保证智能设备与用户之间的距离在拾音范围之内,以完成语音指令识别。
图6为本发明一个实施例的智能设备的语音增强装置的结构示意图。如图6所示,本发明实施例的装置包括语音信号采集单元61、用户方向确定单元62、深度图像采集单元63、声源方向确定单元64和增强处理单元65,具体地:
语音信号采集单元61,用于实时监测并采集用户发出的语音信号;
用户方向确定单元62,用于根据语音信号确定用户的方向;
深度图像采集单元63,用于采集用户所在方向的深度图像;
声源方向确定单元64,用于根据深度图像确定用户的声源方向;
增强处理单元65,用于根据用户的声源方向调整智能设备上的麦克风阵列的波束成形方向,并对语音信号进行增强处理。
本发明实施例提供的智能设备的语音增强装置,先根据获取的用户发出的语音信号粗略确定用户的方向,在获取到用户的大致方向后,再进一步采集用户所在方向的深度图像,根据深度图像来实现用户的声源方向的精确定位,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音质量、强度。相对于现有技术,本发明通过深度图像更准确地确定用户的声源方向,从而便于更准确地确定麦克风阵列的波束成形方向,使麦克风阵列能精确对准用户的声源方向,实现语音增强,避免现有技术因确定的用户的声源方向不准确,在语音增强处理过程中,将用户实际发出的用户语音信号误判为噪声消除掉,进而造成无法识别获得语音指令的缺陷,提高了语音增强的效果以识别获得语音指令。
在本发明实施例的一种可选的实施方式中,还包括:
移动监测单元,用于实时监测用户的移动情况;
移动方向采集单元,用于在监测到用户移动时,采集用户的移动方向;
运动控制单元,用于控制智能设备朝向用户的移动方向运动;
深度图像采集单元还用于采集用户移动后的深度图像。
声源方向确定单元64具体用于:根据深度图像确定用户的预设部位的空间位置坐 标;
根据麦克风阵列的空间位置坐标和用户的预设部位的空间位置坐标确定用户的声源方向。
进一步地,该装置还包括:
距离确定单元,用于根据用户所在方向的深度图像确定智能设备与用户之间的距离;
判断单元,用于根据智能设备与用户之间的距离确定是否控制智能设备运动;
运动控制单元,用于在确定控制智能设备运动时,控制智能设备朝向用户所在方向运动,缩短智能设备与用户之间的距离。
在本发明实施例的一种实施方式中,距离确定单元,具体用于根据所述用户所在方向的深度图像确定用户的预设部位的空间位置坐标;
根据麦克风阵列的空间位置坐标和用户的预设部位的空间位置坐标确定智能设备与用户的预设部位之间的距离。
进一步地,判断单元,具体用于在智能设备与用户的预设部位之间的距离大于预设距离阈值时,确定控制智能设备运动。
本发明实施例的智能设备的语音增强装置可以用于执行上述方法实施例,其原理和技术效果类似,此处不再赘述。
图7为本发明一个实施例的智能设备的原理框图。
参照图7,智能设备包括:存储器71和处理器72,存储器71和处理器72之间通过内部总线73通讯连接;还包括与处理器72分别相连的麦克风阵列74和深度摄像头75;麦克风阵列74实时监测并采集用户发出的语音信号,并将语音信号发送至处理器72;深度摄像头75采集用户所在方向的深度图像,并将深度图像发送至处理器72;存储器72存储有能够被处理器71执行的程序指令,程序指令被处理器71执行时能够实现上述的智能设备的语音增强方法。
此外,上述的存储器72中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发 明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本发明实施例提供一种计算机可读存储介质,计算机可读存储介质存储计算机指令,计算机指令使计算机执行上述各方法实施例所提供的智能设备的语音增强方法。
综上所述,根据本发明的技术方案,先根据获取的用户发出的语音信号粗略确定用户的方向,在获取到用户的大致方向后,再进一步采集用户所在方向的深度图像,根据深度图像来实现用户的声源方向的精确定位,将根据深度图像确定的用户的声源方向作为调整麦克风阵列的波束成形方向的基准,提高用户的声源方向的声音质量、强度。相对于现有技术,本发明通过深度图像更准确地确定用户的声源方向,从便于而更准确地确定麦克风阵列的波束成形方向,使麦克风阵列能精确对准用户的声源方向,实现语音增强,避免现有技术因确定的用户的声源方向不准确,在语音增强处理过程中,将用户实际发出的语音信号误判为噪声消除掉,进而造成无法识别获得语音指令的缺陷,提高了语音增强的效果以识别获得语音指令。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
需要说明的是术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括 没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本发明的说明书中,说明了大量具体细节。然而能够理解的是,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。类似地,应当理解,为了精简本发明公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释呈反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。
以上所述,仅为本发明的具体实施方式,在本发明的上述教导下,本领域技术人员可以在上述实施例的基础上进行其他的改进或变形。本领域技术人员应该明白,上述的具体描述只是更好的解释本发明的目的,本发明的保护范围应以权利要求的保护范围为准。

Claims (16)

  1. 一种智能设备的语音增强方法,其中,包括:
    实时监测并采集用户发出的语音信号;
    根据所述语音信号确定所述用户的方向;
    采集所述用户所在方向的深度图像;
    根据所述深度图像确定所述用户的声源方向;
    根据所述用户的声源方向调整所述智能设备上的麦克风阵列的波束成形方向,并对所述语音信号进行增强处理。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    实时监测用户的移动情况;
    在监测到用户移动时,采集用户的移动方向;
    控制智能设备朝向所述用户的移动方向运动,并采集所述用户移动后的深度图像。
  3. 根据权利要求1所述的方法,其中,通过麦克风阵列实时监测并采集用户发出的语音信号;通过深度摄像头采集所述用户所在方向的深度图像;
    所述根据所述深度图像确定所述用户的声源方向,包括:
    根据所述深度图像确定所述用户的预设部位的空间位置坐标;
    根据所述麦克风阵列的空间位置坐标和所述用户的预设部位的空间位置坐标确定所述用户的声源方向。
  4. 根据权利要求3所述的方法,其中,所述方法还包括:
    根据所述用户所在方向的深度图像确定所述智能设备与所述用户之间的距离;
    根据所述智能设备与所述用户之间的距离确定是否控制所述智能设备运动;
    是则控制所述智能设备朝向所述用户所在方向运动,缩短所述智能设备与所述用户之间的距离。
  5. 根据权利要求4所述的方法,其中,所述根据所述用户所在方向的深度图像确定所述智能设备与所述用户之间的距离,包括:
    根据所述用户所在方向的深度图像确定所述用户的预设部位的空间位置坐标;
    根据所述麦克风阵列的空间位置坐标和所述用户的预设部位的空间位置坐标确定所述智能设备与所述用户的预设部位之间的距离;
    所述根据所述智能设备与所述用户之间的距离确定是否控制所述智能设备运动,包括:
    若所述智能设备与所述用户的预设部位之间的距离大于预设距离阈值,则控制所述智能设备运动。
  6. 根据权利要求3所述的方法,其中,所述用户的预设部位为所述用户的头部或颈部。
  7. 一种智能设备的语音增强装置,其中,包括:
    语音信号采集单元,用于实时监测并采集用户发出的语音信号;
    用户方向确定单元,用于根据所述语音信号确定所述用户的方向;
    深度图像采集单元,用于采集所述用户所在方向的深度图像;
    声源方向确定单元,用于根据所述深度图像确定所述用户的声源方向;
    增强处理单元,用于根据所述用户的声源方向调整所述智能设备上的麦克风阵列的波束成形方向,并对所述语音信号进行增强处理。
  8. 根据权利要求7所述的装置,其中,还包括:
    移动监测单元,用于实时监测用户的移动情况;
    移动方向采集单元,用于在监测到用户移动时,采集用户的移动方向;
    运动控制单元,用于控制智能设备朝向所述用户的移动方向运动;
    所述深度图像采集单元还用于采集所述用户移动后的深度图像。
  9. 根据权利要求7所述的装置,其中,所述声源方向确定单元具体用于:根据所述深度图像确定所述用户的预设部位的空间位置坐标;
    根据所述麦克风阵列的空间位置坐标和所述用户的预设部位的空间位置坐标确定所述用户的声源方向。
  10. 根据权利要求9所述的装置,其中,还包括:
    距离确定单元,用于根据所述用户所在方向的深度图像确定所述智能设备与所述用户的预设部位之间的距离;
    判断单元,用于在所述智能设备与所述用户的预设部位之间的距离大于预设距离阈值时,确定控制所述智能设备运动;
    运动控制单元,用于在确定控制所述智能设备运动时,控制所述智能设备朝向所述用户所在方向运动,缩短所述智能设备与所述用户之间的距离。
  11. 一种智能设备,其中,包括:存储器和处理器,所述存储器和所述处理器之间通过内部总线通讯连接;还包括分别与所述处理器相连的语音采集器和深度摄像头;
    所述语音采集器实时监测并采集用户发出的语音信号,并将所述语音信号发送至所 述处理器;所述深度摄像头采集所述用户所在方向的深度图像,并将所述深度图像发送至所述处理器;
    所述存储器存储有能够被所述处理器执行的程序指令,所述程序指令被处理器执行时能够实现如下步骤:
    根据接收的所述语音信号确定所述用户的方向;
    接收所述用户所在方向的深度图像,并根据所述深度图像确定所述用户的声源方向;
    根据所述用户的声源方向调整所述智能设备上的麦克风阵列的波束成形方向,并对所述语音信号进行增强处理。
  12. 根据权利要求11所述的智能设备,其中,所述深度摄像头还用于实时监测用户的移动情况,并将所述用户的移动情况发送至所述处理器;
    所述程序指令被所述处理器执行时还实现如下步骤:
    根据接收的用户的移动情况;
    在监测到用户移动时,采集用户的移动方向;
    控制智能设备朝向所述用户的移动方向运动,并采集所述用户移动后的深度图像。
  13. 根据权利要求11所述的智能设备,其中,所述处理器根据所述深度图像确定所述用户的声源方向,包括:
    根据所述深度图像确定所述用户的预设部位的空间位置坐标;
    根据所述麦克风阵列的空间位置坐标和所述用户的预设部位的空间位置坐标确定所述用户的声源方向。
  14. 根据权利要求13所述的智能设备,其中,所述程序指令被所述处理器执行时还实现如下步骤:
    根据所述用户所在方向的深度图像确定所述智能设备与所述用户之间的距离;
    根据所述智能设备与所述用户之间的距离确定是否控制所述智能设备运动;
    是则控制所述智能设备朝向所述用户所在方向运动,缩短所述智能设备与所述用户之间的距离。
  15. 根据权利要求14所述的智能设备,其中,所述处理器根据所述用户所在方向的深度图像确定所述智能设备与所述用户之间的距离,包括:
    根据所述用户所在方向的深度图像确定所述用户的预设部位的空间位置坐标;根据所述麦克风阵列的空间位置坐标和所述用户的预设部位的空间位置坐标确定所述智能设备与所述用户的预设部位之间的距离;
    所述处理器根据所述智能设备与所述用户之间的距离确定是否控制所述智能设备运动,包括:
    若所述智能设备与所述用户的预设部位之间的距离大于预设距离阈值,则控制所述智能设备运动。
  16. 根据权利要求11所述的智能设备,其中,所述语音采集器为麦克风阵列。
PCT/CN2018/094658 2017-10-13 2018-07-05 一种智能设备的语音增强方法、装置及智能设备 WO2019071989A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/475,013 US10984816B2 (en) 2017-10-13 2018-07-05 Voice enhancement using depth image and beamforming

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710954593.9A CN107680593A (zh) 2017-10-13 2017-10-13 一种智能设备的语音增强方法及装置
CN201710954593.9 2017-10-13

Publications (1)

Publication Number Publication Date
WO2019071989A1 true WO2019071989A1 (zh) 2019-04-18

Family

ID=61141524

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/094658 WO2019071989A1 (zh) 2017-10-13 2018-07-05 一种智能设备的语音增强方法、装置及智能设备

Country Status (3)

Country Link
US (1) US10984816B2 (zh)
CN (1) CN107680593A (zh)
WO (1) WO2019071989A1 (zh)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107680593A (zh) * 2017-10-13 2018-02-09 歌尔股份有限公司 一种智能设备的语音增强方法及装置
CN108322855B (zh) * 2018-02-11 2020-11-17 北京百度网讯科技有限公司 用于获取音频信息的方法及装置
CN108364648B (zh) * 2018-02-11 2021-08-03 北京百度网讯科技有限公司 用于获取音频信息的方法及装置
CN108615534B (zh) * 2018-04-04 2020-01-24 百度在线网络技术(北京)有限公司 远场语音降噪方法及系统、终端以及计算机可读存储介质
CN108957392A (zh) * 2018-04-16 2018-12-07 深圳市沃特沃德股份有限公司 声源方向估计方法和装置
CN108877787A (zh) * 2018-06-29 2018-11-23 北京智能管家科技有限公司 语音识别方法、装置、服务器及存储介质
CN110673716B (zh) * 2018-07-03 2023-07-07 百度在线网络技术(北京)有限公司 智能终端与用户交互的方法、装置、设备及存储介质
CN110764520B (zh) * 2018-07-27 2023-03-24 杭州海康威视数字技术股份有限公司 飞行器控制方法、装置、飞行器和存储介质
CN111067354B (zh) * 2018-10-19 2022-06-07 佛山市顺德区美的饮水机制造有限公司 饮水机及其移动方法与装置
CN109410983A (zh) * 2018-11-23 2019-03-01 广东小天才科技有限公司 一种语音搜题方法及系统
CN110503970B (zh) * 2018-11-23 2021-11-23 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置及存储介质
CN111273232B (zh) * 2018-12-05 2023-05-19 杭州海康威视系统技术有限公司 一种室内异常情况判断方法及系统
CN109640224B (zh) * 2018-12-26 2022-01-21 北京猎户星空科技有限公司 一种拾音方法及装置
CN109688512B (zh) * 2018-12-26 2020-12-22 北京猎户星空科技有限公司 一种拾音方法及装置
CN110121129B (zh) * 2019-06-20 2021-04-20 歌尔股份有限公司 耳机的麦克风阵列降噪方法、装置、耳机及tws耳机
CN110493690B (zh) * 2019-08-29 2021-08-13 北京搜狗科技发展有限公司 一种声音采集方法及装置
CN112578338A (zh) * 2019-09-27 2021-03-30 阿里巴巴集团控股有限公司 声源定位方法、装置、设备及存储介质
CN110864440B (zh) * 2019-11-20 2020-10-30 珠海格力电器股份有限公司 一种送风方法及送风装置、空调
CN111782045A (zh) * 2020-06-30 2020-10-16 歌尔科技有限公司 一种设备角度调节方法、装置、智能音箱及存储介质
CN112614508B (zh) * 2020-12-11 2022-12-06 北京华捷艾米科技有限公司 音视频结合的定位方法、装置、电子设备以及存储介质
CN113031901B (zh) 2021-02-19 2023-01-17 北京百度网讯科技有限公司 语音处理方法、装置、电子设备以及可读存储介质
CN113299287A (zh) * 2021-05-24 2021-08-24 山东新一代信息产业技术研究院有限公司 基于多模态的服务机器人交互方法、系统及存储介质
CN115575896B (zh) * 2022-12-01 2023-03-10 杭州兆华电子股份有限公司 一种针对非点声源声源图像的特征增强方法
CN116705047B (zh) * 2023-07-31 2023-11-14 北京小米移动软件有限公司 音频采集方法、装置及存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045122A (zh) * 2015-06-24 2015-11-11 张子兴 一种基于音频和视频的智能家居自然交互系统
CN106251857A (zh) * 2016-08-16 2016-12-21 青岛歌尔声学科技有限公司 声源方向判断装置、方法及麦克风指向性调节系统、方法
CN107680593A (zh) * 2017-10-13 2018-02-09 歌尔股份有限公司 一种智能设备的语音增强方法及装置

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8570378B2 (en) * 2002-07-27 2013-10-29 Sony Computer Entertainment Inc. Method and apparatus for tracking three-dimensional movements of an object using a depth sensing camera
US9384737B2 (en) * 2012-06-29 2016-07-05 Microsoft Technology Licensing, Llc Method and device for adjusting sound levels of sources based on sound source priority
KR102150013B1 (ko) * 2013-06-11 2020-08-31 삼성전자주식회사 음향신호를 위한 빔포밍 방법 및 장치
US9516412B2 (en) * 2014-03-28 2016-12-06 Panasonic Intellectual Property Management Co., Ltd. Directivity control apparatus, directivity control method, storage medium and directivity control system
CN104965426A (zh) * 2015-06-24 2015-10-07 百度在线网络技术(北京)有限公司 基于人工智能的智能机器人控制系统、方法和装置
US9530426B1 (en) * 2015-06-24 2016-12-27 Microsoft Technology Licensing, Llc Filtering sounds for conferencing applications
CN105058389A (zh) * 2015-07-15 2015-11-18 深圳乐行天下科技有限公司 一种机器人系统、机器人控制方法及机器人
US10079028B2 (en) * 2015-12-08 2018-09-18 Adobe Systems Incorporated Sound enhancement through reverberation matching
US10424314B2 (en) * 2015-12-23 2019-09-24 Intel Corporation Techniques for spatial filtering of speech
CN106024003B (zh) * 2016-05-10 2020-01-31 北京地平线信息技术有限公司 结合图像的语音定位和增强系统及方法
CN106203259A (zh) * 2016-06-27 2016-12-07 旗瀚科技股份有限公司 机器人的交互方向调整方法及装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045122A (zh) * 2015-06-24 2015-11-11 张子兴 一种基于音频和视频的智能家居自然交互系统
CN106251857A (zh) * 2016-08-16 2016-12-21 青岛歌尔声学科技有限公司 声源方向判断装置、方法及麦克风指向性调节系统、方法
CN107680593A (zh) * 2017-10-13 2018-02-09 歌尔股份有限公司 一种智能设备的语音增强方法及装置

Also Published As

Publication number Publication date
CN107680593A (zh) 2018-02-09
US20190378530A1 (en) 2019-12-12
US10984816B2 (en) 2021-04-20

Similar Documents

Publication Publication Date Title
WO2019071989A1 (zh) 一种智能设备的语音增强方法、装置及智能设备
CN109506568B (zh) 一种基于图像识别和语音识别的声源定位方法及装置
US9977954B2 (en) Robot cleaner and method for controlling a robot cleaner
US9532140B2 (en) Listen to people you recognize
EP3872689B1 (en) Liveness detection method and device, electronic apparatus, storage medium and related system using the liveness detection method
CN106131413B (zh) 一种拍摄设备的控制方法及拍摄设备
US10659670B2 (en) Monitoring system and control method thereof
TWI622474B (zh) 機器人系統及其控制方法
WO2016019768A1 (zh) 用于视频监控的声源定向控制装置及方法
TW201941104A (zh) 智慧設備的控制方法、裝置、設備和存儲介質
WO2018157827A1 (zh) 一种动态人眼跟踪的虹膜采集装置、动态人眼跟踪的虹膜识别装置及方法
US20160094812A1 (en) Method And System For Mobile Surveillance And Mobile Infant Surveillance Platform
WO2018090252A1 (zh) 机器人语音指令识别的方法及相关机器人装置
CN109151393A (zh) 一种声音定位识别侦测方法
WO2017101292A1 (zh) 自动对焦的方法、装置和系统
JP2016532217A (ja) グリントにより眼を検出する方法および装置
JP2020526094A (ja) 使用者信号処理方法およびこのような方法を遂行する装置
JP6845121B2 (ja) ロボットおよびロボット制御方法
CN209579577U (zh) 一种视觉机器人的声源跟踪系统和清洁机器人
CN111103807A (zh) 一种家用终端设备的控制方法及装置
KR100936244B1 (ko) 로봇용 지능형 음성입력 장치 및 그 운용 방법
CN113034526B (zh) 一种抓取方法、抓取装置及机器人
US10796711B2 (en) System and method for dynamic optical microphone
CN115958589A (zh) 用于机器人的手眼标定的方法和装置
TW202117321A (zh) 聲音擷取裝置及加工機刀具狀態偵測設備

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18865785

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18865785

Country of ref document: EP

Kind code of ref document: A1