WO2020125038A1 - 语音控制方法及装置 - Google Patents

语音控制方法及装置 Download PDF

Info

Publication number
WO2020125038A1
WO2020125038A1 PCT/CN2019/100982 CN2019100982W WO2020125038A1 WO 2020125038 A1 WO2020125038 A1 WO 2020125038A1 CN 2019100982 W CN2019100982 W CN 2019100982W WO 2020125038 A1 WO2020125038 A1 WO 2020125038A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
threshold
lip
feature vectors
user
Prior art date
Application number
PCT/CN2019/100982
Other languages
English (en)
French (fr)
Inventor
张文涛
乔慧丽
Original Assignee
南京人工智能高等研究院有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京人工智能高等研究院有限公司 filed Critical 南京人工智能高等研究院有限公司
Publication of WO2020125038A1 publication Critical patent/WO2020125038A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the invention relates to the field of voice recognition, and in particular to a voice control method and device.
  • the embodiments of the present application provide a voice control method and device.
  • a voice control method which includes: acquiring voice feature data of a user, and acquiring lip feature data corresponding to the voice feature data; determining the control terminal based on the voice feature data and the lip feature data Control word; the control terminal performs the operation corresponding to the control word.
  • a voice control device including: an acquisition module for acquiring user's voice feature data and lip feature data corresponding to the voice feature data; a determination module for voice-based The feature data and lip feature data determine the control word of the control terminal; the control module is used to control the terminal to perform operations corresponding to the control word.
  • a computer-readable storage medium stores a computer program, and the computer program is used to execute the foregoing voice control method.
  • an electronic device including: a processor; a memory for storing processor-executable instructions, wherein the processor is used to execute the foregoing voice control method.
  • Embodiments of the present application provide a voice control method and device, which recognizes control words in a user's voice by fusing voice feature data and lip feature data. It can be used in many situations such as high noise, low light, and low sound energy. Improve the accuracy of the user's voice when it is collected, thereby improving the accuracy of voice recognition, thereby improving the user experience and enhancing the naturalness of human-computer interaction.
  • FIG. 1A is a schematic diagram of a system architecture of a voice control system provided by an exemplary embodiment of the present application.
  • FIG. 1B is a schematic flowchart of a voice control method provided by an exemplary embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a voice control device provided by an exemplary embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a voice control device provided by another exemplary embodiment of the present application.
  • FIG. 8 is a block diagram of an electronic device provided by an exemplary embodiment of the present application.
  • the voice interaction process includes a voice wake-up process and a voice command recognition process.
  • wind noise, whistle, road noise and other large noises are easily mixed, thereby interfering with the voice interaction process.
  • problems such as false wake-up, unawakening, and loud roaring are prone to occur, resulting in poor user experience.
  • FIG. 1A is a schematic diagram of a system architecture of a voice control system 1 according to an exemplary embodiment of the present application, which shows an application scenario of waking up a terminal (for example, a vehicle-mounted device).
  • the voice control system 1 includes an electronic device 10, a voice collection device 20 (for example, a microphone array), an image collection device 30 (for example, a camera), and a terminal 40.
  • the voice collecting device 20 is used to collect user's voice.
  • the image acquisition device 30 is used to acquire a video image containing the user's lip image.
  • the electronic device 10 is used to receive voice signals and video image signals from the voice acquisition device 20 and the image acquisition device 30 respectively, perform control word (eg, wake-up word) recognition on the voice signal and video image signal, and control the terminal according to the control word recognition result 40 Perform the corresponding operation.
  • control word eg, wake-up word
  • voice collection device 20 and the image collection device 30 in this application may also be integrated on the electronic device 10.
  • FIG. 1B is a schematic flowchart of a voice control method provided by an exemplary embodiment of the present application.
  • the execution subject of this embodiment may be, for example, the electronic device in FIG. 1A, as shown in FIG. 1B, the method includes the following steps:
  • Step 110 Obtain the voice characteristic data of the user.
  • Step 120 Acquire lip feature data corresponding to the voice feature data.
  • step 120 may be executed before step 110, or may be executed simultaneously with step 110.
  • the technical solution of the present application is described in detail by taking the terminal as an in-vehicle device as an example, where the in-vehicle device may be a speaker, a display device, etc. in the in-vehicle system, and the execution subject (electronic device) of the method may It is the controller of the vehicle-mounted equipment, or the controller of the vehicle-mounted system.
  • the vehicle-mounted system may further include a camera and a microphone, and the camera and the microphone may be installed at positions corresponding to the main driver in the car.
  • the controller can collect the lip feature data of the user by controlling the camera, and at the same time collect the voice feature data of the user by controlling the microphone.
  • the lip feature data and the voice feature data correspond. For example, if the user says "start”, then the lip feature data and the voice feature data both correspond to "start".
  • the lip feature data may be an image representing changes in lip motion, or may be matrix or vector data extracted from the image to characterize the content of the lip language.
  • the voice feature data may be a voice segment, or may be matrix or vector data extracted from the voice segment to characterize the voice content.
  • Step 130 Determine the control word of the control terminal based on the voice feature data and the lip feature data.
  • the control word can be a wake-up word or a voice command.
  • the control can be determined by comparing the voice feature data with the preset voice feature data, and simultaneously comparing the lip feature data with the preset lip feature data, and combining the comparison results of the two data word.
  • the control word may be determined by simultaneously inputting the voice feature data and the lip feature data into the control word recognition model.
  • the control word recognition model may be obtained by training multiple samples, where each sample contains sample speech feature data and sample lip feature data. During training, multiple samples in different light environments or different noise environments can be trained and learned to obtain a control word recognition model.
  • Step 140 The control terminal performs an operation corresponding to the control word.
  • the controller controls the terminal to perform the corresponding operation.
  • the control word is a wake-up word
  • the terminal is woken up;
  • the control word is a voice command
  • the terminal performs the operation corresponding to the command;
  • the control word includes both the wake-up word and the voice command, the terminal is woken up and then executes The operation corresponding to the instruction.
  • the control word is the wake-up word "start”
  • the speaker is awakened, that is, it is in a working state and starts playing music
  • the control word is a word such as the voice command "decrease volume” or “decrease volume”
  • the volume of the speaker will become smaller
  • the control word is "start, reduce the volume” Or “start, reduce volume”
  • the speaker is in working state and starts playing music at a lower volume setting than before starting.
  • the degree of volume reduction can be set in advance according to usage.
  • An embodiment of the present application provides a voice control method, which recognizes control words in a user's voice by fusing voice feature data and lip feature data, and can combine two auditory and visual sensory channels to obtain multimodal information, thereby enhancing control Word recognition makes up for the shortcomings of identifying control words through separate voice or lip data, which greatly improves the accuracy of the user's voice when it is collected under many conditions such as high noise, low light, and low sound energy. In turn, the accuracy of voice recognition is improved, thereby improving the user experience and enhancing the naturalness of human-computer interaction.
  • the voice feature data may be a voice feature vector
  • the voice feature vector may include FBank voice feature parameters or Mel cepstrum coefficients MFCC voice feature parameters.
  • step 120 includes: collecting continuous multi-frame images containing changes in the user's lip motion, and extracting lip feature data based on the continuous multi-frame images.
  • a video of a user's face changes within a certain period of time may be recorded by a camera, continuous multi-frame images corresponding to the video within the period of time may be collected, and lip feature data may be further extracted based on the continuous multi-frame images.
  • the continuous multi-frame images may be images containing the user's complete facial features, or images containing part of the facial features (lips).
  • collecting continuous multi-frame images may be a continuous multi-frame face image of a user identified from multiple continuous video images using a machine vision recognition model; continuous multi-frame images (including part of the face) are collected from continuous multi-frame face images Feature image).
  • step 120 may include: collecting continuous multi-frame images including changes in the user's lip motion; for each frame image in the continuous multi-frame images, extracting a plurality of feature points describing the shape of the lips ; Normalize the coordinates of multiple feature points of each frame image in consecutive multi-frame images to obtain lip feature data.
  • lip feature data obtained through feature points can be improved Recognition accuracy of lip language.
  • the convergence of data processing can be accelerated, thereby increasing the speed of lip recognition.
  • each frame of image corresponds to a shape of a lip.
  • the upper image is the lip image when the user is not speaking
  • the lower image is the image when the user is speaking.
  • Multiple feature points can be extracted around the inner and outer lip edges in each lip shape, and each feature point can be represented by coordinates, and the multiple feature points can represent the lip shape in the frame image.
  • the head When the user is speaking, the head may swing, which may cause the origin of coordinates in each frame of image to be deviated when extracting feature points. Therefore, the feature points corresponding to the multi-frame images can be normalized, so that the coordinate origins of the feature points corresponding to each frame image are consistent, and the accuracy of lip feature data is improved, thereby improving the recognition rate of control words.
  • control method further includes: determining whether the illuminance of the environment in which the user is located is greater than the first threshold, and determining the angle of the human face in each of the consecutive multi-frame images (the human face is facing the camera) Whether the angle at time is zero) is less than or equal to the second threshold, wherein, if the illuminance is greater than the first threshold and the angle is less than or equal to the second threshold, extracting lip feature data based on continuous multi-frame images is performed.
  • the light emitting device when the illuminance of the environment is lower than or equal to the first threshold, the light emitting device can be controlled to supplement the light, thereby determining whether the angle faced by the human face is smaller than or equal to the second threshold, and if so, collecting the user’s lips Continuous multi-frame images with varying motion.
  • the acquisition of lip feature data may be abandoned, and the control word is directly determined based on the voice feature data.
  • control method further includes: determining whether the energy of the user's voice is greater than the third threshold, and determining whether the duration of the sound is greater than the fourth threshold, wherein, if the energy of the voice is greater than the third threshold and the sound continues If the time is greater than the fourth threshold, then acquiring the voice characteristic data of the user is performed.
  • the energy of the user's voice is too small, effective voice feature data cannot be obtained from the voice, that is, even if the lip feature data is obtained, it is difficult to determine the control word.
  • the duration of the user's voice is too short, the user's voice may not contain the exact words at this time, such as "ah", "oh”, etc.
  • the voice feature data is obtained based on the sound, it will increase the calculation burden of the controller, and the control word will not be obtained. Therefore, when the energy of the sound is greater than the third threshold and the duration of the sound is greater than the fourth threshold, the voice feature data is acquired based on the sound.
  • the third threshold and the fourth threshold can be set according to the control word actually used.
  • the acquisition of speech feature data may be abandoned, and the control word is directly determined based on the lip feature data.
  • the control word includes a wake word
  • the lip feature data includes a plurality of lip feature vectors
  • the voice feature data includes a plurality of acoustic feature vectors
  • the control word recognition model is a wake word recognition model.
  • step 130 includes: using multiple lip feature vectors and multiple acoustic feature vectors to obtain multiple combined feature vectors; using a wake-up word recognition model to perform wake-up word recognition on the multiple combined feature vectors to obtain wake-up words.
  • the above speech feature vector may specifically be an acoustic feature vector.
  • the acoustic feature vector may include a vector composed of MFCC (Meier Cepstral Coefficients) parameters, differential parameters (characterizing changes in MFCC parameters), and frame energy, where the MFCC parameters are used to characterize static features of speech, and differential parameters It is used to characterize the dynamic features of speech, and frame energy is used to characterize the energy features of speech.
  • MFCC Machine Cepstral Coefficients
  • differential parameters characterizing changes in MFCC parameters
  • frame energy is used to characterize the energy features of speech.
  • the acoustic feature vector is a 39-dimensional MFCC vector, including 12-dimensional MFCC parameters, 12-dimensional first-order differential parameters, 12-dimensional second-order differential parameters, and 3-dimensional frame energy.
  • Each image in a continuous multi-frame image corresponds to a lip feature vector
  • the lip feature vector may be composed of multiple feature points.
  • the voice collected by the microphone can also be divided into multiple frames according to time, and each frame of voice corresponds to a sound feature vector.
  • Multiple lip feature vectors and multiple voice feature vectors can be in one-to-one correspondence, and multiple combined feature vectors can be obtained by combining the lip feature vector and the voice feature vector.
  • FIG. 2 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 2 is an example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
  • the control method includes the following.
  • Step 205 Use the microphone array to collect the user's voice.
  • the microphone array may refer to a microphone containing multiple channels.
  • Step 210 Acquire user's voice characteristic data.
  • Step 215 Collect continuous multi-frame images containing changes in the user's lips.
  • Step 220 If the illuminance is greater than the first threshold and the angle faced by the face is less than or equal to the second threshold, then step 225 is performed; otherwise, step 250 is performed.
  • Step 225 Extract lip feature data based on continuous multi-frame images.
  • Step 230 Determine the control word of the control terminal based on the voice characteristic data and the lip characteristic data.
  • Step 240 The control terminal performs an operation corresponding to the control word.
  • Step 250 If the illuminance is greater than the first threshold and the angle the face is facing is greater than the second threshold, then perform steps 255 and 260, otherwise end the entire process.
  • Step 255 Determine the sound source location of the sound using the time difference of arrival method (TDOA).
  • TDOA time difference of arrival method
  • Step 260 Adjust the camera angle of the camera used to shoot consecutive multi-frame images according to the position of the sound source.
  • the location of the sound source of the sound can be determined by the TDOA method, and then the camera angle can be adjusted so that the camera can collect the frontal image of the user.
  • Step 255 and step 215 can be performed at the same time, that is, the user's facial image is collected while adjusting the camera angle.
  • the angle of the face facing the camera may be greater than the second threshold.
  • adjusting the camera angle according to the position of the sound source can make the angle of the face facing the camera at the next moment less than the second threshold, which can improve the lip feature data Accuracy.
  • FIG. 3 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 3 is an example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
  • the control method includes the following.
  • Step 305 Use the microphone array to collect the user's voice.
  • Step 310 Determine whether the energy of the user's voice is greater than the third threshold, and determine whether the duration of the voice is greater than the fourth threshold.
  • step 315 is performed, otherwise step 320 is performed.
  • Step 315 Obtain the voice characteristic data of the user.
  • Step 320 If the energy of the sound is less than or equal to the third threshold, or the duration of the sound is less than or equal to the fourth threshold, it is further determined whether the similarity between the lip feature data of two adjacent images in consecutive multi-frame images Below the fifth threshold.
  • the controller may determine whether the similarity between the lip feature data of two adjacent frames of images is lower than the fifth threshold according to the time sequence. If the similarity between the lip feature data of two adjacent images is lower than the fifth threshold, it means that the user's lips are moving. At this time, the user may once again say a sentence containing a control word, so the default Repeat step 305 within the time.
  • Step 325 Acquire continuous multi-frame images containing changes in the user's lips.
  • Step 330 Extract lip feature data based on continuous multi-frame images.
  • Step 330 may be performed before step 320.
  • Step 340 Determine the control word of the control terminal based on the voice characteristic data and the lip characteristic data.
  • Step 345 The control terminal performs the operation corresponding to the control word.
  • FIG. 4 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 4 is the example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
  • the control method includes the following.
  • Step 410 Acquire multiple acoustic feature vectors of the user.
  • Step 420 Acquire multiple lip feature vectors corresponding to multiple acoustic feature vectors.
  • the controller can reduce the weight of the acoustic feature vector and increase the weight of the lip feature vector, that is, by adjusting the weight values of the two vectors, a more accurate combined feature vector can be obtained.
  • step 431 is executed.
  • the sixth threshold may be greater than or equal to the third threshold in FIG. 3.
  • Step 431 Determine a first weight value corresponding to multiple lip feature vectors and a second weight value corresponding to multiple acoustic feature vectors.
  • the first weight value is greater than the second weight value.
  • step 431 may also be performed, at which time the first weight value may be equal to the second weight value.
  • the distribution of weight values may be determined by a weight value distribution model, which may be obtained by performing deep learning on samples of different scenes (sound environment, illuminance, angle of face facing).
  • Step 432 Use the first weight value and the second weight value to perform weighted calculation on the multiple lip feature vectors and the multiple acoustic feature vectors to obtain multiple combined feature vectors.
  • the accuracy of the combined feature vector can be improved, thereby improving the accuracy of the wake word recognition.
  • Step 433 Use a wake-up word recognition model to recognize wake-up words from multiple combined feature vectors to obtain a wake-up word.
  • Step 440 Wake up the terminal.
  • FIG. 5 is a schematic flowchart of a voice control method provided by another exemplary embodiment of the present application.
  • FIG. 5 is the example of FIG. 1B. In order to avoid repetition, the same parts are not explained in detail.
  • the control method includes the following.
  • Step 510 Acquire multiple acoustic feature vectors of the user.
  • Step 520 Acquire multiple lip feature vectors corresponding to multiple acoustic feature vectors.
  • step 531 is executed.
  • Step 531 Determine a third weight value corresponding to multiple lip feature vectors and a fourth weight value corresponding to multiple acoustic feature vectors.
  • the third weight value is less than the fourth weight value.
  • step 531 may also be executed, and the third weight value may be equal to the fourth weight value.
  • Step 532 Use the third weight value and the fourth weight value to perform weighted calculation on the multiple lip feature vectors and the multiple acoustic feature vectors to obtain multiple combined feature vectors.
  • the accuracy rate of the combined feature vector can be improved, thereby improving the accuracy rate of the wake word recognition.
  • Step 533 Use a wake-up word recognition model to recognize wake-up words from multiple combined feature vectors to obtain a wake-up word.
  • Step 540 Wake up the terminal.
  • the weight values of the lip feature vector and the speech feature vector may also be redistributed according to the angle of the face in the image.
  • FIG. 1 to FIG. 5 can complement each other to obtain a control word recognition process with more complete functions and higher accuracy.
  • FIG. 6 is a schematic structural diagram of a voice control device 600 provided by an exemplary embodiment of the present application. As shown in FIG. 6, the device 600 includes an acquisition module 610, a determination module 620, and a control module 630.
  • the obtaining module 610 is used to obtain the user's voice feature data and to obtain the lip feature data corresponding to the voice feature data; the determining module 620 is used to determine the control word of the control terminal based on the voice feature data and the lip feature data; the control module 630 uses The control terminal performs the operation corresponding to the control word.
  • An embodiment of the present application provides a voice control device, which recognizes control words in a user's voice by fusing voice feature data and lip feature data, and can fuse two auditory and visual sensory channels to obtain multimodal information, thereby enhancing control Word recognition makes up for the shortcomings of identifying control words through separate voice or lip data, which greatly improves the accuracy of the user's voice when it is collected under many conditions such as high noise, low light, and low sound energy. In turn, the accuracy of voice recognition is improved, thereby improving the user experience and enhancing the naturalness of human-computer interaction.
  • the obtaining module 610 is used to collect continuous multi-frame images containing changes in the user's lip motion, and extract lip feature data based on the continuous multi-frame images.
  • the acquisition module 610 is used to extract a plurality of feature points for describing the shape of the lips for each frame image in the continuous multi-frame image; for each frame image in the continuous multi-frame image The coordinates of each feature point are normalized to obtain lip feature data.
  • the determination module 620 is further used to determine whether the illuminance of the environment where the user is located is greater than the first threshold, and determine whether the angle of the face in each frame of the continuous multi-frame images is less than or equal to the second Threshold.
  • the acquisition module 610 extracts lip feature data based on consecutive multi-frame images.
  • the determination module 620 is further used to determine whether the energy of the user's voice is greater than the third threshold, and determine whether the duration of the voice is greater than the fourth threshold.
  • FIG. 7 is a schematic structural diagram of a voice control apparatus 700 provided by another exemplary embodiment of the present application.
  • FIG. 7 is the example of FIG. 6, and the same points will not be repeated.
  • the apparatus 700 includes an acquisition module 710, a determination module 720, a control module 730, and an acquisition module 740.
  • the acquisition module 710, the specific functions of the determination module 720, and the control module 730 can refer to the description in FIG. 6.
  • the collection module 740 is used to collect the user's voice using the microphone array.
  • the determination module 720 is also used to determine the sound source position of the sound using the time difference of arrival method TDOA.
  • the device 700 further includes: an adjustment module 750 for adjusting the camera angle of the camera used to capture consecutive multi-frame images according to the position of the sound source.
  • the acquiring module 710 acquires the voice characteristic data of the user.
  • the determination module 720 is further configured to determine whether the similarity between the lip feature data of two adjacent images in consecutive multi-frame images is lower than a fifth threshold.
  • the collection module 740 if the similarity is lower than the fifth threshold, and the energy of the sound is less than or equal to the third threshold or the duration of the sound is less than or equal to the fourth threshold, the collection module 740 repeatedly collects users within a preset time the sound of.
  • the control word includes a wake word
  • the lip feature data includes multiple lip feature vectors
  • the voice feature data includes multiple acoustic feature vectors
  • the determination module 720 is configured to utilize multiple lip feature vectors and Multiple acoustic feature vectors obtain multiple combined feature vectors, and use the wake word recognition model to recognize the wake word from the multiple combined feature vectors to obtain the wake word.
  • the determining module 720 is further used to determine whether the energy of the user's voice is less than the sixth threshold, and determine whether the noise level of the voice is greater than the seventh threshold.
  • the determination module 720 is used to determine a first weight value corresponding to a plurality of lip feature vectors and a second weight value corresponding to a plurality of acoustic feature vectors, The first weight value is greater than the second weight value, and the first weight value and the second weight value are used to perform weighted calculation on the multiple lip feature vectors and the multiple acoustic feature vectors to obtain multiple combined feature vectors.
  • the determination module 720 is further used to determine whether the illuminance of the environment where the user is located is less than the eighth threshold.
  • the determination module 720 is used to determine a third weight value corresponding to multiple lip feature vectors and a fourth weight value corresponding to multiple acoustic feature vectors, where the third weight value is less than the fourth weight value , And use the third weight value and the fourth weight value to perform weighted calculation on multiple lip feature vectors and multiple acoustic feature vectors to obtain multiple combined feature vectors.
  • the acquisition module 710 is used to identify a user's continuous multi-frame face images from multiple continuous video images using a machine vision recognition model, and collect continuous multi-frame images from the continuous multi-frame face images.
  • the electronic device 80 can perform the aforementioned voice control process.
  • FIG. 8 illustrates a block diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 80 includes one or more processors 81 and memory 82.
  • the processor 81 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.
  • CPU central processing unit
  • the processor 81 may be a central processing unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 80 to perform desired functions.
  • the memory 82 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • the volatile memory may include, for example, random access memory (RAM) and/or cache memory.
  • the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, and the like.
  • One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 81 may execute the program instructions to implement the voice control method and/or other embodiments of the embodiments of the present application described above Desired function.
  • Various contents such as voice signals, video image signals, etc. can also be stored in the computer-readable storage medium.
  • the electronic device 80 may further include: an input device 83 and an output device 84, and these components are interconnected by a bus system and/or other forms of connection mechanisms (not shown).
  • the input device 83 may be the aforementioned microphone array and camera, which are used to capture input signals of voice and video images, respectively.
  • the input device 83 may be a communication network connector for receiving the collected input signals from the microphone array and the camera.
  • the input device 83 may also include, for example, a keyboard, a mouse, and the like.
  • the output device 84 can output various information to the outside, including the determined control word and the like.
  • the output device 84 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output device.
  • the electronic device 80 may also include any other suitable components.
  • embodiments of the present application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the above-described "exemplary method" of this specification The steps in the voice control method according to various embodiments of the present application described in the section.
  • the computer program product may write program codes for performing operations of the embodiments of the present application in any combination of one or more programming languages, and the programming languages include object-oriented programming languages, such as Java, C++, etc. , Also includes conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code may be executed entirely on the user's computing device, partly on the user's device, as an independent software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server To execute.
  • an embodiment of the present application may also be a computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor causes the processor to perform the above-mentioned "exemplary method" part of the specification
  • the steps in the voice control method according to various embodiments of the present application are described in.
  • the computer-readable storage medium may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any combination of the above, for example. More specific examples of readable storage media (non-exhaustive list) include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • each component or each step can be decomposed and/or recombined.
  • decompositions and/or recombinations shall be regarded as equivalent solutions of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音控制方法及装置,语音控制方法包括:获取用户的语音特征数据(110),并获取与语音特征数据对应的唇部特征数据(120);基于语音特征数据和唇部特征数据确定控制终端的控制词(130);控制终端执行与控制词相对应的操作(140)。在噪声大、光线暗、以及声音能量小等多种情况下可提高用户语音被采集时的准确率,进而提高语音识别的准确率,从而提高用户体验,增强了人机交互的自然性。

Description

语音控制方法及装置 技术领域
本发明涉及语音识别领域,具体涉及一种语音控制方法及装置。
背景技术
随着人们对人机交互要求的提高,越来越多的设备应用到语音识别技术。现有的语音交互方式大多采用语音来唤醒设备或控制设备执行与语音相对应的指令,这种交互方式适应性差,例如在声音嘈杂的环境中,语音识别的准确率低、设备响应效果不好,造成用户体验差。
因此,亟待提供一种准确率高的语音控制方法及装置。
发明内容
为了解决上述技术问题,本申请的实施例提供了一种语音控制方法及装置。
根据本申请的一个方面,提供了一种语音控制方法,包括:获取用户的语音特征数据,并获取与语音特征数据对应的唇部特征数据;基于语音特征数据和唇部特征数据确定控制终端的控制词;控制终端执行与控制词相对应的操作。
根据本申请的另一个方面,提供了一种语音控制装置,包括:获取模块,用于获取用户的语音特征数据,并获取与语音特征数据对应的唇部特征数据;确定模块,用于基于语音特征数据和唇部特征数据确定控制终端的控制词;控制模块,用于控制终端执行与控制词相对应的操作。
根据本申请的又一个方面,提供了一种计算机可读存储介质,存储介质存储有计算机程序,计算机程序用于执行上述的语音控制方法。
根据本申请的又一个方面,提供了一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器,其中,处理器用于执行上述的语音控制方法。
本申请实施例提供了一种语音控制方法及装置,通过融合语音特征数据和唇部特征数据来识别用户语音中的控制词,在噪声大、光线暗、以及声音能量小等多种情况下可提高用户语音被采集时的准确率,进而提高语音识别的准确率,从而提高用户体验,增强了人机交互的自然性。
附图说明
通过结合附图对本申请实施例进行更详细的描述,本申请的上述以及其他目的、特征和优势将变得更加明显。附图用来提供对本申请实施例的进一步理解,并且构成说明书的一部分,与本申请实施例一起用于解释本申请,并不构成对本申请的限制。在附图中,相同的参考标号通常代表相同部件或 步骤。
图1A是本申请一示例性实施例提供的语音控制系统的系统架构示意图。
图1B是本申请一示例性实施例提供的语音控制方法的流程示意图。
图2是本申请另一示例性实施例提供的语音控制方法的流程示意图。
图3是本申请另一示例性实施例提供的语音控制方法的流程示意图。
图4是本申请另一示例性实施例提供的语音控制方法的流程示意图。
图5是本申请另一示例性实施例提供的语音控制方法的流程示意图。
图6是本申请一示例性实施例提供的语音控制装置的结构示意图。
图7是本申请另一示例性实施例提供的语音控制装置的结构示意图。
图8是本申请一示例性实施例提供的电子设备的框图。
图9是本申请一示例性实施例提供的唇部图像。
具体实施方式
下面,将参考附图详细地描述根据本申请的示例实施例。显然,所描述的实施例仅仅是本申请的一部分实施例,而不是本申请的全部实施例,应理解,本申请不受这里描述的示例实施例的限制。
申请概述
随着人机交互技术的发展,越来越多的产品具有语音交互功能。例如在智能驾驶领域,通过语音与车载设备进行交互,可以解放用户的双手,提高驾驶的安全性。语音交互过程包括语音唤醒过程和语音指令识别过程。但是,在实际应用场景下,容易出现风噪、鸣笛、路噪等多种大噪声掺杂的情况,从而干扰语音交互过程。以语音唤醒过程为例,在嘈杂环境中,容易出现误唤醒、唤不醒、大声吼等问题,从而导致用户体验差。
示例性系统
图1A是本申请一示例实施例的语音控制系统1的系统架构示意图,其示出了一种对终端(例如,对车载设备)进行唤醒操作的应用场景。如图1A所示,该语音控制系统1包括电子设备10、语音采集设备20(例如麦克风阵列)、图像采集设备30(例如,摄像头)和终端40。语音采集设备20用于采集用户的语音。图像采集设备30用于采集包含用户的唇部图像的视频图像。电子设备10用于分别从语音采集设备20和图像采集设备30接收语音信号和视频图像信号,对语音信号和视频图像信号进行控制词(例如,唤醒词)识别,并根据控制词识别结果控制终端40执行相应的操作。
需要说明的是,本申请中的语音采集设备20和图像采集设备30也可以集成在电子设备10上。
需要注意的是,上述应用场景仅是为了便于理解本申请的精神和原理而示出,本申请的实施例并不限于此。相反,本申请的实施例可以应用于可能适用的任何场景。
示例性方
图1B是本申请一示例性实施例提供的语音控制方法的流程示意图。本实施例的执行主体例如可以是图1A中的电子设备,如图1B所示,该方法包括如下步骤:
步骤110:获取用户的语音特征数据。
步骤120:获取与语音特征数据对应的唇部特征数据。
步骤110和步骤120的执行顺序不受上述列举的限制,即,步骤120可以在步骤110之前执行,也可以与步骤110同时执行。
在本申请实施例中,以终端为车载设备为例对本申请的技术方案进行详细的说明,其中,车载设备可以是车载系统中的音箱、显示设备等,该方法的执行主体(电子设备)可以是车载设备的控制器,或者车载系统的控制器。
车载系统还可以包括摄像头和麦克风,摄像头和麦克风可以安装在汽车内与主驾驶对应的位置上。用户在驾驶过程中,当用户开口说话时,控制器可以通过控制摄像头采集用户的唇部特征数据,同时通过控制麦克风采集用户的语音特征数据,唇部特征数据和语音特征数据是对应的。比如用户说“启动”,则唇部特征数据和语音特征数据都是对应“启动”。
唇部特征数据可以是表示唇部动作变化的图像,也可以是通过该图像提取的、用以表征唇语内容的矩阵或向量数据。语音特征数据可以是语音片段,也可以是通过该语音片段提取的、用以表征语音内容的矩阵或向量数据。
步骤130:基于语音特征数据和唇部特征数据确定控制终端的控制词。
控制词可以是唤醒词,也可以是语音指令。在本实施例中,可以通过将语音特征数据与预设的语音特征数据进行对比、同时将唇部特征数据与预设的唇部特征数据进行对比,并结合两种数据的对比结果从而确定控制词。或者,也可以通过将语音特征数据和唇部特征数据同时输入控制词识别模型,从而确定控制词。该控制词识别模型可以是通过对多个样本进行训练得到的,其中,每个样本包含样本语音特征数据和样本唇部特征数据。在训练时,可以对不同光线环境下或不同噪音环境下的多个样本进行训练、学习,以得到控制词识别模型。
步骤140:控制终端执行与控制词相对应的操作。
若用户所说的内容包含控制词,且控制器识别出控制词,则控制器控制终端执行相应的操作。当控制词是唤醒词时,则终端被唤醒;当控制词是语音指令时,则终端执行与指令对应的操作;当控制词既包括唤醒词又包括语音指令时,则终端被唤醒继而执行与指令对应的操作。
以终端为音箱为例,当音箱处于不工作(关闭)的状态,若控制词是唤醒词“启动”时,则音箱被唤醒,即处于工作的状态,开始播放音乐;当音箱处于工作状态,若控制词是语音指令“降低音量”或“减小音量”这类词时,则音箱的音量会变小;当音箱处于不工作(关闭)的状态,若控制词是“启动,降低音量”或“启动,减小音量”时,则音箱处于工作状态并以低于启动前的音量设置开始播放音乐。在本实施例中,音量降低的程度可以根据使用情况预先设置。
本申请实施例提供了一种语音控制方法,通过融合语音特征数据和唇部特征数据来识别用户语音中的控制词,能够融合听觉和视觉两种感觉通道获得多模态信息,从而增强对控制词的识别,即弥补 了通过单独的语音或唇语数据识别控制词的不足,极大提高了在噪声大、光线暗、以及声音能量小等多种情况下用户语音被采集时的准确率,进而提高语音识别的准确率,从而提高用户体验,增强了人机交互的自然性。
根据本申请一实施例,语音特征数据可以是语音特征向量,语音特征向量可以包括FBank语音特征参数或梅尔倒谱系数MFCC语音特征参数。
根据本申请一实施例,步骤120包括:采集包含用户的唇部动作变化的连续多帧图像,并基于连续多帧图像提取唇部特征数据。
在本实施例中,可以通过摄像机记录某一时间段内用户面部变化的视频,采集该段时间内该视频所对应的连续多帧图像,基于该连续多帧图像进一步提取唇部特征数据。连续多帧图像可以是包含用户完整面部特征的图像,也可以是包含部分面部特征(唇部)的图像。
例如,采集连续多帧图像,可以是利用机器视觉识别模型从多个连续视频图像中识别出用户的连续多帧人脸图像;从连续多帧人脸图像中采集连续多帧图像(包含部分面部特征的图像)。
根据本申请一实施例,步骤120可包括:采集包含用户的唇部动作变化的连续多帧图像;针对连续多帧图像中的每一帧图像,提取用于描述唇部形状的多个特征点;对连续多帧图像中的每一帧图像的多个特征点的坐标进行归一化处理,得到唇部特征数据。
与通过将包含用户的唇部动作变化的连续多帧图像与预设的包含控制词的多帧唇部图像进行对比而识别唇语的方法相比,通过特征点获得的唇部特征数据可以提高识别唇语的准确率。另外,通过对特征点的坐标进行归一化处理,能够加快数据处理的收敛,从而提高唇语识别的速度。
具体地,每帧图像都对应一个唇部形状,如图9所示,上面的图像是用户没有开口说话时的唇部图像、下面的图像是用户开口说话时的图像。可以围绕每个唇部形状中的内唇边缘和外唇边缘提取多个特征点,每个特征点可以用坐标进行表示,该多个特征点即可表示该帧图像中的唇部形状。
由于用户在说话时,头部可能会有摆动,这样会导致在提取特征点时,每帧图像中的坐标原点可能会有偏差。因此可以对多帧图像对应的特征点进行归一化处理,使得每帧图像对应的特征点的坐标原点一致,提高唇部特征数据的准确率,进而提高控制词的识别率。
根据本申请一实施例,该控制方法还包括:确定用户所处环境的照度是否大于第一阈值,并确定连续多帧图像中的每一帧图像中人脸面向的角度(人脸正对摄像机时的角度为零)是否小于或等于第二阈值,其中,若照度大于第一阈值且角度小于或等于第二阈值,则执行基于连续多帧图像提取唇部特征数据。
当用户所处环境光线较暗、或者人脸面向摄像机镜头的角度大于第二阈值时,此时通过采集的图像难以提取有效的唇部特征数据,即,即使获得了唇部特征数据,也难以确定控制词。因此为了避免控制器的运算负担,可以先判断环境的照度、以及人脸面向的角度与预设阈值的大小,当环境的照度大于第一阈值,且人脸面向的角度小于或等于第二阈值时,才基于采集的图像提取唇部特征数据。
在本实施例中,当环境的照度低于或等于第一阈值时,可以通过控制发光器件补充光线,进而确定人脸面向的角度是否小于或等于第二阈值,若是,则采集包含用户的唇部动作变化的连续多帧图像。 或者,当环境的照度低于或等于第一阈值时,可放弃获取唇部特征数据,而直接基于语音特征数据确定控制词。
根据本申请一实施例,该控制方法还包括:确定用户的声音的能量是否大于第三阈值,并确定声音持续的时间是否大于第四阈值,其中,若声音的能量大于第三阈值且声音持续的时间大于第四阈值,则执行获取用户的语音特征数据。
当用户的声音的能量过小时,无法通过声音获取有效的语音特征数据,即,即使获得了唇部特征数据,也难以确定控制词。此外,当用户的声音持续时间过短时,此时用户的声音中可能并没有包含确切的词语,例如“嗯”、“哦”等。在这两种情况下,若基于声音获取语音特征数据,会增加控制器的运算负担,而且也得不到控制词。所以在声音的能量大于第三阈值且声音持续的时间大于第四阈值,才基于声音获取语音特征数据。第三阈值、第四阈值可以根据实际使用的控制词来设定。
在本实施例中,若声音的能量小于或等于第三阈值,或声音持续的时间小于或等于第四阈值,则可放弃获取语音特征数据,而直接基于唇部特征数据确定控制词。
根据本申请一实施例,控制词包括唤醒词,唇部特征数据包括多个唇部特征向量,语音特征数据包括多个声学特征向量,控制词识别模型为唤醒词识别模型。在本实施例中,步骤130包括:利用多个唇部特征向量和多个声学特征向量得到多个组合特征向量;利用唤醒词识别模型对多个组合特征向量进行唤醒词识别,得到唤醒词。
这样结合唇部特征向量和声学特征向量,可以结合两种信息进行唤醒词识别,弥补一种信息进行唤醒词识别的不足,提高唤醒词识别的准确率。
上述的语音特征向量具体可以为声学特征向量。具体地,声学特征向量可以包含由MFCC(梅尔倒谱系数)参数、差分参数(表征MFCC参数的变化)、以及帧能量组成的向量,其中,MFCC参数用于表征语音的静态特征,差分参数用于表征语音的动态特征,帧能量用于表征语音的能量特征。例如,声学特征向量为39维的MFCC向量,包括12维的MFCC参数、12维的一阶差分参数、12维的二阶差分参数、以及3维的帧能量。
连续多帧图像中每帧图像对应一个唇部特征向量,该唇部特征向量可以是由多个特征点构成的。麦克风采集的语音也可以根据时间分为多帧,每帧语音对应一个声音特征向量。多个唇部特征向量和多个声音特征向量可以一一对应,结合唇部特征向量和声音特征向量得到多个组合特征向量。
图2是本申请另一示例性实施例提供的语音控制方法的流程示意图。图2是图1B的例子,为避免重复,相同部分不做具体解释。如图2所示,该控制方法包括以下内容。
步骤205:利用麦克风阵列采集用户的声音。
麦克风阵列可以是指包含多个声道的麦克风。
步骤210:获取用户的语音特征数据。
步骤215:采集包含用户的唇部动作变化的连续多帧图像。
确定用户所处环境的照度是否大于第一阈值,并确定连续多帧图像中的每一帧图像中人脸面向的角度是否小于或等于第二阈值。
步骤220:若照度大于第一阈值且人脸面向的角度小于或等于第二阈值,则执行步骤225,否则,执行步骤250。
步骤225:基于连续多帧图像提取唇部特征数据。
步骤230:基于语音特征数据和唇部特征数据确定控制终端的控制词。
步骤240:控制终端执行与控制词相对应的操作。
步骤250:若照度大于第一阈值且人脸面向的角度大于第二阈值,则执行步骤255和步骤260,否则结束整个进程。
步骤255:利用到达时间差方法(TDOA)确定声音的声源位置。
步骤260:根据声源位置调整用于拍摄连续多帧图像的摄像机的摄像角度。
在利用麦克风采集用户声音,以及利用摄像机采集用户面部图像时,可以通过TDOA方法确定声音的声源位置,进而调整摄像角度,使得摄像机可以采集用户的正面图像。
步骤255和步骤215可以同时进行,即一边调整摄像角度,一边采集用户的面部图像。在某一时刻,人脸面向摄像机的角度可能大于第二阈值,此时根据声源位置调整摄像角度,可以使得下一时刻人脸面向摄像机的角度小于第二阈值,从而可以提高唇部特征数据的准确率。
图3是本申请另一示例性实施例提供的语音控制方法的流程示意图。图3是图1B的例子,为避免重复,相同部分不做具体解释。如图3所示,该控制方法包括以下内容。
步骤305:利用麦克风阵列采集用户的声音。
步骤310:确定用户的声音的能量是否大于第三阈值,并确定声音持续的时间是否大于第四阈值。
若声音的能量大于第三阈值且声音持续的时间大于第四阈值,则执行步骤315,否则执行步骤320。
步骤315:获取用户的语音特征数据。
步骤320:若声音的能量小于或等于第三阈值,或者声音持续的时间小于或等于第四阈值,则进一步确定连续多帧图像中两个相邻图像的唇部特征数据之间的相似度是否低于第五阈值。
控制器可以根据时间的先后顺序,依次判断相邻两帧图像的唇部特征数据之间的相似度是否低于第五阈值。若存在两个相邻图像的唇部特征数据之间的相似度低于第五阈值,则说明用户的唇部在动,此时用户可能再次说包含控制词的语句,因此可在预设的时间内重复执行步骤305。
步骤325:采集包含用户的唇部动作变化的连续多帧图像。
步骤330:基于连续多帧图像提取唇部特征数据。
步骤330可以在步骤320之前进行。
步骤340:基于语音特征数据和唇部特征数据确定控制终端的控制词。
步骤345:控制终端执行与控制词相对应的操作。
图4是本申请另一示例性实施例提供的语音控制方法的流程示意图。图4是图1B的例子,为避免重复,相同部分不做具体解释。如图4所示,该控制方法包括以下内容。
步骤410:获取用户的多个声学特征向量。
步骤420:获取与多个声学特征向量对应的多个唇部特征向量。
确定用户的声音的能量是否小于第六阈值,并确定声音的噪音程度是否大于第七阈值。
当用户的声音的能量小于第六阈值时,控制器很难从该段声音中提取准确的声学特征向量;或者当用户周围的环境比较嘈杂,如车辆的鸣笛声、发动机发动的声音、或其他乘客的说话声等,控制器也很难从该段声音中提取准确的声学特征向量。所以控制器在后续获得组合特征向量过程中,可以降低声学特征向量的权重,而提高唇部特征向量的权重,即通过调整两种向量的权重值,进而获取较为准确的组合特征向量。
若声音的能量小于第六阈值或者声音的噪音程度大于第七阈值,则执行步骤431。第六阈值可以大于或等于图3中的第三阈值。
步骤431:确定多个唇部特征向量对应的第一权重值和多个声学特征向量对应的第二权重值。
在声音的能量小于第六阈值或者声音的噪音程度大于第七阈值时,第一权重值大于第二权重值。
若声音的能量大于或等于第六阈值,且声音的噪音程度小于或等于第七阈值,也可以执行步骤431,此时第一权重值可以等于第二权重值。
权重值的分配可以是通过权重值分配模型确定的,该模型可以是对不同场景(声音环境、照度、人脸面向的角度)的样本进行深度学习而得到的。
步骤432:分别采用第一权重值和第二权重值对多个唇部特征向量和多个声学特征向量进行加权计算,得到多个组合特征向量。
在声音能量小或噪音大的情况下,通过增加唇部特征向量的权重值、减小声学特征向量的权重值,可以提高组合特征向量的准确率,进而提高唤醒词识别的准确率。
步骤433:利用唤醒词识别模型对多个组合特征向量进行唤醒词识别,得到唤醒词。
步骤440:唤醒终端。
图5是本申请另一示例性实施例提供的语音控制方法的流程示意图。图5是图1B的例子,为避免重复,相同部分不做具体解释。如图5所示,该控制方法包括以下内容。
步骤510:获取用户的多个声学特征向量。
步骤520:获取与多个声学特征向量对应的多个唇部特征向量。
确定用户所处环境的照度是否小于第八阈值。若照度小于第八阈值,则执行步骤531。
步骤531:确定多个唇部特征向量对应的第三权重值和多个声学特征向量对应的第四权重值。
在用户所处环境的照度小于第八阈值时,第三权重值小于第四权重值。
若用户所处环境的照度大于或等于第八阈值,也可以执行步骤531,此时第三权重值可以等于第四权重值。
步骤532:分别采用第三权重值和第四权重值对多个唇部特征向量和多个声学特征向量进行加权计算,得到多个组合特征向量。
在光线较暗的情况下,通过减小唇部特征向量的权重值、增加声学特征向量的权重值,可以提高组合特征向量的准确率,进而提高唤醒词识别的准确率。
步骤533:利用唤醒词识别模型对多个组合特征向量进行唤醒词识别,得到唤醒词。
步骤540:唤醒终端。
在本实施例中,也可以根据图像中人脸面向的角度,对唇部特征向量和语音特征向量的权重值进行再分配。
图1至图5所示出的实施例之间可以相互补充,以获得功能更全、准确率更高的控制词识别过程。
示例性装置
图6是本申请一示例性实施例提供的语音控制装置600的结构示意图。如图6所示,该装置600包括:获取模块610,确定模块620以及控制模块630。
获取模块610用于获取用户的语音特征数据,并获取与语音特征数据对应的唇部特征数据;确定模块620用于基于语音特征数据和唇部特征数据确定控制终端的控制词;控制模块630用于控制终端执行与控制词相对应的操作。
具体地,各个模块的具体工作过程以及效果,可以参见上述图1中的描述,在此不再赘述。
本申请实施例提供了一种语音控制装置,通过融合语音特征数据和唇部特征数据来识别用户语音中的控制词,能够融合听觉和视觉两种感觉通道获得多模态信息,从而增强对控制词的识别,即弥补了通过单独的语音或唇语数据识别控制词的不足,极大提高了在噪声大、光线暗、以及声音能量小等多种情况下用户语音被采集时的准确率,进而提高语音识别的准确率,从而提高用户体验,增强了人机交互的自然性。
根据本申请一实施例,获取模块610用于采集包含用户的唇部动作变化的连续多帧图像,并基于连续多帧图像提取唇部特征数据。
根据本申请一实施例,获取模块610用于针对连续多帧图像中的每一帧图像,提取用于描述唇部形状的多个特征点;对连续多帧图像中的每一帧图像的多个特征点的坐标进行归一化处理,得到唇部特征数据。
根据本申请一实施例,确定模块620还用于确定用户所处环境的照度是否大于第一阈值,并确定连续多帧图像中的每一帧图像中人脸面向的角度是否小于或等于第二阈值。
在本实施例中,若照度大于第一阈值且角度小于或等于第二阈值,则获取模块610基于连续多帧图像提取唇部特征数据。
根据本申请一实施例,确定模块620还用于确定用户的声音的能量是否大于第三阈值,并确定声音持续的时间是否大于第四阈值。
图7是本申请另一示例性实施例提供的语音控制装置700的结构示意图。图7是图6的例子,相同之处不再赘述。如图7所示,该装置700包括:获取模块710,确定模块720、控制模块730以及采集模块740。获取模块710,确定模块720、控制模块730的具体功能可以参见图6中的描述。
采集模块740用于利用麦克风阵列采集用户的声音。
若人脸面向的角度大于第二阈值,则确定模块720还用于利用到达时间差方法TDOA确定声音的声源位置。装置700还包括:调整模块750,用于根据声源位置调整用于拍摄连续多帧图像的摄像机 的摄像角度。
在本实施例中,若声音的能量大于第三阈值且声音持续的时间大于第四阈值,则获取模块710获取用户的语音特征数据。
根据本申请一实施例,确定模块720还用于确定连续多帧图像中两个相邻图像的唇部特征数据之间的相似度是否低于第五阈值。
在本实施例中,若相似度低于第五阈值,并且声音的能量小于或等于第三阈值或声音持续的时间小于或等于第四阈值,则采集模块740在预设的时间内重复采集用户的声音。
根据本申请一实施例,控制词包括唤醒词,唇部特征数据包括多个唇部特征向量,语音特征数据包括多个声学特征向量,其中,确定模块720用于利用多个唇部特征向量和多个声学特征向量得到多个组合特征向量,并利用唤醒词识别模型对多个组合特征向量进行唤醒词识别,得到唤醒词。
根据本申请一实施例,确定模块720还用于确定用户的声音的能量是否小于第六阈值,并确定声音的噪音程度是否大于第七阈值。
若声音的能量小于第六阈值或者声音的噪音程度大于第七阈值,则确定模块720用于确定多个唇部特征向量对应的第一权重值和多个声学特征向量对应的第二权重值,其中,第一权重值大于第二权重值,并分别采用第一权重值和第二权重值对多个唇部特征向量和多个声学特征向量进行加权计算,得到多个组合特征向量。
根据本申请一实施例,确定模块720还用于确定用户所处环境的照度是否小于第八阈值。
若照度小于第八阈值,则确定模块720用于确定多个唇部特征向量对应的第三权重值和多个声学特征向量对应的第四权重值,其中,第三权重值小于第四权重值,并分别采用第三权重值和第四权重值对多个唇部特征向量和多个声学特征向量进行加权计算,得到多个组合特征向量。
根据本申请一实施例,获取模块710用于利用机器视觉识别模型从多个连续视频图像中识别出用户的连续多帧人脸图像,并从连续多帧人脸图像中采集连续多帧图像。
各个模块的具体工作过程以及效果,可以参见上述图1至图5中的描述,在此不再赘述。
示例性电子设备
下面,参考图8来描述根据本申请实施例的电子设备。该电子设备80可以执行上述的语音控制过程。
图8图示了根据本申请实施例的电子设备的框图。
如图8所示,电子设备80包括一个或多个处理器81和存储器82。
处理器81可以是中央处理单元(CPU)或者具有数据处理能力和/或指令执行能力的其他形式的处理单元,并且可以控制电子设备80中的其他组件以执行期望的功能。
存储器82可以包括一个或多个计算机程序产品,所述计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令,处理器81可以运 行所述程序指令,以实现上文所述的本申请的各个实施例的语音控制方法以及/或者其他期望的功能。在所述计算机可读存储介质中还可以存储诸如语音信号、视频图像信号等各种内容。
在一个示例中,电子设备80还可以包括:输入装置83和输出装置84,这些组件通过总线系统和/或其他形式的连接机构(未示出)互连。
例如,该输入装置83可以是上述的麦克风阵列和摄像机,分别用于捕捉语音和视频图像的输入信号。在该电子设备是单机设备时,该输入装置83可以是通信网络连接器,用于从麦克风阵列和摄像机接收所采集的输入信号。
此外,该输入设备83还可以包括例如键盘、鼠标等等。
该输出装置84可以向外部输出各种信息,包括确定出的控制词等。该输出设备84可以包括例如显示器、扬声器、打印机、以及通信网络及其所连接的远程输出设备等等。
当然,为了简化,图8中仅示出了该电子设备80中与本申请有关的组件中的一些,省略了诸如总线、输入/输出接口等等的组件。除此之外,根据具体应用情况,电子设备80还可以包括任何其他适当的组件。
示例性计算机程序产品和计算机可读存储介质
除了上述方法和设备以外,本申请的实施例还可以是计算机程序产品,其包括计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本申请各种实施例的语音控制方法中的步骤。
所述计算机程序产品可以以一种或多种程序设计语言的任意组合来编写用于执行本申请实施例操作的程序代码,所述程序设计语言包括面向对象的程序设计语言,诸如Java、C++等,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。
此外,本申请的实施例还可以是计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令在被处理器运行时使得所述处理器执行本说明书上述“示例性方法”部分中描述的根据本申请各种实施例的语音控制方法中的步骤。
所述计算机可读存储介质可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以包括但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
以上结合具体实施例描述了本申请的基本原理,但是,需要指出的是,在本申请中提及的优点、优势、效果等仅是示例而非限制,不能认为这些优点、优势、效果等是本申请的各个实施例必须具备的。 另外,上述公开的具体细节仅是为了示例的作用和便于理解的作用,而非限制,上述细节并不限制本申请为必须采用上述具体的细节来实现。
本申请中涉及的器件、装置、设备、系统的方框图仅作为例示性的例子并且不意图要求或暗示必须按照方框图示出的方式进行连接、布置、配置。如本领域技术人员将认识到的,可以按任意方式连接、布置、配置这些器件、装置、设备、系统。诸如“包括”、“包含”、“具有”等等的词语是开放性词汇,指“包括但不限于”,且可与其互换使用。这里所使用的词汇“或”和“和”指词汇“和/或”,且可与其互换使用,除非上下文明确指示不是如此。这里所使用的词汇“诸如”指词组“诸如但不限于”,且可与其互换使用。
还需要指出的是,在本申请的装置、设备和方法中,各部件或各步骤是可以分解和/或重新组合的。这些分解和/或重新组合应视为本申请的等效方案。
提供所公开的方面的以上描述以使本领域的任何技术人员能够做出或者使用本申请。对这些方面的各种修改对于本领域技术人员而言是非常显而易见的,并且在此定义的一般原理可以应用于其他方面而不脱离本申请的范围。因此,本申请不意图被限制到在此示出的方面,而是按照与在此公开的原理和新颖的特征一致的最宽范围。
为了例示和描述的目的已经给出了以上描述。此外,此描述不意图将本申请的实施例限制到在此公开的形式。尽管以上已经讨论了多个示例方面和实施例,但是本领域技术人员将认识到其某些变型、修改、改变、添加和子组合。

Claims (13)

  1. 一种语音控制方法,包括:
    获取用户的语音特征数据,并获取与所述语音特征数据对应的唇部特征数据;
    基于所述语音特征数据和所述唇部特征数据确定控制终端的控制词;
    控制所述终端执行与所述控制词相对应的操作。
  2. 根据权利要求1所述的方法,其中,所述获取与所述语音特征数据对应的唇部特征数据,包括:
    采集包含所述用户的唇部动作变化的连续多帧图像,针对所述连续多帧图像中的每一帧图像,提取用于描述唇部形状的多个特征点;
    对所述连续多帧图像中的每一帧图像的多个特征点的坐标进行归一化处理,得到所述唇部特征数据。
  3. 根据权利要求2所述的方法,还包括:
    确定所述用户所处环境的照度是否大于第一阈值,并确定所述连续多帧图像中的每一帧图像中人脸面向的角度是否小于或等于第二阈值,
    其中,若所述照度大于所述第一阈值且所述角度小于或等于所述第二阈值,则执行所述基于所述连续多帧图像提取所述唇部特征数据。
  4. 根据权利要求3所述的方法,还包括:
    利用麦克风阵列采集所述用户的声音;
    若所述角度大于所述第二阈值,则利用到达时间差方法TDOA确定所述声音的声源位置;
    根据所述声源位置调整用于拍摄所述连续多帧图像的摄像机的摄像角度。
  5. 根据权利要求2所述的方法,还包括:
    确定所述用户的声音的能量是否大于第三阈值,并确定所述声音持续的时间是否大于第四阈值,
    其中,若所述声音的能量大于所述第三阈值且所述声音持续的时间大于所述第四阈值,则执行所述获取用户的语音特征数据。
  6. 根据权利要求5所述的方法,还包括:
    确定所述连续多帧图像中两个相邻图像的唇部特征数据之间的相似度是否低于第五阈值;
    若所述相似度低于所述第五阈值,并且所述声音的能量小于或等于所述第三阈值或所述声音持续的时间小于或等于所述第四阈值,则在预设的时间内重复采集所述用户的声音。
  7. 根据权利要求2所述的方法,其中,所述控制词包括唤醒词,所述唇部特征数据包括多个唇部特征向量,所述语音特征数据包括多个声学特征向量,
    其中,所述基于所述语音特征数据和所述唇部特征数据确定控制终端的控制词,包括:
    利用所述多个唇部特征向量和所述多个声学特征向量得到多个组合特征向量;
    利用唤醒词识别模型对所述多个组合特征向量进行唤醒词识别,得到所述唤醒词。
  8. 根据权利要求7所述的方法,还包括:
    确定所述用户的声音的能量是否小于第六阈值,并确定所述声音的噪音程度是否大于第七阈值,
    其中,所述利用所述多个唇部特征向量和所述多个声学特征向量得到多个组合特征向量,包括:
    若所述声音的能量小于所述第六阈值或者所述声音的噪音程度大于所述第七阈值,则确定所述多个唇部特征向量对应的第一权重值和所述多个声学特征向量对应的第二权重值,其中,所述第一权重值大于所述第二权重值;
    分别采用所述第一权重值和所述第二权重值对所述多个唇部特征向量和所述多个声学特征向量进行加权计算,得到所述多个组合特征向量。
  9. 根据权利要求7所述的方法,还包括:
    确定所述用户所处环境的照度是否小于第八阈值,
    其中,所述利用所述多个唇部特征向量和所述多个声学特征向量得到多个组合特征向量,包括:
    若所述照度小于所述第八阈值,则确定所述多个唇部特征向量对应的第三权重值和所述多个声学特征向量对应的第四权重值,其中,所述第三权重值小于所述第四权重值;
    分别采用所述第三权重值和所述第四权重值对所述多个唇部特征向量和所述多个声学特征向量进行加权计算,得到所述多个组合特征向量。
  10. 根据权利要求1至9中的任一项所述的方法,其中,所述获取用户的语音特征数据,包括:
    利用麦克风阵列采集所述用户的声音;
    从所述声音提取连续的语音特征向量,所述语音特征向量包括FBank语音特征参数或梅尔倒谱系数MFCC语音特征参数。
  11. 一种语音控制装置,包括:
    获取模块,用于获取用户的语音特征数据,并获取与所述语音特征数据对应的唇部特征数据;
    确定模块,用于基于所述语音特征数据和所述唇部特征数据确定控制终端的控制词;
    控制模块,用于控制所述终端执行与所述控制词相对应的操作。
  12. 一种计算机可读存储介质,所述存储介质存储有计算机程序,所述计算机程序用于执行上述权利要求1至10中任一项所述的语音控制方法。
  13. 一种电子设备,所述电子设备包括:
    处理器;
    用于存储所述处理器可执行指令的存储器,
    其中,所述处理器用于执行上述权利要求1至10中任一项所述的语音控制方法。
PCT/CN2019/100982 2018-12-17 2019-08-16 语音控制方法及装置 WO2020125038A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811543052.8 2018-12-17
CN201811543052.8A CN111326152A (zh) 2018-12-17 2018-12-17 语音控制方法及装置

Publications (1)

Publication Number Publication Date
WO2020125038A1 true WO2020125038A1 (zh) 2020-06-25

Family

ID=71100644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/100982 WO2020125038A1 (zh) 2018-12-17 2019-08-16 语音控制方法及装置

Country Status (2)

Country Link
CN (1) CN111326152A (zh)
WO (1) WO2020125038A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436618A (zh) * 2020-08-22 2021-09-24 彭玲玲 一种用于语音指令捕捉的信号精确度调节系统
CN113689858B (zh) * 2021-08-20 2024-01-05 广东美的厨房电器制造有限公司 烹饪设备的控制方法、装置、电子设备及存储介质
CN113723528B (zh) * 2021-09-01 2023-12-29 斑马网络技术有限公司 车载语视融合多模态交互方法及系统、设备、存储介质
CN114093354A (zh) * 2021-10-26 2022-02-25 惠州市德赛西威智能交通技术研究院有限公司 一种提高车载语音助手识别准确率的方法及系统
CN117672228B (zh) * 2023-12-06 2024-06-25 江苏中科重德智能科技有限公司 基于机器学习的智能语音交互误唤醒系统及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023703A (zh) * 2009-09-22 2011-04-20 现代自动车株式会社 组合唇读与语音识别的多模式界面系统
CN102314595A (zh) * 2010-06-17 2012-01-11 微软公司 用于改善话音识别的rgb/深度相机
US20130054240A1 (en) * 2011-08-25 2013-02-28 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image
CN103218924A (zh) * 2013-03-29 2013-07-24 上海众实科技发展有限公司 一种基于音视频双模态的口语学习监测方法
CN105096935A (zh) * 2014-05-06 2015-11-25 阿里巴巴集团控股有限公司 一种语音输入方法、装置和系统

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004024863A (ja) * 1994-05-13 2004-01-29 Matsushita Electric Ind Co Ltd 口唇認識装置および発生区間認識装置
CN102117115B (zh) * 2009-12-31 2016-11-23 上海量科电子科技有限公司 一种利用唇语进行文字输入选择的系统及实现方法
JP5323770B2 (ja) * 2010-06-30 2013-10-23 日本放送協会 ユーザ指示取得装置、ユーザ指示取得プログラムおよびテレビ受像機
CN102004549B (zh) * 2010-11-22 2012-05-09 北京理工大学 一种适用于中文的自动唇语识别系统
CN105389097A (zh) * 2014-09-03 2016-03-09 中兴通讯股份有限公司 一种人机交互装置及方法
CN107799125A (zh) * 2017-11-09 2018-03-13 维沃移动通信有限公司 一种语音识别方法、移动终端及计算机可读存储介质
CN108346427A (zh) * 2018-02-05 2018-07-31 广东小天才科技有限公司 一种语音识别方法、装置、设备及存储介质
CN108446675A (zh) * 2018-04-28 2018-08-24 北京京东金融科技控股有限公司 面部图像识别方法、装置电子设备及计算机可读介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023703A (zh) * 2009-09-22 2011-04-20 现代自动车株式会社 组合唇读与语音识别的多模式界面系统
CN102314595A (zh) * 2010-06-17 2012-01-11 微软公司 用于改善话音识别的rgb/深度相机
US20130054240A1 (en) * 2011-08-25 2013-02-28 Samsung Electronics Co., Ltd. Apparatus and method for recognizing voice by using lip image
CN103218924A (zh) * 2013-03-29 2013-07-24 上海众实科技发展有限公司 一种基于音视频双模态的口语学习监测方法
CN105096935A (zh) * 2014-05-06 2015-11-25 阿里巴巴集团控股有限公司 一种语音输入方法、装置和系统

Also Published As

Publication number Publication date
CN111326152A (zh) 2020-06-23

Similar Documents

Publication Publication Date Title
US11854527B2 (en) Electronic device and method of controlling speech recognition by electronic device
WO2020125038A1 (zh) 语音控制方法及装置
WO2021093449A1 (zh) 基于人工智能的唤醒词检测方法、装置、设备及介质
CN110364143B (zh) 语音唤醒方法、装置及其智能电子设备
US10777193B2 (en) System and device for selecting speech recognition model
US11854550B2 (en) Determining input for speech processing engine
US20150325240A1 (en) Method and system for speech input
US20200335128A1 (en) Identifying input for speech recognition engine
CN107972028B (zh) 人机交互方法、装置及电子设备
CN110310623A (zh) 样本生成方法、模型训练方法、装置、介质及电子设备
JP2012047924A (ja) 情報処理装置、および情報処理方法、並びにプログラム
US11741943B2 (en) Method and system for acoustic model conditioning on non-phoneme information features
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
CN112017633B (zh) 语音识别方法、装置、存储介质及电子设备
CN114038457B (zh) 用于语音唤醒的方法、电子设备、存储介质和程序
US11322151B2 (en) Method, apparatus, and medium for processing speech signal
CN113516990A (zh) 一种语音增强方法、训练神经网络的方法以及相关设备
US10847154B2 (en) Information processing device, information processing method, and program
CN110853669A (zh) 音频识别方法、装置及设备
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
WO2020073839A1 (zh) 语音唤醒方法、装置、系统及电子设备
JP7511374B2 (ja) 発話区間検知装置、音声認識装置、発話区間検知システム、発話区間検知方法及び発話区間検知プログラム
CN117649848A (zh) 语音信号的处理设备及方法
CN115148188A (zh) 语种识别方法、装置、电子设备和介质
CN116259309A (zh) 一种终端设备及自定义唤醒词的检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19898488

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19898488

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19898488

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 09.02.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19898488

Country of ref document: EP

Kind code of ref document: A1