WO2023006033A1 - Speech interaction method, electronic device, and medium - Google Patents

Speech interaction method, electronic device, and medium Download PDF

Info

Publication number
WO2023006033A1
WO2023006033A1 PCT/CN2022/108624 CN2022108624W WO2023006033A1 WO 2023006033 A1 WO2023006033 A1 WO 2023006033A1 CN 2022108624 W CN2022108624 W CN 2022108624W WO 2023006033 A1 WO2023006033 A1 WO 2023006033A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
voice
electronic device
recognition
lip
Prior art date
Application number
PCT/CN2022/108624
Other languages
French (fr)
Chinese (zh)
Inventor
朱维峰
曾俊飞
查永东
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023006033A1 publication Critical patent/WO2023006033A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis

Definitions

  • the present application relates to the technical field of human-computer interaction, and in particular to a voice interaction method, electronic equipment and media.
  • the robot may not be able to receive or recognize the user's voice instructions and execute them. For example, because the current environment is too noisy, the robot cannot judge when the user's voice command ends, so it has been in a continuous listening state, or it cannot judge when the user's voice command starts, so it has been in a non-radio state, so it cannot listen to the user's voice Feedback such as command execution operations seriously affects user experience.
  • the first aspect of the embodiment of the present application provides a voice interaction method, an electronic device and a medium.
  • the method can be applied to electronic equipment, and the method includes:
  • the speech recognition method if it is judged that it is difficult to recognize the user's voice command by using the speech recognition method, it further determines whether to use lip recognition by judging whether the user is interacting with the voice assistant; The method can effectively improve the accuracy rate of voice command recognition, thereby further improving the accuracy rate of the electronic device executing the user's voice command.
  • the noise value around the electronic device and the duration of the electronic device receiving sound, etc. all belong to the category of the voice interaction environment of the electronic device.
  • the image acquisition device may be a camera device for collecting images, such as a camera.
  • the electronic device when the electronic device receives the user's voice instruction, it can simultaneously collect the user's voice and the user's mouth change features.
  • voice recognition is used to recognize the user's voice received by the electronic device to obtain a voice recognition result.
  • the user's mouth acquired by the image acquisition device of the electronic device can be detected by means of lip recognition. Identify the internal change features to obtain the lip language recognition results.
  • the speech recognition conditions include:
  • the noise value around the electronic equipment is lower than the set noise value
  • the duration time for the electronic device to receive sound is greater than zero and less than the set time.
  • judging whether the voice command of the user can be recognized by means of voice recognition can firstly determine whether the noise value around the electronic device is lower than the set noise value, and if so, it proves that the noise around the electronic device is lower than the set noise value. If the noise is small, the user’s voice command can be recognized through voice recognition; if not, it proves that the noise around the electronic device is already large, and the external environment is already in a noisy environment. At this time, it is further judged that the electronic device The continuous sound of the radio is greater than zero and less than the set time.
  • the continuous sound received by the electronic device is greater than zero and less than the set time, it proves that the electronic device can still accurately determine the time point when the user's voice is cut off, and then confirm that the user's voice command can be recognized by voice recognition at this time. If the continuous sound received by the electronic device is greater than or equal to the first set value or less than zero, it means that the electronic device has been unable to accurately determine the time point when the user's voice is cut off, and then it is determined that it is difficult to use the voice recognition method at this time. identify.
  • the voice recognition condition includes: the duration of the electronic device receiving sound is greater than zero and less than a set time.
  • the method of judging whether the voice command of the user can be recognized by voice recognition can be directly judging whether the duration of the electronic device receiving sound is greater than zero and less than the set time, and if so, prove The electronic device can still accurately determine the time point when the user's voice is cut off, and then determine that the user's voice command can be recognized by means of voice recognition at this time. If the continuous sound received by the electronic device is greater than or equal to the first set value or less than zero, it means that the electronic device has been unable to accurately determine the time point when the user's voice is cut off, and then infers that the external environment is too noisy, and determines that the voice recognition method is used at this time It has been difficult to recognize the user's voice commands.
  • the lip recognition conditions include:
  • the user and the electronic device are in an interactive state within a set time.
  • the electronic device can capture the changes in the user's mouth more clearly, so that lip recognition can be used to identify the acquired changes in the user's mouth. Get the lip recognition result.
  • the method for determining whether the user is in an interactive state with the electronic device within a set time includes:
  • the interaction intensity value is related to the distance between the user and the electronic device and the face orientation of the user.
  • the interaction intensity value between the user and the electronic device before detecting the interaction intensity value between the user and the electronic device, it may be determined whether the user who interacts with the electronic device within the above-mentioned set time has changed. If no change has occurred, the current user may be As an object for subsequent detection of interaction strength values. When it is further determined that the interaction intensity value between the user and the electronic device reaches the set intensity value, it can be confirmed that the user is in an interactive state with the electronic device; Identify the changing features of the user's mouth to obtain lip recognition results.
  • the interaction intensity value may be acquired based on the distance between the user's face and the electronic device, the orientation of the face, and the like within a set time. For example, if the distance between the user's face and the electronic device is relatively short within the set time, and the user's face faces the electronic device, the interaction intensity value is high, and vice versa.
  • the meaning of the interaction strength value mentioned in the embodiment of the present application is the same as that of the interaction willingness value, but the expression is different.
  • the set intensity value may be the second set value mentioned in the following embodiments.
  • the method for confirming whether the lip recognition result is correct includes:
  • the step of confirming whether the lip language result is correct also includes:
  • the voice recognition method in the step of confirming whether the result of the lip language is correct, because the previous steps have confirmed that the current environment is noisy, the voice recognition method may not be able to accurately recognize the user's voice command, so when the voice assistant is performing When confirming, the visual recognition function can be turned on at the same time.
  • the visual recognition function can obtain the characteristics of the user's body movements and facilitate the recognition of the user's reply through body movements. Indicates that the lip-reading result is confirmed to be correct.
  • the noise detection function in addition to the visual recognition function when performing voice confirmation to the user, so as to detect the surrounding environment noise in real time .
  • the voice command of the user can be recognized by the voice recognition method at this time, and the user's confirmation command or other subsequent voices can be confirmed by voice recognition at this time. If the ambient noise is still higher than the set value, lip recognition or visual recognition or a combination of lip recognition and visual recognition is used to recognize the user's confirmation instruction or other subsequent voice instructions.
  • the electronic device is a robot.
  • the speech recognition method when it is judged that it is difficult to recognize the user's voice command by the speech recognition method, it further determines whether to use lip language by judging whether the user is interacting with the voice assistant. Recognition; the method can effectively improve the accuracy rate of voice command recognition, thereby further improving the accuracy rate of the electronic device for executing the user's voice command.
  • the second aspect of the embodiment of the present application provides an electronic device, including:
  • a memory for storing instructions to be executed by the one or more processors of the electronic device
  • the processor is one of the one or more processors of the electronic device, configured to execute the above voice interaction method.
  • the third aspect of the embodiments of the present application provides a computer-readable storage medium, where instructions are stored on the computer-readable storage medium, and when the instructions are executed, the computer executes the above voice interaction method.
  • the fourth aspect of the embodiments of the present application provides a computer program product, where the computer program product includes instructions, and when the instructions are executed, the computer executes the above voice interaction method.
  • FIG. 1 shows a schematic diagram of a scene of a voice interaction method according to some embodiments of the present application
  • Fig. 2 shows a schematic structural diagram of an electronic device according to some embodiments of the present application
  • Fig. 3 shows a schematic flowchart of a voice interaction method according to some embodiments of the present application
  • Fig. 4 shows a schematic scene diagram of a voice interaction method according to some embodiments of the present application
  • Fig. 5 shows a schematic diagram of a scene of a voice interaction method according to some embodiments of the present application
  • Fig. 6 shows a schematic flowchart of a voice interaction method according to some embodiments of the present application.
  • the embodiment of the application discloses a voice interaction method, electronic equipment and media.
  • the electronic devices applicable to the embodiments of the present application may be various electronic devices with voice recognition functions, including but not limited to robots, laptop computers, desktop computers, tablet computers, smart phones, servers, wearable devices, Head-mounted displays, mobile email devices, portable game consoles, portable music players, reader devices, televisions with one or more processors embedded in or coupled to them, or other electronic devices with computing capabilities.
  • the speech recognition function of the above-mentioned electronic device can be implemented in the form of various applications, for example, appearing in the form of a voice assistant, or the speech recognition function is built into the application program of the electronic device, for example, for the application program , such as voice search in the Maps app.
  • the electronic device is used as a robot, and the speech recognition function is implemented as a robot's voice assistant as an example.
  • users can control electronic devices such as robots through voice commands.
  • the voice assistant of the robot uses voice recognition to recognize the user's voice commands, the above-mentioned voice assistant may appear It is impossible to judge when the user's voice command ends, so it is always in the receiving state, or it cannot be judged when the user's voice command starts, so it has been in the non-radio state, and the user's voice command cannot be executed, which in turn affects the user's Use experience.
  • the embodiment of the present application provides a voice interaction method.
  • the voice assistant When the voice assistant is awakened by the user, the voice assistant can detect the surrounding noise level through the noise detection function. If the noise level is higher than the set threshold, the current The voice recognition mode is switched to the lip language recognition mode, so that the electronic device can recognize the user's voice command through the lip language recognition technology, and execute the voice command.
  • the voice assistant detects that the surrounding noise level is high through the noise detection function, which is higher than the set level. Threshold, for example, if the threshold is set to 70 decibels, and the noise value detected by the voice assistant through the noise detection function is 78 decibels, the voice assistant will switch the current voice recognition method to the lip recognition method, so that the voice assistant can use lip recognition.
  • the technology recognizes the user's voice commands and executes the "storytelling" voice commands.
  • the above technology can recognize voice commands under certain circumstances, but in normal scenarios, the accuracy of lip recognition is generally lower than that of speech recognition. Therefore, in the above solution, the speech recognition method is converted to In the lip language recognition method, there may be scenarios where the voice recognition can still perform accurate recognition even though the surrounding environment is noisy. At this time, converting the voice recognition method to the lip language recognition method increases the risk of recognition errors.
  • This method does not directly switch the voice recognition mode to the lip language recognition mode after judging that the surrounding environment is too noisy through the noise detection function to obtain the lip language recognition result; It is first to judge whether it is determined that the speech recognition method cannot be used, and if it is determined that the speech recognition method cannot be used, it is judged whether the condition for adopting lip language recognition is satisfied. The lip recognition result is obtained only after it is judged that the lip recognition condition is met. Among them, judging whether the voice recognition method cannot be used is as follows:
  • the voice assistant may not be able to judge the end of the user's voice command because the external environment is too noisy by judging that the listening time of the voice assistant is too long, for example, exceeding the conventional setting value of the system The time point, or by judging that the radio cannot be received, it is possible to determine the start time point when the voice assistant cannot judge the user's voice command because the external environment is too noisy, and then determine that it is difficult to accurately recognize the user's voice command through voice recognition at this time, Confirm that the user cannot be recognized by voice recognition at this time.
  • the surrounding environment can be judged first through the noise detection function. If the noise value of the surrounding environment is less than the set noise value, it is directly determined that the speech recognition method can be used at this time. If the noise of the surrounding environment The value is greater than or equal to the set noise value. At this time, it is further judged whether the voice assistant has been listening for too long, for example, it has exceeded the system’s conventional setting value, or whether it is unable to listen. It is determined that the external environment is too noisy for the voice assistant to judge the user’s voice commands. The end time point or the start time point of the system, and then determine the user's voice instruction that is difficult to accurately recognize through voice recognition at this time.
  • the user's face is facing the camera within the set time period, whether the face is within the shooting range of the robot camera, etc., to confirm whether the user is in and The way the voice assistant interacts confirms whether the lip recognition result is used. If it is confirmed that the user is interacting with the voice assistant, it can be determined that the lip recognition method can relatively accurately recognize the user's voice command, and then obtain the lip recognition result , and execute the function corresponding to the lip recognition result according to the lip recognition result.
  • the voice assistant When the voice assistant is awakened, it enters the voice recognition mode and starts to listen.
  • the voice assistant When user 001 sends out the voice command of "tell a story", the voice assistant does not detect the user's voice.
  • the instruction has ended, and the audio has been continuously receiving.
  • the audio receiving time exceeds the system setting value, such as 10 seconds, the voice assistant can judge that the voice assistant cannot judge the end of user 001’s voice instruction because the external environment is too noisy. Time point, and then determine the user's voice instructions that are difficult to accurately recognize through voice recognition at this time. Then detect whether the face of the user 001 has been facing the camera within the set time during the radio reception just now, and whether the face has always been within the shooting range of the robot camera.
  • the detection result is yes, it is determined that the user 001 is interacting with the electronic device. Therefore, it can be determined whether the voice command of user 001 can be recognized relatively accurately by means of lip recognition, and then the voice command of user 001 can be recognized by lip recognition, and the voice command of "telling a story" can be executed.
  • the speech recognition method provided by the embodiment of the present application first judges that the voice assistant has received the sound for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, thereby judging the external environment It is too noisy; furthermore, it can be more accurately judged that it is difficult to recognize the voice command of user 001 by using voice recognition in this case, and further determine whether to use lip recognition by judging whether user 001 is interacting with the voice assistant ; It can effectively avoid the situation that the accuracy rate of the recognition result is reduced by using the lip language recognition result when the speech recognition can be performed, and effectively improve the accuracy rate of the speech command recognition.
  • the electronic device is robot 002 as an example for illustration. It should be understood that the robot 002 in the embodiment of the present application can also interact with the cloud server, and send the cloud command of the identified user 001 to the cloud server, and the cloud server can use the database to feed back the interaction content to the robot 002.
  • the interaction content is as follows: Songs, stories, etc.
  • the robot 002 may include a processor 110 , a power module 140 , a memory 180 , a sensor module 190 , an audio module 150 , a camera 170 , an interface module 160 , buttons 101 , and a display screen 102 .
  • the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the robot 002 .
  • the robot 002 may include more or fewer components than shown in the illustration, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, may include a central processing unit CPU (Central Processing Unit), an image processor GPU (Graphics Processing Unit), a digital signal processor DSP, a neural network processor (neural-network processing unit, NPU), microprocessor MCU (Micro-programmed Control Unit), AI (Artificial Intelligence, artificial intelligence) processor or programmable logic device FPGA (Field Programmable Gate Array) and other processing modules or processing circuits. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • a storage unit may be provided in the processor 110 for storing instructions and data. In some embodiments, the storage unit in processor 110 is cache memory 180 .
  • the processor 110 may control a corresponding program to execute the voice interaction method provided in the embodiment of the present application.
  • an artificial intelligence processor can be used to recognize the received voice to obtain a recognition result;
  • the above-mentioned image processor can be used to analyze the collected lip movements of user 001 to obtain a recognition result; meanwhile, the above-mentioned image processor can be used Recognize the collected body movements of user 001, and obtain the recognition result.
  • the processor 110 may be used to detect the noise around the electronic device in real time, so as to select a more accurate identification method.
  • the power module 140 may include a power supply, power management components, and the like.
  • the power source can be a battery.
  • the power management component is used to manage the charging of the power supply and the power supply from the power supply to other modules.
  • the power management component includes a charge management module and a power management module.
  • the charging management module is used to receive charging input from the charger; the power management module is used to connect the power supply, the charging management module and the processor 110 .
  • the power management module receives power and/or input from the charging management module, and supplies power to the processor 110 , the display screen 102 , the camera 170 , and the wireless communication module 120 .
  • the wireless communication module 120 may include an antenna, and transmit and receive electromagnetic waves via the antenna.
  • the wireless communication module 120 can provide applications on the robot 002 including wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (wireless fidelity, Wi-Fi) network), bluetooth (bluetooth, BT), global navigation satellite system ( Global navigation satellite system (GNSS), frequency modulation (frequency modulation, FM), near field communication (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the robot 002 can communicate with the network and other devices through wireless communication technology.
  • the robot 002 can communicate with the cloud server through the wireless communication module 120 .
  • the display screen 102 is used for displaying human-computer interaction interfaces, images, videos, and the like.
  • the display screen 102 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
  • the display screen 102 may be used to display various application program interfaces of the robot 002 .
  • the sensor module 190 may include a proximity light sensor, a pressure sensor, a gyro sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.
  • the audio module 150 is used to convert digital audio information into analog audio signal output, or convert analog audio input into digital audio signal.
  • the audio module 150 may also be used to encode and decode audio signals.
  • the audio module 150 can be set in the processor 110, or some functional modules of the audio module 150 can be set in the processor 110.
  • the audio module 150 may include a speaker, an earpiece, a microphone, and an earphone jack.
  • the audio module 150 can be used to receive voice instructions from the user 001.
  • the audio module 150 can also be used to perform operations such as playing music and telling stories according to the voice instructions of the user 001.
  • Camera 170 is used to capture still images or video.
  • the object generates an optical image through the lens and projects it to the photosensitive element.
  • the photosensitive element converts the light signal into an electrical signal, and then transmits the electrical signal to the ISP (Image Signal Processing, Image Signal Processing) to convert it into a digital image signal.
  • the robot 002 can realize the shooting function through ISP, camera 170, video codec, GPU (Graphic Processing Unit, graphics processor), display screen 102 and application processor.
  • the camera 170 can obtain user 001's face image, lip movement image, etc. of user 001.
  • the interface module 160 includes an external memory interface, a universal serial bus (universal serial bus, USB) interface, and the like.
  • the external memory interface can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the robot 002.
  • the external memory card communicates with the processor 110 through the external memory interface to realize the data storage function.
  • the universal serial bus interface is used for the robot 002 to communicate with other electronic devices 002 .
  • the robot 002 further includes a button 101 .
  • the key 101 may include a volume key, an on/off key, and the like.
  • FIG. 3 shows a schematic diagram of a voice interaction method, wherein the voice interaction method shown in FIG. 3 can be executed by a voice assistant of the robot 002 .
  • the voice interaction method shown in Figure 3 includes:
  • the speech recognition mode is a mode for recognizing the speech instruction of the user 001 received by the robot 002 .
  • the voice assistant After the voice assistant is woken up, the voice assistant will turn on the sound recording and obtain the characteristics of user 001's mouth changes. Among them, turning on the sound recording can facilitate the determination in the subsequent steps that when the voice command of user 001 is recognized by voice, it can directly Recognize the received sound. Acquiring the mouth change features of the user 001 can facilitate the recognition of the acquired user 001's mouth change features when it is determined in the subsequent steps that the voice command of the user 001 is recognized by lip language.
  • the receiving of sound can be realized based on the microphone of the robot 002, and the acquisition of the mouth change characteristics of the user 001 can be realized based on the camera device of the robot 002.
  • the voice assistant of the robot 002 needs to be woken up after receiving the wake-up word from the user 001, so as to enter the voice recognition mode. For example, if the wake-up word of the voice assistant is "Hi, Xiaoyi", when the user 001 speaks the wake-up word of "Hi, Xiaoyi", the voice assistant will enter the voice recognition mode, turn on the radio and obtain the user's mouth change characteristics, In order to receive voice instructions from user 001.
  • the voice recognition function and the lip language recognition function can be selectively enabled. As described below:
  • the voice recognition function and the lip language recognition function can be turned on during the entire process of receiving sound, so that the received sound can be directly recognized by voice recognition in real time to obtain the voice recognition result, and the captured User 001's mouth features are directly recognized in real time by means of lip recognition to obtain lip recognition results.
  • the speech recognition result can be obtained directly without performing speech recognition again.
  • the lip recognition result can be obtained directly without performing lip recognition again, which effectively saves the time for recognizing the voice of user 001.
  • this solution after entering the speech recognition mode, it can be judged in the subsequent steps whether to use the lip recognition method or the speech recognition method to recognize the speech command of user 001, and then turn on the speech recognition function Or lip recognition function, this solution can effectively reduce unnecessary speech recognition or lip recognition calculations.
  • the voice recognition function can be turned on during the whole process of radio reception, and the lip recognition function can be turned on when it is later judged that lip recognition is needed for recognition.
  • the setting of this implementation is because most scenarios still use voice recognition to recognize the voice commands of user 001, so turning on the voice recognition function for a long time can effectively avoid multiple conversions of turning on and off the voice recognition function, reducing most common scenarios.
  • the calculation amount of the processor 110 in the processor 110 is increased, and the running speed of the processor 110 is increased.
  • the lip recognition method only occurs in a small number of scenes. Therefore, when the lip recognition method is determined to be used, the lip recognition function can be effectively reduced.
  • the computational load of the speech recognition process is because most scenarios still use voice recognition to recognize the voice commands of user 001, so turning on the voice recognition function for a long time can effectively avoid multiple conversions of turning on and off the voice recognition function, reducing most common scenarios.
  • the calculation amount of the processor 110 in the processor 110 is increased, and the running speed of the processor 110 is increased.
  • the speech recognition function can be implemented based on the artificial intelligence processor of the robot 002 .
  • the artificial intelligence processor can intelligently recognize the voice of user 001.
  • the lip language recognition function can be realized based on the image processor of the robot 002, in which the image processor can continuously recognize faces from the images, judge the person who is speaking, and extract the person's continuous mouth shape change characteristics; The features of the lip language recognition model in the image processor are input to recognize the pronunciation corresponding to the speaker's population type; then, according to the recognized pronunciation, the most likely natural language sentence is obtained.
  • the recognition rate of the lip language recognition method is not high in general scenarios.
  • the recognition accuracy rate is relatively high, for example, it can reach more than 90%.
  • S302 Detect whether the condition for recognizing the user's voice instruction by means of voice recognition is satisfied. If it is satisfied, then it shows that the speech recognition method can be used to recognize the user's voice command, and then proceeds to S308 to obtain the speech recognition result; if not satisfied, it shows that the speech recognition method can no longer be used to accurately recognize the user's voice command, then Go to S303, and check whether the condition for entering the lip recognition mode is met.
  • the voice assistant can collect sound by controlling the microphone, and can judge the time point when user 001's voice starts or is cut off by using voice boundary detection (Voice activity detection, Vad) technology.
  • voice boundary detection Voice activity detection, Vad
  • the condition for recognizing the voice command of user 001 by means of voice recognition may be that the voice assistant has already received the voice and the continuous voice is less than the first set value.
  • the first setting value can be determined according to the relevant performance of the device and the regular voice duration of user 001. For example, although the voice device exceeding 10s can still be recognized, the dialogue system of the device can no longer give an effective answer; or the user 001 generally does not take more than 10 seconds to issue voice commands, and there is usually a pause within 10 seconds. When the continuous radio reception exceeds 10 seconds, the VAD technology of the device can no longer accurately identify the truncation time point of the human voice in the audio. Therefore, the first set value can be set to 10 seconds.
  • the voice assistant when the voice assistant continues to listen for more than the first set value, it can be judged that the use of voice boundary detection technology has been unable to accurately determine the time point when the voice of user 001 ends, so the voice assistant Just kept on listening. Therefore, it is determined that the external environment is too noisy, and it can be determined that it is difficult to recognize the voice instruction of the user 001 through voice recognition at this time.
  • turning on the sound collection may refer to turning on the sound collection function and allowing sound collection, but in some cases, it may still be impossible to receive sound.
  • the use of voice boundary detection technology has been unable to accurately determine the time point when user 001's voice started, so the voice assistant may not be able to receive the voice.
  • the voice assistant when the voice assistant has been unable to receive sound, it can be judged that the voice boundary detection technology has been unable to accurately determine the time point when the voice of user 001 starts, so the voice assistant has been unable to receive sound. Therefore, it can be determined that the external environment is too noisy, and it can be determined that it is difficult to recognize the voice instruction of the user 001 through voice recognition at this time.
  • the above solution is to confirm that the voice assistant has started to receive sound and the continuous sound is less than or equal to the first set value, so as to confirm that the use of VAD technology can still accurately judge the time point when the user 001's voice starts and is cut off, and then determine that it can be used at this time.
  • the speech recognition mode recognizes the speech instruction of user 001.
  • the condition for recognizing the voice command of the user 001 by means of voice recognition may be that the noise value of the surrounding environment is less than a set value.
  • the first set value indicates that the VAD technology can still accurately determine the time point when the voice of user 001 is cut off, and then determine that the voice command of user 001 can be recognized by voice recognition at this time.
  • the duration of the voice assistant’s sound collection is greater than or equal to the first set value, as mentioned above, it indicates that the VAD technology has been unable to accurately determine the time point when the voice of user 001 is cut off, and then it is determined that the speech recognition method can be used at this time and the user 001 cannot be accurately recognized. voice commands.
  • S303 Detect whether the condition for recognizing the user's voice command by means of lip recognition is satisfied.
  • the voice assistant detects that the user 001 has kept interacting with the voice assistant, for example, the user 001 before the robot 002 has remained unchanged, and the face has been facing the camera device of the robot 002 and Only when the distance between the robots 002 and 002 is within the set range, it is considered that the conditions for entering the lip recognition mode are met.
  • the movement of the user’s mouth can be accurately captured at this time, At this time, the voice command of the user 001 can be recognized more accurately by using the lip recognition method.
  • the lip recognition result may be the instruction keyword included in the user 001 instruction.
  • the above-mentioned keywords can be the conventional instruction keywords that have been stored in the voice assistant and have been trained by the model, for example, tell a story, read a picture book, play music, tell a joke, exit, return, etc. It is understandable that because the above-mentioned keywords have been Stored in the voice assistant, therefore, the above-mentioned instruction keywords can be accurately identified by using the lip recognition method.
  • the keyword in order to avoid misrecognition when command words are mixed into long sentences, the keyword may be used as the result of lip recognition only when there is a pause before and after the keyword is recognized.
  • telling a story can be directly used as the lip recognition result, so as to execute the function corresponding to the lip recognition result of "telling a story”.
  • This scheme can effectively avoid the misrecognition that may be caused by command words appearing in long sentences.
  • S305 confirm whether the lip language recognition result is correct, if the result is yes, it shows that the lip language recognition is accurate, then go to S305; if the result is no, it shows that the lip language recognition is incorrect, then go to S306;
  • the voice recognition method provided in the embodiment of the present application may include a step of confirming the voice command to the user 001 .
  • the confirmation of the voice command to the user 001 may be to ask the user 001 whether he wants the voice assistant to perform the function corresponding to the lip recognition result. For example, as shown in Figure 4, if the recognized keyword is "telling a story", the way to confirm the voice command to user 001 can be to ask user 001, wherein the content of the inquiry can be: "Do you want to Shall I tell you a story?" and other inquiries.
  • the confirmation may be indicated by voice answering "Yes".
  • the voice recognition method may not be able to accurately recognize the voice command of user 001. Therefore, when the voice assistant confirms to user 001, the visual recognition function can be turned on at the same time. It is convenient to recognize the confirmation performed by the user 001 through body movements, for example, the user 001 may express confirmation through a nodding action or an ok gesture.
  • the visual recognition function can be a function that can detect the body movements of the user 001, and the visual recognition function can be realized based on the image processor of the robot 002. Body movements are parsed to obtain visual recognition results. For example, after the nodding action image of user 001 is collected by the image processor, the nodding action image can be analyzed, and the recognition result obtained can be the text corresponding to the nodding action, such as "confirm", "yes", etc. .
  • the voice assistant in order to further increase the accuracy of the voice assistant for user 001's voice command recognition, can also enable the noise detection function in addition to the visual recognition function when performing voice confirmation to user 001, so that real-time Detect ambient noise.
  • the voice command of user 001 can be recognized by the voice recognition method at this time, and then the voice command of user 001 or the subsequent confirmation command can be confirmed by voice recognition.
  • Other voice commands are recognized; if the ambient noise is still higher than the set value, lip recognition or visual recognition or a combination of lip recognition and visual recognition is used to confirm the user 001's command or other subsequent voice commands to identify.
  • S306 Execute a function corresponding to the lip recognition result based on the lip recognition result.
  • the voice assistant can execute the function corresponding to "telling a story”.
  • the visual recognition function can be continuously turned on to keep the recognition result of the body movement of the user 001 by the voice assistant.
  • the lip recognition function can be continuously turned on to obtain the lip recognition results of user 001.
  • Body action recognition results For example, the user 001 makes a five-finger gesture within the range captured by the camera device of the robot 002 to indicate that the task of storytelling is stopped, and the voice assistant can recognize the gesture and stop the execution of the task.
  • Detection function for real-time detection of ambient noise. After judging that the surrounding environment noise is lower than the set value, it can be determined that the voice command of user 001 can be recognized by using the voice recognition mode at this time, and then the voice recognition mode can be used at this time; if the surrounding environment noise is still higher than the set value value, the lip recognition result method or the visual recognition method or the combination of the lip recognition method and the visual recognition method is used to accurately recognize other voice commands of user 001.
  • the voice assistant when the voice assistant performs the task of "telling a story", the voice assistant simultaneously turns on the visual recognition function, the lip language recognition function and the noise detection function. At a certain moment, the voice assistant detects that the surrounding environment noise is lower than the set value.
  • the voice recognition method For example, user 001 has switched to using the The voice recognition method recognizes the voice command of user 001 during the task, and issues the command "stop telling the story”. ", and execute the function corresponding to the voice recognition result of "stop telling the story”.
  • the voice recognition function and the lip language recognition function are always on during the process of the above-mentioned voice assistant executing the voice command of user 001.
  • the way of reminding the user 001 of the recognition failure may be to display prompt information such as "recognition error” and "unrecognizable” on the screen of the robot 002 .
  • the method of reminding the user 001 of the recognition failure may also be to prompt the user 001 through voice messages such as "recognition error” and "unrecognizable”.
  • the user 001 after reminding the user 001 that the recognition fails, the user 001 can also be reminded to face the camera, raise the voice, etc. to remind the user 001 to issue the voice command again.
  • a function corresponding to the speech recognition result may be executed based on the speech recognition result.
  • the speech recognition method provided in FIG. 3 first judges that the voice assistant has received the sound for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, thereby It is determined that the external environment is too noisy, and it is further determined that it is difficult to recognize the voice command of the user 001 by means of voice recognition under such circumstances. Further, by judging whether the user 001 is interacting with the voice assistant to determine whether to use the lip recognition method; it can effectively avoid the situation that the accuracy of the recognition result is reduced by using the lip recognition result when speech recognition is possible, Effectively improve the accuracy of voice command recognition. In addition, the user 001 can be asked again after the lip language recognition result is obtained, which can effectively ensure the accuracy of the recognition result.
  • judging in step 303 whether the user 001 has been maintaining the interactive state with the voice assistant can be judged from the following aspects:
  • the voice assistant detects that the user 001 who interacts with the voice assistant during the listening process has not changed, the possibility that the user 001 is interacting with the voice assistant is relatively high.
  • the user 001 who interacts with the voice assistant has changed, the user 001 who may have issued the voice command has left. At this time, in some embodiments, it can be directly determined that the received voice is invalid. In some other embodiments, it is also possible to detect the voice command of the last user 001 who interacted with the voice assistant during the listening process. For example, the voice assistant detects that the object interacting with the voice assistant has been changed once during the listening process, that is During the sound collection process, two users 001 have interacted with the voice assistant, and the voice instruction of the second user 001 who interacts with the voice assistant during the sound collection process is detected.
  • the user interacting with the voice assistant can be detected by detecting whether the face of user 001 in front of the voice assistant has changed 001 has changed.
  • the interaction willingness value may be calculated based on the distance between the face of the user 001 and the voice assistant, the orientation of the face, and the like within a period of time. For example, if the distance between the face of the user 001 and the voice assistant is relatively short within a period of time, and the face of the user 001 is facing the voice assistant, the interaction willingness value is higher, and vice versa.
  • the voice assistant can obtain the face angle of user 001 and the distance between user 001 and robot 002 by collecting images of user 001 within a period of time, and then according to the user 001's The face angle and the distance between the user 001 and the smart device are used to obtain the interaction willingness value of the user 001 through the interaction willingness value model. The higher the interaction willingness value, the greater the interaction intensity between user 001 and the voice assistant.
  • the interaction willingness value model it can be defined that different face angles correspond to different values, the distance between user 001 and robot 002 corresponds to different values, and the value corresponding to the face angle is the same as the distance between user 001 and robot 002.
  • the value corresponding to the distance can be assigned different weights. For example, the angle of the face can better reflect whether the user 001 is interacting with the voice assistant, and the weight corresponding to the angle of the face can account for 60%.
  • User 001 and robot 002 The weight corresponding to the distance between them may be 40%.
  • steps 302 and 303 in FIG. 3 can be sorted and supplemented, wherein the specific speech recognition method is shown in As shown in 6, step 301 and steps 304-308 refer to the foregoing, and will not be repeated here.
  • steps 302-303 can be adjusted as follows:
  • S302A Determine whether there is a human voice in the received sound.
  • the human voice detection model in the artificial intelligence processor can be used to detect whether there is a human voice in the received sound. If there is a human voice, execute S302B to further determine whether the voice assistant detects that the duration of sound collection is less than the first set value. If there is no human voice, then after the interval setting time, go to S302C, start receiving sound again, and recalculate the duration of sound collection.
  • the interval setting time can be 200ms.
  • the judgment result is yes, then it shows that the speech recognition method can be used to identify the user's voice command, and then proceed to S308 to obtain the speech recognition result; If it is recognized, go to S303A to check whether the conditions for entering the lip recognition mode are met.
  • step S302 in FIG. 3 the conditions for voice command recognition are as described in step S302 in FIG. 3 , and will not be repeated here.
  • S302C Start receiving sound again, and recalculate the duration of receiving sound.
  • S303A Determine whether the user 001 of face tracking has not changed.
  • the judgment result is yes, it means that the user who interacts with the voice assistant during the sound collection process is always the same user, then go to S303B, and use the user as the user who interacts with the voice assistant; if the judgment result is no, it means that during the sound collection process If the user interacting with the voice assistant has been changed, go to S303C, and use the last user 001 collected by the camera device as the user 001 interacting with the voice assistant.
  • S303B The current user is used as the user interacting with the voice assistant. Wherein, the current user is a user who has been interacting with the voice assistant during the listening process.
  • S303C Use the last user 001 captured by the camera device as the user 001 interacting with the voice assistant.
  • S303D Determine whether the value of the interaction willingness of the user 001 who interacts with the voice assistant reaches the first set value.
  • the judgment result is yes, it indicates that the user 001 is interacting with the device, then go to S305 to obtain the lip recognition result; if the judgment result is no, it indicates that the user 001 is not willing to interact with the voice assistant, and the lip recognition is adopted If the method is also difficult to recognize the user's voice command, go to step S307 and prompt the user that the recognition fails.
  • the speech recognition method shown in FIG. 6 of the present application can sort several judgment conditions for using lip recognition, and can more accurately judge the timing of using lip recognition.
  • the current process can be ended in advance, and the next round of detection can be started, which avoids adding unnecessary subsequent recognition steps and effectively improves the recognition efficiency.
  • the voice recognition method provided by the embodiment of the present application first judges that the voice assistant has received voice for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, so as to determine It is found that the external environment is too noisy, and it is judged that it is difficult to recognize the voice command of the user 001 by means of voice recognition under such circumstances. Further, by judging whether the user 001 is interacting with the voice assistant to determine whether to use the lip recognition method; it can effectively avoid the situation that the accuracy of the recognition result is reduced by using the lip recognition result when speech recognition is possible, Effectively improve the accuracy of voice command recognition.
  • the user 001 can be asked again after the lip language recognition result is obtained, which can effectively ensure the accuracy of the recognition result.
  • the visual recognition function and the noise detection function can be turned on while confirming to the user 001.
  • the voice assistant can keep the recognition result of the body movements of the user 001, and on the other hand, according to the surrounding environment As the noise changes, adjust the way of recognizing user 001 voice commands in time to increase the accuracy of voice command recognition.
  • the embodiment of the present application also provides a voice interaction device, including:
  • the detection module is configured to control the electronic device to enter the voice recognition mode after detecting that the user 001 wakes up the voice assistant.
  • the recognition control module is configured to control the electronic device to recognize the voice command of the user 001 by voice recognition to obtain a voice recognition result if it detects that the current voice interaction environment of the electronic device satisfies the voice recognition condition.
  • control The electronic device uses a lip recognition method to recognize the mouth change characteristics of the user 001 acquired by the electronic device through the image acquisition device to obtain a lip recognition result.
  • the execution module is configured to control the electronic device to execute the function corresponding to the recognition result according to the recognition result acquired by the electronic device. For example, if the electronic device obtains the result of lip recognition, the electronic device is controlled to execute the function corresponding to the lip recognition result; if the electronic device obtains the result of speech recognition, the electronic device is controlled to execute the function corresponding to the speech recognition result.
  • Embodiments disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods.
  • Embodiments of the present application may be implemented as a computer program or program code executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements) , at least one input device, and at least one output device.
  • Program code can be applied to input instructions to perform the functions described herein and to generate output information.
  • the output information may be applied to one or more output devices in known manner.
  • a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • the program code can be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system.
  • Program code can also be implemented in assembly or machine language, if desired.
  • the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.
  • the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
  • the disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which can be executed by one or more processors read and execute.
  • instructions may be distributed over a network or via other computer-readable media.
  • a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy disks, optical disks, optical disks, read-only memories (CD-ROMs), magnetic Optical discs, read-only memory (ROM), random-access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical cards, flash memory, or A tangible, machine-readable memory used to transmit information (eg, carrier waves, infrared signals, digital signals, etc.) by electrical, optical, acoustic, or other forms of propagating signals using the Internet.
  • a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).
  • the embodiment of the present application also provides a computer program or a computer program product including the computer program.
  • the computer program When the computer program is executed on a computer, it will enable the computer to implement the above voice command execution method.
  • the computer program product may include instructions, and the instructions are used to implement the above voice interaction method.
  • each unit/module mentioned in each device embodiment of this application is a logical unit/module.
  • a logical unit/module can be a physical unit/module, or a physical unit/module.
  • a part of the module can also be realized with a combination of multiple physical units/modules, the physical implementation of these logical units/modules is not the most important, the combination of functions realized by these logical units/modules is the solution The key to the technical issues raised.
  • the above-mentioned device embodiments of this application do not introduce units/modules that are not closely related to solving the technical problems proposed by this application, which does not mean that the above-mentioned device embodiments do not exist other units/modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Manipulator (AREA)

Abstract

A speech interaction method, an electronic device, and a medium. The speech interaction method comprises: where it is detected that the current speech interaction environment of an electronic device does not meet a speech recognition condition, determining whether the current interaction state of a user meets a lip reading recognition condition (S303); if so, acquiring a lip reading recognition result, which is obtained by recognizing, by means of lip reading recognition, a speech instruction of the user which is received by the electronic device (S304); and executing a function corresponding to the lip reading recognition result (S306). By means of the method, where it is determined that the speech instruction of the user has been difficult to recognize by means of lip reading recognition, whether lip reading recognition is used is further determined by determining whether the user is interacting with a voice assistant, such that the accuracy of speech instruction recognition can be effectively improved, thereby further improving the accuracy of executing the speech instruction of the user by the electronic device.

Description

语音交互方法、电子设备及介质Voice interaction method, electronic device and medium
本申请要求于2021年07月29日提交中国专利局、申请号为202110865871.X,申请名称为“语音交互方法、电子设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202110865871.X and the application title "Voice Interaction Method, Electronic Equipment and Media" filed with the China Patent Office on July 29, 2021, the entire contents of which are incorporated by reference in In this application.
技术领域technical field
本申请涉及人机交互技术领域,特别涉及一种语音交互方法、电子设备及介质。The present application relates to the technical field of human-computer interaction, and in particular to a voice interaction method, electronic equipment and media.
背景技术Background technique
随着人工智能技术的发展,机器人等电子设备已广应用于餐饮、教育、医疗、文化、智能家居、金融、电信等行业,能够为用户提供全方位人工智能服务。With the development of artificial intelligence technology, electronic devices such as robots have been widely used in catering, education, medical care, culture, smart home, finance, telecommunications and other industries, and can provide users with comprehensive artificial intelligence services.
用户可以通过触摸屏、语音、遥控等方式与机器人等电子设备进行人机交互。其中,当用户采用语音与机器人进行人机交互时,机器人能够识别出用户的语音指令并执行。例如,如图1所示,用户如果想要命令机器人执行“讲故事”的操作,则可以发出“讲故事”的语音指令,机器人识别出上述“讲故事”的语音指令后,可以执行“讲故事”的操作。Users can interact with electronic devices such as robots through touch screens, voice, remote control, etc. Among them, when the user uses voice to interact with the robot, the robot can recognize the user's voice command and execute it. For example, as shown in Figure 1, if the user wants to order the robot to perform the operation of "telling a story", he can issue a voice command of "telling a story". Story" operation.
但是若处于嘈杂环境中时,当用户通过语音对机器人下达指令时,机器人可能无法接收或识别出用户的语音指令并执行。例如,因当前环境过于嘈杂,因此机器人无法判断用户语音命令何时结束,因此一直处于持续收音状态,或者,无法判断用户语音命令何时开始,因此一直处于不收音状态,所以无法对用户的语音指令进行执行操作等反馈,严重影响用户体验。But if it is in a noisy environment, when the user gives instructions to the robot through voice, the robot may not be able to receive or recognize the user's voice instructions and execute them. For example, because the current environment is too noisy, the robot cannot judge when the user's voice command ends, so it has been in a continuous listening state, or it cannot judge when the user's voice command starts, so it has been in a non-radio state, so it cannot listen to the user's voice Feedback such as command execution operations seriously affects user experience.
发明内容Contents of the invention
为解决上述处于嘈杂环境中时,电子设备可能无法接收或识别出用户的语音指令并执行的技术问题,本申请实施例第一面提供了一种语音交互方法,电子设备及介质。该方法能够应用于电子设备,所述方法包括:In order to solve the above-mentioned technical problem that the electronic device may not be able to receive or recognize the user's voice command and execute it in a noisy environment, the first aspect of the embodiment of the present application provides a voice interaction method, an electronic device and a medium. The method can be applied to electronic equipment, and the method includes:
在检测出所述电子设备当前的语音交互环境不满足语音识别条件的情况下,确定用户当前的交互状态是否满足唇语识别条件;When it is detected that the current voice interaction environment of the electronic device does not meet the voice recognition condition, determine whether the user's current interaction state meets the lip language recognition condition;
在确定出所述用户当前的交互状态满足所述唇语识别条件的情况下,获取采用唇语识别方式对所述电子设备通过图像采集装置获取的用户的嘴部变化特征进行识别所得到的唇语识别结果;When it is determined that the user's current interaction state satisfies the lip language recognition condition, acquire the lip language recognition method obtained by identifying the user's mouth change characteristics acquired by the electronic device through the image acquisition device. Language recognition results;
执行所述唇语识别结果对应的功能。Execute the function corresponding to the lip recognition result.
本申请实施例提供的语音识别方法在判断出采用语音识别的方式已经难以识别用户的语音指令的情况下,进一步的通过判断用户是否是在和语音助手进行交互来确定是否采用唇语识别;该方法能够有效提高语音指令识别的准确率,从而进一步提高电 子设备对用户语音指令执行的正确率。In the speech recognition method provided by the embodiment of the present application, if it is judged that it is difficult to recognize the user's voice command by using the speech recognition method, it further determines whether to use lip recognition by judging whether the user is interacting with the voice assistant; The method can effectively improve the accuracy rate of voice command recognition, thereby further improving the accuracy rate of the electronic device executing the user's voice command.
可以理解,本申请实施例中,电子设备周围的噪音值及电子设备收音持续的时间等均属于电子设备的语音交互环境范畴。It can be understood that, in the embodiment of the present application, the noise value around the electronic device and the duration of the electronic device receiving sound, etc. all belong to the category of the voice interaction environment of the electronic device.
可以理解,本申请实施例中,图像采集装置可以为用于采集图像的摄像装置,例如摄像头等。It can be understood that in the embodiment of the present application, the image acquisition device may be a camera device for collecting images, such as a camera.
可以理解,本申请实施例中,电子设备在接收用户的语音指令的时候,可以同时采集用户的声音和用户的嘴部变化特征。It can be understood that in the embodiment of the present application, when the electronic device receives the user's voice instruction, it can simultaneously collect the user's voice and the user's mouth change features.
当确定电子设备当前的语音交互环境满足语音识别条件的情况下,采用语音识别的方式对电子设备接收的用户的声音进行识别以获取语音识别结果。When it is determined that the current voice interaction environment of the electronic device satisfies the voice recognition condition, voice recognition is used to recognize the user's voice received by the electronic device to obtain a voice recognition result.
当确定出电子设备当前的语音交互环境不满足语音识别条件,但用户当前的交互状态满足唇语识别条件的情况下,可以采用唇语识别的方式对电子设备的图像采集装置获取的用户的嘴部变化特征进行识别以获取唇语识别结果。When it is determined that the current voice interaction environment of the electronic device does not meet the voice recognition conditions, but the user's current interaction state meets the lip recognition conditions, the user's mouth acquired by the image acquisition device of the electronic device can be detected by means of lip recognition. Identify the internal change features to obtain the lip language recognition results.
在上述第一方面一种可能的实现中,所述语音识别条件包括:In a possible implementation of the first aspect above, the speech recognition conditions include:
所述电子设备周围的噪音值低于设定噪音值;The noise value around the electronic equipment is lower than the set noise value;
或者;or;
在所述电子设备周围的噪音值大于等于设定噪音值的情况下,所述电子设备收音持续的时间大于零且小于设定时间。In the case that the noise value around the electronic device is greater than or equal to the set noise value, the duration time for the electronic device to receive sound is greater than zero and less than the set time.
可以理解,在一些实施例中,判断能否采用语音识别的方式对用户的语音指令进行识别可以首先判断电子设备周围的噪音值是否低于设定噪音值,若是,则证明此时电子设备周围的噪音较小,此时可以通过语音识别的方式对用户的语音指令进行识别;若否,则证明此时电子设备周围的噪音已经较大,外部环境已经处于嘈杂环境,此时进一步判断电子设备收音持续的声音大于零且小于设定时间。It can be understood that in some embodiments, judging whether the voice command of the user can be recognized by means of voice recognition can firstly determine whether the noise value around the electronic device is lower than the set noise value, and if so, it proves that the noise around the electronic device is lower than the set noise value. If the noise is small, the user’s voice command can be recognized through voice recognition; if not, it proves that the noise around the electronic device is already large, and the external environment is already in a noisy environment. At this time, it is further judged that the electronic device The continuous sound of the radio is greater than zero and less than the set time.
若电子设备收音持续的声音大于零且小于设定时间,证明电子设备还是能够准确判断用户的语音截断的时间点,进而确定此时可以采用语音识别方式对用户的语音指令进行识别。若电子设备收音持续的声音大于等于第一设定值或者小于零,则说明电子设备已经无法准确判断用户的语音截断的时间点,进而确定此时采用语音识别方式已经难以对用户的语音指令进行识别。If the continuous sound received by the electronic device is greater than zero and less than the set time, it proves that the electronic device can still accurately determine the time point when the user's voice is cut off, and then confirm that the user's voice command can be recognized by voice recognition at this time. If the continuous sound received by the electronic device is greater than or equal to the first set value or less than zero, it means that the electronic device has been unable to accurately determine the time point when the user's voice is cut off, and then it is determined that it is difficult to use the voice recognition method at this time. identify.
在上述第一方面一种可能的实现中,所述语音识别条件包括:所述电子设备收音持续的时间大于零且小于设定时间。In a possible implementation of the first aspect above, the voice recognition condition includes: the duration of the electronic device receiving sound is greater than zero and less than a set time.
可以理解,在一些实施例中,判断能否采用语音识别的方式对用户的语音指令进行识别的方式可以为直接判断所述电子设备收音持续的时间是否大于零且小于设定时间,若是,证明电子设备还是能够准确判断用户的语音截断的时间点,进而确定此时可以采用语音识别方式对用户的语音指令进行识别。若电子设备收音持续的声音大于等于第一设定值或者小于零,则说明电子设备已经无法准确判断用户的语音截断的时间点,进而推断出外部环境过于嘈杂,并确定此时采用语音识别方式已经难以对用户的语音指令进行识别。It can be understood that, in some embodiments, the method of judging whether the voice command of the user can be recognized by voice recognition can be directly judging whether the duration of the electronic device receiving sound is greater than zero and less than the set time, and if so, prove The electronic device can still accurately determine the time point when the user's voice is cut off, and then determine that the user's voice command can be recognized by means of voice recognition at this time. If the continuous sound received by the electronic device is greater than or equal to the first set value or less than zero, it means that the electronic device has been unable to accurately determine the time point when the user's voice is cut off, and then infers that the external environment is too noisy, and determines that the voice recognition method is used at this time It has been difficult to recognize the user's voice commands.
在上述第一方面一种可能的实现中,所述唇语识别条件,包括:In a possible implementation of the first aspect above, the lip recognition conditions include:
用户与电子设备在设定时间内处于交互状态。The user and the electronic device are in an interactive state within a set time.
可以理解,当用户与电子设备一直处于交互状态,则说明用户仍然存在需要电子 设备执行语音指令的需求,若用户已经未与电子设备进行交互,例如,已经离开,则说明用户已经不存在需要电子设备执行语音指令的需求。It can be understood that when the user has been interacting with the electronic device, it means that the user still needs the electronic device to execute voice commands. If the user has not interacted with the electronic device, for example, has left, it means that the user no longer needs the electronic device The need for devices to execute voice commands.
在一些实施例中,若用户一直与电子设备处于交互状态,则电子设备可以更加清晰地拍摄到用户的嘴部变化特征,便于采用唇语识别方式对获取的用户的嘴部变化特征进行识别以获取唇语识别结果。In some embodiments, if the user has been interacting with the electronic device, the electronic device can capture the changes in the user's mouth more clearly, so that lip recognition can be used to identify the acquired changes in the user's mouth. Get the lip recognition result.
在上述第一方面一种可能的实现中,确定所述用户与所述电子设备在设定时间内是否处于交互状态的方法,包括:In a possible implementation of the first aspect above, the method for determining whether the user is in an interactive state with the electronic device within a set time includes:
确定在所述设定时间内与所述电子设备进行交互的用户是否发生更换;determining whether a user interacting with the electronic device has changed within the set time;
检测所述用户与所述电子设备的交互强度值是否达到设定强度值;Detecting whether the interaction intensity value between the user and the electronic device reaches a set intensity value;
在确定在所述设定时间内与所述电子设备进行交互的用户未发生更换,且所述用户与所述电子设备的所述交互强度值达到所述设定强度值的情况下,确认所述用户与所述电子设备处于交互状态;When it is determined that the user who interacts with the electronic device within the set time has not changed, and the interaction intensity value between the user and the electronic device reaches the set intensity value, confirm the The user is in an interactive state with the electronic device;
其中,所述交互强度值与所述用户与所述电子设备之间的距离以及所述用户的人脸朝向相关。Wherein, the interaction intensity value is related to the distance between the user and the electronic device and the face orientation of the user.
在本申请一些实施例中,在检测用户与电子设备的交互强度值之前,可以判断在上述设定时间内与电子设备进行交互的用户是否发生更换,若一直未发生更换,则可以将当前用户作为后续检测交互强度值的对象。当进一步确定出该用户与电子设备的交互强度值达到所述设定强度值的情况下,可以确认该用户与所述电子设备处于交互状态;可以判定此时采用唇语识别方式能够更加准确的对用户的嘴部变化特征进行识别以获取唇语识别结果。In some embodiments of the present application, before detecting the interaction intensity value between the user and the electronic device, it may be determined whether the user who interacts with the electronic device within the above-mentioned set time has changed. If no change has occurred, the current user may be As an object for subsequent detection of interaction strength values. When it is further determined that the interaction intensity value between the user and the electronic device reaches the set intensity value, it can be confirmed that the user is in an interactive state with the electronic device; Identify the changing features of the user's mouth to obtain lip recognition results.
可以理解,本申请实施例中,交互强度值可以基于设定时间内用户的脸部与电子设备之间的距离、人脸朝向等进行获取。例如,在设定时间内用户的脸部与电子设备之间的距离较近、且用户的脸部朝向正对电子设备,则交互强度值较高,反之较弱。It can be understood that in the embodiment of the present application, the interaction intensity value may be acquired based on the distance between the user's face and the electronic device, the orientation of the face, and the like within a set time. For example, if the distance between the user's face and the electronic device is relatively short within the set time, and the user's face faces the electronic device, the interaction intensity value is high, and vice versa.
可以理解,本申请实施例中提及的交互强度值与交互意愿值的意思一致,只是表达不同。且设定强度值可以为后文实施例中提及的第二设定值。It can be understood that the meaning of the interaction strength value mentioned in the embodiment of the present application is the same as that of the interaction willingness value, but the expression is different. And the set intensity value may be the second set value mentioned in the following embodiments.
在上述第一方面一种可能的实现中,在所述执行所述唇语识别结果对应的功能步骤之前,还包括:In a possible implementation of the first aspect above, before performing the functional step corresponding to the lip recognition result, it further includes:
确认所述唇语识别结果是否正确;Confirm whether the lip recognition result is correct;
在确认所述唇语识别结果正确的情况下,执行所述唇语识别结果对应的功能。If it is confirmed that the lip recognition result is correct, execute the function corresponding to the lip recognition result.
本申请实施例中,在执行所述唇语识别结果对应的功能步骤之前,再次进行确认所述唇语识别结果是否正确,能够有效提高电子设备对用户语音指令执行的正确率。In the embodiment of the present application, before executing the functional step corresponding to the lip word recognition result, it is confirmed again whether the lip word recognition result is correct, which can effectively improve the accuracy of the electronic device in executing the user's voice command.
在上述第一方面一种可能的实现中,确认所述唇语识别结果是否正确的方法,包括:In a possible implementation of the first aspect above, the method for confirming whether the lip recognition result is correct includes:
向用户进行询问是否需要执行所述唇语识别结果对应的功能;Ask the user whether it is necessary to perform the function corresponding to the lip recognition result;
在用户确认需要执行所述唇语识别结果对应的功能的情况下,确认所述唇语识别结果正确。When the user confirms that the function corresponding to the lip recognition result needs to be executed, it is confirmed that the lip recognition result is correct.
在上述第一方面一种可能的实现中,在所述确认唇语结果是否正确的步骤的同时,还包括:In a possible implementation of the first aspect above, at the same time as the step of confirming whether the lip language result is correct, it also includes:
获取所述用户的肢体动作特征和电子设备周围的噪音值。Obtain the body movement characteristics of the user and the noise value around the electronic device.
本申请实施例中,在所述确认唇语结果是否正确的步骤中,因前述步骤已确认当前处于嘈杂环境,语音识别方式可能已经无法准确识别用户的语音指令,因此在语音助手在向用户进行确认的时候,可以同时开启视觉识别功能,视觉识别功能能够获取所述用户的肢体动作特征,且便于对用户通过肢体动作进行的答复进行识别,例如用户可能会通过点头的动作或ok的手势等表示确认唇语结果正确。In the embodiment of the present application, in the step of confirming whether the result of the lip language is correct, because the previous steps have confirmed that the current environment is noisy, the voice recognition method may not be able to accurately recognize the user's voice command, so when the voice assistant is performing When confirming, the visual recognition function can be turned on at the same time. The visual recognition function can obtain the characteristics of the user's body movements and facilitate the recognition of the user's reply through body movements. Indicates that the lip-reading result is confirmed to be correct.
在一些实施例中,为了进一步增加电子设备对用户语音指令识别的准确性,也可以在向用户进行语音确认的时候,除了开启视觉识别功能,还可以开启噪声检测功能,以便实时检测周围环境噪音。当判断到周围环境噪音低于设定值后,则可以确定此时采用语音识别方式已经可以识别出用户的语音指令,则此时可以采用语音识别的方式对用户的确认指令或后续的其他语音指令进行识别;若周围环境噪音仍然高于设定值,则采用唇语识别方式或视觉识别方式或唇语识别方式与视觉识别方式结合以对用户的确认指令或后续的其他语音指令进行识别。In some embodiments, in order to further increase the accuracy of the electronic device's recognition of the user's voice command, it is also possible to enable the noise detection function in addition to the visual recognition function when performing voice confirmation to the user, so as to detect the surrounding environment noise in real time . When it is judged that the ambient noise is lower than the set value, it can be determined that the voice command of the user can be recognized by the voice recognition method at this time, and the user's confirmation command or other subsequent voices can be confirmed by voice recognition at this time. If the ambient noise is still higher than the set value, lip recognition or visual recognition or a combination of lip recognition and visual recognition is used to recognize the user's confirmation instruction or other subsequent voice instructions.
在上述第一方面一种可能的实现中,所述电子设备为机器人。In a possible implementation of the foregoing first aspect, the electronic device is a robot.
本申请实施例第一方面提供的语音识别方法在判断出采用语音识别的方式已经难以识别用户的语音指令的情况下,进一步的通过判断用户是否是在和语音助手进行交互来确定是否采用唇语识别;该方法能够有效提高语音指令识别的准确率,从而进一步提高电子设备对用户语音指令执行的正确率。In the speech recognition method provided by the first aspect of the embodiment of the present application, when it is judged that it is difficult to recognize the user's voice command by the speech recognition method, it further determines whether to use lip language by judging whether the user is interacting with the voice assistant. Recognition; the method can effectively improve the accuracy rate of voice command recognition, thereby further improving the accuracy rate of the electronic device for executing the user's voice command.
本申请实施例第二方面提供一种电子设备,包括:The second aspect of the embodiment of the present application provides an electronic device, including:
存储器,用于存储由所述电子设备的一个或多个处理器执行的指令,以及a memory for storing instructions to be executed by the one or more processors of the electronic device, and
处理器,是所述电子设备的所述一个或多个处理器之一,用于执行上述语音交互方法。The processor is one of the one or more processors of the electronic device, configured to execute the above voice interaction method.
本申请实施例第三方面提供一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,该指令在执行时使计算机执行上述语音交互方法。The third aspect of the embodiments of the present application provides a computer-readable storage medium, where instructions are stored on the computer-readable storage medium, and when the instructions are executed, the computer executes the above voice interaction method.
本申请实施例第四方面提供一种计算机程序产品,所述计算机程序产品包括指令,该指令在执行时使计算机执行上述语音交互方法。The fourth aspect of the embodiments of the present application provides a computer program product, where the computer program product includes instructions, and when the instructions are executed, the computer executes the above voice interaction method.
附图说明Description of drawings
图1根据本申请的一些实施例,示出了一种语音交互方法的场景示意图;FIG. 1 shows a schematic diagram of a scene of a voice interaction method according to some embodiments of the present application;
图2根据本申请的一些实施例,示出了一种电子设备的结构示意图;Fig. 2 shows a schematic structural diagram of an electronic device according to some embodiments of the present application;
图3根据本申请的一些实施例,示出了一种语音交互方法的流程示意图;Fig. 3 shows a schematic flowchart of a voice interaction method according to some embodiments of the present application;
图4根据本申请的一些实施例,示出了一种语音交互方法的场景示意图;Fig. 4 shows a schematic scene diagram of a voice interaction method according to some embodiments of the present application;
图5根据本申请的一些实施例,示出了一种语音交互方法的场景示意图;Fig. 5 shows a schematic diagram of a scene of a voice interaction method according to some embodiments of the present application;
图6根据本申请的一些实施例,示出了一种语音交互方法的流程示意图。Fig. 6 shows a schematic flowchart of a voice interaction method according to some embodiments of the present application.
具体实施方式Detailed ways
本申请的实施例公开了一种语音交互方法、电子设备及介质。The embodiment of the application discloses a voice interaction method, electronic equipment and media.
可以理解,适用于本申请实施例的电子设备可以是具备语音识别功能的各种电子设备,包括但不限于机器人,膝上型计算机、台式计算机、平板计算机、智能手机、服务器、可穿戴设备、头戴式显示器、移动电子邮件设备、便携式游戏机、便携式音 乐播放器、阅读器设备、其中嵌入或耦接有一个或多个处理器的电视机、或具有计算功能的其他电子设备。It can be understood that the electronic devices applicable to the embodiments of the present application may be various electronic devices with voice recognition functions, including but not limited to robots, laptop computers, desktop computers, tablet computers, smart phones, servers, wearable devices, Head-mounted displays, mobile email devices, portable game consoles, portable music players, reader devices, televisions with one or more processors embedded in or coupled to them, or other electronic devices with computing capabilities.
此外,上述电子设备的语音识别功能可以以各种应用程序的形式实现,例如,以语音助手的形式出现,或者说该语音识别功能被内置在电子设备的应用程序中,例如,用于应用程序的语音搜索,例如地图应用中的语音搜索。In addition, the speech recognition function of the above-mentioned electronic device can be implemented in the form of various applications, for example, appearing in the form of a voice assistant, or the speech recognition function is built into the application program of the electronic device, for example, for the application program , such as voice search in the Maps app.
为了便于说明,下文以电子设备为机器人,语音识别功能以机器人的语音助手实现为例。For ease of description, the electronic device is used as a robot, and the speech recognition function is implemented as a robot's voice assistant as an example.
如前所述,用户可以通过语音指令实现对机器人等电子设备进行控制,而当处于嘈杂环境时,若机器人的语音助手采用语音识别的方式对用户的语音指令进行识别,可能会出现上述语音助手无法判断用户的语音指令何时结束,因此一直处于收音状态,或者,出现无法判断用户的语音指令何时开始,因此一直处于不收音状态,出现无法执行用户的语音指令的情况,进而影响用户的使用体验。As mentioned above, users can control electronic devices such as robots through voice commands. However, in a noisy environment, if the voice assistant of the robot uses voice recognition to recognize the user's voice commands, the above-mentioned voice assistant may appear It is impossible to judge when the user's voice command ends, so it is always in the receiving state, or it cannot be judged when the user's voice command starts, so it has been in the non-radio state, and the user's voice command cannot be executed, which in turn affects the user's Use experience.
为解决上述问题,本申请实施例提供一种语音交互方法,当语音助手被用户唤醒后,语音助手可以通过噪音检测功能检测周围的噪音水平,若噪音水平高于设定阈值,则将当前的语音识别方式切换为唇语识别方式,使得电子设备能够通过唇语识别技术识别用户的语音指令,并执行该语音指令。In order to solve the above problems, the embodiment of the present application provides a voice interaction method. When the voice assistant is awakened by the user, the voice assistant can detect the surrounding noise level through the noise detection function. If the noise level is higher than the set threshold, the current The voice recognition mode is switched to the lip language recognition mode, so that the electronic device can recognize the user's voice command through the lip language recognition technology, and execute the voice command.
例如,以图1中所示场景为例,当用户001发出“讲故事”的语音指令,在用户发出指令过程中,语音助手通过噪音检测功能检测周围的噪音水平较高,高于了设定阈值,例如设定阈值为70分贝,而语音助手通过噪音检测功能检测周围的噪音值为78分贝,则语音助手将当前的语音识别方式切换为唇语识别方式,使得语音助手能够通过唇语识别技术识别用户的语音指令,并执行“讲故事”的语音指令。For example, taking the scene shown in Figure 1 as an example, when user 001 sends out the voice command "tell a story", during the process of the user sending out the command, the voice assistant detects that the surrounding noise level is high through the noise detection function, which is higher than the set level. Threshold, for example, if the threshold is set to 70 decibels, and the noise value detected by the voice assistant through the noise detection function is 78 decibels, the voice assistant will switch the current voice recognition method to the lip recognition method, so that the voice assistant can use lip recognition. The technology recognizes the user's voice commands and executes the "storytelling" voice commands.
上述技术能够在一定的情况下对语音指令进行识别,但是在常规场景下,唇语识别的准确度一般低于语音识别的准确度,因此上述方案中在噪音较高时即将语音识别方式转换为唇语识别方式,可能存在虽然周围环境噪声较高,但是语音识别仍然可以进行准确识别的场景,此时,将语音识别方式转化为唇语识别方式,增加了识别错误的风险。The above technology can recognize voice commands under certain circumstances, but in normal scenarios, the accuracy of lip recognition is generally lower than that of speech recognition. Therefore, in the above solution, the speech recognition method is converted to In the lip language recognition method, there may be scenarios where the voice recognition can still perform accurate recognition even though the surrounding environment is noisy. At this time, converting the voice recognition method to the lip language recognition method increases the risk of recognition errors.
故在本申请实施例提供另一种语音交互方法,该方法并不是通过噪音检测功能判断出周围环境过于嘈杂后,就直接将语音识别模式切换为唇语识别模式,获取唇语识别结果;而是首先判断是否确定不能采用语音识别方式,在确定不能采用语音识别方式的情况下判断是否满足采用唇语识别的条件。在判断出满足唇语识别的条件后,才获取唇语识别结果。其中,判断是否不能采用语音识别方式如下所述:Therefore, another voice interaction method is provided in the embodiment of the present application. This method does not directly switch the voice recognition mode to the lip language recognition mode after judging that the surrounding environment is too noisy through the noise detection function to obtain the lip language recognition result; It is first to judge whether it is determined that the speech recognition method cannot be used, and if it is determined that the speech recognition method cannot be used, it is judged whether the condition for adopting lip language recognition is satisfied. The lip recognition result is obtained only after it is judged that the lip recognition condition is met. Among them, judging whether the voice recognition method cannot be used is as follows:
在一种可实施的方案中,可以通过判断语音助手收音时间过长,例如超过了系统的常规设定值,从而确定出可能因为外部环境过于嘈杂导致了语音助手无法判断用户的语音指令的结束时间点,或者通过判断出无法收音,从而确定出可能因为外部环境过于嘈杂导致了语音助手无法判断用户的语音指令的开始时间点,进而确定此时通过语音识别的难以准确识别用户的语音指令,确认此时不能采用语音识别的方式对用户进行语音识别。In an implementable solution, it can be determined that the voice assistant may not be able to judge the end of the user's voice command because the external environment is too noisy by judging that the listening time of the voice assistant is too long, for example, exceeding the conventional setting value of the system The time point, or by judging that the radio cannot be received, it is possible to determine the start time point when the voice assistant cannot judge the user's voice command because the external environment is too noisy, and then determine that it is difficult to accurately recognize the user's voice command through voice recognition at this time, Confirm that the user cannot be recognized by voice recognition at this time.
在另一种可实施的方案中,可以首先通过噪音检测功能对周围环境进行判断,若周围环境的噪音值小于设定噪音值,则直接确定此时可以采用语音识别方式,若周围 环境的噪音值大于等于设定噪音值,此时进一步判断语音助手是否收音时间过长,例如超过了系统的常规设定值,或者是否无法收音等确定出外部环境已经嘈杂到语音助手无法判断用户的语音指令的结束时间点或开始时间点,进而确定此时通过语音识别的难以准确识别用户的语音指令。In another practicable solution, the surrounding environment can be judged first through the noise detection function. If the noise value of the surrounding environment is less than the set noise value, it is directly determined that the speech recognition method can be used at this time. If the noise of the surrounding environment The value is greater than or equal to the set noise value. At this time, it is further judged whether the voice assistant has been listening for too long, for example, it has exceeded the system’s conventional setting value, or whether it is unable to listen. It is determined that the external environment is too noisy for the voice assistant to judge the user’s voice commands. The end time point or the start time point of the system, and then determine the user's voice instruction that is difficult to accurately recognize through voice recognition at this time.
在确定出处采用语音识别的方式已经难以识别出用户的语音指令后,通过判断在设定时间段内用户的人脸是否朝向摄像头、人脸是否位于机器人摄像头的拍摄范围内等确认用户是否在和语音助手进行交互的方式确认是否采用唇语识别结果,若确认用户是在正在和语音助手进行交互,则可以确定采用唇语识别方式能够相对准确地识别用户的语音指令,则获取唇语识别结果,并根据唇语识别结果执行唇语识别结果对应的功能。After it is determined that it is difficult to recognize the user's voice command by voice recognition, determine whether the user's face is facing the camera within the set time period, whether the face is within the shooting range of the robot camera, etc., to confirm whether the user is in and The way the voice assistant interacts confirms whether the lip recognition result is used. If it is confirmed that the user is interacting with the voice assistant, it can be determined that the lip recognition method can relatively accurately recognize the user's voice command, and then obtain the lip recognition result , and execute the function corresponding to the lip recognition result according to the lip recognition result.
例如,以图1中所示场景为例,当语音助手被唤醒后,则进入语音识别模式,开始收音,当用户001发出“讲故事”的语音指令后,语音助手并未检测到用户的语音指令已经结束,一直在持续收音,当收音时间超过了系统的设定值,例如10秒,此时语音助手可以判断出可能因为外部环境过于嘈杂导致了语音助手无法判断用户001的语音指令的结束时间点,进而确定此时通过语音识别的难以准确识别用户的语音指令。然后检测刚才收音过程中在设定时间内用户001的人脸是否一直朝向摄像头、且人脸一直位于机器人摄像头的拍摄范围内,如果检测结果为是,则确定用户001在和电子设备进行交互,从而可以确定采用唇语识别方式是否能够相对准确地识别用户001的语音指令,则可以采用唇语识别的方式对用户001的语音指令进行识别,并执行“讲故事”的语音指令。For example, take the scene shown in Figure 1 as an example. When the voice assistant is awakened, it enters the voice recognition mode and starts to listen. When user 001 sends out the voice command of "tell a story", the voice assistant does not detect the user's voice. The instruction has ended, and the audio has been continuously receiving. When the audio receiving time exceeds the system setting value, such as 10 seconds, the voice assistant can judge that the voice assistant cannot judge the end of user 001’s voice instruction because the external environment is too noisy. Time point, and then determine the user's voice instructions that are difficult to accurately recognize through voice recognition at this time. Then detect whether the face of the user 001 has been facing the camera within the set time during the radio reception just now, and whether the face has always been within the shooting range of the robot camera. If the detection result is yes, it is determined that the user 001 is interacting with the electronic device. Therefore, it can be determined whether the voice command of user 001 can be recognized relatively accurately by means of lip recognition, and then the voice command of user 001 can be recognized by lip recognition, and the voice command of "telling a story" can be executed.
本申请实施例提供的语音识别方法首先通过判断语音助手接收声音的时间已经持续了相当长的一段时间,判断出语音助手可能已经无法判断用户001的语音指令何时已经结束,从而判断出外部环境过于嘈杂;进而更加精确的判断出在此种情况下采用语音识别的方式已经难以识别用户001的语音指令,进一步的通过判断用户001是否是在和语音助手进行交互来确定何是否采用唇语识别;能够有效避免在可以进行语音识别的情况下采用唇语识别结果造成的识别结果准确率降低的情况,有效提高语音指令识别的准确率。The speech recognition method provided by the embodiment of the present application first judges that the voice assistant has received the sound for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, thereby judging the external environment It is too noisy; furthermore, it can be more accurately judged that it is difficult to recognize the voice command of user 001 by using voice recognition in this case, and further determine whether to use lip recognition by judging whether user 001 is interacting with the voice assistant ; It can effectively avoid the situation that the accuracy rate of the recognition result is reduced by using the lip language recognition result when the speech recognition can be performed, and effectively improve the accuracy rate of the speech command recognition.
下面在详细介绍本申请实施例提供的另一种语音交互方法之前,首先对本申请实施例提供的电子设备进行介绍。Before introducing another voice interaction method provided in the embodiment of the present application in detail, the electronic device provided in the embodiment of the present application will be introduced first.
为了便于介绍,现在以电子设备为机器人002为例进行说明。应理解的是,本申请实施例中的机器人002还可以与云服务器进行交互,将识别出的用户001的云指令发送至云服务器,云服务器可以采用数据库向机器人002反馈交互内容,交互内容如歌曲、故事等。For ease of introduction, the electronic device is robot 002 as an example for illustration. It should be understood that the robot 002 in the embodiment of the present application can also interact with the cloud server, and send the cloud command of the identified user 001 to the cloud server, and the cloud server can use the database to feed back the interaction content to the robot 002. The interaction content is as follows: Songs, stories, etc.
如图2所示,机器人002可以包括处理器110、电源模块140、存储器180,传感器模块190、音频模块150、摄像头170、接口模块160、按键101以及显示屏102等。As shown in FIG. 2 , the robot 002 may include a processor 110 , a power module 140 , a memory 180 , a sensor module 190 , an audio module 150 , a camera 170 , an interface module 160 , buttons 101 , and a display screen 102 .
可以理解的是,本发明实施例示意的结构并不构成对机器人002的具体限定。在本申请另一些实施例中,机器人002可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It can be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the robot 002 . In other embodiments of the present application, the robot 002 may include more or fewer components than shown in the illustration, or combine certain components, or separate certain components, or arrange different components. The illustrated components can be realized in hardware, software or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如,可以包括中央处理器CPU(Central Processing Unit)、图像处理器GPU(Graphics Processing Unit)、数字信号处理器DSP、神经网络处理器(neural-network processing unit,NPU)、微处理器MCU(Micro-programmed Control Unit)、AI(Artificial Intelligence,人工智能)处理器或可编程逻辑器件FPGA(Field Programmable Gate Array)等的处理模块或处理电路。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。处理器110中可以设置存储单元,用于存储指令和数据。在一些实施例中,处理器110中的存储单元为高速缓冲存储器180。The processor 110 may include one or more processing units, for example, may include a central processing unit CPU (Central Processing Unit), an image processor GPU (Graphics Processing Unit), a digital signal processor DSP, a neural network processor (neural-network processing unit, NPU), microprocessor MCU (Micro-programmed Control Unit), AI (Artificial Intelligence, artificial intelligence) processor or programmable logic device FPGA (Field Programmable Gate Array) and other processing modules or processing circuits. Wherein, different processing units may be independent devices, or may be integrated in one or more processors. A storage unit may be provided in the processor 110 for storing instructions and data. In some embodiments, the storage unit in processor 110 is cache memory 180 .
可以理解,本申请实施例中,处理器110可以控制相应程序执行本申请实施例提供的语音交互方法。具体的,可以采用人工智能处理器对接收到的语音进行识别,获取识别结果;可以上述图像处理器对采集到的用户001的嘴唇动作进行解析,获取识别结果;同时,可以采用上述图像处理器对采集到的用户001的肢体动作进行识别,获取识别结果。另外可以采用处理器110实时检测电子设备周围的噪音,以选择更准确的识别方式。It can be understood that in the embodiment of the present application, the processor 110 may control a corresponding program to execute the voice interaction method provided in the embodiment of the present application. Specifically, an artificial intelligence processor can be used to recognize the received voice to obtain a recognition result; the above-mentioned image processor can be used to analyze the collected lip movements of user 001 to obtain a recognition result; meanwhile, the above-mentioned image processor can be used Recognize the collected body movements of user 001, and obtain the recognition result. In addition, the processor 110 may be used to detect the noise around the electronic device in real time, so as to select a more accurate identification method.
电源模块140可以包括电源、电源管理部件等。电源可以为电池。电源管理部件用于管理电源的充电和电源向其他模块的供电。在一些实施例中,电源管理部件包括充电管理模块和电源管理模块。充电管理模块用于从充电器接收充电输入;电源管理模块用于连接电源,充电管理模块与处理器110。电源管理模块接收电源和/或充电管理模块的输入,为处理器110,显示屏102,摄像头170,及无线通信模块120等供电。The power module 140 may include a power supply, power management components, and the like. The power source can be a battery. The power management component is used to manage the charging of the power supply and the power supply from the power supply to other modules. In some embodiments, the power management component includes a charge management module and a power management module. The charging management module is used to receive charging input from the charger; the power management module is used to connect the power supply, the charging management module and the processor 110 . The power management module receives power and/or input from the charging management module, and supplies power to the processor 110 , the display screen 102 , the camera 170 , and the wireless communication module 120 .
无线通信模块120可以包括天线,并经由天线实现对电磁波的收发。无线通信模块120可以提供应用在机器人002上的包括无线局域网(wireless localarea networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。机器人002可以通过无线通信技术与网络以及其他设备进行通信。例如,机器人002可以通过无线通信模块120与云服务器进行通信。The wireless communication module 120 may include an antenna, and transmit and receive electromagnetic waves via the antenna. The wireless communication module 120 can provide applications on the robot 002 including wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (wireless fidelity, Wi-Fi) network), bluetooth (bluetooth, BT), global navigation satellite system ( Global navigation satellite system (GNSS), frequency modulation (frequency modulation, FM), near field communication (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions. The robot 002 can communicate with the network and other devices through wireless communication technology. For example, the robot 002 can communicate with the cloud server through the wireless communication module 120 .
显示屏102用于显示人机交互界面、图像、视频等。显示屏102包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。本申请实施例中,显示屏102可以用于显示机器人002的各种应用程序界面。The display screen 102 is used for displaying human-computer interaction interfaces, images, videos, and the like. The display screen 102 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc. In the embodiment of the present application, the display screen 102 may be used to display various application program interfaces of the robot 002 .
传感器模块190可以包括接近光传感器、压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,骨传导传感器等。The sensor module 190 may include a proximity light sensor, a pressure sensor, a gyro sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.
音频模块150用于将数字音频信息转换成模拟音频信号输出,或者将模拟音频输入转换为数字音频信号。音频模块150还可以用于对音频信号编码和解码。在一些实 施例中,音频模块150可以设置于处理器110中,或将音频模块150的部分功能模块设置于处理器110中。在一些实施例中,音频模块150可以包括扬声器、听筒、麦克风以及耳机接口。本申请实施例中,音频模块150可以用于接收用户001的语音指令,可实施的,音频模块150还可以用于根据用户001的语音指令执行播放音乐、讲故事等操作。The audio module 150 is used to convert digital audio information into analog audio signal output, or convert analog audio input into digital audio signal. The audio module 150 may also be used to encode and decode audio signals. In some embodiments, the audio module 150 can be set in the processor 110, or some functional modules of the audio module 150 can be set in the processor 110. In some embodiments, the audio module 150 may include a speaker, an earpiece, a microphone, and an earphone jack. In the embodiment of the present application, the audio module 150 can be used to receive voice instructions from the user 001. In practice, the audio module 150 can also be used to perform operations such as playing music and telling stories according to the voice instructions of the user 001.
摄像头170用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件把光信号转换成电信号,之后将电信号传递给ISP(Image Signal Processing,图像信号处理)转换成数字图像信号。机器人002可以通过ISP,摄像头170,视频编解码器,GPU(Graphic Processing Unit,图形处理器),显示屏102以及应用处理器等实现拍摄功能。本申请实施例中,摄像头170可以用户001获取用户001的脸部图像、嘴唇动作图像等。 Camera 170 is used to capture still images or video. The object generates an optical image through the lens and projects it to the photosensitive element. The photosensitive element converts the light signal into an electrical signal, and then transmits the electrical signal to the ISP (Image Signal Processing, Image Signal Processing) to convert it into a digital image signal. The robot 002 can realize the shooting function through ISP, camera 170, video codec, GPU (Graphic Processing Unit, graphics processor), display screen 102 and application processor. In the embodiment of the present application, the camera 170 can obtain user 001's face image, lip movement image, etc. of user 001.
接口模块160包括外部存储器接口、通用串行总线(universal serial bus,USB)接口等。其中外部存储器接口可以用于连接外部存储卡,例如Micro SD卡,实现扩展机器人002的存储能力。外部存储卡通过外部存储器接口与处理器110通信,实现数据存储功能。通用串行总线接口用于机器人002和其他电子设备002进行通信。The interface module 160 includes an external memory interface, a universal serial bus (universal serial bus, USB) interface, and the like. The external memory interface can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the robot 002. The external memory card communicates with the processor 110 through the external memory interface to realize the data storage function. The universal serial bus interface is used for the robot 002 to communicate with other electronic devices 002 .
在一些实施例中,机器人002还包括按键101。其中,按键101可以包括音量键、开/关机键等。In some embodiments, the robot 002 further includes a button 101 . Wherein, the key 101 may include a volume key, an on/off key, and the like.
下面结合上述机器人002对本申请实施例另一种语音交互方法进行详细叙述。图3示出了一种语音交互方法的示意图,其中,图3所示的语音交互方法可以由机器人002的语音助手执行。如图3所示,图3中所示的语音交互方法包括:Another voice interaction method according to the embodiment of the present application will be described in detail below in conjunction with the above-mentioned robot 002 . FIG. 3 shows a schematic diagram of a voice interaction method, wherein the voice interaction method shown in FIG. 3 can be executed by a voice assistant of the robot 002 . As shown in Figure 3, the voice interaction method shown in Figure 3 includes:
S301:检测到用户001唤醒语音助手后,进入语音识别模式。本申请实施例中,语音识别模式即为对机器人002接收到的用户001的语音指令进行识别的模式。S301: After detecting that the user 001 wakes up the voice assistant, enter the voice recognition mode. In the embodiment of the present application, the speech recognition mode is a mode for recognizing the speech instruction of the user 001 received by the robot 002 .
在语音助手被唤醒后,语音助手即开启收音并获取用户001的嘴部变化特征,其中,开启收音能够便于在后续步骤中判定出采用语音方式对用户001的语音指令进行识别的时候,可以直接对接收到的声音进行识别。而获取用户001的嘴部变化特征能够便于在后续步骤中判定出采用唇语方式对用户001的语音指令进行识别的时候,可以直接对获取用户001的嘴部变化特征进行识别。After the voice assistant is woken up, the voice assistant will turn on the sound recording and obtain the characteristics of user 001's mouth changes. Among them, turning on the sound recording can facilitate the determination in the subsequent steps that when the voice command of user 001 is recognized by voice, it can directly Recognize the received sound. Acquiring the mouth change features of the user 001 can facilitate the recognition of the acquired user 001's mouth change features when it is determined in the subsequent steps that the voice command of the user 001 is recognized by lip language.
本申请实施例中,接收声音可以基于机器人002的麦克风实现,获取用户001的嘴部变化特征可以基于机器人002的摄像装置实现。In the embodiment of the present application, the receiving of sound can be realized based on the microphone of the robot 002, and the acquisition of the mouth change characteristics of the user 001 can be realized based on the camera device of the robot 002.
可以理解,在一些实施例中,机器人002的语音助手需要在接收到用户001的唤醒词后才能被唤醒,从而进入语音识别模式。例如,若语音助手的唤醒词为“嗨,小艺”,当用户001说出“嗨,小艺”的唤醒词,语音助手将进入语音识别模式,开启收音并获取用户的嘴部变化特征,以便接收用户001的语音指令。It can be understood that, in some embodiments, the voice assistant of the robot 002 needs to be woken up after receiving the wake-up word from the user 001, so as to enter the voice recognition mode. For example, if the wake-up word of the voice assistant is "Hi, Xiaoyi", when the user 001 speaks the wake-up word of "Hi, Xiaoyi", the voice assistant will enter the voice recognition mode, turn on the radio and obtain the user's mouth change characteristics, In order to receive voice instructions from user 001.
本申请实施例中,在语音识别模式运行的过程中,可以选择性的开启语音识别功能和唇语识别功能。如下所述:In the embodiment of the present application, during the operation of the voice recognition mode, the voice recognition function and the lip language recognition function can be selectively enabled. As described below:
一种可实施的方案中,可以在收音的整个过程中均开启语音识别功能和唇语识别功能,如此可以实现对接收的声音直接采用语音识别的方式实时识别以获取语音识别结果,对拍摄的用户001嘴部特征直接采用唇语识别的方式实时识别以获取唇语识别结果。当后续判断结果为采用语音识别方式的时候,不需再次进行语音识别即可以直 接获取语音识别结果。若后续判断结果为采用唇语识别方式的时候,不需再次进行唇语识别即可以直接获取唇语识别结果,有效节省对用户001语音进行识别的时间。In an implementable solution, the voice recognition function and the lip language recognition function can be turned on during the entire process of receiving sound, so that the received sound can be directly recognized by voice recognition in real time to obtain the voice recognition result, and the captured User 001's mouth features are directly recognized in real time by means of lip recognition to obtain lip recognition results. When the subsequent judgment result is that the speech recognition method is adopted, the speech recognition result can be obtained directly without performing speech recognition again. If the subsequent judgment result is that the lip recognition method is adopted, the lip recognition result can be obtained directly without performing lip recognition again, which effectively saves the time for recognizing the voice of user 001.
第二种可实施的方案中,在进入语音识别模式后,可以在后续步骤中判断出是采用唇语识别方式还是语音识别方式对用户001的语音指令进行识别的条件后,再开启语音识别功能或唇语识别功能,该方案能够有效减少不必要的语音识别或唇唇语识别计算量。In the second implementable solution, after entering the speech recognition mode, it can be judged in the subsequent steps whether to use the lip recognition method or the speech recognition method to recognize the speech command of user 001, and then turn on the speech recognition function Or lip recognition function, this solution can effectively reduce unnecessary speech recognition or lip recognition calculations.
第三种可实施的方案中,可以在收音的整个过程中开启语音识别功能,在后续判断出需要采用唇语识别的方式进行识别的时候,再开启唇语识别功能。该实施方案的设置是因为大多数场景还是用语音识别方式对用户001的语音指令进行识别,因此长期开启语音识别功能能够有效避免对语音识别功能开启和关闭的多次转换,减小大多常规场景中的处理器110的运算量,提高处理器110的运行速度。而采用唇语识别的方式只是在较少场景中出现的情况,因此,在确定使用唇语识别方式的时候再开启唇语识别功能,能够有效减小唇语识别的运算量,从而减小整个语音识别过程的运算量。In the third implementable solution, the voice recognition function can be turned on during the whole process of radio reception, and the lip recognition function can be turned on when it is later judged that lip recognition is needed for recognition. The setting of this implementation is because most scenarios still use voice recognition to recognize the voice commands of user 001, so turning on the voice recognition function for a long time can effectively avoid multiple conversions of turning on and off the voice recognition function, reducing most common scenarios. The calculation amount of the processor 110 in the processor 110 is increased, and the running speed of the processor 110 is increased. However, the lip recognition method only occurs in a small number of scenes. Therefore, when the lip recognition method is determined to be used, the lip recognition function can be effectively reduced. The computational load of the speech recognition process.
可以理解,本申请实施例中,语音识别功能可以基于机器人002的人工智能处理器进行实现。其中,人工智能处理器可以对用户001的语音进行智能识别。唇语识别功能可以基于机器人002的图像处理器实现,其中,图像处理器可以从图像中连续识别出人脸,判断其中正在说话的人,提取此人连续的口型变化特征;随即将连续变化的特征输入到图像处理器中的唇语识别模型中,识别出讲话人口型对应的发音;随后根据识别出的发音,获取可能性最大的自然语言语句。It can be understood that in the embodiment of the present application, the speech recognition function can be implemented based on the artificial intelligence processor of the robot 002 . Wherein, the artificial intelligence processor can intelligently recognize the voice of user 001. The lip language recognition function can be realized based on the image processor of the robot 002, in which the image processor can continuously recognize faces from the images, judge the person who is speaking, and extract the person's continuous mouth shape change characteristics; The features of the lip language recognition model in the image processor are input to recognize the pronunciation corresponding to the speaker's population type; then, according to the recognized pronunciation, the most likely natural language sentence is obtained.
可以理解,唇语识别的方式在通用场景下识别率还不高。但是在垂直场景(即识别一些通过模型训练好的关键词的场景)下,识别的正确率较高,例如,可以达到百分之九十以上。It is understandable that the recognition rate of the lip language recognition method is not high in general scenarios. However, in a vertical scenario (that is, a scenario where some keywords trained by the model are recognized), the recognition accuracy rate is relatively high, for example, it can reach more than 90%.
S302:检测是否满足采用语音识别方式对用户的语音指令进行识别的条件。若满足,则表明采用语音识别方式能够对用户的语音指令进行识别,转至S308,获取语音识别结果;若不满足,则表明可以语音识别方式已经不能实现对用户的语音指令进行精确识别,则转至S303,检测是否满足进入唇语识别模式的条件。S302: Detect whether the condition for recognizing the user's voice instruction by means of voice recognition is satisfied. If it is satisfied, then it shows that the speech recognition method can be used to recognize the user's voice command, and then proceeds to S308 to obtain the speech recognition result; if not satisfied, it shows that the speech recognition method can no longer be used to accurately recognize the user's voice command, then Go to S303, and check whether the condition for entering the lip recognition mode is met.
本申请实施例中,语音助手可以通过控制麦克风进行收音,且可以通过语音边界检测(Voice activity detection,Vad)技术判断用户001语音开始或截断的时间点。In the embodiment of the present application, the voice assistant can collect sound by controlling the microphone, and can judge the time point when user 001's voice starts or is cut off by using voice boundary detection (Voice activity detection, Vad) technology.
在一种可实施的方案中,采用语音识别方式对用户001的语音指令进行识别的条件可以为语音助手已经接收声音且收音持续的声音小于第一设定值。In a practicable solution, the condition for recognizing the voice command of user 001 by means of voice recognition may be that the voice assistant has already received the voice and the continuous voice is less than the first set value.
其中,第一设定值可以根据设备的相关性能和用户001的常规语音时长进行确定,例如,超过10s的语音设备虽然还可以识别,但是设备的对话系统已经不能给出有效回答;或者说用户001发出语音指令的时间一般不会超过10秒,在10秒内一般会存在停顿,当持续收音超过10秒时,设备的VAD技术也已经无法准确识别出音频中的人声的截断时间点,因此可以将第一设定值设定为10秒。Among them, the first setting value can be determined according to the relevant performance of the device and the regular voice duration of user 001. For example, although the voice device exceeding 10s can still be recognized, the dialogue system of the device can no longer give an effective answer; or the user 001 generally does not take more than 10 seconds to issue voice commands, and there is usually a pause within 10 seconds. When the continuous radio reception exceeds 10 seconds, the VAD technology of the device can no longer accurately identify the truncation time point of the human voice in the audio. Therefore, the first set value can be set to 10 seconds.
可以理解,本申请实施例中,在一种情况下,当语音助手持续收音的时间超过第一设定值可以判断采用语音边界检测技术已经无法准确判断用户001语音结束的时间点,因此语音助手才一直持续收音。从而确定外部环境过于嘈杂,并且可以确定此时 通过语音识别的方式已经难以识别出用户001的语音指令。It can be understood that in the embodiment of the present application, in one case, when the voice assistant continues to listen for more than the first set value, it can be judged that the use of voice boundary detection technology has been unable to accurately determine the time point when the voice of user 001 ends, so the voice assistant Just kept on listening. Therefore, it is determined that the external environment is too noisy, and it can be determined that it is difficult to recognize the voice instruction of the user 001 through voice recognition at this time.
可以理解,本申请实施例中,开启收音可以指开启了收音功能,允许进行收音,但是在一些情况下,仍然可能会出现无法收音的情况。例如,采用语音边界检测技术已经无法准确判断用户001语音开始的时间点,因此语音助手会出现无法收音的情况。It can be understood that in the embodiment of the present application, turning on the sound collection may refer to turning on the sound collection function and allowing sound collection, but in some cases, it may still be impossible to receive sound. For example, the use of voice boundary detection technology has been unable to accurately determine the time point when user 001's voice started, so the voice assistant may not be able to receive the voice.
因此,在另一种情况下,当语音助手出现一直无法收音的情况,此时可以判断出采用语音边界检测技术已经无法准确判断用户001语音开始的时间点,因此语音助手才一直无法收音。从而确定外部环境过于嘈杂,并且可以确定此时通过语音识别的方式已经难以识别出用户001的语音指令。Therefore, in another case, when the voice assistant has been unable to receive sound, it can be judged that the voice boundary detection technology has been unable to accurately determine the time point when the voice of user 001 starts, so the voice assistant has been unable to receive sound. Therefore, it can be determined that the external environment is too noisy, and it can be determined that it is difficult to recognize the voice instruction of the user 001 through voice recognition at this time.
上述方案是通过判断语音助手已经开始接收声音且收音持续的声音小于等于第一设定值,从而确认采用VAD技术还是能够准确判断用户001的语音开始和截断的时间点,进而确定此时可以采用语音识别方式对用户001的语音指令进行识别。The above solution is to confirm that the voice assistant has started to receive sound and the continuous sound is less than or equal to the first set value, so as to confirm that the use of VAD technology can still accurately judge the time point when the user 001's voice starts and is cut off, and then determine that it can be used at this time. The speech recognition mode recognizes the speech instruction of user 001.
在另一种可实施的方案中,采用语音识别方式对用户001的语音指令进行识别的条件可以为周围环境的噪音值小于设定值。In another practicable solution, the condition for recognizing the voice command of the user 001 by means of voice recognition may be that the noise value of the surrounding environment is less than a set value.
需要说明的是,当周围环境的噪音值大于设定值时,判断出当时外部环境嘈杂,但此时不直接判断为不能采用语音识别的方式,而是进一步判断语音助手收音持续的时间是否小于第一设定值,如前所述,表明VAD技术还是能够准确判断用户001的语音截断的时间点,进而确定此时可以采用语音识别方式对用户001的语音指令进行识别。若语音助手收音持续的时间大于等于第一设定值,如上所述,表明VAD技术已经不能准确判断用户001的语音截断的时间点,进而确定此时可以采用语音识别方式已经无法准确识别用户001的语音指令。It should be noted that when the noise value of the surrounding environment is greater than the set value, it is judged that the external environment is noisy at that time, but at this time it is not directly judged that the voice recognition method cannot be used, but to further judge whether the duration of the voice assistant’s radio reception is less than The first set value, as mentioned above, indicates that the VAD technology can still accurately determine the time point when the voice of user 001 is cut off, and then determine that the voice command of user 001 can be recognized by voice recognition at this time. If the duration of the voice assistant’s sound collection is greater than or equal to the first set value, as mentioned above, it indicates that the VAD technology has been unable to accurately determine the time point when the voice of user 001 is cut off, and then it is determined that the speech recognition method can be used at this time and the user 001 cannot be accurately recognized. voice commands.
S303:检测是否满足采用唇语识别方式对用户的语音指令进行识别的条件。S303: Detect whether the condition for recognizing the user's voice command by means of lip recognition is satisfied.
若满足,则表明采用唇语识别方式能够对用户的语音指令进行识别,转至S304,获取唇语识别结果;若不满足,则表明可以采用唇语识别方式也不能对用户的语音指令进行识别,转至S307,提示用户识别失败。If satisfied, it indicates that the user's voice command can be recognized by using the lip language recognition method, and then go to S304 to obtain the lip language recognition result; if not satisfied, it indicates that the lip language recognition method can not be used to recognize the user's voice command. , go to S307, and prompt the user that the identification fails.
可以理解,在本申请一些实施例中,在语音助手检测出用户001一直保持和语音助手的交互状态,例如机器人002前的用户001一直保持未变,且人脸一直朝向机器人002的摄像装置且机器人002之间的距离在设定范围内的情况下,才认为满足进入唇语识别模式的条件。因为若机器人002前的用户001一直为同一用户且用户人脸一直朝向机器人002的摄像装置以及与机器人002之间的距离在设定范围内,则此时可以精确拍摄到用户嘴部的动作,此时采用唇语识别的方式能够更加精确的识别出用户001的语音指令。It can be understood that, in some embodiments of the present application, after the voice assistant detects that the user 001 has kept interacting with the voice assistant, for example, the user 001 before the robot 002 has remained unchanged, and the face has been facing the camera device of the robot 002 and Only when the distance between the robots 002 and 002 is within the set range, it is considered that the conditions for entering the lip recognition mode are met. Because if the user 001 in front of the robot 002 is always the same user and the user’s face is always facing the camera device of the robot 002 and the distance between the user and the robot 002 is within the set range, the movement of the user’s mouth can be accurately captured at this time, At this time, the voice command of the user 001 can be recognized more accurately by using the lip recognition method.
S304:获取唇语识别结果。S304: Obtain a lip recognition result.
本申请实施例中,唇语识别结果可以是用户001指令中包括的指令关键词。其中,上述关键词可以为语音助手中已存储的经过模型训练好的常规指令关键词,例如,讲故事,读绘本,播放音乐,讲笑话,退出,返回等,可以理解,因上述关键词已经存储于语音助手中,因此,采用唇语识别方式能够对上述指令关键词进行精确的识别。In this embodiment of the present application, the lip recognition result may be the instruction keyword included in the user 001 instruction. Among them, the above-mentioned keywords can be the conventional instruction keywords that have been stored in the voice assistant and have been trained by the model, for example, tell a story, read a picture book, play music, tell a joke, exit, return, etc. It is understandable that because the above-mentioned keywords have been Stored in the voice assistant, therefore, the above-mentioned instruction keywords can be accurately identified by using the lip recognition method.
在一些实施例中,为了避免一些在长句中混入命令词的场景下造成的误识别的情形,可以为只有在识别到关键词前后具有停顿间隔,才将该关键词作为唇语识别结果。In some embodiments, in order to avoid misrecognition when command words are mixed into long sentences, the keyword may be used as the result of lip recognition only when there is a pause before and after the keyword is recognized.
例如,若用户001先发了“讲故事”的语音指令,后来又说了“听说这个机器人 002也有播放音乐之类的功能”的语音内容,此时用户001实际上是想让机器人002讲故事,而不是播放音乐。此时,若语音助手将“讲故事”和“播放音乐”这两个关键词均作为唇语识别结果,将会发生无法确认应该执行“讲故事”和“播放音乐”这两个唇语识别结果中哪个唇语识别结果对应的功能的情况,或者直接执行获取的第二个唇语识别结果“播放音乐”对应的功能的情况。而若语音助手可以为只有在识别到关键词前后具有停顿间隔,才将该关键词作为唇语识别结果。则可以直接将“讲故事”作为唇语识别结果,从而执行“讲故事”这个唇语识别结果对应的功能。该方案能够有效避免了长句中出现命令词可能造成的误识别情况的发生。For example, if user 001 sends the voice command of "tell a story" first, and then says the voice content of "I heard that this robot 002 also has the function of playing music", user 001 actually wants robot 002 to tell the story. Stories, not music. At this time, if the voice assistant uses the two keywords of "telling a story" and "playing music" as the result of lip recognition, it will fail to confirm that the lip recognition of "telling a story" and "playing music" should be performed. Which lip recognition result in the result corresponds to the function, or directly executes the function corresponding to the obtained second lip recognition result "play music". And if the voice assistant can only have a pause interval before and after the keyword is recognized, the keyword is used as the lip language recognition result. Then, "telling a story" can be directly used as the lip recognition result, so as to execute the function corresponding to the lip recognition result of "telling a story". This scheme can effectively avoid the misrecognition that may be caused by command words appearing in long sentences.
S305:确认唇语识别结果是否正确,若结果为是,表明唇语识别准确,转至S305;若结果为否,则表明唇语识别不正确,则转至S306;S305: confirm whether the lip language recognition result is correct, if the result is yes, it shows that the lip language recognition is accurate, then go to S305; if the result is no, it shows that the lip language recognition is incorrect, then go to S306;
为了进一步确认唇语识别结果是否准确,本申请实施例提供的语音识别方法可以包括向用户001进行语音指令确认的步骤。其中向用户001进行语音指令的确认可以为向用户001询问是否想要语音助手执行唇语识别结果对应的功能。例如,如图4所示,若识别出的关键词为“讲故事”,则向用户001进行语音指令的确认的方式可以为向用户001询问,其中,询问的内容可以为:“您是要我为您讲故事吗?”等询问语句。In order to further confirm whether the lip recognition result is accurate, the voice recognition method provided in the embodiment of the present application may include a step of confirming the voice command to the user 001 . The confirmation of the voice command to the user 001 may be to ask the user 001 whether he wants the voice assistant to perform the function corresponding to the lip recognition result. For example, as shown in Figure 4, if the recognized keyword is "telling a story", the way to confirm the voice command to user 001 can be to ask user 001, wherein the content of the inquiry can be: "Do you want to Shall I tell you a story?" and other inquiries.
当用户001进行语音指令的确认的时候,可能会如图5中所示,以语音回答“是的”的方式表示确认。When the user 001 confirms the voice command, as shown in FIG. 5 , the confirmation may be indicated by voice answering "Yes".
本申请实施例中,因前述步骤已确认当前处于嘈杂环境,语音识别方式可能已经无法准确识别用户001的语音指令,因此在语音助手在向用户001进行确认的时候,可以同时开启视觉识别功能,便于识别用户001通过肢体动作进行的确认,例如用户001可能会通过点头的动作或ok的手势等表示确认。In the embodiment of this application, because the previous steps have confirmed that the current environment is noisy, the voice recognition method may not be able to accurately recognize the voice command of user 001. Therefore, when the voice assistant confirms to user 001, the visual recognition function can be turned on at the same time. It is convenient to recognize the confirmation performed by the user 001 through body movements, for example, the user 001 may express confirmation through a nodding action or an ok gesture.
视觉识别功能可以为能够检测用户001的肢体动作的功能,视觉识别功能可以基于机器人002的图像处理器实现,图像处理器能够通过采集用户001的图像,并对采集到的图像中的用户001的肢体动作进行解析,以获取视觉识别结果。例如,当图像处理器采集到的用户001的点头的动作图像后,可以对点头的动作图像进行解析,获取的识别结果可以为点头的动作对应的文字,例如“确认”,“是的”等。The visual recognition function can be a function that can detect the body movements of the user 001, and the visual recognition function can be realized based on the image processor of the robot 002. Body movements are parsed to obtain visual recognition results. For example, after the nodding action image of user 001 is collected by the image processor, the nodding action image can be analyzed, and the recognition result obtained can be the text corresponding to the nodding action, such as "confirm", "yes", etc. .
在一些实施例中,为了进一步增加语音助手对用户001语音指令识别的准确性,语音助手也可以在向用户001进行语音确认的时候,除了开启视觉识别功能,还可以开启噪声检测功能,以便实时检测周围环境噪音。当判断到周围环境噪音低于设定值后,则可以确定此时采用语音识别方式已经可以识别出用户001的语音指令,则此时可以采用语音识别的方式对用户001的确认指令或后续的其他语音指令进行识别;若周围环境噪音仍然高于设定值,则采用唇语识别方式或视觉识别方式或唇语识别方式与视觉识别方式结合以对用户001的确认指令或后续的其他语音指令进行识别。In some embodiments, in order to further increase the accuracy of the voice assistant for user 001's voice command recognition, the voice assistant can also enable the noise detection function in addition to the visual recognition function when performing voice confirmation to user 001, so that real-time Detect ambient noise. When it is judged that the surrounding environment noise is lower than the set value, it can be determined that the voice command of user 001 can be recognized by the voice recognition method at this time, and then the voice command of user 001 or the subsequent confirmation command can be confirmed by voice recognition. Other voice commands are recognized; if the ambient noise is still higher than the set value, lip recognition or visual recognition or a combination of lip recognition and visual recognition is used to confirm the user 001's command or other subsequent voice commands to identify.
S306:基于唇语识别结果执行唇语识别结果对应的功能。S306: Execute a function corresponding to the lip recognition result based on the lip recognition result.
本申请实施中,如图5所示,若唇语识别结果为“讲故事”,当用户001确认唇语识别结果正确后,语音助手可以执行“讲故事”对应的功能。In the implementation of this application, as shown in Figure 5, if the lip recognition result is "telling a story", after user 001 confirms that the lip recognition result is correct, the voice assistant can execute the function corresponding to "telling a story".
在一些实施例中,在执行用户001的语音指令对应的任务的过程中,可以持续开启视觉识别功能,保持语音助手对用户001肢体动作的识别结果。In some embodiments, during the process of executing the task corresponding to the voice command of the user 001, the visual recognition function can be continuously turned on to keep the recognition result of the body movement of the user 001 by the voice assistant.
例如,在执行“讲故事”的任务过程中,可以持续开启唇语识别功能以获取用户001的唇语识别结果,在开启唇语识别功能的同时可以开启视觉识别功能,保持语音助手对用户001肢体动作的识别结果。例如,用户001在机器人002的摄像装置能够拍摄到的范围内做出五指张开的手势表示停止讲故事的任务,语音助手可以对该手势进行识别,并停止该任务的执行。For example, in the process of performing the task of "telling a story", the lip recognition function can be continuously turned on to obtain the lip recognition results of user 001. Body action recognition results. For example, the user 001 makes a five-finger gesture within the range captured by the camera device of the robot 002 to indicate that the task of storytelling is stopped, and the voice assistant can recognize the gesture and stop the execution of the task.
可以理解,本申请实施例中,在开启视觉识别功能的同时,语音识别功能和唇语识别功能均是同时开启的。It can be understood that in the embodiment of the present application, when the visual recognition function is turned on, the voice recognition function and the lip language recognition function are both turned on at the same time.
在另一些实施例中,为了进一步增加语音助手对用户001语音指令识别的准确性,在执行用户001的语音指令对应的任务的过程中,除了开启视觉识别功能,还可以如前所述开启噪声检测功能,以便实时检测周围环境噪音。当判断到周围环境噪音低于设定值后,则可以确定此时采用语音识别方式已经可以识别出用户001的语音指令,则此时可以采用语音识别方式;若周围环境噪音仍然高于设定值,则采用唇语识别结果方式或视觉识别方式或唇语识别方式与视觉识别方式结合以对用户001的其他语音指令进行精确识别。In other embodiments, in order to further increase the accuracy of the voice assistant for user 001's voice command recognition, in the process of performing the task corresponding to user 001's voice command, in addition to turning on the visual recognition function, you can also turn on the noise as described above. Detection function for real-time detection of ambient noise. After judging that the surrounding environment noise is lower than the set value, it can be determined that the voice command of user 001 can be recognized by using the voice recognition mode at this time, and then the voice recognition mode can be used at this time; if the surrounding environment noise is still higher than the set value value, the lip recognition result method or the visual recognition method or the combination of the lip recognition method and the visual recognition method is used to accurately recognize other voice commands of user 001.
例如,在语音助手执行“讲故事”的任务过程中,语音助手同时开启了视觉识别功能、唇语识别功能以及噪音检测功能,在某一时刻,语音助手检测到周围环境噪音低于设定值后,确定此时采用语音识别方式已经可以识别出用户001的语音指令,则转换为采用语音识别方式对用户001在任务过程中的语音指令进行识别,例如,用户001在语音助手已经转换为采用语音识别方式对用户001在任务过程中的语音指令进行识别后,发出“停止讲故事”的命令,语音助手可以通过语音识别的方式识别出“停止讲故事”的语音指令,获取“停止讲故事”的语音识别结果,并执行“停止讲故事”这一语音识别结果对应的功能。For example, when the voice assistant performs the task of "telling a story", the voice assistant simultaneously turns on the visual recognition function, the lip language recognition function and the noise detection function. At a certain moment, the voice assistant detects that the surrounding environment noise is lower than the set value Finally, if it is determined that the voice command of user 001 can be recognized by the voice recognition method at this time, then the voice command of user 001 in the task process is recognized by the voice recognition method. For example, user 001 has switched to using the The voice recognition method recognizes the voice command of user 001 during the task, and issues the command "stop telling the story". ", and execute the function corresponding to the voice recognition result of "stop telling the story".
可以理解,本申请实施例中语音识别功能和唇语识别功能在上述语音助手执行用户001语音指令的过程中均是一直处于开启状态的。It can be understood that in the embodiment of the present application, the voice recognition function and the lip language recognition function are always on during the process of the above-mentioned voice assistant executing the voice command of user 001.
S307:提醒用户001识别失败。S307: Remind the user that the identification of 001 fails.
在一些实施例中,提醒用户001识别失败的方式可以为在机器人002的屏幕上显示“识别错误”,“无法识别”等提示信息。In some embodiments, the way of reminding the user 001 of the recognition failure may be to display prompt information such as "recognition error" and "unrecognizable" on the screen of the robot 002 .
在一些实施例中,提醒用户001识别失败的方式也可以为通过“识别错误”,“无法识别”等语音信息提示用户001。In some embodiments, the method of reminding the user 001 of the recognition failure may also be to prompt the user 001 through voice messages such as "recognition error" and "unrecognizable".
在一些实施例中,在提醒用户001识别失败后也可以采用提醒用户001正对摄像头、提高声音等方式提醒用户001再次发出语音指令。In some embodiments, after reminding the user 001 that the recognition fails, the user 001 can also be reminded to face the camera, raise the voice, etc. to remind the user 001 to issue the voice command again.
S308:获取语音识别结果。S308: Obtain a voice recognition result.
本申请实施例中,获取语音识别结果后,可以基于语音识别结果执行语音识别结果对应的功能。In the embodiment of the present application, after the speech recognition result is acquired, a function corresponding to the speech recognition result may be executed based on the speech recognition result.
本申请实施例中,图3提供的语音识别方法首先通过判断语音助手接收声音的时间已经持续了相当长的一段时间,判断出语音助手可能已经无法判断用户001的语音指令何时已经结束,从而判断出外部环境过于嘈杂,进而判断出在此种情况下采用语音识别的方式已经难以识别用户001的语音指令。进一步的通过判断用户001是否是在和语音助手进行交互来确定是否采用唇语识别的方式;能够有效避免在可以进行语 音识别的情况下采用唇语识别结果造成的识别结果准确率降低的情况,有效提高语音指令识别的准确率。另外能够在获取到唇语识别结果后对用户001进行再次询问,能够有效保证识别结果的准确性。In the embodiment of the present application, the speech recognition method provided in FIG. 3 first judges that the voice assistant has received the sound for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, thereby It is determined that the external environment is too noisy, and it is further determined that it is difficult to recognize the voice command of the user 001 by means of voice recognition under such circumstances. Further, by judging whether the user 001 is interacting with the voice assistant to determine whether to use the lip recognition method; it can effectively avoid the situation that the accuracy of the recognition result is reduced by using the lip recognition result when speech recognition is possible, Effectively improve the accuracy of voice command recognition. In addition, the user 001 can be asked again after the lip language recognition result is obtained, which can effectively ensure the accuracy of the recognition result.
本申请实施例中,步骤303中判断用户001是否一直保持和语音助手的交互状态可以从以下方面判断:In the embodiment of the present application, judging in step 303 whether the user 001 has been maintaining the interactive state with the voice assistant can be judged from the following aspects:
第一,在收音过程中,和语音助手交互的用户001是否保持未变。First, whether the user 001 interacting with the voice assistant remains unchanged during the listening process.
其中,若语音助手检测到在收音过程中和语音助手进行交互的用户001一直未发生更换,则用户001和语音助手正在交互的可能性就比较高。Wherein, if the voice assistant detects that the user 001 who interacts with the voice assistant during the listening process has not changed, the possibility that the user 001 is interacting with the voice assistant is relatively high.
若和语音助手进行交互的用户001已经发生更换,则可能发出语音命令的用户001已经离开,此时,在一些实施方案中,可以直接确定接收的语音已经无效。在另一些实施方案中,也可以检测在收音过程中最后一个与语音助手进行交互的用户001的语音指令,例如,在收音过程中语音助手检测到和语音助手进行交互的对象更换过一次,即在收音过程中,有两个用户001和语音助手交互过,则检测在收音过程中第二个与语音助手进行交互的用户001的语音指令。If the user 001 who interacts with the voice assistant has changed, the user 001 who may have issued the voice command has left. At this time, in some embodiments, it can be directly determined that the received voice is invalid. In some other embodiments, it is also possible to detect the voice command of the last user 001 who interacted with the voice assistant during the listening process. For example, the voice assistant detects that the object interacting with the voice assistant has been changed once during the listening process, that is During the sound collection process, two users 001 have interacted with the voice assistant, and the voice instruction of the second user 001 who interacts with the voice assistant during the sound collection process is detected.
本申请实施例中,可以通过人脸跟踪技术检测和语音助手进行交互的用户001是否发生改变。In the embodiment of the present application, it is possible to detect whether the user 001 interacting with the voice assistant has changed through the face tracking technology.
在一些实施例中,若一些电子设备不具有人脸跟踪技术,不具备人脸跟踪的能力,则可以通过检测语音助手正前方用户001的人脸是否发生更改来检测和语音助手进行交互的用户001的是否发生改变。In some embodiments, if some electronic devices do not have face tracking technology and do not have the ability to track faces, the user interacting with the voice assistant can be detected by detecting whether the face of user 001 in front of the voice assistant has changed 001 has changed.
第二,用户001和语音助手的交互意愿值是否达到第二设定值。Second, whether the interaction willingness value between the user 001 and the voice assistant reaches the second set value.
在一些实施例中,交互意愿值可以基于一段时间内用户001的脸部与语音助手之间的距离、人脸朝向等进行计算。例如,在一段时间内用户001的脸部与语音助手之间的距离较近、且用户001的脸部朝向正对语音助手,则交互意愿值较高,反之较弱。In some embodiments, the interaction willingness value may be calculated based on the distance between the face of the user 001 and the voice assistant, the orientation of the face, and the like within a period of time. For example, if the distance between the face of the user 001 and the voice assistant is relatively short within a period of time, and the face of the user 001 is facing the voice assistant, the interaction willingness value is higher, and vice versa.
具体的,在一种可能的实现方式中,语音助手可以通过采集一但段时间内用户001的图像获取用户001的人脸角度和所述用户001距离机器人002的距离,进而根据用户001的人脸角度和用户001距离智能设备的距离,通过交互意愿值模型,得到所述用户001的交互意愿值。交互意愿值越高,则用户001与语音助手的交互强度越大。Specifically, in a possible implementation, the voice assistant can obtain the face angle of user 001 and the distance between user 001 and robot 002 by collecting images of user 001 within a period of time, and then according to the user 001's The face angle and the distance between the user 001 and the smart device are used to obtain the interaction willingness value of the user 001 through the interaction willingness value model. The higher the interaction willingness value, the greater the interaction intensity between user 001 and the voice assistant.
其中,在交互意愿值模型中,可以定义不同的人脸角度对应不同的值,用户001和机器人002之间的距离对应不同的值,且人脸角度对应的值和用户001与机器人002之间的距离对应的值可以分配不同的权重,例如,人脸角度相对来说更能反映用户001是否正在和语音助手进行交互,则人脸角度对应的权重可以占比60%,用户001与机器人002之间的距离对应的权重可以占比为40%。Among them, in the interaction willingness value model, it can be defined that different face angles correspond to different values, the distance between user 001 and robot 002 corresponds to different values, and the value corresponding to the face angle is the same as the distance between user 001 and robot 002. The value corresponding to the distance can be assigned different weights. For example, the angle of the face can better reflect whether the user 001 is interacting with the voice assistant, and the weight corresponding to the angle of the face can account for 60%. User 001 and robot 002 The weight corresponding to the distance between them may be 40%.
可以理解,当用户001与电子设备的交互意愿值较弱时,则用户001与电子设备的距离较远,且人脸角度在一定程度上偏离正对电子设备的人脸角度,此时与用户001的嘴唇动作进行精确捕捉和识别,因此,此时采用唇语识别的方式也难以识别用户001的语音指令。反之,当用户001与电子设备的交互意愿值较强时,则用户001与电子设备的距离较近,且人脸角度靠近或等于正对电子设备的角度,此时电子设备可以对用户001的嘴唇动作进行精确捕捉和识别,因此,此时采用唇语识别的方式可以精确识别用户001的语音指令。It can be understood that when the interaction willingness between user 001 and the electronic device is weak, the distance between user 001 and the electronic device is relatively long, and the face angle deviates to a certain extent from the face angle facing the electronic device. 001's lip movements are accurately captured and recognized. Therefore, it is difficult to recognize user 001's voice commands by lip language recognition at this time. Conversely, when user 001 has a strong willingness to interact with the electronic device, the distance between user 001 and the electronic device is relatively close, and the face angle is close to or equal to the angle facing the electronic device. Lip movement is accurately captured and recognized, therefore, at this time, the voice command of user 001 can be accurately recognized by means of lip recognition.
在一些实施例中,为了更加准确的判断出是否采用唇语识别方式获取识别结果,可以对图3中步骤302及303中的各项判断条件进行排序和补充,其中,具体语音识别方法如图6所示,步骤301及步骤304-308参考前文所述,此处不再赘述,下面详细介绍步骤302-303,具体的,步骤302-303可以调整为:In some embodiments, in order to judge more accurately whether lip recognition is used to obtain recognition results, the judgment conditions in steps 302 and 303 in FIG. 3 can be sorted and supplemented, wherein the specific speech recognition method is shown in As shown in 6, step 301 and steps 304-308 refer to the foregoing, and will not be repeated here. The following describes steps 302-303 in detail. Specifically, steps 302-303 can be adjusted as follows:
S302A:判断接收到的声音中是否有人声。S302A: Determine whether there is a human voice in the received sound.
若判断结果为是,表明存在用户001发出了语音指令,转至S302B;若判断结果为否,表明并不存在用户001发出语音指令,则转至S302C,重新开始结合接收声音,并重新进行检测。If the judgment result is yes, it means that user 001 has issued a voice command, go to S302B; if the judgment result is no, it shows that there is no user 001 to send a voice command, then go to S302C, start combining and receiving sound again, and re-test .
本申请实施例中,可以通过人工智能处理器中的人声检测模型检测接收到的声音中是否有人声。若存在人声,则执行S302B,进一步判断语音助手检测到收音持续的时间是否小于第一设定值。若不存在人声,则可以在间隔设定时间后,转至S302C,重新开始接收声音,并重新计算收音持续的时间。例如,间隔设定时间可以为200ms。In the embodiment of the present application, the human voice detection model in the artificial intelligence processor can be used to detect whether there is a human voice in the received sound. If there is a human voice, execute S302B to further determine whether the voice assistant detects that the duration of sound collection is less than the first set value. If there is no human voice, then after the interval setting time, go to S302C, start receiving sound again, and recalculate the duration of sound collection. For example, the interval setting time can be 200ms.
S302B:判断是否满足采用语音识别方式对用户的语音指令进行识别的条件;S302B: Judging whether the condition for recognizing the user's voice command by voice recognition is met;
若判断结果为是,则表明采用语音识别方式能够对用户的语音指令进行识别,转至S308,获取语音识别结果;若不满足,则表明可以语音识别方式已经不能实现对用户的语音指令进行精确识别,则转至S303A,检测是否满足进入唇语识别模式的条件。If the judgment result is yes, then it shows that the speech recognition method can be used to identify the user's voice command, and then proceed to S308 to obtain the speech recognition result; If it is recognized, go to S303A to check whether the conditions for entering the lip recognition mode are met.
其中,语音指令识别的条件如图3中步骤S302中所述,此处不再赘述。Wherein, the conditions for voice command recognition are as described in step S302 in FIG. 3 , and will not be repeated here.
S302C:重新开始接收声音,并重新计算收音持续的时间。S302C: Start receiving sound again, and recalculate the duration of receiving sound.
S303A:判断人脸跟踪的用户001是否未改变。S303A: Determine whether the user 001 of face tracking has not changed.
若判断结果为是,表明在收音的过程中与语音助手交互的用户一直为同一用户,则转至S303B,将该用户作为与语音助手交互的用户;若判断结果为否,表明在收音过程中与语音助手交互的用户已经更换,则转至S303C,将摄像装置采集的最后一个用户001作为与语音助手交互的用户001。If the judgment result is yes, it means that the user who interacts with the voice assistant during the sound collection process is always the same user, then go to S303B, and use the user as the user who interacts with the voice assistant; if the judgment result is no, it means that during the sound collection process If the user interacting with the voice assistant has been changed, go to S303C, and use the last user 001 collected by the camera device as the user 001 interacting with the voice assistant.
可以理解,在判断用户是否在与语音助手进行交互时,需要首先确定与语音助手进行交互的用户,因为在一段收音过程中,因为环境过于嘈杂,VAD技术满意判断人声的时间点,因此可能出现多个用户在这段收音过程中与语音助手进行了交互,即与语音助手进行交互的用户可能更换过,例如在这段收音过程中可能有两个用户依次与语音助手进行了交互,而更换之前的用户可能已经离开,因此可以将更换后的第二个用户作为与语音助手交互的用户,对第二个用户的嘴部特征进行唇语识别。It can be understood that when judging whether the user is interacting with the voice assistant, it is necessary to first determine the user who is interacting with the voice assistant, because during a period of radio recording, because the environment is too noisy, the VAD technology is satisfied with the time point of judging the human voice, so it is possible It appears that multiple users interacted with the voice assistant during this listening process, that is, the users who interacted with the voice assistant may have changed. For example, two users may have interacted with the voice assistant in turn during this listening process, and The user before the replacement may have left, so the second user after the replacement can be used as the user interacting with the voice assistant, and lip recognition is performed on the mouth features of the second user.
S303B:将当前用户作为与语音助手交互的用户。其中,当前用户为在收音过程中与语音助手一直进行交互的用户。S303B: The current user is used as the user interacting with the voice assistant. Wherein, the current user is a user who has been interacting with the voice assistant during the listening process.
S303C:将摄像装置采集的最后一个用户001作为与语音助手交互的用户001。S303C: Use the last user 001 captured by the camera device as the user 001 interacting with the voice assistant.
S303D:判断与语音助手进行交互的用户001的交互意愿的值是否达到第一设定值。S303D: Determine whether the value of the interaction willingness of the user 001 who interacts with the voice assistant reaches the first set value.
若判断结果为是,则表明确定用户001与设备正在交互,则转至S305,获取唇语识别结果;若判断结果为否,则表明用户001与语音助手交互的意愿较弱,采用唇语识别方式也难以对用户的语音指令进行识别,则转至步骤S307,提示用户识别失败。If the judgment result is yes, it indicates that the user 001 is interacting with the device, then go to S305 to obtain the lip recognition result; if the judgment result is no, it indicates that the user 001 is not willing to interact with the voice assistant, and the lip recognition is adopted If the method is also difficult to recognize the user's voice command, go to step S307 and prompt the user that the recognition fails.
本申请图6中所示语音识别方法能够对采用唇语识别的若干判断条件进行排序,能够更加精确判断出采用唇语识别方式的时机。另外能够在判断出收音中没有人声的 情况下提前结束当前流程,开启下一轮检测,避免了增加后续不必要的识别步骤,有效提高识别效率。The speech recognition method shown in FIG. 6 of the present application can sort several judgment conditions for using lip recognition, and can more accurately judge the timing of using lip recognition. In addition, when it is judged that there is no human voice in the radio, the current process can be ended in advance, and the next round of detection can be started, which avoids adding unnecessary subsequent recognition steps and effectively improves the recognition efficiency.
综上,本申请实施例提供的语音识别方法首先通过判断语音助手接收声音的时间已经持续了相当长的一段时间,判断出语音助手可能已经无法判断用户001的语音指令何时已经结束,从而判断出外部环境过于嘈杂,进而判断出在此种情况下采用语音识别的方式已经难以识别用户001的语音指令。进一步的通过判断用户001是否是在和语音助手进行交互来确定是否采用唇语识别的方式;能够有效避免在可以进行语音识别的情况下采用唇语识别结果造成的识别结果准确率降低的情况,有效提高语音指令识别的准确率。To sum up, the voice recognition method provided by the embodiment of the present application first judges that the voice assistant has received voice for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, so as to determine It is found that the external environment is too noisy, and it is judged that it is difficult to recognize the voice command of the user 001 by means of voice recognition under such circumstances. Further, by judging whether the user 001 is interacting with the voice assistant to determine whether to use the lip recognition method; it can effectively avoid the situation that the accuracy of the recognition result is reduced by using the lip recognition result when speech recognition is possible, Effectively improve the accuracy of voice command recognition.
另外,能够在获取到唇语识别结果后对用户001进行再次询问,能够有效保证识别结果的准确性。In addition, the user 001 can be asked again after the lip language recognition result is obtained, which can effectively ensure the accuracy of the recognition result.
其次,本申请实施例提供的语音识别方法中,可以在向用户001确认的同时开启视觉识别功能和噪音检测功能,一方面保持语音助手对用户001肢体动作的识别结果,另一方面根据周围环境噪音的改变及时调整对用户001语音指令识别的方式,增加语音指令识别的准确率。Secondly, in the speech recognition method provided by the embodiment of the present application, the visual recognition function and the noise detection function can be turned on while confirming to the user 001. On the one hand, the voice assistant can keep the recognition result of the body movements of the user 001, and on the other hand, according to the surrounding environment As the noise changes, adjust the way of recognizing user 001 voice commands in time to increase the accuracy of voice command recognition.
本申请实施例还提供一种语音交互装置,包括:The embodiment of the present application also provides a voice interaction device, including:
检测模块,用于在检测到用户001唤醒语音助手后,控制电子设备进入语音识别模式。The detection module is configured to control the electronic device to enter the voice recognition mode after detecting that the user 001 wakes up the voice assistant.
识别控制模块,用于若检测到电子设备当前的语音交互环境满足语音识别条件,则控制电子设备采用语音识别的方式对用户001的语音指令进行识别以获取语音识别结果。The recognition control module is configured to control the electronic device to recognize the voice command of the user 001 by voice recognition to obtain a voice recognition result if it detects that the current voice interaction environment of the electronic device satisfies the voice recognition condition.
若检测到电子设备当前的语音交互环境不满足语音识别条件,则检测用户001当前的交互状态是否满足唇语识别条件;在确定出用户001当前的交互状态满足唇语识别条件的情况下,控制电子设备采用唇语识别方式对电子设备通过图像采集装置获取的用户001的嘴部变化特征进行识别以获取唇语识别结果。If it is detected that the current voice interaction environment of the electronic device does not meet the voice recognition condition, then it is detected whether the current interaction state of the user 001 meets the lip recognition condition; when it is determined that the current interaction state of the user 001 meets the lip recognition condition, control The electronic device uses a lip recognition method to recognize the mouth change characteristics of the user 001 acquired by the electronic device through the image acquisition device to obtain a lip recognition result.
执行模块,用于根据电子设备获取的识别结果控制电子设备执行识别结果对应的功能。例如,若电子设备获取的是唇语识别结果,则控制电子设备执行唇语识别结果对应的功能;若电子设备获取的是语音识别结果,则控制电子设备执行语音识别结果对应的功能。The execution module is configured to control the electronic device to execute the function corresponding to the recognition result according to the recognition result acquired by the electronic device. For example, if the electronic device obtains the result of lip recognition, the electronic device is controlled to execute the function corresponding to the lip recognition result; if the electronic device obtains the result of speech recognition, the electronic device is controlled to execute the function corresponding to the speech recognition result.
本申请公开的各实施例可以被实现在硬件、软件、固件或这些实现方法的组合中。本申请的实施例可实现为在可编程系统上执行的计算机程序或程序代码,该可编程系统包括至少一个处理器、存储系统(包括易失性和非易失性存储器和/或存储元件)、至少一个输入设备以及至少一个输出设备。The various embodiments disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods. Embodiments of the present application may be implemented as a computer program or program code executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements) , at least one input device, and at least one output device.
可将程序代码应用于输入指令,以执行本申请描述的各功能并生成输出信息。可以按已知方式将输出信息应用于一个或多个输出设备。为了本申请的目的,处理系统包括具有诸如例如数字信号处理器(DSP)、微控制器、专用集成电路(ASIC)或微处理器之类的处理器的任何系统。Program code can be applied to input instructions to perform the functions described herein and to generate output information. The output information may be applied to one or more output devices in known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.
程序代码可以用高级程序化语言或面向对象的编程语言来实现,以便与处理系统通信。在需要时,也可用汇编语言或机器语言来实现程序代码。事实上,本申请中描 述的机制不限于任何特定编程语言的范围。在任一情形下,该语言可以是编译语言或解释语言。The program code can be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system. Program code can also be implemented in assembly or machine language, if desired. In fact, the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.
在一些情况下,所公开的实施例可以以硬件、固件、软件或其任何组合来实现。所公开的实施例还可以被实现为由一个或多个暂时或非暂时性机器可读(例如,计算机可读)存储介质承载或存储在其上的指令,其可以由一个或多个处理器读取和执行。例如,指令可以通过网络或通过其他计算机可读介质分发。因此,机器可读介质可以包括用于以机器(例如,计算机)可读的形式存储或传输信息的任何机制,包括但不限于,软盘、光盘、光碟、只读存储器(CD-ROMs)、磁光盘、只读存储器(ROM)、随机存取存储器(RAM)、可擦除可编程只读存储器(EPROM)、电可擦除可编程只读存储器(EEPROM)、磁卡或光卡、闪存、或用于利用因特网以电、光、声或其他形式的传播信号来传输信息(例如,载波、红外信号数字信号等)的有形的机器可读存储器。因此,机器可读介质包括适合于以机器(例如,计算机)可读的形式存储或传输电子指令或信息的任何类型的机器可读介质。In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which can be executed by one or more processors read and execute. For example, instructions may be distributed over a network or via other computer-readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy disks, optical disks, optical disks, read-only memories (CD-ROMs), magnetic Optical discs, read-only memory (ROM), random-access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical cards, flash memory, or A tangible, machine-readable memory used to transmit information (eg, carrier waves, infrared signals, digital signals, etc.) by electrical, optical, acoustic, or other forms of propagating signals using the Internet. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).
本申请实施例还提供了一种计算机程序或包括计算机程序的一种计算机程序产品,该计算机程序在某一计算机上执行时,将会使所述计算机实现上述语音指令执行方法。可实施的,计算机程序产品可以包括指令,所述指令用于实现上述语音交互方法。The embodiment of the present application also provides a computer program or a computer program product including the computer program. When the computer program is executed on a computer, it will enable the computer to implement the above voice command execution method. In an implementable form, the computer program product may include instructions, and the instructions are used to implement the above voice interaction method.
在附图中,可以以特定布置和/或顺序示出一些结构或方法特征。然而,应该理解,可能不需要这样的特定布置和/或排序。而是,在一些实施例中,这些特征可以以不同于说明性附图中所示的方式和/或顺序来布置。另外,在特定图中包括结构或方法特征并不意味着暗示在所有实施例中都需要这样的特征,并且在一些实施例中,可以不包括这些特征或者可以与其他特征组合。In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such specific arrangements and/or orderings may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure does not imply that such features are required in all embodiments, and in some embodiments these features may not be included or may be combined with other features.
需要说明的是,本申请各设备实施例中提到的各单元/模块都是逻辑单元/模块,在物理上,一个逻辑单元/模块可以是一个物理单元/模块,也可以是一个物理单元/模块的一部分,还可以以多个物理单元/模块的组合实现,这些逻辑单元/模块本身的物理实现方式并不是最重要的,这些逻辑单元/模块所实现的功能的组合才是解决本申请所提出的技术问题的关键。此外,为了突出本申请的创新部分,本申请上述各设备实施例并没有将与解决本申请所提出的技术问题关系不太密切的单元/模块引入,这并不表明上述设备实施例并不存在其它的单元/模块。It should be noted that each unit/module mentioned in each device embodiment of this application is a logical unit/module. Physically, a logical unit/module can be a physical unit/module, or a physical unit/module. A part of the module can also be realized with a combination of multiple physical units/modules, the physical implementation of these logical units/modules is not the most important, the combination of functions realized by these logical units/modules is the solution The key to the technical issues raised. In addition, in order to highlight the innovative part of this application, the above-mentioned device embodiments of this application do not introduce units/modules that are not closely related to solving the technical problems proposed by this application, which does not mean that the above-mentioned device embodiments do not exist other units/modules.
需要说明的是,在本专利的示例和说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in the examples and descriptions of this patent, relative terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply There is no such actual relationship or order between these entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article, or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or device. Without further limitations, an element defined by the statement "comprising a" does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.
虽然通过参照本申请的某些优选实施例,已经对本申请进行了图示和描述,但本领域的普通技术人员应该明白,可以在形式上和细节上对其作各种改变,而不偏离本 申请的范围。Although this application has been shown and described with reference to certain preferred embodiments thereof, those skilled in the art will understand that various changes in form and details may be made therein without departing from this disclosure. the scope of the application.

Claims (12)

  1. 一种语音交互方法,应用于电子设备,其特征在于,所述方法包括:A voice interaction method applied to electronic equipment, characterized in that the method includes:
    在检测出所述电子设备当前的语音交互环境不满足语音识别条件的情况下,确定用户当前的交互状态是否满足唇语识别条件;When it is detected that the current voice interaction environment of the electronic device does not meet the voice recognition condition, determine whether the user's current interaction state meets the lip language recognition condition;
    在确定出所述用户当前的交互状态满足所述唇语识别条件的情况下,获取采用唇语识别方式对所述电子设备通过图像采集装置获取的用户的嘴部变化特征进行识别所得到的唇语识别结果;When it is determined that the user's current interaction state satisfies the lip language recognition condition, acquire the lip language recognition method obtained by identifying the user's mouth change characteristics acquired by the electronic device through the image acquisition device. Language recognition results;
    执行所述唇语识别结果对应的功能。Execute the function corresponding to the lip recognition result.
  2. 根据权利要求1所述的方法,其特征在于,所述语音识别条件包括:The method according to claim 1, wherein the speech recognition conditions include:
    所述电子设备周围的噪音值低于设定噪音值;The noise value around the electronic equipment is lower than the set noise value;
    或者;or;
    在所述电子设备周围的噪音值大于等于设定噪音值的情况下,所述电子设备收音持续的时间大于零且小于设定时间。In the case that the noise value around the electronic device is greater than or equal to the set noise value, the duration time for the electronic device to receive sound is greater than zero and less than the set time.
  3. 根据权利要求1所述的方法,其特征在于,所述语音识别条件包括:所述电子设备收音持续的时间大于零且小于设定时间。The method according to claim 1, wherein the voice recognition condition includes: the duration of the electronic device receiving sound is greater than zero and less than a set time.
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述唇语识别条件,包括:The method according to any one of claims 1-3, wherein the lip recognition conditions include:
    用户与电子设备在设定时间内处于交互状态。The user and the electronic device are in an interactive state within a set time.
  5. 根据权利要求4所述的方法,其特征在于,确定所述用户与所述电子设备在设定时间内是否处于交互状态的方法,包括:The method according to claim 4, wherein the method for determining whether the user is in an interactive state with the electronic device within a set time includes:
    确定在所述设定时间内与所述电子设备进行交互的用户是否发生更换;determining whether a user interacting with the electronic device has changed within the set time;
    检测所述用户与所述电子设备的交互强度值是否达到设定强度值;Detecting whether the interaction intensity value between the user and the electronic device reaches a set intensity value;
    在确定在所述设定时间内与所述电子设备进行交互的用户未发生更换,且所述用户与所述电子设备的所述交互强度值达到所述设定强度值的情况下,确认所述用户与所述电子设备处于交互状态;When it is determined that the user who interacts with the electronic device within the set time has not changed, and the interaction intensity value between the user and the electronic device reaches the set intensity value, confirm the The user is in an interactive state with the electronic device;
    其中,所述交互强度值与所述用户与所述电子设备之间的距离以及所述用户的人脸朝向相关。Wherein, the interaction intensity value is related to the distance between the user and the electronic device and the face orientation of the user.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,在所述执行所述唇语识别结果对应的功能步骤之前,还包括:The method according to any one of claims 1-5, characterized in that, before performing the functional step corresponding to the lip recognition result, further comprising:
    确认所述唇语识别结果是否正确;Confirm whether the lip recognition result is correct;
    在确认所述唇语识别结果正确的情况下,执行所述唇语识别结果对应的功能。If it is confirmed that the lip recognition result is correct, execute the function corresponding to the lip recognition result.
  7. 根据权利要求6所述的方法,其特征在于,确认所述唇语识别结果是否正确的方法,包括:The method according to claim 6, wherein the method for confirming whether the lip recognition result is correct comprises:
    向用户进行询问是否需要执行所述唇语识别结果对应的功能;Ask the user whether it is necessary to perform the function corresponding to the lip recognition result;
    在用户确认需要执行所述唇语识别结果对应的功能的情况下,确认所述唇语识别结果正确。When the user confirms that the function corresponding to the lip recognition result needs to be executed, it is confirmed that the lip recognition result is correct.
  8. 根据权利要求6或7所述的方法,其特征在于,在所述确认唇语结果是否正确的步骤的同时,还包括:The method according to claim 6 or 7, characterized in that, while the step of confirming whether the lip language result is correct, it also includes:
    获取所述用户的肢体动作特征和电子设备周围的噪音值。Obtain the body movement characteristics of the user and the noise value around the electronic device.
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述电子设备为机器人。The method according to any one of claims 1-8, wherein the electronic device is a robot.
  10. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    存储器,用于存储由所述电子设备的一个或多个处理器执行的指令,以及a memory for storing instructions to be executed by the one or more processors of the electronic device, and
    处理器,是所述电子设备的所述一个或多个处理器之一,用于执行权利要求1-9中任一项所述的语音交互方法。The processor is one of the one or more processors of the electronic device, configured to execute the voice interaction method according to any one of claims 1-9.
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有指令,该指令在执行时使计算机执行权利要求1至9中任一项所述的语音交互方法。A computer-readable storage medium, characterized in that instructions are stored on the computer-readable storage medium, and when the instructions are executed, the computer executes the voice interaction method according to any one of claims 1 to 9.
  12. 一种计算机程序产品,其特征在于,所述计算机程序产品包括指令,该指令在执行时使计算机执行权利要求1至9中任一项所述的语音交互方法。A computer program product, characterized in that the computer program product includes instructions, and the instructions cause a computer to execute the voice interaction method according to any one of claims 1 to 9 when executed.
PCT/CN2022/108624 2021-07-29 2022-07-28 Speech interaction method, electronic device, and medium WO2023006033A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110865871.XA CN115691498A (en) 2021-07-29 2021-07-29 Voice interaction method, electronic device and medium
CN202110865871.X 2021-07-29

Publications (1)

Publication Number Publication Date
WO2023006033A1 true WO2023006033A1 (en) 2023-02-02

Family

ID=85059169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/108624 WO2023006033A1 (en) 2021-07-29 2022-07-28 Speech interaction method, electronic device, and medium

Country Status (2)

Country Link
CN (1) CN115691498A (en)
WO (1) WO2023006033A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164389B (en) * 2020-09-18 2023-06-02 国营芜湖机械厂 Multi-mode voice recognition speech transmitting device and control method thereof

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
JP2014240856A (en) * 2013-06-11 2014-12-25 アルパイン株式会社 Voice input system and computer program
CN107799125A (en) * 2017-11-09 2018-03-13 维沃移动通信有限公司 A kind of audio recognition method, mobile terminal and computer-readable recording medium
CN108537207A (en) * 2018-04-24 2018-09-14 Oppo广东移动通信有限公司 Lip reading recognition methods, device, storage medium and mobile terminal
CN110517685A (en) * 2019-09-25 2019-11-29 深圳追一科技有限公司 Audio recognition method, device, electronic equipment and storage medium
WO2020122677A1 (en) * 2018-12-14 2020-06-18 Samsung Electronics Co., Ltd. Method of performing function of electronic device and electronic device using same
CN112132095A (en) * 2020-09-30 2020-12-25 Oppo广东移动通信有限公司 Dangerous state identification method and device, electronic equipment and storage medium
CN112633208A (en) * 2020-12-30 2021-04-09 海信视像科技股份有限公司 Lip language identification method, service equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102023703A (en) * 2009-09-22 2011-04-20 现代自动车株式会社 Combined lip reading and voice recognition multimodal interface system
JP2014240856A (en) * 2013-06-11 2014-12-25 アルパイン株式会社 Voice input system and computer program
CN107799125A (en) * 2017-11-09 2018-03-13 维沃移动通信有限公司 A kind of audio recognition method, mobile terminal and computer-readable recording medium
CN108537207A (en) * 2018-04-24 2018-09-14 Oppo广东移动通信有限公司 Lip reading recognition methods, device, storage medium and mobile terminal
WO2020122677A1 (en) * 2018-12-14 2020-06-18 Samsung Electronics Co., Ltd. Method of performing function of electronic device and electronic device using same
CN110517685A (en) * 2019-09-25 2019-11-29 深圳追一科技有限公司 Audio recognition method, device, electronic equipment and storage medium
CN112132095A (en) * 2020-09-30 2020-12-25 Oppo广东移动通信有限公司 Dangerous state identification method and device, electronic equipment and storage medium
CN112633208A (en) * 2020-12-30 2021-04-09 海信视像科技股份有限公司 Lip language identification method, service equipment and storage medium

Also Published As

Publication number Publication date
CN115691498A (en) 2023-02-03

Similar Documents

Publication Publication Date Title
US11810562B2 (en) Reducing the need for manual start/end-pointing and trigger phrases
CN105654952B (en) Electronic device, server and method for outputting voice
KR102411766B1 (en) Method for activating voice recognition servive and electronic device for the same
WO2021036644A1 (en) Voice-driven animation method and apparatus based on artificial intelligence
CN112863510B (en) Method for executing operation on client device platform and client device platform
EP3593958A1 (en) Data processing method and nursing robot device
US20190187787A1 (en) Non-verbal engagement of a virtual assistant
US11430438B2 (en) Electronic device providing response corresponding to user conversation style and emotion and method of operating same
KR102412523B1 (en) Method for operating speech recognition service, electronic device and server supporting the same
WO2021008538A1 (en) Voice interaction method and related device
CN110910887B (en) Voice wake-up method and device
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
WO2021212388A1 (en) Interactive communication implementation method and device, and storage medium
WO2023006033A1 (en) Speech interaction method, electronic device, and medium
CN114333774B (en) Speech recognition method, device, computer equipment and storage medium
WO2022227507A1 (en) Wake-up degree recognition model training method and speech wake-up degree acquisition method
WO2016206646A1 (en) Method and system for urging machine device to generate action
CN112634911B (en) Man-machine conversation method, electronic device and computer readable storage medium
US20200090663A1 (en) Information processing apparatus and electronic device
Zhang et al. Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing
WO2020087534A1 (en) Generating response in conversation
US11997445B2 (en) Systems and methods for live conversation using hearing devices
WO2024055831A1 (en) Voice interaction method and apparatus, and terminal
CN117809625A (en) Terminal equipment and wake-up method for dual-mode verification
US20210082427A1 (en) Information processing apparatus and information processing method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22848638

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE