WO2023006033A1 - Procédé d'interaction vocale, dispositif électronique et support - Google Patents
Procédé d'interaction vocale, dispositif électronique et support Download PDFInfo
- Publication number
- WO2023006033A1 WO2023006033A1 PCT/CN2022/108624 CN2022108624W WO2023006033A1 WO 2023006033 A1 WO2023006033 A1 WO 2023006033A1 CN 2022108624 W CN2022108624 W CN 2022108624W WO 2023006033 A1 WO2023006033 A1 WO 2023006033A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- user
- voice
- electronic device
- recognition
- lip
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 137
- 230000003993 interaction Effects 0.000 title claims abstract description 72
- 230000006870 function Effects 0.000 claims description 72
- 230000015654 memory Effects 0.000 claims description 15
- 230000008859 change Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 10
- 230000002452 interceptive effect Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 description 24
- 238000001514 detection method Methods 0.000 description 20
- 230000000007 visual effect Effects 0.000 description 19
- 238000005516 engineering process Methods 0.000 description 17
- 238000012545 processing Methods 0.000 description 13
- 238000007726 management method Methods 0.000 description 10
- 238000012790 confirmation Methods 0.000 description 9
- 238000013473 artificial intelligence Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 8
- 230000003287 optical effect Effects 0.000 description 6
- 230000009471 action Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 229920001621 AMOLED Polymers 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000002096 quantum dot Substances 0.000 description 2
- 241000203475 Neopanax arboreus Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000001902 propagating effect Effects 0.000 description 1
- 238000013102 re-test Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/24—Speech recognition using non-acoustical features
- G10L15/25—Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
Definitions
- the present application relates to the technical field of human-computer interaction, and in particular to a voice interaction method, electronic equipment and media.
- the robot may not be able to receive or recognize the user's voice instructions and execute them. For example, because the current environment is too noisy, the robot cannot judge when the user's voice command ends, so it has been in a continuous listening state, or it cannot judge when the user's voice command starts, so it has been in a non-radio state, so it cannot listen to the user's voice Feedback such as command execution operations seriously affects user experience.
- the first aspect of the embodiment of the present application provides a voice interaction method, an electronic device and a medium.
- the method can be applied to electronic equipment, and the method includes:
- the speech recognition method if it is judged that it is difficult to recognize the user's voice command by using the speech recognition method, it further determines whether to use lip recognition by judging whether the user is interacting with the voice assistant; The method can effectively improve the accuracy rate of voice command recognition, thereby further improving the accuracy rate of the electronic device executing the user's voice command.
- the noise value around the electronic device and the duration of the electronic device receiving sound, etc. all belong to the category of the voice interaction environment of the electronic device.
- the image acquisition device may be a camera device for collecting images, such as a camera.
- the electronic device when the electronic device receives the user's voice instruction, it can simultaneously collect the user's voice and the user's mouth change features.
- voice recognition is used to recognize the user's voice received by the electronic device to obtain a voice recognition result.
- the user's mouth acquired by the image acquisition device of the electronic device can be detected by means of lip recognition. Identify the internal change features to obtain the lip language recognition results.
- the speech recognition conditions include:
- the noise value around the electronic equipment is lower than the set noise value
- the duration time for the electronic device to receive sound is greater than zero and less than the set time.
- judging whether the voice command of the user can be recognized by means of voice recognition can firstly determine whether the noise value around the electronic device is lower than the set noise value, and if so, it proves that the noise around the electronic device is lower than the set noise value. If the noise is small, the user’s voice command can be recognized through voice recognition; if not, it proves that the noise around the electronic device is already large, and the external environment is already in a noisy environment. At this time, it is further judged that the electronic device The continuous sound of the radio is greater than zero and less than the set time.
- the continuous sound received by the electronic device is greater than zero and less than the set time, it proves that the electronic device can still accurately determine the time point when the user's voice is cut off, and then confirm that the user's voice command can be recognized by voice recognition at this time. If the continuous sound received by the electronic device is greater than or equal to the first set value or less than zero, it means that the electronic device has been unable to accurately determine the time point when the user's voice is cut off, and then it is determined that it is difficult to use the voice recognition method at this time. identify.
- the voice recognition condition includes: the duration of the electronic device receiving sound is greater than zero and less than a set time.
- the method of judging whether the voice command of the user can be recognized by voice recognition can be directly judging whether the duration of the electronic device receiving sound is greater than zero and less than the set time, and if so, prove The electronic device can still accurately determine the time point when the user's voice is cut off, and then determine that the user's voice command can be recognized by means of voice recognition at this time. If the continuous sound received by the electronic device is greater than or equal to the first set value or less than zero, it means that the electronic device has been unable to accurately determine the time point when the user's voice is cut off, and then infers that the external environment is too noisy, and determines that the voice recognition method is used at this time It has been difficult to recognize the user's voice commands.
- the lip recognition conditions include:
- the user and the electronic device are in an interactive state within a set time.
- the electronic device can capture the changes in the user's mouth more clearly, so that lip recognition can be used to identify the acquired changes in the user's mouth. Get the lip recognition result.
- the method for determining whether the user is in an interactive state with the electronic device within a set time includes:
- the interaction intensity value is related to the distance between the user and the electronic device and the face orientation of the user.
- the interaction intensity value between the user and the electronic device before detecting the interaction intensity value between the user and the electronic device, it may be determined whether the user who interacts with the electronic device within the above-mentioned set time has changed. If no change has occurred, the current user may be As an object for subsequent detection of interaction strength values. When it is further determined that the interaction intensity value between the user and the electronic device reaches the set intensity value, it can be confirmed that the user is in an interactive state with the electronic device; Identify the changing features of the user's mouth to obtain lip recognition results.
- the interaction intensity value may be acquired based on the distance between the user's face and the electronic device, the orientation of the face, and the like within a set time. For example, if the distance between the user's face and the electronic device is relatively short within the set time, and the user's face faces the electronic device, the interaction intensity value is high, and vice versa.
- the meaning of the interaction strength value mentioned in the embodiment of the present application is the same as that of the interaction willingness value, but the expression is different.
- the set intensity value may be the second set value mentioned in the following embodiments.
- the method for confirming whether the lip recognition result is correct includes:
- the step of confirming whether the lip language result is correct also includes:
- the voice recognition method in the step of confirming whether the result of the lip language is correct, because the previous steps have confirmed that the current environment is noisy, the voice recognition method may not be able to accurately recognize the user's voice command, so when the voice assistant is performing When confirming, the visual recognition function can be turned on at the same time.
- the visual recognition function can obtain the characteristics of the user's body movements and facilitate the recognition of the user's reply through body movements. Indicates that the lip-reading result is confirmed to be correct.
- the noise detection function in addition to the visual recognition function when performing voice confirmation to the user, so as to detect the surrounding environment noise in real time .
- the voice command of the user can be recognized by the voice recognition method at this time, and the user's confirmation command or other subsequent voices can be confirmed by voice recognition at this time. If the ambient noise is still higher than the set value, lip recognition or visual recognition or a combination of lip recognition and visual recognition is used to recognize the user's confirmation instruction or other subsequent voice instructions.
- the electronic device is a robot.
- the speech recognition method when it is judged that it is difficult to recognize the user's voice command by the speech recognition method, it further determines whether to use lip language by judging whether the user is interacting with the voice assistant. Recognition; the method can effectively improve the accuracy rate of voice command recognition, thereby further improving the accuracy rate of the electronic device for executing the user's voice command.
- the second aspect of the embodiment of the present application provides an electronic device, including:
- a memory for storing instructions to be executed by the one or more processors of the electronic device
- the processor is one of the one or more processors of the electronic device, configured to execute the above voice interaction method.
- the third aspect of the embodiments of the present application provides a computer-readable storage medium, where instructions are stored on the computer-readable storage medium, and when the instructions are executed, the computer executes the above voice interaction method.
- the fourth aspect of the embodiments of the present application provides a computer program product, where the computer program product includes instructions, and when the instructions are executed, the computer executes the above voice interaction method.
- FIG. 1 shows a schematic diagram of a scene of a voice interaction method according to some embodiments of the present application
- Fig. 2 shows a schematic structural diagram of an electronic device according to some embodiments of the present application
- Fig. 3 shows a schematic flowchart of a voice interaction method according to some embodiments of the present application
- Fig. 4 shows a schematic scene diagram of a voice interaction method according to some embodiments of the present application
- Fig. 5 shows a schematic diagram of a scene of a voice interaction method according to some embodiments of the present application
- Fig. 6 shows a schematic flowchart of a voice interaction method according to some embodiments of the present application.
- the embodiment of the application discloses a voice interaction method, electronic equipment and media.
- the electronic devices applicable to the embodiments of the present application may be various electronic devices with voice recognition functions, including but not limited to robots, laptop computers, desktop computers, tablet computers, smart phones, servers, wearable devices, Head-mounted displays, mobile email devices, portable game consoles, portable music players, reader devices, televisions with one or more processors embedded in or coupled to them, or other electronic devices with computing capabilities.
- the speech recognition function of the above-mentioned electronic device can be implemented in the form of various applications, for example, appearing in the form of a voice assistant, or the speech recognition function is built into the application program of the electronic device, for example, for the application program , such as voice search in the Maps app.
- the electronic device is used as a robot, and the speech recognition function is implemented as a robot's voice assistant as an example.
- users can control electronic devices such as robots through voice commands.
- the voice assistant of the robot uses voice recognition to recognize the user's voice commands, the above-mentioned voice assistant may appear It is impossible to judge when the user's voice command ends, so it is always in the receiving state, or it cannot be judged when the user's voice command starts, so it has been in the non-radio state, and the user's voice command cannot be executed, which in turn affects the user's Use experience.
- the embodiment of the present application provides a voice interaction method.
- the voice assistant When the voice assistant is awakened by the user, the voice assistant can detect the surrounding noise level through the noise detection function. If the noise level is higher than the set threshold, the current The voice recognition mode is switched to the lip language recognition mode, so that the electronic device can recognize the user's voice command through the lip language recognition technology, and execute the voice command.
- the voice assistant detects that the surrounding noise level is high through the noise detection function, which is higher than the set level. Threshold, for example, if the threshold is set to 70 decibels, and the noise value detected by the voice assistant through the noise detection function is 78 decibels, the voice assistant will switch the current voice recognition method to the lip recognition method, so that the voice assistant can use lip recognition.
- the technology recognizes the user's voice commands and executes the "storytelling" voice commands.
- the above technology can recognize voice commands under certain circumstances, but in normal scenarios, the accuracy of lip recognition is generally lower than that of speech recognition. Therefore, in the above solution, the speech recognition method is converted to In the lip language recognition method, there may be scenarios where the voice recognition can still perform accurate recognition even though the surrounding environment is noisy. At this time, converting the voice recognition method to the lip language recognition method increases the risk of recognition errors.
- This method does not directly switch the voice recognition mode to the lip language recognition mode after judging that the surrounding environment is too noisy through the noise detection function to obtain the lip language recognition result; It is first to judge whether it is determined that the speech recognition method cannot be used, and if it is determined that the speech recognition method cannot be used, it is judged whether the condition for adopting lip language recognition is satisfied. The lip recognition result is obtained only after it is judged that the lip recognition condition is met. Among them, judging whether the voice recognition method cannot be used is as follows:
- the voice assistant may not be able to judge the end of the user's voice command because the external environment is too noisy by judging that the listening time of the voice assistant is too long, for example, exceeding the conventional setting value of the system The time point, or by judging that the radio cannot be received, it is possible to determine the start time point when the voice assistant cannot judge the user's voice command because the external environment is too noisy, and then determine that it is difficult to accurately recognize the user's voice command through voice recognition at this time, Confirm that the user cannot be recognized by voice recognition at this time.
- the surrounding environment can be judged first through the noise detection function. If the noise value of the surrounding environment is less than the set noise value, it is directly determined that the speech recognition method can be used at this time. If the noise of the surrounding environment The value is greater than or equal to the set noise value. At this time, it is further judged whether the voice assistant has been listening for too long, for example, it has exceeded the system’s conventional setting value, or whether it is unable to listen. It is determined that the external environment is too noisy for the voice assistant to judge the user’s voice commands. The end time point or the start time point of the system, and then determine the user's voice instruction that is difficult to accurately recognize through voice recognition at this time.
- the user's face is facing the camera within the set time period, whether the face is within the shooting range of the robot camera, etc., to confirm whether the user is in and The way the voice assistant interacts confirms whether the lip recognition result is used. If it is confirmed that the user is interacting with the voice assistant, it can be determined that the lip recognition method can relatively accurately recognize the user's voice command, and then obtain the lip recognition result , and execute the function corresponding to the lip recognition result according to the lip recognition result.
- the voice assistant When the voice assistant is awakened, it enters the voice recognition mode and starts to listen.
- the voice assistant When user 001 sends out the voice command of "tell a story", the voice assistant does not detect the user's voice.
- the instruction has ended, and the audio has been continuously receiving.
- the audio receiving time exceeds the system setting value, such as 10 seconds, the voice assistant can judge that the voice assistant cannot judge the end of user 001’s voice instruction because the external environment is too noisy. Time point, and then determine the user's voice instructions that are difficult to accurately recognize through voice recognition at this time. Then detect whether the face of the user 001 has been facing the camera within the set time during the radio reception just now, and whether the face has always been within the shooting range of the robot camera.
- the detection result is yes, it is determined that the user 001 is interacting with the electronic device. Therefore, it can be determined whether the voice command of user 001 can be recognized relatively accurately by means of lip recognition, and then the voice command of user 001 can be recognized by lip recognition, and the voice command of "telling a story" can be executed.
- the speech recognition method provided by the embodiment of the present application first judges that the voice assistant has received the sound for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, thereby judging the external environment It is too noisy; furthermore, it can be more accurately judged that it is difficult to recognize the voice command of user 001 by using voice recognition in this case, and further determine whether to use lip recognition by judging whether user 001 is interacting with the voice assistant ; It can effectively avoid the situation that the accuracy rate of the recognition result is reduced by using the lip language recognition result when the speech recognition can be performed, and effectively improve the accuracy rate of the speech command recognition.
- the electronic device is robot 002 as an example for illustration. It should be understood that the robot 002 in the embodiment of the present application can also interact with the cloud server, and send the cloud command of the identified user 001 to the cloud server, and the cloud server can use the database to feed back the interaction content to the robot 002.
- the interaction content is as follows: Songs, stories, etc.
- the robot 002 may include a processor 110 , a power module 140 , a memory 180 , a sensor module 190 , an audio module 150 , a camera 170 , an interface module 160 , buttons 101 , and a display screen 102 .
- the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on the robot 002 .
- the robot 002 may include more or fewer components than shown in the illustration, or combine certain components, or separate certain components, or arrange different components.
- the illustrated components can be realized in hardware, software or a combination of software and hardware.
- the processor 110 may include one or more processing units, for example, may include a central processing unit CPU (Central Processing Unit), an image processor GPU (Graphics Processing Unit), a digital signal processor DSP, a neural network processor (neural-network processing unit, NPU), microprocessor MCU (Micro-programmed Control Unit), AI (Artificial Intelligence, artificial intelligence) processor or programmable logic device FPGA (Field Programmable Gate Array) and other processing modules or processing circuits. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
- a storage unit may be provided in the processor 110 for storing instructions and data. In some embodiments, the storage unit in processor 110 is cache memory 180 .
- the processor 110 may control a corresponding program to execute the voice interaction method provided in the embodiment of the present application.
- an artificial intelligence processor can be used to recognize the received voice to obtain a recognition result;
- the above-mentioned image processor can be used to analyze the collected lip movements of user 001 to obtain a recognition result; meanwhile, the above-mentioned image processor can be used Recognize the collected body movements of user 001, and obtain the recognition result.
- the processor 110 may be used to detect the noise around the electronic device in real time, so as to select a more accurate identification method.
- the power module 140 may include a power supply, power management components, and the like.
- the power source can be a battery.
- the power management component is used to manage the charging of the power supply and the power supply from the power supply to other modules.
- the power management component includes a charge management module and a power management module.
- the charging management module is used to receive charging input from the charger; the power management module is used to connect the power supply, the charging management module and the processor 110 .
- the power management module receives power and/or input from the charging management module, and supplies power to the processor 110 , the display screen 102 , the camera 170 , and the wireless communication module 120 .
- the wireless communication module 120 may include an antenna, and transmit and receive electromagnetic waves via the antenna.
- the wireless communication module 120 can provide applications on the robot 002 including wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (wireless fidelity, Wi-Fi) network), bluetooth (bluetooth, BT), global navigation satellite system ( Global navigation satellite system (GNSS), frequency modulation (frequency modulation, FM), near field communication (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
- the robot 002 can communicate with the network and other devices through wireless communication technology.
- the robot 002 can communicate with the cloud server through the wireless communication module 120 .
- the display screen 102 is used for displaying human-computer interaction interfaces, images, videos, and the like.
- the display screen 102 includes a display panel.
- the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active matrix organic light emitting diode or an active matrix organic light emitting diode (active-matrix organic light emitting diode, AMOLED), flexible light-emitting diode (flex light-emitting diode, FLED), Miniled, MicroLed, Micro-oLed, quantum dot light emitting diodes (quantum dot light emitting diodes, QLED), etc.
- the display screen 102 may be used to display various application program interfaces of the robot 002 .
- the sensor module 190 may include a proximity light sensor, a pressure sensor, a gyro sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.
- the audio module 150 is used to convert digital audio information into analog audio signal output, or convert analog audio input into digital audio signal.
- the audio module 150 may also be used to encode and decode audio signals.
- the audio module 150 can be set in the processor 110, or some functional modules of the audio module 150 can be set in the processor 110.
- the audio module 150 may include a speaker, an earpiece, a microphone, and an earphone jack.
- the audio module 150 can be used to receive voice instructions from the user 001.
- the audio module 150 can also be used to perform operations such as playing music and telling stories according to the voice instructions of the user 001.
- Camera 170 is used to capture still images or video.
- the object generates an optical image through the lens and projects it to the photosensitive element.
- the photosensitive element converts the light signal into an electrical signal, and then transmits the electrical signal to the ISP (Image Signal Processing, Image Signal Processing) to convert it into a digital image signal.
- the robot 002 can realize the shooting function through ISP, camera 170, video codec, GPU (Graphic Processing Unit, graphics processor), display screen 102 and application processor.
- the camera 170 can obtain user 001's face image, lip movement image, etc. of user 001.
- the interface module 160 includes an external memory interface, a universal serial bus (universal serial bus, USB) interface, and the like.
- the external memory interface can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the robot 002.
- the external memory card communicates with the processor 110 through the external memory interface to realize the data storage function.
- the universal serial bus interface is used for the robot 002 to communicate with other electronic devices 002 .
- the robot 002 further includes a button 101 .
- the key 101 may include a volume key, an on/off key, and the like.
- FIG. 3 shows a schematic diagram of a voice interaction method, wherein the voice interaction method shown in FIG. 3 can be executed by a voice assistant of the robot 002 .
- the voice interaction method shown in Figure 3 includes:
- the speech recognition mode is a mode for recognizing the speech instruction of the user 001 received by the robot 002 .
- the voice assistant After the voice assistant is woken up, the voice assistant will turn on the sound recording and obtain the characteristics of user 001's mouth changes. Among them, turning on the sound recording can facilitate the determination in the subsequent steps that when the voice command of user 001 is recognized by voice, it can directly Recognize the received sound. Acquiring the mouth change features of the user 001 can facilitate the recognition of the acquired user 001's mouth change features when it is determined in the subsequent steps that the voice command of the user 001 is recognized by lip language.
- the receiving of sound can be realized based on the microphone of the robot 002, and the acquisition of the mouth change characteristics of the user 001 can be realized based on the camera device of the robot 002.
- the voice assistant of the robot 002 needs to be woken up after receiving the wake-up word from the user 001, so as to enter the voice recognition mode. For example, if the wake-up word of the voice assistant is "Hi, Xiaoyi", when the user 001 speaks the wake-up word of "Hi, Xiaoyi", the voice assistant will enter the voice recognition mode, turn on the radio and obtain the user's mouth change characteristics, In order to receive voice instructions from user 001.
- the voice recognition function and the lip language recognition function can be selectively enabled. As described below:
- the voice recognition function and the lip language recognition function can be turned on during the entire process of receiving sound, so that the received sound can be directly recognized by voice recognition in real time to obtain the voice recognition result, and the captured User 001's mouth features are directly recognized in real time by means of lip recognition to obtain lip recognition results.
- the speech recognition result can be obtained directly without performing speech recognition again.
- the lip recognition result can be obtained directly without performing lip recognition again, which effectively saves the time for recognizing the voice of user 001.
- this solution after entering the speech recognition mode, it can be judged in the subsequent steps whether to use the lip recognition method or the speech recognition method to recognize the speech command of user 001, and then turn on the speech recognition function Or lip recognition function, this solution can effectively reduce unnecessary speech recognition or lip recognition calculations.
- the voice recognition function can be turned on during the whole process of radio reception, and the lip recognition function can be turned on when it is later judged that lip recognition is needed for recognition.
- the setting of this implementation is because most scenarios still use voice recognition to recognize the voice commands of user 001, so turning on the voice recognition function for a long time can effectively avoid multiple conversions of turning on and off the voice recognition function, reducing most common scenarios.
- the calculation amount of the processor 110 in the processor 110 is increased, and the running speed of the processor 110 is increased.
- the lip recognition method only occurs in a small number of scenes. Therefore, when the lip recognition method is determined to be used, the lip recognition function can be effectively reduced.
- the computational load of the speech recognition process is because most scenarios still use voice recognition to recognize the voice commands of user 001, so turning on the voice recognition function for a long time can effectively avoid multiple conversions of turning on and off the voice recognition function, reducing most common scenarios.
- the calculation amount of the processor 110 in the processor 110 is increased, and the running speed of the processor 110 is increased.
- the speech recognition function can be implemented based on the artificial intelligence processor of the robot 002 .
- the artificial intelligence processor can intelligently recognize the voice of user 001.
- the lip language recognition function can be realized based on the image processor of the robot 002, in which the image processor can continuously recognize faces from the images, judge the person who is speaking, and extract the person's continuous mouth shape change characteristics; The features of the lip language recognition model in the image processor are input to recognize the pronunciation corresponding to the speaker's population type; then, according to the recognized pronunciation, the most likely natural language sentence is obtained.
- the recognition rate of the lip language recognition method is not high in general scenarios.
- the recognition accuracy rate is relatively high, for example, it can reach more than 90%.
- S302 Detect whether the condition for recognizing the user's voice instruction by means of voice recognition is satisfied. If it is satisfied, then it shows that the speech recognition method can be used to recognize the user's voice command, and then proceeds to S308 to obtain the speech recognition result; if not satisfied, it shows that the speech recognition method can no longer be used to accurately recognize the user's voice command, then Go to S303, and check whether the condition for entering the lip recognition mode is met.
- the voice assistant can collect sound by controlling the microphone, and can judge the time point when user 001's voice starts or is cut off by using voice boundary detection (Voice activity detection, Vad) technology.
- voice boundary detection Voice activity detection, Vad
- the condition for recognizing the voice command of user 001 by means of voice recognition may be that the voice assistant has already received the voice and the continuous voice is less than the first set value.
- the first setting value can be determined according to the relevant performance of the device and the regular voice duration of user 001. For example, although the voice device exceeding 10s can still be recognized, the dialogue system of the device can no longer give an effective answer; or the user 001 generally does not take more than 10 seconds to issue voice commands, and there is usually a pause within 10 seconds. When the continuous radio reception exceeds 10 seconds, the VAD technology of the device can no longer accurately identify the truncation time point of the human voice in the audio. Therefore, the first set value can be set to 10 seconds.
- the voice assistant when the voice assistant continues to listen for more than the first set value, it can be judged that the use of voice boundary detection technology has been unable to accurately determine the time point when the voice of user 001 ends, so the voice assistant Just kept on listening. Therefore, it is determined that the external environment is too noisy, and it can be determined that it is difficult to recognize the voice instruction of the user 001 through voice recognition at this time.
- turning on the sound collection may refer to turning on the sound collection function and allowing sound collection, but in some cases, it may still be impossible to receive sound.
- the use of voice boundary detection technology has been unable to accurately determine the time point when user 001's voice started, so the voice assistant may not be able to receive the voice.
- the voice assistant when the voice assistant has been unable to receive sound, it can be judged that the voice boundary detection technology has been unable to accurately determine the time point when the voice of user 001 starts, so the voice assistant has been unable to receive sound. Therefore, it can be determined that the external environment is too noisy, and it can be determined that it is difficult to recognize the voice instruction of the user 001 through voice recognition at this time.
- the above solution is to confirm that the voice assistant has started to receive sound and the continuous sound is less than or equal to the first set value, so as to confirm that the use of VAD technology can still accurately judge the time point when the user 001's voice starts and is cut off, and then determine that it can be used at this time.
- the speech recognition mode recognizes the speech instruction of user 001.
- the condition for recognizing the voice command of the user 001 by means of voice recognition may be that the noise value of the surrounding environment is less than a set value.
- the first set value indicates that the VAD technology can still accurately determine the time point when the voice of user 001 is cut off, and then determine that the voice command of user 001 can be recognized by voice recognition at this time.
- the duration of the voice assistant’s sound collection is greater than or equal to the first set value, as mentioned above, it indicates that the VAD technology has been unable to accurately determine the time point when the voice of user 001 is cut off, and then it is determined that the speech recognition method can be used at this time and the user 001 cannot be accurately recognized. voice commands.
- S303 Detect whether the condition for recognizing the user's voice command by means of lip recognition is satisfied.
- the voice assistant detects that the user 001 has kept interacting with the voice assistant, for example, the user 001 before the robot 002 has remained unchanged, and the face has been facing the camera device of the robot 002 and Only when the distance between the robots 002 and 002 is within the set range, it is considered that the conditions for entering the lip recognition mode are met.
- the movement of the user’s mouth can be accurately captured at this time, At this time, the voice command of the user 001 can be recognized more accurately by using the lip recognition method.
- the lip recognition result may be the instruction keyword included in the user 001 instruction.
- the above-mentioned keywords can be the conventional instruction keywords that have been stored in the voice assistant and have been trained by the model, for example, tell a story, read a picture book, play music, tell a joke, exit, return, etc. It is understandable that because the above-mentioned keywords have been Stored in the voice assistant, therefore, the above-mentioned instruction keywords can be accurately identified by using the lip recognition method.
- the keyword in order to avoid misrecognition when command words are mixed into long sentences, the keyword may be used as the result of lip recognition only when there is a pause before and after the keyword is recognized.
- telling a story can be directly used as the lip recognition result, so as to execute the function corresponding to the lip recognition result of "telling a story”.
- This scheme can effectively avoid the misrecognition that may be caused by command words appearing in long sentences.
- S305 confirm whether the lip language recognition result is correct, if the result is yes, it shows that the lip language recognition is accurate, then go to S305; if the result is no, it shows that the lip language recognition is incorrect, then go to S306;
- the voice recognition method provided in the embodiment of the present application may include a step of confirming the voice command to the user 001 .
- the confirmation of the voice command to the user 001 may be to ask the user 001 whether he wants the voice assistant to perform the function corresponding to the lip recognition result. For example, as shown in Figure 4, if the recognized keyword is "telling a story", the way to confirm the voice command to user 001 can be to ask user 001, wherein the content of the inquiry can be: "Do you want to Shall I tell you a story?" and other inquiries.
- the confirmation may be indicated by voice answering "Yes".
- the voice recognition method may not be able to accurately recognize the voice command of user 001. Therefore, when the voice assistant confirms to user 001, the visual recognition function can be turned on at the same time. It is convenient to recognize the confirmation performed by the user 001 through body movements, for example, the user 001 may express confirmation through a nodding action or an ok gesture.
- the visual recognition function can be a function that can detect the body movements of the user 001, and the visual recognition function can be realized based on the image processor of the robot 002. Body movements are parsed to obtain visual recognition results. For example, after the nodding action image of user 001 is collected by the image processor, the nodding action image can be analyzed, and the recognition result obtained can be the text corresponding to the nodding action, such as "confirm", "yes", etc. .
- the voice assistant in order to further increase the accuracy of the voice assistant for user 001's voice command recognition, can also enable the noise detection function in addition to the visual recognition function when performing voice confirmation to user 001, so that real-time Detect ambient noise.
- the voice command of user 001 can be recognized by the voice recognition method at this time, and then the voice command of user 001 or the subsequent confirmation command can be confirmed by voice recognition.
- Other voice commands are recognized; if the ambient noise is still higher than the set value, lip recognition or visual recognition or a combination of lip recognition and visual recognition is used to confirm the user 001's command or other subsequent voice commands to identify.
- S306 Execute a function corresponding to the lip recognition result based on the lip recognition result.
- the voice assistant can execute the function corresponding to "telling a story”.
- the visual recognition function can be continuously turned on to keep the recognition result of the body movement of the user 001 by the voice assistant.
- the lip recognition function can be continuously turned on to obtain the lip recognition results of user 001.
- Body action recognition results For example, the user 001 makes a five-finger gesture within the range captured by the camera device of the robot 002 to indicate that the task of storytelling is stopped, and the voice assistant can recognize the gesture and stop the execution of the task.
- Detection function for real-time detection of ambient noise. After judging that the surrounding environment noise is lower than the set value, it can be determined that the voice command of user 001 can be recognized by using the voice recognition mode at this time, and then the voice recognition mode can be used at this time; if the surrounding environment noise is still higher than the set value value, the lip recognition result method or the visual recognition method or the combination of the lip recognition method and the visual recognition method is used to accurately recognize other voice commands of user 001.
- the voice assistant when the voice assistant performs the task of "telling a story", the voice assistant simultaneously turns on the visual recognition function, the lip language recognition function and the noise detection function. At a certain moment, the voice assistant detects that the surrounding environment noise is lower than the set value.
- the voice recognition method For example, user 001 has switched to using the The voice recognition method recognizes the voice command of user 001 during the task, and issues the command "stop telling the story”. ", and execute the function corresponding to the voice recognition result of "stop telling the story”.
- the voice recognition function and the lip language recognition function are always on during the process of the above-mentioned voice assistant executing the voice command of user 001.
- the way of reminding the user 001 of the recognition failure may be to display prompt information such as "recognition error” and "unrecognizable” on the screen of the robot 002 .
- the method of reminding the user 001 of the recognition failure may also be to prompt the user 001 through voice messages such as "recognition error” and "unrecognizable”.
- the user 001 after reminding the user 001 that the recognition fails, the user 001 can also be reminded to face the camera, raise the voice, etc. to remind the user 001 to issue the voice command again.
- a function corresponding to the speech recognition result may be executed based on the speech recognition result.
- the speech recognition method provided in FIG. 3 first judges that the voice assistant has received the sound for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, thereby It is determined that the external environment is too noisy, and it is further determined that it is difficult to recognize the voice command of the user 001 by means of voice recognition under such circumstances. Further, by judging whether the user 001 is interacting with the voice assistant to determine whether to use the lip recognition method; it can effectively avoid the situation that the accuracy of the recognition result is reduced by using the lip recognition result when speech recognition is possible, Effectively improve the accuracy of voice command recognition. In addition, the user 001 can be asked again after the lip language recognition result is obtained, which can effectively ensure the accuracy of the recognition result.
- judging in step 303 whether the user 001 has been maintaining the interactive state with the voice assistant can be judged from the following aspects:
- the voice assistant detects that the user 001 who interacts with the voice assistant during the listening process has not changed, the possibility that the user 001 is interacting with the voice assistant is relatively high.
- the user 001 who interacts with the voice assistant has changed, the user 001 who may have issued the voice command has left. At this time, in some embodiments, it can be directly determined that the received voice is invalid. In some other embodiments, it is also possible to detect the voice command of the last user 001 who interacted with the voice assistant during the listening process. For example, the voice assistant detects that the object interacting with the voice assistant has been changed once during the listening process, that is During the sound collection process, two users 001 have interacted with the voice assistant, and the voice instruction of the second user 001 who interacts with the voice assistant during the sound collection process is detected.
- the user interacting with the voice assistant can be detected by detecting whether the face of user 001 in front of the voice assistant has changed 001 has changed.
- the interaction willingness value may be calculated based on the distance between the face of the user 001 and the voice assistant, the orientation of the face, and the like within a period of time. For example, if the distance between the face of the user 001 and the voice assistant is relatively short within a period of time, and the face of the user 001 is facing the voice assistant, the interaction willingness value is higher, and vice versa.
- the voice assistant can obtain the face angle of user 001 and the distance between user 001 and robot 002 by collecting images of user 001 within a period of time, and then according to the user 001's The face angle and the distance between the user 001 and the smart device are used to obtain the interaction willingness value of the user 001 through the interaction willingness value model. The higher the interaction willingness value, the greater the interaction intensity between user 001 and the voice assistant.
- the interaction willingness value model it can be defined that different face angles correspond to different values, the distance between user 001 and robot 002 corresponds to different values, and the value corresponding to the face angle is the same as the distance between user 001 and robot 002.
- the value corresponding to the distance can be assigned different weights. For example, the angle of the face can better reflect whether the user 001 is interacting with the voice assistant, and the weight corresponding to the angle of the face can account for 60%.
- User 001 and robot 002 The weight corresponding to the distance between them may be 40%.
- steps 302 and 303 in FIG. 3 can be sorted and supplemented, wherein the specific speech recognition method is shown in As shown in 6, step 301 and steps 304-308 refer to the foregoing, and will not be repeated here.
- steps 302-303 can be adjusted as follows:
- S302A Determine whether there is a human voice in the received sound.
- the human voice detection model in the artificial intelligence processor can be used to detect whether there is a human voice in the received sound. If there is a human voice, execute S302B to further determine whether the voice assistant detects that the duration of sound collection is less than the first set value. If there is no human voice, then after the interval setting time, go to S302C, start receiving sound again, and recalculate the duration of sound collection.
- the interval setting time can be 200ms.
- the judgment result is yes, then it shows that the speech recognition method can be used to identify the user's voice command, and then proceed to S308 to obtain the speech recognition result; If it is recognized, go to S303A to check whether the conditions for entering the lip recognition mode are met.
- step S302 in FIG. 3 the conditions for voice command recognition are as described in step S302 in FIG. 3 , and will not be repeated here.
- S302C Start receiving sound again, and recalculate the duration of receiving sound.
- S303A Determine whether the user 001 of face tracking has not changed.
- the judgment result is yes, it means that the user who interacts with the voice assistant during the sound collection process is always the same user, then go to S303B, and use the user as the user who interacts with the voice assistant; if the judgment result is no, it means that during the sound collection process If the user interacting with the voice assistant has been changed, go to S303C, and use the last user 001 collected by the camera device as the user 001 interacting with the voice assistant.
- S303B The current user is used as the user interacting with the voice assistant. Wherein, the current user is a user who has been interacting with the voice assistant during the listening process.
- S303C Use the last user 001 captured by the camera device as the user 001 interacting with the voice assistant.
- S303D Determine whether the value of the interaction willingness of the user 001 who interacts with the voice assistant reaches the first set value.
- the judgment result is yes, it indicates that the user 001 is interacting with the device, then go to S305 to obtain the lip recognition result; if the judgment result is no, it indicates that the user 001 is not willing to interact with the voice assistant, and the lip recognition is adopted If the method is also difficult to recognize the user's voice command, go to step S307 and prompt the user that the recognition fails.
- the speech recognition method shown in FIG. 6 of the present application can sort several judgment conditions for using lip recognition, and can more accurately judge the timing of using lip recognition.
- the current process can be ended in advance, and the next round of detection can be started, which avoids adding unnecessary subsequent recognition steps and effectively improves the recognition efficiency.
- the voice recognition method provided by the embodiment of the present application first judges that the voice assistant has received voice for a long period of time, and judges that the voice assistant may not be able to judge when the voice command of user 001 has ended, so as to determine It is found that the external environment is too noisy, and it is judged that it is difficult to recognize the voice command of the user 001 by means of voice recognition under such circumstances. Further, by judging whether the user 001 is interacting with the voice assistant to determine whether to use the lip recognition method; it can effectively avoid the situation that the accuracy of the recognition result is reduced by using the lip recognition result when speech recognition is possible, Effectively improve the accuracy of voice command recognition.
- the user 001 can be asked again after the lip language recognition result is obtained, which can effectively ensure the accuracy of the recognition result.
- the visual recognition function and the noise detection function can be turned on while confirming to the user 001.
- the voice assistant can keep the recognition result of the body movements of the user 001, and on the other hand, according to the surrounding environment As the noise changes, adjust the way of recognizing user 001 voice commands in time to increase the accuracy of voice command recognition.
- the embodiment of the present application also provides a voice interaction device, including:
- the detection module is configured to control the electronic device to enter the voice recognition mode after detecting that the user 001 wakes up the voice assistant.
- the recognition control module is configured to control the electronic device to recognize the voice command of the user 001 by voice recognition to obtain a voice recognition result if it detects that the current voice interaction environment of the electronic device satisfies the voice recognition condition.
- control The electronic device uses a lip recognition method to recognize the mouth change characteristics of the user 001 acquired by the electronic device through the image acquisition device to obtain a lip recognition result.
- the execution module is configured to control the electronic device to execute the function corresponding to the recognition result according to the recognition result acquired by the electronic device. For example, if the electronic device obtains the result of lip recognition, the electronic device is controlled to execute the function corresponding to the lip recognition result; if the electronic device obtains the result of speech recognition, the electronic device is controlled to execute the function corresponding to the speech recognition result.
- Embodiments disclosed in this application may be implemented in hardware, software, firmware, or a combination of these implementation methods.
- Embodiments of the present application may be implemented as a computer program or program code executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements) , at least one input device, and at least one output device.
- Program code can be applied to input instructions to perform the functions described herein and to generate output information.
- the output information may be applied to one or more output devices in known manner.
- a processing system includes any system having a processor such as, for example, a digital signal processor (DSP), microcontroller, application specific integrated circuit (ASIC), or microprocessor.
- DSP digital signal processor
- ASIC application specific integrated circuit
- the program code can be implemented in a high-level procedural language or an object-oriented programming language to communicate with the processing system.
- Program code can also be implemented in assembly or machine language, if desired.
- the mechanisms described in this application are not limited in scope to any particular programming language. In either case, the language may be a compiled or interpreted language.
- the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof.
- the disclosed embodiments can also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which can be executed by one or more processors read and execute.
- instructions may be distributed over a network or via other computer-readable media.
- a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy disks, optical disks, optical disks, read-only memories (CD-ROMs), magnetic Optical discs, read-only memory (ROM), random-access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical cards, flash memory, or A tangible, machine-readable memory used to transmit information (eg, carrier waves, infrared signals, digital signals, etc.) by electrical, optical, acoustic, or other forms of propagating signals using the Internet.
- a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (eg, a computer).
- the embodiment of the present application also provides a computer program or a computer program product including the computer program.
- the computer program When the computer program is executed on a computer, it will enable the computer to implement the above voice command execution method.
- the computer program product may include instructions, and the instructions are used to implement the above voice interaction method.
- each unit/module mentioned in each device embodiment of this application is a logical unit/module.
- a logical unit/module can be a physical unit/module, or a physical unit/module.
- a part of the module can also be realized with a combination of multiple physical units/modules, the physical implementation of these logical units/modules is not the most important, the combination of functions realized by these logical units/modules is the solution The key to the technical issues raised.
- the above-mentioned device embodiments of this application do not introduce units/modules that are not closely related to solving the technical problems proposed by this application, which does not mean that the above-mentioned device embodiments do not exist other units/modules.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- User Interface Of Digital Computer (AREA)
- Manipulator (AREA)
Abstract
L'invention concerne un procédé d'interaction vocale, un dispositif électronique et un support. Le procédé d'interaction vocale comprend : lorsqu'il est détecté que l'environnement d'interaction vocale actuel d'un dispositif électronique ne satisfait pas à une condition de reconnaissance vocale, déterminer si l'état d'interaction actuel d'un utilisateur satisfait à une condition de reconnaissance de lecture labiale (S303) ; si c'est le cas, acquérir un résultat de reconnaissance de lecture labiale, qui est obtenu en reconnaissant, au moyen d'une reconnaissance de lecture labiale, une instruction vocale de l'utilisateur qui est reçue par le dispositif électronique (S304) ; et exécuter une fonction correspondant au résultat de reconnaissance de lecture labiale (S306). Au moyen du procédé, lorsqu'il est déterminé que les instructions vocales de l'utilisateur ont été difficiles à reconnaître au moyen de la reconnaissance de la lecture labiale, le fait que la reconnaissance de la lecture labiale soit utilisée est en outre déterminé en déterminant si l'utilisateur interagit avec un assistant vocal, de sorte que la précision de la reconnaissance des instructions vocales peut être améliorée de manière efficace, améliorant ainsi davantage la précision de l'exécution des instructions vocales de l'utilisateur par le dispositif électronique.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110865871.XA CN115691498A (zh) | 2021-07-29 | 2021-07-29 | 语音交互方法、电子设备及介质 |
CN202110865871.X | 2021-07-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023006033A1 true WO2023006033A1 (fr) | 2023-02-02 |
Family
ID=85059169
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/108624 WO2023006033A1 (fr) | 2021-07-29 | 2022-07-28 | Procédé d'interaction vocale, dispositif électronique et support |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN115691498A (fr) |
WO (1) | WO2023006033A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112164389B (zh) * | 2020-09-18 | 2023-06-02 | 国营芜湖机械厂 | 一种多模式语音识别送话装置及其控制方法 |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
JP2014240856A (ja) * | 2013-06-11 | 2014-12-25 | アルパイン株式会社 | 音声入力システム及びコンピュータプログラム |
CN107799125A (zh) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | 一种语音识别方法、移动终端及计算机可读存储介质 |
CN108537207A (zh) * | 2018-04-24 | 2018-09-14 | Oppo广东移动通信有限公司 | 唇语识别方法、装置、存储介质及移动终端 |
CN110517685A (zh) * | 2019-09-25 | 2019-11-29 | 深圳追一科技有限公司 | 语音识别方法、装置、电子设备及存储介质 |
WO2020122677A1 (fr) * | 2018-12-14 | 2020-06-18 | Samsung Electronics Co., Ltd. | Procédé d'exécution de fonction de dispositif électronique et dispositif électronique l'utilisant |
CN112132095A (zh) * | 2020-09-30 | 2020-12-25 | Oppo广东移动通信有限公司 | 危险状态的识别方法、装置、电子设备及存储介质 |
CN112633208A (zh) * | 2020-12-30 | 2021-04-09 | 海信视像科技股份有限公司 | 一种唇语识别方法、服务设备及存储介质 |
-
2021
- 2021-07-29 CN CN202110865871.XA patent/CN115691498A/zh active Pending
-
2022
- 2022-07-28 WO PCT/CN2022/108624 patent/WO2023006033A1/fr active Application Filing
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023703A (zh) * | 2009-09-22 | 2011-04-20 | 现代自动车株式会社 | 组合唇读与语音识别的多模式界面系统 |
JP2014240856A (ja) * | 2013-06-11 | 2014-12-25 | アルパイン株式会社 | 音声入力システム及びコンピュータプログラム |
CN107799125A (zh) * | 2017-11-09 | 2018-03-13 | 维沃移动通信有限公司 | 一种语音识别方法、移动终端及计算机可读存储介质 |
CN108537207A (zh) * | 2018-04-24 | 2018-09-14 | Oppo广东移动通信有限公司 | 唇语识别方法、装置、存储介质及移动终端 |
WO2020122677A1 (fr) * | 2018-12-14 | 2020-06-18 | Samsung Electronics Co., Ltd. | Procédé d'exécution de fonction de dispositif électronique et dispositif électronique l'utilisant |
CN110517685A (zh) * | 2019-09-25 | 2019-11-29 | 深圳追一科技有限公司 | 语音识别方法、装置、电子设备及存储介质 |
CN112132095A (zh) * | 2020-09-30 | 2020-12-25 | Oppo广东移动通信有限公司 | 危险状态的识别方法、装置、电子设备及存储介质 |
CN112633208A (zh) * | 2020-12-30 | 2021-04-09 | 海信视像科技股份有限公司 | 一种唇语识别方法、服务设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN115691498A (zh) | 2023-02-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12118999B2 (en) | Reducing the need for manual start/end-pointing and trigger phrases | |
CN105654952B (zh) | 用于输出语音的电子设备、服务器和方法 | |
WO2021036644A1 (fr) | Procédé et appareil d'animation à commande vocale basés sur l'intelligence artificielle | |
EP2959474B1 (fr) | Échelonnage de performance hybride pour la reconnaissance de la parole | |
US11430438B2 (en) | Electronic device providing response corresponding to user conversation style and emotion and method of operating same | |
EP3593958A1 (fr) | Procédé de traitement de données, et dispositif de robot de soins infirmiers | |
WO2021008538A1 (fr) | Procédé d'interaction vocale et dispositif associé | |
CN111492328A (zh) | 虚拟助手的非口头接合 | |
CN110263131B (zh) | 回复信息生成方法、装置及存储介质 | |
KR102412523B1 (ko) | 음성 인식 서비스 운용 방법, 이를 지원하는 전자 장치 및 서버 | |
WO2022227507A1 (fr) | Procédé d'apprentissage de modèle de reconnaissance de degré de réveil et procédé d'acquisition de degré de réveil vocal | |
WO2021212388A1 (fr) | Procédé et dispositif de mise en œuvre d'une communication interactive, et support de stockage | |
CN110910887A (zh) | 语音唤醒方法和装置 | |
CN112634911B (zh) | 人机对话方法、电子设备及计算机可读存储介质 | |
WO2023006033A1 (fr) | Procédé d'interaction vocale, dispositif électronique et support | |
CN114333774B (zh) | 语音识别方法、装置、计算机设备及存储介质 | |
WO2016206646A1 (fr) | Procédé et système pour pousser un dispositif de machine à générer une action | |
US20200090663A1 (en) | Information processing apparatus and electronic device | |
CN116860913A (zh) | 语音交互方法、装置、设备及存储介质 | |
US11997445B2 (en) | Systems and methods for live conversation using hearing devices | |
CN111971670B (zh) | 在对话中生成响应 | |
Zhang et al. | Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic Sensing | |
US20210082427A1 (en) | Information processing apparatus and information processing method | |
WO2024055831A1 (fr) | Procédé et appareil d'interaction vocale et terminal | |
CN117809625A (zh) | 一种终端设备及双模型校验的唤醒方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22848638 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22848638 Country of ref document: EP Kind code of ref document: A1 |