WO2024055831A1

WO2024055831A1 - Voice interaction method and apparatus, and terminal

Info

Publication number: WO2024055831A1
Application number: PCT/CN2023/114613
Authority: WO
Inventors: 王石磊
Original assignee: 荣耀终端有限公司
Priority date: 2022-09-14
Filing date: 2023-08-24
Publication date: 2024-03-21
Also published as: CN117746849A

Abstract

A voice interaction method and apparatus, and a terminal. The method comprises: detecting a wake-up instruction for initiating voice interaction (S1); in response to the wake-up instruction, entering an operation state of voice interaction (S2); detecting first voice information (S3); outputting a feedback result for the first voice information (S4); determining whether a terminal (100) is close to the mouth of a user (S51); if it is determined that the terminal (100) is close to the mouth of the user, prolonging the operation state of voice interaction for a preset duration (S52); if second voice information is detected within the preset duration, detecting the breath of the user (S54); and if the breath of the user is detected, outputting a feedback result for the second voice information (S55). By means of such a voice interaction method, a user himself/herself having an intent to continue voice interaction can be identified with high probability, thereby effectively reducing erroneous responses of a terminal to other people or other surrounding noise, and thus improving the accuracy and user experience of voice interaction.

Description

A voice interaction method, device and terminal

This application claims priority to the Chinese patent application submitted to the State Intellectual Property Office on September 14, 2022, with the application number 202211113419.9 and the invention title "A voice interaction method, device and terminal", the entire content of which is incorporated by reference. in this application.

Technical field

The present application belongs to the field of human-computer interaction technology, and in particular relates to a voice interaction method, device and terminal.

Background technique

Voice interaction is a new generation of interaction mode based on voice input. Based on the voice information input by the user to the terminal, feedback results corresponding to the input voice information can be obtained.

Before performing voice interaction with the terminal, the voice interaction system (such as a voice assistant) on the terminal must first be awakened. For example, the voice assistant can be awakened through a specific wake-up word. After the voice assistant is awakened, the user can conduct voice interaction with the terminal. During the voice interaction process between the user and the terminal, generally after the user finishes speaking a voice, the terminal outputs the feedback result corresponding to the voice. Then, the user can speak the next voice, thus realizing a continuous dialogue with the terminal.

However, the current continuous dialogue function of the terminal is achieved by extending the radio time of the terminal. For example, after the terminal outputs the feedback result corresponding to the first voice message, the terminal continues to listen for a period of time, such as 10 seconds. If no voice signal is received within 10 seconds, the terminal will stop collecting; if a voice signal is received within 10 seconds, the terminal will continue to output feedback results for the received voice information. In this way, during the period when the terminal extends the listening period, if the user does not make any sound, but there are other people talking around, the terminal will continue to provide feedback on what other people say, which will cause trouble and disgust to the user. Affect user experience.

Contents of the invention

This application provides a voice interaction method, device and terminal, which can solve the problem of incorrect responses to other people or other surrounding noises during the period when the terminal extends the sound collection.

In a first aspect, the present application provides a voice interaction method, which method includes: detecting a wake-up instruction to initiate voice interaction; responding to the wake-up instruction, entering the working state of voice interaction; detecting the first voice information; and outputting the corresponding The feedback result of the first voice information; if the second voice information is detected within the preset time period, the user's breath is detected; if the user's breath is detected, the feedback result for the second voice information is output.

In this way, through user breath detection, it can be identified with a high probability that the user himself has the intention to continue voice interaction, effectively reducing the terminal's erroneous response to other people or other surrounding noises, and improving the accuracy of voice interaction and user experience.

In an implementable manner, after outputting the feedback result for the first voice information, the method further includes: determining whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, then The working state of the voice interaction is extended by the preset duration; if it is determined that the terminal is not close to the user's mouth, the working state of the voice interaction is ended.

In this way, before performing user breath detection, it is first determined whether the terminal is close to the user's mouth. If it is determined If the terminal is close to the user's mouth, the sound collection time will be extended; if it is determined that the terminal is not close to the user's mouth, the sound collection will be ended directly. This can greatly reduce the power consumption problem caused by radio.

In an implementable manner, if the second voice information is detected within a preset time period, the method further includes: determining whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, detecting The user breathes; if it is determined that the terminal is not close to the user's mouth, the working state of voice interaction is ended.

In this way, when the second voice information is detected, it is first determined whether the terminal is close to the user's mouth, and then it is determined whether to detect the user's breath. If the terminal is not close to the user's mouth, it is considered that the second voice message is not the sound made by the user, and there is no need to detect the user's breath.

In an implementable manner, if the wake-up indication is the user's breath, determining whether the terminal is close to the user's mouth includes: identifying the user's gesture in the working state of the voice interaction; If the user's gesture is a first gesture, it is determined that the terminal is close to the user's mouth, and the first gesture is used to represent that the user is holding the terminal in a stationary state; if the user's gesture is a second gesture, it is determined that the terminal is close to the user's mouth. gesture, it is determined that the terminal is not close to the user's mouth, and the second gesture is used to represent that the user is holding the terminal and moving away from the user's mouth.

In this way, if the working state of voice interaction is awakened by breathing, it means that the terminal is near the user's mouth when waking up the terminal. Therefore, after the feedback result for the first voice information is output, it can be determined whether the terminal 100 is still near the user's mouth by determining whether the user holds the terminal 100 away from the user's mouth.

In an implementable manner, if the wake-up indication is other than the user's breath, before determining whether the terminal is close to the user's mouth, the method includes: determining whether to output feedback for the first voice information. Before the result, whether a third gesture is recognized, the third gesture is used to represent that the user holds the terminal and approaches the user's mouth; if the third gesture is recognized, it is determined that the output for the After the feedback result of the first voice information, whether the terminal is still close to the user's mouth; if the third gesture is not recognized, the working state of voice interaction is ended.

In this way, if the working state of voice interaction is not awakened by breath awakening, it means that the terminal is not near the user's mouth when the terminal is awakened. In this case, after entering the working state of voice interaction, the present application may first determine whether the user holds the terminal and approaches the user's mouth before outputting the feedback result for the first voice information. If it is determined that the user held the terminal closer to the user's mouth before outputting the feedback result for the first voice information, then it is then determined whether the terminal is still near the user's mouth after the feedback result for the first voice information is output.

In an implementable manner, the identifying the user's gestures in the working state of the voice interaction includes: obtaining the angular velocity and acceleration at different times in the working state of the voice interaction; using the The angular velocity, acceleration, and gesture recognition module at different times determine the user's gesture; wherein, the gesture recognition module is used to recognize that the user's handheld terminal is approaching toward the user's mouth, the user's handheld terminal is moving away from the user's mouth, or The user holds the terminal in a stationary state.

In this way, the gesture recognition module can be used to determine the user's gesture based on angular velocity and acceleration data at different times.

In an implementable manner, detecting the user's breath includes: inputting the second voice information into a breath recognition module, and the breath recognition module is used to identify whether the second voice information is the mouth of the user. Sounds emitted within a preset distance from the terminal; if the breath recognition module recognizes that the second voice information is a sound emitted within a preset distance from the user's mouth to the terminal, it is determined that the user is detected breath; if said breath When the recognition module recognizes that the second voice information is not a sound emitted by the user's mouth within a preset distance from the terminal, it determines that the user's breath is not detected.

In this way, the breath recognition module can be used to perform feature recognition on the second voice information to determine whether the second voice information is the sound produced by the user's mouth close to the terminal.

In an implementable manner, the terminal includes a pressure sensor, and detecting the user's breath includes: obtaining the pressure value corresponding to the pressure sensor when the second voice information is collected; if the pressure value is greater than a predetermined If the pressure threshold is set, it is determined that the user's breath is detected; if the pressure value is less than or equal to the preset pressure threshold, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a temperature sensor, and detecting the user's breath includes: obtaining a first temperature and a second temperature, where the first temperature is before the second voice information is collected, The temperature corresponding to the temperature sensor, the second temperature is the temperature corresponding to the temperature sensor when the second voice information is collected; if the second temperature is greater than the first temperature, it is determined that the user is detected Breath; if the second temperature is less than or equal to the first temperature, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a humidity sensor, and detecting the user's breath includes: obtaining the humidity corresponding to the humidity sensor when the second voice information is collected; if the humidity is greater than the preset humidity If the humidity is less than or equal to the preset humidity threshold, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a carbon dioxide sensor, and detecting the user's breath includes: obtaining the carbon dioxide concentration corresponding to the carbon dioxide sensor when the second voice information is collected; if the carbon dioxide concentration is greater than a predetermined If the carbon dioxide concentration threshold is set, it is determined that the user's breath is detected; if the carbon dioxide concentration is less than or equal to the preset carbon dioxide concentration threshold, it is determined that the user's breath is not detected.

In this way, if the user speaks close to the terminal, the air flow generated by speaking will exert a certain pressure on the terminal, and the temperature, humidity, and carbon dioxide concentration near the terminal will also change to a certain extent. In this way, this application can use pressure sensors, Temperature sensor, humidity sensor or carbon dioxide sensor to detect the user's breath.

In a second aspect, the present application provides a voice interaction method, which method includes: detecting a wake-up instruction to initiate voice interaction; responding to the wake-up instruction, entering the working state of voice interaction; detecting the first voice information; and outputting the corresponding The feedback result of the first voice information; determine whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, extend the working state of the voice interaction for a preset time; if it is detected within the preset time second voice information, then output the feedback result for the second voice information.

In a third aspect, the present application provides a voice interaction method, which method includes: detecting a wake-up instruction to initiate voice interaction; responding to the wake-up instruction, entering the working state of voice interaction; detecting the first voice information; and outputting the corresponding The feedback result of the first voice information; if the second voice information is detected within the preset time period, determine whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, output a response to the first voice information. 2. Feedback results of voice information.

In a fourth aspect, the present application provides a voice interaction device, which includes a processor; the processor is configured to detect a wake-up instruction that initiates voice interaction; and enter a working state of voice interaction in response to the wake-up instruction; Detect the first voice information; output the feedback result for the first voice information; if the second voice information is detected within the preset time period, detect the user's breath; if the user's breath is detected, output the feedback result for the second voice information. Feedback results of voice information.

In an implementable manner, the processor is further configured to output a feedback result for the first voice information. After that, determine whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, extend the working state of the voice interaction by the preset duration; if it is determined that the terminal is not close The user's mouth ends the voice interaction working state.

In an implementable manner, the processor is further configured to determine whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, detect the user's breath; if it is determined that the terminal is not close The user's mouth ends the working state of voice interaction.

In an implementable manner, the processor is further configured to identify the user's gesture in the working state of the voice interaction; if the user's gesture is a first gesture, determine that the terminal is close to The user's mouth, the first gesture is used to indicate that the user is holding the terminal in a stationary state; if the user's gesture is the second gesture, it is determined that the terminal is not close to the user's mouth, the The second gesture is used to represent that the user is holding the terminal and moving away from the user's mouth.

In an implementable manner, if the wake-up indication is other than the user's breath, before determining whether the terminal is close to the user's mouth, the processor is further configured to determine whether to output the target for the user's mouth. Before the feedback result of the first voice information, whether the third gesture is recognized, the third gesture is used to represent that the user holds the terminal and approaches the user's mouth; if the third gesture is recognized, Then it is determined whether the terminal is still close to the user's mouth after outputting the feedback result for the first voice information; if the third gesture is not recognized, the working state of the voice interaction is ended.

In an implementable manner, the processor is also used to obtain the angular velocity and acceleration at different times in the working state of the voice interaction; using the angular velocity, acceleration and gesture recognition module at different times, determine The user's gesture; wherein, the gesture recognition module is used to recognize that the user holds the terminal toward the user's mouth, the user holds the terminal away from the user's mouth, or the user holds the terminal in a stationary state.

In an implementable manner, the processor is further configured to input the second voice information into a breath recognition module, and the breath recognition module is used to identify whether the second voice information is the mouth of the user. Sounds emitted within a preset distance from the terminal; if the breath recognition module recognizes that the second voice information is a sound emitted within a preset distance from the user's mouth to the terminal, it is determined that the user is detected Breath; if the breath recognition module recognizes that the second voice information is not a sound emitted by the user's mouth within a preset distance from the terminal, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a pressure sensor, and the processor is further configured to obtain a pressure value corresponding to the pressure sensor when the second voice information is collected; if the pressure value is greater than a predetermined If the pressure threshold is set, it is determined that the user's breath is detected; if the pressure value is less than or equal to the preset pressure threshold, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a temperature sensor, and the processor is further configured to obtain a first temperature and a second temperature, where the first temperature is before the second voice information is collected, The temperature corresponding to the temperature sensor, the second temperature is the temperature corresponding to the temperature sensor when the second voice information is collected; if the second temperature is greater than the first temperature, it is determined that the user is detected Breath; if the second temperature is less than or equal to the first temperature, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a humidity sensor, and the processor is further configured to obtain the humidity corresponding to the humidity sensor when the second voice information is collected; if the humidity is greater than the preset humidity If the humidity is less than or equal to the preset humidity threshold, it is determined that the user's breath is not detected.

In an implementable manner, the terminal includes a carbon dioxide sensor, and the processor is further configured to obtain the carbon dioxide concentration corresponding to the carbon dioxide sensor when the second voice information is collected; if the carbon dioxide concentration is greater than a predetermined If the carbon dioxide concentration threshold is set, it is determined that the user's breath is detected; if the carbon dioxide concentration is less than or equal to the preset carbon dioxide concentration threshold, it is determined that the user's breath is not detected.

In a fifth aspect, the present application provides a voice interaction device, which includes a processor; the processor is configured to detect a wake-up instruction that initiates voice interaction; and enter a working state of voice interaction in response to the wake-up instruction; Detecting the first voice information; outputting a feedback result for the first voice information; determining whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, extending the working state of the voice interaction for a preset time ; If it is determined that the terminal is not close to the user's mouth, the working state of voice interaction is ended.

In a sixth aspect, the present application provides a voice interaction device, which includes a processor; the processor is configured to detect a wake-up instruction that initiates voice interaction; and enter a working state of voice interaction in response to the wake-up instruction; Detecting the first voice information; outputting a feedback result for the first voice information; if the second voice information is detected within a preset time period, determining whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth mouth, then output the feedback result for the second voice information.

In a seventh aspect, the present application provides a terminal. The terminal includes a memory and a processor; the memory is coupled to the processor; the memory is used to store computer program code, and the computer program code includes computer instructions. When the processor executes the computer instructions, the electronic device is caused to execute the method described in any one of the first to third aspects.

In an eighth aspect, the present application provides a computer-readable storage medium in which computer programs or instructions are stored. When the computer programs or instructions are executed, as in the first to third aspects, Any of the methods described are executed.

In a ninth aspect, the present application provides a computer program product. The computer program product includes a computer program or instructions. When the computer program or instructions are run on a computer, the computer causes the computer to perform any of the first to third aspects. The method described in 1.

In summary, the voice interaction method, device and terminal provided by this application can detect the user's breath and/or determine whether the terminal is close to the user's mouth, and recognize with a high probability that the user himself has the intention to continue voice interaction, effectively reducing the terminal's Error responses from other people or other surrounding noises improve the accuracy and user experience of voice interaction.

Description of drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 is an application scenario diagram of voice interaction provided by an embodiment of the present application;

Figure 2 is a hardware structure block diagram of the terminal 100 provided by the embodiment of the present application;

Figure 3 is a flow chart of a voice interaction method provided by an embodiment of the present application;

Figure 4 is a flow chart of a first implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application;

Figure 5 is a flow chart of a second implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application;

Figure 6 is a flow chart of a third implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application;

Figure 7 is a flow chart of a fourth implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application;

Figure 8 is a flow chart of a fifth implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application;

Figure 9 is a schematic structural diagram of a voice interaction device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only some of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making creative efforts fall within the scope of protection of this application.

Before describing the technical solution of the present application, the application scenarios of the present application will be described first.

Figure 1 is an application scenario diagram of voice interaction provided by an embodiment of the present application. As shown in Figure 1, the application scenario diagram includes a terminal 100 and a user 200. The terminal 100 has a voice interaction function, and the user 200 can perform voice interaction with the terminal 100 . Currently, a specific event needs to trigger the voice interaction function of the terminal so that the terminal 100 can enter the voice interaction working state. Usually, we refer to triggering the voice interaction function of the terminal as waking up voice interaction. The method of wake-up voice interaction can be wake-up word wake-up, long press the power button to wake up, click on the desktop voice assistant application, etc. This application does not limit this.

After the voice interaction function is awakened, the user 200 can perform voice interaction with the terminal 100 . During the voice interaction process between the user 200 and the terminal 100, generally after the user 200 finishes speaking a voice, the terminal 100 outputs a feedback result corresponding to the voice. For example: after the voice interaction function is awakened, user 200 says "How is the weather today?". After receiving the voice message "How is the weather today?" from user 200, the terminal 100 recognizes the voice information and outputs the same message as the voice message. Feedback corresponding to the piece of voice information, for example, the terminal 100 outputs "the weather is sunny today" through the speaker.

Then, if the user 200 wants to continue voice interaction with the terminal 100, the user 200 can directly speak the next voice message after the terminal 100 has fed back the previous voice message, thus realizing a continuous conversation with the terminal 100.

In one implementation manner, the terminal 100 implements the above-mentioned continuous dialogue function by extending the sound collection time after each round of voice interaction with the user 200 is completed. For example, after the terminal 100 outputs the feedback result corresponding to the first voice message, the terminal 100 does not exit the sound collection, but continues to monitor the sound for a period of time, such as 10 seconds. If no voice signal is received within 10 seconds, the terminal 100 will exit the radio at this time; if a voice signal is received within 10 seconds, the terminal 100 will continue to perform feedback on the received voice information.

However, during the period when the terminal 100 extends the sound collection, if the user 200 does not make any sound, that is, the user 200 has no intention to continue the conversation, and there are other people talking or other noises around, the terminal 100 will continue to target other people. Feedback of what people say or other noises around them will cause trouble and disgust to the user 200 and affect the user experience.

In order to solve the above technical problems, this application provides a voice interaction method, which can effectively reduce the terminal 100's erroneous response to other people or other surrounding noises and improve the accuracy of voice interaction. The voice interaction method provided by this application can be applied to the terminal 100. In this embodiment of the present application, the terminal 100 may be a mobile phone, a remote control, or a smart wearable device such as a watch or bracelet.

Taking the terminal 100 as a mobile phone as an example, the hardware structure of the terminal 100 is introduced below.

Figure 2 is a hardware structure block diagram of the terminal 100 provided by the embodiment of the present application. As shown in Figure 2, the terminal 100 may include: a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, and a battery 142 , Antenna 1, Antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193 , display screen 194, and subscriber identification module (subscriber identification module, SIM) card interface 195, etc.

Among them, the above-mentioned sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, and a touch sensor 180K. Sensors such as ambient light sensor 180L, bone conduction sensor 180M, humidity sensor 180N and carbon dioxide sensor 180P.

It can be understood that the structure illustrated in this embodiment does not constitute a specific limitation on the terminal 100. In other embodiments, the terminal 100 may include more or fewer components than shown, or some components may be combined, or some components may be separated, or may be arranged differently. The components illustrated may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units. For example, the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) wait. Among them, different processing units can be independent devices or integrated in one or more processors.

The controller may be the nerve center and command center of the terminal 100. The controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.

The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system. In some embodiments, processor 110 may include one or more interfaces.

It can be understood that the interface connection relationships between the modules illustrated in this embodiment are only schematic illustrations and do not constitute a structural limitation on the terminal 100 . In other embodiments, the terminal 100 may also adopt different interface connection methods in the above embodiments, or a combination of multiple interface connection methods.

The charging management module 140 is used to receive charging input from the charger. Among them, the charger can be a wireless charger or a wired charger. While charging the battery 142, the charging management module 140 can also provide power to the terminal through the power management module 141.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc.

The wireless communication function of the terminal 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 360, the modem processor and the baseband processor.

Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in terminal 100 may be used to cover a single one or more communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.

The mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied to the terminal 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc. The mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation. The mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.

The wireless communication module 160 can provide applications on the terminal 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) network), Bluetooth (blue tooth, BT), and global navigation satellites. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.

In some embodiments, the antenna 1 of the terminal 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the terminal 100 can communicate with the network and other devices through wireless communication technology.

The terminal 100 implements the display function through the GPU, the display screen 194, and the application processor. The GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, videos, etc. The display screen 194 includes a display panel. For example, display screen 194 may be a touch screen.

The terminal 100 can implement the shooting function through the ISP, camera 193, video codec, GPU, display screen 194, application processor, etc.

The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the terminal 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.

Internal memory 121 may be used to store computer executable program code, which includes instructions. The processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the terminal 100 . For example, in the embodiment of the present application, the processor 110 can execute instructions stored in the internal memory 121, and the internal memory 121 can include a program storage area and a data storage area. Among them, the stored program area can store an operating system, at least one application program required for a function (such as a sound playback function, an image playback function, etc.). The storage data area may store data created during use of the terminal 100 (such as audio data, phone book, etc.).

The terminal 100 can implement audio functions through the audio module 370, the speaker 370A, the receiver 370B, the microphone 370C, the headphone interface 370D, and the application processor. For example, the user's voice information can be collected through the microphone 370C, and the feedback result for the user's voice information can be played through the speaker 370A.

Touch sensor, also called "touch panel". The touch sensor can be disposed on the display screen 194, and the touch sensor and the display screen 194 form a touch screen, which is also called a "touch screen". Touch sensors are used to detect touches on or near them. The touch sensor can pass the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through display screen 194 . In other embodiments, the touch sensor may also be disposed on the surface of the terminal 100 in a position different from that of the display screen 194 .

In this embodiment of the present application, the terminal 100 can detect the touch operation input by the user on the touch screen through a touch sensor. operation, and collect one or more of the touch position of the touch operation on the touch screen, the touch time, etc. In some embodiments, the terminal 100 can determine the touch location of the touch operation on the touch screen through a combination of the touch sensor 180K and the pressure sensor 180A.

The buttons 190 include a power button, a volume button, etc. Key 190 may be a mechanical key. It can also be a touch button. The terminal 100 may receive key input and generate key signal input related to user settings and function control of the terminal 100. For example, the voice interaction function can be awakened by long pressing the power button.

The motor 191 can generate vibration prompts. The motor 191 can be used for vibration prompts for incoming calls and can also be used for touch vibration feedback. For example, touch operations for different applications (such as taking pictures, audio playback, etc.) can correspond to different vibration feedback effects. The motor 191 can also respond to different vibration feedback effects for touch operations in different areas of the display screen 194 . Different application scenarios (such as time reminders, receiving information, alarm clocks, games, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect can also be customized.

The indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc. The SIM card interface 195 is used to connect a SIM card. The SIM card can be connected to or separated from the terminal 100 by inserting it into the SIM card interface 195 or pulling it out from the SIM card interface 195 . The terminal 100 can support 1 or N SIM card interfaces, where N is a positive integer greater than 1. SIM card interface 195 can support Nano SIM card, Micro SIM card, SIM card, etc.

The gyro sensor 180B may be a three-axis gyroscope, used to track state changes of the terminal 100 in six directions. The acceleration sensor 180E is used to detect the movement speed, direction and displacement of the terminal 100 . In the embodiment of the present application, the terminal 100 can detect the status and position of the terminal 100 through the gyroscope sensor 180B and the acceleration sensor 180E, and can determine the gesture of the user holding the terminal 100 based on the status and position of the terminal 100 at different times. For example, the user holds the terminal 100 toward the user's mouth, or the user holds the terminal 100 away from the user's mouth.

The methods in the following embodiments can all be implemented in the terminal 100 with the above hardware structure.

The following is an exemplary description of the voice interaction method provided by the embodiment of the present application.

Figure 3 is a flow chart of a voice interaction method provided by an embodiment of the present application. As shown in Figure 3, the method may include the following steps:

Step S1: A wake-up instruction to initiate voice interaction is detected.

The wake-up instruction is used to wake up the terminal 100 to enter the working state of voice interaction. The wake-up instruction may be a specific wake-up word input by the user to the terminal 100, the user may press and hold the power button, or the user may click on the desktop voice assistant application, etc.

The embodiment of the present application also provides a wake-up method for breath wake-up. Breath awakening means that the user awakens the terminal 100 to enter the voice interaction working state by pointing the mouth towards the terminal 100 and generating breath (such as speaking or blowing) within a preset distance range from the terminal 100 . In this way, the user can put the terminal 100 to his mouth and speak or blow directly into the terminal 100 to wake up the terminal 100 and enter the voice interaction working state without using a specific wake-up word or pressing a button. Correspondingly, when the terminal 100 detects the user's breath, it enters the voice interaction working state.

In one possible implementation, the user's breath detection method can be implemented in the following way: using the microphone 170C to collect voice information. If the voice information is collected, the breath recognition module can be used to determine whether the collected voice information is the user's breath through the mouth. words or air blown towards the terminal 100 and within a preset distance range from the terminal 100 . The breath recognition module may be a neural network trained for identifying breath.

For example, when the user speaks at different distances from the microphone 170C, a sound will be formed on the microphone 170C. Different airflow. For example, when the user speaks close to the microphone 170C, the speech content includes "b, c, d, f, j, k, l, p, q, r, s, t, v, w, x, y, z" etc. Consonants will cause popping sounds on the microphone 170C. In this way, the breath recognition module can be obtained by learning the characteristics of the popping sound produced when the user speaks into the microphone 170C through training. The breath recognition module is a trained neural network that can identify whether the input voice information is a sound input close to the microphone 170C. For example, a well-trained neural network can accurately detect human voices within 5 centimeters of the microphone. In this way, when the breath recognition module recognizes that the input voice information is a human voice within 5 cm of the microphone, it is determined that the user's breath is detected, thereby waking up the terminal 100 to enter the voice interaction working state.

It should be noted that when the user blows air into the terminal 100, the microphone 170C can also collect the sound. In this application, the sound generated by blowing is also called voice information.

In one possible implementation, the user's breath detection method can also adopt the following implementation: if the microphone 170C collects voice information, obtain the pressure value collected by the pressure sensor 180A when the voice information is collected. If the pressure value is greater than the preset pressure threshold, it is determined that the user's breath is detected.

The airflow generated when the user's mouth is facing the terminal and speaks or blows within a preset distance range from the terminal 100 will produce a certain pressure on the terminal 100 . In this way, the embodiment of the present application can use the pressure sensor 180A to detect the pressure generated by the user on the terminal 100 when speaking. If the pressure value is greater than the preset pressure threshold, it means that the user's mouth is facing the terminal 100 and is within the preset distance range from the terminal 100 Talk or blow into the device to ensure that the user's breath is detected. On the contrary, if the pressure value is less than or equal to the preset pressure threshold, it means that the user does not speak or blow within the preset distance range from the terminal 100, so it can be determined that the user's breath is not detected.

It should be noted that in the embodiment of the present application, the parameters of the pressure sensor 180A must be able to meet the accuracy requirements of breath detection. For example, the airflow generated when the user's mouth is facing the terminal 100 and speaking or blowing within a preset distance range from the terminal 100 will produce a pressure of 0.07Mpa on the terminal 100. The measurement range of the pressure sensor 180A is 0 to 0.3Mpa. The accuracy is 0.001Mpa.

It should also be noted that in order to improve the detection accuracy of the pressure sensor 180A, the pressure sensor 180A can be disposed near the microphone 170C. In this way, when the user speaks close to the microphone 170C, the pressure sensor 180A near the microphone 170C can detect the pressure exerted on the pressure sensor 180A by the airflow generated by speaking.

In one possible implementation, the user's breath detection method can also adopt the following implementation: if the microphone 170C collects voice information, the first temperature and the second temperature are obtained. The first temperature is the temperature collected by the temperature sensor 180J before the microphone 170C collects the voice information; the second temperature is the temperature collected by the temperature sensor 180J when the microphone 170C collects the voice information. If the second temperature is greater than the first temperature, it is determined that the user's breath is detected; if the second temperature is less than or equal to the first temperature, it is determined that the user's breath is not detected.

In one possible implementation, the user's breath detection method can also be implemented in the following implementation: if the microphone 170C collects voice information, the humidity collected by the humidity sensor 180N when the voice information is collected is obtained. If the humidity is greater than the preset humidity threshold, it is determined that the user's breath is detected; if the humidity is less than or equal to the preset humidity threshold, it is determined that the user's breath is not detected.

In one possible implementation, the user's breath detection method can also adopt the following implementation: if the microphone 170C collects voice information, obtain the carbon dioxide concentration collected by the carbon dioxide sensor 180P when the voice information is collected. If the carbon dioxide concentration is greater than the preset carbon dioxide concentration threshold, it is determined that the user's breath is detected; if the carbon dioxide concentration is less than or equal to the preset carbon dioxide concentration threshold, it is determined that the user's breath is not detected.

When the user's mouth faces the terminal 100 and speaks or blows within a preset distance range from the terminal 100, the temperature, humidity, and carbon dioxide concentration near the terminal 100 will change to a certain extent. Therefore, the embodiments of this application can Based on the data collected by the temperature sensor 180J, the humidity sensor 180N, or the carbon dioxide sensor 180P, it is determined whether the user's breath is detected.

It should be noted that the above embodiments only illustrate the implementation of detecting the user's breath and do not limit the specific implementation of detecting the user's breath. For example, the various implementation methods listed in the above embodiments can also be used in combination. For example, the solution of "breath recognition module" and "pressure sensor" can be used in combination. For example, the "breath recognition module" and "temperature sensor" can be used in combination. " solution, for example, the solution of "breath identification module" and "humidity sensor" can be combined. For another example, the solution of "breath identification module" and "carbon dioxide sensor" can be combined.

It should also be noted that when other applications occupy the microphone, the voice interaction method provided by the embodiment of the present application is not available. For example, when the user makes a call using the terminal 100, even if the user's mouth is facing the terminal 100 and breath is generated within a preset distance range from the terminal 100, the terminal 100 will not be awakened to enter the voice interaction working state.

Step S2: In response to the wake-up instruction, enter the voice interaction working state.

After entering the voice interaction working state, the terminal 100 will continue to collect audio to obtain the user's voice information.

Step S3: The first voice information is detected.

Step S4: Output the feedback result for the first voice information.

In the embodiment of this application, the feedback result for the first voice information may be voice, text, image, or entering an application program, etc. This application does not limit this.

For example, after entering the voice interaction working state, the user says a sentence, such as "How is the weather today?", and then "How is the weather today" is detected by the terminal 100 as the first voice information. Then, the terminal 100 outputs the feedback result for the first voice information. For example, the terminal 100 outputs the voice "Today's weather is sunny" through the speaker 170A. For example, the terminal 100 can display the text "Today's weather is sunny" through the display screen 194.

For example, after entering the voice interaction working state, the user says a sentence, such as "Dial Zhang San", and then "Dial Zhang San" as the first voice message, which is detected by the terminal 100. Then, the terminal 100 outputs the feedback result for the first voice information. For example, the terminal 100 enters the voice call application and dials Zhang San's phone number.

Step S5: Determine whether the user has the intention to continue voice interaction.

If the user has the intention to continue voice interaction, the voice interaction working state is maintained; if the user has no intention to continue voice interaction, the voice interaction working state is ended.

It should be noted that in the voice interaction working state, the terminal 100 can continue to collect sounds; after ending the voice interaction working state, the terminal stops collecting sounds.

The embodiments of this application provide the following implementation methods for determining whether the user has the intention to continue voice interaction.

Figure 4 is a flowchart of a first implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application.

As shown in Figure 4, the first implementation method of determining whether the user has the intention to continue voice interaction may include the following steps:

Step S51, determine whether the terminal 100 is close to the user's mouth.

In the embodiment of the present application, determining whether the terminal 100 is close to the user's mouth means determining whether the terminal 100 is near the user's mouth.

If the working state of voice interaction is awakened by breathing, it means that the terminal 100 is near the user's mouth when the terminal 100 is awakened. Therefore, after outputting the feedback result for the first voice information, it can be determined by determining whether the user holds the terminal 100 away from the user's mouth during the period from step S1 to step S4. Whether the terminal 100 is still at the user's mouth. If yes, it is considered that the terminal 100 is no longer near the user's mouth. In this case, it can be considered that the user has no intention to continue voice interaction, and the voice interaction working state can be ended; if no, it is considered that the terminal 100 is still near the user's mouth. In this case, it is considered that the terminal 100 is no longer near the user's mouth. The user may have the intention to continue voice interaction and can continue to perform subsequent steps.

If the working state of voice interaction is not awakened by breathing, it means that the terminal 100 is not near the user's mouth when the terminal 100 is awakened. In this case, after entering the voice interaction working state, the present application may first determine whether the user holds the terminal 100 closer to the user's mouth before outputting the feedback result for the first voice information. If it is determined that the user holds the terminal 100 closer to the user's mouth before outputting the feedback result for the first voice information, then it is then determined whether the terminal 100 is still in the direction of the user's mouth after outputting the feedback result for the first voice information. side (specifically, it is determined whether the user holds the terminal 100 away from the user's mouth after outputting the feedback result for the first voice information). If it is determined that the user does not hold the terminal 100 toward the user's mouth before outputting the feedback result for the first voice information, in this case, it can be considered that the user has no intention to continue the voice interaction, and the working state of the voice interaction can be ended.

In one implementation, the gyro sensor 180C and the acceleration sensor 180E on the terminal 100 can be used to collect the angular velocity and acceleration of the terminal 100; then, the collected angular velocity and acceleration are used to determine the user's gesture. The user's gestures may include a first gesture, a second gesture and a third gesture. The first gesture is used to indicate that the user's handheld terminal 100 is in a stationary state, and the second gesture is used to indicate that the user's handheld terminal 100 is moving towards the user's mouth. The third gesture indicates that the user holds the terminal 100 closer to the user's mouth.

For example, the gyro sensor 180C and the acceleration sensor 180E can be used to collect the angular velocity and acceleration during the period from step S1 to step S4. Then, the collected angular velocity and acceleration are input into the gesture recognition module. The gesture recognition module can be a neural network trained for gesture recognition. After processing by the gesture recognition module, the user's gesture is output. Among them, the gesture recognition module can determine the user's handholding based on the angular velocity and acceleration of the terminal 100 at different times.

It should be noted that in this embodiment of the present application, if the change of the user's gesture is small and within the preset change range, the user's gesture is considered to be static. For example, if the user holds the terminal 100 at a distance of 5 cm from the mouth, and the user holds the terminal 100 at a distance of 4 cm from the mouth, the user's gesture is considered to be the first gesture.

Step S52: If it is determined that the terminal 100 is close to the user's mouth, the voice interaction working state is extended for a preset time.

If the terminal 100 is still near the user's mouth after outputting the feedback result for the first voice information, it is considered that the user may have the intention to continue voice interaction. In this case, this application extends the voice interaction working state for a preset time. Within the extended preset time period, the terminal 100 continues to receive audio.

This application does not limit the preset time, for example, it can be 5s, 10s, 20s, etc.

Step S53: Determine whether the second voice message is detected within a preset time period.

Step S54: If the second voice information is detected within the preset time period, determine whether the user's breath is detected.

If the second voice information is not detected within the preset time period, the working state of voice interaction is ended. If the second voice information is detected within the preset time period, the second voice information detected by the terminal 100 may be what the user said, or it may be that other people are talking around. Therefore, the present application further detects the user's breath to determine whether the second voice information is what the user said to the terminal 100 .

It should be noted here that in the embodiment of the present application, in the working state of voice interaction, the user's mouth needs to be close to the terminal 100 to perform voice interaction with the terminal 100 . Therefore, if the second voice message is spoken by the user, then The breath produced by the user while speaking can be detected by the terminal. In other words, this application can determine whether the second voice information is what the user said or whether it is what other people around him said based on whether the user's breath can be detected.

For the detection method of the user's breath, please refer to the description of step S1 above, and will not be described again here. For example, the breath recognition module, pressure sensor 180A, temperature sensor 180J, humidity sensor 180N or carbon dioxide sensor 180P can be used to detect the user's breath.

Step S55: If the user's breath is detected, the feedback result for the second voice information is output.

If the user's breath is detected, it means that the second voice information is spoken by the user's mouth toward the terminal 100 and within a preset distance range from the terminal 100 . In this case, it is considered that the user has the intention to continue voice interaction, and the feedback result for the second voice information is output. If the user's breath is not detected, it means that the second voice information is what other people around him said, not what the user said. In this case, it is considered that the user has no intention of continuing the voice interaction, and the working state of the voice interaction can be ended.

In summary, in the first implementation method of determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application, after outputting the feedback result for the first voice information, it is first determined whether the terminal 100 is near the user's mouth. If the terminal 100 is not near the user's mouth, the voice interaction working state is ended; if it is determined that the terminal 100 is near the user's mouth, the voice interaction working state is extended for a preset time. Then, if the second voice information is not detected within the preset time period, the voice interaction working state is ended; if the second voice information is detected within the preset time period, the user's breath is detected. If the user's breath is not detected, the voice interaction working state is ended; if the user's breath is detected, the feedback result for the second voice information is output. That is to say, in the first implementation manner, if the terminal 100 is near the user's mouth and can detect the user's breath, it is determined that the user has the intention to continue voice interaction. In this way, the voice interaction method provided by the embodiments of the present application can identify with a high probability that the user himself has the intention to continue voice interaction, effectively reducing the terminal 100's erroneous response to other people or other surrounding noises, and improving the accuracy of voice interaction and user satisfaction. experience.

Figure 5 is a flowchart of a second implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application.

As shown in Figure 5, the second implementation method of determining whether the user has the intention to continue voice interaction may include the following steps:

Step S61: After outputting the feedback result for the first voice information, the voice interaction working state is extended for a preset time period.

Step S62, determine whether the second voice message is detected within a preset time period;

Step S63: If the second voice information is detected within the preset time period, determine whether the terminal 100 is close to the user's mouth.

Step S64: If it is determined that the terminal 100 is close to the user's mouth, it is determined whether the user's breath is detected.

Step S65: If the user's breath is detected, the feedback result for the second voice information is output.

To sum up, in the second implementation method, after outputting the feedback result for the first voice information, the voice interaction working state is directly extended for a preset time. If the second voice information is not detected within the preset time period, the voice interaction working state is ended. If the second voice information is detected within the preset time period, it is first determined whether the terminal 100 is near the user's mouth. If the terminal 100 is not near the user's mouth, the voice interaction working state is ended. If it is determined that the terminal 100 is near the user's mouth, the user's breath is detected again. If the user's breath is not detected, the voice interaction working state is ended; if the user's breath is detected, the feedback result for the second voice information is output.

It should be noted that, for the specific implementation method of determining whether the terminal 100 is near the user's mouth in step S63, please refer to the description of step S51, and for the specific implementation method of detecting the user's breath in step S64, please refer to the description of step S54. For the specific implementation of step S65, please refer to the description of step S55, which will not be described again here.

FIG. 6 is a flow chart of a third implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application.

As shown in Figure 6, the third implementation method of determining whether the user has the intention to continue voice interaction may include the following steps:

Step S71: After outputting the feedback result for the first voice information, the voice interaction working state is extended for a preset time period.

Step S72, determine whether the second voice message is detected within the preset time period;

Step S73: If the second voice information is detected within the preset time period, determine whether the user's breath is detected.

Step S74: If the user's breath is detected, a feedback result for the second voice information is output.

To sum up, in the third implementation method, after outputting the feedback result for the first voice information, the voice interaction working state is directly extended for a preset time. If the second voice information is not detected within the preset time period, the voice interaction working state is ended. If the second voice information is detected within the preset time period, the user's breath is detected. If the user's breath is not detected, the voice interaction working state is ended. If the user's breath is detected, a feedback result for the second voice information is output.

It should be noted that, for the specific implementation method of detecting the user's breath in step S73, please refer to the description of step S54, and for the specific implementation method of step S74, please refer to the description of step S55, which will not be described again here.

FIG. 7 is a flow chart of a fourth implementation manner for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application.

As shown in Figure 7, the fourth implementation method of determining whether the user has the intention to continue voice interaction may include the following steps:

Step S81: After outputting the feedback result for the first voice information, the voice interaction working state is extended for a preset time period.

Step S82, determine whether the second voice message is detected within the preset time period;

Step S83: If the second voice information is detected within the preset time period, determine whether the terminal 100 is close to the user's mouth.

Step S84: If it is determined that the terminal 100 is close to the user's mouth, a feedback result for the second voice information is output.

To sum up, in the fourth implementation method, after outputting the feedback result for the first voice information, the voice interaction working state is directly extended for a preset time period. If the second voice information is not detected within the preset time period, the voice interaction working state is ended. If the second voice information is detected within the preset time period, it is determined whether the terminal 100 is near the user's mouth. If the terminal 100 is not near the user's mouth, the voice interaction working state ends. If it is determined that the terminal 100 is at the user's mouth, the feedback result for the second voice information is output.

It should be noted that, for the specific implementation method of determining whether the terminal 100 is near the user's mouth in step S83, please refer to the description of step S51, and for the specific implementation method of step S84, please refer to the description of step S55, which will not be described again here.

Figure 8 is a flowchart of a fifth implementation method for determining whether the user has the intention to continue voice interaction provided by the embodiment of the present application.

As shown in Figure 8, the fifth implementation method of determining whether the user has the intention to continue voice interaction may include the following steps:

Step S91, determine whether the terminal 100 is close to the user's mouth.

Step S92: If it is determined that the terminal 100 is close to the user's mouth, the working state of the voice interaction is extended for a preset time.

Step S93: Determine whether the second voice message is detected within a preset time period.

Step S94: If the second voice information is detected within the preset time period, a feedback result for the second voice information is output.

To sum up, in the fifth implementation manner, it is first determined whether the terminal 100 is at the user's mouth. If it is determined that the terminal 100 is at the user's mouth, the working state of the voice interaction is extended for a preset time period. If it is determined that the terminal 100 is not near the user's mouth, the working state of voice interaction is ended. In this way, the power consumption of the terminal 100 can be reduced. Further, if the second voice information is detected within a preset time period, a feedback result for the second voice information is output. If the second voice information is not detected within the preset time period, the voice interaction working state is ended. In the fifth implementation manner, after outputting the feedback result for the first voice information, if the terminal 100 is still near the user's mouth, it is considered that the user has the intention to continue voice interaction, so that the listening time can be extended.

Furthermore, in order to improve the recognition of the user's intention to continue voice interaction, after the second voice information is detected within a preset time period, the user's breath can be detected first. If the user's breath is detected, the feedback result for the second voice information can be output. For details, please refer to the first implementation method mentioned above, which will not be described again here.

In summary, the voice interaction method provided by the embodiments of the present application can identify with a high probability that the user himself has the intention to continue voice interaction, effectively reduce the terminal 100's incorrect response to other people or other surrounding noises, and improve the accuracy and accuracy of voice interaction. user experience.

Each method embodiment described in this article can be an independent solution or can be combined according to internal logic. These solutions all fall within the protection scope of this application.

It can be understood that in the above method embodiments, the methods and operations implemented by the electronic device can also be implemented by components (such as chips or circuits) that can be used in the electronic device.

The above embodiments introduce the voice interaction method provided by this application. It can be understood that in order to implement the above functions, the terminal includes hardware structures and/or software modules corresponding to each function. Persons skilled in the art should easily realize that, with the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein, the present application can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.

The method provided by the embodiment of the present application is described in detail above. Hereinafter, the device provided by the embodiment of the present application will be described in detail with reference to FIG. 9 . It should be understood that the description of the device embodiments corresponds to the description of the method embodiments. Therefore, for content that is not described in detail, please refer to the above method embodiments. For the sake of brevity, they will not be described again here.

Figure 9 is a schematic structural diagram of a voice interaction device provided by an embodiment of the present application. In one embodiment, the terminal can implement corresponding functions through the hardware device shown in Figure 9. As shown in Figure 9, the device 1000 may include: a processor 1001 and a memory 1002. The processor 1001 may include one or more processing units. For example, the processor 1001 may include an application processor, a modem processor, a graphics processor, an image signal processor, a controller, a video codec, a digital signal processor, baseband processor, and/or neural network processor, etc. Among them, different processing units can be independent devices or integrated in one or more processors. Memory 1002 is coupled to processor 1001 for storing various software programs and/or sets of instructions. Memory 1002 may include volatile memory. storage and/or non-volatile memory.

The device 1000 can perform the operations performed in the above method embodiments.

For example, in an optional embodiment of the present application, the processor 1001 may be configured to detect a wake-up instruction initiating voice interaction; respond to the wake-up instruction, enter the working state of voice interaction; detect the first Voice information; output a feedback result for the first voice information; if the second voice information is detected within a preset time period, detect the user's breath; if the user's breath is detected, output feedback for the second voice information result.

In an implementable manner, the processor is further configured to determine whether the terminal is close to the user's mouth after outputting the feedback result for the first voice information; if it is determined that the terminal is close to the user If it is determined that the terminal is not close to the user's mouth, then the working state of voice interaction is ended.

In an implementable manner, the terminal includes a temperature sensor, and the processor is further configured to obtain a first temperature and a second temperature, where the first temperature is before the second voice information is collected, The temperature sensor The corresponding temperature, the second temperature is the temperature corresponding to the temperature sensor when the second voice information is collected; if the second temperature is greater than the first temperature, it is determined that the user's breath is detected; if the If the second temperature is less than or equal to the first temperature, it is determined that the user's breath is not detected.

For another example, in an optional embodiment of the present application, the processor 1001 may be configured to detect a wake-up instruction initiating voice interaction; respond to the wake-up instruction, enter the working state of voice interaction; detect that the third a voice message; output the feedback result for the first voice message; determine whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, extend the working state of the voice interaction for a preset time; if determined If the terminal is not close to the user's mouth, the voice interaction working state ends.

For another example, in an optional embodiment of the present application, the processor 1001 may be configured to detect a wake-up instruction that initiates voice interaction; respond to the wake-up instruction, enter the working state of voice interaction; detect that the third a voice message; output the feedback result for the first voice information; if the second voice information is detected within a preset time period, determine whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, Then the feedback result for the second voice information is output.

During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor. The steps of the methods disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware processor for execution, or can be executed by a combination of hardware and software modules in the processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware. To avoid repetition, it will not be described in detail here.

It should be noted that the processor in the embodiment of the present application may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method embodiment can be completed through an integrated logic circuit of hardware in the processor or instructions in the form of software. The above-mentioned processor may be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. . Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

It can be understood that the memory in the embodiment of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memories. Among them, the non-volatile memory can be a read-only memory (read-only memory). memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM) or flash memory. Volatile memory may be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) ) and direct memory bus random access memory (direct rambus RAM, DR RAM). It should be noted that the memory of the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

According to the method provided by the embodiment of the present application, the embodiment of the present application also provides a computer program product. The computer program product includes: a computer program or instructions. When the computer program or instructions are run on a computer, the computer executes the method. method in any of the examples.

According to the method provided by the embodiment of the present application, the embodiment of the present application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program or instructions. When the computer program or instructions are run on the computer, the The computer executes the method of any one of the method embodiments.

According to the method provided by the embodiment of the present application, the embodiment of the present application also provides a terminal, including a memory and a processor; the memory is coupled to the processor; the memory is used to store computer program code, and the computer program code It includes computer instructions, and when the processor executes the computer instructions, the electronic device causes the electronic device to execute the method of any one of the method embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative logical blocks and steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or a combination of computer software and electronic hardware. accomplish. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each specific application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and modules described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of modules is only a logical function division. In actual implementation, there may be other division methods. For example, multiple modules or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

The modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application can be integrated into a processing unit, or each module can exist physically alone, or two or more modules can be integrated into one unit.

If the functions described are implemented in the form of software functional units and sold or used as independent products, they can be saved. Stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program code. .

The voice interaction devices, chips, computer storage media, computer program products, and terminals provided by the above embodiments of the present application are all used to execute the methods provided above. Therefore, the beneficial effects they can achieve can refer to the methods provided above. The corresponding beneficial effects will not be described again here.

It should be understood that in each embodiment of the present application, the execution order of each step should be determined by its function and internal logic. The size of each step number does not mean the order of execution, and does not limit the implementation process of the embodiment.

Each part of this specification is described in a progressive manner. The same and similar parts between various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, the embodiments of voice interaction devices, chips, computer storage media, computer program products, and terminals are described simply because they are basically similar to the method embodiments. For relevant details, please refer to the description in the method embodiments. Can.

Although the preferred embodiments of the present application have been described, those skilled in the art will be able to make additional changes and modifications to these embodiments once the basic inventive concepts are apparent. Therefore, it is intended that the appended claims be construed to include the preferred embodiments and all changes and modifications that fall within the scope of this application.

The above-described embodiments of the present application do not limit the scope of protection of the present application.

Claims

A voice interaction method, characterized in that the method includes:

A wake-up indication to initiate a voice interaction is detected;

In response to the wake-up instruction, enter the working state of voice interaction;

The first voice message is detected;

Output the feedback result for the first voice information;

If the second voice information is detected within the preset time period, the user's breath is detected;

If the user's breath is detected, a feedback result for the second voice information is output.
The method according to claim 1, characterized in that, after outputting the feedback result for the first voice information, it further includes:

Determine whether the terminal is close to the user's mouth;

If it is determined that the terminal is close to the user's mouth, extend the working state of the voice interaction for the preset duration;

If it is determined that the terminal is not close to the user's mouth, the working state of the voice interaction is ended.
The method according to claim 1, characterized in that if the second voice information is detected within a preset time period, it further includes:

Determine whether the terminal is close to the user's mouth;

If it is determined that the terminal is close to the user's mouth, detecting the user's breath;

If it is determined that the terminal is not close to the user's mouth, the working state of the voice interaction is ended.
The method according to claim 2 or 3, characterized in that if the wake-up indication is the user's breath, then determining whether the terminal is close to the user's mouth includes:

Recognize the user's gestures in the working state of the voice interaction;

If the user's gesture is a first gesture, it is determined that the terminal is close to the user's mouth, and the first gesture is used to represent that the user is holding the terminal in a stationary state;

If the user's gesture is a second gesture, it is determined that the terminal is not close to the user's mouth. The second gesture is used to represent that the user is holding the terminal and moving away from the user's mouth.
The method according to claim 2 or 3, characterized in that if the wake-up indication is other than the user's breath, before determining whether the terminal is close to the user's mouth, the method includes:

Determine whether a third gesture is recognized before outputting the feedback result for the first voice information, where the third gesture is used to represent that the user holds the terminal and approaches the user's mouth;

If the third gesture is recognized, determine whether the terminal is still close to the user's mouth after outputting the feedback result for the first voice information;

If the third gesture is not recognized, the working state of voice interaction ends.
The method according to claim 4, characterized in that the recognition of the user's gestures in the working state of the voice interaction includes:

Obtain the angular velocity and acceleration at different times in the working state of the voice interaction;

The angular velocity, acceleration, and gesture recognition module at different times are used to determine the user's gesture; wherein, the gesture recognition module is used to recognize that the user's handheld terminal is approaching the direction of the user's mouth, and the user's handheld terminal is moving toward the user's mouth. away, or the user is holding the terminal in a stationary state.
The method according to claim 1, characterized in that detecting the user's breath includes:

Input the second voice information into the breath recognition module, and the breath recognition module is used to identify whether the second voice information is a sound emitted within a preset distance between the user's mouth and the terminal;

If the breath recognition module recognizes that the second voice information is a sound emitted by the user's mouth within a preset distance from the terminal, it is determined that the user's breath is detected;

If the breath recognition module recognizes that the second voice information is not a sound emitted by the user's mouth within a preset distance from the terminal, it is determined that the user's breath is not detected.
The method according to claim 1, wherein the terminal includes a pressure sensor, and detecting the user's breath includes:

Obtain the pressure value corresponding to the pressure sensor when the second voice information is collected;

If the pressure value is greater than the preset pressure threshold, it is determined that the user's breath is detected;

If the pressure value is less than or equal to the preset pressure threshold, it is determined that the user's breath is not detected.
The method according to claim 1, wherein the terminal includes a temperature sensor, and detecting the user's breath includes:

Obtain the first temperature and the second temperature, where the first temperature is the temperature corresponding to the temperature sensor before the second voice information is collected, and the second temperature is the temperature when the second voice information is collected. , the temperature corresponding to the temperature sensor;

If the second temperature is greater than the first temperature, it is determined that the user's breath is detected;

If the second temperature is less than or equal to the first temperature, it is determined that the user's breath is not detected.
The method according to claim 1, wherein the terminal includes a humidity sensor, and detecting the user's breath includes:

Obtain the humidity corresponding to the humidity sensor when the second voice information is collected;

If the humidity is greater than the preset humidity threshold, it is determined that the user's breath is detected;

If the humidity is less than or equal to the preset humidity threshold, it is determined that the user's breath is not detected.
The method according to claim 1, wherein the terminal includes a carbon dioxide sensor, and detecting the user's breath includes:

Obtain the carbon dioxide concentration corresponding to the carbon dioxide sensor when the second voice information is collected;

If the carbon dioxide concentration is greater than the preset carbon dioxide concentration threshold, it is determined that the user's breath is detected;

If the carbon dioxide concentration is less than or equal to the preset carbon dioxide concentration threshold, it is determined that the user's breath is not detected.
A voice interaction method, characterized in that the method includes:

A wake-up indication to initiate a voice interaction is detected;

In response to the wake-up instruction, enter the working state of voice interaction;

The first voice message is detected;

Output the feedback result for the first voice information;

Determine whether the terminal is close to the user's mouth;

If it is determined that the terminal is close to the user's mouth, extend the working state of voice interaction for a preset time;

If the second voice information is detected within the preset time period, a feedback result for the second voice information is output.
A voice interaction method, characterized in that the method includes:

A wake-up indication to initiate a voice interaction is detected;

In response to the wake-up instruction, enter the working state of voice interaction;

The first voice message is detected;

Output the feedback result for the first voice information;

If the second voice information is detected within the preset time period, determine whether the terminal is close to the user's mouth;

If it is determined that the terminal is close to the user's mouth, a feedback result for the second voice information is output.
A voice interaction device, characterized in that the device includes a processor;

The processor is configured to detect a wake-up instruction that initiates voice interaction; respond to the wake-up instruction, enter the working state of voice interaction; detect the first voice information; output a feedback result for the first voice information; if If the second voice information is detected within the preset time period, the user's breath is detected; if the user's breath is detected, a feedback result for the second voice information is output.
A voice interaction device, characterized in that the device includes a processor;

The processor is configured to detect a wake-up instruction that initiates voice interaction; respond to the wake-up instruction, enter the working state of voice interaction; detect the first voice information; output a feedback result for the first voice information; determine Whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, the working state of voice interaction is extended for a preset time; if it is determined that the terminal is not close to the user's mouth, the working state of voice interaction is ended .
A voice interaction device, characterized in that the device includes a processor;

The processor is configured to detect a wake-up instruction that initiates voice interaction; respond to the wake-up instruction, enter the working state of voice interaction; detect the first voice information; output a feedback result for the first voice information; if If the second voice information is detected within the preset time period, it is determined whether the terminal is close to the user's mouth; if it is determined that the terminal is close to the user's mouth, a feedback result for the second voice information is output.
A terminal, characterized in that the terminal includes a memory and a processor; the memory is coupled to the processor; the memory is used to store computer program code, and the computer program code includes computer instructions. When the processing When the computer instruction is executed by the computer, the terminal is caused to execute the method according to any one of claims 1-13.
A computer-readable storage medium, characterized in that a computer program or instructions are stored in the computer-readable storage medium. When the computer program or instructions are executed, as described in any one of claims 1-13 method is executed.
A computer program product, characterized in that the computer program product includes a computer program or instructions, and when the computer program or instructions are run on a computer, the computer executes the method described in any one of claims 1-13. method.