CN111292737A

CN111292737A - Voice interaction and voice awakening detection method, device, equipment and storage medium

Info

Publication number: CN111292737A
Application number: CN201811495094.9A
Authority: CN
Inventors: 王德淼; 孟伟
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2020-06-16

Abstract

The invention discloses a voice interaction and voice awakening detection method, a voice interaction and voice awakening detection device, voice interaction equipment and a voice awakening detection storage medium. Analyzing the voice input of the user; and adjusting parameters related to the voice interaction based on the analysis result. The method and the device can be used for voice interaction after the equipment is awakened, for example, voice characteristics such as the speed, the volume, the tone and the like of voice input of a user can be analyzed, and the emotional state of the voice input of the user can also be analyzed. And based on the analysis result, the related parameters of the voice output fed back to the user can be correspondingly adjusted so as to improve the interactive experience of the user. The invention can also be used for voice awakening detection, and can reduce the parameter of the awakening threshold value under the condition that the awakening detection result is lower than the current awakening threshold value, for example, the awakening threshold value can be reduced in the next period of time, so that the awakening success rate in the next period of time can be improved, and the awakening experience of the user can be improved.

Description

Voice interaction and voice awakening detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of voice interaction, and in particular, to a method, an apparatus, a device, and a storage medium for voice interaction and voice wake-up detection.

Background

The voice interaction belongs to the category of human-computer interaction, and is a leading-edge interaction mode developed by the human-computer interaction to the present. Voice interaction is the process by which a user gives instructions to a machine through natural language to achieve his or her own objectives. The current voice interaction scheme mainly considers how to improve the accuracy of voice recognition, but ignores the essence of voice interaction to provide convenience for users, so that the existing voice interaction scheme is not user-friendly.

Accordingly, there is a need for an improved voice interaction scheme to provide a more comfortable interaction experience for the user.

Disclosure of Invention

It is an object of the present invention to provide a voice interaction scheme that provides a more comfortable interaction experience for the user.

According to a first aspect of the present invention, there is provided a voice interaction method, comprising: analyzing a voice input of a user to determine a first characteristic of the voice input; based on the first characteristic, a second characteristic of the speech output that is fed back to the user is determined.

Optionally, the first feature comprises a speech feature and/or an emotional state feature, and/or the second feature comprises a speech feature and/or an emotional state feature.

Optionally, the first feature and/or the second feature comprises at least one of: the speed of speech; volume; a tone; tone color; an emotional state.

Optionally, the second feature is the same as or similar to the first feature; or the second characteristic is opposite to the first characteristic.

Optionally, the step of determining a second characteristic of the speech output to be fed back to the user comprises: and determining a second characteristic based on the first characteristic and combined with the time of voice input and/or the current state of the equipment and/or text information obtained by analyzing the voice input of the user.

Optionally, the first characteristic comprises a first volume, and the step of determining the second characteristic comprises: in the event that the current time satisfies a first predetermined condition, and/or the device is in a do-not-disturb mode, and/or the user does not adjust the system volume of the device for a first predetermined length of time, and/or the first volume differs from the system volume of the device by more than a first predetermined threshold, adjusting a second volume of the speech output that is fed back to the user to the first volume, or to near the first volume.

Optionally, the first feature comprises a first speech rate, and the step of determining the second feature comprises: and adjusting the second speech speed of the speech output fed back to the user to be the same as or close to the first speech speed.

Optionally, the method further comprises: the system volume is turned down if the current time meets a second predetermined condition and/or the device is in a do-not-disturb mode and/or no voice input is received for a second predetermined period of time and/or the system volume is greater than a second predetermined threshold and/or the user does not adjust the system volume for a third predetermined period of time and/or there is no voice output currently.

Optionally, the method further comprises: performing wake-up detection on the detected voice input; and when the awakening detection result is lower than the current awakening threshold value, the awakening threshold value is reduced.

Optionally, the step of turning down the wake threshold if the wake detection result is lower than the current wake threshold includes: and when the awakening detection result is lower than the current awakening threshold and higher than a third preset threshold, the awakening threshold is reduced.

According to the second aspect of the present invention, there is also provided a voice wake-up detection method, including: performing wake-up detection on the detected voice input; and when the awakening detection result is lower than the current awakening threshold value, the awakening threshold value is reduced.

According to a third aspect of the present invention, there is also provided a voice interaction method, including: analyzing the voice input of the user; and adjusting parameters related to the voice interaction based on the analysis result.

According to a fourth aspect of the present invention, there is also provided a voice interaction apparatus, including: the analysis module is used for analyzing the voice input of the user to determine a first characteristic of the voice input; and the determining module is used for determining a second characteristic of the voice output fed back to the user based on the first characteristic.

According to a fifth aspect of the present invention, there is also provided a voice wake-up detecting apparatus, including: the awakening detection module is used for carrying out awakening detection on the detected voice input; and the awakening threshold adjusting module is used for reducing the awakening threshold under the condition that the awakening detection result is lower than the current awakening threshold.

According to a sixth aspect of the present invention, there is also provided a voice interaction apparatus, including: the analysis module is used for analyzing the voice input of the user; and an adjustment module for adjusting parameters related to the voice interaction based on the analysis result.

According to a seventh aspect of the present invention, there is also provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as set forth in any one of the first to third aspects of the invention.

According to an eighth aspect of the present invention, there is also provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform a method as set forth in any one of the first to third aspects of the present invention.

The voice interaction scheme can be used for voice awakening and voice interaction after equipment is awakened, threshold values can be reduced through secondary awakening, the awakening rate is improved, and the speed, volume and emotional state of a user can be simulated in the voice interaction process, so that the interaction experience of the user is improved.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.

FIG. 1 is a schematic flow chart diagram illustrating a voice interaction method according to an embodiment of the present invention.

Fig. 2 is a schematic flow chart diagram illustrating a voice wake-up detection method according to an embodiment of the present invention.

Fig. 3 is a schematic block diagram showing the structure of a voice interaction apparatus according to an embodiment of the present invention.

Fig. 4 is a schematic block diagram showing the structure of a voice interaction apparatus according to another embodiment of the present invention.

Fig. 5 is a schematic block diagram illustrating the structure of a voice wake-up detecting apparatus according to an embodiment of the present invention.

FIG. 6 illustrates a schematic structural diagram of a computing device according to an embodiment of the invention.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In order to provide more comfortable interactive experience for users, the invention provides a voice interaction scheme. And analyzing the voice input of the user, and adjusting parameters related to voice interaction based on the analysis result so as to improve the interaction experience of the user. Parameters related to voice interaction, i.e. parameters that can affect the user's interaction experience. The purpose of adjusting the parameters is to improve the interaction experience of the user when performing voice interaction based on the adjusted parameters.

The voice interaction scheme of the invention can be suitable for various voice interaction scenes, such as voice interaction after equipment awakening and voice awakening detection before equipment awakening. When the method is applied to different scenes, the purpose, the implementation mode, the adjusted parameters and the achievable effect of analyzing the voice input of the user are different.

The voice interaction method and the voice interaction device applied to the awakened equipment can be used for analyzing voice characteristics of voice input of the user, such as speed, volume, tone, timbre and the like, and can also be used for analyzing emotional state of the voice input of the user. And relevant parameters of the voice output fed back to the user can be correspondingly adjusted based on the analysis result. The adjusted parameters may include, but are not limited to, speech speed, volume, pitch, timbre, emotional state of the speech output fed back to the user.

For example, the speech rate of the speech output fed back to the user may be adjusted according to the speech rate of the speech input of the user, for example, the speech rate of the speech output fed back to the user may be adjusted to be the same as or similar to the speech rate of the speech input of the user, that is, the speech rate of the user is simulated to respond. For example, when the user speaks "from day to qi to how to look like" in a lazy ocean, the user can be simulated to reply to "from day to qi to fine to lang-" at the speech rate.

For another example, the user may also be replied to with the same or similar emotional state according to the analyzed emotional state of the voice input of the user.

In the voice interaction process, the user is replied by imitating the characteristics of the user, such as the speed of speech, the volume, the tone color, the emotional state and the like, so that the interaction experience of the user can be improved, and the probability that the replied content is accepted by the user can be improved to a certain extent. For example, if the device wakes up in the middle of the night, where the system volume of the device may be larger and the detected volume of the user's voice input may be smaller, the present invention may provide a more intimate interactive experience for the user by mimicking the user's volume to reply to the user.

Taking the voice wake-up detection before the voice interaction between the user and the device as an example, in the existing wake-up detection scheme, if the user does not have the wake-up device because the voice input of the user does not reach the wake-up threshold value due to the problems of noisy environment and the like, the user can continue to wake up the device when the surrounding environment is not obviously improved, and the user experience is poor.

The invention can perform awakening detection on the voice input of the user, and can reduce the parameter of the awakening threshold value under the condition that the awakening detection result is lower than the current awakening threshold value, for example, the awakening threshold value can be reduced in the next period of time, so that the awakening success rate in the next period of time can be improved, and the awakening experience of the user can be improved.

The aspects of the invention are further described below.

[ VOICE INTERACTION METHOD ]

FIG. 1 is a schematic flow chart diagram illustrating a voice interaction method according to an embodiment of the present invention. Wherein the method shown in fig. 1 may be performed by a device capable of voice interaction with a user.

Referring to fig. 1, a voice input of a user is analyzed to determine a first characteristic of the voice input at step S110.

The first feature may comprise a speech feature and/or an emotional state feature. The voice features may include, but are not limited to, acoustic features such as speech speed, volume, tone, timbre, etc., and the emotional state features may be features such as happiness, anger, sadness, etc., which can characterize the emotional type of the user.

In the invention, the first characteristic of the voice input can be determined by performing acoustic analysis, voice recognition, semantic analysis and the like on the voice input of the user.

Taking the first characteristic as the speech speed as an example, the duration information of the voice input may be acquired, and the speech speed of the voice input may be determined based on the number of words included in the text information obtained by performing voice recognition on the voice input. The speed of speech is the number of words spoken by the user in a unit time.

Taking the first characteristic as volume, pitch, and timbre as an example, the volume, pitch, and timbre of the voice input can be determined by performing acoustic analysis (e.g., waveform analysis) on the voice input of the user. The detailed analysis process is not described herein.

Taking the first feature as an emotional state feature as an example, the voice feature of the voice input of the user can indicate the current emotional state of the user to a certain extent. For example: the voice produced by the user is generally crisp and powerful when enjoying, and the speech speed is faster; the voice produced when the user is tired and sadness is generally deep and dull, and the speech speed is slow. Thus, the emotional state of the speech input may be determined based on speech characteristics (speech rate, volume, pitch, timbre, etc.) analyzed for the speech input. The text information of the voice input of the user can reflect the current emotional state of the user to a certain extent, so that the emotional state of the voice input can be determined by performing semantic analysis on the voice input. Alternatively, the emotional state of the speech input may be determined based on both the speech characteristics of the speech input and the semantic analysis results. The specific implementation process is not described herein again.

In step S120, a second feature of the speech output fed back to the user is determined based on the first feature.

The voice output that is fed back to the user may be a reply made by the device to the voice input of the user. After receiving the voice input of the user, the content fed back to the user can be determined by using the preset phone system, and the content of the voice output fed back to the user and the determination mode thereof are not limited in the present invention.

The invention mainly determines the second characteristic of the voice output fed back to the user according to the first characteristic of the voice input. Similar to the first feature, the second feature may also include a speech feature and/or an emotional state feature. The voice features may include, but are not limited to, acoustic features such as speech speed, volume, tone, timbre, etc., and the emotional state features may be features such as happiness, anger, sadness, etc., which can characterize the emotional type of the user.

The first feature and the second feature mentioned in the present invention may refer to features with the same dimension, or may refer to features with different dimensions. For example: the first feature refers to speech rate, and the second feature refers to speech rate as well; the first characteristic refers to volume, and the second characteristic also refers to volume; the first feature refers to pitch, as does the second feature; the first feature refers to tone, and the second feature refers to tone as well; the first characteristic refers to an emotional state, as does the second characteristic. As another example, the first characteristic may refer to speech rate and/or volume and/or pitch and/or timbre, and the second characteristic may also refer to an emotional state.

The second feature may be the same as or similar to the first feature. That is, in feeding back the speech output to the user, the second characteristic of the speech output may be adjusted to be the same as or similar to the first characteristic to mimic the first characteristic of the user's speech input. For example, speech output may be fed back to the user in a manner that mimics the speech rate, volume, pitch, timbre, emotional state, etc. characteristics of the speech input.

The second feature may also be the opposite of the first feature. For example, when it is detected that the emotional state of the voice input of the user is relatively sad, the voice output corresponding to the voice input may be fed back to the user in a cheerful emotional state to help the user go out of a valley.

As an example of the present invention, other factors may also be referenced when determining the second characteristic of the speech output that is fed back to the user based on the first characteristic. For example, reference may be made to the timing of the voice input (e.g., the time of the voice input) and/or the current state of the device (e.g., whether it is in a do-not-disturb mode) and/or textual information (e.g., the semantics of the textual information) parsed from the user's voice input.

That is, the second characteristic may be determined based on the first characteristic in combination with the timing of the voice input and/or the current state of the device and/or textual information resulting from parsing the voice input by the user.

Several application scenarios of the voice interaction scheme of the present invention are exemplified below with reference to specific embodiments.

Example 1 speed of speech simulation

Taking the first feature as an example to refer to the speech rate (which may be referred to as "first speech rate" for convenience of distinction), the speech rate of the speech output fed back to the user (which may be referred to as "second speech rate" for convenience of distinction) may be adjusted to be the same as, or close to, the first speech rate.

The second speech rate is adjusted to be close to the first speech rate, which means that the second speech rate can be adjusted within the upper and lower predetermined ranges of the first speech rate. Wherein the upper and lower predetermined ranges may be a predetermined ratio,or may be a predetermined value. For example, assume that the first speech rate is V₀Can be in the interval [ V ]₀-V₁，V₀+V₂Internally adjusting the second speech rate, wherein V₁、V₂Are constant and may be the same or different. As another example, assume that the first speech rate is V₀Can be in the interval [ 1- α ] V₀，(1+β)V₀The second speech rate is adjusted in-vivo, wherein α and β are proportional coefficients belonging to the (0, 1) interval, and may be the same or different.

As an example, several speech rate intervals may be divided in advance, and different speech rate intervals correspond to different speech rate ranges. The trigger condition of the speech rate emulation may be that, when it is detected that a speech rate section to which a first speech rate of the speech input by the user belongs is changed, a second speech rate of the speech output is adjusted. Optionally, the second speech rate may be adjusted within a speech rate range corresponding to a speech rate interval to which the first speech rate currently belongs, so that the adjusted second speech rate and the first speech rate belong to the same speech rate interval.

Example 2 volume imitation

Taking the first characteristic as an example of referring to the volume (which may be referred to as "first volume" for convenience of distinction), the volume of the voice output fed back to the user (which may be referred to as "second volume" for convenience of distinction) may be adjusted to be the same as, or close to, the first volume.

The second volume is adjusted to be close to the first volume, which means that the second volume can be adjusted within the upper and lower predetermined ranges of the first volume. The upper and lower predetermined ranges may be a predetermined ratio or a predetermined value. For example, assume that the first volume is C₀Can be in the interval [ C ]₀-C₁，C₀+C₂Internally adjusting the second volume, wherein C₁、C₂Are constant and may be the same or different. As another example, assume that the first volume is C₀May be in the interval [ 1- α ] C₀，(1+β)C₀α and β are proportionality coefficients, which may be the same or different.

As an example, the trigger condition for the volume emulation may be set to satisfy one or more of the following conditions.

(1) The current time satisfies a first predetermined condition. The first predetermined condition may be a condition that characterizes a time range, such as 9 pm to 8 am, or a work time of a weekday (monday to friday) (e.g., 9 am to 5 pm).

(2) The device is in a do-not-disturb mode. The do-not-disturb mode can be a silent mode, a vibration mode, and the like. The on/off of the do-not-disturb mode may be actively set by the user.

(3) The user does not adjust the system volume of the device for the first predetermined length of time. The value of the first predetermined period of time may be set according to actual conditions, such as 10 minutes.

(4) The difference between the first volume and the system volume of the device is greater than a first predetermined threshold. The first predetermined threshold may be set according to practical situations, for example, 10 db.

When the trigger condition of the sound volume imitation is met, the second sound volume of the voice output fed back to the user can be adjusted to the first sound volume or close to the first sound volume. Alternatively, when the trigger condition for the volume emulation is disabled, the second volume may be restored to the previous state, that is, to the state before the volume emulation is performed.

When the trigger condition for volume imitation includes the condition (4), adjusting the second volume to be close to the first volume means that the second volume can be adjusted within a predetermined range of up and down of the first volume. The upper and lower predetermined ranges mentioned herein refer to a range smaller than the first predetermined threshold.

Example 3 Intelligent volume reduction

As an example, the triggering condition for intelligently lowering the volume may be set to satisfy one or more of the following conditions.

(1) The current time satisfies a second predetermined condition. The second predetermined condition may be a condition that characterizes a time range, such as 9 pm to 8 am, or a work time of a weekday (monday to friday) (e.g., 9 am to 5 pm).

(3) No voice input is received for a second predetermined length of time. I.e. no speech input for the second predetermined length of time. The value of the second predetermined time period may be set according to actual conditions, for example, may be 10 minutes.

(4) The system volume is greater than a second predetermined threshold. The second predetermined threshold may be set according to practical conditions, such as 60 db.

(5) The user does not adjust the system volume for the third predetermined length of time. The value of the third predetermined period of time may be set according to actual conditions, such as 10 minutes.

(6) There is currently no speech output. I.e. the device does not emit sound.

And when the triggering condition of intelligently reducing the volume is met, the volume of the system can be reduced. Such as to automatically adjust the system volume to 40.

Example 4 Emotion simulation

In feeding back the speech output to the user, the user may be replied to with an emotional state that is the same as or similar to the emotional state of the user's speech input.

As an example, the trigger condition for emotion imitation may be set as: the emotion of the voice input of the user is detected to be in a state of happiness, sadness, anger, and the like. In case the trigger condition for emotional imitation is fulfilled, the device may reply to the user with the same or similar emotion. Different emotion states can correspond to different voice characteristics, and the user can be replied with the same or similar emotion by adjusting voice characteristics such as tone, tone and volume of voice output.

It should be noted that the method shown in fig. 1 may be applied to voice interaction after the device wakes up, and may also be applied to voice wake-up detection before the device wakes up. That is, the method shown in fig. 1 may be performed during voice wake detection to determine the second characteristic of the voice output fed back to the user after wake up. Here, the second feature mentioned here can be regarded as a feature when a voice output is fed back to the user for the first time after waking up. During the subsequent voice interaction, the method shown in fig. 1 may be performed as well, and the second characteristic of the voice output fed back to the user is adjusted according to the first characteristic of the voice input of the user determined in real time.

[ VOICE WAKE-UP DETECTION ]

Fig. 2 is a schematic flow chart diagram illustrating a voice wake-up detection method according to an embodiment of the present invention. Wherein the method shown in fig. 2 may be performed by a device capable of voice interaction with a user.

Referring to fig. 2, in step S210, a wake-up detection is performed on a detected voice input.

Wake-up detection may be performed using existing wake-up detection algorithms. If the detected voice input can be recognized, the recognition result is compared with the preset awakening word, and the equipment can be awakened under the condition that the voice input is matched with the preset awakening word.

In step S220, if the wake-up detection result is lower than the current wake-up threshold, the wake-up threshold is decreased.

The wake-up detection result may be a probability of being recognized as a wake-up word obtained by performing wake-up detection on the voice input, and the wake-up threshold may be a preset comparison threshold. And when the awakening detection result is greater than or equal to the awakening threshold value, the awakening is considered to be successful, the voice interaction function of the equipment is started, and the voice interaction is carried out with the user.

In existing solutions, if the user does not reach the wake-up threshold because of problems such as noisy environment, the device is not woken up. When the surrounding environment is not improved obviously, the user continues to wake up, the device still cannot be woken up, and the user experience is poor.

The invention can lower the awakening threshold value under the condition that the awakening detection result is lower than the current awakening threshold value, for example, the awakening threshold value can be lowered in the next preset time. Therefore, the success rate of the user in the subsequent continuous awakening process is greatly improved. The specific adjustment value may be set according to actual conditions, and is not described herein again.

Optionally, the wake-up threshold may be turned down if the wake-up detection result is lower than the current wake-up threshold and higher than a third predetermined threshold. The third predetermined threshold value can be set according to actual conditions, so that abnormal awakening caused by environmental noise can be avoided.

In conclusion, the voice interaction scheme of the invention can be used for voice awakening and also can be used for voice interaction after equipment is awakened, the threshold value can be reduced through secondary awakening, the awakening rate is improved, and the speed, volume and emotional state of a user can be simulated in the voice interaction process, so that the interaction experience of the user is improved.

[ VOICE INTERACTION APPARATUS ]

Fig. 3 is a schematic block diagram showing the structure of a voice interaction apparatus according to an embodiment of the present invention. Wherein the functional blocks of the voice interaction apparatus can be implemented by hardware, software or a combination of hardware and software which implement the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 3 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

The functional modules that the voice interaction apparatus may have and the operations that each functional module may perform are briefly described, and for the details related thereto, reference may be made to the above-mentioned related description, which is not repeated herein.

Referring to fig. 3, the voice interaction apparatus 300 includes an analysis module 310 and an adjustment module 320. The analysis module 310 is used for analyzing the voice input of the user, and the adjusting module 320 is used for adjusting the parameters related to the voice interaction based on the analysis result. Parameters related to voice interaction, i.e. parameters that can affect the user's interaction experience. The purpose of the adjusting module 320 to adjust the parameters is to improve the interaction experience of the user when performing voice interaction based on the adjusted parameters.

The voice interaction apparatus 300 may be suitable for various voice interaction scenarios, such as voice interaction after the device wakes up, or voice wake-up detection before the device wakes up. When the method is applied to different scenes, the purpose, the implementation mode, the adjusted parameters and the influence on the interactive experience of the user are different when the method is used for analyzing the voice input of the user.

In the present invention, the voice interaction apparatus 300 may perform the voice interaction method described above with reference to fig. 1, and may also perform the voice wake-up detection method described above with reference to fig. 2.

That is, the analysis module 310 may analyze the user's voice input to determine a first characteristic of the voice input. The adjustment module 320 may determine a second characteristic of the speech output that is fed back to the user based on the first characteristic. For the first feature, the second feature and the related details, reference may be made to the description above in connection with fig. 1, which is not repeated here.

The analysis module 310 may also perform wake-up detection on the detected voice input, and the adjustment module 320 may adjust the wake-up threshold to be lower if the wake-up detection result is lower than the current wake-up threshold. For the relevant details, reference may be made to the description above in connection with fig. 2, which is not repeated here.

Fig. 4 is a schematic block diagram showing the structure of a voice interaction apparatus according to another embodiment of the present invention. Wherein the functional blocks of the voice interaction apparatus can be implemented by hardware, software or a combination of hardware and software which implement the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 4 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

The functional modules that the voice interaction apparatus may have and the operations that each functional module may perform are briefly described, and for the details related thereto, reference may be made to the related description above in conjunction with fig. 2, which is not repeated herein.

Referring to fig. 4, the voice interaction apparatus 400 includes an analysis module 410 and a determination module 420.

The analysis module 410 is configured to analyze the voice input of the user to determine a first characteristic of the voice input, and the determination module 420 is configured to determine a second characteristic of the voice output to be fed back to the user based on the first characteristic.

As an example, the determining module 420 may determine the second characteristic based on the first characteristic in combination with the timing of the voice input and/or the current state of the device and/or textual information parsed from the voice input of the user.

Taking the example that the first characteristic includes a first volume, the determining module 420 may adjust the second volume of the speech output that is fed back to the user to the first volume, or to near the first volume, if the current time satisfies a first predetermined condition, and/or the device is in a do-not-disturb mode, and/or the user has not adjusted the system volume of the device for a first predetermined length of time, and/or the first volume differs from the system volume of the device by more than a first predetermined threshold.

Taking the example that the first characteristic includes the first speech rate, the determining module 420 may adjust the second speech rate of the speech output fed back to the user to be the same as the first speech rate or to be close to the first speech rate.

Optionally, the determining module 420 may turn down the system volume if the current time meets a second predetermined condition, and/or the device is in the do-not-disturb mode, and/or no voice input is received for a second predetermined period of time, and/or the system volume is greater than a second predetermined threshold, and/or the user does not adjust the system volume for a third predetermined period of time, and/or no voice output is currently available.

Optionally, the voice interaction apparatus 400 may further include a voice wake-up detection module and a wake-up threshold adjustment module (not shown in the figure). The voice awakening detection module is used for awakening and detecting the detected voice input, and the awakening threshold value adjusting module is used for reducing the awakening threshold value under the condition that the awakening detection result is lower than the current awakening threshold value. For example, the wake threshold adjustment module may decrease the wake threshold if the wake detection result is lower than the current wake threshold and higher than a third predetermined threshold.

[ VOICE WAKE-UP DETECTING DEVICE ]

Fig. 5 is a schematic block diagram illustrating the structure of a voice wake-up detecting apparatus according to an embodiment of the present invention. The functional modules of the voice wake-up detection apparatus can be implemented by hardware, software or a combination of hardware and software for implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional blocks described in fig. 5 may be combined or divided into sub-blocks to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional modules described herein.

The functional modules that the voice wake-up detection apparatus may have and the operations that each functional module may perform are briefly described, and for the details related thereto, reference may be made to the above related description, which is not repeated herein.

Referring to fig. 5, the voice wake-up detection apparatus 500 includes a voice wake-up detection module 510 and a wake-up threshold adjustment module 520.

The voice wake-up detection module 510 is configured to perform wake-up detection on the detected voice input, and the wake-up threshold adjustment module 520 is configured to decrease the wake-up threshold when the wake-up detection result is lower than the current wake-up threshold. For example, the wake threshold adjustment module 520 may decrease the wake threshold if the wake detection result is lower than the current wake threshold and higher than a third predetermined threshold.

[ calculating device ]

Fig. 6 is a schematic structural diagram of a computing device that can be used to implement the voice interaction or voice wake detection method according to an embodiment of the present invention.

Referring to fig. 6, computing device 600 includes memory 610 and processor 620.

The processor 620 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 620 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).

The memory 610 may include various types of storage units, such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. In addition, the memory 610 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, memory 610 may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a digital versatile disc read only (e.g., DVD-ROM, dual layer DVD-ROM), a Blu-ray disc read only, an ultra-dense disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disk, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.

The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform the voice interaction or voice wake detection methods described above.

The voice interaction and voice wake-up detection method, apparatus and device according to the present invention have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.

Alternatively, the invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of voice interaction, comprising:

analyzing a voice input of a user to determine a first characteristic of the voice input;

based on the first characteristic, a second characteristic of the speech output that is fed back to the user is determined.

2. The voice interaction method of claim 1,

the first feature comprises a speech feature and/or an emotional state feature, and/or

The second feature comprises a speech feature and/or an emotional state feature.

3. The voice interaction method of claim 1, wherein the first feature and/or the second feature comprises at least one of:

the speed of speech;

volume;

a tone;

tone color;

an emotional state.

4. The voice interaction method of claim 1,

the second feature is the same as or similar to the first feature; or

The second characteristic is opposite to the first characteristic.

5. The voice interaction method of claim 1, for enabling interaction between a user and a device, wherein the step of determining a second characteristic of the voice output fed back to the user comprises:

and determining the second characteristic based on the first characteristic and in combination with the time of the voice input and/or the current state of the equipment and/or text information obtained by analyzing the voice input of the user.

6. The method of claim 5, wherein the first characteristic comprises a first volume, and wherein the step of determining the second characteristic comprises:

in the event that a first predetermined condition is met at a current time, and/or the device is in a do-not-disturb mode, and/or a user does not adjust a system volume of the device for a first predetermined length of time, and/or the first volume differs from the system volume of the device by more than a first predetermined threshold, adjusting a second volume of the speech output that is fed back to the user to the first volume, or to near the first volume.

7. The method of claim 1, wherein the first characteristic comprises a first speech rate, and wherein the step of determining the second characteristic comprises:

and adjusting a second speech speed of the speech output fed back to the user to be the same as the first speech speed or to be close to the first speech speed.

8. The voice interaction method of claim 1, further comprising:

the system volume is turned down if the current time meets a second predetermined condition and/or the device is in a do-not-disturb mode and/or no voice input is received for a second predetermined period of time and/or the system volume is greater than a second predetermined threshold and/or the user does not adjust the system volume for a third predetermined period of time and/or there is no voice output currently.

9. The voice interaction method of claim 1, further comprising:

performing wake-up detection on the detected voice input;

and when the awakening detection result is lower than the current awakening threshold value, the awakening threshold value is reduced.

10. The voice interaction method of claim 9, wherein the step of turning down the wake-up threshold if the wake-up detection result is lower than the current wake-up threshold comprises:

and when the awakening detection result is lower than the current awakening threshold and higher than a third preset threshold, the awakening threshold is reduced.

11. A voice wake-up detection method, comprising:

performing wake-up detection on the detected voice input;

12. The voice wake-up detection method according to claim 11, wherein the step of lowering the wake-up threshold if the wake-up detection result is lower than the current wake-up threshold comprises:

13. A method of voice interaction, comprising:

analyzing the voice input of the user; and

parameters related to the voice interaction are adjusted based on the analysis results.

14. A voice interaction apparatus, comprising:

the analysis module is used for analyzing the voice input of a user to determine a first characteristic of the voice input;

a determination module to determine a second characteristic of the speech output to be fed back to the user based on the first characteristic.

15. A voice wake-up detection apparatus, comprising:

the awakening detection module is used for carrying out awakening detection on the detected voice input;

and the awakening threshold adjusting module is used for reducing the awakening threshold under the condition that the awakening detection result is lower than the current awakening threshold.

16. A voice interaction apparatus, comprising:

the analysis module is used for analyzing the voice input of the user; and

and the adjusting module is used for adjusting parameters related to voice interaction based on the analysis result.

17. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1 to 13.

18. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-13.