CN108962283B

CN108962283B - Method and device for determining question end mute time and electronic equipment

Info

Publication number: CN108962283B
Application number: CN201810083491.9A
Authority: CN
Inventors: 高慧湍; 李宝祥
Original assignee: Beijing Orion Star Technology Co Ltd
Current assignee: Beijing Orion Star Technology Co Ltd
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2020-11-06
Anticipated expiration: 2038-01-29
Also published as: CN108962283A

Abstract

The embodiment of the invention provides a method, a device and electronic equipment for determining question end mute time, wherein the method comprises the following steps: acquiring a user voice signal acquired by an intelligent voice terminal; determining the speech rate information of the user speech signal, wherein the speech rate information is information for identifying the speech rate characteristics of the user speech signal; and determining question end mute time according to the speech rate information and a preset mute time setting rule. By adopting the method, the question end mute time is determined, the reasonable question end mute time can be set according to the speech speed characteristics of the user, the intelligent voice terminal can also accurately respond to the users with different speech speeds, and the response accuracy and the user experience of the intelligent voice terminal are greatly improved.

Description

Method and device for determining question end mute time and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for determining question end mute time and electronic equipment.

Background

In recent years, with the rapid development of artificial intelligence technology, many artificial intelligence devices have appeared in the market. Some artificial intelligence devices are embedded with intelligent voice technology, users can control the artificial intelligence devices through voice, voice interaction can be carried out with the artificial intelligence devices, the artificial intelligence devices comprise a weather inquiry device, an alarm clock, a story telling device, a chat device and the like, the artificial intelligence devices capable of carrying out voice interaction with the users can be called as intelligent voice terminals, for example, intelligent sound boxes, robots capable of carrying out voice interaction and the like.

When the intelligent voice terminal realizes the voice interaction function, obviously, the voice response speed is very important. When the intelligent voice terminal collects the voice signal of the user, the collected voice signal of the user is sent to a server in communication connection with the intelligent voice terminal in real time, when the server receives the voice signal of the user, the mute time of the voice signal of the user is monitored, when the mute time reaches the preset time, the voice signal of the user is determined to be finished, namely, after the user speaks for a period of mute time, the voice question of the user is judged to be finished, and the server can perform analysis work such as voice recognition on the voice signal of the user. The preset time may be referred to as a question end mute time, which identifies the end of the question for the user.

The question end mute time of a general intelligent voice terminal is preset and cannot be changed. Therefore, because the speech rate difference of different users during speaking is large, the adoption of the fixed question ending mute time can often lead to the fact that the user with the fast speech rate needs to wait for a long time after the actual question is ended, and the intelligent voice terminal can respond. And a user with a slow speech speed is frequently in a state of not completing a period of speech and is in a state of being preempted by the intelligent voice terminal to respond, obviously, the intelligent voice terminal responds inaccurately and the user experience is poor due to the determination mode of asking for a mute time.

Disclosure of Invention

The embodiment of the invention aims to provide a method and a device for determining question end mute time and electronic equipment, so as to improve the response accuracy and user experience of an intelligent voice terminal. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for determining a question end mute time, where the method includes:

acquiring a user voice signal acquired by an intelligent voice terminal;

determining the speech rate information of the user speech signal, wherein the speech rate information is information for identifying the speech rate characteristics of the user speech signal;

and determining question end mute time according to the speech rate information and a preset mute time setting rule.

Optionally, the step of obtaining the user voice signal collected by the intelligent voice terminal includes:

acquiring a user voice signal acquired by an intelligent voice terminal in real time;

before the step of determining the speech rate information of the user speech signal, the method includes:

monitoring the time length of the user voice signal to reach a preset time length;

the step of determining the question end silent time according to the speech rate information and a preset silent time setting rule comprises the following steps:

and determining question end mute time corresponding to the currently acquired user voice signal according to the speech rate information and a preset mute time setting rule.

Optionally, the speech rate information is an average speech rate;

the step of determining the speech rate information of the user speech signal includes:

acquiring the duration of the user voice signal;

carrying out voice recognition on the user voice signal to obtain the number of characters corresponding to the user voice signal;

and determining the average speech speed of the user speech signal according to the number of the characters and the duration.

Optionally, the step of determining the question end silent time according to the speech rate information and a preset silent time setting rule includes:

and determining the question end mute time according to the magnitude relation between the average speech rate and a preset speech rate threshold value.

Optionally, the preset speech rate threshold includes a first preset speech rate threshold and a second preset speech rate threshold, where the first preset speech rate threshold is smaller than the second preset speech rate threshold;

the step of determining the question end mute time according to the magnitude relation between the average speech rate and a preset speech rate threshold value comprises the following steps:

when the average speech rate is smaller than the first preset speech rate threshold value, determining the question end mute time as a first mute time;

when the average speech rate is greater than the first preset speech rate threshold and less than the second preset speech rate threshold, determining question end mute time as second mute time;

and when the average speech rate is greater than the second preset speech rate threshold value, determining that the question-asking ending mute time is a third mute time, wherein the first mute time is greater than the second mute time, and the second mute time is greater than the third mute time.

Optionally, the speech rate information is average interval time between words;

carrying out voice recognition on the user voice signal to obtain the interval time between adjacent characters in the characters corresponding to the user voice signal;

and calculating the average interval time corresponding to the user voice signal according to the interval time.

and determining the question end mute time according to the magnitude relation between the average interval time and a preset time threshold.

Optionally, the preset time threshold includes a first preset time threshold and a second preset time threshold, where the first preset time threshold is greater than the second preset time threshold;

the step of determining the question end mute time according to the magnitude relation between the average interval time and a preset time threshold value comprises the following steps:

when the average interval time is greater than the first preset time threshold, determining that the question end mute time is a fourth mute time;

when the average interval time is smaller than the first preset time threshold and larger than the second preset time threshold, determining the question end mute time as fifth mute time;

and when the average interval time is smaller than the second preset time threshold, determining that the question end mute time is a sixth mute time, wherein the fourth mute time is larger than the fifth mute time, and the fifth mute time is larger than the sixth mute time.

Optionally, after the step of determining the question end silent time according to the speech rate information and the preset silent time setting rule, the method further includes:

and when detecting that the mute time corresponding to the acquired user voice signal currently acquired by the intelligent voice terminal reaches the determined question end mute time, responding to a user instruction corresponding to the currently acquired user voice signal, wherein the user instruction is an instruction determined according to the semantics of the currently acquired user voice signal.

In a second aspect, an embodiment of the present invention provides an apparatus for determining a question end mute time, where the apparatus includes:

the voice signal acquisition module is used for acquiring a user voice signal acquired by the intelligent voice terminal;

a speech rate information determining module, configured to determine speech rate information of the user speech signal, where the speech rate information is information that identifies a speech rate characteristic of the user speech signal;

and the silent time determining module is used for determining the silent time for finishing asking according to the speech rate information and a preset silent time setting rule.

Optionally, the voice signal obtaining module includes:

the real-time acquisition submodule is used for acquiring a user voice signal acquired by the intelligent voice terminal in real time;

the device further comprises:

a preset duration monitoring module, configured to monitor that a duration of the user voice signal reaches a preset duration before determining the speech rate information of the user voice signal;

the mute time determination module comprises:

and the mute time determining submodule is used for determining question ending mute time corresponding to the currently acquired user voice signal according to the speech rate information and a preset mute time setting rule.

Optionally, the speech rate information is an average speech rate;

the speech rate information determining module comprises:

the time length obtaining submodule is used for obtaining the time length of the user voice signal;

the character number determining submodule is used for carrying out voice recognition on the user voice signal to obtain the character number corresponding to the user voice signal;

and the average speed determining submodule is used for determining the average speed of speech of the user speech signal according to the number of the characters and the duration.

Optionally, the mute time determining module includes:

and the first determining submodule is used for determining question ending mute time according to the size relation between the average speech rate and a preset speech rate threshold.

the first determination submodule includes:

a first determining unit, configured to determine that the question termination mute time is a first mute time when the average speech rate is smaller than the first preset speech rate threshold;

a second determining unit, configured to determine that the question termination mute time is a second mute time when the average speech rate is greater than the first preset speech rate threshold and is less than the second preset speech rate threshold;

a third determining unit, configured to determine, when the average speech rate is greater than the second preset speech rate threshold, that the question end mute time is a third mute time, where the first mute time is greater than the second mute time, and the second mute time is greater than the third mute time.

Optionally, the speech rate information is average interval time between words;

the speech rate information determining module comprises:

the interval time determining submodule is used for carrying out voice recognition on the user voice signal to obtain the interval time between adjacent characters in the characters corresponding to the user voice signal;

and the average interval time determining submodule is used for calculating the average interval time corresponding to the user voice signal according to the interval time.

Optionally, the mute time determining module includes:

and the second determining submodule is used for determining the question ending mute time according to the size relation between the average interval time and a preset time threshold.

the second determination submodule includes:

a fourth determining unit, configured to determine that the question ending mute time is a fourth mute time when the average interval time is greater than the first preset time threshold;

a fifth determining unit, configured to determine that the question ending mute time is a fifth mute time when the average interval time is smaller than the first preset time threshold and larger than the second preset time threshold;

a sixth determining unit, configured to determine, when the average interval time is smaller than the second preset time threshold, that the question end mute time is a sixth mute time, where the fourth mute time is greater than the fifth mute time, and the fifth mute time is greater than the sixth mute time.

Optionally, the apparatus further comprises:

and the instruction response module is used for responding a user instruction corresponding to the currently acquired user voice signal when detecting that the acquired mute time corresponding to the currently acquired user voice signal of the intelligent voice terminal reaches the determined question end mute time after the step of determining question end mute time according to the speech rate information and a preset mute time setting rule, wherein the user instruction is an instruction determined according to the semantics of the currently acquired user voice signal.

In a third aspect, an embodiment of the present invention further provides an electronic device, including a processor, a memory, and a communication bus, where the processor and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the steps of the method for determining the question ending mute time when executing the program stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when executed by a processor, the computer program implements the above method for determining the question end mute time.

In the scheme provided by the embodiment of the invention, a user voice signal acquired by an intelligent voice terminal is firstly acquired, then the speech rate information of the user voice signal is determined, and finally the question end mute time is determined according to the speech rate information and a preset mute time setting rule, wherein the speech rate information is information for identifying the speech rate characteristics of the user voice signal. By adopting the method, the question end mute time is determined, the reasonable question end mute time can be set according to the speech speed characteristics of the user, the intelligent voice terminal can also accurately respond to the users with different speech speeds, and the response accuracy and the user experience of the intelligent voice terminal are greatly improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for determining a query end mute time according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the step S102 in the embodiment shown in FIG. 1;

FIG. 3 is another detailed flowchart of step S102 in the embodiment shown in FIG. 1;

fig. 4 is a schematic structural diagram of an apparatus for determining a question end mute time according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to improve response accuracy and user experience of an intelligent voice terminal, the embodiment of the invention provides a method and a device for determining question end mute time, electronic equipment and a computer-readable storage medium.

First, a method for determining a question end mute time provided in an embodiment of the present invention is described below.

The method for determining the question end mute time provided by the embodiment of the invention can be applied to a server in communication connection with an intelligent voice terminal, and is hereinafter referred to as the server for short. The intelligent voice terminal may be any intelligent device capable of performing voice interaction with a user through voice control, for example, the intelligent voice terminal may be an intelligent sound box, a voice robot, and the like, which is not limited specifically herein.

As shown in fig. 1, a method for determining an end-of-questioning-mute time includes:

s101, acquiring a user voice signal acquired by an intelligent voice terminal;

s102, determining the speech rate information of the user speech signal, wherein the speech rate information is information for identifying the speech rate characteristics of the user speech signal;

and S103, determining question end mute time according to the speech rate information and a preset mute time setting rule.

It can be seen that, in the scheme provided in the embodiment of the present invention, the server may first obtain the user voice signal acquired by the intelligent voice terminal, then determine the speech rate information of the user voice signal, and finally determine the question end mute time according to the speech rate information and the preset mute time setting rule, where the speech rate information is information identifying the speech rate characteristics of the user voice signal. By adopting the method, the question end mute time is determined, the reasonable question end mute time can be set according to the speech speed characteristics of the user, the intelligent voice terminal can also accurately respond to the users with different speech speeds, and the response accuracy and the user experience of the intelligent voice terminal are greatly improved.

In the step S101, when the user speaks, that is, sends the user voice signal, the intelligent voice terminal collects the user voice signal and sends the user voice signal to the server in real time, so that the server can obtain the user voice signal collected by the intelligent voice terminal.

In an embodiment, the user voice signal acquired by the intelligent voice terminal and acquired by the server may be the user voice signal acquired by the intelligent voice terminal at the current time. For example, it may be a speech word spoken by the user at the current moment or a speech signal corresponding to a few words. Then, the question end mute time determined by the server at this time can be used as a sentence spoken by the user at the current moment or the question end mute time corresponding to several words, that is, when the user finishes the current question and the intelligent voice terminal collects the voice signal of the next section of user, the server can determine the question end mute time corresponding to the voice signal of the next section of user again, so as to form a mode of dynamically determining the question end mute time for each section of user voice signal in real time.

In another embodiment, the user voice signal acquired by the intelligent voice terminal and acquired by the server may be the user voice signal acquired by the intelligent voice terminal within a period of time, where the period of time may be 3 days, 5 days, a week, and the like, and is not limited specifically herein. That is to say, the server may determine the question end mute time according to the preset time, and determine the question end mute time according to the speech rate information of all or a part of the user speech signals collected by the intelligent speech terminal within the preset time.

After the server obtains the user voice signal collected by the intelligent voice terminal, the speed information of the user voice signal can be determined, that is, step S102 is executed. The speech rate information is information for identifying speech rate characteristics of the user speech signal. That is, information indicating how fast the user speaks may be, for example, a speech rate, an average word-to-word interval, etc., and is not limited herein. The manner in which the server determines the speech rate information of the user speech signal may be a common manner in the field of speech signal processing such as speech recognition, and is not specifically limited and described herein.

For example, if the speech rate information of the user speech signal is an average speech rate and the preset time is 3 days, the server may calculate the average speech rate of all or a part of the user speech signals acquired within 3 days as the speech rate information, in which case, the server may set the question end mute time every 3 days.

Next, in step S103, the server may determine the question end mute time according to the speech rate information and the preset mute time setting rule. For example, the question end mute time may be determined according to the speech rate of the user speech signal and the magnitude relationship of a preset speech rate threshold. For clarity of the scheme and layout, a specific implementation manner in which the server determines the question end silent time according to the speech rate information and the preset silent time setting rule will be described in the following.

It should be noted that, in this document, the words and phrases refer to units that are separated according to the habit of each language and that form a sentence, and are usually separated by the pause of the user's speech. For example, in the chinese language, "words" and "characters" refer to chinese characters divided according to the chinese habit, and for a sentence "how like the weather today", it includes 7 words, i.e., "present", "day", "qi", "how" and "like". In the English language, the words "word" and "word" may be referred to as a word. Similarly, in other languages, such as korean, japanese, french, etc., the words may refer to units that form a sentence according to their respective language habits, and are not illustrated herein.

In order to adjust the question end mute time corresponding to the user voice signal in real time, so that the intelligent voice terminal can accurately respond to voice signals sent by different users, as an implementation manner of the embodiment of the present invention, the step of acquiring the user voice signal acquired by the intelligent voice terminal may include: and acquiring the user voice signal acquired by the intelligent voice terminal in real time.

The server can acquire the user voice signal acquired by the intelligent voice terminal in real time, namely, the user voice signal acquired by the intelligent voice terminal is transmitted to the server, and the server receives the user voice signal and performs corresponding processing.

Correspondingly, before the step of determining the speech rate information of the user speech signal, the method may further include: and monitoring the time length of the user voice signal to reach a preset time length.

In this case, since the question end mute time is determined in real time, that is, when the user speaks a sentence, the question end mute time corresponding to the sentence is not determined, in order to respond to the user voice signal, the server may monitor whether the duration of the user voice signal reaches the preset duration while acquiring the user voice signal, and if the duration reaches the preset duration, perform the step of determining the speech rate information of the user voice signal.

The preset duration can be determined according to the length of time for a general user to speak a sentence, the preset duration is not specifically limited, and the preset duration can ensure that the characters corresponding to the voice signal of the user include more than two characters.

Correspondingly, the step of determining the question end silent time according to the speech rate information and the preset silent time setting rule may include:

The server monitors that the duration of the user voice signal reaches the preset duration, so that the speech speed information of the user voice signal can be determined, and then the question end mute time corresponding to the user voice signal, namely the currently acquired question end mute time corresponding to the user voice signal, can be understood as the question end mute time corresponding to a sentence spoken by the user currently according to the speech speed information and the preset mute time setting rule.

For example, the preset duration is 500 milliseconds, when the duration of monitoring the user voice signal by the server reaches 500 milliseconds, the speech rate information of the user voice signal is determined, the question end mute time corresponding to the user voice signal is determined according to the speech rate information and the preset mute time setting rule, and if the question end mute time corresponding to the user voice signal is determined to be 600 milliseconds, since the server monitors the mute time of the user voice signal while receiving the user voice signal, when the monitored mute time reaches 600 milliseconds, the server judges that the user question is ended, and then performs processing such as identification and analysis to respond to the user instruction corresponding to the user voice signal.

Therefore, in this embodiment, the server can dynamically set the question end mute time corresponding to each sentence of voice sent by the user according to the user voice signal acquired in real time, and when different users use the same intelligent voice terminal, the server can accurately respond to each sentence of voice of the user according to the voice speed characteristics of the different users, so that the user experience is further improved.

As an implementation manner of the embodiment of the present invention, regarding the case that the speech rate information is an average speech rate, as shown in fig. 2, the step of determining the speech rate information of the user speech signal may include:

s201, acquiring the duration of the user voice signal;

the server may obtain the time length of the user voice signal by recording the time length of the user voice signal and other manners while receiving the user voice signal, and since the manner of obtaining the time length of the user voice signal may adopt any manner of obtaining the time length of the voice signal in the field of voice signal processing, no limitation and description are provided herein.

If the user voice signal acquired by the server is the user voice signal acquired by the intelligent voice terminal within the preset time, the server can acquire the total duration of all or part of the user voice signals. For example, the user voice signal acquired by the server is a user voice signal in a week, and the duration of the user voice signal acquired by the server may be the total duration of all the user voice signals in the week or the total duration of a part of the user voice signals in the week.

If the server acquires the user voice signal acquired by the intelligent voice terminal at the current moment, the duration of the user voice signal acquired by the server is the duration of the user voice signal acquired by the intelligent voice terminal at the current moment.

S202, carrying out voice recognition on the user voice signal to obtain the number of characters corresponding to the user voice signal;

next, the server may perform speech recognition on the obtained user speech signal, and further obtain a number of characters corresponding to the user speech signal. It can be understood that, when the server performs the speech recognition on the user speech signal, the server may obtain the text content corresponding to the user speech signal, and may also obtain the number of texts corresponding to the user speech signal.

For example, when the server performs speech recognition on the user speech signal, it obtains that the corresponding text content is "play next song", and it is obvious that the server can determine that the number of texts corresponding to the user speech signal is 5.

It can be understood that the user speech signal corresponding to the number of characters obtained by the server is the same user speech signal as the time length of the user speech signal determined in step S201, that is, if the user speech signal is a part of speech signals within a preset time, when the number of characters is calculated, the number of characters obtained by performing speech recognition on the part of user speech signals is also obtained.

S203, determining the average speech speed of the user speech signal according to the number of the characters and the duration.

After obtaining the number of characters corresponding to the user voice signal, the server can determine the average speed of speech of the user voice signal according to the number of characters and the duration. It can be understood that the speech rate is the speech speed of the user, and can be represented by the number of words spoken in unit time, i.e. the quotient of the number of words corresponding to the speech signal of the user and the duration.

For example, if the number of words corresponding to the user speech signal is 6, and the duration corresponding to the user speech signal is 3 seconds, the average speech rate of the user speech signal is 6/3 ═ 2 words per second, that is, the average speech rate of the user speaking is 2 words per second.

Therefore, in this embodiment, the server may obtain the duration of the user voice signal, perform voice recognition on the user voice signal to obtain the number of characters corresponding to the user voice signal, and then determine the average speech rate of the user voice signal according to the number of characters and the duration. The speech rate information of the user speech signal can be quickly and accurately determined, and the speed and the accuracy of subsequently determining the question end mute time are improved.

As an implementation manner of the embodiment of the present invention, when the speech rate information is an average speech rate, the step of determining the question end mute time according to the speech rate information and a preset mute time setting rule may include:

In this embodiment, the server may determine the question end mute time according to the magnitude relationship between the average speech rate calculated in the above embodiment shown in fig. 2 and the preset speech rate threshold. The preset speech rate threshold may be determined according to factors such as average speech rate of the average speaker, for example, may be 3 per second, 4 per second, 5 per second, and the like, and is not limited specifically herein. The preset speech rate threshold may be one or more, which is reasonable and not specifically limited herein.

In this case, as an implementation manner of the embodiment of the present invention, the preset speech rate threshold may include a first preset speech rate threshold and a second preset speech rate threshold, where the first preset speech rate threshold is smaller than the second preset speech rate threshold.

In one embodiment, the average speech rate of a person with slower speech may be used as the first preset speech rate threshold, and the average speech rate of a person with faster speech may be used as the second preset speech rate threshold.

Correspondingly, the step of determining the question end mute time according to the magnitude relationship between the average speech rate and the preset speech rate threshold may include:

when the average speech rate is smaller than the first preset speech rate threshold value, determining the question end mute time as a first mute time; when the average speech rate is greater than the first preset speech rate threshold and less than the second preset speech rate threshold, determining question end mute time as second mute time; and when the average speech rate is greater than the second preset speech rate threshold value, determining the question end mute time as a third mute time. The first mute time is greater than the second mute time, and the second mute time is greater than the third mute time.

When the server determines the question end mute time according to the magnitude relation between the average speech rate and the preset speech rate threshold, the server may compare the average speech rate with a first preset speech rate threshold and a second preset speech rate threshold, and if the average speech rate is less than the first preset speech rate threshold, the average speech rate is slow, that is, the speech rate of the user speaking is slow, then the server may determine the question end mute time as the first mute time. It can be understood that the first mute time should be long to avoid the intelligent voice terminal from preemptively interrupting the response when responding to the user instruction. Generally, the first mute time can be 700 milliseconds, so that the intelligent voice terminal can not be subjected to response interruption when responding to the user instruction, and the response speed can not be too slow.

If the average speech rate is greater than the first preset speech rate threshold and less than the second preset speech rate threshold, it indicates that the average speech rate is moderate, but not fast or not slow, i.e. the speech rate of the user speaking is moderate, but not fast or not slow, then the server may determine that the question-ending mute time is the second mute time. It can be understood that the second mute time is not suitable for being too long or too short, and generally the second mute time may be 500 milliseconds, so that it can be ensured that the intelligent voice terminal does not rob off the response when responding to the user instruction, and it can be ensured that the response speed is not too slow.

If the average speech rate is greater than the second preset speech rate threshold, it means that the average speech rate is faster, that is, the speech rate of the user speaking is faster, then the server may determine that the question-ending mute time is the third mute time. It can be understood that the third mute time should be short, so as to improve the response speed as much as possible while ensuring that the intelligent voice terminal does not give up the response when responding to the user instruction, and avoid that the waiting time is long after the user finishes speaking. Generally, the third mute time can be 300 milliseconds, so that the intelligent voice terminal can not be subjected to response interruption when responding to the user instruction, and the response speed can be improved as much as possible.

Therefore, in this embodiment, the server may set three question end mute times of different lengths according to the magnitude relationship between the average speech rate and the first preset speech rate threshold and the second preset speech rate threshold, so as to ensure that the intelligent voice terminal does not rob the response when responding to the user instruction as much as possible, improve the response speed as much as possible, adapt to the speaking habits of different users, and further improve the user experience.

As an implementation manner of the embodiment of the present invention, regarding the case that the speech rate information is an average interval time between words, as shown in fig. 3, the step of determining the speech rate information of the user speech signal may include:

s301, carrying out voice recognition on the user voice signal to obtain the interval time between adjacent characters in the characters corresponding to the user voice signal;

the server can receive the voice signal of the user and perform voice recognition on the voice signal of the user at the same time, so that the interval time between adjacent characters in the characters corresponding to the voice signal of the user is obtained. It should be noted that the interval time between adjacent words refers to the interval time between units that constitute a sentence and are distinguished according to each language habit.

Illustratively, if the text corresponding to the user speech signal is "what you are doing", the interval time between adjacent texts is the interval time between the texts "you" and "at", between "and" doing ", between" doing "and" sh ", and between" sh "and" doing ".

If the user voice signal acquired by the server is the user voice signal acquired by the intelligent voice terminal within the preset time, the server can acquire the interval time between adjacent characters in all or part of corresponding characters in the user voice signals. For example, the user speech signal acquired by the server is a user speech signal within 3 days, and then the interval time between adjacent characters in the characters corresponding to the user speech signal acquired by the server is the interval time between adjacent characters in all or a part of the characters corresponding to the user speech signal within the 3 days.

If the server acquires the user voice signal acquired by the intelligent voice terminal at the current moment, the interval time between adjacent characters in the characters corresponding to the user voice signal acquired by the server is the interval time between adjacent characters in the characters corresponding to the user voice signal acquired by the intelligent voice terminal at the current moment.

The specific manner of obtaining the interval time between adjacent characters in the characters corresponding to the voice signal of the user may be determined by the frequency spectrum corresponding to the voice signal or the time corresponding to the peak and the trough in the waveform diagram, and is not specifically limited herein.

S302, calculating the average interval time corresponding to the user voice signal according to the interval time.

After the interval time is determined, the server can calculate the average interval time corresponding to the voice signal of the user. For example, the words corresponding to the user speech signal are "what you are doing", the words "you" and "between", the words "between" and "doing", the words "between" and "sh", and the time intervals between "sh" and "how" are respectively: 400 ms, 450 ms, 420 ms and 435 ms, the average inter-period time corresponding to the user speech signal is (400+450+420+ 435)/4-426.25 ms.

As can be seen, in this embodiment, the server may perform speech recognition on the user speech signal to obtain an interval time between adjacent characters in the characters corresponding to the user speech signal, and then determine an average interval time of the user speech signal according to the interval time. The speech rate information of the user speech signal can be quickly and accurately determined, and the speed and the accuracy of subsequently determining the question end mute time are improved.

As an implementation manner of the embodiment of the present invention, when the speech rate information is an average interval time, the step of determining the question end mute time according to the speech rate information and a preset mute time setting rule may include:

In this embodiment, the server may determine the question end mute time according to the magnitude relationship between the average interval time calculated in the above embodiment shown in fig. 3 and the preset time threshold. The preset time threshold may be determined according to factors such as counting intervals between words when a general person speaks, for example, the preset time threshold may be 350 milliseconds, 400 milliseconds, 450 milliseconds, and the like, which is not specifically limited herein. The preset time threshold may be one or more, which is reasonable and not specifically limited herein.

In this case, as an implementation manner of the embodiment of the present invention, the preset time threshold may include a first preset time threshold and a second preset time threshold, where the first preset time threshold is greater than the second preset time threshold.

In one embodiment, the average time interval between words when the person with slower general speaking speaks can be used as the first preset time threshold, and the average time interval between words when the person with faster general speaking speaks can be used as the second preset time threshold.

Correspondingly, the step of determining the question end mute time according to the magnitude relationship between the average interval time and the preset time threshold may include:

when the average interval time is greater than the first preset time threshold, determining that the question end mute time is a fourth mute time; when the average interval time is smaller than the first preset time threshold and larger than the second preset time threshold, determining the question end mute time as fifth mute time; and when the average interval time is smaller than the second preset time threshold, determining that the question end mute time is a sixth mute time, wherein the fourth mute time is larger than the fifth mute time, and the fifth mute time is larger than the sixth mute time.

When the server determines the question end mute time according to the magnitude relationship between the average interval time and the preset time threshold, the server may compare the average interval time with a first preset time threshold and a second preset time threshold, and if the average interval time is greater than the first preset time threshold, it indicates that the average interval time is longer, that is, it indicates that the word-word interval time is longer when the user speaks, the server may determine the question end mute time as a fourth mute time. It is understood that the fourth mute time should be longer to avoid the intelligent voice terminal from preemptively interrupting the response when responding to the user command. Generally, the fourth mute time can be 700 milliseconds, so that the intelligent voice terminal can not be subjected to response interruption when responding to the user instruction, and the response speed can not be too slow.

If the average interval time is smaller than the first preset time threshold and larger than the second preset time threshold, the average interval time is moderate, not long or not short, that is, the interval time between the words is moderate, not long or not short when the user speaks, and then the server can determine that the question end mute time is the fifth mute time. It can be understood that the fifth mute time is not suitable for being too long or too short, and generally the fifth mute time may be 500 milliseconds, so that it can be ensured that the intelligent voice terminal does not rob off the response when responding to the user instruction, and it can be ensured that the response speed is not too slow.

If the average interval time is less than the second preset time threshold, which indicates that the average interval time is shorter, that is, the word-to-word interval time is shorter when the user speaks, the server may determine that the question end mute time is the sixth mute time. It can be understood that the sixth mute time should be shorter, so as to improve the response speed as much as possible while ensuring that the intelligent voice terminal does not snap the response when responding to the user instruction, and avoid that the waiting time after the user finishes speaking is longer. Generally, the sixth mute time can be 300 milliseconds, so that the intelligent voice terminal can not be subjected to response interruption when responding to the user instruction, and the response speed can be improved as much as possible.

Therefore, in this embodiment, the server can set three different question ending mute times according to the relationship between the average interval time and the first preset time threshold and the second preset time threshold, so as to ensure that the intelligent voice terminal does not give up a response when responding to the user instruction as much as possible, improve the response speed as much as possible, adapt to the speaking habits of different users, and further improve the user experience.

As an implementation manner of the embodiment of the present invention, after the step of determining the question end mute time according to the speech rate information and the preset mute time setting rule, the method may further include:

and when detecting that the mute time corresponding to the acquired user voice signal currently acquired by the intelligent voice terminal reaches the determined question end mute time, responding to a user instruction corresponding to the currently acquired user voice signal.

The user instruction is determined according to the semantics of the currently acquired user voice signal. For example, the server determines the semantic meaning of the currently collected user voice signal as "how today's weather" through voice recognition, and then the user instruction may be "play weather forecast". For another example, the server determines that the semantic of the currently collected user voice signal is "play next song" through voice recognition, and then the user command may be "play next song".

As described above, when the server receives the user voice signal sent by the intelligent voice terminal and detects the mute time corresponding to the user voice signal in real time, and when it is detected that the obtained mute time corresponding to the user voice signal currently acquired by the intelligent voice terminal reaches the determined question end mute time, it indicates that the question of the user is ended, the server can perform voice recognition on the received user voice signal, determine the semantics of the currently acquired user voice signal and the user instruction corresponding to the semantics, and further respond to the user instruction.

For example, if the user instruction is "play weather forecast", the server may obtain weather forecast information from a network resource or by other means, and send the weather forecast information to the intelligent voice terminal, so that the intelligent voice terminal plays the weather forecast information, and the user may know the weather forecast.

It can be seen that, in this embodiment, when detecting that the mute time corresponding to the user voice signal currently acquired by the acquired intelligent voice terminal reaches the determined question end mute time, the server responds to the user instruction corresponding to the user voice signal currently acquired, can judge that the user has finished asking according to the determined question end mute time, and responds to the user instruction, so that the user experience is better.

Corresponding to the above method embodiment, the embodiment of the present invention further provides a device for determining the question end mute time. The following describes a device for determining question end mute time according to an embodiment of the present invention.

As shown in fig. 4, an apparatus for determining an end-of-question mute time, the apparatus comprising:

a voice signal obtaining module 410, configured to obtain a user voice signal collected by an intelligent voice terminal;

a speech rate information determining module 420, configured to determine speech rate information of the user speech signal;

the speech rate information is information for identifying speech rate characteristics of the user speech signal.

And a mute time determination module 430, configured to determine a mute time for ending the question according to the speech rate information and a preset mute time setting rule.

It can be seen that, in the scheme provided in the embodiment of the present invention, the user voice signal acquired by the intelligent voice terminal is first acquired, then the speech rate information of the user voice signal is determined, and finally the question end mute time is determined according to the speech rate information and the preset mute time setting rule, where the speech rate information is information identifying the speech rate characteristics of the user voice signal. By adopting the method, the question end mute time is determined, the reasonable question end mute time can be set according to the speech speed characteristics of the user, the intelligent voice terminal can also accurately respond to the users with different speech speeds, and the response accuracy and the user experience of the intelligent voice terminal are greatly improved.

As an implementation manner of the embodiment of the present invention, the voice signal obtaining module 410 may include:

a real-time obtaining sub-module (not shown in fig. 4) for obtaining a user voice signal collected by the intelligent voice terminal in real time;

the apparatus may further include:

a preset duration monitoring module (not shown in fig. 4) configured to monitor that a duration of the user voice signal reaches a preset duration before the determining of the speech rate information of the user voice signal;

the mute time determination module 430 may include:

a mute time determination submodule (not shown in fig. 4) configured to determine a question end mute time corresponding to the currently acquired user speech signal according to the speech rate information and a preset mute time setting rule.

As an implementation manner of the embodiment of the present invention, the speech rate information may be an average speech rate;

the aforementioned speed information determining module may include:

a time length obtaining sub-module (not shown in fig. 4) for obtaining the time length of the user voice signal;

a text number determining sub-module (not shown in fig. 4) configured to perform speech recognition on the user speech signal to obtain a text number corresponding to the user speech signal;

and an average speech rate determining submodule (not shown in fig. 4) configured to determine an average speech rate of the user speech signal according to the number of characters and the duration.

As an implementation manner of the embodiment of the present invention, the mute time determination module 430 may include:

a first determining submodule (not shown in fig. 4) configured to determine the question end mute time according to a size relationship between the average speech rate and a preset speech rate threshold.

As an implementation manner of the embodiment of the present invention, the preset speech rate threshold may include a first preset speech rate threshold and a second preset speech rate threshold, where the first preset speech rate threshold is smaller than the second preset speech rate threshold;

the first determining sub-module may include:

a first determining unit (not shown in fig. 4) configured to determine, when the average speech rate is smaller than the first preset speech rate threshold, that the question ending silent time is a first silent time;

a second determining unit (not shown in fig. 4), configured to determine that the question end mute time is a second mute time when the average speech rate is greater than the first preset speech rate threshold and smaller than the second preset speech rate threshold;

a third determining unit (not shown in fig. 4), configured to determine, when the average speech rate is greater than the second preset speech rate threshold, that the question end mute time is a third mute time, where the first mute time is greater than the second mute time, and the second mute time is greater than the third mute time.

As an implementation manner of the embodiment of the present invention, the speech rate information may be average interval time between words;

the speech rate information determination module may include:

an interval time determining submodule (not shown in fig. 4) configured to perform speech recognition on the user speech signal to obtain an interval time between adjacent characters in the characters corresponding to the user speech signal;

and an average interval time determining submodule (not shown in fig. 4) for calculating an average interval time corresponding to the user voice signal according to the interval time.

and a second determining submodule (not shown in fig. 4) configured to determine the question end mute time according to a magnitude relationship between the average interval time and a preset time threshold.

As an implementation manner of the embodiment of the present invention, the preset time threshold may include a first preset time threshold and a second preset time threshold, where the first preset time threshold is greater than the second preset time threshold;

the second determination submodule may include:

a fourth determining unit (not shown in fig. 4) configured to determine the question end mute time as a fourth mute time when the average interval time is greater than the first preset time threshold;

a fifth determining unit (not shown in fig. 4) configured to determine the question ending mute time as a fifth mute time when the average interval time is smaller than the first preset time threshold and larger than the second preset time threshold;

a sixth determining unit (not shown in fig. 4) configured to determine the question end mute time as a sixth mute time when the average interval time is smaller than the second preset time threshold.

Wherein the fourth mute time is greater than the fifth mute time, and the fifth mute time is greater than the sixth mute time.

As an implementation manner of the embodiment of the present invention, the apparatus may further include:

and an instruction response module (not shown in fig. 4) configured to, after the step of determining the question end mute time according to the speech rate information and the preset mute time setting rule, respond to the user instruction corresponding to the currently acquired user speech signal when it is detected that the obtained mute time corresponding to the currently acquired user speech signal of the intelligent speech terminal reaches the determined question end mute time.

And the user instruction is determined according to the semantics of the currently acquired user voice signal.

An embodiment of the present invention further provides an electronic device, as shown in fig. 5, which includes a processor 501, a communication interface 502, a memory 503 and a communication bus 504, where the processor 501, the communication interface 502 and the memory 503 complete mutual communication through the communication bus 504,

a memory 503 for storing a computer program;

the processor 501, when executing the program stored in the memory 503, implements the following steps:

acquiring a user voice signal acquired by an intelligent voice terminal;

It can be seen that, in the scheme provided in the embodiment of the present invention, the electronic device may first obtain the user voice signal acquired by the intelligent voice terminal, then determine the speech rate information of the user voice signal, and finally determine the question end mute time according to the speech rate information and the preset mute time setting rule, where the speech rate information is information identifying the speech rate characteristics of the user voice signal. By adopting the method, the question end mute time is determined, the reasonable question end mute time can be set according to the speech speed characteristics of the user, the intelligent voice terminal can also accurately respond to the users with different speech speeds, and the response accuracy and the user experience of the intelligent voice terminal are greatly improved.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

Wherein, the above-mentioned step of obtaining the user speech signal that intelligent voice terminal gathered can include:

before the step of determining the speech rate information of the user speech signal, the method may include:

the step of determining the silent time for question end according to the speech rate information and the preset silent time setting rule may include:

Wherein, the speech rate information may be an average speech rate;

the step of determining the speech rate information of the user speech signal may include:

acquiring the duration of the user voice signal;

The step of determining the question end silent period according to the speech rate information and the preset silent period setting rule may include:

The preset speech rate threshold may include a first preset speech rate threshold and a second preset speech rate threshold, where the first preset speech rate threshold is smaller than the second preset speech rate threshold;

the step of determining the question end mute time according to the magnitude relationship between the average speech rate and the preset speech rate threshold may include:

The speech rate information can be average interval time between words;

The preset time threshold may include a first preset time threshold and a second preset time threshold, where the first preset time threshold is greater than the second preset time threshold;

the step of determining the question end mute time according to the magnitude relationship between the average interval time and the preset time threshold may include:

After the step of determining the question end silent time according to the speech rate information and the preset silent time setting rule, the method may further include:

An embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and when executed by a processor, the computer program implements the following steps:

acquiring a user voice signal acquired by an intelligent voice terminal;

It can be seen that, in the solution provided in the embodiment of the present invention, when the computer program is executed by the processor, the user voice signal acquired by the intelligent voice terminal is first acquired, then the speech rate information of the user voice signal is determined, and finally the question end mute time is determined according to the speech rate information and the preset mute time setting rule, where the speech rate information is information identifying the speech rate characteristics of the user voice signal. By adopting the method, the question end mute time is determined, the reasonable question end mute time can be set according to the speech speed characteristics of the user, the intelligent voice terminal can also accurately respond to the users with different speech speeds, and the response accuracy and the user experience of the intelligent voice terminal are greatly improved.

Wherein, the speech rate information may be an average speech rate;

acquiring the duration of the user voice signal;

The speech rate information can be average interval time between words;

It should be noted that, for the above-mentioned apparatus, electronic device and computer-readable storage medium embodiments, since they are basically similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for determining an end-of-challenge mute time, the method comprising:

acquiring a user voice signal acquired by an intelligent voice terminal;

determining question end mute time according to the speech rate information and a preset mute time setting rule, wherein the question end mute time is used for marking the end of the question of the user;

wherein, according to the speech rate information and the preset silent time setting rule, determining the question end silent time comprises:

when the average speech speed of the user speech signal is smaller than a first preset speech speed threshold value, determining the question-asking ending mute time as first mute time; when the average speech speed of the user speech signal is greater than a first preset speech speed threshold and less than a second preset speech speed threshold, determining the question end mute time as a second mute time; when the average speech rate of the user speech signal is greater than a second preset speech rate threshold, determining that the question-asking ending mute time is a third mute time, wherein the first mute time is greater than the second mute time, the second mute time is greater than the third mute time, or,

when the average interval time of the user voice signals is larger than a first preset time threshold, determining that the question-asking ending mute time is fourth mute time; when the average interval time of the user voice signals is smaller than a first preset time threshold and larger than a second preset time threshold, determining the question end mute time as fifth mute time; and when the average interval time of the user voice signals is smaller than a second preset time threshold, determining that the question end mute time is a sixth mute time, wherein the fourth mute time is larger than the fifth mute time, and the fifth mute time is larger than the sixth mute time.

2. The method of claim 1, wherein the step of obtaining the user voice signal collected by the intelligent voice terminal comprises:

3. The method of claim 1, wherein said speech rate information is an average speech rate;

acquiring the duration of the user voice signal;

4. The method of claim 1, wherein the speech rate information is word-to-word average interval time;

5. The method according to any one of claims 1-4, wherein after said step of determining a question end mute time according to said speech rate information and a preset mute time setting rule, said method further comprises:

6. An apparatus for determining an end-of-question mute time, the apparatus comprising:

a silent time determining module, configured to determine a question end silent time according to the speech rate information and a preset silent time setting rule, where the question end silent time is used to identify that the user ends the question at this time;

the silent time determination module is specifically configured to determine that a question end silent time is a first silent time when an average speech rate of the user speech signal is less than a first preset speech rate threshold; when the average speech speed of the user speech signal is greater than a first preset speech speed threshold and less than a second preset speech speed threshold, determining the question end mute time as a second mute time; when the average speech rate of the user speech signal is greater than a second preset speech rate threshold value, determining that the question-ending mute time is a third mute time, wherein the first mute time is greater than the second mute time, and the second mute time is greater than the third mute time, or when the average interval time of the user speech signal is greater than a first preset time threshold value, determining that the question-ending mute time is a fourth mute time; when the average interval time of the user voice signals is smaller than a first preset time threshold and larger than a second preset time threshold, determining the question end mute time as fifth mute time; and when the average interval time of the user voice signals is smaller than a second preset time threshold, determining that the question end mute time is a sixth mute time, wherein the fourth mute time is larger than the fifth mute time, and the fifth mute time is larger than the sixth mute time.

7. The apparatus of claim 6, wherein the voice signal acquisition module comprises:

the device further comprises:

the mute time determination module comprises:

8. The apparatus of claim 6, wherein said speech rate information is an average speech rate;

the speech rate information determining module comprises:

9. The apparatus of claim 6, wherein said speech rate information is word-to-word average interval time;

the speech rate information determining module comprises:

10. The apparatus of any of claims 6-9, wherein the apparatus further comprises:

11. An electronic device is characterized by comprising a processor, a memory and a communication bus, wherein the processor and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 5 when executing a program stored in the memory.

12. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-5.