CN109377998B

CN109377998B - Voice interaction method and device

Info

Publication number: CN109377998B
Application number: CN201811513534.9A
Authority: CN
Inventors: 马雪涛; 薛臣臣; 熊勇军
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-12-11
Filing date: 2018-12-11
Publication date: 2022-02-25
Anticipated expiration: 2038-12-11
Also published as: CN109377998A

Abstract

The application discloses a voice interaction method and a voice interaction device, wherein the method comprises the following steps: in the current round of voice interaction process, the user voice of the current user in the current round of interaction is determined according to the current sentence pause threshold, then a new sentence pause threshold is determined according to the user voice, the current sentence pause threshold is updated by the new sentence pause threshold, and the current sentence pause threshold is responded to the current round of user voice.

Description

Voice interaction method and device

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a speech interaction method and apparatus.

Background

With the development of intelligent voice interaction technology, the understanding and response of the robot to the semantics become more and more humanized, however, the self-adaptive ability to the expression rhythms of different users still has a deficiency.

Although the existing voice interaction method can simulate human alternate conversation to a certain extent, each user has different expression habits, and therefore, if a unified rule is adopted to judge whether a section of voice is finished, the judgment result is inaccurate, the subsequent semantic understanding and the accuracy of voice response are influenced, and the user experience effect is reduced.

Disclosure of Invention

The embodiment of the application mainly aims to provide a voice interaction method and device, which can improve the accuracy of a voice response result and improve the user experience effect.

The embodiment of the application provides a voice interaction method, which comprises the following steps:

in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold;

and determining a new inter-sentence pause threshold according to the user voice, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the user voice.

Optionally, the determining a new inter-sentence pause threshold according to the user speech includes:

determining actual pause between each sentence and pause at each sentence tail in the user voice;

and determining a new inter-sentence pause threshold according to the determined inter-sentence pauses and the determined end-of-sentence pauses.

Optionally, the determining actual inter-sentence pauses and post-sentence pauses in the user speech includes:

and carrying out syntactic analysis on the recognition text of the user voice, and determining actual pause among all sentences and pause at all sentence ends in the user voice.

Optionally, the parsing the recognition text of the user speech to determine actual inter-sentence pauses and end-of-sentence pauses in the user speech includes:

extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;

and determining actual pause among all sentences and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.

Optionally, the determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined end-of-sentence pause includes:

and determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause.

Optionally, the determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause includes:

weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;

weighting the determined duration of each sentence end pause to obtain the sentence end pause duration of the current round;

and selecting a numerical value between the sentence pause time of the current round and the sentence tail pause time of the current round as a new sentence pause threshold value.

Optionally, if the current round of voice interaction process is any round of voice interaction process except the first round of voice interaction process of the current user, determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the duration corresponding to the inter-sentence pause, including:

weighting the determined duration of pause between sentences, and weighting the weighting result and the pause duration between sentences obtained in the previous round to obtain the pause duration between sentences in the current round;

weighting the determined duration of each sentence end pause, and weighting the weighting result and the sentence end pause duration obtained in the previous round to obtain the sentence end pause duration of the current round;

and selecting a numerical value between the sentence pause time of the current round and the sentence tail pause time of the current round as a new sentence pause threshold value. Optionally, the responding to the user voice includes:

taking the recognition text of the user voice as a text to be responded;

extracting each high-frequency vocabulary in the text to be responded;

matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;

and performing voice response according to the text after the removing operation.

Optionally, after removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, the method further includes:

detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after the matching operation;

and if so, storing the detected invalid vocabulary in the invalid word bank, and removing the detected invalid vocabulary from the text to be responded.

An embodiment of the present application further provides a voice interaction apparatus, including:

the user voice determining unit is used for determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold in the current round of voice interaction;

a pause threshold determining unit, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;

and the user voice response unit is used for responding to the user voice.

Optionally, the pause threshold determining unit includes:

an actual pause determining subunit, configured to determine actual inter-sentence pauses and end-sentence pauses in the user speech;

and the pause threshold determining subunit is used for determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined inter-sentence pause.

Optionally, the actual pause determining subunit is specifically configured to:

Optionally, the actual pause determining subunit includes:

the role information extraction subunit is used for extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;

and the pause determining subunit is used for determining actual pause among each sentence and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.

Optionally, the pause threshold determination subunit is specifically configured to:

Optionally, the pause threshold determining subunit includes:

the first pause duration determining subunit is used for weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;

the second pause duration determining subunit is configured to weight the determined duration of each sentence end pause to obtain a sentence end pause duration of the current round;

and the first pause threshold determining subunit is used for selecting a numerical value from the inter-sentence pause duration of the current round and the tail-sentence pause duration of the current round as a new inter-sentence pause threshold.

Optionally, if the current round of voice interaction process is any round of voice interaction process except the first round of voice interaction process of the current user, the pause threshold determining subunit includes:

a third pause duration determining subunit, configured to weight the determined duration of each inter-sentence pause, and weight the weighted result and the inter-sentence pause duration obtained in the previous round to obtain the inter-sentence pause duration of the current round;

a fourth pause duration determining subunit, configured to weight the determined duration of each sentence end pause, and weight the weighted result and the sentence end pause duration obtained in the previous round to obtain a sentence end pause duration of the current round;

and the second pause threshold determining subunit is used for selecting a numerical value between the inter-sentence pause duration of the current round and the sentence tail pause duration of the current round as a new inter-sentence pause threshold.

Optionally, the user voice response unit includes:

the text determining subunit is used for taking the recognition text of the user voice as a text to be responded;

the vocabulary extracting subunit is used for extracting each high-frequency vocabulary in the text to be responded;

the vocabulary matching subunit is used for matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;

and the voice response subunit is used for performing voice response according to the text after the removal operation.

Optionally, the apparatus further comprises:

the vocabulary detecting unit is used for removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, and detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after matching operation;

and the vocabulary removing unit is used for storing the detected invalid vocabulary in the invalid vocabulary library and removing the detected invalid vocabulary from the text to be responded if the fact that the invalid vocabulary exists in the high-frequency vocabulary left after the matching operation is detected.

An embodiment of the present application further provides a voice interaction device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation manner of the voice interaction method.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation manner of the voice interaction method.

The embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the voice interaction method.

According to the voice interaction method and device provided by the embodiment of the application, firstly, in the voice interaction process of the current round, the user voice of the current user in the interaction of the current round is determined according to the current pause threshold value between the sentences, then, a new pause threshold value between the sentences is determined according to the user voice, and the current pause threshold value between the sentences is updated by using the new pause threshold value between the sentences, so that the user voice can be determined according to the updated pause threshold value between the sentences when the next round of voice interaction is carried out, and meanwhile, the voice of the current round of user is responded. Therefore, compared with the prior art, in the voice interaction process, the method and the device have the advantages that the personalized expression habit of the current user is considered, namely the habitual pause time of the current user is considered, so that the accuracy of the voice response result can be improved, the times of the user repeating the same problem are reduced, and the user experience is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a voice interaction method according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating receiving user voice data in a first-pass voice interaction according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a process for determining a new inter-sentence pause threshold according to user speech according to an embodiment of the present application;

fig. 4 is a schematic diagram of a recognition text corresponding to a user voice according to an embodiment of the present application;

FIG. 5 is an exemplary diagram of a parse tree provided by an embodiment of the present application;

FIG. 6 is a diagram illustrating extraction of semantic role information from an identified text according to an embodiment of the present application;

fig. 7 is a schematic flowchart of determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each end-of-sentence pause according to the embodiment of the present application;

FIG. 8 is a flowchart illustrating an exemplary process for updating inter-sentence pause thresholds according to an embodiment of the present disclosure;

fig. 9 is a schematic flowchart of responding to a speech before a pause after a sentence end according to an embodiment of the present application;

FIG. 10 is a diagram illustrating an example of extracting high-frequency words from recognized text according to an embodiment of the present application;

fig. 11 is a schematic composition diagram of a voice interaction apparatus according to an embodiment of the present application.

Detailed Description

In some voice interaction scenarios, in order to simulate a real-person conversation scenario, a silence detection technique is usually used to determine whether a speech uttered by a user is ended, specifically, during the speaking process of the user, the silence detection technique is used to monitor the volume of the sound uttered by the user in real time, when it is detected that the volume of the sound uttered by the user is lower than a preset volume threshold, it is determined that the user does not utter the speech during the period, if the duration of the condition is short, the period may be determined as an inter-sentence pause of the user, otherwise, if the duration of the condition is long, so that the preset time threshold is exceeded, the period may be determined as an end-of-sentence pause of the user, which indicates that the speech uttered by the user is ended, and then a system (e.g., a robot) may respond based on the speech uttered by the user.

However, the above manner of determining pause between sentences and pause between sentence ends does not consider different pause habits of each user, for example, some users have faster speech speed, shorter corresponding pause time between sentences and pause time between sentence ends, while some users have slower speech speed, and longer corresponding pause time between sentences and pause time between sentence ends, so that the manner of determining whether a section of speech of all users is finished by using a unified rule may cause a situation of a determination error, which further causes a low accuracy of a subsequent speech response result, and also reduces a user experience effect.

In order to solve the above-mentioned drawbacks, an embodiment of the present application provides a voice interaction method, which may preset an inter-sentence pause threshold, such as a larger inter-sentence pause threshold, and then determine a user voice of a current user in a current round of voice interaction according to the preset inter-sentence pause threshold when performing voice interaction, and then determine a new inter-sentence pause threshold according to the user voice of the current round, and use the new inter-sentence pause threshold as an inter-sentence pause threshold used in a next round of voice interaction with the current user to determine a user voice in a next round of voice interaction, and at the same time, respond to the user voice of the current round, so that the inter-sentence pause threshold used in each round of interaction may be dynamically adjusted to dynamically adapt to a pause habit of the current user, and improve accuracy of user voice sentence break, and then a more accurate user voice response result is made, and the situation that the user repeats the same problem for many times when the user cannot obtain a response result of a certain problem is effectively avoided. Therefore, compared with the prior art, in the voice interaction process, the method and the device have the advantages that the personalized expression habit of the current user is considered, namely the habitual pause time of the current user is considered, so that the accuracy of the voice response result can be improved, the times of the user repeating the same problem are reduced, and the user experience is improved.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First embodiment

Referring to fig. 1, a schematic flow chart of a voice interaction method provided in this embodiment is shown, where the method includes the following steps:

s101: and in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold.

In this embodiment, the system (e.g., the robot) may perform one or more rounds of voice interaction with the current user, for example, the robot may perform voice interaction with the current user for the first time, or may perform voice interaction with the current user for two, three, etc. times. And in the process of each round of voice interaction, a silence detection technology can be adopted, and the user voice in each round of voice interaction is determined according to the sentence pause threshold corresponding to each round. It should be noted that the present embodiment does not limit the language of the user voice, for example, the user voice may be a voice composed of chinese, a voice composed of english, or the like.

Specifically, in the current round of voice interaction, when the voice of the user is received, if the continuous mute time appearing in the voice of the user is lower than the current inter-sentence pause threshold, the continuous mute time is determined as the inter-sentence pause time, and the receiving of the voice of the user is not stopped, but if the continuous mute time appearing in the voice of the user is higher than the current inter-sentence pause threshold, the continuous mute time is determined as the end-sentence pause time, which indicates that the expression of the user is finished, the receiving of the voice of the user can be stopped, and the voice is determined as the voice of the user in the current round of interaction.

The current round of voice interaction may be the current user's first round of voice interaction, or may be a current user's non-first round of voice interaction, such as the second round and the third round … ….

If the current round of voice interaction is the first round of voice interaction of the current user, a longer inter-sentence pause threshold value can be preset as the inter-sentence pause threshold value in the current round of voice interaction before the current round of voice interaction is performed, and the threshold value can be higher than the inter-sentence pause threshold value commonly used in the existing common voice interaction, so that the more sufficient user voice can be obtained. For example, if the inter-sentence pause threshold value commonly used in normal voice interaction is 1000 ms, in this embodiment, before the first round of voice interaction is performed, the inter-sentence pause threshold value may be set to a value greater than 1000 ms in advance, for example, 2000 ms may be set, so as to obtain more sufficient user voice according to the inter-sentence pause threshold value.

For example, the following steps are carried out: as shown in fig. 2, assuming that the predetermined inter-sentence pause threshold is 2000 ms, when the utterance of the user is detected from time "1" in fig. 2 during the first round of voice interaction, the voice data of the user starts to be received. After receiving the first speech segment, detecting a mute segment with the duration of 1200 milliseconds, but the duration of the mute segment does not exceed a preset inter-sentence pause threshold (1200<2000), judging that the mute segment is inter-sentence pause, and continuing to receive the user speech; then, after receiving the second section of voice, detecting a mute section with the duration of 1800 milliseconds, but the duration of the mute section still does not exceed a preset inter-sentence pause threshold (1800<2000), still judging the mute section to be an inter-sentence pause, and continuing to receive the user voice; after the third section of voice is received, a mute section with the duration of 2000 milliseconds is detected, and because the duration of the mute section reaches the preset inter-sentence pause threshold of 2000 milliseconds, the mute section can be judged to be an end-of-sentence pause, the receiving of the voice of the user is stopped, and the received three sections of voice data are used as the voice of the user in the current round of interaction.

In addition, if the voice interaction of the current round is not the first round of voice interaction, for example, if the voice interaction of the current round is the second round of voice interaction, since a new inter-sentence pause threshold is determined in the process of the first round of voice interaction, the inter-sentence pause threshold in the voice interaction of the current round is the new inter-sentence pause threshold; similarly, if the voice interaction of the current round is the third round of voice interaction, since a new inter-sentence pause threshold is determined in the second round of voice interaction, the inter-sentence pause threshold in the voice interaction of the current round is the new inter-sentence pause threshold; and so on. Therefore, in the second round and later rounds of voice interaction processes, the user voice in each round of voice interaction can be determined by adopting a silence detection technology or other technical means according to different sentence pause threshold values.

S102: and determining a new inter-sentence pause threshold according to the voice of the user in the current round, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the voice of the user in the current round.

In this embodiment, after determining the user speech of the current user in the current round of interaction through step S101, data processing may be performed on the user speech to determine a new inter-sentence pause threshold, where the new inter-sentence pause threshold is a pause threshold adapted to the pause characteristics of the current user speech. Then, the current inter-sentence pause threshold in S101 is updated by using the new inter-sentence pause threshold, that is, the current inter-sentence pause threshold is replaced by the new inter-sentence pause threshold to be used as an inter-sentence pause threshold used in a next round of voice interaction, so that when a next round of voice interaction is performed, a next round of user voice can be determined according to the updated inter-sentence pause threshold.

And, it is also necessary to respond to the voice of the user of the present round. For example, the voice content of the current round of users is "how much the weather is today? After the robot semantically understands the voice, the robot can answer that the weather is good today and the robot is suitable for going out. It should be noted that, in step S102, the specific implementation process of "responding to the voice of the user in the current round" can be referred to the related description of the second embodiment.

In an implementation manner of this embodiment, as shown in fig. 3, the implementation process of "determining a new inter-sentence pause threshold according to the current round of user speech" in step S102 may specifically include steps S1021-S1022.

S1021: and determining the actual pause between each sentence and the pause at the tail of each sentence in the voice of the user in the current round.

In this embodiment, after determining the user speech of the current user in the current round of interaction through step S101, the speech recognition method that occurs in the present or future may be used to perform speech recognition on the user speech to obtain a corresponding recognition text. For example, the following steps are carried out: based on the example in fig. 2, after performing speech recognition on three sections of speech data in the speech of the user in fig. 2, a corresponding recognition text can be obtained, as shown in fig. 4, where the first section of text is "i ask you for money", the second section of text is "how much money you are for this product", and the third section of text is "that i ask how much money you are for this product sold", and the pause duration corresponding to the indicated position "1" and the pause duration corresponding to the indicated position "2" in fig. 4 are both the inter-sentence pause determined according to the inter-sentence pause threshold corresponding to this round, and the pause duration corresponding to the indicated position "3" is the sentence end pause determined according to the inter-sentence threshold pause corresponding to this round.

However, such an inter-sentence pause and an end-sentence pause that are distinguished only by the inter-sentence pause threshold may not be accurate, and the recognized text of the user speech of the current round may be subjected to data processing according to some text processing methods (e.g., a processing method based on a language model, etc.) to determine actual inter-sentence pauses and end-sentence pauses in the user speech of the current round.

An optional implementation manner is that syntactic analysis may be performed on the recognized text of the user speech of the current round, and actual inter-sentence pauses and post-sentence pauses in the user speech of the current round are determined.

In this implementation, word segmentation processing may be performed on the recognition text of the user speech of the current round to obtain each word included in the recognition text, and then syntax functions of the words included in the recognition text are analyzed to obtain a subject, a predicate, an object, a predicate, an idiom, a state, a complement, and the like in the recognition text, so that the number of complete sentences included in the recognition text can be determined based on the constituent units of the syntax structures, where an end-of-sentence silent segment of each complete sentence is an actual end-of-sentence pause, and an inter-sentence silent segment of each complete sentence is an actual inter-sentence pause, so that actual inter-sentence pauses and end-of-sentence pauses in the user speech of the current round are determined, and the flow specifically includes the following steps a1-a 2:

step A1: semantic role information is extracted from the recognition text of the voice of the current round of the user, and the semantic role information comprises a group of predicates and objects.

In this implementation, the syntactic analysis method may be used to perform syntactic analysis on the recognition text of the user's voice in the current round, for example, a semantic role labeling method based on a shallow syntactic analysis result may be used to perform semantic role labeling on the recognition text, so as to extract one or more semantic role information from the recognition text of the user's voice in the current round, where each semantic role information includes a set of predicates and objects, and of course, the semantic role information may further include a subject. The predicate is a statement or description of a subject (an action or a subject of an action of executing a text), indicates an action or an action such as "what" the subject does "," what "or" what "and is the core of the whole sentence text; an object refers to the recipient of an action (predicate).

Specifically, the noun collocated with the predicate is called argument, and the roles assumed by different arguments in the text may be different, such as different roles of an actor (Agent), a victim (parent), an object (Theme), an Experiencer, a Beneficiary (Beneficiary), a tool (Instrument), a place (Location), a Goal (Goal), and a Source (Source), and among these different arguments with different roles, a subject and an object are usually included, such as the actor (Agent) being the subject and the victim (parent) being the object.

According to grammatical rules, a complete sentence can generally consist of a subject, predicates and objects (the subject can be omitted from the imperative sentence), while a second set of predicates and objects in a piece of text, once present, represents the occurrence of another complete sentence. Based on the above, when semantic role information is extracted from the recognition text of the user voice in the current round by adopting a semantic role labeling method based on a shallow syntactic analysis result, a complete sentence can be found by finding out a subject (i.e., an actor) and an object (a victim) around the "predicate" with the "predicate" as a core, and all the complete sentences contained in the recognition text can be found by analogy in sequence.

For example, the following steps are carried out: as shown in fig. 5, taking the found complete sentence text "Xiaoming yesterday meets a little red in a park at night" as an example, semantic character information extracted from the text by adopting a semantic character labeling method based on a shallow syntactic analysis result includes: "meet" is the predicate, "Xiaoming" is the subject (actor) and "Xiaohong" is the object (victim). In addition, regarding other words, such as "yesterday" is the Time of occurrence (Time) of the event described by the text, "park" is the Location of occurrence (Location) of the event described by the text.

Step A2: and according to the extracted quantity of the semantic role information, determining the actual pause among all sentences and the pause at the tail of each sentence in the voice of the user.

In this embodiment, after semantic character information (including a set of predicates and objects) is extracted from the recognition text of the user speech in the current round through step a1, actual inter-sentence pauses and end-sentence pauses in the user speech can be determined according to the extracted number of the semantic character information. Specifically, since a complete sentence at least includes a predicate and an object, a complete sentence can be determined after a predicate and an object matching the predicate are extracted from the recognition text of the user's voice in the current round, and a pause after the sentence is an end-of-sentence pause, and a pause occurring in the sentence is an inter-sentence pause. By analogy, the number of complete sentences contained in the recognition text can be determined according to the extracted predicates and the number of objects matched with the predicates, and actual pause among the sentences and pause at the tail of each sentence in the voice of the user in the current round can be further determined.

For example, the following steps are carried out: still taking the three pieces of content of the identification text shown in fig. 4 as an example, as shown in fig. 6, the subject 1 is "me" and the predicate 1 is "ask" can be extracted from the first identification text "asking me", and the object 1 is "product" can be extracted from the second identification text "how much money you are for this product", so that the first identification text and the second identification text contain a predicate 1 and an object 1 collocated with the predicate 1, and the two pieces of text content can form a complete sentence. Then, the subject 2 as "me", the predicate 2 as "ask" and the object 2 as "product" can be extracted from the third recognition text "i ask how much you sell so", and similarly, the third recognition text itself can form a complete sentence. Thus, the pause between the two recognized texts and the third recognized text (i.e., the second silent period, duration 1800 ms) is an end-of-sentence pause, and the pause between the first recognized text and the second recognized text (i.e., the first silent period, duration 1200 ms) is an inter-sentence pause.

S1022: and determining a new inter-sentence pause threshold according to the determined inter-sentence pauses and the determined end-of-sentence pauses.

In this embodiment, after the actual inter-sentence pause and the actual end-sentence pause in the speech of the user in the current round are determined in step S1021, a new inter-sentence pause threshold may be determined according to the determined actual inter-sentence pause and the determined end-sentence pause.

In an implementation manner of this embodiment, a new inter-sentence pause threshold may be determined according to the determined duration corresponding to each inter-sentence pause and each end-of-sentence pause.

In this implementation manner, after the actual inter-sentence pause and the actual end-sentence pause in the current round of user speech are determined through step S1021, the actual inter-sentence pause and the duration corresponding to the end-sentence pause in the current round of user speech may be processed to determine a new inter-sentence pause threshold, and the inter-sentence pause threshold may be used as the inter-sentence pause threshold used in the next round of speech interaction.

It should be noted that, when the current round of voice interaction is the first round of voice interaction, a new inter-sentence pause threshold value can be directly determined according to the actual inter-sentence pause and the actual inter-sentence pause determined by the current round of voice interaction; when the current round of voice interaction is any round of voice interaction process (such as the second round, the third round and the like) except the first round of voice interaction process of the current user, the actual pause between sentences and the pause between sentence ends determined according to the current round of voice interaction can be combined with the actual pause between sentences and the pause between sentence ends determined according to the previous round of voice interaction to jointly determine a new pause threshold value between sentences.

As shown in fig. 7, the specific implementation flow of the present implementation includes the following steps S701 to S703:

s701: and weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round.

When the current round of speech interaction is the first round of speech interaction of the current user, as shown in fig. 8, the preset inter-sentence pause threshold is a, and if only 1 inter-sentence pause in the user speech is determined through step S1021, the duration of the inter-sentence pause can be directly used as the inter-sentence pause duration of the first round, that is, as the inter-sentence pause B of the first round shown in fig. 8. If it is determined in step S1021 that there are multiple actual inter-sentence pauses in the user speech, the duration of each inter-sentence pause may be weighted to obtain a first round of inter-sentence pause durations, that is, the first round of inter-sentence pauses B shown in fig. 8, where a sum of each weight value may be 1, and may be obtained experimentally or set empirically, for example, if it is assumed that there are 3 actual inter-sentence pauses in the determined user speech, which are 1000 milliseconds, 1200 milliseconds, and 1400 milliseconds, and the weight values corresponding to the duration of each actual inter-sentence pause duration are 0.2, 0.3, and 0.5 in sequence, the inter-sentence pause duration of this round is 1000.2 + 1200.3 +1400, that is, B is 1260 milliseconds.

When the current round of voice interaction is any round of voice interaction process except the first round of voice interaction process of the current user, the first mode is to obtain the inter-sentence pause time of the current round according to the first round of voice interaction mode and use the inter-sentence pause time as the final inter-sentence pause time of the current round, and the second mode is to obtain the inter-sentence pause time of the current round according to the first round of voice interaction mode and weight the inter-sentence pause time and the final inter-sentence pause time of the previous round to use the inter-sentence pause time as the final inter-sentence pause time of the current round.

That is, when the current voice interaction is any voice interaction process except the first voice interaction process of the current user, if the second manner is adopted, the step S701 may be replaced with: and weighting the determined duration of pause between the sentences, and weighting the weighting result and the pause duration between the sentences obtained in the previous round to obtain the pause duration between the sentences in the current round.

Specifically, as shown in fig. 8, taking the second round of speech interaction process with the current round of speech interaction as the current user as an example, if the inter-sentence pause threshold determined by the first round is D, and if the actual inter-sentence pause of the second round is determined by step S1021 onlyIf there are 1, the duration of the inter-sentence pause can be directly used as the inter-sentence pause duration of the second round, and defined as E. If it is determined in step S1021 that there are multiple actual inter-sentence pauses in the second round, the duration of each inter-sentence pause may be weighted to obtain the initial inter-sentence pause duration E in the second round₁The specific calculation process is consistent with the calculation method of the inter-sentence pause duration B of the first round.

Further, in order to obtain the inter-sentence pause E of the second round of voice interaction process, the calculated initial inter-sentence pause duration E of the second round may be₁And carrying out weighted calculation with the inter-sentence pause duration B of the first round, wherein the sum of the weights can be 1. The specific calculation process can still be divided into two ways, one way is to perform an average calculation, i.e., E ═ E (E)₁+ B)/2, namely the weight is 0.5; another way is according to the second round of initial inter-sentence pause duration E₁Different weights corresponding to the sentence pause time length B of the first round are weighted, and the initial sentence pause time length E of the second round₁Compared with the inter-sentence pause time B of the first round, the inter-sentence pause condition of the second round of the user can be embodied, so that the initial inter-sentence pause time E of the second round can be used for calculating the inter-sentence pause time E of the second round₁The corresponding weight is set to be a larger weight value, and correspondingly, the weight corresponding to the inter-sentence pause duration B of the first round can be set to be a smaller weight value, for example, E can be set₁The corresponding weight is set to 0.7, the weight corresponding to B is set to 0.3, and then the specific calculation formula of the inter-sentence pause E of the second round of speech is: e ═ E₁*0.7+B*0.3。

In the subsequent voice interaction processes of the third round, the fourth round and the like, the calculation mode of the pause duration between sentences is similar to that of the second round, and is not repeated here.

S702: and weighting the determined duration of each sentence end pause to obtain the sentence end pause duration of the current round.

When the current round of voice interaction is the first round of voice interaction of the current user, as shown in fig. 8, the preset inter-sentence pause threshold is a, and if only 1 sentence end pause is determined in the user voice through step S1021, the duration of the sentence end pause can be directly used as the sentence end pause duration of the first round, that is, as the sentence end pause C of the first round shown in fig. 8. If it is determined in step S1021 that there are multiple actual sentence end pauses in the user speech, the duration of each sentence end pause may be weighted to obtain a first round of sentence end pause duration, that is, a first round of sentence end pause C shown in fig. 8, where the sum of each weight may be 1.

When the current round of voice interaction is any round of voice interaction process except the first round of voice interaction process of the current user, the first mode is to obtain the sentence end pause time of the current round according to the first round of voice interaction mode and use the sentence end pause time as the final sentence end pause time of the current round, and the second mode is to obtain the sentence end pause time of the current round according to the first round of voice interaction mode and weight the sentence end pause time and the final sentence end pause time of the previous round to use the sentence end pause time of the current round.

That is, when the current voice interaction is any voice interaction process except the first voice interaction process of the current user, if the second manner is adopted, this step S702 may be replaced by: and weighting the determined duration of each sentence end pause, and weighting the weighting result and the sentence end pause duration obtained in the previous round to obtain the sentence end pause duration of the current round.

Specifically, as shown in fig. 8, still taking the current round of voice interaction as the second round of voice interaction process of the current user as an example, if the inter-sentence pause threshold determined by the first round is D, and if it is determined by step S1021 that there are only 1 actual sentence end pauses of the second round, the duration of the sentence end pause may be directly used as the sentence end pause duration of the second round, and defined as F. If it is determined through step S1021 that there are multiple actual sentence end pauses in the second round, the specific calculation process is consistent with the calculation method of the sentence end pause duration C in the first round.

Further, in order to obtain the final pause F of the second round of voice interaction process, the calculated initial final pause duration F of the second round of voice interaction process may be used₁And carrying out weighted calculation with the sentence end pause duration C of the first round, wherein the sum of the weights can be 1. The specific calculation process can still be divided into two ways, one way is to perform an average calculation, i.e., F ═ F (F)₁+ C)/2 means the weight is 0.5; another way is to stop mean value F according to the initial sentence end of the second round₁Carrying out weighting calculation on different weights corresponding to the sentence end pause duration C of the first round, wherein the initial sentence end pause duration F of the second round₁Compared with the sentence end pause time length C of the first round, the sentence end pause condition of the user in the second round can be embodied, so that the initial sentence end pause time length F of the second round can be used for calculating the sentence end pause time length F of the second round₁The corresponding weight is set to a larger weight value, and correspondingly, the weight corresponding to the end-of-sentence pause C of the first round can be set to a smaller weight value, for example, F can be set₁The corresponding weight is set to 0.6, and the weight corresponding to C is set to 0.4, so that the specific calculation formula of the sentence end pause F of the second round of speech is as follows: f ═ F₁*0.6+C*0.4。

In the subsequent voice interaction processes of the third round, the fourth round and the like, the calculation mode of the sentence end pause duration is similar to that of the second round, and details are not repeated here.

S703: and selecting a numerical value between the sentence pause time length of the current round and the sentence tail pause time length of the current round as a new sentence pause threshold value.

Specifically, for example, when the current round of voice interaction is the first round of voice interaction of the current user, as shown in fig. 8, after the inter-sentence pause duration B and the end-sentence pause duration C of the first round are calculated through the above steps S701 and S702, a numerical value may be selected between B and C as a new inter-sentence pause threshold, which is defined as D. For example, an average value of B and C may be taken, and D may be 1500 msec, that is, D may be (1200+1800)/2, assuming that B is 1200 msec and C is 1800 msec, for example. And the inter-sentence pause threshold D can be used as an inter-sentence pause threshold used in the second round of voice interaction, so that the user voice can be determined according to the inter-sentence pause threshold D during the second round of voice interaction.

Similarly, when the current voice interaction is any voice interaction except the first voice interaction process of the current user, a new inter-sentence pause threshold value can be generated according to the same method. Taking the second round of voice interaction process as an example, as shown in fig. 8, a value may be selected between E and F as a new inter-sentence pause threshold, which is defined as G. For example, an average value of E and F may be taken as G, and G may be taken as an inter-sentence pause threshold used in the third round of voice interaction, and so on, and a new inter-sentence pause threshold may be generated in each subsequent round of voice interaction process, that is, continuous dynamic adjustment of the current inter-sentence pause threshold may be implemented on a wheel-by-wheel basis.

In summary, the embodiment can dynamically adjust the inter-sentence pause threshold used in each round of interaction, and dynamically adapt to the pause habit of the current user. Therefore, compared with the prior art, in the voice interaction process, the personalized expression habit of the current user is considered, namely the habitual pause time of the current user is considered, so that the accuracy of the voice response result can be improved, the times of the user repeating the same problem are reduced, and the user experience is improved.

Second embodiment

The present embodiment will describe a specific implementation process of "responding to the voice of the user in this round" in the first step S102.

It should be noted that, in the process of performing voice interaction, when a user performs a voice expression of the user, some words, vocals, etc. such as "o", "yao", "this", "that", etc. are often added, and these meaningless words are used as invalid expressions, which are not only unfavorable for performing semantic understanding on the voice of the user, but also may generate certain interference, and especially when these words are located in the middle of valid words, the normal semantic understanding on the voice of the user is affected. Moreover, when different users express the user speech, some personalized words such as moods, and whistles may be generated according to their own different expression habits. Therefore, after determining the actual pause between each sentence and the pause at each tail in the voice of the user in the current round, one way is to directly understand the semantic of the voice of the user before each tail pause and make a corresponding response. Another way is that after the actual inter-sentence pause and the actual end-sentence pause in the user speech are determined through the first embodiment, and the interference of the habitual pause of the user is eliminated, the personalized invalid expression generated by the user is filtered to ensure the accuracy of semantic understanding, as shown in fig. 9, the specific implementation flow of the way includes the following steps S901 to S904:

s901: and taking the recognized text of the voice of the user in the current round as the text to be responded.

In this embodiment, existing or future voice recognition methods may be used to perform voice recognition on the voice of the user in the current round to obtain a corresponding recognition text, and the recognition text is used as a text to be responded, so as to implement accurate response of the text to be responded through subsequent steps S902-S904.

S902: and extracting each high-frequency vocabulary in the text to be responded.

In this embodiment, after the text to be responded is obtained in step S901, a word segmentation method may be used to perform word segmentation on the text to be responded to obtain each word segmentation word included in the text to be responded and the number of times that each word segmentation word appears in the text to be responded, and then according to a preset number threshold, for example, the number threshold may be set to 2 in advance, a word segmentation word whose number of times that the text to be responded appears is not lower than the number threshold may be defined as a high-frequency word, for example, a word segmentation word whose number of times that the text to be responded appears is not lower than 2 may be defined as a high-frequency word, and then all high-frequency words may be extracted from the text to be responded, and the remaining word segmentation words are low-frequency words.

For example, the following steps are carried out: as shown in fig. 10, the number threshold is set to 2 in advance, and it is assumed that the obtained identification text (text to be responded) is "how much money you ask you for this product, how much money you ask you for this product? "the text to be responded is subjected to word segmentation processing, and a word segmentation result is obtained, namely that" i ask how much money you ask your product to sell how much money you ask your product ", as a result, the word segmentation words" i "," ask "," o "," you "," this "," product "," how much "and" money "appear in the text to be responded for 2 times and are equal to a preset time threshold, so that the words can be used as high-frequency words, as shown in fig. 10, the high-frequency words can be sequentially marked as (i) -b, and correspondingly, the three word segmentation words of" one stroke "," that "," sell "in the text to be responded are low-frequency words.

S903: and matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library.

In this embodiment, after the high frequency words and the low frequency words in the text to be responded are extracted in step S902, the high frequency words and the low frequency words in the text to be responded may be matched with the invalid words in the pre-constructed invalid word bank, and the matched high frequency words and the low frequency words in the text to be responded are removed from the text to be responded, that is, the high frequency words overlapping with the invalid word bank are removed, the high frequency words not overlapping with the invalid word bank are retained, and the low frequency words in the text to be responded are also removed.

Wherein, the invalid vocabulary used by the current user is stored in the invalid vocabulary bank. It should be noted that, in the voice interaction process, different users may generate some personalized ineffective words such as language-atmosphere words and vocabularies according to different expression habits of the users, so in order to ensure the accuracy of semantic understanding of voices of different users, personalized ineffective word banks need to be created for different users, so that each ineffective word bank stores the ineffective words used by a unique user, and a better filtering effect is achieved.

The method for implementing the personalized invalid word bank corresponding to each user includes the steps of firstly, segmenting the recognized text corresponding to the voice of the user in the current round, then, extracting high-frequency words commonly used by the user from the recognized text according to the mode of the step S902, then, calling a natural language understanding model, analyzing the semantics of all the proposed high-frequency words, namely, separating words without clear semantics from the words, for example, based on the example shown in fig. 10, after obtaining the high-frequency words "i", "ask", "o", "you", "this", "product", "how much" and "money", calling the natural language understanding model, analyzing the semantics of the high-frequency words, knowing that "o" and "this" have no clear semantics, and classifying the words into invalid words to form the personalized invalid word bank corresponding to the user. And through more rounds of voice interaction, the invalid words corresponding to the user can be continuously accumulated, and further the personalized invalid word bank corresponding to the user can be enriched.

Therefore, an alternative implementation manner is that after the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded are removed from the text to be responded in step S903, the following steps B1-B2 may be further performed:

step B1: and detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after the matching operation.

In this implementation manner, after each extracted high-frequency vocabulary is matched with each invalid vocabulary in the pre-constructed invalid vocabulary library in step S903 and the matched high-frequency vocabulary is removed from the text to be responded, it may be further detected whether there is an invalid vocabulary in the high-frequency vocabularies remaining after the matching operation, if yes, it is indicated that the pre-constructed personalized invalid vocabulary library corresponding to the current user does not include the invalid vocabularies, step B2 needs to be executed, and if no, step S904 is executed.

Step B2: and if so, storing the detected invalid vocabulary in the invalid word bank, and removing the detected invalid vocabulary from the text to be responded.

In this implementation manner, if it is detected through step B1 that there are still invalid words in the high-frequency words remaining after the matching operation, it indicates that these words are not included in the pre-constructed personalized invalid word library corresponding to the current user, and at this time, these detected invalid words may be stored in the personalized invalid word library corresponding to the current user, so that when processing the user voice in the subsequent voice interaction of the user, a better filtering effect may be achieved. Meanwhile, the detected invalid words need to be removed from the text to be responded so as to ensure the accuracy of subsequent semantic understanding.

S904: and performing voice response according to the text after the removing operation.

In this embodiment, after the invalid vocabulary in the text to be responded is removed in step S903, the remaining text content may be semantically understood to obtain a semantic understanding result. Moreover, after the invalid vocabulary is removed, the obtained semantic understanding result can be closer to the meaning which the current user wants to express, and further the system can make a more accurate response which is more in line with the meaning of the user.

In summary, in the embodiment, the personalized invalid word bank corresponding to each user is pre-constructed, and the invalid words in the recognized text corresponding to the voice of the user are personalized and filtered, so that a good filtering effect is achieved, the accuracy of semantic understanding can be ensured, and the user experience is further improved.

Third embodiment

In this embodiment, a voice interaction apparatus will be described, and please refer to the above method embodiment for related contents.

Referring to fig. 11, a schematic composition diagram of a voice interaction apparatus provided in this embodiment is shown, where the apparatus 1100 includes:

a user voice determining unit 1101, configured to determine, according to a current inter-sentence pause threshold, a user voice of a current user in a current round of interaction in a current round of voice interaction;

a pause threshold determining unit 1102, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;

a user voice response unit 1103, configured to respond to the user voice.

In an implementation manner of this embodiment, the stall threshold determination unit 1102 includes:

In an implementation manner of this embodiment, the actual pause determining subunit is specifically configured to:

In an implementation manner of this embodiment, the actual pause determining subunit includes:

In an implementation manner of this embodiment, the pause threshold determining subunit is specifically configured to:

In an implementation manner of this embodiment, the stall threshold determination subunit includes:

In an implementation manner of this embodiment, if the current round of voice interaction process is any round of voice interaction process except the first round of voice interaction process of the current user, the pause threshold determining subunit includes:

In an implementation manner of this embodiment, the user voice response unit 1103 includes:

In an implementation manner of this embodiment, the apparatus further includes:

Further, an embodiment of the present application further provides a voice interaction device, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the voice interaction method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute any implementation method of the above voice interaction method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above voice interaction method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of voice interaction, comprising:

determining a new inter-sentence pause threshold according to the user voice, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the user voice;

the determining a new inter-sentence pause threshold according to the user speech includes:

2. The method of claim 1, wherein determining a new inter-sentence pause threshold based on the determined durations for each inter-sentence pause and each end-of-sentence pause comprises:

3. The method of claim 1, wherein if the current round of speech interaction process is any round of speech interaction process other than the first round of speech interaction process of the current user, determining a new inter-sentence pause threshold according to the determined durations corresponding to the respective inter-sentence pauses and the respective end-of-sentence pauses comprises:

4. A method of voice interaction, comprising:

determining a new inter-sentence pause threshold value according to the determined inter-sentence pause and the determined inter-sentence pause;

the determining of the actual pause between each sentence and the pause at each sentence end in the user speech includes:

performing syntactic analysis on the recognition text of the user voice, and determining actual pause between each sentence and pause at each sentence tail in the user voice;

the parsing the recognition text of the user speech to determine actual inter-sentence pauses and post-sentence pauses in the user speech includes:

5. A method of voice interaction, comprising:

the responding to the user voice comprises:

taking the recognition text of the user voice as a text to be responded;

extracting each high-frequency vocabulary in the text to be responded;

6. The method of claim 5, wherein after removing the matched high frequency vocabulary and the low frequency vocabulary in the text to be responded from the text to be responded, the method further comprises:

7. A voice interaction apparatus, comprising:

a user voice response unit, configured to respond to the user voice;

the stall threshold determination unit includes:

and the pause threshold determining subunit is used for determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the duration corresponding to the inter-sentence pause.

8. The apparatus of claim 7, wherein the stall threshold determination subunit comprises:

9. The apparatus of claim 7, wherein if the current round of voice interaction process is any round of voice interaction process other than the first round of voice interaction process of the current user, the pause threshold determining subunit includes:

10. A voice interaction apparatus, comprising:

a user voice response unit, configured to respond to the user voice;

the stall threshold determination unit includes:

a pause threshold determining subunit, configured to determine a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined inter-sentence pause;

the actual pause determining subunit is specifically configured to:

the actual stall determination subunit includes:

11. A voice interaction apparatus, comprising:

a user voice response unit, configured to respond to the user voice;

the user voice response unit includes:

12. A voice interaction device, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-6.

13. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-6.

14. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-6.