CN109377998B - Voice interaction method and device - Google Patents

Voice interaction method and device Download PDF

Info

Publication number
CN109377998B
CN109377998B CN201811513534.9A CN201811513534A CN109377998B CN 109377998 B CN109377998 B CN 109377998B CN 201811513534 A CN201811513534 A CN 201811513534A CN 109377998 B CN109377998 B CN 109377998B
Authority
CN
China
Prior art keywords
sentence
pause
inter
user
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811513534.9A
Other languages
Chinese (zh)
Other versions
CN109377998A (en
Inventor
马雪涛
薛臣臣
熊勇军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201811513534.9A priority Critical patent/CN109377998B/en
Publication of CN109377998A publication Critical patent/CN109377998A/en
Application granted granted Critical
Publication of CN109377998B publication Critical patent/CN109377998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Abstract

The application discloses a voice interaction method and a voice interaction device, wherein the method comprises the following steps: in the current round of voice interaction process, the user voice of the current user in the current round of interaction is determined according to the current sentence pause threshold, then a new sentence pause threshold is determined according to the user voice, the current sentence pause threshold is updated by the new sentence pause threshold, and the current sentence pause threshold is responded to the current round of user voice.

Description

Voice interaction method and device
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech interaction method and apparatus.
Background
With the development of intelligent voice interaction technology, the understanding and response of the robot to the semantics become more and more humanized, however, the self-adaptive ability to the expression rhythms of different users still has a deficiency.
Although the existing voice interaction method can simulate human alternate conversation to a certain extent, each user has different expression habits, and therefore, if a unified rule is adopted to judge whether a section of voice is finished, the judgment result is inaccurate, the subsequent semantic understanding and the accuracy of voice response are influenced, and the user experience effect is reduced.
Disclosure of Invention
The embodiment of the application mainly aims to provide a voice interaction method and device, which can improve the accuracy of a voice response result and improve the user experience effect.
The embodiment of the application provides a voice interaction method, which comprises the following steps:
in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold;
and determining a new inter-sentence pause threshold according to the user voice, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the user voice.
Optionally, the determining a new inter-sentence pause threshold according to the user speech includes:
determining actual pause between each sentence and pause at each sentence tail in the user voice;
and determining a new inter-sentence pause threshold according to the determined inter-sentence pauses and the determined end-of-sentence pauses.
Optionally, the determining actual inter-sentence pauses and post-sentence pauses in the user speech includes:
and carrying out syntactic analysis on the recognition text of the user voice, and determining actual pause among all sentences and pause at all sentence ends in the user voice.
Optionally, the parsing the recognition text of the user speech to determine actual inter-sentence pauses and end-of-sentence pauses in the user speech includes:
extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;
and determining actual pause among all sentences and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.
Optionally, the determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined end-of-sentence pause includes:
and determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause.
Optionally, the determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause includes:
weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;
weighting the determined duration of each sentence end pause to obtain the sentence end pause duration of the current round;
and selecting a numerical value between the sentence pause time of the current round and the sentence tail pause time of the current round as a new sentence pause threshold value.
Optionally, if the current round of voice interaction process is any round of voice interaction process except the first round of voice interaction process of the current user, determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the duration corresponding to the inter-sentence pause, including:
weighting the determined duration of pause between sentences, and weighting the weighting result and the pause duration between sentences obtained in the previous round to obtain the pause duration between sentences in the current round;
weighting the determined duration of each sentence end pause, and weighting the weighting result and the sentence end pause duration obtained in the previous round to obtain the sentence end pause duration of the current round;
and selecting a numerical value between the sentence pause time of the current round and the sentence tail pause time of the current round as a new sentence pause threshold value. Optionally, the responding to the user voice includes:
taking the recognition text of the user voice as a text to be responded;
extracting each high-frequency vocabulary in the text to be responded;
matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;
and performing voice response according to the text after the removing operation.
Optionally, after removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, the method further includes:
detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after the matching operation;
and if so, storing the detected invalid vocabulary in the invalid word bank, and removing the detected invalid vocabulary from the text to be responded.
An embodiment of the present application further provides a voice interaction apparatus, including:
the user voice determining unit is used for determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold in the current round of voice interaction;
a pause threshold determining unit, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;
and the user voice response unit is used for responding to the user voice.
Optionally, the pause threshold determining unit includes:
an actual pause determining subunit, configured to determine actual inter-sentence pauses and end-sentence pauses in the user speech;
and the pause threshold determining subunit is used for determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined inter-sentence pause.
Optionally, the actual pause determining subunit is specifically configured to:
and carrying out syntactic analysis on the recognition text of the user voice, and determining actual pause among all sentences and pause at all sentence ends in the user voice.
Optionally, the actual pause determining subunit includes:
the role information extraction subunit is used for extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;
and the pause determining subunit is used for determining actual pause among each sentence and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.
Optionally, the pause threshold determination subunit is specifically configured to:
and determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause.
Optionally, the pause threshold determining subunit includes:
the first pause duration determining subunit is used for weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;
the second pause duration determining subunit is configured to weight the determined duration of each sentence end pause to obtain a sentence end pause duration of the current round;
and the first pause threshold determining subunit is used for selecting a numerical value from the inter-sentence pause duration of the current round and the tail-sentence pause duration of the current round as a new inter-sentence pause threshold.
Optionally, if the current round of voice interaction process is any round of voice interaction process except the first round of voice interaction process of the current user, the pause threshold determining subunit includes:
a third pause duration determining subunit, configured to weight the determined duration of each inter-sentence pause, and weight the weighted result and the inter-sentence pause duration obtained in the previous round to obtain the inter-sentence pause duration of the current round;
a fourth pause duration determining subunit, configured to weight the determined duration of each sentence end pause, and weight the weighted result and the sentence end pause duration obtained in the previous round to obtain a sentence end pause duration of the current round;
and the second pause threshold determining subunit is used for selecting a numerical value between the inter-sentence pause duration of the current round and the sentence tail pause duration of the current round as a new inter-sentence pause threshold.
Optionally, the user voice response unit includes:
the text determining subunit is used for taking the recognition text of the user voice as a text to be responded;
the vocabulary extracting subunit is used for extracting each high-frequency vocabulary in the text to be responded;
the vocabulary matching subunit is used for matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;
and the voice response subunit is used for performing voice response according to the text after the removal operation.
Optionally, the apparatus further comprises:
the vocabulary detecting unit is used for removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, and detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after matching operation;
and the vocabulary removing unit is used for storing the detected invalid vocabulary in the invalid vocabulary library and removing the detected invalid vocabulary from the text to be responded if the fact that the invalid vocabulary exists in the high-frequency vocabulary left after the matching operation is detected.
An embodiment of the present application further provides a voice interaction device, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation manner of the voice interaction method.
An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation manner of the voice interaction method.
The embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the voice interaction method.
According to the voice interaction method and device provided by the embodiment of the application, firstly, in the voice interaction process of the current round, the user voice of the current user in the interaction of the current round is determined according to the current pause threshold value between the sentences, then, a new pause threshold value between the sentences is determined according to the user voice, and the current pause threshold value between the sentences is updated by using the new pause threshold value between the sentences, so that the user voice can be determined according to the updated pause threshold value between the sentences when the next round of voice interaction is carried out, and meanwhile, the voice of the current round of user is responded. Therefore, compared with the prior art, in the voice interaction process, the method and the device have the advantages that the personalized expression habit of the current user is considered, namely the habitual pause time of the current user is considered, so that the accuracy of the voice response result can be improved, the times of the user repeating the same problem are reduced, and the user experience is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a voice interaction method according to an embodiment of the present application;
FIG. 2 is a schematic diagram illustrating receiving user voice data in a first-pass voice interaction according to an embodiment of the present application;
FIG. 3 is a flowchart illustrating a process for determining a new inter-sentence pause threshold according to user speech according to an embodiment of the present application;
fig. 4 is a schematic diagram of a recognition text corresponding to a user voice according to an embodiment of the present application;
FIG. 5 is an exemplary diagram of a parse tree provided by an embodiment of the present application;
FIG. 6 is a diagram illustrating extraction of semantic role information from an identified text according to an embodiment of the present application;
fig. 7 is a schematic flowchart of determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each end-of-sentence pause according to the embodiment of the present application;
FIG. 8 is a flowchart illustrating an exemplary process for updating inter-sentence pause thresholds according to an embodiment of the present disclosure;
fig. 9 is a schematic flowchart of responding to a speech before a pause after a sentence end according to an embodiment of the present application;
FIG. 10 is a diagram illustrating an example of extracting high-frequency words from recognized text according to an embodiment of the present application;
fig. 11 is a schematic composition diagram of a voice interaction apparatus according to an embodiment of the present application.
Detailed Description
In some voice interaction scenarios, in order to simulate a real-person conversation scenario, a silence detection technique is usually used to determine whether a speech uttered by a user is ended, specifically, during the speaking process of the user, the silence detection technique is used to monitor the volume of the sound uttered by the user in real time, when it is detected that the volume of the sound uttered by the user is lower than a preset volume threshold, it is determined that the user does not utter the speech during the period, if the duration of the condition is short, the period may be determined as an inter-sentence pause of the user, otherwise, if the duration of the condition is long, so that the preset time threshold is exceeded, the period may be determined as an end-of-sentence pause of the user, which indicates that the speech uttered by the user is ended, and then a system (e.g., a robot) may respond based on the speech uttered by the user.
However, the above manner of determining pause between sentences and pause between sentence ends does not consider different pause habits of each user, for example, some users have faster speech speed, shorter corresponding pause time between sentences and pause time between sentence ends, while some users have slower speech speed, and longer corresponding pause time between sentences and pause time between sentence ends, so that the manner of determining whether a section of speech of all users is finished by using a unified rule may cause a situation of a determination error, which further causes a low accuracy of a subsequent speech response result, and also reduces a user experience effect.
In order to solve the above-mentioned drawbacks, an embodiment of the present application provides a voice interaction method, which may preset an inter-sentence pause threshold, such as a larger inter-sentence pause threshold, and then determine a user voice of a current user in a current round of voice interaction according to the preset inter-sentence pause threshold when performing voice interaction, and then determine a new inter-sentence pause threshold according to the user voice of the current round, and use the new inter-sentence pause threshold as an inter-sentence pause threshold used in a next round of voice interaction with the current user to determine a user voice in a next round of voice interaction, and at the same time, respond to the user voice of the current round, so that the inter-sentence pause threshold used in each round of interaction may be dynamically adjusted to dynamically adapt to a pause habit of the current user, and improve accuracy of user voice sentence break, and then a more accurate user voice response result is made, and the situation that the user repeats the same problem for many times when the user cannot obtain a response result of a certain problem is effectively avoided. Therefore, compared with the prior art, in the voice interaction process, the method and the device have the advantages that the personalized expression habit of the current user is considered, namely the habitual pause time of the current user is considered, so that the accuracy of the voice response result can be improved, the times of the user repeating the same problem are reduced, and the user experience is improved.
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
First embodiment
Referring to fig. 1, a schematic flow chart of a voice interaction method provided in this embodiment is shown, where the method includes the following steps:
s101: and in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold.
In this embodiment, the system (e.g., the robot) may perform one or more rounds of voice interaction with the current user, for example, the robot may perform voice interaction with the current user for the first time, or may perform voice interaction with the current user for two, three, etc. times. And in the process of each round of voice interaction, a silence detection technology can be adopted, and the user voice in each round of voice interaction is determined according to the sentence pause threshold corresponding to each round. It should be noted that the present embodiment does not limit the language of the user voice, for example, the user voice may be a voice composed of chinese, a voice composed of english, or the like.
Specifically, in the current round of voice interaction, when the voice of the user is received, if the continuous mute time appearing in the voice of the user is lower than the current inter-sentence pause threshold, the continuous mute time is determined as the inter-sentence pause time, and the receiving of the voice of the user is not stopped, but if the continuous mute time appearing in the voice of the user is higher than the current inter-sentence pause threshold, the continuous mute time is determined as the end-sentence pause time, which indicates that the expression of the user is finished, the receiving of the voice of the user can be stopped, and the voice is determined as the voice of the user in the current round of interaction.
The current round of voice interaction may be the current user's first round of voice interaction, or may be a current user's non-first round of voice interaction, such as the second round and the third round … ….
If the current round of voice interaction is the first round of voice interaction of the current user, a longer inter-sentence pause threshold value can be preset as the inter-sentence pause threshold value in the current round of voice interaction before the current round of voice interaction is performed, and the threshold value can be higher than the inter-sentence pause threshold value commonly used in the existing common voice interaction, so that the more sufficient user voice can be obtained. For example, if the inter-sentence pause threshold value commonly used in normal voice interaction is 1000 ms, in this embodiment, before the first round of voice interaction is performed, the inter-sentence pause threshold value may be set to a value greater than 1000 ms in advance, for example, 2000 ms may be set, so as to obtain more sufficient user voice according to the inter-sentence pause threshold value.
For example, the following steps are carried out: as shown in fig. 2, assuming that the predetermined inter-sentence pause threshold is 2000 ms, when the utterance of the user is detected from time "1" in fig. 2 during the first round of voice interaction, the voice data of the user starts to be received. After receiving the first speech segment, detecting a mute segment with the duration of 1200 milliseconds, but the duration of the mute segment does not exceed a preset inter-sentence pause threshold (1200<2000), judging that the mute segment is inter-sentence pause, and continuing to receive the user speech; then, after receiving the second section of voice, detecting a mute section with the duration of 1800 milliseconds, but the duration of the mute section still does not exceed a preset inter-sentence pause threshold (1800<2000), still judging the mute section to be an inter-sentence pause, and continuing to receive the user voice; after the third section of voice is received, a mute section with the duration of 2000 milliseconds is detected, and because the duration of the mute section reaches the preset inter-sentence pause threshold of 2000 milliseconds, the mute section can be judged to be an end-of-sentence pause, the receiving of the voice of the user is stopped, and the received three sections of voice data are used as the voice of the user in the current round of interaction.
In addition, if the voice interaction of the current round is not the first round of voice interaction, for example, if the voice interaction of the current round is the second round of voice interaction, since a new inter-sentence pause threshold is determined in the process of the first round of voice interaction, the inter-sentence pause threshold in the voice interaction of the current round is the new inter-sentence pause threshold; similarly, if the voice interaction of the current round is the third round of voice interaction, since a new inter-sentence pause threshold is determined in the second round of voice interaction, the inter-sentence pause threshold in the voice interaction of the current round is the new inter-sentence pause threshold; and so on. Therefore, in the second round and later rounds of voice interaction processes, the user voice in each round of voice interaction can be determined by adopting a silence detection technology or other technical means according to different sentence pause threshold values.
S102: and determining a new inter-sentence pause threshold according to the voice of the user in the current round, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the voice of the user in the current round.
In this embodiment, after determining the user speech of the current user in the current round of interaction through step S101, data processing may be performed on the user speech to determine a new inter-sentence pause threshold, where the new inter-sentence pause threshold is a pause threshold adapted to the pause characteristics of the current user speech. Then, the current inter-sentence pause threshold in S101 is updated by using the new inter-sentence pause threshold, that is, the current inter-sentence pause threshold is replaced by the new inter-sentence pause threshold to be used as an inter-sentence pause threshold used in a next round of voice interaction, so that when a next round of voice interaction is performed, a next round of user voice can be determined according to the updated inter-sentence pause threshold.
And, it is also necessary to respond to the voice of the user of the present round. For example, the voice content of the current round of users is "how much the weather is today? After the robot semantically understands the voice, the robot can answer that the weather is good today and the robot is suitable for going out. It should be noted that, in step S102, the specific implementation process of "responding to the voice of the user in the current round" can be referred to the related description of the second embodiment.
In an implementation manner of this embodiment, as shown in fig. 3, the implementation process of "determining a new inter-sentence pause threshold according to the current round of user speech" in step S102 may specifically include steps S1021-S1022.
S1021: and determining the actual pause between each sentence and the pause at the tail of each sentence in the voice of the user in the current round.
In this embodiment, after determining the user speech of the current user in the current round of interaction through step S101, the speech recognition method that occurs in the present or future may be used to perform speech recognition on the user speech to obtain a corresponding recognition text. For example, the following steps are carried out: based on the example in fig. 2, after performing speech recognition on three sections of speech data in the speech of the user in fig. 2, a corresponding recognition text can be obtained, as shown in fig. 4, where the first section of text is "i ask you for money", the second section of text is "how much money you are for this product", and the third section of text is "that i ask how much money you are for this product sold", and the pause duration corresponding to the indicated position "1" and the pause duration corresponding to the indicated position "2" in fig. 4 are both the inter-sentence pause determined according to the inter-sentence pause threshold corresponding to this round, and the pause duration corresponding to the indicated position "3" is the sentence end pause determined according to the inter-sentence threshold pause corresponding to this round.
However, such an inter-sentence pause and an end-sentence pause that are distinguished only by the inter-sentence pause threshold may not be accurate, and the recognized text of the user speech of the current round may be subjected to data processing according to some text processing methods (e.g., a processing method based on a language model, etc.) to determine actual inter-sentence pauses and end-sentence pauses in the user speech of the current round.
An optional implementation manner is that syntactic analysis may be performed on the recognized text of the user speech of the current round, and actual inter-sentence pauses and post-sentence pauses in the user speech of the current round are determined.
In this implementation, word segmentation processing may be performed on the recognition text of the user speech of the current round to obtain each word included in the recognition text, and then syntax functions of the words included in the recognition text are analyzed to obtain a subject, a predicate, an object, a predicate, an idiom, a state, a complement, and the like in the recognition text, so that the number of complete sentences included in the recognition text can be determined based on the constituent units of the syntax structures, where an end-of-sentence silent segment of each complete sentence is an actual end-of-sentence pause, and an inter-sentence silent segment of each complete sentence is an actual inter-sentence pause, so that actual inter-sentence pauses and end-of-sentence pauses in the user speech of the current round are determined, and the flow specifically includes the following steps a1-a 2:
step A1: semantic role information is extracted from the recognition text of the voice of the current round of the user, and the semantic role information comprises a group of predicates and objects.
In this implementation, the syntactic analysis method may be used to perform syntactic analysis on the recognition text of the user's voice in the current round, for example, a semantic role labeling method based on a shallow syntactic analysis result may be used to perform semantic role labeling on the recognition text, so as to extract one or more semantic role information from the recognition text of the user's voice in the current round, where each semantic role information includes a set of predicates and objects, and of course, the semantic role information may further include a subject. The predicate is a statement or description of a subject (an action or a subject of an action of executing a text), indicates an action or an action such as "what" the subject does "," what "or" what "and is the core of the whole sentence text; an object refers to the recipient of an action (predicate).
Specifically, the noun collocated with the predicate is called argument, and the roles assumed by different arguments in the text may be different, such as different roles of an actor (Agent), a victim (parent), an object (Theme), an Experiencer, a Beneficiary (Beneficiary), a tool (Instrument), a place (Location), a Goal (Goal), and a Source (Source), and among these different arguments with different roles, a subject and an object are usually included, such as the actor (Agent) being the subject and the victim (parent) being the object.
According to grammatical rules, a complete sentence can generally consist of a subject, predicates and objects (the subject can be omitted from the imperative sentence), while a second set of predicates and objects in a piece of text, once present, represents the occurrence of another complete sentence. Based on the above, when semantic role information is extracted from the recognition text of the user voice in the current round by adopting a semantic role labeling method based on a shallow syntactic analysis result, a complete sentence can be found by finding out a subject (i.e., an actor) and an object (a victim) around the "predicate" with the "predicate" as a core, and all the complete sentences contained in the recognition text can be found by analogy in sequence.
For example, the following steps are carried out: as shown in fig. 5, taking the found complete sentence text "Xiaoming yesterday meets a little red in a park at night" as an example, semantic character information extracted from the text by adopting a semantic character labeling method based on a shallow syntactic analysis result includes: "meet" is the predicate, "Xiaoming" is the subject (actor) and "Xiaohong" is the object (victim). In addition, regarding other words, such as "yesterday" is the Time of occurrence (Time) of the event described by the text, "park" is the Location of occurrence (Location) of the event described by the text.
Step A2: and according to the extracted quantity of the semantic role information, determining the actual pause among all sentences and the pause at the tail of each sentence in the voice of the user.
In this embodiment, after semantic character information (including a set of predicates and objects) is extracted from the recognition text of the user speech in the current round through step a1, actual inter-sentence pauses and end-sentence pauses in the user speech can be determined according to the extracted number of the semantic character information. Specifically, since a complete sentence at least includes a predicate and an object, a complete sentence can be determined after a predicate and an object matching the predicate are extracted from the recognition text of the user's voice in the current round, and a pause after the sentence is an end-of-sentence pause, and a pause occurring in the sentence is an inter-sentence pause. By analogy, the number of complete sentences contained in the recognition text can be determined according to the extracted predicates and the number of objects matched with the predicates, and actual pause among the sentences and pause at the tail of each sentence in the voice of the user in the current round can be further determined.
For example, the following steps are carried out: still taking the three pieces of content of the identification text shown in fig. 4 as an example, as shown in fig. 6, the subject 1 is "me" and the predicate 1 is "ask" can be extracted from the first identification text "asking me", and the object 1 is "product" can be extracted from the second identification text "how much money you are for this product", so that the first identification text and the second identification text contain a predicate 1 and an object 1 collocated with the predicate 1, and the two pieces of text content can form a complete sentence. Then, the subject 2 as "me", the predicate 2 as "ask" and the object 2 as "product" can be extracted from the third recognition text "i ask how much you sell so", and similarly, the third recognition text itself can form a complete sentence. Thus, the pause between the two recognized texts and the third recognized text (i.e., the second silent period, duration 1800 ms) is an end-of-sentence pause, and the pause between the first recognized text and the second recognized text (i.e., the first silent period, duration 1200 ms) is an inter-sentence pause.
S1022: and determining a new inter-sentence pause threshold according to the determined inter-sentence pauses and the determined end-of-sentence pauses.
In this embodiment, after the actual inter-sentence pause and the actual end-sentence pause in the speech of the user in the current round are determined in step S1021, a new inter-sentence pause threshold may be determined according to the determined actual inter-sentence pause and the determined end-sentence pause.
In an implementation manner of this embodiment, a new inter-sentence pause threshold may be determined according to the determined duration corresponding to each inter-sentence pause and each end-of-sentence pause.
In this implementation manner, after the actual inter-sentence pause and the actual end-sentence pause in the current round of user speech are determined through step S1021, the actual inter-sentence pause and the duration corresponding to the end-sentence pause in the current round of user speech may be processed to determine a new inter-sentence pause threshold, and the inter-sentence pause threshold may be used as the inter-sentence pause threshold used in the next round of speech interaction.
It should be noted that, when the current round of voice interaction is the first round of voice interaction, a new inter-sentence pause threshold value can be directly determined according to the actual inter-sentence pause and the actual inter-sentence pause determined by the current round of voice interaction; when the current round of voice interaction is any round of voice interaction process (such as the second round, the third round and the like) except the first round of voice interaction process of the current user, the actual pause between sentences and the pause between sentence ends determined according to the current round of voice interaction can be combined with the actual pause between sentences and the pause between sentence ends determined according to the previous round of voice interaction to jointly determine a new pause threshold value between sentences.
As shown in fig. 7, the specific implementation flow of the present implementation includes the following steps S701 to S703:
s701: and weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round.
When the current round of speech interaction is the first round of speech interaction of the current user, as shown in fig. 8, the preset inter-sentence pause threshold is a, and if only 1 inter-sentence pause in the user speech is determined through step S1021, the duration of the inter-sentence pause can be directly used as the inter-sentence pause duration of the first round, that is, as the inter-sentence pause B of the first round shown in fig. 8. If it is determined in step S1021 that there are multiple actual inter-sentence pauses in the user speech, the duration of each inter-sentence pause may be weighted to obtain a first round of inter-sentence pause durations, that is, the first round of inter-sentence pauses B shown in fig. 8, where a sum of each weight value may be 1, and may be obtained experimentally or set empirically, for example, if it is assumed that there are 3 actual inter-sentence pauses in the determined user speech, which are 1000 milliseconds, 1200 milliseconds, and 1400 milliseconds, and the weight values corresponding to the duration of each actual inter-sentence pause duration are 0.2, 0.3, and 0.5 in sequence, the inter-sentence pause duration of this round is 1000.2 + 1200.3 +1400, that is, B is 1260 milliseconds.
When the current round of voice interaction is any round of voice interaction process except the first round of voice interaction process of the current user, the first mode is to obtain the inter-sentence pause time of the current round according to the first round of voice interaction mode and use the inter-sentence pause time as the final inter-sentence pause time of the current round, and the second mode is to obtain the inter-sentence pause time of the current round according to the first round of voice interaction mode and weight the inter-sentence pause time and the final inter-sentence pause time of the previous round to use the inter-sentence pause time as the final inter-sentence pause time of the current round.
That is, when the current voice interaction is any voice interaction process except the first voice interaction process of the current user, if the second manner is adopted, the step S701 may be replaced with: and weighting the determined duration of pause between the sentences, and weighting the weighting result and the pause duration between the sentences obtained in the previous round to obtain the pause duration between the sentences in the current round.
Specifically, as shown in fig. 8, taking the second round of speech interaction process with the current round of speech interaction as the current user as an example, if the inter-sentence pause threshold determined by the first round is D, and if the actual inter-sentence pause of the second round is determined by step S1021 onlyIf there are 1, the duration of the inter-sentence pause can be directly used as the inter-sentence pause duration of the second round, and defined as E. If it is determined in step S1021 that there are multiple actual inter-sentence pauses in the second round, the duration of each inter-sentence pause may be weighted to obtain the initial inter-sentence pause duration E in the second round1The specific calculation process is consistent with the calculation method of the inter-sentence pause duration B of the first round.
Further, in order to obtain the inter-sentence pause E of the second round of voice interaction process, the calculated initial inter-sentence pause duration E of the second round may be1And carrying out weighted calculation with the inter-sentence pause duration B of the first round, wherein the sum of the weights can be 1. The specific calculation process can still be divided into two ways, one way is to perform an average calculation, i.e., E ═ E (E)1+ B)/2, namely the weight is 0.5; another way is according to the second round of initial inter-sentence pause duration E1Different weights corresponding to the sentence pause time length B of the first round are weighted, and the initial sentence pause time length E of the second round1Compared with the inter-sentence pause time B of the first round, the inter-sentence pause condition of the second round of the user can be embodied, so that the initial inter-sentence pause time E of the second round can be used for calculating the inter-sentence pause time E of the second round1The corresponding weight is set to be a larger weight value, and correspondingly, the weight corresponding to the inter-sentence pause duration B of the first round can be set to be a smaller weight value, for example, E can be set1The corresponding weight is set to 0.7, the weight corresponding to B is set to 0.3, and then the specific calculation formula of the inter-sentence pause E of the second round of speech is: e ═ E1*0.7+B*0.3。
In the subsequent voice interaction processes of the third round, the fourth round and the like, the calculation mode of the pause duration between sentences is similar to that of the second round, and is not repeated here.
S702: and weighting the determined duration of each sentence end pause to obtain the sentence end pause duration of the current round.
When the current round of voice interaction is the first round of voice interaction of the current user, as shown in fig. 8, the preset inter-sentence pause threshold is a, and if only 1 sentence end pause is determined in the user voice through step S1021, the duration of the sentence end pause can be directly used as the sentence end pause duration of the first round, that is, as the sentence end pause C of the first round shown in fig. 8. If it is determined in step S1021 that there are multiple actual sentence end pauses in the user speech, the duration of each sentence end pause may be weighted to obtain a first round of sentence end pause duration, that is, a first round of sentence end pause C shown in fig. 8, where the sum of each weight may be 1.
When the current round of voice interaction is any round of voice interaction process except the first round of voice interaction process of the current user, the first mode is to obtain the sentence end pause time of the current round according to the first round of voice interaction mode and use the sentence end pause time as the final sentence end pause time of the current round, and the second mode is to obtain the sentence end pause time of the current round according to the first round of voice interaction mode and weight the sentence end pause time and the final sentence end pause time of the previous round to use the sentence end pause time of the current round.
That is, when the current voice interaction is any voice interaction process except the first voice interaction process of the current user, if the second manner is adopted, this step S702 may be replaced by: and weighting the determined duration of each sentence end pause, and weighting the weighting result and the sentence end pause duration obtained in the previous round to obtain the sentence end pause duration of the current round.
Specifically, as shown in fig. 8, still taking the current round of voice interaction as the second round of voice interaction process of the current user as an example, if the inter-sentence pause threshold determined by the first round is D, and if it is determined by step S1021 that there are only 1 actual sentence end pauses of the second round, the duration of the sentence end pause may be directly used as the sentence end pause duration of the second round, and defined as F. If it is determined through step S1021 that there are multiple actual sentence end pauses in the second round, the specific calculation process is consistent with the calculation method of the sentence end pause duration C in the first round.
Further, in order to obtain the final pause F of the second round of voice interaction process, the calculated initial final pause duration F of the second round of voice interaction process may be used1And carrying out weighted calculation with the sentence end pause duration C of the first round, wherein the sum of the weights can be 1. The specific calculation process can still be divided into two ways, one way is to perform an average calculation, i.e., F ═ F (F)1+ C)/2 means the weight is 0.5; another way is to stop mean value F according to the initial sentence end of the second round1Carrying out weighting calculation on different weights corresponding to the sentence end pause duration C of the first round, wherein the initial sentence end pause duration F of the second round1Compared with the sentence end pause time length C of the first round, the sentence end pause condition of the user in the second round can be embodied, so that the initial sentence end pause time length F of the second round can be used for calculating the sentence end pause time length F of the second round1The corresponding weight is set to a larger weight value, and correspondingly, the weight corresponding to the end-of-sentence pause C of the first round can be set to a smaller weight value, for example, F can be set1The corresponding weight is set to 0.6, and the weight corresponding to C is set to 0.4, so that the specific calculation formula of the sentence end pause F of the second round of speech is as follows: f ═ F1*0.6+C*0.4。
In the subsequent voice interaction processes of the third round, the fourth round and the like, the calculation mode of the sentence end pause duration is similar to that of the second round, and details are not repeated here.
S703: and selecting a numerical value between the sentence pause time length of the current round and the sentence tail pause time length of the current round as a new sentence pause threshold value.
Specifically, for example, when the current round of voice interaction is the first round of voice interaction of the current user, as shown in fig. 8, after the inter-sentence pause duration B and the end-sentence pause duration C of the first round are calculated through the above steps S701 and S702, a numerical value may be selected between B and C as a new inter-sentence pause threshold, which is defined as D. For example, an average value of B and C may be taken, and D may be 1500 msec, that is, D may be (1200+1800)/2, assuming that B is 1200 msec and C is 1800 msec, for example. And the inter-sentence pause threshold D can be used as an inter-sentence pause threshold used in the second round of voice interaction, so that the user voice can be determined according to the inter-sentence pause threshold D during the second round of voice interaction.
Similarly, when the current voice interaction is any voice interaction except the first voice interaction process of the current user, a new inter-sentence pause threshold value can be generated according to the same method. Taking the second round of voice interaction process as an example, as shown in fig. 8, a value may be selected between E and F as a new inter-sentence pause threshold, which is defined as G. For example, an average value of E and F may be taken as G, and G may be taken as an inter-sentence pause threshold used in the third round of voice interaction, and so on, and a new inter-sentence pause threshold may be generated in each subsequent round of voice interaction process, that is, continuous dynamic adjustment of the current inter-sentence pause threshold may be implemented on a wheel-by-wheel basis.
In summary, the embodiment can dynamically adjust the inter-sentence pause threshold used in each round of interaction, and dynamically adapt to the pause habit of the current user. Therefore, compared with the prior art, in the voice interaction process, the personalized expression habit of the current user is considered, namely the habitual pause time of the current user is considered, so that the accuracy of the voice response result can be improved, the times of the user repeating the same problem are reduced, and the user experience is improved.
Second embodiment
The present embodiment will describe a specific implementation process of "responding to the voice of the user in this round" in the first step S102.
It should be noted that, in the process of performing voice interaction, when a user performs a voice expression of the user, some words, vocals, etc. such as "o", "yao", "this", "that", etc. are often added, and these meaningless words are used as invalid expressions, which are not only unfavorable for performing semantic understanding on the voice of the user, but also may generate certain interference, and especially when these words are located in the middle of valid words, the normal semantic understanding on the voice of the user is affected. Moreover, when different users express the user speech, some personalized words such as moods, and whistles may be generated according to their own different expression habits. Therefore, after determining the actual pause between each sentence and the pause at each tail in the voice of the user in the current round, one way is to directly understand the semantic of the voice of the user before each tail pause and make a corresponding response. Another way is that after the actual inter-sentence pause and the actual end-sentence pause in the user speech are determined through the first embodiment, and the interference of the habitual pause of the user is eliminated, the personalized invalid expression generated by the user is filtered to ensure the accuracy of semantic understanding, as shown in fig. 9, the specific implementation flow of the way includes the following steps S901 to S904:
s901: and taking the recognized text of the voice of the user in the current round as the text to be responded.
In this embodiment, existing or future voice recognition methods may be used to perform voice recognition on the voice of the user in the current round to obtain a corresponding recognition text, and the recognition text is used as a text to be responded, so as to implement accurate response of the text to be responded through subsequent steps S902-S904.
S902: and extracting each high-frequency vocabulary in the text to be responded.
In this embodiment, after the text to be responded is obtained in step S901, a word segmentation method may be used to perform word segmentation on the text to be responded to obtain each word segmentation word included in the text to be responded and the number of times that each word segmentation word appears in the text to be responded, and then according to a preset number threshold, for example, the number threshold may be set to 2 in advance, a word segmentation word whose number of times that the text to be responded appears is not lower than the number threshold may be defined as a high-frequency word, for example, a word segmentation word whose number of times that the text to be responded appears is not lower than 2 may be defined as a high-frequency word, and then all high-frequency words may be extracted from the text to be responded, and the remaining word segmentation words are low-frequency words.
For example, the following steps are carried out: as shown in fig. 10, the number threshold is set to 2 in advance, and it is assumed that the obtained identification text (text to be responded) is "how much money you ask you for this product, how much money you ask you for this product? "the text to be responded is subjected to word segmentation processing, and a word segmentation result is obtained, namely that" i ask how much money you ask your product to sell how much money you ask your product ", as a result, the word segmentation words" i "," ask "," o "," you "," this "," product "," how much "and" money "appear in the text to be responded for 2 times and are equal to a preset time threshold, so that the words can be used as high-frequency words, as shown in fig. 10, the high-frequency words can be sequentially marked as (i) -b, and correspondingly, the three word segmentation words of" one stroke "," that "," sell "in the text to be responded are low-frequency words.
S903: and matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library.
In this embodiment, after the high frequency words and the low frequency words in the text to be responded are extracted in step S902, the high frequency words and the low frequency words in the text to be responded may be matched with the invalid words in the pre-constructed invalid word bank, and the matched high frequency words and the low frequency words in the text to be responded are removed from the text to be responded, that is, the high frequency words overlapping with the invalid word bank are removed, the high frequency words not overlapping with the invalid word bank are retained, and the low frequency words in the text to be responded are also removed.
Wherein, the invalid vocabulary used by the current user is stored in the invalid vocabulary bank. It should be noted that, in the voice interaction process, different users may generate some personalized ineffective words such as language-atmosphere words and vocabularies according to different expression habits of the users, so in order to ensure the accuracy of semantic understanding of voices of different users, personalized ineffective word banks need to be created for different users, so that each ineffective word bank stores the ineffective words used by a unique user, and a better filtering effect is achieved.
The method for implementing the personalized invalid word bank corresponding to each user includes the steps of firstly, segmenting the recognized text corresponding to the voice of the user in the current round, then, extracting high-frequency words commonly used by the user from the recognized text according to the mode of the step S902, then, calling a natural language understanding model, analyzing the semantics of all the proposed high-frequency words, namely, separating words without clear semantics from the words, for example, based on the example shown in fig. 10, after obtaining the high-frequency words "i", "ask", "o", "you", "this", "product", "how much" and "money", calling the natural language understanding model, analyzing the semantics of the high-frequency words, knowing that "o" and "this" have no clear semantics, and classifying the words into invalid words to form the personalized invalid word bank corresponding to the user. And through more rounds of voice interaction, the invalid words corresponding to the user can be continuously accumulated, and further the personalized invalid word bank corresponding to the user can be enriched.
Therefore, an alternative implementation manner is that after the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded are removed from the text to be responded in step S903, the following steps B1-B2 may be further performed:
step B1: and detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after the matching operation.
In this implementation manner, after each extracted high-frequency vocabulary is matched with each invalid vocabulary in the pre-constructed invalid vocabulary library in step S903 and the matched high-frequency vocabulary is removed from the text to be responded, it may be further detected whether there is an invalid vocabulary in the high-frequency vocabularies remaining after the matching operation, if yes, it is indicated that the pre-constructed personalized invalid vocabulary library corresponding to the current user does not include the invalid vocabularies, step B2 needs to be executed, and if no, step S904 is executed.
Step B2: and if so, storing the detected invalid vocabulary in the invalid word bank, and removing the detected invalid vocabulary from the text to be responded.
In this implementation manner, if it is detected through step B1 that there are still invalid words in the high-frequency words remaining after the matching operation, it indicates that these words are not included in the pre-constructed personalized invalid word library corresponding to the current user, and at this time, these detected invalid words may be stored in the personalized invalid word library corresponding to the current user, so that when processing the user voice in the subsequent voice interaction of the user, a better filtering effect may be achieved. Meanwhile, the detected invalid words need to be removed from the text to be responded so as to ensure the accuracy of subsequent semantic understanding.
S904: and performing voice response according to the text after the removing operation.
In this embodiment, after the invalid vocabulary in the text to be responded is removed in step S903, the remaining text content may be semantically understood to obtain a semantic understanding result. Moreover, after the invalid vocabulary is removed, the obtained semantic understanding result can be closer to the meaning which the current user wants to express, and further the system can make a more accurate response which is more in line with the meaning of the user.
In summary, in the embodiment, the personalized invalid word bank corresponding to each user is pre-constructed, and the invalid words in the recognized text corresponding to the voice of the user are personalized and filtered, so that a good filtering effect is achieved, the accuracy of semantic understanding can be ensured, and the user experience is further improved.
Third embodiment
In this embodiment, a voice interaction apparatus will be described, and please refer to the above method embodiment for related contents.
Referring to fig. 11, a schematic composition diagram of a voice interaction apparatus provided in this embodiment is shown, where the apparatus 1100 includes:
a user voice determining unit 1101, configured to determine, according to a current inter-sentence pause threshold, a user voice of a current user in a current round of interaction in a current round of voice interaction;
a pause threshold determining unit 1102, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;
a user voice response unit 1103, configured to respond to the user voice.
In an implementation manner of this embodiment, the stall threshold determination unit 1102 includes:
an actual pause determining subunit, configured to determine actual inter-sentence pauses and end-sentence pauses in the user speech;
and the pause threshold determining subunit is used for determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined inter-sentence pause.
In an implementation manner of this embodiment, the actual pause determining subunit is specifically configured to:
and carrying out syntactic analysis on the recognition text of the user voice, and determining actual pause among all sentences and pause at all sentence ends in the user voice.
In an implementation manner of this embodiment, the actual pause determining subunit includes:
the role information extraction subunit is used for extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;
and the pause determining subunit is used for determining actual pause among each sentence and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.
In an implementation manner of this embodiment, the pause threshold determining subunit is specifically configured to:
and determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause.
In an implementation manner of this embodiment, the stall threshold determination subunit includes:
the first pause duration determining subunit is used for weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;
the second pause duration determining subunit is configured to weight the determined duration of each sentence end pause to obtain a sentence end pause duration of the current round;
and the first pause threshold determining subunit is used for selecting a numerical value from the inter-sentence pause duration of the current round and the tail-sentence pause duration of the current round as a new inter-sentence pause threshold.
In an implementation manner of this embodiment, if the current round of voice interaction process is any round of voice interaction process except the first round of voice interaction process of the current user, the pause threshold determining subunit includes:
a third pause duration determining subunit, configured to weight the determined duration of each inter-sentence pause, and weight the weighted result and the inter-sentence pause duration obtained in the previous round to obtain the inter-sentence pause duration of the current round;
a fourth pause duration determining subunit, configured to weight the determined duration of each sentence end pause, and weight the weighted result and the sentence end pause duration obtained in the previous round to obtain a sentence end pause duration of the current round;
and the second pause threshold determining subunit is used for selecting a numerical value between the inter-sentence pause duration of the current round and the sentence tail pause duration of the current round as a new inter-sentence pause threshold.
In an implementation manner of this embodiment, the user voice response unit 1103 includes:
the text determining subunit is used for taking the recognition text of the user voice as a text to be responded;
the vocabulary extracting subunit is used for extracting each high-frequency vocabulary in the text to be responded;
the vocabulary matching subunit is used for matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;
and the voice response subunit is used for performing voice response according to the text after the removal operation.
In an implementation manner of this embodiment, the apparatus further includes:
the vocabulary detecting unit is used for removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, and detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after matching operation;
and the vocabulary removing unit is used for storing the detected invalid vocabulary in the invalid vocabulary library and removing the detected invalid vocabulary from the text to be responded if the fact that the invalid vocabulary exists in the high-frequency vocabulary left after the matching operation is detected.
Further, an embodiment of the present application further provides a voice interaction device, including: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the voice interaction method.
Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is caused to execute any implementation method of the above voice interaction method.
Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above voice interaction method.
As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (14)

1. A method of voice interaction, comprising:
in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold;
determining a new inter-sentence pause threshold according to the user voice, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the user voice;
the determining a new inter-sentence pause threshold according to the user speech includes:
determining actual pause between each sentence and pause at each sentence tail in the user voice;
and determining a new inter-sentence pause threshold according to the determined duration corresponding to each inter-sentence pause and each sentence end pause.
2. The method of claim 1, wherein determining a new inter-sentence pause threshold based on the determined durations for each inter-sentence pause and each end-of-sentence pause comprises:
weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;
weighting the determined duration of each sentence end pause to obtain the sentence end pause duration of the current round;
and selecting a numerical value between the sentence pause time of the current round and the sentence tail pause time of the current round as a new sentence pause threshold value.
3. The method of claim 1, wherein if the current round of speech interaction process is any round of speech interaction process other than the first round of speech interaction process of the current user, determining a new inter-sentence pause threshold according to the determined durations corresponding to the respective inter-sentence pauses and the respective end-of-sentence pauses comprises:
weighting the determined duration of pause between sentences, and weighting the weighting result and the pause duration between sentences obtained in the previous round to obtain the pause duration between sentences in the current round;
weighting the determined duration of each sentence end pause, and weighting the weighting result and the sentence end pause duration obtained in the previous round to obtain the sentence end pause duration of the current round;
and selecting a numerical value between the sentence pause time of the current round and the sentence tail pause time of the current round as a new sentence pause threshold value.
4. A method of voice interaction, comprising:
in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold;
determining a new inter-sentence pause threshold according to the user voice, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the user voice;
the determining a new inter-sentence pause threshold according to the user speech includes:
determining actual pause between each sentence and pause at each sentence tail in the user voice;
determining a new inter-sentence pause threshold value according to the determined inter-sentence pause and the determined inter-sentence pause;
the determining of the actual pause between each sentence and the pause at each sentence end in the user speech includes:
performing syntactic analysis on the recognition text of the user voice, and determining actual pause between each sentence and pause at each sentence tail in the user voice;
the parsing the recognition text of the user speech to determine actual inter-sentence pauses and post-sentence pauses in the user speech includes:
extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;
and determining actual pause among all sentences and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.
5. A method of voice interaction, comprising:
in the current round of voice interaction process, determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold;
determining a new inter-sentence pause threshold according to the user voice, updating the current inter-sentence pause threshold by using the new inter-sentence pause threshold, and responding to the user voice;
the responding to the user voice comprises:
taking the recognition text of the user voice as a text to be responded;
extracting each high-frequency vocabulary in the text to be responded;
matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;
and performing voice response according to the text after the removing operation.
6. The method of claim 5, wherein after removing the matched high frequency vocabulary and the low frequency vocabulary in the text to be responded from the text to be responded, the method further comprises:
detecting whether an invalid vocabulary exists in the high-frequency vocabulary left after the matching operation;
and if so, storing the detected invalid vocabulary in the invalid word bank, and removing the detected invalid vocabulary from the text to be responded.
7. A voice interaction apparatus, comprising:
the user voice determining unit is used for determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold in the current round of voice interaction;
a pause threshold determining unit, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;
a user voice response unit, configured to respond to the user voice;
the stall threshold determination unit includes:
an actual pause determining subunit, configured to determine actual inter-sentence pauses and end-sentence pauses in the user speech;
and the pause threshold determining subunit is used for determining a new inter-sentence pause threshold according to the determined inter-sentence pause and the duration corresponding to the inter-sentence pause.
8. The apparatus of claim 7, wherein the stall threshold determination subunit comprises:
the first pause duration determining subunit is used for weighting the determined duration of pause between each sentence to obtain the pause duration between the sentences of the current round;
the second pause duration determining subunit is configured to weight the determined duration of each sentence end pause to obtain a sentence end pause duration of the current round;
and the first pause threshold determining subunit is used for selecting a numerical value from the inter-sentence pause duration of the current round and the tail-sentence pause duration of the current round as a new inter-sentence pause threshold.
9. The apparatus of claim 7, wherein if the current round of voice interaction process is any round of voice interaction process other than the first round of voice interaction process of the current user, the pause threshold determining subunit includes:
a third pause duration determining subunit, configured to weight the determined duration of each inter-sentence pause, and weight the weighted result and the inter-sentence pause duration obtained in the previous round to obtain the inter-sentence pause duration of the current round;
a fourth pause duration determining subunit, configured to weight the determined duration of each sentence end pause, and weight the weighted result and the sentence end pause duration obtained in the previous round to obtain a sentence end pause duration of the current round;
and the second pause threshold determining subunit is used for selecting a numerical value between the inter-sentence pause duration of the current round and the sentence tail pause duration of the current round as a new inter-sentence pause threshold.
10. A voice interaction apparatus, comprising:
the user voice determining unit is used for determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold in the current round of voice interaction;
a pause threshold determining unit, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;
a user voice response unit, configured to respond to the user voice;
the stall threshold determination unit includes:
an actual pause determining subunit, configured to determine actual inter-sentence pauses and end-sentence pauses in the user speech;
a pause threshold determining subunit, configured to determine a new inter-sentence pause threshold according to the determined inter-sentence pause and the determined inter-sentence pause;
the actual pause determining subunit is specifically configured to:
performing syntactic analysis on the recognition text of the user voice, and determining actual pause between each sentence and pause at each sentence tail in the user voice;
the actual stall determination subunit includes:
the role information extraction subunit is used for extracting semantic role information from the recognition text of the user voice, wherein the semantic role information comprises a group of predicates and objects;
and the pause determining subunit is used for determining actual pause among each sentence and pause at the tail of each sentence in the user voice according to the extracted number of the semantic role information.
11. A voice interaction apparatus, comprising:
the user voice determining unit is used for determining the user voice of the current user in the current round of interaction according to the current inter-sentence pause threshold in the current round of voice interaction;
a pause threshold determining unit, configured to determine a new inter-sentence pause threshold according to the user speech, and update the current inter-sentence pause threshold by using the new inter-sentence pause threshold;
a user voice response unit, configured to respond to the user voice;
the user voice response unit includes:
the text determining subunit is used for taking the recognition text of the user voice as a text to be responded;
the vocabulary extracting subunit is used for extracting each high-frequency vocabulary in the text to be responded;
the vocabulary matching subunit is used for matching each extracted high-frequency vocabulary with each invalid vocabulary in a pre-constructed invalid vocabulary library, and removing the matched high-frequency vocabulary and the low-frequency vocabulary in the text to be responded from the text to be responded, wherein the invalid vocabulary used by the current user is stored in the invalid vocabulary library;
and the voice response subunit is used for performing voice response according to the text after the removal operation.
12. A voice interaction device, comprising: a processor, a memory, a system bus;
the processor and the memory are connected through the system bus;
the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-6.
13. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-6.
14. A computer program product, characterized in that the computer program product, when run on a terminal device, causes the terminal device to perform the method of any of claims 1-6.
CN201811513534.9A 2018-12-11 2018-12-11 Voice interaction method and device Active CN109377998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811513534.9A CN109377998B (en) 2018-12-11 2018-12-11 Voice interaction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811513534.9A CN109377998B (en) 2018-12-11 2018-12-11 Voice interaction method and device

Publications (2)

Publication Number Publication Date
CN109377998A CN109377998A (en) 2019-02-22
CN109377998B true CN109377998B (en) 2022-02-25

Family

ID=65374130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811513534.9A Active CN109377998B (en) 2018-12-11 2018-12-11 Voice interaction method and device

Country Status (1)

Country Link
CN (1) CN109377998B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189751A (en) * 2019-04-24 2019-08-30 中国联合网络通信集团有限公司 Method of speech processing and equipment
CN110310632A (en) * 2019-06-28 2019-10-08 联想(北京)有限公司 Method of speech processing and device and electronic equipment
CN110502631B (en) * 2019-07-17 2022-11-04 招联消费金融有限公司 Input information response method and device, computer equipment and storage medium
CN110400576B (en) * 2019-07-29 2021-10-15 北京声智科技有限公司 Voice request processing method and device
US11288459B2 (en) * 2019-08-01 2022-03-29 International Business Machines Corporation Adapting conversation flow based on cognitive interaction
CN111145756B (en) * 2019-12-26 2022-06-14 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition
CN112037786A (en) * 2020-08-31 2020-12-04 百度在线网络技术(北京)有限公司 Voice interaction method, device, equipment and storage medium
CN112116907A (en) * 2020-10-22 2020-12-22 浙江同花顺智能科技有限公司 Speech recognition model establishing method, speech recognition device, speech recognition equipment and medium
CN113436617B (en) * 2021-06-29 2023-08-18 平安科技(深圳)有限公司 Voice sentence breaking method, device, computer equipment and storage medium
CN113393840B (en) * 2021-08-17 2021-11-05 硕广达微电子(深圳)有限公司 Mobile terminal control system and method based on voice recognition
CN115512687B (en) * 2022-11-08 2023-02-17 之江实验室 Voice sentence-breaking method and device, storage medium and electronic equipment

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020042709A1 (en) * 2000-09-29 2002-04-11 Rainer Klisch Method and device for analyzing a spoken sequence of numbers
US6950796B2 (en) * 2001-11-05 2005-09-27 Motorola, Inc. Speech recognition by dynamical noise model adaptation
CN103680500B (en) * 2012-08-29 2018-10-16 北京百度网讯科技有限公司 A kind of method and apparatus of speech recognition
CN103077718B (en) * 2013-01-09 2015-11-25 华为终端有限公司 Method of speech processing, system and terminal
CN103345922B (en) * 2013-07-05 2016-07-06 张巍 A kind of large-length voice full-automatic segmentation method
US9697828B1 (en) * 2014-06-20 2017-07-04 Amazon Technologies, Inc. Keyword detection modeling using contextual and environmental information
CN105812535A (en) * 2014-12-29 2016-07-27 中兴通讯股份有限公司 Method of recording speech communication information and terminal
US9980033B2 (en) * 2015-12-21 2018-05-22 Bragi GmbH Microphone natural speech capture voice dictation system and method
CN106101094A (en) * 2016-06-08 2016-11-09 联想(北京)有限公司 Audio-frequency processing method, sending ending equipment, receiving device and audio frequency processing system
US10032456B2 (en) * 2016-08-17 2018-07-24 International Business Machines Corporation Automated audio data selector
CN107632980B (en) * 2017-08-03 2020-10-27 北京搜狗科技发展有限公司 Voice translation method and device for voice translation

Also Published As

Publication number Publication date
CN109377998A (en) 2019-02-22

Similar Documents

Publication Publication Date Title
CN109377998B (en) Voice interaction method and device
US10991366B2 (en) Method of processing dialogue query priority based on dialog act information dependent on number of empty slots of the query
JP6066471B2 (en) Dialog system and utterance discrimination method for dialog system
EP3370230B1 (en) Voice interaction apparatus, its processing method, and program
WO2017084334A1 (en) Language recognition method, apparatus and device and computer storage medium
KR20190004495A (en) Method, Apparatus and System for processing task using chatbot
CN111968679B (en) Emotion recognition method and device, electronic equipment and storage medium
CN109858038B (en) Text punctuation determination method and device
CN109119070B (en) Voice endpoint detection method, device, equipment and storage medium
CN105654943A (en) Voice wakeup method, apparatus and system thereof
CN110556105B (en) Voice interaction system, processing method thereof, and program thereof
WO2018192186A1 (en) Speech recognition method and apparatus
US11270691B2 (en) Voice interaction system, its processing method, and program therefor
CN112581938B (en) Speech breakpoint detection method, device and equipment based on artificial intelligence
CN112614514B (en) Effective voice fragment detection method, related equipment and readable storage medium
CN112825248A (en) Voice processing method, model training method, interface display method and equipment
CN112071310A (en) Speech recognition method and apparatus, electronic device, and storage medium
CN114385800A (en) Voice conversation method and device
CN109074809B (en) Information processing apparatus, information processing method, and computer-readable storage medium
US20130238321A1 (en) Dialog text analysis device, method and program
CN114155839A (en) Voice endpoint detection method, device, equipment and storage medium
CN110767240B (en) Equipment control method, equipment, storage medium and device for identifying child accent
CN115512687B (en) Voice sentence-breaking method and device, storage medium and electronic equipment
CN108899016B (en) Voice text normalization method, device and equipment and readable storage medium
CN114242064A (en) Speech recognition method and device, and training method and device of speech recognition model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant