WO2023207212A1

WO2023207212A1 - Voice dialogue detection method and apparatus

Info

Publication number: WO2023207212A1
Application number: PCT/CN2023/070200
Authority: WO
Inventors: 邓成东; 曾琳铖曦; 郭江; 吴海英
Original assignee: 马上消费金融股份有限公司
Priority date: 2022-04-24
Filing date: 2023-01-03
Publication date: 2023-11-02
Also published as: CN114842849A; CN114842849B

Abstract

A voice dialogue detection method and apparatus. The voice dialogue detection method comprises : on the basis of dialogue-related information and dialogue text of a plurality of dialogue sentences in target voice data, performing interruption behavior pre-detection on the plurality of dialogue sentences, so as to determine one or more candidate dialogue sentences that may have an interruption behavior in the plurality of dialogue sentences (S102), wherein the target voice data comprises dialogue sentences of speakers of different roles, and each piece of the dialogue-related information comprises dialogue start-stop time information and a speaker role; and for each candidate dialogue sentence, determining, on the basis of at least one of an emotion recognition result of the candidate dialogue sentence and a voice feature of the candidate dialogue sentence, whether an interruption behavior exists in the candidate dialogue sentence (S104).

Description

Voice dialogue detection method and device

Cross-references to related applications

This application claims priority from Chinese Patent Application No. 202210451120.8 filed on April 24, 2022, the entire content of which is incorporated herein by reference.

Technical field

The present application relates to speech processing technology, and in particular to speech dialogue detection methods and devices.

Background technique

Detecting whether participants in a voice conversation engage in interrupting behavior is an important part of voice conversation detection and is widely used in scenarios such as telephone operations and intelligent question and answer.

In some cases, the voice dialogue detection method is mainly based on simple detection rules to determine whether the participants in the voice dialogue engage in interrupting behavior. For example, if participant A responds before participant B has finished speaking, it will be determined that participant A There is interfering behavior. However, this detection method is too simplistic and cannot accurately detect interrupting behaviors in complex conversation scenarios. For example, there may be the following situation: when participant A is talking endlessly, participant B only responds when participant A has not finished speaking out of patience and respect for participant A, but does not really interrupt or interrupt. Talking participant A.

Contents of the invention

In view of this, embodiments of the present application provide voice dialogue detection methods and devices.

A voice dialogue detection method provided by an embodiment of the present application includes: based on the dialogue-related information and dialogue text of multiple dialogue statements in the target voice data, performing pre-detection of interjection on the multiple dialogue statements to determine the target voice dialogue statements. Among the plurality of dialogue sentences, there may be one or more candidate dialogue sentences that interrupt the conversation, wherein the target voice data includes dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speech. human role; for each candidate dialogue sentence in the one or more candidate dialogue sentences, at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence are at least One is to determine whether the candidate dialogue sentence has interfering behavior.

A voice dialogue detection device provided by an embodiment of the present application includes: a first determination module, configured to interpolate multiple dialogue statements in the target voice data based on their respective dialogue related information and dialogue text. dialogue pre-detection to determine one or more candidate dialogue statements that may include interjection behavior among the plurality of dialogue statements, wherein the target voice data includes dialogue statements of speakers with different roles, and the dialogue-related information It includes dialogue start and end time information and speaker role; a second determination module is used for each candidate dialogue statement in the one or more candidate dialogue statements, based on emotion recognition of the candidate dialogue statement using an emotion recognition model. At least one of the emotion recognition result and the speech feature of the candidate dialogue sentence is used to determine whether the candidate dialogue sentence has interrupting behavior.

An electronic device provided by an embodiment of the present application includes: a processor; and a memory used to store instructions executable by the processor, wherein the processor is configured to implement the above voice dialogue detection method when executing the instructions. .

An embodiment of the present application provides a computer-readable storage medium. When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can implement the above voice dialogue detection method.

Description of the drawings

Figure 1 is a schematic flow chart of a voice dialogue detection method provided by an embodiment of the present application;

Figure 2 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application;

Figure 3 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application;

Figure 4 is a schematic diagram of applicable scenarios for the voice dialogue detection method provided by an embodiment of the present application;

Figure 5 is a schematic diagram of a configuration interface provided by an embodiment of the present application;

Figure 6 is a schematic diagram of a configuration interface provided by another embodiment of the present application;

Figure 7 is a schematic structural diagram of a voice dialogue detection device provided by an embodiment of the present application;

FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with specific embodiments of the present application and corresponding drawings. The described embodiments are illustrative only and are not intended to limit the application. Based on the embodiments described in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.

The terms "first", "second", etc. in this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that data so used are interchangeable under appropriate circumstances so that embodiments of the present application can be practiced in sequences other than those illustrated or described herein. In addition, "and/or" in this application means at least one of the connected objects, and the character "/" generally means that the related objects are in an "or" relationship.

Some concept explanations:

Interruption: One party participating in a conversation interrupts the other party by starting to speak before the other party has finished speaking.

Intelligent customer service quality inspection system: A system that detects the text content of voice, video and other data through detection models, detection algorithms, etc., which plays a role in detecting the behavior of customer service personnel, such as detecting whether dialogue participants are interfering in conversations. Behavior is conducive to improving the service quality of customer service personnel.

Automatic Speech Recognition (ASR) technology: refers to the conversion from speech to text, that is, using computers to convert meaningful speech produced by humans into written language.

The embodiment of this application uses the rule that one party usually starts speaking before the other party has finished speaking, and the speech is not too brief, and proposes a dialogue sentence detection solution by utilizing the rule that interrupting behavior is usually caused by one party starting to speak before the other party has finished speaking. First, based on the dialogue start and end time information of the dialogue sentences of speakers with different roles, dialogue-related information such as speaker roles, and the dialogue texts of these dialogue sentences, candidate dialogue sentences that may involve interrupting behaviors are determined from these dialogue sentences. Then, using the rule that speakers usually speak louder, have negative and excited emotions when interrupting, the emotion recognition results and/or voice characteristics of the candidate dialogue sentences can be combined to further determine the content of the candidate dialogue sentences. Is there any behavior of interrupting calls? Compared with the overly simplistic approach that determines "the behavior of one party speaking before the other party has finished speaking as interfering behavior", the embodiment of the present application can avoid classifying one party out of patience and respect for the other party. Behaviors such as responding respectfully to the other party before the other party has finished speaking are misjudged as interrupting behavior, thus improving the detection accuracy of voice conversations.

It should be understood that the voice dialogue detection method provided by the embodiment of the present application can be executed by an electronic device or software installed in the electronic device, and specifically can be executed by a terminal device or a server device.

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to Figure 1, a voice dialogue detection method provided by an embodiment of the present application may include steps S102 to S104.

In step S102, based on the dialogue-related information and dialogue text of each of the plurality of dialogue statements in the target voice data, pre-detection of interjection is performed on the plurality of dialogue statements to determine that there may be interjection in the plurality of dialogue statements. One or more candidate dialogue utterances for a speech act.

The target speech data may include conversational sentences of speakers with different roles. For example, in a telephone operation scenario, the target voice data includes conversational sentences between users and customer service personnel; for another example, in a video conference scenario, the target voice data includes conversational sentences between different conference participants, and so on.

The dialogue-related information of the dialogue sentence may include the dialogue start and end time information of the dialogue sentence and the speaker's role. The start and end time information of the dialogue sentence includes the start time of the dialogue sentence (that is, the time when the speaker starts speaking) and the end time (that is, the time when the speaker stops speaking).

The dialogue text of the dialogue sentence represents the dialogue content of the dialogue sentence. In practical applications, the dialogue text of the dialogue sentence can be obtained by identifying the dialogue sentence based on ASR technology.

Considering that the behavior of interrupting usually means that one party starts speaking before the other party has finished speaking and the speech will not be too brief, the dialogue start and end time information of the dialogue sentences of the speakers of different roles and the role of the speaker can be used. Based on the relevant information and the dialogue text of these dialogue sentences, the dialogue sentences of speakers with different roles are pre-detected to determine the candidate dialogue sentences that may have the behavior of interrupting.

For example, in one implementation, if the first dialogue sentence and the second dialogue sentence are any two adjacent dialogue sentences in the target speech data, and the first dialogue sentence and the second dialogue sentence have different speaker roles, the first dialogue sentence and the second dialogue sentence have different speaker roles. If the start time of the second dialogue sentence is located after the start time of the first dialogue sentence and before the end time of the first dialogue sentence, the second dialogue sentence can be pre-detected for interrupting. Specifically, the intersection duration between the first dialogue statement and the second dialogue statement may be determined based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement. If the end time of the first dialogue sentence is before the end time of the second dialogue sentence, the intersection time between the first dialogue sentence and the second dialogue sentence is equal to the end time of the first dialogue sentence and the start time of the second dialogue sentence. the difference between them; if the end time of the first dialogue statement is after the end time of the second dialogue statement, then the intersection duration between the first dialogue statement and the second dialogue statement is equal to the end time of the second dialogue statement and the second dialogue statement. The difference between the start times of dialogue statements. If the intersection duration exceeds the preset duration or the number of characters contained in the dialogue text of the second dialogue sentence exceeds the preset number of characters, the second dialogue sentence is determined to be a candidate dialogue sentence that may involve interrupting. In actual applications, the preset duration and the preset number of characters can be set according to actual needs. For example, the preset duration can be set to 3 seconds, and the preset number of characters can be set to 5.

In order to avoid missed detection, as shown in Figure 2, multiple dialogue statements can be sorted according to the starting time of each dialogue statement in the target speech data from early to late, and then each two adjacent dialogue statements can be executed in sequence. The above steps are carried out until all dialogue sentences in the target speech data have been judged. For example, if it is determined that the intersection duration between the Nth (N is a positive integer) dialogue statement and the N+1th dialogue statement exceeds the preset duration or the dialogue text of the N+1th dialogue statement contains more than the preset number of characters, number of characters, then determine the N+1th dialogue statement as the candidate dialogue statement; otherwise, continue to repeat the above process for the N+1th dialogue statement and the N+2th dialogue statement until all the dialogue statements in the target speech data are All dialogue sentences have been judged.

For example, taking the telephone operation scenario as an example, the target voice data includes conversational sentences between customer service personnel and users. Since in this scenario we only need to pay attention to whether the customer service staff has interrupting behavior, we only determine whether there is interrupting behavior in the customer service staff's conversational statements. It is assumed that the dialogue-related information and dialogue text of the dialogue sentence in the target speech data are as follows:

In the target speech data shown above, there are no other dialogue sentences before the first dialogue sentence, so it can be determined that there is no interrupting behavior in the first dialogue sentence. The speaker role of the second dialogue sentence is the user, so the pre-detection of interjection is not performed on the second dialogue sentence.

The speaker role of the third dialogue sentence is a customer service staff, which is different from the speaker role of the second dialogue sentence, and the starting time of the third dialogue sentence (5880ms) is located at the starting time of the second dialogue sentence ( 4760ms) and before the end time (10240ms) of the second dialogue sentence, so the third dialogue sentence is pre-detected for interrupting. The end time of the third dialogue statement (6320ms) is before the end time of the second dialogue statement (10240ms), so it can be determined that the intersection duration between the third dialogue statement and the second dialogue statement is equal to the third dialogue statement The difference between the end time (6320ms) and the start time of the third dialogue statement (5880ms) is 440ms, which is less than the preset duration of 3 seconds. And it can be determined that the number of characters contained in the dialogue text of the third dialogue statement is 1, which is less than the preset number of characters 5. Therefore, it can be determined that there is no interfering behavior in the third dialogue sentence.

The speaker role of the fourth dialogue sentence is the same as the speaker role of the third dialogue sentence, so the fourth dialogue sentence is not pre-detected for interjection. The speaker role of the fifth dialogue sentence is the user, so the fifth dialogue sentence is not pre-detected for interrupting.

The speaker role of the sixth dialogue sentence is a customer service staff, which is different from the speaker role of the fifth dialogue sentence, and the starting time of the sixth dialogue sentence (15830ms) is located at the starting time of the fifth dialogue sentence ( 14640ms) and before the end time (23270ms) of the fifth dialogue sentence, so the sixth dialogue sentence is pre-detected for interrupting. The end time of the 6th dialogue statement (20500ms) is before the end time of the 5th dialogue statement (23270ms), so it can be determined that the intersection duration between the 6th dialogue statement and the 5th dialogue statement is equal to the 6th dialogue statement The difference between the end time (20500ms) and the start time of the sixth dialogue statement (15830ms) is 4670ms, which is 3 seconds longer than the preset time. From this, it can be determined that the sixth dialogue sentence is a candidate dialogue sentence that may involve interrupting the conversation. Alternatively, it can be determined that the number of characters in the dialogue text of the sixth dialogue sentence is 18, which exceeds the preset number of characters of 5, and thus the sixth dialogue sentence can be determined to be a candidate dialogue sentence that may involve interrupting.

It is understandable that, considering the actual conversation scenario, when one party participating in the voice conversation is talking endlessly, the other party will sometimes respond out of patience and respect before the other party has finished speaking, such as responding with "um" Polite words such as "Okay" and "Okay" are not meant to be offensive. If we simply adopt the oversimplified approach of "judging the behavior of one party speaking before the other party has finished speaking as interrupting behavior", this type of conversational sentences will be misjudged as interrupting behavior. In view of this, based on the intersection duration between two dialogue sentences with different speaker roles and the number of characters contained in the dialogue text of the dialogue sentence, the dialogue sentences of speakers with different roles can be pre-detected to determine possible interjections. There are conversational statements that interrupt the conversation. Specifically, when the intersection duration is long or the dialogue text contains a large number of characters, it can be determined that the dialogue statement may include interrupting behavior. In this way, it is possible to avoid misjudgment of conversational statements such as the above-mentioned polite words as interfering behaviors, which is beneficial to improving the accuracy of voice dialogue detection.

In addition, considering that in actual dialogue scenarios, dialogue participants may add some polite words, greetings, etc. when speaking out of patience, respect, and politeness. If there are too many of these words, according to the above method, such polite words or greetings may be misjudged as possible interfering behavior. In view of this, in one implementation, in order to improve the detection accuracy of voice dialogue, the voice dialogue detection method provided by the embodiment of the present application may also include: before step S102, determine the dialogue of each dialogue statement in the target voice data. Whether the text contains preset words. If the dialogue text of the dialogue statement contains preset words, delete the preset words in the dialogue text of the dialogue statement. The preset words can be set according to actual needs. For example, the preset words can include the above-mentioned polite words, greeting words, etc.

In the embodiment of the present application, any appropriate method may be used to determine whether the dialogue text of the dialogue sentence contains the preset words. For example, in one implementation, the dialogue text of the dialogue statement can be segmented to obtain the words contained in the dialogue text of the dialogue statement; then, each word contained in the dialogue text of the dialogue statement is compared with the words in the preset word library The preset words are matched to determine whether the dialogue text of the dialogue statement contains the preset words.

For example, the preset word library can be obtained by exhaustively enumerating the preset words. Then, a regular matching algorithm is used to match each word contained in the dialogue text of the dialogue sentence with the preset words in the preset word library. If the matching result indicates that the matching degree value between a certain word contained in the dialogue text of the dialogue statement and a preset word in the preset word library exceeds the preset matching degree value, it can be determined that the dialogue text of the dialogue statement contains the preset word. Assume words.

It can be understood that by performing word segmentation processing on the dialogue text of the dialogue statement and matching it with the preset words in the preset word library, it is determined whether the dialogue text of the dialogue statement contains the preset words, which has high accuracy and is suitable for Scenarios where the preset words in the preset word library do not change much.

In another implementation, the dialogue text of the dialogue sentence can be input into a pre-trained word recognition model to obtain a word recognition result of the dialogue text of the dialogue sentence. The word recognition result indicates whether the dialogue text of the dialogue sentence contains a preset word. For example, the word recognition result can indicate the similarity between a certain word in the dialogue text of the dialogue sentence and one or more preset words. The similarity is usually a floating point value between 0 and 1, and the larger the value, the greater the similarity. It means the higher the similarity. The word recognition model is trained based on the sample text and the word tags of the words contained in the sample text. The word tags of the words are used to indicate whether the words are preset words. In practical applications, the word tag of a word can be represented by one-hot encoding. For example, if the word tag of a word is [0,1], it means that the word is not a preset word; if the word tag of a word is [1,0], it means This word is a default word. For example, the sample text is "Well, okay", and the words it contains include {"Well", "Okay"}, where the word label corresponding to "Well" is [1,0], and the corresponding word label to "Okay" The word label of is also [1,0].

It should be noted that in actual applications, the type of word recognition model can be selected according to actual needs. For example, the word recognition model can be a Bidirectional Encoder Representation from Transformers (BERT) model.

It is understandable that model training is performed using sample text and the word labels of the words contained in the sample text, so that the trained word recognition model has generalized recognition capabilities, and the word recognition model can be continuously improved by continuously supplementing new sample text. Recognition ability and accuracy. Based on the trained word recognition model, the dialogue text of the dialogue statement can be recognized simply and accurately whether the dialogue text of the dialogue statement contains preset words such as polite words.

In step S104, for each candidate dialogue sentence, it is determined whether there is an interpolation in the candidate dialogue sentence based on at least one of the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. Talk-stealing behavior.

In the embodiment of this application, the emotion recognition model refers to a pre-trained machine learning model with emotion recognition capabilities. Specifically, the emotion recognition model can be trained by using the emotion-related features of the sample dialogue sentences and the emotion labels corresponding to the sample dialogue sentences. The emotion-related features of the sample dialogue sentence refer to the characteristics of the sample dialogue sentence that can represent the speaker's emotion, such as the spectrogram characteristics of the sample dialogue sentence, etc. The emotion label corresponding to the sample dialogue sentence is used to indicate the emotional tendency of the sample dialogue sentence, such as positive emotion or negative emotion. In one implementation, the tendency value of the emotional tendency corresponding to the sample dialogue sentence may include, for example, a positive emotional value and a negative emotional value. If the positive sentiment value of the sample conversational sentence is higher, it means that the sample conversational sentence is more inclined to positive sentiment; if the negative sentiment value of the sample conversational sentence is higher, it means that the sample conversational sentence is more inclined to negative sentiment. It should be noted that in actual applications, the type of emotion recognition model can be selected according to actual needs.

As an example, feature extraction can be performed on the candidate dialogue sentences to obtain the emotion-related features of the candidate dialogue sentences, and then the emotion-related features of the candidate dialogue sentences are input into the emotion recognition model to obtain the emotion recognition results of the candidate dialogue sentences. The emotion recognition result may represent the emotional tendency of the candidate dialogue sentence, or may represent the tendency value of the emotional tendency of the candidate dialogue sentence.

The voice characteristics of the candidate dialogue statement may include, but are not limited to, the volume of the candidate dialogue statement and/or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement, wherein the speaker role of the first associated dialogue statement is consistent with the speaker role of the candidate dialogue statement. The speaker role is the same. For example, if the speaker role of the candidate dialogue statement is a customer service staff, the first associated dialogue statement may be a dialogue statement output by the customer service staff before the candidate dialogue statement.

Considering that the speaker usually shows characteristics such as louder volume, negative emotion, and excitement when interrupting, the candidate dialogue sentence is determined whether there is interrupting based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence. Talking behavior can improve the detection accuracy of voice dialogue.

In one implementation, as shown in Figure 3, it can be determined based on the emotion recognition result and/or the volume change value whether the candidate dialogue sentence satisfies the preset interrupting condition. If so, it is determined that the candidate dialogue sentence has interrupting behavior. . The preset interruption condition may include that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value. In actual applications, the preset emotional threshold and preset volume value can be set according to actual needs.

Still taking the above target speech data as an example, after determining the 6th dialogue sentence as a candidate dialogue sentence, if it is determined that the volume change value of the 6th dialogue sentence relative to the 4th dialogue sentence (as the first associated dialogue sentence) exceeds If the volume value is preset, it can be determined that the sixth dialogue sentence has interrupting behavior.

It can be understood that when the negative emotion value of the candidate dialogue statement exceeds the preset emotion threshold or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement exceeds the preset volume threshold, it is determined that the candidate dialogue statement is interfering. Compared with the oversimplified approach of simply judging "the behavior of one party speaking before the other party has finished speaking as interfering behavior", it can avoid classifying one party out of patience and respect for the other party. Behaviors such as respecting the other party and responding to the other party before the other party has finished speaking are misjudged as interrupting behavior, which is helpful to improve the detection accuracy of voice dialogue.

In addition, considering the actual dialogue scenario, the following situation may occur: when two or more parties in the dialogue have obvious intentions to end the dialogue, one party suddenly asks a question, etc., causing other participants to start speaking before the first party has finished speaking, but other parties The parties involved did not intentionally engage in chatter. In order to avoid misjudging such behavior of other participants as interfering behavior in such a situation, in an implementation manner, inspection exemption conditions can be set in advance for this situation. For any candidate dialogue sentence, if it is determined that the candidate dialogue sentence satisfies the inspection exemption condition, it can be directly determined that the candidate dialogue sentence does not engage in interfering behavior without performing the above step S104, as shown in FIG. 3 .

For example, it may be determined whether the candidate dialogue sentence satisfies the preset exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence. The speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.

Preset inspection exemption conditions can be set according to actual needs. For example, in order to further improve the accuracy of interrupt detection, the preset exemption conditions may include: the intention of the second associated dialogue statement is to end the dialogue, and the matching degree between the dialogue text of the third associated dialogue statement and the preset end dialogue text. The value exceeds the first preset level threshold. The preset ending conversation text may be a standard text used to end a conversation, such as "Thank you for calling, goodbye", etc.

In one implementation, based on the intention recognition model and the dialogue text of the second associated dialogue statement, the intention recognition of the second associated dialogue statement can be performed to obtain the intention recognition result, and the dialogue text of the third associated dialogue statement and the predetermined Assume that the end dialogue text is matched and the first matching result is obtained. Then, based on the intention recognition result and the first matching result, it can be determined whether the candidate dialogue sentence satisfies the preset inspection exemption condition. Wherein, the start time of the second associated dialogue sentence is before the start time of the candidate dialogue sentence, and the start time of the third associated dialogue sentence is between the start time of the second associated dialogue sentence and the start time of the candidate dialogue sentence. .

In the embodiment of this application, the intent recognition model refers to a pre-trained machine learning model with intent recognition capabilities. Specifically, the intent recognition model can be trained by using the intent-related features of the sample conversation text and the intent tags corresponding to the sample conversation text. Intention-related features of the sample dialogue text may include word features and/or sentence features of the sample dialogue text that can characterize the speaker's intention. The intent tag corresponding to the sample conversation text is used to indicate the intent of the sample conversation text, for example, indicating whether the intent of the sample conversation text is to end the conversation. It should be noted that in actual applications, the type of intent recognition model can be selected according to actual needs.

In one implementation, in order to identify the intention of the second associated dialogue sentence, feature extraction can be performed on the dialogue text of the second associated dialogue sentence to obtain the intention-related features of the dialogue text of the second associated dialogue sentence, and then the second associated dialogue sentence is By inputting the intention-related features of the dialogue text of the associated dialogue sentence into the intention recognition model, it can be determined whether the intention of the second associated dialogue sentence is to end the dialogue.

For example, in a voice call scenario, the conversation between the calling party and the called party is as follows:

Among the above-mentioned dialogue sentences, it is assumed that the fourth dialogue sentence is determined to be a candidate dialogue sentence through the above-mentioned step S102. Based on the start and end time information of each dialogue sentence and the speaker role, the first dialogue sentence can be determined as the second associated dialogue sentence of the candidate dialogue sentence, and the second dialogue sentence can be determined as the third associated dialogue of the candidate dialogue sentence. statement. By performing intention recognition on the second associated dialogue sentence (ie, the first dialogue sentence) through the intention recognition model, it can be determined that the intention of the second associated dialogue sentence is to end the dialogue. By matching the dialogue text of the third associated dialogue statement (ie, the second dialogue statement) with the preset end dialogue text, the match between the dialogue text of the third associated dialogue statement and the preset end dialogue text can be determined The level value exceeds the preset first preset level threshold. Therefore, it can be determined that the candidate dialogue sentence (i.e., the fourth dialogue sentence) satisfies the preset exemption condition, thereby determining that the candidate dialogue sentence does not engage in interfering behavior. That is, it can be determined that the fourth conversation sentence belongs to the situation where the calling party suddenly asks a question when the two parties have a clear intention to end the conversation, causing the called party to start speaking even before the calling party has finished speaking. Therefore, the fourth conversation sentence can be determined. The statement does not belong to the called party to interrupt the call.

As mentioned above, you can first perform intent recognition on the dialogue sentences before the speakers of other characters, match the dialogue sentences before the speaker of the candidate dialogue sentence with the preset end dialogue text, and determine the candidate dialogue by combining the intent recognition results and the matching results. Whether the statement meets the preset exemption conditions. Then, if it is determined that the candidate dialogue sentence does not meet the preset inspection exemption conditions, it can be determined based on the emotion recognition result and/or the speech characteristics of the candidate dialogue sentence whether there is interfering behavior in the candidate dialogue sentence. In this way, some special situations in actual dialogue scenes can be avoided from being misjudged as interruptions, which will help improve the detection accuracy of voice dialogues.

The voice dialogue detection method in the embodiment of the present application can be used in a variety of scenarios that require detection of interrupting calls, such as telephone operations, intelligent question and answer and other scenarios. Taking the telephone operation scenario as an example, the voice dialogue detection method provided by the embodiment of the present application will be described below.

As shown in Figure 4, the telephone operation scenario involves the client 10 and the intelligent customer service quality inspection system 20. The client 10 can display a configuration interface for developer A to configure quality inspection rules. For example, as shown in Figure 5, you can configure rule 1 corresponding to the preset exemption conditions, rule 2 corresponding to pre-detection of preemptive calls, rule 3 corresponding to detection of secondary preemptive calls, and rule 3 corresponding to the exclusion of preemptive calls with a small number of words. Rule 4 etc. For example, the rule 1 corresponding to the preset exemption condition may include the intention of the above-mentioned second associated dialogue statement and the conditions that need to be satisfied by the third associated dialogue statement. Rule 2 corresponding to the pre-detection of pre-detection of interrupted calls may include preset cross duration, preset number of characters, pre-emptive call delay, etc. (as shown in Figure 6). Rule 3 corresponding to the secondary interrupting detection may include an emotion recognition model, an intent recognition model, etc., which are used to further determine whether a candidate dialogue sentence has an interrupting behavior. Rule 4 corresponding to excluding cases where the number of words for interrupting is small may include a preset number of characters and so on.

The client 10 can send the quality inspection rules configured by developer A to the intelligent customer service quality inspection system 20 for use by the intelligent customer service quality inspection system 20 . The client 10 can display a voice data import interface. User B, who has voice dialogue quality inspection requirements, can import the target voice data that needs to be detected into the client 10 through the voice data import interface. The client 10 can send the imported target voice data to the intelligent customer service quality inspection system 20, and according to the voice dialogue detection triggering instruction input by user B, send the detection target voice data to the intelligent customer service quality inspection system 20 to detect the presence of interfering calls. A request for a conversational statement of behavior.

The intelligent customer service quality inspection system 20 may include a server (Server) or a server cluster (Cluster) composed of multiple servers. The intelligent customer service quality inspection system 20 can execute the voice dialogue detection method disclosed in the above embodiments of the present application based on pre-configured quality inspection rules to determine whether there are dialogue statements that interrupt the conversation in the target voice data, and return the detection results. Give the client 10. The client 10 displays the detection results to user B, so that user B can take corresponding measures to improve customer service quality based on the detection results.

Specifically, the intelligent customer service quality inspection system 20 can obtain the voice characteristics and dialogue-related information of each dialogue sentence in the target voice data (for example, including the start and end time of the dialogue and the speaker's role), and based on the ASR technology, convert the target voice data into Corresponding text, the dialogue text of each dialogue statement is obtained. Next, the intelligent customer service quality inspection system 20 can exclude dialogue sentences with a small number of words in the target voice data based on rule 4 corresponding to excluding cases where the number of words in the target voice data is small, and then, based on the dialogue correlation of the remaining dialogue sentences in the target voice data Information and dialogue text, according to the rules 2 corresponding to the pre-detection of interjection, perform pre-detection of interjection on these dialogue sentences to determine candidate dialogue sentences that may contain interjection behavior in these dialogue sentences. Then, the intelligent customer service quality inspection system 20 may determine whether the candidate dialogue sentence satisfies the preset inspection exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence. If the candidate dialogue sentence satisfies the preset exemption conditions, the intelligent customer service quality inspection system 20 can determine that the candidate dialogue sentence does not engage in interfering behavior. If the candidate dialogue sentence does not meet the preset exemption conditions, the intelligent customer service quality inspection system 20 can call the emotion recognition model to perform emotion recognition on the candidate dialogue sentence based on the rule 3 corresponding to the secondary interruption detection to obtain the emotion recognition result, and Based on the emotion recognition results and/or the speech characteristics of the candidate dialogue sentence, it is determined whether the candidate dialogue sentence has interrupting behavior.

In addition, corresponding to the above-mentioned voice dialogue detection method, embodiments of the present application also provide a voice dialogue detection device. Referring to FIG. 7 , the voice dialogue detection device 700 may include a first determination module 710 and a second determination module 730 .

The first determination module 710 may perform interjection pre-detection on the plurality of dialogue statements based on the dialogue related information and dialogue text of each of the plurality of dialogue statements in the target voice data to determine that there may be any presence in the plurality of dialogue statements. One or more candidate dialogue statements that interrupt the conversational behavior. The target voice data may include dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles.

For each candidate dialogue sentence determined by the first determination module 710, the second determination module 730 may be based on at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. One is to determine whether the candidate dialogue sentence has interfering behavior.

In one embodiment, the emotion recognition result includes the negative emotion value of the candidate dialogue sentence, and the voice characteristics of the candidate dialogue sentence include the volume change value of the candidate dialogue sentence relative to the first associated dialogue sentence, The speaker role of the first associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.

In this case, the second determination module 730 may include a first call interrupting determination sub-module. If it is determined that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value, the first interrupting judgment sub-module may determine that the candidate dialogue sentence has an interrupting behavior. .

In one implementation, the voice dialogue detection device 700 may also include a check-free recognition module.

For each candidate dialogue sentence determined by the first determination module 710, the exemption identification module can determine whether the candidate dialogue sentence satisfies the preset exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence. The speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.

When the exemption identification module determines that the candidate dialogue sentence satisfies the preset exemption conditions, the second determination module 730 can directly determine that the candidate dialogue sentence does not have an interfering behavior; and when the exemption identification module determines that the candidate dialogue sentence does not If the preset inspection exemption condition is met, the second determination module 730 may determine whether the candidate dialogue sentence has an interfering behavior based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence.

In one implementation, the preset exemption condition includes the intention of the second associated dialogue statement to end the dialogue, and the matching degree value between the dialogue text of the third associated dialogue statement and the preset end dialogue text. exceeds the first preset level threshold.

In this case, the inspection-free identification module may include an intent identification sub-module, a matching sub-module and an inspection-free identification sub-module.

The intention recognition sub-module can perform intention recognition on the second associated dialogue statement based on the intention recognition model and the dialogue text of the second associated dialogue statement, and obtain an intention recognition result, wherein the start of the second associated dialogue statement The time is before the start time of the candidate dialogue sentence.

The matching submodule can match the dialogue text of the third associated dialogue statement with the preset end dialogue text to obtain the first matching result, wherein the starting time of the third associated dialogue statement is located in the second associated dialogue between the starting time of the sentence and the starting time of the candidate dialogue sentence.

The exemption recognition sub-module may determine whether the candidate dialogue statement satisfies the preset exemption conditions based on the intention recognition result obtained by the intention recognition sub-module and the first matching result obtained by the matching sub-module.

In one implementation, the first determination module 710 may include an intersection duration determination sub-module and a candidate dialogue sentence determination sub-module.

If the first dialogue sentence and the second dialogue sentence are any two adjacent dialogue sentences in the target speech data, the first dialogue sentence and the second dialogue sentence have different speaker roles, and the starting time of the second dialogue sentence is at After the start time of the first dialogue statement and before the end time of the first dialogue statement, the intersection duration determination sub-module may determine the first dialogue statement based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement. The intersection duration between the first dialogue sentence and the second dialogue sentence.

If the intersection duration determined by the intersection duration determination sub-module exceeds the preset duration or the number of characters contained in the dialogue text of the second dialogue statement exceeds the preset number of characters, the candidate dialogue statement determination sub-module may determine that the second dialogue statement may be interfering. Candidate dialogue sentences for speech acts.

In one implementation, the voice dialogue detection device 700 may further include a third determination module and a deletion module.

Before the first determination module 710 performs interjection pre-detection on the plurality of dialogue sentences in the target voice data, the third determination module may determine whether the dialogue text of each dialogue sentence contains a preset word. If the third determination module determines that the dialogue text of the dialogue statement contains the preset words, the deletion module may delete the preset words in the dialogue text of the dialogue statement.

In one implementation, the third determination module may include a word segmentation sub-module and a matching sub-module.

The word segmentation sub-module can perform word segmentation processing on the dialogue text of the dialogue sentence to obtain the words contained in the dialogue text of the dialogue sentence. The matching sub-module may match each word contained in the dialogue text of the dialogue statement with the preset words in the preset word library to determine whether the dialogue text of the dialogue statement contains the preset word.

In one implementation, the third determination module may include a word determination sub-module.

The word determination sub-module can input the dialogue text of the dialogue sentence into the pre-trained word recognition model to obtain the word recognition result of the dialogue text of the dialogue sentence. The word recognition result is used to indicate whether the dialogue text of the dialogue sentence contains a preset The word recognition model is obtained by model training based on the sample text and the word tags of the words contained in the sample text. The word tags of the words are used to indicate whether the words are preset words.

In addition, embodiments of the present application also provide an electronic device. Referring to Figure 8, the electronic device may include a processor, an internal bus, a network interface, a memory, etc. Memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or non-volatile memory (non-volatile memory), such as disk memory. Of course, the electronic equipment may also include other hardware required by the business.

The processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, a peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture, extended industrial standard architecture) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in Figure 8, but it does not mean that there is only one bus or one type of bus.

Memory, used to store programs. Specifically, a program may include program code including computer operating instructions. Memory may include internal memory and non-volatile memory and provides instructions and data to the processor.

The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to form a voice dialogue detection device at the logical level. The processor executes the program stored in the memory to execute the above voice dialogue detection method.

The processor may be an integrated circuit chip that has signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor. The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processor, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.

Embodiments of the present application also propose a computer-readable storage medium that stores one or more programs. The one or more programs include instructions that, when executed by a processor of an electronic device, can The electronic device is caused to execute the above voice conversation detection method.

The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.

Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.

The above describes some embodiments of the present application and is not intended to limit the present application. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desired results. Multitasking and parallel processing are also possible or may be advantageous in certain implementations. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the scope of this application.

Claims

A voice dialogue detection method, including:

Based on the dialogue related information and dialogue text of each of the plurality of dialogue sentences in the target voice data, pre-detection of interjection is performed on the plurality of dialogue statements to determine which one of the plurality of dialogue statements may have interjection behavior. Or multiple candidate dialogue sentences, wherein the target voice data includes dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles;

For each candidate dialogue sentence in the one or more candidate dialogue sentences, it is determined based on at least one of the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. Whether the candidate dialogue sentence has interrupting behavior.
The method of claim 1, wherein,

The emotion recognition result includes the negative emotion value of the candidate dialogue sentence;

Determining whether the candidate dialogue sentence has an interjection behavior based on at least one of the emotion recognition result and the speech feature includes: in response to determining that the negative emotion value exceeds a preset emotion threshold, determining that the candidate dialogue sentence has an interjection behavior. Talk-stealing behavior.
The method of claim 1, wherein,

The voice characteristics of the candidate dialogue sentence include a volume change value of the candidate dialogue sentence relative to a first associated dialogue sentence of the candidate dialogue sentence, a speaker role of the first associated dialogue sentence and a difference between the speaker role of the candidate dialogue sentence and the candidate dialogue sentence. The speaker role is the same;

Determining whether the candidate dialogue sentence has an interjection behavior based on at least one of the emotion recognition result and the speech feature includes: in response to determining that the volume change value exceeds a preset volume value, determining that the candidate dialogue sentence has an interjection behavior. Talk-stealing behavior.
The method according to claim 1, wherein determining whether the candidate dialogue sentence has an interjection behavior based on at least one of the emotion recognition result and the speech feature includes:

In response to determining that the candidate dialogue sentence does not satisfy the preset inspection exemption condition, it is determined based on at least one of the emotion recognition result and the speech feature whether the candidate dialogue sentence has an interfering behavior.
The method according to claim 4, wherein the preset exemption condition includes that the intention of the second associated dialogue sentence of the candidate dialogue sentence is to end the dialogue, and the dialogue text of the third associated dialogue sentence of the candidate dialogue sentence is consistent with the preset dialogue sentence. Assume that the matching degree value between the end dialogue texts exceeds the first preset degree threshold,

Wherein, the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the starting time of the second associated dialogue sentence is before the starting time of the candidate dialogue sentence; the third The speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence, and the starting time of the third associated dialogue sentence is between the starting time of the second associated dialogue sentence and the start time of the candidate dialogue sentence. between start times.
The method of claim 5, wherein,

The intention of the second associated dialogue sentence is obtained by performing intention recognition on the second associated dialogue sentence based on the intention recognition model and the dialogue text of the second associated dialogue sentence.
The method according to claim 1, wherein performing interjection pre-detection on the plurality of dialogue statements to determine the one or more candidate dialogue statements includes: for the first adjacent one of the plurality of dialogue statements. A dialogue sentence and a second dialogue sentence, wherein the first dialogue sentence and the second dialogue sentence have different speaker roles, and the starting time of the second dialogue sentence is after the starting time of the first dialogue sentence and is located in the first dialogue sentence before the end time of

Based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement, determine the intersection duration between the first dialogue statement and the second dialogue statement;

In response to determining that the intersection duration exceeds a preset duration or that the dialogue text of the second dialogue statement contains more than a preset number of characters, the second dialogue statement is determined to be the candidate dialogue statement.
The method according to claim 1, further comprising: before performing interjection pre-detection on the plurality of dialogue statements, for each dialogue statement in the plurality of dialogue statements,

Determine whether the dialogue text of the dialogue statement contains preset words;

In response to determining that the dialogue text of the dialogue statement contains the preset word, the preset word in the dialogue text of the dialogue statement is deleted.
The method according to claim 8, wherein determining whether the dialogue text of the dialogue statement contains a preset word includes:

Perform word segmentation processing on the dialogue text of the dialogue sentence to obtain the target words contained in the dialogue text of the dialogue sentence;

Each target word is matched with a preset word in a preset word library to determine whether the dialogue text of the dialogue sentence contains the preset word.
The method according to claim 8, wherein determining whether the dialogue text of the dialogue statement contains a preset word includes:

The dialogue text of the dialogue sentence is input into a pre-trained word recognition model to obtain a word recognition result of the dialogue text of the dialogue sentence. The word recognition result indicates whether the dialogue text of the dialogue sentence contains the preset word.
A voice dialogue detection device, including:

The first determination module is configured to perform pre-detection of interjections on multiple dialogue sentences in the target voice data based on respective dialogue related information and dialogue text of the multiple dialogue sentences to determine possible interjections in the multiple dialogue sentences. There are one or more candidate dialogue sentences for interrupting the conversation, wherein the target voice data includes dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles;

The second determination module is configured to, for each candidate dialogue sentence in the one or more candidate dialogue sentences, based on the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the result of the candidate dialogue sentence. At least one of the speech features is used to determine whether the candidate dialogue sentence has interrupting behavior.
The device of claim 11, wherein:

The emotion recognition result includes the negative emotion value of the candidate dialogue sentence;

The second determination module includes: a first interrupting judgment sub-module, configured to determine that the candidate dialogue sentence has an interrupting behavior in response to determining that the negative emotion value exceeds a preset emotion threshold.
The device of claim 11, wherein:

The voice characteristics of the candidate dialogue sentence include a volume change value of the candidate dialogue sentence relative to a first associated dialogue sentence of the candidate dialogue sentence, a speaker role of the first associated dialogue sentence and a difference between the speaker role of the candidate dialogue sentence and the candidate dialogue sentence. The speaker role is the same;

The second determination module includes: a first interrupting judgment sub-module, configured to determine that the candidate dialogue sentence has an interrupting behavior in response to determining that the volume change value exceeds a preset volume value.
The device of claim 11, further comprising:

A check-free identification module configured to, for each of the one or more candidate dialogue statements determined by the first determination module, generate a second associated dialogue statement and a third associated dialogue statement based on the candidate dialogue statement, Determine whether the candidate dialogue sentence satisfies the preset exemption condition, wherein the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is different from the speaker role of the candidate dialogue sentence. Conversational sentences have the same speaker role,

Wherein, the second determination module is configured to: in response to the determination by the exemption recognition module that the candidate dialogue sentence does not meet the preset exemption condition, determine the candidate dialogue based on at least one of the emotion recognition result and the speech feature. Whether the statement contains interrupting behavior.
The device according to claim 14, wherein the preset exemption condition includes that the intention of the second associated dialogue statement is to end the dialogue, and there is a gap between the dialogue text of the third associated dialogue statement and the preset end dialogue text. The matching degree value exceeds the first preset degree threshold,

Wherein, the starting time of the second associated dialogue sentence is located before the starting time of the candidate dialogue sentence, and the starting time of the third associated dialogue sentence is between the starting time of the second associated dialogue sentence and the starting time of the candidate dialogue sentence. between the starting times of candidate dialogue sentences.
The device of claim 15, wherein:

The intention of the second associated dialogue sentence is obtained by performing intention recognition on the second associated dialogue sentence based on the intention recognition model and the dialogue text of the second associated dialogue sentence.
The device according to claim 11, wherein the first determining module includes:

A cross duration determination submodule, configured to determine the adjacent first dialogue statement and the second dialogue statement among the plurality of dialogue statements based on the end time of the first dialogue statement and the starting time of the second dialogue statement. and the end time, determine the intersection duration between the first dialogue sentence and the second dialogue sentence, wherein the speaker roles of the first dialogue sentence and the second dialogue sentence are different, and the third dialogue sentence The starting time of the second dialogue statement is located after the starting time of the first dialogue statement and before the end time of the first dialogue statement;

A candidate dialogue sentence determination submodule, configured to determine the second dialogue sentence as the candidate dialogue sentence in response to determining that the intersection duration exceeds a preset duration or that the dialogue text of the second dialogue statement contains more than a preset number of characters. Describe candidate dialogue sentences.
The device of claim 11, further comprising:

A third determination module configured to determine, for each dialogue statement in the plurality of dialogue statements, the dialogue text of the dialogue statement before the first determination module performs interjection pre-detection on the plurality of dialogue statements. Whether it contains preset words;

A deletion module configured to delete the preset words in the dialogue text of the dialogue statement in response to the third determination module determining that the dialogue text of the dialogue statement contains the preset word.
An electronic device including:

processor;

memory for storing instructions executable by the processor;

Wherein, the processor is configured to implement the method according to any one of claims 1 to 10 when executing the instructions.
A computer-readable storage medium. When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can implement the method according to any one of claims 1 to 10.