WO2023207212A1 - Procédé et appareil de détection de dialogue vocal - Google Patents

Procédé et appareil de détection de dialogue vocal Download PDF

Info

Publication number
WO2023207212A1
WO2023207212A1 PCT/CN2023/070200 CN2023070200W WO2023207212A1 WO 2023207212 A1 WO2023207212 A1 WO 2023207212A1 CN 2023070200 W CN2023070200 W CN 2023070200W WO 2023207212 A1 WO2023207212 A1 WO 2023207212A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialogue
sentence
candidate
statement
dialogue sentence
Prior art date
Application number
PCT/CN2023/070200
Other languages
English (en)
Chinese (zh)
Inventor
邓成东
曾琳铖曦
郭江
吴海英
Original Assignee
马上消费金融股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 马上消费金融股份有限公司 filed Critical 马上消费金融股份有限公司
Publication of WO2023207212A1 publication Critical patent/WO2023207212A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to speech processing technology, and in particular to speech dialogue detection methods and devices.
  • Detecting whether participants in a voice conversation engage in interrupting behavior is an important part of voice conversation detection and is widely used in scenarios such as telephone operations and intelligent question and answer.
  • the voice dialogue detection method is mainly based on simple detection rules to determine whether the participants in the voice dialogue engage in interrupting behavior. For example, if participant A responds before participant B has finished speaking, it will be determined that participant A There is interfering behavior.
  • this detection method is too simplistic and cannot accurately detect interrupting behaviors in complex conversation scenarios. For example, there may be the following situation: when participant A is talking endlessly, participant B only responds when participant A has not finished speaking out of patience and respect for participant A, but does not really interrupt or interrupt. Talking participant A.
  • embodiments of the present application provide voice dialogue detection methods and devices.
  • a voice dialogue detection method provided by an embodiment of the present application includes: based on the dialogue-related information and dialogue text of multiple dialogue statements in the target voice data, performing pre-detection of interjection on the multiple dialogue statements to determine the target voice dialogue statements.
  • the target voice data includes dialogue sentences of speakers with different roles
  • the dialogue-related information includes dialogue start and end time information and speech. human role
  • for each candidate dialogue sentence in the one or more candidate dialogue sentences at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence are at least One is to determine whether the candidate dialogue sentence has interfering behavior.
  • a voice dialogue detection device includes: a first determination module, configured to interpolate multiple dialogue statements in the target voice data based on their respective dialogue related information and dialogue text.
  • dialogue pre-detection to determine one or more candidate dialogue statements that may include interjection behavior among the plurality of dialogue statements, wherein the target voice data includes dialogue statements of speakers with different roles, and the dialogue-related information It includes dialogue start and end time information and speaker role;
  • a second determination module is used for each candidate dialogue statement in the one or more candidate dialogue statements, based on emotion recognition of the candidate dialogue statement using an emotion recognition model. At least one of the emotion recognition result and the speech feature of the candidate dialogue sentence is used to determine whether the candidate dialogue sentence has interrupting behavior.
  • An electronic device provided by an embodiment of the present application includes: a processor; and a memory used to store instructions executable by the processor, wherein the processor is configured to implement the above voice dialogue detection method when executing the instructions. .
  • An embodiment of the present application provides a computer-readable storage medium.
  • the electronic device can implement the above voice dialogue detection method.
  • Figure 1 is a schematic flow chart of a voice dialogue detection method provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application.
  • Figure 3 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application.
  • Figure 4 is a schematic diagram of applicable scenarios for the voice dialogue detection method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a configuration interface provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a configuration interface provided by another embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a voice dialogue detection device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • Intelligent customer service quality inspection system A system that detects the text content of voice, video and other data through detection models, detection algorithms, etc., which plays a role in detecting the behavior of customer service personnel, such as detecting whether dialogue participants are interfering in conversations. Behavior is conducive to improving the service quality of customer service personnel.
  • ASR Automatic Speech Recognition
  • the embodiment of this application uses the rule that one party usually starts speaking before the other party has finished speaking, and the speech is not too brief, and proposes a dialogue sentence detection solution by utilizing the rule that interrupting behavior is usually caused by one party starting to speak before the other party has finished speaking.
  • candidate dialogue sentences that may involve interrupting behaviors are determined from these dialogue sentences.
  • the emotion recognition results and/or voice characteristics of the candidate dialogue sentences can be combined to further determine the content of the candidate dialogue sentences. Is there any behavior of interrupting calls?
  • the embodiment of the present application can avoid classifying one party out of patience and respect for the other party. Behaviors such as responding respectfully to the other party before the other party has finished speaking are misjudged as interrupting behavior, thus improving the detection accuracy of voice conversations.
  • the voice dialogue detection method provided by the embodiment of the present application can be executed by an electronic device or software installed in the electronic device, and specifically can be executed by a terminal device or a server device.
  • a voice dialogue detection method provided by an embodiment of the present application may include steps S102 to S104.
  • step S102 based on the dialogue-related information and dialogue text of each of the plurality of dialogue statements in the target voice data, pre-detection of interjection is performed on the plurality of dialogue statements to determine that there may be interjection in the plurality of dialogue statements.
  • the target speech data may include conversational sentences of speakers with different roles.
  • the target voice data includes conversational sentences between users and customer service personnel; for another example, in a video conference scenario, the target voice data includes conversational sentences between different conference participants, and so on.
  • the dialogue-related information of the dialogue sentence may include the dialogue start and end time information of the dialogue sentence and the speaker's role.
  • the start and end time information of the dialogue sentence includes the start time of the dialogue sentence (that is, the time when the speaker starts speaking) and the end time (that is, the time when the speaker stops speaking).
  • the dialogue text of the dialogue sentence represents the dialogue content of the dialogue sentence.
  • the dialogue text of the dialogue sentence can be obtained by identifying the dialogue sentence based on ASR technology.
  • the dialogue start and end time information of the dialogue sentences of the speakers of different roles and the role of the speaker can be used. Based on the relevant information and the dialogue text of these dialogue sentences, the dialogue sentences of speakers with different roles are pre-detected to determine the candidate dialogue sentences that may have the behavior of interrupting.
  • the first dialogue sentence and the second dialogue sentence are any two adjacent dialogue sentences in the target speech data, and the first dialogue sentence and the second dialogue sentence have different speaker roles, the first dialogue sentence and the second dialogue sentence have different speaker roles.
  • the start time of the second dialogue sentence is located after the start time of the first dialogue sentence and before the end time of the first dialogue sentence, the second dialogue sentence can be pre-detected for interrupting.
  • the intersection duration between the first dialogue statement and the second dialogue statement may be determined based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement.
  • the intersection time between the first dialogue sentence and the second dialogue sentence is equal to the end time of the first dialogue sentence and the start time of the second dialogue sentence. the difference between them; if the end time of the first dialogue statement is after the end time of the second dialogue statement, then the intersection duration between the first dialogue statement and the second dialogue statement is equal to the end time of the second dialogue statement and the second dialogue statement. The difference between the start times of dialogue statements. If the intersection duration exceeds the preset duration or the number of characters contained in the dialogue text of the second dialogue sentence exceeds the preset number of characters, the second dialogue sentence is determined to be a candidate dialogue sentence that may involve interrupting. In actual applications, the preset duration and the preset number of characters can be set according to actual needs. For example, the preset duration can be set to 3 seconds, and the preset number of characters can be set to 5.
  • multiple dialogue statements can be sorted according to the starting time of each dialogue statement in the target speech data from early to late, and then each two adjacent dialogue statements can be executed in sequence.
  • the above steps are carried out until all dialogue sentences in the target speech data have been judged. For example, if it is determined that the intersection duration between the Nth (N is a positive integer) dialogue statement and the N+1th dialogue statement exceeds the preset duration or the dialogue text of the N+1th dialogue statement contains more than the preset number of characters, number of characters, then determine the N+1th dialogue statement as the candidate dialogue statement; otherwise, continue to repeat the above process for the N+1th dialogue statement and the N+2th dialogue statement until all the dialogue statements in the target speech data are All dialogue sentences have been judged.
  • the target voice data includes conversational sentences between customer service personnel and users. Since in this scenario we only need to pay attention to whether the customer service staff has interrupting behavior, we only determine whether there is interrupting behavior in the customer service staff's conversational statements. It is assumed that the dialogue-related information and dialogue text of the dialogue sentence in the target speech data are as follows:
  • the speaker role of the third dialogue sentence is a customer service staff, which is different from the speaker role of the second dialogue sentence, and the starting time of the third dialogue sentence (5880ms) is located at the starting time of the second dialogue sentence ( 4760ms) and before the end time (10240ms) of the second dialogue sentence, so the third dialogue sentence is pre-detected for interrupting.
  • the end time of the third dialogue statement (6320ms) is before the end time of the second dialogue statement (10240ms), so it can be determined that the intersection duration between the third dialogue statement and the second dialogue statement is equal to the third dialogue statement
  • the difference between the end time (6320ms) and the start time of the third dialogue statement (5880ms) is 440ms, which is less than the preset duration of 3 seconds. And it can be determined that the number of characters contained in the dialogue text of the third dialogue statement is 1, which is less than the preset number of characters 5. Therefore, it can be determined that there is no interfering behavior in the third dialogue sentence.
  • the speaker role of the fourth dialogue sentence is the same as the speaker role of the third dialogue sentence, so the fourth dialogue sentence is not pre-detected for interjection.
  • the speaker role of the fifth dialogue sentence is the user, so the fifth dialogue sentence is not pre-detected for interrupting.
  • the speaker role of the sixth dialogue sentence is a customer service staff, which is different from the speaker role of the fifth dialogue sentence, and the starting time of the sixth dialogue sentence (15830ms) is located at the starting time of the fifth dialogue sentence ( 14640ms) and before the end time (23270ms) of the fifth dialogue sentence, so the sixth dialogue sentence is pre-detected for interrupting.
  • the end time of the 6th dialogue statement (20500ms) is before the end time of the 5th dialogue statement (23270ms), so it can be determined that the intersection duration between the 6th dialogue statement and the 5th dialogue statement is equal to the 6th dialogue statement
  • the difference between the end time (20500ms) and the start time of the sixth dialogue statement (15830ms) is 4670ms, which is 3 seconds longer than the preset time. From this, it can be determined that the sixth dialogue sentence is a candidate dialogue sentence that may involve interrupting the conversation. Alternatively, it can be determined that the number of characters in the dialogue text of the sixth dialogue sentence is 18, which exceeds the preset number of characters of 5, and thus the sixth dialogue sentence can be determined to be a candidate dialogue sentence that may involve interrupting.
  • the dialogue statement may include interrupting behavior. In this way, it is possible to avoid misjudgment of conversational statements such as the above-mentioned polite words as interfering behaviors, which is beneficial to improving the accuracy of voice dialogue detection.
  • the voice dialogue detection method may also include: before step S102, determine the dialogue of each dialogue statement in the target voice data. Whether the text contains preset words. If the dialogue text of the dialogue statement contains preset words, delete the preset words in the dialogue text of the dialogue statement.
  • the preset words can be set according to actual needs. For example, the preset words can include the above-mentioned polite words, greeting words, etc.
  • any appropriate method may be used to determine whether the dialogue text of the dialogue sentence contains the preset words.
  • the dialogue text of the dialogue statement can be segmented to obtain the words contained in the dialogue text of the dialogue statement; then, each word contained in the dialogue text of the dialogue statement is compared with the words in the preset word library The preset words are matched to determine whether the dialogue text of the dialogue statement contains the preset words.
  • the preset word library can be obtained by exhaustively enumerating the preset words. Then, a regular matching algorithm is used to match each word contained in the dialogue text of the dialogue sentence with the preset words in the preset word library. If the matching result indicates that the matching degree value between a certain word contained in the dialogue text of the dialogue statement and a preset word in the preset word library exceeds the preset matching degree value, it can be determined that the dialogue text of the dialogue statement contains the preset word. Assume words.
  • the dialogue text of the dialogue sentence can be input into a pre-trained word recognition model to obtain a word recognition result of the dialogue text of the dialogue sentence.
  • the word recognition result indicates whether the dialogue text of the dialogue sentence contains a preset word.
  • the word recognition result can indicate the similarity between a certain word in the dialogue text of the dialogue sentence and one or more preset words.
  • the similarity is usually a floating point value between 0 and 1, and the larger the value, the greater the similarity. It means the higher the similarity.
  • the word recognition model is trained based on the sample text and the word tags of the words contained in the sample text. The word tags of the words are used to indicate whether the words are preset words.
  • the word tag of a word can be represented by one-hot encoding. For example, if the word tag of a word is [0,1], it means that the word is not a preset word; if the word tag of a word is [1,0], it means This word is a default word. For example, the sample text is "Well, okay”, and the words it contains include ⁇ "Well", “Okay” ⁇ , where the word label corresponding to "Well” is [1,0], and the corresponding word label to "Okay” The word label of is also [1,0].
  • the type of word recognition model can be selected according to actual needs.
  • the word recognition model can be a Bidirectional Encoder Representation from Transformers (BERT) model.
  • model training is performed using sample text and the word labels of the words contained in the sample text, so that the trained word recognition model has generalized recognition capabilities, and the word recognition model can be continuously improved by continuously supplementing new sample text. Recognition ability and accuracy.
  • the dialogue text of the dialogue statement can be recognized simply and accurately whether the dialogue text of the dialogue statement contains preset words such as polite words.
  • step S104 for each candidate dialogue sentence, it is determined whether there is an interpolation in the candidate dialogue sentence based on at least one of the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. Talk-stealing behavior.
  • the emotion recognition model refers to a pre-trained machine learning model with emotion recognition capabilities.
  • the emotion recognition model can be trained by using the emotion-related features of the sample dialogue sentences and the emotion labels corresponding to the sample dialogue sentences.
  • the emotion-related features of the sample dialogue sentence refer to the characteristics of the sample dialogue sentence that can represent the speaker's emotion, such as the spectrogram characteristics of the sample dialogue sentence, etc.
  • the emotion label corresponding to the sample dialogue sentence is used to indicate the emotional tendency of the sample dialogue sentence, such as positive emotion or negative emotion.
  • the tendency value of the emotional tendency corresponding to the sample dialogue sentence may include, for example, a positive emotional value and a negative emotional value.
  • the type of emotion recognition model can be selected according to actual needs.
  • feature extraction can be performed on the candidate dialogue sentences to obtain the emotion-related features of the candidate dialogue sentences, and then the emotion-related features of the candidate dialogue sentences are input into the emotion recognition model to obtain the emotion recognition results of the candidate dialogue sentences.
  • the emotion recognition result may represent the emotional tendency of the candidate dialogue sentence, or may represent the tendency value of the emotional tendency of the candidate dialogue sentence.
  • the voice characteristics of the candidate dialogue statement may include, but are not limited to, the volume of the candidate dialogue statement and/or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement, wherein the speaker role of the first associated dialogue statement is consistent with the speaker role of the candidate dialogue statement.
  • the speaker role is the same.
  • the first associated dialogue statement may be a dialogue statement output by the customer service staff before the candidate dialogue statement.
  • the candidate dialogue sentence is determined whether there is interrupting based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence. Talking behavior can improve the detection accuracy of voice dialogue.
  • the preset interruption condition may include that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value.
  • the preset emotional threshold and preset volume value can be set according to actual needs.
  • the negative emotion value of the candidate dialogue statement exceeds the preset emotion threshold or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement exceeds the preset volume threshold, it is determined that the candidate dialogue statement is interfering.
  • the oversimplified approach of simply judging "the behavior of one party speaking before the other party has finished speaking as interfering behavior” it can avoid classifying one party out of patience and respect for the other party. Behaviors such as respecting the other party and responding to the other party before the other party has finished speaking are misjudged as interrupting behavior, which is helpful to improve the detection accuracy of voice dialogue.
  • inspection exemption conditions can be set in advance for this situation. For any candidate dialogue sentence, if it is determined that the candidate dialogue sentence satisfies the inspection exemption condition, it can be directly determined that the candidate dialogue sentence does not engage in interfering behavior without performing the above step S104, as shown in FIG. 3 .
  • the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
  • Preset inspection exemption conditions can be set according to actual needs.
  • the preset exemption conditions may include: the intention of the second associated dialogue statement is to end the dialogue, and the matching degree between the dialogue text of the third associated dialogue statement and the preset end dialogue text. The value exceeds the first preset level threshold.
  • the preset ending conversation text may be a standard text used to end a conversation, such as "Thank you for calling, goodbye", etc.
  • the intention recognition of the second associated dialogue statement can be performed to obtain the intention recognition result, and the dialogue text of the third associated dialogue statement and the predetermined Assume that the end dialogue text is matched and the first matching result is obtained. Then, based on the intention recognition result and the first matching result, it can be determined whether the candidate dialogue sentence satisfies the preset inspection exemption condition. Wherein, the start time of the second associated dialogue sentence is before the start time of the candidate dialogue sentence, and the start time of the third associated dialogue sentence is between the start time of the second associated dialogue sentence and the start time of the candidate dialogue sentence. .
  • the intent recognition model refers to a pre-trained machine learning model with intent recognition capabilities.
  • the intent recognition model can be trained by using the intent-related features of the sample conversation text and the intent tags corresponding to the sample conversation text.
  • Intention-related features of the sample dialogue text may include word features and/or sentence features of the sample dialogue text that can characterize the speaker's intention.
  • the intent tag corresponding to the sample conversation text is used to indicate the intent of the sample conversation text, for example, indicating whether the intent of the sample conversation text is to end the conversation. It should be noted that in actual applications, the type of intent recognition model can be selected according to actual needs.
  • feature extraction can be performed on the dialogue text of the second associated dialogue sentence to obtain the intention-related features of the dialogue text of the second associated dialogue sentence, and then the second associated dialogue sentence is By inputting the intention-related features of the dialogue text of the associated dialogue sentence into the intention recognition model, it can be determined whether the intention of the second associated dialogue sentence is to end the dialogue.
  • the conversation between the calling party and the called party is as follows:
  • the fourth dialogue sentence is determined to be a candidate dialogue sentence through the above-mentioned step S102.
  • the first dialogue sentence can be determined as the second associated dialogue sentence of the candidate dialogue sentence
  • the second dialogue sentence can be determined as the third associated dialogue of the candidate dialogue sentence. statement.
  • intention recognition By performing intention recognition on the second associated dialogue sentence (ie, the first dialogue sentence) through the intention recognition model, it can be determined that the intention of the second associated dialogue sentence is to end the dialogue.
  • the match between the dialogue text of the third associated dialogue statement and the preset end dialogue text can be determined
  • the level value exceeds the preset first preset level threshold. Therefore, it can be determined that the candidate dialogue sentence (i.e., the fourth dialogue sentence) satisfies the preset exemption condition, thereby determining that the candidate dialogue sentence does not engage in interfering behavior. That is, it can be determined that the fourth conversation sentence belongs to the situation where the calling party suddenly asks a question when the two parties have a clear intention to end the conversation, causing the called party to start speaking even before the calling party has finished speaking. Therefore, the fourth conversation sentence can be determined. The statement does not belong to the called party to interrupt the call.
  • the voice dialogue detection method in the embodiment of the present application can be used in a variety of scenarios that require detection of interrupting calls, such as telephone operations, intelligent question and answer and other scenarios. Taking the telephone operation scenario as an example, the voice dialogue detection method provided by the embodiment of the present application will be described below.
  • the telephone operation scenario involves the client 10 and the intelligent customer service quality inspection system 20.
  • the client 10 can display a configuration interface for developer A to configure quality inspection rules.
  • the rule 1 corresponding to the preset exemption condition may include the intention of the above-mentioned second associated dialogue statement and the conditions that need to be satisfied by the third associated dialogue statement.
  • Rule 2 corresponding to the pre-detection of pre-detection of interrupted calls may include preset cross duration, preset number of characters, pre-emptive call delay, etc. (as shown in Figure 6).
  • Rule 3 corresponding to the secondary interrupting detection may include an emotion recognition model, an intent recognition model, etc., which are used to further determine whether a candidate dialogue sentence has an interrupting behavior.
  • Rule 4 corresponding to excluding cases where the number of words for interrupting is small may include a preset number of characters and so on.
  • the client 10 can send the quality inspection rules configured by developer A to the intelligent customer service quality inspection system 20 for use by the intelligent customer service quality inspection system 20 .
  • the client 10 can display a voice data import interface.
  • User B who has voice dialogue quality inspection requirements, can import the target voice data that needs to be detected into the client 10 through the voice data import interface.
  • the client 10 can send the imported target voice data to the intelligent customer service quality inspection system 20, and according to the voice dialogue detection triggering instruction input by user B, send the detection target voice data to the intelligent customer service quality inspection system 20 to detect the presence of interfering calls.
  • a request for a conversational statement of behavior can be used to detect the presence of interfering calls.
  • the intelligent customer service quality inspection system 20 may include a server (Server) or a server cluster (Cluster) composed of multiple servers.
  • the intelligent customer service quality inspection system 20 can execute the voice dialogue detection method disclosed in the above embodiments of the present application based on pre-configured quality inspection rules to determine whether there are dialogue statements that interrupt the conversation in the target voice data, and return the detection results.
  • the client 10 displays the detection results to user B, so that user B can take corresponding measures to improve customer service quality based on the detection results.
  • the intelligent customer service quality inspection system 20 can obtain the voice characteristics and dialogue-related information of each dialogue sentence in the target voice data (for example, including the start and end time of the dialogue and the speaker's role), and based on the ASR technology, convert the target voice data into Corresponding text, the dialogue text of each dialogue statement is obtained.
  • the intelligent customer service quality inspection system 20 can exclude dialogue sentences with a small number of words in the target voice data based on rule 4 corresponding to excluding cases where the number of words in the target voice data is small, and then, based on the dialogue correlation of the remaining dialogue sentences in the target voice data Information and dialogue text, according to the rules 2 corresponding to the pre-detection of interjection, perform pre-detection of interjection on these dialogue sentences to determine candidate dialogue sentences that may contain interjection behavior in these dialogue sentences. Then, the intelligent customer service quality inspection system 20 may determine whether the candidate dialogue sentence satisfies the preset inspection exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence.
  • the intelligent customer service quality inspection system 20 can determine that the candidate dialogue sentence does not engage in interfering behavior. If the candidate dialogue sentence does not meet the preset exemption conditions, the intelligent customer service quality inspection system 20 can call the emotion recognition model to perform emotion recognition on the candidate dialogue sentence based on the rule 3 corresponding to the secondary interruption detection to obtain the emotion recognition result, and Based on the emotion recognition results and/or the speech characteristics of the candidate dialogue sentence, it is determined whether the candidate dialogue sentence has interrupting behavior.
  • the voice dialogue detection device 700 may include a first determination module 710 and a second determination module 730 .
  • the first determination module 710 may perform interjection pre-detection on the plurality of dialogue statements based on the dialogue related information and dialogue text of each of the plurality of dialogue statements in the target voice data to determine that there may be any presence in the plurality of dialogue statements.
  • the target voice data may include dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles.
  • the second determination module 730 may be based on at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. One is to determine whether the candidate dialogue sentence has interfering behavior.
  • the emotion recognition result includes the negative emotion value of the candidate dialogue sentence
  • the voice characteristics of the candidate dialogue sentence include the volume change value of the candidate dialogue sentence relative to the first associated dialogue sentence
  • the speaker role of the first associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
  • the second determination module 730 may include a first call interrupting determination sub-module. If it is determined that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value, the first interrupting judgment sub-module may determine that the candidate dialogue sentence has an interrupting behavior. .
  • the voice dialogue detection device 700 may also include a check-free recognition module.
  • the exemption identification module can determine whether the candidate dialogue sentence satisfies the preset exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence.
  • the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
  • the second determination module 730 can directly determine that the candidate dialogue sentence does not have an interfering behavior; and when the exemption identification module determines that the candidate dialogue sentence does not If the preset inspection exemption condition is met, the second determination module 730 may determine whether the candidate dialogue sentence has an interfering behavior based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence.
  • the preset exemption condition includes the intention of the second associated dialogue statement to end the dialogue, and the matching degree value between the dialogue text of the third associated dialogue statement and the preset end dialogue text. exceeds the first preset level threshold.
  • the inspection-free identification module may include an intent identification sub-module, a matching sub-module and an inspection-free identification sub-module.
  • the intention recognition sub-module can perform intention recognition on the second associated dialogue statement based on the intention recognition model and the dialogue text of the second associated dialogue statement, and obtain an intention recognition result, wherein the start of the second associated dialogue statement The time is before the start time of the candidate dialogue sentence.
  • the matching submodule can match the dialogue text of the third associated dialogue statement with the preset end dialogue text to obtain the first matching result, wherein the starting time of the third associated dialogue statement is located in the second associated dialogue between the starting time of the sentence and the starting time of the candidate dialogue sentence.
  • the exemption recognition sub-module may determine whether the candidate dialogue statement satisfies the preset exemption conditions based on the intention recognition result obtained by the intention recognition sub-module and the first matching result obtained by the matching sub-module.
  • the first determination module 710 may include an intersection duration determination sub-module and a candidate dialogue sentence determination sub-module.
  • the intersection duration determination sub-module may determine the first dialogue statement based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement. The intersection duration between the first dialogue sentence and the second dialogue sentence.
  • the candidate dialogue statement determination sub-module may determine that the second dialogue statement may be interfering.
  • the voice dialogue detection device 700 may further include a third determination module and a deletion module.
  • the third determination module may determine whether the dialogue text of each dialogue sentence contains a preset word. If the third determination module determines that the dialogue text of the dialogue statement contains the preset words, the deletion module may delete the preset words in the dialogue text of the dialogue statement.
  • the third determination module may include a word segmentation sub-module and a matching sub-module.
  • the word segmentation sub-module can perform word segmentation processing on the dialogue text of the dialogue sentence to obtain the words contained in the dialogue text of the dialogue sentence.
  • the matching sub-module may match each word contained in the dialogue text of the dialogue statement with the preset words in the preset word library to determine whether the dialogue text of the dialogue statement contains the preset word.
  • the third determination module may include a word determination sub-module.
  • the word determination sub-module can input the dialogue text of the dialogue sentence into the pre-trained word recognition model to obtain the word recognition result of the dialogue text of the dialogue sentence.
  • the word recognition result is used to indicate whether the dialogue text of the dialogue sentence contains a preset
  • the word recognition model is obtained by model training based on the sample text and the word tags of the words contained in the sample text.
  • the word tags of the words are used to indicate whether the words are preset words.
  • the electronic device may include a processor, an internal bus, a network interface, a memory, etc.
  • Memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or non-volatile memory (non-volatile memory), such as disk memory.
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • the electronic equipment may also include other hardware required by the business.
  • the processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, a peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture, extended industrial standard architecture) bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in Figure 8, but it does not mean that there is only one bus or one type of bus.
  • Memory used to store programs.
  • a program may include program code including computer operating instructions.
  • Memory may include internal memory and non-volatile memory and provides instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to form a voice dialogue detection device at the logical level.
  • the processor executes the program stored in the memory to execute the above voice dialogue detection method.
  • the processor may be an integrated circuit chip that has signal processing capabilities.
  • each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor.
  • the above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processor, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • Embodiments of the present application also propose a computer-readable storage medium that stores one or more programs.
  • the one or more programs include instructions that, when executed by a processor of an electronic device, can The electronic device is caused to execute the above voice conversation detection method.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • compact disc read-only memory CD-ROM
  • DVD digital versatile disc
  • Magnetic tape cassettes tape disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un procédé et un appareil de détection de dialogue vocal. Le procédé de détection de dialogue vocal comprend : sur la base d'informations liées au dialogue et d'un texte de dialogue d'une pluralité de phrases de dialogue dans des données vocales cibles, la réalisation d'une pré-détection d'interruption de comportement sur la pluralité de phrases de dialogue de façon à déterminer une ou plusieurs phrases de dialogue candidates qui peuvent avoir un comportement d'interruption dans la pluralité de phrases de dialogue (S102), les données vocales cibles comprenant des phrases de dialogue de locuteurs ayant différents rôles, et chaque élément des informations liées au dialogue comprenant des informations de temps de début et de fin de dialogue et un rôle de locuteur; et pour chaque phrase de dialogue candidate, le fait de déterminer, sur la base d'un résultat de reconnaissance d'émotion de la phrase de dialogue candidate et/ou d'une caractéristique vocale de la phrase de dialogue candidate, si un comportement d'interruption existe dans la phrase de dialogue candidate (S104).
PCT/CN2023/070200 2022-04-24 2023-01-03 Procédé et appareil de détection de dialogue vocal WO2023207212A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210451120.8A CN114842849B (zh) 2022-04-24 2022-04-24 语音对话检测方法及装置
CN202210451120.8 2022-04-24

Publications (1)

Publication Number Publication Date
WO2023207212A1 true WO2023207212A1 (fr) 2023-11-02

Family

ID=82568107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/070200 WO2023207212A1 (fr) 2022-04-24 2023-01-03 Procédé et appareil de détection de dialogue vocal

Country Status (2)

Country Link
CN (1) CN114842849B (fr)
WO (1) WO2023207212A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842849B (zh) * 2022-04-24 2023-08-08 马上消费金融股份有限公司 语音对话检测方法及装置

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204195A (zh) * 2017-05-19 2017-09-26 四川新网银行股份有限公司 一种基于情绪分析的智能质检方法
US20200364271A1 (en) * 2017-11-03 2020-11-19 Money Brain Co., Ltd. Method and computer device for providing natural language conversation by providing interjection response in timely manner, and computer-readable recording medium
CN111968679A (zh) * 2020-10-22 2020-11-20 深圳追一科技有限公司 情感识别方法、装置、电子设备及存储介质
CN112017629A (zh) * 2020-07-15 2020-12-01 马上消费金融股份有限公司 语音机器人的会话控制方法及设备、存储介质
CN112885332A (zh) * 2021-01-08 2021-06-01 天讯瑞达通信技术有限公司 一种语音质检方法、系统及存储介质
CN113539275A (zh) * 2020-04-22 2021-10-22 北京有限元科技有限公司 确定话术的方法、装置以及存储介质
CN114842849A (zh) * 2022-04-24 2022-08-02 马上消费金融股份有限公司 语音对话检测方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8675854B2 (en) * 2012-05-01 2014-03-18 Mitel Networks Corporation Multi-modal communications with conferencing and clients
KR102429498B1 (ko) * 2017-11-01 2022-08-05 현대자동차주식회사 차량의 음성인식 장치 및 방법
US10783476B2 (en) * 2018-01-26 2020-09-22 Walmart Apollo, Llc System for customized interactions-related assistance
CN111508474B (zh) * 2019-08-08 2021-04-06 马上消费金融股份有限公司 一种语音打断方法、电子设备及存储装置
CN111210842B (zh) * 2019-12-27 2023-04-28 中移(杭州)信息技术有限公司 语音质检方法、装置、终端及计算机可读存储介质
CN111835925A (zh) * 2020-06-16 2020-10-27 杭州云嘉云计算有限公司 一种面向呼叫中心的离线语音质检及分析系统
CN111951831A (zh) * 2020-08-24 2020-11-17 浙江百应科技有限公司 一种基于ai实现音频质检的方法
CN115148205A (zh) * 2022-06-23 2022-10-04 鼎富新动力(北京)智能科技有限公司 一种语音交互方法、系统、电子设备及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204195A (zh) * 2017-05-19 2017-09-26 四川新网银行股份有限公司 一种基于情绪分析的智能质检方法
US20200364271A1 (en) * 2017-11-03 2020-11-19 Money Brain Co., Ltd. Method and computer device for providing natural language conversation by providing interjection response in timely manner, and computer-readable recording medium
CN113539275A (zh) * 2020-04-22 2021-10-22 北京有限元科技有限公司 确定话术的方法、装置以及存储介质
CN112017629A (zh) * 2020-07-15 2020-12-01 马上消费金融股份有限公司 语音机器人的会话控制方法及设备、存储介质
CN111968679A (zh) * 2020-10-22 2020-11-20 深圳追一科技有限公司 情感识别方法、装置、电子设备及存储介质
CN112885332A (zh) * 2021-01-08 2021-06-01 天讯瑞达通信技术有限公司 一种语音质检方法、系统及存储介质
CN114842849A (zh) * 2022-04-24 2022-08-02 马上消费金融股份有限公司 语音对话检测方法及装置

Also Published As

Publication number Publication date
CN114842849A (zh) 2022-08-02
CN114842849B (zh) 2023-08-08

Similar Documents

Publication Publication Date Title
US11227124B2 (en) Context-aware human-to-computer dialog
CN108428447B (zh) 一种语音意图识别方法及装置
US9348817B2 (en) Automatic generation of question-answer pairs from conversational text
KR20200129182A (ko) 적절한 에이전트의 자동화된 어시스턴트 호출
JP2017534941A (ja) オーファン発話検出システム及び方法
WO2023246393A1 (fr) Apprentissage de modèle de reconnaissance d'intention et reconnaissance d'intention d'utilisateur
JP5496863B2 (ja) 感情推定装置、その方法、プログラム及びその記録媒体
US10199035B2 (en) Multi-channel speech recognition
US11170763B2 (en) Voice interaction system, its processing method, and program therefor
WO2023207212A1 (fr) Procédé et appareil de détection de dialogue vocal
CN112053687A (zh) 一种语音处理方法、装置、计算机可读存储介质及设备
JP7151181B2 (ja) 音声対話システム、その処理方法及びプログラム
CN110581927A (zh) 通话内容的处理及提示方法、装置
CN113779208A (zh) 用于人机对话的方法和装置
CN115545002A (zh) 一种模型训练和业务处理的方法、装置、存储介质及设备
CN110209768B (zh) 自动问答的问题处理方法和装置
Church et al. Speaker diarization: a perspective on challenges and opportunities from theory to practice
CN112908315A (zh) 一种基于声音特征和语音识别的问答意图判断方法
US10824520B2 (en) Restoring automated assistant sessions
CN111970311B (zh) 会话切分方法、电子设备及计算机可读介质
JP2020140169A (ja) 話者決定装置、話者決定方法、および話者決定装置の制御プログラム
CN114860910A (zh) 智能对话方法及系统
CN115186051A (zh) 敏感词检测方法、装置及计算机可读存储介质
US20240143925A1 (en) Method and apparatus for automatic entity recognition in customer service environments
US20230133027A1 (en) Method and apparatus for intent-guided automated speech recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794652

Country of ref document: EP

Kind code of ref document: A1