WO2023207212A1 - Voice dialogue detection method and apparatus - Google Patents

Voice dialogue detection method and apparatus Download PDF

Info

Publication number
WO2023207212A1
WO2023207212A1 PCT/CN2023/070200 CN2023070200W WO2023207212A1 WO 2023207212 A1 WO2023207212 A1 WO 2023207212A1 CN 2023070200 W CN2023070200 W CN 2023070200W WO 2023207212 A1 WO2023207212 A1 WO 2023207212A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialogue
sentence
candidate
statement
dialogue sentence
Prior art date
Application number
PCT/CN2023/070200
Other languages
French (fr)
Chinese (zh)
Inventor
邓成东
曾琳铖曦
郭江
吴海英
Original Assignee
马上消费金融股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 马上消费金融股份有限公司 filed Critical 马上消费金融股份有限公司
Publication of WO2023207212A1 publication Critical patent/WO2023207212A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to speech processing technology, and in particular to speech dialogue detection methods and devices.
  • Detecting whether participants in a voice conversation engage in interrupting behavior is an important part of voice conversation detection and is widely used in scenarios such as telephone operations and intelligent question and answer.
  • the voice dialogue detection method is mainly based on simple detection rules to determine whether the participants in the voice dialogue engage in interrupting behavior. For example, if participant A responds before participant B has finished speaking, it will be determined that participant A There is interfering behavior.
  • this detection method is too simplistic and cannot accurately detect interrupting behaviors in complex conversation scenarios. For example, there may be the following situation: when participant A is talking endlessly, participant B only responds when participant A has not finished speaking out of patience and respect for participant A, but does not really interrupt or interrupt. Talking participant A.
  • embodiments of the present application provide voice dialogue detection methods and devices.
  • a voice dialogue detection method provided by an embodiment of the present application includes: based on the dialogue-related information and dialogue text of multiple dialogue statements in the target voice data, performing pre-detection of interjection on the multiple dialogue statements to determine the target voice dialogue statements.
  • the target voice data includes dialogue sentences of speakers with different roles
  • the dialogue-related information includes dialogue start and end time information and speech. human role
  • for each candidate dialogue sentence in the one or more candidate dialogue sentences at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence are at least One is to determine whether the candidate dialogue sentence has interfering behavior.
  • a voice dialogue detection device includes: a first determination module, configured to interpolate multiple dialogue statements in the target voice data based on their respective dialogue related information and dialogue text.
  • dialogue pre-detection to determine one or more candidate dialogue statements that may include interjection behavior among the plurality of dialogue statements, wherein the target voice data includes dialogue statements of speakers with different roles, and the dialogue-related information It includes dialogue start and end time information and speaker role;
  • a second determination module is used for each candidate dialogue statement in the one or more candidate dialogue statements, based on emotion recognition of the candidate dialogue statement using an emotion recognition model. At least one of the emotion recognition result and the speech feature of the candidate dialogue sentence is used to determine whether the candidate dialogue sentence has interrupting behavior.
  • An electronic device provided by an embodiment of the present application includes: a processor; and a memory used to store instructions executable by the processor, wherein the processor is configured to implement the above voice dialogue detection method when executing the instructions. .
  • An embodiment of the present application provides a computer-readable storage medium.
  • the electronic device can implement the above voice dialogue detection method.
  • Figure 1 is a schematic flow chart of a voice dialogue detection method provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application.
  • Figure 3 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application.
  • Figure 4 is a schematic diagram of applicable scenarios for the voice dialogue detection method provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a configuration interface provided by an embodiment of the present application.
  • Figure 6 is a schematic diagram of a configuration interface provided by another embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a voice dialogue detection device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • Intelligent customer service quality inspection system A system that detects the text content of voice, video and other data through detection models, detection algorithms, etc., which plays a role in detecting the behavior of customer service personnel, such as detecting whether dialogue participants are interfering in conversations. Behavior is conducive to improving the service quality of customer service personnel.
  • ASR Automatic Speech Recognition
  • the embodiment of this application uses the rule that one party usually starts speaking before the other party has finished speaking, and the speech is not too brief, and proposes a dialogue sentence detection solution by utilizing the rule that interrupting behavior is usually caused by one party starting to speak before the other party has finished speaking.
  • candidate dialogue sentences that may involve interrupting behaviors are determined from these dialogue sentences.
  • the emotion recognition results and/or voice characteristics of the candidate dialogue sentences can be combined to further determine the content of the candidate dialogue sentences. Is there any behavior of interrupting calls?
  • the embodiment of the present application can avoid classifying one party out of patience and respect for the other party. Behaviors such as responding respectfully to the other party before the other party has finished speaking are misjudged as interrupting behavior, thus improving the detection accuracy of voice conversations.
  • the voice dialogue detection method provided by the embodiment of the present application can be executed by an electronic device or software installed in the electronic device, and specifically can be executed by a terminal device or a server device.
  • a voice dialogue detection method provided by an embodiment of the present application may include steps S102 to S104.
  • step S102 based on the dialogue-related information and dialogue text of each of the plurality of dialogue statements in the target voice data, pre-detection of interjection is performed on the plurality of dialogue statements to determine that there may be interjection in the plurality of dialogue statements.
  • the target speech data may include conversational sentences of speakers with different roles.
  • the target voice data includes conversational sentences between users and customer service personnel; for another example, in a video conference scenario, the target voice data includes conversational sentences between different conference participants, and so on.
  • the dialogue-related information of the dialogue sentence may include the dialogue start and end time information of the dialogue sentence and the speaker's role.
  • the start and end time information of the dialogue sentence includes the start time of the dialogue sentence (that is, the time when the speaker starts speaking) and the end time (that is, the time when the speaker stops speaking).
  • the dialogue text of the dialogue sentence represents the dialogue content of the dialogue sentence.
  • the dialogue text of the dialogue sentence can be obtained by identifying the dialogue sentence based on ASR technology.
  • the dialogue start and end time information of the dialogue sentences of the speakers of different roles and the role of the speaker can be used. Based on the relevant information and the dialogue text of these dialogue sentences, the dialogue sentences of speakers with different roles are pre-detected to determine the candidate dialogue sentences that may have the behavior of interrupting.
  • the first dialogue sentence and the second dialogue sentence are any two adjacent dialogue sentences in the target speech data, and the first dialogue sentence and the second dialogue sentence have different speaker roles, the first dialogue sentence and the second dialogue sentence have different speaker roles.
  • the start time of the second dialogue sentence is located after the start time of the first dialogue sentence and before the end time of the first dialogue sentence, the second dialogue sentence can be pre-detected for interrupting.
  • the intersection duration between the first dialogue statement and the second dialogue statement may be determined based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement.
  • the intersection time between the first dialogue sentence and the second dialogue sentence is equal to the end time of the first dialogue sentence and the start time of the second dialogue sentence. the difference between them; if the end time of the first dialogue statement is after the end time of the second dialogue statement, then the intersection duration between the first dialogue statement and the second dialogue statement is equal to the end time of the second dialogue statement and the second dialogue statement. The difference between the start times of dialogue statements. If the intersection duration exceeds the preset duration or the number of characters contained in the dialogue text of the second dialogue sentence exceeds the preset number of characters, the second dialogue sentence is determined to be a candidate dialogue sentence that may involve interrupting. In actual applications, the preset duration and the preset number of characters can be set according to actual needs. For example, the preset duration can be set to 3 seconds, and the preset number of characters can be set to 5.
  • multiple dialogue statements can be sorted according to the starting time of each dialogue statement in the target speech data from early to late, and then each two adjacent dialogue statements can be executed in sequence.
  • the above steps are carried out until all dialogue sentences in the target speech data have been judged. For example, if it is determined that the intersection duration between the Nth (N is a positive integer) dialogue statement and the N+1th dialogue statement exceeds the preset duration or the dialogue text of the N+1th dialogue statement contains more than the preset number of characters, number of characters, then determine the N+1th dialogue statement as the candidate dialogue statement; otherwise, continue to repeat the above process for the N+1th dialogue statement and the N+2th dialogue statement until all the dialogue statements in the target speech data are All dialogue sentences have been judged.
  • the target voice data includes conversational sentences between customer service personnel and users. Since in this scenario we only need to pay attention to whether the customer service staff has interrupting behavior, we only determine whether there is interrupting behavior in the customer service staff's conversational statements. It is assumed that the dialogue-related information and dialogue text of the dialogue sentence in the target speech data are as follows:
  • the speaker role of the third dialogue sentence is a customer service staff, which is different from the speaker role of the second dialogue sentence, and the starting time of the third dialogue sentence (5880ms) is located at the starting time of the second dialogue sentence ( 4760ms) and before the end time (10240ms) of the second dialogue sentence, so the third dialogue sentence is pre-detected for interrupting.
  • the end time of the third dialogue statement (6320ms) is before the end time of the second dialogue statement (10240ms), so it can be determined that the intersection duration between the third dialogue statement and the second dialogue statement is equal to the third dialogue statement
  • the difference between the end time (6320ms) and the start time of the third dialogue statement (5880ms) is 440ms, which is less than the preset duration of 3 seconds. And it can be determined that the number of characters contained in the dialogue text of the third dialogue statement is 1, which is less than the preset number of characters 5. Therefore, it can be determined that there is no interfering behavior in the third dialogue sentence.
  • the speaker role of the fourth dialogue sentence is the same as the speaker role of the third dialogue sentence, so the fourth dialogue sentence is not pre-detected for interjection.
  • the speaker role of the fifth dialogue sentence is the user, so the fifth dialogue sentence is not pre-detected for interrupting.
  • the speaker role of the sixth dialogue sentence is a customer service staff, which is different from the speaker role of the fifth dialogue sentence, and the starting time of the sixth dialogue sentence (15830ms) is located at the starting time of the fifth dialogue sentence ( 14640ms) and before the end time (23270ms) of the fifth dialogue sentence, so the sixth dialogue sentence is pre-detected for interrupting.
  • the end time of the 6th dialogue statement (20500ms) is before the end time of the 5th dialogue statement (23270ms), so it can be determined that the intersection duration between the 6th dialogue statement and the 5th dialogue statement is equal to the 6th dialogue statement
  • the difference between the end time (20500ms) and the start time of the sixth dialogue statement (15830ms) is 4670ms, which is 3 seconds longer than the preset time. From this, it can be determined that the sixth dialogue sentence is a candidate dialogue sentence that may involve interrupting the conversation. Alternatively, it can be determined that the number of characters in the dialogue text of the sixth dialogue sentence is 18, which exceeds the preset number of characters of 5, and thus the sixth dialogue sentence can be determined to be a candidate dialogue sentence that may involve interrupting.
  • the dialogue statement may include interrupting behavior. In this way, it is possible to avoid misjudgment of conversational statements such as the above-mentioned polite words as interfering behaviors, which is beneficial to improving the accuracy of voice dialogue detection.
  • the voice dialogue detection method may also include: before step S102, determine the dialogue of each dialogue statement in the target voice data. Whether the text contains preset words. If the dialogue text of the dialogue statement contains preset words, delete the preset words in the dialogue text of the dialogue statement.
  • the preset words can be set according to actual needs. For example, the preset words can include the above-mentioned polite words, greeting words, etc.
  • any appropriate method may be used to determine whether the dialogue text of the dialogue sentence contains the preset words.
  • the dialogue text of the dialogue statement can be segmented to obtain the words contained in the dialogue text of the dialogue statement; then, each word contained in the dialogue text of the dialogue statement is compared with the words in the preset word library The preset words are matched to determine whether the dialogue text of the dialogue statement contains the preset words.
  • the preset word library can be obtained by exhaustively enumerating the preset words. Then, a regular matching algorithm is used to match each word contained in the dialogue text of the dialogue sentence with the preset words in the preset word library. If the matching result indicates that the matching degree value between a certain word contained in the dialogue text of the dialogue statement and a preset word in the preset word library exceeds the preset matching degree value, it can be determined that the dialogue text of the dialogue statement contains the preset word. Assume words.
  • the dialogue text of the dialogue sentence can be input into a pre-trained word recognition model to obtain a word recognition result of the dialogue text of the dialogue sentence.
  • the word recognition result indicates whether the dialogue text of the dialogue sentence contains a preset word.
  • the word recognition result can indicate the similarity between a certain word in the dialogue text of the dialogue sentence and one or more preset words.
  • the similarity is usually a floating point value between 0 and 1, and the larger the value, the greater the similarity. It means the higher the similarity.
  • the word recognition model is trained based on the sample text and the word tags of the words contained in the sample text. The word tags of the words are used to indicate whether the words are preset words.
  • the word tag of a word can be represented by one-hot encoding. For example, if the word tag of a word is [0,1], it means that the word is not a preset word; if the word tag of a word is [1,0], it means This word is a default word. For example, the sample text is "Well, okay”, and the words it contains include ⁇ "Well", “Okay” ⁇ , where the word label corresponding to "Well” is [1,0], and the corresponding word label to "Okay” The word label of is also [1,0].
  • the type of word recognition model can be selected according to actual needs.
  • the word recognition model can be a Bidirectional Encoder Representation from Transformers (BERT) model.
  • model training is performed using sample text and the word labels of the words contained in the sample text, so that the trained word recognition model has generalized recognition capabilities, and the word recognition model can be continuously improved by continuously supplementing new sample text. Recognition ability and accuracy.
  • the dialogue text of the dialogue statement can be recognized simply and accurately whether the dialogue text of the dialogue statement contains preset words such as polite words.
  • step S104 for each candidate dialogue sentence, it is determined whether there is an interpolation in the candidate dialogue sentence based on at least one of the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. Talk-stealing behavior.
  • the emotion recognition model refers to a pre-trained machine learning model with emotion recognition capabilities.
  • the emotion recognition model can be trained by using the emotion-related features of the sample dialogue sentences and the emotion labels corresponding to the sample dialogue sentences.
  • the emotion-related features of the sample dialogue sentence refer to the characteristics of the sample dialogue sentence that can represent the speaker's emotion, such as the spectrogram characteristics of the sample dialogue sentence, etc.
  • the emotion label corresponding to the sample dialogue sentence is used to indicate the emotional tendency of the sample dialogue sentence, such as positive emotion or negative emotion.
  • the tendency value of the emotional tendency corresponding to the sample dialogue sentence may include, for example, a positive emotional value and a negative emotional value.
  • the type of emotion recognition model can be selected according to actual needs.
  • feature extraction can be performed on the candidate dialogue sentences to obtain the emotion-related features of the candidate dialogue sentences, and then the emotion-related features of the candidate dialogue sentences are input into the emotion recognition model to obtain the emotion recognition results of the candidate dialogue sentences.
  • the emotion recognition result may represent the emotional tendency of the candidate dialogue sentence, or may represent the tendency value of the emotional tendency of the candidate dialogue sentence.
  • the voice characteristics of the candidate dialogue statement may include, but are not limited to, the volume of the candidate dialogue statement and/or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement, wherein the speaker role of the first associated dialogue statement is consistent with the speaker role of the candidate dialogue statement.
  • the speaker role is the same.
  • the first associated dialogue statement may be a dialogue statement output by the customer service staff before the candidate dialogue statement.
  • the candidate dialogue sentence is determined whether there is interrupting based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence. Talking behavior can improve the detection accuracy of voice dialogue.
  • the preset interruption condition may include that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value.
  • the preset emotional threshold and preset volume value can be set according to actual needs.
  • the negative emotion value of the candidate dialogue statement exceeds the preset emotion threshold or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement exceeds the preset volume threshold, it is determined that the candidate dialogue statement is interfering.
  • the oversimplified approach of simply judging "the behavior of one party speaking before the other party has finished speaking as interfering behavior” it can avoid classifying one party out of patience and respect for the other party. Behaviors such as respecting the other party and responding to the other party before the other party has finished speaking are misjudged as interrupting behavior, which is helpful to improve the detection accuracy of voice dialogue.
  • inspection exemption conditions can be set in advance for this situation. For any candidate dialogue sentence, if it is determined that the candidate dialogue sentence satisfies the inspection exemption condition, it can be directly determined that the candidate dialogue sentence does not engage in interfering behavior without performing the above step S104, as shown in FIG. 3 .
  • the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
  • Preset inspection exemption conditions can be set according to actual needs.
  • the preset exemption conditions may include: the intention of the second associated dialogue statement is to end the dialogue, and the matching degree between the dialogue text of the third associated dialogue statement and the preset end dialogue text. The value exceeds the first preset level threshold.
  • the preset ending conversation text may be a standard text used to end a conversation, such as "Thank you for calling, goodbye", etc.
  • the intention recognition of the second associated dialogue statement can be performed to obtain the intention recognition result, and the dialogue text of the third associated dialogue statement and the predetermined Assume that the end dialogue text is matched and the first matching result is obtained. Then, based on the intention recognition result and the first matching result, it can be determined whether the candidate dialogue sentence satisfies the preset inspection exemption condition. Wherein, the start time of the second associated dialogue sentence is before the start time of the candidate dialogue sentence, and the start time of the third associated dialogue sentence is between the start time of the second associated dialogue sentence and the start time of the candidate dialogue sentence. .
  • the intent recognition model refers to a pre-trained machine learning model with intent recognition capabilities.
  • the intent recognition model can be trained by using the intent-related features of the sample conversation text and the intent tags corresponding to the sample conversation text.
  • Intention-related features of the sample dialogue text may include word features and/or sentence features of the sample dialogue text that can characterize the speaker's intention.
  • the intent tag corresponding to the sample conversation text is used to indicate the intent of the sample conversation text, for example, indicating whether the intent of the sample conversation text is to end the conversation. It should be noted that in actual applications, the type of intent recognition model can be selected according to actual needs.
  • feature extraction can be performed on the dialogue text of the second associated dialogue sentence to obtain the intention-related features of the dialogue text of the second associated dialogue sentence, and then the second associated dialogue sentence is By inputting the intention-related features of the dialogue text of the associated dialogue sentence into the intention recognition model, it can be determined whether the intention of the second associated dialogue sentence is to end the dialogue.
  • the conversation between the calling party and the called party is as follows:
  • the fourth dialogue sentence is determined to be a candidate dialogue sentence through the above-mentioned step S102.
  • the first dialogue sentence can be determined as the second associated dialogue sentence of the candidate dialogue sentence
  • the second dialogue sentence can be determined as the third associated dialogue of the candidate dialogue sentence. statement.
  • intention recognition By performing intention recognition on the second associated dialogue sentence (ie, the first dialogue sentence) through the intention recognition model, it can be determined that the intention of the second associated dialogue sentence is to end the dialogue.
  • the match between the dialogue text of the third associated dialogue statement and the preset end dialogue text can be determined
  • the level value exceeds the preset first preset level threshold. Therefore, it can be determined that the candidate dialogue sentence (i.e., the fourth dialogue sentence) satisfies the preset exemption condition, thereby determining that the candidate dialogue sentence does not engage in interfering behavior. That is, it can be determined that the fourth conversation sentence belongs to the situation where the calling party suddenly asks a question when the two parties have a clear intention to end the conversation, causing the called party to start speaking even before the calling party has finished speaking. Therefore, the fourth conversation sentence can be determined. The statement does not belong to the called party to interrupt the call.
  • the voice dialogue detection method in the embodiment of the present application can be used in a variety of scenarios that require detection of interrupting calls, such as telephone operations, intelligent question and answer and other scenarios. Taking the telephone operation scenario as an example, the voice dialogue detection method provided by the embodiment of the present application will be described below.
  • the telephone operation scenario involves the client 10 and the intelligent customer service quality inspection system 20.
  • the client 10 can display a configuration interface for developer A to configure quality inspection rules.
  • the rule 1 corresponding to the preset exemption condition may include the intention of the above-mentioned second associated dialogue statement and the conditions that need to be satisfied by the third associated dialogue statement.
  • Rule 2 corresponding to the pre-detection of pre-detection of interrupted calls may include preset cross duration, preset number of characters, pre-emptive call delay, etc. (as shown in Figure 6).
  • Rule 3 corresponding to the secondary interrupting detection may include an emotion recognition model, an intent recognition model, etc., which are used to further determine whether a candidate dialogue sentence has an interrupting behavior.
  • Rule 4 corresponding to excluding cases where the number of words for interrupting is small may include a preset number of characters and so on.
  • the client 10 can send the quality inspection rules configured by developer A to the intelligent customer service quality inspection system 20 for use by the intelligent customer service quality inspection system 20 .
  • the client 10 can display a voice data import interface.
  • User B who has voice dialogue quality inspection requirements, can import the target voice data that needs to be detected into the client 10 through the voice data import interface.
  • the client 10 can send the imported target voice data to the intelligent customer service quality inspection system 20, and according to the voice dialogue detection triggering instruction input by user B, send the detection target voice data to the intelligent customer service quality inspection system 20 to detect the presence of interfering calls.
  • a request for a conversational statement of behavior can be used to detect the presence of interfering calls.
  • the intelligent customer service quality inspection system 20 may include a server (Server) or a server cluster (Cluster) composed of multiple servers.
  • the intelligent customer service quality inspection system 20 can execute the voice dialogue detection method disclosed in the above embodiments of the present application based on pre-configured quality inspection rules to determine whether there are dialogue statements that interrupt the conversation in the target voice data, and return the detection results.
  • the client 10 displays the detection results to user B, so that user B can take corresponding measures to improve customer service quality based on the detection results.
  • the intelligent customer service quality inspection system 20 can obtain the voice characteristics and dialogue-related information of each dialogue sentence in the target voice data (for example, including the start and end time of the dialogue and the speaker's role), and based on the ASR technology, convert the target voice data into Corresponding text, the dialogue text of each dialogue statement is obtained.
  • the intelligent customer service quality inspection system 20 can exclude dialogue sentences with a small number of words in the target voice data based on rule 4 corresponding to excluding cases where the number of words in the target voice data is small, and then, based on the dialogue correlation of the remaining dialogue sentences in the target voice data Information and dialogue text, according to the rules 2 corresponding to the pre-detection of interjection, perform pre-detection of interjection on these dialogue sentences to determine candidate dialogue sentences that may contain interjection behavior in these dialogue sentences. Then, the intelligent customer service quality inspection system 20 may determine whether the candidate dialogue sentence satisfies the preset inspection exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence.
  • the intelligent customer service quality inspection system 20 can determine that the candidate dialogue sentence does not engage in interfering behavior. If the candidate dialogue sentence does not meet the preset exemption conditions, the intelligent customer service quality inspection system 20 can call the emotion recognition model to perform emotion recognition on the candidate dialogue sentence based on the rule 3 corresponding to the secondary interruption detection to obtain the emotion recognition result, and Based on the emotion recognition results and/or the speech characteristics of the candidate dialogue sentence, it is determined whether the candidate dialogue sentence has interrupting behavior.
  • the voice dialogue detection device 700 may include a first determination module 710 and a second determination module 730 .
  • the first determination module 710 may perform interjection pre-detection on the plurality of dialogue statements based on the dialogue related information and dialogue text of each of the plurality of dialogue statements in the target voice data to determine that there may be any presence in the plurality of dialogue statements.
  • the target voice data may include dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles.
  • the second determination module 730 may be based on at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. One is to determine whether the candidate dialogue sentence has interfering behavior.
  • the emotion recognition result includes the negative emotion value of the candidate dialogue sentence
  • the voice characteristics of the candidate dialogue sentence include the volume change value of the candidate dialogue sentence relative to the first associated dialogue sentence
  • the speaker role of the first associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
  • the second determination module 730 may include a first call interrupting determination sub-module. If it is determined that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value, the first interrupting judgment sub-module may determine that the candidate dialogue sentence has an interrupting behavior. .
  • the voice dialogue detection device 700 may also include a check-free recognition module.
  • the exemption identification module can determine whether the candidate dialogue sentence satisfies the preset exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence.
  • the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
  • the second determination module 730 can directly determine that the candidate dialogue sentence does not have an interfering behavior; and when the exemption identification module determines that the candidate dialogue sentence does not If the preset inspection exemption condition is met, the second determination module 730 may determine whether the candidate dialogue sentence has an interfering behavior based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence.
  • the preset exemption condition includes the intention of the second associated dialogue statement to end the dialogue, and the matching degree value between the dialogue text of the third associated dialogue statement and the preset end dialogue text. exceeds the first preset level threshold.
  • the inspection-free identification module may include an intent identification sub-module, a matching sub-module and an inspection-free identification sub-module.
  • the intention recognition sub-module can perform intention recognition on the second associated dialogue statement based on the intention recognition model and the dialogue text of the second associated dialogue statement, and obtain an intention recognition result, wherein the start of the second associated dialogue statement The time is before the start time of the candidate dialogue sentence.
  • the matching submodule can match the dialogue text of the third associated dialogue statement with the preset end dialogue text to obtain the first matching result, wherein the starting time of the third associated dialogue statement is located in the second associated dialogue between the starting time of the sentence and the starting time of the candidate dialogue sentence.
  • the exemption recognition sub-module may determine whether the candidate dialogue statement satisfies the preset exemption conditions based on the intention recognition result obtained by the intention recognition sub-module and the first matching result obtained by the matching sub-module.
  • the first determination module 710 may include an intersection duration determination sub-module and a candidate dialogue sentence determination sub-module.
  • the intersection duration determination sub-module may determine the first dialogue statement based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement. The intersection duration between the first dialogue sentence and the second dialogue sentence.
  • the candidate dialogue statement determination sub-module may determine that the second dialogue statement may be interfering.
  • the voice dialogue detection device 700 may further include a third determination module and a deletion module.
  • the third determination module may determine whether the dialogue text of each dialogue sentence contains a preset word. If the third determination module determines that the dialogue text of the dialogue statement contains the preset words, the deletion module may delete the preset words in the dialogue text of the dialogue statement.
  • the third determination module may include a word segmentation sub-module and a matching sub-module.
  • the word segmentation sub-module can perform word segmentation processing on the dialogue text of the dialogue sentence to obtain the words contained in the dialogue text of the dialogue sentence.
  • the matching sub-module may match each word contained in the dialogue text of the dialogue statement with the preset words in the preset word library to determine whether the dialogue text of the dialogue statement contains the preset word.
  • the third determination module may include a word determination sub-module.
  • the word determination sub-module can input the dialogue text of the dialogue sentence into the pre-trained word recognition model to obtain the word recognition result of the dialogue text of the dialogue sentence.
  • the word recognition result is used to indicate whether the dialogue text of the dialogue sentence contains a preset
  • the word recognition model is obtained by model training based on the sample text and the word tags of the words contained in the sample text.
  • the word tags of the words are used to indicate whether the words are preset words.
  • the electronic device may include a processor, an internal bus, a network interface, a memory, etc.
  • Memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or non-volatile memory (non-volatile memory), such as disk memory.
  • RAM random access memory
  • non-volatile memory non-volatile memory
  • the electronic equipment may also include other hardware required by the business.
  • the processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, a peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture, extended industrial standard architecture) bus, etc.
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in Figure 8, but it does not mean that there is only one bus or one type of bus.
  • Memory used to store programs.
  • a program may include program code including computer operating instructions.
  • Memory may include internal memory and non-volatile memory and provides instructions and data to the processor.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to form a voice dialogue detection device at the logical level.
  • the processor executes the program stored in the memory to execute the above voice dialogue detection method.
  • the processor may be an integrated circuit chip that has signal processing capabilities.
  • each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor.
  • the above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processor, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
  • Embodiments of the present application also propose a computer-readable storage medium that stores one or more programs.
  • the one or more programs include instructions that, when executed by a processor of an electronic device, can The electronic device is caused to execute the above voice conversation detection method.
  • a typical implementation device is a computer.
  • the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory.
  • PRAM phase change memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • RAM random access memory
  • read-only memory read-only memory
  • ROM read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other memory technology
  • compact disc read-only memory CD-ROM
  • DVD digital versatile disc
  • Magnetic tape cassettes tape disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Abstract

A voice dialogue detection method and apparatus. The voice dialogue detection method comprises : on the basis of dialogue-related information and dialogue text of a plurality of dialogue sentences in target voice data, performing interruption behavior pre-detection on the plurality of dialogue sentences, so as to determine one or more candidate dialogue sentences that may have an interruption behavior in the plurality of dialogue sentences (S102), wherein the target voice data comprises dialogue sentences of speakers of different roles, and each piece of the dialogue-related information comprises dialogue start-stop time information and a speaker role; and for each candidate dialogue sentence, determining, on the basis of at least one of an emotion recognition result of the candidate dialogue sentence and a voice feature of the candidate dialogue sentence, whether an interruption behavior exists in the candidate dialogue sentence (S104).

Description

语音对话检测方法及装置Voice dialogue detection method and device
相关申请的交叉引用Cross-references to related applications
本申请要求在2022年4月24日提交的中国专利申请第202210451120.8号的优先权,该中国专利申请的全部内容通过引用包含于此。This application claims priority from Chinese Patent Application No. 202210451120.8 filed on April 24, 2022, the entire content of which is incorporated herein by reference.
技术领域Technical field
本申请涉及语音处理技术,尤其涉及语音对话检测方法及装置。The present application relates to speech processing technology, and in particular to speech dialogue detection methods and devices.
背景技术Background technique
检测语音对话的参与方是否存在插抢话行为,是语音对话检测的重要组成部分,广泛应用于电话作业、智能问答等场景。Detecting whether participants in a voice conversation engage in interrupting behavior is an important part of voice conversation detection and is widely used in scenarios such as telephone operations and intelligent question and answer.
在一些情形下,语音对话检测方法主要基于简单的检测规则判断语音对话的参与方是否存在插抢话行为,比如参与方A在参与方B未说完话的情况下回应,就判定参与方A存在插抢话行为。但是,这种检测方式过于简单化,并不能准确检测复杂对话场景中的插抢话行为。例如可能存在如下情形:在参与方A滔滔不绝地说话时,参与方B只是出于对参与方A的耐心和尊重才在参与方A未说完话的情况下回应,而并非真的插话或抢话参与方A。In some cases, the voice dialogue detection method is mainly based on simple detection rules to determine whether the participants in the voice dialogue engage in interrupting behavior. For example, if participant A responds before participant B has finished speaking, it will be determined that participant A There is interfering behavior. However, this detection method is too simplistic and cannot accurately detect interrupting behaviors in complex conversation scenarios. For example, there may be the following situation: when participant A is talking endlessly, participant B only responds when participant A has not finished speaking out of patience and respect for participant A, but does not really interrupt or interrupt. Talking participant A.
发明内容Contents of the invention
有鉴于此,本申请实施例提供语音对话检测方法及装置。In view of this, embodiments of the present application provide voice dialogue detection methods and devices.
本申请实施例提供的一种语音对话检测方法包括:基于目标语音数据中的多个对话语句各自的对话相关信息及对话文本,对所述多个对话语句进行插抢话预检测,以确定所述多个对话语句中可能存在插抢话行为的一个或多个候选对话语句,其中,所述目标语音数据包括不同角色的说话人的对话语句,所述对话相关信息包括对话起止时间信息及说话人角色;对于所述一个或多个候选对话语句中的每个候选对话语句,基于利用情绪识别模型对该候选对话语句进行情绪识别而得到的情绪识别结果和该候选对话语句的语音特征中至少一个,确定该候选对话语句是否存在插抢话行为。A voice dialogue detection method provided by an embodiment of the present application includes: based on the dialogue-related information and dialogue text of multiple dialogue statements in the target voice data, performing pre-detection of interjection on the multiple dialogue statements to determine the target voice dialogue statements. Among the plurality of dialogue sentences, there may be one or more candidate dialogue sentences that interrupt the conversation, wherein the target voice data includes dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speech. human role; for each candidate dialogue sentence in the one or more candidate dialogue sentences, at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence are at least One is to determine whether the candidate dialogue sentence has interfering behavior.
本申请实施例提供的一种语音对话检测装置包括:第一确定模块,用于基于目标语音数据中的多个对话语句各自的对话相关信息及对话文本,对所述多个对话语句进行插抢话预检测,以确定所述多个对话语句中可能存在插抢话行为的一个或多个候选对话语句,其中,所述目标语音数据包括不同角色的说话人的对话语句,所述对话相关信息包括对话起止时间信息及说话人角色;第二确定模块,用于对于所述一个或多个候选对话语句中的每个候选对话语句,基于利用情绪识别模型对该候选对话语句进行情绪识别而得到的情绪识别结果和该候选对话语句的语音特征中至少一个,确定该候选对话语句是否存在插抢话行为。A voice dialogue detection device provided by an embodiment of the present application includes: a first determination module, configured to interpolate multiple dialogue statements in the target voice data based on their respective dialogue related information and dialogue text. dialogue pre-detection to determine one or more candidate dialogue statements that may include interjection behavior among the plurality of dialogue statements, wherein the target voice data includes dialogue statements of speakers with different roles, and the dialogue-related information It includes dialogue start and end time information and speaker role; a second determination module is used for each candidate dialogue statement in the one or more candidate dialogue statements, based on emotion recognition of the candidate dialogue statement using an emotion recognition model. At least one of the emotion recognition result and the speech feature of the candidate dialogue sentence is used to determine whether the candidate dialogue sentence has interrupting behavior.
本申请实施例提供的一种电子设备包括:处理器;用于存储所述处理器可执行指令的存储器,其中,所述处理器被配置为在执行所述指令时实现上述的语音对话检测方法。An electronic device provided by an embodiment of the present application includes: a processor; and a memory used to store instructions executable by the processor, wherein the processor is configured to implement the above voice dialogue detection method when executing the instructions. .
本申请实施例提供的一种计算机可读存储介质,所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够实现上述的语音对话检测方法。An embodiment of the present application provides a computer-readable storage medium. When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can implement the above voice dialogue detection method.
附图说明Description of the drawings
图1为本申请的一个实施例提供的一种语音对话检测方法的流程示意图;Figure 1 is a schematic flow chart of a voice dialogue detection method provided by an embodiment of the present application;
图2为本申请的另一个实施例提供的一种语音对话检测方法的流程示意图;Figure 2 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application;
图3为本申请的又一个实施例提供的一种语音对话检测方法的流程示意图;Figure 3 is a schematic flow chart of a voice dialogue detection method provided by another embodiment of the present application;
图4为本申请的一个实施例提供的语音对话检测方法可应用的场景的示意图;Figure 4 is a schematic diagram of applicable scenarios for the voice dialogue detection method provided by an embodiment of the present application;
图5为本申请的一个实施例提供的一种配置界面的示意图;Figure 5 is a schematic diagram of a configuration interface provided by an embodiment of the present application;
图6为本申请的另一个实施例提供的一种配置界面的示意图;Figure 6 is a schematic diagram of a configuration interface provided by another embodiment of the present application;
图7为本申请的一个实施例提供的一种语音对话检测装置的结构示意图;Figure 7 is a schematic structural diagram of a voice dialogue detection device provided by an embodiment of the present application;
图8为本申请的一个实施例提供的一种电子设备的结构示意图。FIG. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。所描述的实施例仅是示例性性的,并不旨在限制本申请。基于本申请中描述的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions of the present application will be clearly and completely described below in conjunction with specific embodiments of the present application and corresponding drawings. The described embodiments are illustrative only and are not intended to limit the application. Based on the embodiments described in this application, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this application.
本申请中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应理解,这样使用的数据在适当情况下可以互换,以便本申请实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,本申请中“和/或”表示所连接的对象的至少其中之一,字符“/”一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. in this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that data so used are interchangeable under appropriate circumstances so that embodiments of the present application can be practiced in sequences other than those illustrated or described herein. In addition, "and/or" in this application means at least one of the connected objects, and the character "/" generally means that the related objects are in an "or" relationship.
部分概念说明:Some concept explanations:
插抢话:参与对话的一方在另一方没有讲完话的情况下就开始讲话,从而打断了另一方的讲话。Interruption: One party participating in a conversation interrupts the other party by starting to speak before the other party has finished speaking.
智能客户服务质检系统:通过检测模型、检测算法等对语音、视频等数据的文本内容进行检测的系统,起到对客户服务人员的行为检测的作用,比如检测对话参与方是否存在插抢话行为,有利于提升客户服务人员服务质量。Intelligent customer service quality inspection system: A system that detects the text content of voice, video and other data through detection models, detection algorithms, etc., which plays a role in detecting the behavior of customer service personnel, such as detecting whether dialogue participants are interfering in conversations. Behavior is conducive to improving the service quality of customer service personnel.
自动语音识别技术(Automatic Speech Recognition,ASR):是指从语音到文本的转换,即利用计算机把人发出的有意义的话音变为书面语言。Automatic Speech Recognition (ASR) technology: refers to the conversion from speech to text, that is, using computers to convert meaningful speech produced by humans into written language.
本申请实施例利用插抢话行为通常是一方在另一方还未说完话的情况下开始说话且说话并不会过于简短这一规律,提出对话语句检测方案。先基于不同角色的说话人的对话语句的对话起止时间信息及说话人角色等对话相关信息及这些对话语句的对话文本,从这些对话语句中确定可能存在插抢话行为的候选对话语句。接着,利用说话人在插抢话时通常表现为说话音量变大、情绪负面且激动等特点这一规律,可以结合对候选对话语句的情绪识别结果和/或语音特征,进一步确定候选对话语句中是否存在插抢话行为。相较于采用将“一方在另一方未说完话之前说话的行为判定为插抢话行为”这种过于简单化的方式,本申请实施例可以避免将诸如一方出于对另一方的耐心和尊重而在另一方未说完话之前回应另一方等行为误判为插抢话行为,从而提高语音对话的检测准确率。The embodiment of this application uses the rule that one party usually starts speaking before the other party has finished speaking, and the speech is not too brief, and proposes a dialogue sentence detection solution by utilizing the rule that interrupting behavior is usually caused by one party starting to speak before the other party has finished speaking. First, based on the dialogue start and end time information of the dialogue sentences of speakers with different roles, dialogue-related information such as speaker roles, and the dialogue texts of these dialogue sentences, candidate dialogue sentences that may involve interrupting behaviors are determined from these dialogue sentences. Then, using the rule that speakers usually speak louder, have negative and excited emotions when interrupting, the emotion recognition results and/or voice characteristics of the candidate dialogue sentences can be combined to further determine the content of the candidate dialogue sentences. Is there any behavior of interrupting calls? Compared with the overly simplistic approach that determines "the behavior of one party speaking before the other party has finished speaking as interfering behavior", the embodiment of the present application can avoid classifying one party out of patience and respect for the other party. Behaviors such as responding respectfully to the other party before the other party has finished speaking are misjudged as interrupting behavior, thus improving the detection accuracy of voice conversations.
应理解,本申请实施例提供的语音对话检测方法可以由电子设备执行或安装在电子设备中的软件执行,具体可以由终端设备或服务端设备执行。It should be understood that the voice dialogue detection method provided by the embodiment of the present application can be executed by an electronic device or software installed in the electronic device, and specifically can be executed by a terminal device or a server device.
以下结合附图,详细说明本申请实施例。The embodiments of the present application will be described in detail below with reference to the accompanying drawings.
参考图1,本申请的一个实施例提供的一种语音对话检测方法可以包括步骤S102~S104。Referring to Figure 1, a voice dialogue detection method provided by an embodiment of the present application may include steps S102 to S104.
在步骤S102,基于目标语音数据中的多个对话语句各自的对话相关信息及对话文本,对所述多个对话语句进行插抢话预检测,以确定所述多个对话语句中可能存在插抢话行为的一个或多个候选对话语句。In step S102, based on the dialogue-related information and dialogue text of each of the plurality of dialogue statements in the target voice data, pre-detection of interjection is performed on the plurality of dialogue statements to determine that there may be interjection in the plurality of dialogue statements. One or more candidate dialogue utterances for a speech act.
目标语音数据可以包括不同角色的说话人的对话语句。比如,在电话作业场景下,目标语音数据包括用户与客户服务人员之间的对话语句;又如,在视频会议场景下,目标语音数据包括不同会议参与者之间的对话语句,等等。The target speech data may include conversational sentences of speakers with different roles. For example, in a telephone operation scenario, the target voice data includes conversational sentences between users and customer service personnel; for another example, in a video conference scenario, the target voice data includes conversational sentences between different conference participants, and so on.
对话语句的对话相关信息可以包括对话语句的对话起止时间信息及说话人角色。对话语句的起止时间信息包括对话语句的开始时间(也即说话人开始说话的时间)和结束时间(也即说话人停止说话的时间)。The dialogue-related information of the dialogue sentence may include the dialogue start and end time information of the dialogue sentence and the speaker's role. The start and end time information of the dialogue sentence includes the start time of the dialogue sentence (that is, the time when the speaker starts speaking) and the end time (that is, the time when the speaker stops speaking).
对话语句的对话文本表示对话语句的对话内容。实际应用中,对话语句的对话文本可基于ASR技术对对话语句进行识别得到。The dialogue text of the dialogue sentence represents the dialogue content of the dialogue sentence. In practical applications, the dialogue text of the dialogue sentence can be obtained by identifying the dialogue sentence based on ASR technology.
考虑到插抢话行为通常是一方在另一方还未说完话的情况下开始说话且说话不会过于简短,可基于不同角色的说话人的对话语句的对话起止时间信息及说话人角色等对话相关信息及这些对话语句的对话文本,对不同角色的说话人的对话语句进行插抢话预检测,以确定可能存在插抢话行为的候选对话语句。Considering that the behavior of interrupting usually means that one party starts speaking before the other party has finished speaking and the speech will not be too brief, the dialogue start and end time information of the dialogue sentences of the speakers of different roles and the role of the speaker can be used. Based on the relevant information and the dialogue text of these dialogue sentences, the dialogue sentences of speakers with different roles are pre-detected to determine the candidate dialogue sentences that may have the behavior of interrupting.
例如,在一种实现方式中,若第一对话语句和第二对话语句为目标语音数据中任意两个相邻的对话语句,第一对话语句与第二对话语句各自的说话人角色不同,第二对话语句的起始时间位于第一对话语句的起始时间之后且位于第一对话语句的结束时间之前,则可以对第二对话语句进行插抢话预检测。具体地,可以基于第一对话语句的结束时间以及第二对话语句的起始时间和结束时间,确定第一对话语句与第二对话语句之间的交叉时长。若第一 对话语句的结束时间在第二对话语句的结束时间之前,则第一对话语句与第二对话语句之间的交叉时长等于第一对话语句的结束时间与第二对话语句的起始时间之间的差值;若第一对话语句的结束时间在第二对话语句的结束时间之后,则第一对话语句与第二对话语句之间的交叉时长等于第二对话语句的结束时间与第二对话语句的起始时间之间的差值。若该交叉时长超过预设时长或者第二对话语句的对话文本包含的字符数超过预设字符数,则确定第二对话语句为可能存在插抢话行为的候选对话语句。实际应用中,预设时长和预设字符数均可以根据实际需要进行设置,例如预设时长可以设置为3秒,预设字符数可以设置为5。For example, in one implementation, if the first dialogue sentence and the second dialogue sentence are any two adjacent dialogue sentences in the target speech data, and the first dialogue sentence and the second dialogue sentence have different speaker roles, the first dialogue sentence and the second dialogue sentence have different speaker roles. If the start time of the second dialogue sentence is located after the start time of the first dialogue sentence and before the end time of the first dialogue sentence, the second dialogue sentence can be pre-detected for interrupting. Specifically, the intersection duration between the first dialogue statement and the second dialogue statement may be determined based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement. If the end time of the first dialogue sentence is before the end time of the second dialogue sentence, the intersection time between the first dialogue sentence and the second dialogue sentence is equal to the end time of the first dialogue sentence and the start time of the second dialogue sentence. the difference between them; if the end time of the first dialogue statement is after the end time of the second dialogue statement, then the intersection duration between the first dialogue statement and the second dialogue statement is equal to the end time of the second dialogue statement and the second dialogue statement. The difference between the start times of dialogue statements. If the intersection duration exceeds the preset duration or the number of characters contained in the dialogue text of the second dialogue sentence exceeds the preset number of characters, the second dialogue sentence is determined to be a candidate dialogue sentence that may involve interrupting. In actual applications, the preset duration and the preset number of characters can be set according to actual needs. For example, the preset duration can be set to 3 seconds, and the preset number of characters can be set to 5.
为避免出现漏检,如图2所示,可按照目标语音数据中各对话语句的起始时间从早到晚的顺序将多个对话语句排序,然后依次对每两个相邻的对话语句执行上述步骤,直到对目标语音数据中的所有对话语句均判定完毕。例如,若确定第N(N为正整数)个对话语句与第N+1个对话语句之间的交叉时长超过预设时长或者第N+1个对话语句的对话文本包含的字符数超过预设字符数,则确定第N+1个对话语句为候选对话语句;否则,继续针对第N+1个对话语句和第N+2个对话语句,重复执行上述过程,直到对目标语音数据中的所有对话语句均判定完毕。In order to avoid missed detection, as shown in Figure 2, multiple dialogue statements can be sorted according to the starting time of each dialogue statement in the target speech data from early to late, and then each two adjacent dialogue statements can be executed in sequence. The above steps are carried out until all dialogue sentences in the target speech data have been judged. For example, if it is determined that the intersection duration between the Nth (N is a positive integer) dialogue statement and the N+1th dialogue statement exceeds the preset duration or the dialogue text of the N+1th dialogue statement contains more than the preset number of characters, number of characters, then determine the N+1th dialogue statement as the candidate dialogue statement; otherwise, continue to repeat the above process for the N+1th dialogue statement and the N+2th dialogue statement until all the dialogue statements in the target speech data are All dialogue sentences have been judged.
示例地,以电话作业场景为例,目标语音数据包括客户服务人员和用户之间的对话语句。由于在该场景下只需关注客户服务人员是否存在插抢话行为,因而只确定客户服务人员的对话语句中是否存在插抢话行为。假设目标语音数据中的对话语句的对话相关信息及对话文本如下所示:For example, taking the telephone operation scenario as an example, the target voice data includes conversational sentences between customer service personnel and users. Since in this scenario we only need to pay attention to whether the customer service staff has interrupting behavior, we only determine whether there is interrupting behavior in the customer service staff's conversational statements. It is assumed that the dialogue-related information and dialogue text of the dialogue sentence in the target speech data are as follows:
Figure PCTCN2023070200-appb-000001
Figure PCTCN2023070200-appb-000001
Figure PCTCN2023070200-appb-000002
Figure PCTCN2023070200-appb-000002
在如上所示的目标语音数据中,第1个对话语句之前不存在其他对话语句,因此可以判定第1个对话语句不存在插抢话行为。第2个对话语句的说话人角色为用户,因而不对第2个对话语句进行插抢话预检测。In the target speech data shown above, there are no other dialogue sentences before the first dialogue sentence, so it can be determined that there is no interrupting behavior in the first dialogue sentence. The speaker role of the second dialogue sentence is the user, so the pre-detection of interjection is not performed on the second dialogue sentence.
第3个对话语句的说话人角色为客户服务人员,不同于第2个对话语句的说话人角色,且第3个对话语句的起始时间(5880ms)位于第2个对话语句的起始时间(4760ms)之后且位于第2个对话语句的结束时间(10240ms)之前,因而对第3个对话语句进行插抢话预检测。第3个对话语句的结束时间(6320ms)在第2个对话语句的结束时间(10240ms)之前,因此可以确定第3个对话语句与第2个对话语句之间的交叉时长等于第3个对话语句的结束时间(6320ms)与第3个对话语句的起始时间(5880ms)之间的差值,为440ms,小于预设时长3秒。且可以确定第3个对话语句的对话文本包含的字符数为1,小于预设字符数5。因此,可以确定第3个对话语句不存在插抢话行为。The speaker role of the third dialogue sentence is a customer service staff, which is different from the speaker role of the second dialogue sentence, and the starting time of the third dialogue sentence (5880ms) is located at the starting time of the second dialogue sentence ( 4760ms) and before the end time (10240ms) of the second dialogue sentence, so the third dialogue sentence is pre-detected for interrupting. The end time of the third dialogue statement (6320ms) is before the end time of the second dialogue statement (10240ms), so it can be determined that the intersection duration between the third dialogue statement and the second dialogue statement is equal to the third dialogue statement The difference between the end time (6320ms) and the start time of the third dialogue statement (5880ms) is 440ms, which is less than the preset duration of 3 seconds. And it can be determined that the number of characters contained in the dialogue text of the third dialogue statement is 1, which is less than the preset number of characters 5. Therefore, it can be determined that there is no interfering behavior in the third dialogue sentence.
第4个对话语句的说话人角色与第3个对话语句的说话人角色相同,因而不对第4个对话语句进行插抢话预检测。第5个对话语句的说话人角色为用户,因而不对第5个对话语句进行插抢话预检测。The speaker role of the fourth dialogue sentence is the same as the speaker role of the third dialogue sentence, so the fourth dialogue sentence is not pre-detected for interjection. The speaker role of the fifth dialogue sentence is the user, so the fifth dialogue sentence is not pre-detected for interrupting.
第6个对话语句的说话人角色为客户服务人员,不同于第5个对话语句的说话人角色,且第6个对话语句的起始时间(15830ms)位于第5个对话语句的起始时间(14640ms)之后且位于第5个对话语句的结束时间(23270ms)之前,因而对第6个对话语句进行插抢话预检测。第6个对话语句的结束时间(20500ms)在第5个对话语句的结束时间(23270ms)之前,因此可以确定第6个对话语句与第5个对话语句之间的交叉时长等于第6个对话语句的结束时间(20500ms)与第6个对话语句的起始时间(15830ms)之间的差值,为4670ms,大于预设时长3秒。由此可以确定第6个对话语句为可能存在插 抢话行为的候选对话语句。或者,可以确定第6个对话语句的对话文本包含的字符数为18,超过预设字符数5,由此可以确定第6个对话语句为可能存在插抢话行为的候选对话语句。The speaker role of the sixth dialogue sentence is a customer service staff, which is different from the speaker role of the fifth dialogue sentence, and the starting time of the sixth dialogue sentence (15830ms) is located at the starting time of the fifth dialogue sentence ( 14640ms) and before the end time (23270ms) of the fifth dialogue sentence, so the sixth dialogue sentence is pre-detected for interrupting. The end time of the 6th dialogue statement (20500ms) is before the end time of the 5th dialogue statement (23270ms), so it can be determined that the intersection duration between the 6th dialogue statement and the 5th dialogue statement is equal to the 6th dialogue statement The difference between the end time (20500ms) and the start time of the sixth dialogue statement (15830ms) is 4670ms, which is 3 seconds longer than the preset time. From this, it can be determined that the sixth dialogue sentence is a candidate dialogue sentence that may involve interrupting the conversation. Alternatively, it can be determined that the number of characters in the dialogue text of the sixth dialogue sentence is 18, which exceeds the preset number of characters of 5, and thus the sixth dialogue sentence can be determined to be a candidate dialogue sentence that may involve interrupting.
可以理解的是,考虑到实际对话场景中,在参与语音对话的一方滔滔不绝地说话时,另一方有时会出于耐心和尊重才在对方未说完话的情况下回应,比如回应“嗯”、“好的”等客套话,并非是插抢话。如果单纯采用将“一方在另一方未说完话之前说话的行为判定为插抢话行为”这种过于简单化的方式,就会将这类对话语句误判为存在插抢话行为。有鉴于此,可以基于说话人角色不同的两个对话语句之间的交叉时长及对话语句的对话文本包含的字符数,对不同角色的说话人的对话语句进行插抢话预检测,以确定可能存在插抢话行为的对话语句。具体地,可以在交叉时长较长或者对话文本中包含的字符数较多时,判定对话语句可能存在插抢话行为。这样,可以避免将诸如上述的客套话之类的对话语句误判为插抢话行为,有利于提高语音对话检测准确率。It is understandable that, considering the actual conversation scenario, when one party participating in the voice conversation is talking endlessly, the other party will sometimes respond out of patience and respect before the other party has finished speaking, such as responding with "um" Polite words such as "Okay" and "Okay" are not meant to be offensive. If we simply adopt the oversimplified approach of "judging the behavior of one party speaking before the other party has finished speaking as interrupting behavior", this type of conversational sentences will be misjudged as interrupting behavior. In view of this, based on the intersection duration between two dialogue sentences with different speaker roles and the number of characters contained in the dialogue text of the dialogue sentence, the dialogue sentences of speakers with different roles can be pre-detected to determine possible interjections. There are conversational statements that interrupt the conversation. Specifically, when the intersection duration is long or the dialogue text contains a large number of characters, it can be determined that the dialogue statement may include interrupting behavior. In this way, it is possible to avoid misjudgment of conversational statements such as the above-mentioned polite words as interfering behaviors, which is beneficial to improving the accuracy of voice dialogue detection.
此外,考虑到实际对话场景中,对话参与方之间出于耐心、尊重和客套等,可能会在说话时加上一些客套词、寒暄词等。若这些词语过多,按照上述方式,可能会将这类客套语或寒暄语等误判为可能存在插抢话行为。有鉴于此,在一种实现方式中,为提高语音对话的检测准确率,本申请实施例提供的语音对话检测方法还可以包括:在步骤S102之前,确定目标语音数据中每个对话语句的对话文本是否包含预设词语,若对话语句的对话文本包含预设词语,则删除该对话语句的对话文本中的预设词语。预设词语可以根据实际需要进行设置,例如预设词语可以包括上述客套词、寒暄词等。In addition, considering that in actual dialogue scenarios, dialogue participants may add some polite words, greetings, etc. when speaking out of patience, respect, and politeness. If there are too many of these words, according to the above method, such polite words or greetings may be misjudged as possible interfering behavior. In view of this, in one implementation, in order to improve the detection accuracy of voice dialogue, the voice dialogue detection method provided by the embodiment of the present application may also include: before step S102, determine the dialogue of each dialogue statement in the target voice data. Whether the text contains preset words. If the dialogue text of the dialogue statement contains preset words, delete the preset words in the dialogue text of the dialogue statement. The preset words can be set according to actual needs. For example, the preset words can include the above-mentioned polite words, greeting words, etc.
本申请实施例中,可通过任意适当的方式来确定对话语句的对话文本中是否包含预设词语。例如,在一种实现方式中,可以对对话语句的对话文本进行分词处理,得到对话语句的对话文本包含的词语;接着,将对话语句的对话文本包含的每个词语与预设词语库中的预设词语进行匹配,以确定对话语句的对话文本中是否包含预设词语。In the embodiment of the present application, any appropriate method may be used to determine whether the dialogue text of the dialogue sentence contains the preset words. For example, in one implementation, the dialogue text of the dialogue statement can be segmented to obtain the words contained in the dialogue text of the dialogue statement; then, each word contained in the dialogue text of the dialogue statement is compared with the words in the preset word library The preset words are matched to determine whether the dialogue text of the dialogue statement contains the preset words.
示例地,可通过穷举预设词语的方式得到预设词语库。接着,利用正则匹配算法将对话语句的对话文本包含的每个词语与预设词语库中的预设词语进行匹配。若匹配结果指示对话语句的对话文本包含的某个词语与预设词语库中的一个预设词语之间的匹配程度值超过预设匹配程度值,则可确定对话语句的对话文本中包含该预设词语。For example, the preset word library can be obtained by exhaustively enumerating the preset words. Then, a regular matching algorithm is used to match each word contained in the dialogue text of the dialogue sentence with the preset words in the preset word library. If the matching result indicates that the matching degree value between a certain word contained in the dialogue text of the dialogue statement and a preset word in the preset word library exceeds the preset matching degree value, it can be determined that the dialogue text of the dialogue statement contains the preset word. Assume words.
可以理解的是,通过对对话语句的对话文本进行分词处理并与预设词语库中的预设词语进行匹配的方式,确定对话语句的对话文本中是否包含预设词语,准确率高,适用于预设词语库中的预设词语变化不大的场景。It can be understood that by performing word segmentation processing on the dialogue text of the dialogue statement and matching it with the preset words in the preset word library, it is determined whether the dialogue text of the dialogue statement contains the preset words, which has high accuracy and is suitable for Scenarios where the preset words in the preset word library do not change much.
在另一种实现方式中,可以将对话语句的对话文本输入预先训练的词语识别模型,得到对话语句的对话文本的词语识别结果,该词语识别结果指示对话语句的对话文本是否包含预设词语。例如,该词语识别结果可以指示对话语句的对话文本中的某个词语与一个或多个预设词语的相似度,相似度通 常是在0到1之间的一个浮点数值,且数值越大说明相似度越高。词语识别模型是基于样本文本及样本文本包含的词语的词语标签进行模型训练得到的,词语的词语标签用于指示词语是否为预设词语。实际应用中,词语的词语标签可用独热编码表示,比如若词语的词语标签为[0,1],则表示该词语不是预设词语;若词语的词语标签为[1,0],则表示该词语是预设词语。示例地,样本文本为“嗯,好的”,其包含的词语包括{“嗯”、“好的”},其中,“嗯”对应的词语标签为[1,0],“好的”对应的词语标签也为[1,0]。In another implementation, the dialogue text of the dialogue sentence can be input into a pre-trained word recognition model to obtain a word recognition result of the dialogue text of the dialogue sentence. The word recognition result indicates whether the dialogue text of the dialogue sentence contains a preset word. For example, the word recognition result can indicate the similarity between a certain word in the dialogue text of the dialogue sentence and one or more preset words. The similarity is usually a floating point value between 0 and 1, and the larger the value, the greater the similarity. It means the higher the similarity. The word recognition model is trained based on the sample text and the word tags of the words contained in the sample text. The word tags of the words are used to indicate whether the words are preset words. In practical applications, the word tag of a word can be represented by one-hot encoding. For example, if the word tag of a word is [0,1], it means that the word is not a preset word; if the word tag of a word is [1,0], it means This word is a default word. For example, the sample text is "Well, okay", and the words it contains include {"Well", "Okay"}, where the word label corresponding to "Well" is [1,0], and the corresponding word label to "Okay" The word label of is also [1,0].
需要说明的是,实际应用中,词语识别模型的类型可根据实际需要进行选择,例如词语识别模型可以为变压器的双向编码器表示(Bidirectional Encoder Representation from Transformers,BERT)模型。It should be noted that in actual applications, the type of word recognition model can be selected according to actual needs. For example, the word recognition model can be a Bidirectional Encoder Representation from Transformers (BERT) model.
可以理解的是,利用样本文本及样本文本包含的词语的词语标签进行模型训练,使得训练得到的词语识别模型具备泛化识别能力,并且可以通过不断补充新的样本文本来不断提升词语识别模型的识别能力和精度。基于训练好的词语识别模型对对话语句的对话文本进行识别,可以简单、准确识别对话语句的对话文本中是否包含客气词等预设词语。It is understandable that model training is performed using sample text and the word labels of the words contained in the sample text, so that the trained word recognition model has generalized recognition capabilities, and the word recognition model can be continuously improved by continuously supplementing new sample text. Recognition ability and accuracy. Based on the trained word recognition model, the dialogue text of the dialogue statement can be recognized simply and accurately whether the dialogue text of the dialogue statement contains preset words such as polite words.
在步骤S104,对于每个候选对话语句,基于利用情绪识别模型对该候选对话语句进行情绪识别而得到的情绪识别结果和该候选对话语句的语音特征中至少一个,确定该候选对话语句是否存在插抢话行为。In step S104, for each candidate dialogue sentence, it is determined whether there is an interpolation in the candidate dialogue sentence based on at least one of the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. Talk-stealing behavior.
本申请实施例中,情绪识别模型是指预先训练好的、具有情绪识别能力的机器学习模型。具体而言,情绪识别模型可以是利用样本对话语句的情绪相关特征及样本对话语句对应的情绪标签进行训练得到的。样本对话语句的情绪相关特征是指样本对话语句的能够表征说话人情绪的特征,比如样本对话语句的声谱图特征等。样本对话语句对应的情绪标签用于指示样本对话语句的情绪倾向,比如正向情绪或负向情绪。在一种实施方式中,样本对话语句对应的情绪倾向的倾向值可以包括比如正向情绪值和负向情绪值。若样本对话语句的正向情绪值越高,则表明样本对话语句越倾向于正向情绪;若样本对话语句的负向情绪值越高,则表明样本对话语句越倾向于负向情绪。需要说明的是,实际应用中,情绪识别模型的类型可根据实际需要进行选择。In the embodiment of this application, the emotion recognition model refers to a pre-trained machine learning model with emotion recognition capabilities. Specifically, the emotion recognition model can be trained by using the emotion-related features of the sample dialogue sentences and the emotion labels corresponding to the sample dialogue sentences. The emotion-related features of the sample dialogue sentence refer to the characteristics of the sample dialogue sentence that can represent the speaker's emotion, such as the spectrogram characteristics of the sample dialogue sentence, etc. The emotion label corresponding to the sample dialogue sentence is used to indicate the emotional tendency of the sample dialogue sentence, such as positive emotion or negative emotion. In one implementation, the tendency value of the emotional tendency corresponding to the sample dialogue sentence may include, for example, a positive emotional value and a negative emotional value. If the positive sentiment value of the sample conversational sentence is higher, it means that the sample conversational sentence is more inclined to positive sentiment; if the negative sentiment value of the sample conversational sentence is higher, it means that the sample conversational sentence is more inclined to negative sentiment. It should be noted that in actual applications, the type of emotion recognition model can be selected according to actual needs.
作为示例,可对候选对话语句进行特征提取,得到候选对话语句的情绪相关特征,而后将候选对话语句的情绪相关特征输入情绪识别模型,即可得到对候选对话语句的情绪识别结果。该情绪识别结果可以表示候选对话语句的情绪倾向,也可以表示候选对话语句的情绪倾向的倾向值。As an example, feature extraction can be performed on the candidate dialogue sentences to obtain the emotion-related features of the candidate dialogue sentences, and then the emotion-related features of the candidate dialogue sentences are input into the emotion recognition model to obtain the emotion recognition results of the candidate dialogue sentences. The emotion recognition result may represent the emotional tendency of the candidate dialogue sentence, or may represent the tendency value of the emotional tendency of the candidate dialogue sentence.
候选对话语句的语音特征可以包括但不限于候选对话语句的音量和/或候选对话语句相对于第一关联对话语句的音量变化值,其中,第一关联对话语句的说话人角色与候选对话语句的说话人角色相同。例如,若候选对话语句的说话人角色为客户服务人员,则第一关联对话语句可以为客户服务人员在候选对话语句之前输出的对话语句。The voice characteristics of the candidate dialogue statement may include, but are not limited to, the volume of the candidate dialogue statement and/or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement, wherein the speaker role of the first associated dialogue statement is consistent with the speaker role of the candidate dialogue statement. The speaker role is the same. For example, if the speaker role of the candidate dialogue statement is a customer service staff, the first associated dialogue statement may be a dialogue statement output by the customer service staff before the candidate dialogue statement.
考虑到说话人在插抢话时通常表现出音量变大、情绪负面且激动等特点,结合对候选对话语句的情绪识别结果和/或候选对话语句的语音特征而确定 候选对话语句是否存在插抢话行为,能够提高语音对话的检测准确率。Considering that the speaker usually shows characteristics such as louder volume, negative emotion, and excitement when interrupting, the candidate dialogue sentence is determined whether there is interrupting based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence. Talking behavior can improve the detection accuracy of voice dialogue.
在一种实现方式中,如图3所示,可以基于情绪识别结果和/或音量变化值,确定候选对话语句是否满足预设插抢话条件,若是,则确定候选对话语句存在插抢话行为。预设插抢话条件可以包括候选对话语句的负向情绪值超过预设情绪阈值或者音量变化值超过预设音量值。实际应用中,预设情绪阈值和预设音量值均可以根据实际需要进行设置。In one implementation, as shown in Figure 3, it can be determined based on the emotion recognition result and/or the volume change value whether the candidate dialogue sentence satisfies the preset interrupting condition. If so, it is determined that the candidate dialogue sentence has interrupting behavior. . The preset interruption condition may include that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value. In actual applications, the preset emotional threshold and preset volume value can be set according to actual needs.
仍以上述目标语音数据为例,在确定出第6个对话语句为候选对话语句后,若确定第6个对话语句相对于第4个对话语句(作为第一关联对话语句)的音量变化值超过预设音量值,则可确定第6个对话语句存在插抢话行为。Still taking the above target speech data as an example, after determining the 6th dialogue sentence as a candidate dialogue sentence, if it is determined that the volume change value of the 6th dialogue sentence relative to the 4th dialogue sentence (as the first associated dialogue sentence) exceeds If the volume value is preset, it can be determined that the sixth dialogue sentence has interrupting behavior.
可以理解的是,在候选对话语句的负向情绪值超过预设情绪阈值或者候选对话语句相对于第一关联对话语句的音量变化值超过预设音量阈值的情况下,判定候选对话语句存在插抢话行为,相较于单纯采用将“一方在另一方未说完话之前说话的行为判定为插抢话行为”这种过于简单化的方式,可以避免将诸如一方出于对另一方的耐心和尊重而在另一方未说完话之前回应另一方等行为误判为插抢话行为,有利于提高语音对话的检测准确率。It can be understood that when the negative emotion value of the candidate dialogue statement exceeds the preset emotion threshold or the volume change value of the candidate dialogue statement relative to the first associated dialogue statement exceeds the preset volume threshold, it is determined that the candidate dialogue statement is interfering. Compared with the oversimplified approach of simply judging "the behavior of one party speaking before the other party has finished speaking as interfering behavior", it can avoid classifying one party out of patience and respect for the other party. Behaviors such as respecting the other party and responding to the other party before the other party has finished speaking are misjudged as interrupting behavior, which is helpful to improve the detection accuracy of voice dialogue.
此外,考虑到实际对话场景中,可能出现如下情形:在对话双方或多方有明显意图结束对话时,因其中一方突然提问等导致其他参与方在第一方讲话未结束时也开始说话,但其他参与方并非故意进行插抢话。为避免在这样的情形下将其他参与方的此类行为误判为插抢话行为,在一种实施方式中,可以针对该情形预先设置免检条件。对于任一候选对话语句,若确定该候选对话语句满足该免检条件,则可以直接确定该候选对话语句不存在插抢话行为,而无需执行上述步骤S104,如图3所示。In addition, considering the actual dialogue scenario, the following situation may occur: when two or more parties in the dialogue have obvious intentions to end the dialogue, one party suddenly asks a question, etc., causing other participants to start speaking before the first party has finished speaking, but other parties The parties involved did not intentionally engage in chatter. In order to avoid misjudging such behavior of other participants as interfering behavior in such a situation, in an implementation manner, inspection exemption conditions can be set in advance for this situation. For any candidate dialogue sentence, if it is determined that the candidate dialogue sentence satisfies the inspection exemption condition, it can be directly determined that the candidate dialogue sentence does not engage in interfering behavior without performing the above step S104, as shown in FIG. 3 .
例如,可以基于候选对话语句的第二关联对话语句和第三关联对话语句,确定候选对话语句是否满足预设免检条件。第二关联对话语句的说话人角色与候选对话语句的说话人角色不同,第三关联对话语句的说话人角色与候选对话语句的说话人角色相同。For example, it may be determined whether the candidate dialogue sentence satisfies the preset exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence. The speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
预设免检条件可以根据实际需要进行设置。例如,为进一步提高插抢话检测准确率,预设免检条件可以包括:第二关联对话语句的意图为结束对话、且第三关联对话语句的对话文本与预设结束对话文本之间的匹配程度值超过第一预设程度阈值。预设结束对话文本可以是用于结束对话的标准文本,例如“感谢您的来电,再见”等。Preset inspection exemption conditions can be set according to actual needs. For example, in order to further improve the accuracy of interrupt detection, the preset exemption conditions may include: the intention of the second associated dialogue statement is to end the dialogue, and the matching degree between the dialogue text of the third associated dialogue statement and the preset end dialogue text. The value exceeds the first preset level threshold. The preset ending conversation text may be a standard text used to end a conversation, such as "Thank you for calling, goodbye", etc.
在一种实现方式中,可以基于意图识别模型和第二关联对话语句的对话文本,对第二关联对话语句进行意图识别,得到意图识别结果,并且,将第三关联对话语句的对话文本与预设结束对话文本进行匹配,得到第一匹配结果。然后,可以基于意图识别结果和第一匹配结果,确定候选对话语句是否满足预设免检条件。其中,第二关联对话语句的起始时间位于候选对话语句的起始时间之前,第三关联对话语句的起始时间位于第二关联对话语句的起始时间与候选对话语句的起始时间之间。In one implementation, based on the intention recognition model and the dialogue text of the second associated dialogue statement, the intention recognition of the second associated dialogue statement can be performed to obtain the intention recognition result, and the dialogue text of the third associated dialogue statement and the predetermined Assume that the end dialogue text is matched and the first matching result is obtained. Then, based on the intention recognition result and the first matching result, it can be determined whether the candidate dialogue sentence satisfies the preset inspection exemption condition. Wherein, the start time of the second associated dialogue sentence is before the start time of the candidate dialogue sentence, and the start time of the third associated dialogue sentence is between the start time of the second associated dialogue sentence and the start time of the candidate dialogue sentence. .
本申请实施例中,意图识别模型是指预先训练好的、具有意图识别能力 的机器学习模型。具体而言,意图识别模型可以是利用样本对话文本的意图相关特征及样本对话文本对应的意图标签进行训练得到的。样本对话文本的意图相关特征可以包括样本对话文本的能够表征说话人意图的词特征和/或句特征等。样本对话文本对应的意图标签用于指示样本对话文本的意图,比如表示样本对话文本的意图是否为结束对话。需要说明的是,实际应用中,意图识别模型的类型可根据实际需要进行选择。In the embodiment of this application, the intent recognition model refers to a pre-trained machine learning model with intent recognition capabilities. Specifically, the intent recognition model can be trained by using the intent-related features of the sample conversation text and the intent tags corresponding to the sample conversation text. Intention-related features of the sample dialogue text may include word features and/or sentence features of the sample dialogue text that can characterize the speaker's intention. The intent tag corresponding to the sample conversation text is used to indicate the intent of the sample conversation text, for example, indicating whether the intent of the sample conversation text is to end the conversation. It should be noted that in actual applications, the type of intent recognition model can be selected according to actual needs.
在一种实现方式中,为了对第二关联对话语句进行意图识别,可对第二关联对话语句的对话文本进行特征提取,得到第二关联对话语句的对话文本的意图相关特征,而后将第二关联对话语句的对话文本的意图相关特征输入意图识别模型,即可确定第二关联对话语句的意图是否为结束对话。In one implementation, in order to identify the intention of the second associated dialogue sentence, feature extraction can be performed on the dialogue text of the second associated dialogue sentence to obtain the intention-related features of the dialogue text of the second associated dialogue sentence, and then the second associated dialogue sentence is By inputting the intention-related features of the dialogue text of the associated dialogue sentence into the intention recognition model, it can be determined whether the intention of the second associated dialogue sentence is to end the dialogue.
例如,在语音通话场景中,主叫方与被叫方之间的对话语句如下:For example, in a voice call scenario, the conversation between the calling party and the called party is as follows:
Figure PCTCN2023070200-appb-000003
Figure PCTCN2023070200-appb-000003
在上述对话语句中,假设通过上述步骤S102确定第4个对话语句为候选对话语句。基于各对话语句的起止时间信息及说话人角色,可将第1个对话语句确定为该候选对话语句的第二关联对话语句,将第2个对话语句确定为该候选对话语句的第三关联对话语句。通过意图识别模型对该第二关联对话语句(即第1个对话语句)进行意图识别,可确定出该第二关联对话语句的意图为结束对话。通过将该第三关联对话语句(即第2个对话语句)的对话 文本与预设结束对话文本进行匹配,可确定出该第三关联对话语句的对话文本与预设结束对话文本之间的匹配程度值超过预设第一预设程度阈值。因此,可以确定该候选对话语句(即第4个对话语句)满足预设免检条件,从而确定该候选对话语句不存在插抢话行为。即,可判定第4个对话语句属于在对话双方有明显意图结束对话时因主叫方突然提问等导致被叫方在主叫方讲话未结束时也开始说话的情况,因而确定第4个对话语句并非属于被叫方插抢话主叫方。Among the above-mentioned dialogue sentences, it is assumed that the fourth dialogue sentence is determined to be a candidate dialogue sentence through the above-mentioned step S102. Based on the start and end time information of each dialogue sentence and the speaker role, the first dialogue sentence can be determined as the second associated dialogue sentence of the candidate dialogue sentence, and the second dialogue sentence can be determined as the third associated dialogue of the candidate dialogue sentence. statement. By performing intention recognition on the second associated dialogue sentence (ie, the first dialogue sentence) through the intention recognition model, it can be determined that the intention of the second associated dialogue sentence is to end the dialogue. By matching the dialogue text of the third associated dialogue statement (ie, the second dialogue statement) with the preset end dialogue text, the match between the dialogue text of the third associated dialogue statement and the preset end dialogue text can be determined The level value exceeds the preset first preset level threshold. Therefore, it can be determined that the candidate dialogue sentence (i.e., the fourth dialogue sentence) satisfies the preset exemption condition, thereby determining that the candidate dialogue sentence does not engage in interfering behavior. That is, it can be determined that the fourth conversation sentence belongs to the situation where the calling party suddenly asks a question when the two parties have a clear intention to end the conversation, causing the called party to start speaking even before the calling party has finished speaking. Therefore, the fourth conversation sentence can be determined. The statement does not belong to the called party to interrupt the call.
如上所述,可以先对其他角色的说话人之前的对话语句进行意图识别并对候选对话语句的说话人之前的对话语句与预设结束对话文本进行匹配,结合意图识别结果和匹配结果确定候选对话语句是否满足预设免检条件。然后,在确定候选对话语句不满足预设免检条件的情况下,可以基于情绪识别结果和/或候选对话语句的语音特征,确定候选对话语句是否存在插抢话行为。这样,可以避免将实际对话场景中的一些特殊情况误判为插抢话,从而有利于提高语音对话的检测准确率。As mentioned above, you can first perform intent recognition on the dialogue sentences before the speakers of other characters, match the dialogue sentences before the speaker of the candidate dialogue sentence with the preset end dialogue text, and determine the candidate dialogue by combining the intent recognition results and the matching results. Whether the statement meets the preset exemption conditions. Then, if it is determined that the candidate dialogue sentence does not meet the preset inspection exemption conditions, it can be determined based on the emotion recognition result and/or the speech characteristics of the candidate dialogue sentence whether there is interfering behavior in the candidate dialogue sentence. In this way, some special situations in actual dialogue scenes can be avoided from being misjudged as interruptions, which will help improve the detection accuracy of voice dialogues.
本申请实施例的语音对话检测方法可用于多种需要进行插抢话检测的场景,例如电话作业、智能问答等场景。下面以电话作业场景为例,对本申请实施例提供的语音对话检测方法进行说明。The voice dialogue detection method in the embodiment of the present application can be used in a variety of scenarios that require detection of interrupting calls, such as telephone operations, intelligent question and answer and other scenarios. Taking the telephone operation scenario as an example, the voice dialogue detection method provided by the embodiment of the present application will be described below.
如图4所示,电话作业场景涉及客户端10和智能客户服务质检系统20。客户端10可展示配置界面,以供开发人员A进行质检规则配置。示例地,如图5所示,可以配置预设免检条件对应的规则1、插抢话预检测对应的规则2、二次插抢话检测对应的规则3以及排除插抢话字数较少情况对应的规则4等。例如,预设免检条件对应的规则1可以包括上述第二关联对话语句的意图及第三关联对话语句所需满足的条件。插抢话预检测对应的规则2可以包括预设交叉时长、预设字符数以及抢话延时等(如图6所示)。二次插抢话检测对应的规则3可以包括用于进一步确定候选对话语句是否存在插抢话行为的情绪识别模型、意图识别模型等。排除插抢话字数较少情况对应的规则4可以包括预设字符数等等。As shown in Figure 4, the telephone operation scenario involves the client 10 and the intelligent customer service quality inspection system 20. The client 10 can display a configuration interface for developer A to configure quality inspection rules. For example, as shown in Figure 5, you can configure rule 1 corresponding to the preset exemption conditions, rule 2 corresponding to pre-detection of preemptive calls, rule 3 corresponding to detection of secondary preemptive calls, and rule 3 corresponding to the exclusion of preemptive calls with a small number of words. Rule 4 etc. For example, the rule 1 corresponding to the preset exemption condition may include the intention of the above-mentioned second associated dialogue statement and the conditions that need to be satisfied by the third associated dialogue statement. Rule 2 corresponding to the pre-detection of pre-detection of interrupted calls may include preset cross duration, preset number of characters, pre-emptive call delay, etc. (as shown in Figure 6). Rule 3 corresponding to the secondary interrupting detection may include an emotion recognition model, an intent recognition model, etc., which are used to further determine whether a candidate dialogue sentence has an interrupting behavior. Rule 4 corresponding to excluding cases where the number of words for interrupting is small may include a preset number of characters and so on.
客户端10可将开发人员A配置的质检规则发送给智能客户服务质检系统20,以供智能客户服务质检系统20使用。客户端10可展示语音数据导入界面,具有语音对话质检需求的用户B可通过该语音数据导入界面将需要检测的目标语音数据导入客户端10。客户端10可将导入的目标语音数据发送给智能客户服务质检系统20,并根据用户B输入的语音对话检测触发指令,向智能客户服务质检系统20发送检测目标语音数据中存在插抢话行为的对话语句的请求。The client 10 can send the quality inspection rules configured by developer A to the intelligent customer service quality inspection system 20 for use by the intelligent customer service quality inspection system 20 . The client 10 can display a voice data import interface. User B, who has voice dialogue quality inspection requirements, can import the target voice data that needs to be detected into the client 10 through the voice data import interface. The client 10 can send the imported target voice data to the intelligent customer service quality inspection system 20, and according to the voice dialogue detection triggering instruction input by user B, send the detection target voice data to the intelligent customer service quality inspection system 20 to detect the presence of interfering calls. A request for a conversational statement of behavior.
智能客户服务质检系统20可以包括一台服务器(Server)或者由多台服务器组成的服务器集群(Cluster)。智能客户服务质检系统20可基于预先配置的质检规则,执行上述本申请实施例所揭示的语音对话检测方法,以确定目标语音数据中存在插抢话行为的对话语句,并将检测结果返回给客户端10。客户端10将检测结果展示给用户B,使得用户B可以基于检测结果采取相应 措施提升客户服务质量。The intelligent customer service quality inspection system 20 may include a server (Server) or a server cluster (Cluster) composed of multiple servers. The intelligent customer service quality inspection system 20 can execute the voice dialogue detection method disclosed in the above embodiments of the present application based on pre-configured quality inspection rules to determine whether there are dialogue statements that interrupt the conversation in the target voice data, and return the detection results. Give the client 10. The client 10 displays the detection results to user B, so that user B can take corresponding measures to improve customer service quality based on the detection results.
具体而言,智能客户服务质检系统20可获取目标语音数据中各个对话语句的语音特征和对话相关信息(比如包括对话起止时间和说话人角色),并基于ASR技术,将目标语音数据转换为相应的文本,得到各个对话语句的对话文本。接着,智能客户服务质检系统20可以基于排除插抢话字数较少情况对应的规则4,排除目标语音数据中字数较少的对话语句,而后,基于目标语音数据中剩余的对话语句的对话相关信息及对话文本,根据插抢话预检测对应的规则2,对这些对话语句进行插抢话预检测,以确定这些对话语句中可能存在插抢话行为的候选对话语句。然后,智能客户服务质检系统20可以基于候选对话语句的第二关联对话语句和第三关联对话语句,确定候选对话语句是否满足预设免检条件。若候选对话语句满足预设免检条件,则智能客户服务质检系统20可以确定候选对话语句不存在插抢话行为。若候选对话语句不满足预设免检条件,则智能客户服务质检系统20可以基于二次插抢话检测对应的规则3,调用情绪识别模型对候选对话语句进行情绪识别以得到情绪识别结果,并基于情绪识别结果和/或候选对话语句的语音特征,确定候选对话语句是否存在插抢话行为。Specifically, the intelligent customer service quality inspection system 20 can obtain the voice characteristics and dialogue-related information of each dialogue sentence in the target voice data (for example, including the start and end time of the dialogue and the speaker's role), and based on the ASR technology, convert the target voice data into Corresponding text, the dialogue text of each dialogue statement is obtained. Next, the intelligent customer service quality inspection system 20 can exclude dialogue sentences with a small number of words in the target voice data based on rule 4 corresponding to excluding cases where the number of words in the target voice data is small, and then, based on the dialogue correlation of the remaining dialogue sentences in the target voice data Information and dialogue text, according to the rules 2 corresponding to the pre-detection of interjection, perform pre-detection of interjection on these dialogue sentences to determine candidate dialogue sentences that may contain interjection behavior in these dialogue sentences. Then, the intelligent customer service quality inspection system 20 may determine whether the candidate dialogue sentence satisfies the preset inspection exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence. If the candidate dialogue sentence satisfies the preset exemption conditions, the intelligent customer service quality inspection system 20 can determine that the candidate dialogue sentence does not engage in interfering behavior. If the candidate dialogue sentence does not meet the preset exemption conditions, the intelligent customer service quality inspection system 20 can call the emotion recognition model to perform emotion recognition on the candidate dialogue sentence based on the rule 3 corresponding to the secondary interruption detection to obtain the emotion recognition result, and Based on the emotion recognition results and/or the speech characteristics of the candidate dialogue sentence, it is determined whether the candidate dialogue sentence has interrupting behavior.
此外,与上述的语音对话检测方法相对应地,本申请实施例还提供一种语音对话检测装置。参考图7,该语音对话检测装置700可以包括第一确定模块710和第二确定模块730。In addition, corresponding to the above-mentioned voice dialogue detection method, embodiments of the present application also provide a voice dialogue detection device. Referring to FIG. 7 , the voice dialogue detection device 700 may include a first determination module 710 and a second determination module 730 .
第一确定模块710可以基于目标语音数据中的多个对话语句各自的对话相关信息及对话文本,对所述多个对话语句进行插抢话预检测,以确定所述多个对话语句中可能存在插抢话行为的一个或多个候选对话语句。所述目标语音数据可以包括不同角色的说话人的对话语句,所述对话相关信息包括对话起止时间信息及说话人角色。The first determination module 710 may perform interjection pre-detection on the plurality of dialogue statements based on the dialogue related information and dialogue text of each of the plurality of dialogue statements in the target voice data to determine that there may be any presence in the plurality of dialogue statements. One or more candidate dialogue statements that interrupt the conversational behavior. The target voice data may include dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles.
对于第一确定模块710所确定的每个候选对话语句,第二确定模块730可以基于利用情绪识别模型对该候选对话语句进行情绪识别而得到的情绪识别结果和该候选对话语句的语音特征中至少一个,确定该候选对话语句是否存在插抢话行为。For each candidate dialogue sentence determined by the first determination module 710, the second determination module 730 may be based on at least the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. One is to determine whether the candidate dialogue sentence has interfering behavior.
在一种实施方式中,所述情绪识别结果包括所述候选对话语句的负向情绪值,所述候选对话语句的语音特征包括所述候选对话语句相对于第一关联对话语句的音量变化值,所述第一关联对话语句的说话人角色与所述候选对话语句的说话人角色相同。In one embodiment, the emotion recognition result includes the negative emotion value of the candidate dialogue sentence, and the voice characteristics of the candidate dialogue sentence include the volume change value of the candidate dialogue sentence relative to the first associated dialogue sentence, The speaker role of the first associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
在这种情况下,第二确定模块730可以包括第一插抢话判断子模块。若确定所述候选对话语句的负向情绪值超过预设情绪阈值或者所述音量变化值超过预设音量值,则第一插抢话判断子模块可以确定所述候选对话语句存在插抢话行为。In this case, the second determination module 730 may include a first call interrupting determination sub-module. If it is determined that the negative emotion value of the candidate dialogue sentence exceeds the preset emotion threshold or the volume change value exceeds the preset volume value, the first interrupting judgment sub-module may determine that the candidate dialogue sentence has an interrupting behavior. .
在一种实施方式中,语音对话检测装置700还可以包括免检识别模块。In one implementation, the voice dialogue detection device 700 may also include a check-free recognition module.
对于第一确定模块710所确定的每个候选对话语句,该免检识别模块可以基于该候选对话语句的第二关联对话语句和第三关联对话语句,确定该候 选对话语句是否满足预设免检条件。所述第二关联对话语句的说话人角色与该候选对话语句的说话人角色不同,所述第三关联对话语句的说话人角色与该候选对话语句的说话人角色相同。For each candidate dialogue sentence determined by the first determination module 710, the exemption identification module can determine whether the candidate dialogue sentence satisfies the preset exemption condition based on the second associated dialogue sentence and the third associated dialogue sentence of the candidate dialogue sentence. The speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence.
在该免检识别模块确定该候选对话语句满足预设免检条件的情况下,第二确定模块730可以直接确定该候选对话语句不存在插抢话行为;而在该免检识别模块确定该候选对话语句不满足所述预设免检条件的情况下,第二确定模块730可以基于对该候选对话语句的情绪识别结果和/或该候选对话语句的语音特征,确定该候选对话语句是否存在插抢话行为。When the exemption identification module determines that the candidate dialogue sentence satisfies the preset exemption conditions, the second determination module 730 can directly determine that the candidate dialogue sentence does not have an interfering behavior; and when the exemption identification module determines that the candidate dialogue sentence does not If the preset inspection exemption condition is met, the second determination module 730 may determine whether the candidate dialogue sentence has an interfering behavior based on the emotion recognition result of the candidate dialogue sentence and/or the voice characteristics of the candidate dialogue sentence.
在一种实施方式中,所述预设免检条件包括所述第二关联对话语句的意图为结束对话、且所述第三关联对话语句的对话文本与预设结束对话文本之间的匹配程度值超过第一预设程度阈值。In one implementation, the preset exemption condition includes the intention of the second associated dialogue statement to end the dialogue, and the matching degree value between the dialogue text of the third associated dialogue statement and the preset end dialogue text. exceeds the first preset level threshold.
在这种情况下,所述免检识别模块可以包括意图识别子模块、匹配子模块和免检识别子模块。In this case, the inspection-free identification module may include an intent identification sub-module, a matching sub-module and an inspection-free identification sub-module.
意图识别子模块可以基于意图识别模型和所述第二关联对话语句的对话文本,对所述第二关联对话语句进行意图识别,得到意图识别结果,其中,所述第二关联对话语句的起始时间位于所述候选对话语句的起始时间之前。The intention recognition sub-module can perform intention recognition on the second associated dialogue statement based on the intention recognition model and the dialogue text of the second associated dialogue statement, and obtain an intention recognition result, wherein the start of the second associated dialogue statement The time is before the start time of the candidate dialogue sentence.
匹配子模块可以将所述第三关联对话语句的对话文本与预设结束对话文本进行匹配,得到第一匹配结果,其中,所述第三关联对话语句的起始时间位于所述第二关联对话语句的起始时间与所述候选对话语句的起始时间之间。The matching submodule can match the dialogue text of the third associated dialogue statement with the preset end dialogue text to obtain the first matching result, wherein the starting time of the third associated dialogue statement is located in the second associated dialogue between the starting time of the sentence and the starting time of the candidate dialogue sentence.
免检识别子模块可以基于意图识别子模块得到的意图识别结果和匹配子模块得到的第一匹配结果,确定所述候选对话语句是否满足预设免检条件。The exemption recognition sub-module may determine whether the candidate dialogue statement satisfies the preset exemption conditions based on the intention recognition result obtained by the intention recognition sub-module and the first matching result obtained by the matching sub-module.
在一种实施方式中,第一确定模块710可以包括交叉时长确定子模块和候选对话语句确定子模块。In one implementation, the first determination module 710 may include an intersection duration determination sub-module and a candidate dialogue sentence determination sub-module.
若第一对话语句和第二对话语句为目标语音数据中任意两个相邻的对话语句,第一对话语句与第二对话语句各自的说话人角色不同,且第二对话语句的起始时间位于第一对话语句的起始时间之后且位于第一对话语句的结束时间之前,则交叉时长确定子模块可以基于第一对话语句的结束时间以及第二对话语句的起始时间和结束时间,确定第一对话语句与第二对话语句之间的交叉时长。If the first dialogue sentence and the second dialogue sentence are any two adjacent dialogue sentences in the target speech data, the first dialogue sentence and the second dialogue sentence have different speaker roles, and the starting time of the second dialogue sentence is at After the start time of the first dialogue statement and before the end time of the first dialogue statement, the intersection duration determination sub-module may determine the first dialogue statement based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement. The intersection duration between the first dialogue sentence and the second dialogue sentence.
若交叉时长确定子模块确定的交叉时长超过预设时长或者第二对话语句的对话文本包含的字符数超过预设字符数,则候选对话语句确定子模块可以确定第二对话语句为可能存在插抢话行为的候选对话语句。If the intersection duration determined by the intersection duration determination sub-module exceeds the preset duration or the number of characters contained in the dialogue text of the second dialogue statement exceeds the preset number of characters, the candidate dialogue statement determination sub-module may determine that the second dialogue statement may be interfering. Candidate dialogue sentences for speech acts.
在一种实施方式中,语音对话检测装置700还可以包括第三确定模块和删除模块。In one implementation, the voice dialogue detection device 700 may further include a third determination module and a deletion module.
在第一确定模块710对目标语音数据中的多个对话语句进行插抢话预检测之前,第三确定模块可以确定每个对话语句的对话文本是否包含预设词语。若第三确定模块确定该对话语句的对话文本包含预设词语,则删除模块可以删除该对话语句的对话文本中的预设词语。Before the first determination module 710 performs interjection pre-detection on the plurality of dialogue sentences in the target voice data, the third determination module may determine whether the dialogue text of each dialogue sentence contains a preset word. If the third determination module determines that the dialogue text of the dialogue statement contains the preset words, the deletion module may delete the preset words in the dialogue text of the dialogue statement.
在一种实施方式中,第三确定模块可以包括分词子模块和匹配子模块。In one implementation, the third determination module may include a word segmentation sub-module and a matching sub-module.
分词子模块可以对该对话语句的对话文本进行分词处理,得到该对话语句的对话文本包含的词语。匹配子模块可以将该对话语句的对话文本包含的每个词语与预设词语库中的预设词语进行匹配,以确定该对话语句的对话文本中是否包含预设词语。The word segmentation sub-module can perform word segmentation processing on the dialogue text of the dialogue sentence to obtain the words contained in the dialogue text of the dialogue sentence. The matching sub-module may match each word contained in the dialogue text of the dialogue statement with the preset words in the preset word library to determine whether the dialogue text of the dialogue statement contains the preset word.
在一种实施方式中,第三确定模块可以包括词语确定子模块。In one implementation, the third determination module may include a word determination sub-module.
词语确定子模块可以将该对话语句的对话文本输入预先训练的词语识别模型,得到该对话语句的对话文本的词语识别结果,所述词语识别结果用于指示该对话语句的对话文本是否包含预设词语,所述词语识别模型是基于样本文本及所述样本文本包含的词语的词语标签进行模型训练得到的,词语的词语标签用于指示词语是否为预设词语。The word determination sub-module can input the dialogue text of the dialogue sentence into the pre-trained word recognition model to obtain the word recognition result of the dialogue text of the dialogue sentence. The word recognition result is used to indicate whether the dialogue text of the dialogue sentence contains a preset The word recognition model is obtained by model training based on the sample text and the word tags of the words contained in the sample text. The word tags of the words are used to indicate whether the words are preset words.
此外,本申请的实施例还提供一种电子设备。参考图8,该电子设备可以包括处理器、内部总线、网络接口、存储器等。存储器可能包含内存,例如高速随机存取存储器(Random-Access Memory,RAM),也可能还包括非易失性存储器(non-volatile memory),例如磁盘存储器等。当然,该电子设备还可能包括其他业务所需要的硬件。In addition, embodiments of the present application also provide an electronic device. Referring to Figure 8, the electronic device may include a processor, an internal bus, a network interface, a memory, etc. Memory may include memory, such as high-speed random access memory (Random-Access Memory, RAM), or non-volatile memory (non-volatile memory), such as disk memory. Of course, the electronic equipment may also include other hardware required by the business.
处理器、网络接口和存储器可以通过内部总线相互连接,该内部总线可以是ISA(Industry Standard Architecture,工业标准体系结构)总线、PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一个双向箭头表示,但并不表示仅有一根总线或一种类型的总线。The processor, network interface and memory can be connected to each other through an internal bus, which can be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect, a peripheral component interconnect standard) bus or an EISA (Extended Industry Standard Architecture, extended industrial standard architecture) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one bidirectional arrow is used in Figure 8, but it does not mean that there is only one bus or one type of bus.
存储器,用于存放程序。具体地,程序可以包括程序代码,所述程序代码包括计算机操作指令。存储器可以包括内存和非易失性存储器,并向处理器提供指令和数据。Memory, used to store programs. Specifically, a program may include program code including computer operating instructions. Memory may include internal memory and non-volatile memory and provides instructions and data to the processor.
处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,在逻辑层面上形成语音对话检测装置。处理器,执行存储器所存放的程序,以执行上述的语音对话检测方法。The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it to form a voice dialogue detection device at the logical level. The processor executes the program stored in the memory to execute the above voice dialogue detection method.
处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可 擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。The processor may be an integrated circuit chip that has signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor. The above-mentioned processor can be a general-purpose processor, including a central processing unit (CPU), a network processor (Network Processor, NP), etc.; it can also be a digital signal processor (Digital Signal Processor, DSP), special integrated Circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. Each method, step and logical block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory, and the processor reads the information in the memory and completes the steps of the above method in combination with its hardware.
本申请实施例还提出了一种计算机可读存储介质,该计算机可读存储介质存储一个或多个程序,该一个或多个程序包括指令,该指令当被电子设备的处理器执行时,能够使该电子设备执行上述的语音对话检测方法。Embodiments of the present application also propose a computer-readable storage medium that stores one or more programs. The one or more programs include instructions that, when executed by a processor of an electronic device, can The electronic device is caused to execute the above voice conversation detection method.
上述实施例阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机。具体的,计算机例如可以为个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任何设备的组合。The systems, devices, modules or units described in the above embodiments may be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, the computer may be, for example, a personal computer, a laptop computer, a cellular phone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or A combination of any of these devices.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information. Information may be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), and read-only memory. (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, compact disc read-only memory (CD-ROM), digital versatile disc (DVD) or other optical storage, Magnetic tape cassettes, tape disk storage or other magnetic storage devices or any other non-transmission medium can be used to store information that can be accessed by a computing device. As defined in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprises," "comprises," or any other variation thereof are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements not only includes those elements, but also includes Other elements are not expressly listed or are inherent to the process, method, article or equipment. Without further limitation, an element defined by the statement "comprises a..." does not exclude the presence of additional identical elements in a process, method, article, or device that includes the stated element.
以上描述了本申请的一些实施例,并不旨在限制本申请。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。凡在本申请的精神和原则之内所作的任何修改、等同替换、改进等,均应包含在本申请的范围之内。The above describes some embodiments of the present application and is not intended to limit the present application. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desired results. Multitasking and parallel processing are also possible or may be advantageous in certain implementations. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this application shall be included in the scope of this application.

Claims (20)

  1. 一种语音对话检测方法,包括:A voice dialogue detection method, including:
    基于目标语音数据中的多个对话语句各自的对话相关信息及对话文本,对所述多个对话语句进行插抢话预检测,以确定所述多个对话语句中可能存在插抢话行为的一个或多个候选对话语句,其中,所述目标语音数据包括不同角色的说话人的对话语句,所述对话相关信息包括对话起止时间信息及说话人角色;Based on the dialogue related information and dialogue text of each of the plurality of dialogue sentences in the target voice data, pre-detection of interjection is performed on the plurality of dialogue statements to determine which one of the plurality of dialogue statements may have interjection behavior. Or multiple candidate dialogue sentences, wherein the target voice data includes dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles;
    对于所述一个或多个候选对话语句中的每个候选对话语句,基于利用情绪识别模型对该候选对话语句进行情绪识别而得到的情绪识别结果和该候选对话语句的语音特征中至少一个,确定该候选对话语句是否存在插抢话行为。For each candidate dialogue sentence in the one or more candidate dialogue sentences, it is determined based on at least one of the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the speech characteristics of the candidate dialogue sentence. Whether the candidate dialogue sentence has interrupting behavior.
  2. 根据权利要求1所述的方法,其中,The method of claim 1, wherein,
    所述情绪识别结果包括所述候选对话语句的负向情绪值;The emotion recognition result includes the negative emotion value of the candidate dialogue sentence;
    基于所述情绪识别结果和所述语音特征中至少一个而确定该候选对话语句是否存在插抢话行为包括:响应于确定所述负向情绪值超过预设情绪阈值,确定该候选对话语句存在插抢话行为。Determining whether the candidate dialogue sentence has an interjection behavior based on at least one of the emotion recognition result and the speech feature includes: in response to determining that the negative emotion value exceeds a preset emotion threshold, determining that the candidate dialogue sentence has an interjection behavior. Talk-stealing behavior.
  3. 根据权利要求1所述的方法,其中,The method of claim 1, wherein,
    所述候选对话语句的语音特征包括所述候选对话语句相对于所述候选对话语句的第一关联对话语句的音量变化值,所述第一关联对话语句的说话人角色与所述候选对话语句的说话人角色相同;The voice characteristics of the candidate dialogue sentence include a volume change value of the candidate dialogue sentence relative to a first associated dialogue sentence of the candidate dialogue sentence, a speaker role of the first associated dialogue sentence and a difference between the speaker role of the candidate dialogue sentence and the candidate dialogue sentence. The speaker role is the same;
    基于所述情绪识别结果和所述语音特征中至少一个而确定该候选对话语句是否存在插抢话行为包括:响应于确定所述音量变化值超过预设音量值,确定所述候选对话语句存在插抢话行为。Determining whether the candidate dialogue sentence has an interjection behavior based on at least one of the emotion recognition result and the speech feature includes: in response to determining that the volume change value exceeds a preset volume value, determining that the candidate dialogue sentence has an interjection behavior. Talk-stealing behavior.
  4. 根据权利要求1所述的方法,其中,基于所述情绪识别结果和所述语音特征中至少一个而确定该候选对话语句是否存在插抢话行为包括:The method according to claim 1, wherein determining whether the candidate dialogue sentence has an interjection behavior based on at least one of the emotion recognition result and the speech feature includes:
    响应于确定该候选对话语句不满足预设免检条件,基于所述情绪识别结果和所述语音特征中至少一个而确定该候选对话语句是否存在插抢话行为。In response to determining that the candidate dialogue sentence does not satisfy the preset inspection exemption condition, it is determined based on at least one of the emotion recognition result and the speech feature whether the candidate dialogue sentence has an interfering behavior.
  5. 根据权利要求4所述的方法,其中,所述预设免检条件包括该候选对话语句的第二关联对话语句的意图为结束对话,且该候选对话语句的第三关联对话语句的对话文本与预设结束对话文本之间的匹配程度值超过第一预设程度阈值,The method according to claim 4, wherein the preset exemption condition includes that the intention of the second associated dialogue sentence of the candidate dialogue sentence is to end the dialogue, and the dialogue text of the third associated dialogue sentence of the candidate dialogue sentence is consistent with the preset dialogue sentence. Assume that the matching degree value between the end dialogue texts exceeds the first preset degree threshold,
    其中,所述第二关联对话语句的说话人角色与该候选对话语句的说话人角色不同,且所述第二关联对话语句的起始时间位于该候选对话语句的起始时间之前;所述第三关联对话语句的说话人角色与该候选对话语句的说话人角色相同,且所述第三关联对话语句的起始时间位于所述第二关联对话语句 的起始时间与该候选对话语句的起始时间之间。Wherein, the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the starting time of the second associated dialogue sentence is before the starting time of the candidate dialogue sentence; the third The speaker role of the third associated dialogue sentence is the same as the speaker role of the candidate dialogue sentence, and the starting time of the third associated dialogue sentence is between the starting time of the second associated dialogue sentence and the start time of the candidate dialogue sentence. between start times.
  6. 根据权利要求5所述的方法,其中,The method of claim 5, wherein,
    所述第二关联对话语句的意图是基于意图识别模型和所述第二关联对话语句的对话文本对所述第二关联对话语句进行意图识别而得到的。The intention of the second associated dialogue sentence is obtained by performing intention recognition on the second associated dialogue sentence based on the intention recognition model and the dialogue text of the second associated dialogue sentence.
  7. 根据权利要求1所述的方法,其中,对所述多个对话语句进行插抢话预检测以确定所述一个或多个候选对话语句包括:对于所述多个对话语句中相邻的第一对话语句和第二对话语句,其中第一对话语句与第二对话语句各自的说话人角色不同,且第二对话语句的起始时间位于第一对话语句的起始时间之后且位于第一对话语句的结束时间之前,The method according to claim 1, wherein performing interjection pre-detection on the plurality of dialogue statements to determine the one or more candidate dialogue statements includes: for the first adjacent one of the plurality of dialogue statements. A dialogue sentence and a second dialogue sentence, wherein the first dialogue sentence and the second dialogue sentence have different speaker roles, and the starting time of the second dialogue sentence is after the starting time of the first dialogue sentence and is located in the first dialogue sentence before the end time of
    基于所述第一对话语句的结束时间以及所述第二对话语句的起始时间和结束时间,确定所述第一对话语句与所述第二对话语句之间的交叉时长;Based on the end time of the first dialogue statement and the start time and end time of the second dialogue statement, determine the intersection duration between the first dialogue statement and the second dialogue statement;
    响应于确定所述交叉时长超过预设时长或者所述第二对话语句的对话文本包含的字符数超过预设字符数,将所述第二对话语句确定为所述候选对话语句。In response to determining that the intersection duration exceeds a preset duration or that the dialogue text of the second dialogue statement contains more than a preset number of characters, the second dialogue statement is determined to be the candidate dialogue statement.
  8. 根据权利要求1所述的方法,还包括:在对所述多个对话语句进行插抢话预检测之前,对于所述多个对话语句中的每个对话语句,The method according to claim 1, further comprising: before performing interjection pre-detection on the plurality of dialogue statements, for each dialogue statement in the plurality of dialogue statements,
    确定该对话语句的对话文本是否包含预设词语;Determine whether the dialogue text of the dialogue statement contains preset words;
    响应于确定该对话语句的对话文本包含所述预设词语,删除该对话语句的对话文本中的所述预设词语。In response to determining that the dialogue text of the dialogue statement contains the preset word, the preset word in the dialogue text of the dialogue statement is deleted.
  9. 根据权利要求8所述的方法,其中,所述确定该对话语句的对话文本是否包含预设词语包括:The method according to claim 8, wherein determining whether the dialogue text of the dialogue statement contains a preset word includes:
    对该对话语句的对话文本进行分词处理,得到该对话语句的对话文本包含的目标词语;Perform word segmentation processing on the dialogue text of the dialogue sentence to obtain the target words contained in the dialogue text of the dialogue sentence;
    将每个所述目标词语与预设词语库中的预设词语进行匹配,以确定该对话语句的对话文本中是否包含所述预设词语。Each target word is matched with a preset word in a preset word library to determine whether the dialogue text of the dialogue sentence contains the preset word.
  10. 根据权利要求8所述的方法,其中,所述确定该对话语句的对话文本是否包含预设词语包括:The method according to claim 8, wherein determining whether the dialogue text of the dialogue statement contains a preset word includes:
    将该对话语句的对话文本输入预先训练的词语识别模型,得到该对话语句的对话文本的词语识别结果,所述词语识别结果指示该对话语句的对话文本是否包含所述预设词语。The dialogue text of the dialogue sentence is input into a pre-trained word recognition model to obtain a word recognition result of the dialogue text of the dialogue sentence. The word recognition result indicates whether the dialogue text of the dialogue sentence contains the preset word.
  11. 一种语音对话检测装置,包括:A voice dialogue detection device, including:
    第一确定模块,用于基于目标语音数据中的多个对话语句各自的对话相关信息及对话文本,对所述多个对话语句进行插抢话预检测,以确定所述多个对话语句中可能存在插抢话行为的一个或多个候选对话语句,其中,所述 目标语音数据包括不同角色的说话人的对话语句,所述对话相关信息包括对话起止时间信息及说话人角色;The first determination module is configured to perform pre-detection of interjections on multiple dialogue sentences in the target voice data based on respective dialogue related information and dialogue text of the multiple dialogue sentences to determine possible interjections in the multiple dialogue sentences. There are one or more candidate dialogue sentences for interrupting the conversation, wherein the target voice data includes dialogue sentences of speakers with different roles, and the dialogue-related information includes dialogue start and end time information and speaker roles;
    第二确定模块,用于对于所述一个或多个候选对话语句中的每个候选对话语句,基于利用情绪识别模型对该候选对话语句进行情绪识别而得到的情绪识别结果和该候选对话语句的语音特征中至少一个,确定该候选对话语句是否存在插抢话行为。The second determination module is configured to, for each candidate dialogue sentence in the one or more candidate dialogue sentences, based on the emotion recognition result obtained by using the emotion recognition model to perform emotion recognition on the candidate dialogue sentence and the result of the candidate dialogue sentence. At least one of the speech features is used to determine whether the candidate dialogue sentence has interrupting behavior.
  12. 根据权利要求11所述的装置,其中,The device of claim 11, wherein:
    所述情绪识别结果包括所述候选对话语句的负向情绪值;The emotion recognition result includes the negative emotion value of the candidate dialogue sentence;
    所述第二确定模块包括:第一插抢话判断子模块,用于响应于确定所述负向情绪值超过预设情绪阈值,确定该候选对话语句存在插抢话行为。The second determination module includes: a first interrupting judgment sub-module, configured to determine that the candidate dialogue sentence has an interrupting behavior in response to determining that the negative emotion value exceeds a preset emotion threshold.
  13. 根据权利要求11所述的装置,其中,The device of claim 11, wherein:
    所述候选对话语句的语音特征包括所述候选对话语句相对于所述候选对话语句的第一关联对话语句的音量变化值,所述第一关联对话语句的说话人角色与所述候选对话语句的说话人角色相同;The voice characteristics of the candidate dialogue sentence include a volume change value of the candidate dialogue sentence relative to a first associated dialogue sentence of the candidate dialogue sentence, a speaker role of the first associated dialogue sentence and a difference between the speaker role of the candidate dialogue sentence and the candidate dialogue sentence. The speaker role is the same;
    所述第二确定模块包括:第一插抢话判断子模块,用于响应于确定所述音量变化值超过预设音量值,确定所述候选对话语句存在插抢话行为。The second determination module includes: a first interrupting judgment sub-module, configured to determine that the candidate dialogue sentence has an interrupting behavior in response to determining that the volume change value exceeds a preset volume value.
  14. 根据权利要求11所述的装置,还包括:The device of claim 11, further comprising:
    免检识别模块,用于对于所述第一确定模块确定的所述一个或多个候选对话语句中的每个候选对话语句,基于该候选对话语句的第二关联对话语句和第三关联对话语句,确定该候选对话语句是否满足预设免检条件,其中,所述第二关联对话语句的说话人角色与该候选对话语句的说话人角色不同,所述第三关联对话语句的说话人角色与该候选对话语句的说话人角色相同,A check-free identification module configured to, for each of the one or more candidate dialogue statements determined by the first determination module, generate a second associated dialogue statement and a third associated dialogue statement based on the candidate dialogue statement, Determine whether the candidate dialogue sentence satisfies the preset exemption condition, wherein the speaker role of the second associated dialogue sentence is different from the speaker role of the candidate dialogue sentence, and the speaker role of the third associated dialogue sentence is different from the speaker role of the candidate dialogue sentence. Conversational sentences have the same speaker role,
    其中,第二确定模块被配置成:响应于所述免检识别模块确定该候选对话语句不满足所述预设免检条件,基于所述情绪识别结果和所述语音特征中至少一个而确定该候选对话语句是否存在插抢话行为。Wherein, the second determination module is configured to: in response to the determination by the exemption recognition module that the candidate dialogue sentence does not meet the preset exemption condition, determine the candidate dialogue based on at least one of the emotion recognition result and the speech feature. Whether the statement contains interrupting behavior.
  15. 根据权利要求14所述的装置,其中,所述预设免检条件包括所述第二关联对话语句的意图为结束对话,且所述第三关联对话语句的对话文本与预设结束对话文本之间的匹配程度值超过第一预设程度阈值,The device according to claim 14, wherein the preset exemption condition includes that the intention of the second associated dialogue statement is to end the dialogue, and there is a gap between the dialogue text of the third associated dialogue statement and the preset end dialogue text. The matching degree value exceeds the first preset degree threshold,
    其中,所述第二关联对话语句的起始时间位于该候选对话语句的起始时间之前,且所述第三关联对话语句的起始时间位于所述第二关联对话语句的起始时间与该候选对话语句的起始时间之间。Wherein, the starting time of the second associated dialogue sentence is located before the starting time of the candidate dialogue sentence, and the starting time of the third associated dialogue sentence is between the starting time of the second associated dialogue sentence and the starting time of the candidate dialogue sentence. between the starting times of candidate dialogue sentences.
  16. 根据权利要求15所述的装置,其中,The device of claim 15, wherein:
    所述第二关联对话语句的意图是基于意图识别模型和所述第二关联对话语句的对话文本对所述第二关联对话语句进行意图识别而得到的。The intention of the second associated dialogue sentence is obtained by performing intention recognition on the second associated dialogue sentence based on the intention recognition model and the dialogue text of the second associated dialogue sentence.
  17. 根据权利要求11所述的装置,其中,所述第一确定模块包括:The device according to claim 11, wherein the first determining module includes:
    交叉时长确定子模块,用于对于所述多个对话语句中相邻的第一对话语句和第二对话语句,基于所述第一对话语句的结束时间以及所述第二对话语句的起始时间和结束时间,确定所述第一对话语句与所述第二对话语句之间的交叉时长,其中,所述第一对话语句与所述第二对话语句各自的说话人角色不同,且所述第二对话语句的起始时间位于所述第一对话语句的起始时间之后且位于所述第一对话语句的结束时间之前;A cross duration determination submodule, configured to determine the adjacent first dialogue statement and the second dialogue statement among the plurality of dialogue statements based on the end time of the first dialogue statement and the starting time of the second dialogue statement. and the end time, determine the intersection duration between the first dialogue sentence and the second dialogue sentence, wherein the speaker roles of the first dialogue sentence and the second dialogue sentence are different, and the third dialogue sentence The starting time of the second dialogue statement is located after the starting time of the first dialogue statement and before the end time of the first dialogue statement;
    候选对话语句确定子模块,用于响应于确定所述交叉时长超过预设时长或者所述第二对话语句的对话文本包含的字符数超过预设字符数,将所述第二对话语句确定为所述候选对话语句。A candidate dialogue sentence determination submodule, configured to determine the second dialogue sentence as the candidate dialogue sentence in response to determining that the intersection duration exceeds a preset duration or that the dialogue text of the second dialogue statement contains more than a preset number of characters. Describe candidate dialogue sentences.
  18. 根据权利要求11所述的装置,还包括:The device of claim 11, further comprising:
    第三确定模块,用于在所述第一确定模块对所述多个对话语句进行插抢话预检测之前,对于所述多个对话语句中的每个对话语句,确定该对话语句的对话文本是否包含预设词语;A third determination module configured to determine, for each dialogue statement in the plurality of dialogue statements, the dialogue text of the dialogue statement before the first determination module performs interjection pre-detection on the plurality of dialogue statements. Whether it contains preset words;
    删除模块,用于响应于所述第三确定模块确定该对话语句的对话文本包含所述预设词语,删除该对话语句的对话文本中的所述预设词语。A deletion module configured to delete the preset words in the dialogue text of the dialogue statement in response to the third determination module determining that the dialogue text of the dialogue statement contains the preset word.
  19. 一种电子设备,包括:An electronic device including:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;memory for storing instructions executable by the processor;
    其中,所述处理器被配置为在执行所述指令时实现如权利要求1至10中任一项所述的方法。Wherein, the processor is configured to implement the method according to any one of claims 1 to 10 when executing the instructions.
  20. 一种计算机可读存储介质,所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够实现如权利要求1至10中任一项所述的方法。A computer-readable storage medium. When instructions in the storage medium are executed by a processor of an electronic device, the electronic device can implement the method according to any one of claims 1 to 10.
PCT/CN2023/070200 2022-04-24 2023-01-03 Voice dialogue detection method and apparatus WO2023207212A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210451120.8A CN114842849B (en) 2022-04-24 2022-04-24 Voice dialogue detection method and device
CN202210451120.8 2022-04-24

Publications (1)

Publication Number Publication Date
WO2023207212A1 true WO2023207212A1 (en) 2023-11-02

Family

ID=82568107

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/070200 WO2023207212A1 (en) 2022-04-24 2023-01-03 Voice dialogue detection method and apparatus

Country Status (2)

Country Link
CN (1) CN114842849B (en)
WO (1) WO2023207212A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114842849B (en) * 2022-04-24 2023-08-08 马上消费金融股份有限公司 Voice dialogue detection method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204195A (en) * 2017-05-19 2017-09-26 四川新网银行股份有限公司 A kind of intelligent quality detecting method analyzed based on mood
US20200364271A1 (en) * 2017-11-03 2020-11-19 Money Brain Co., Ltd. Method and computer device for providing natural language conversation by providing interjection response in timely manner, and computer-readable recording medium
CN111968679A (en) * 2020-10-22 2020-11-20 深圳追一科技有限公司 Emotion recognition method and device, electronic equipment and storage medium
CN112017629A (en) * 2020-07-15 2020-12-01 马上消费金融股份有限公司 Conversation control method and equipment of voice robot and storage medium
CN112885332A (en) * 2021-01-08 2021-06-01 天讯瑞达通信技术有限公司 Voice quality inspection method, system and storage medium
CN113539275A (en) * 2020-04-22 2021-10-22 北京有限元科技有限公司 Method, apparatus and storage medium for determining dialogs
CN114842849A (en) * 2022-04-24 2022-08-02 马上消费金融股份有限公司 Voice conversation detection method and device

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8675854B2 (en) * 2012-05-01 2014-03-18 Mitel Networks Corporation Multi-modal communications with conferencing and clients
KR102429498B1 (en) * 2017-11-01 2022-08-05 현대자동차주식회사 Device and method for recognizing voice of vehicle
US10783476B2 (en) * 2018-01-26 2020-09-22 Walmart Apollo, Llc System for customized interactions-related assistance
CN111508474B (en) * 2019-08-08 2021-04-06 马上消费金融股份有限公司 Voice interruption method, electronic equipment and storage device
CN111210842B (en) * 2019-12-27 2023-04-28 中移(杭州)信息技术有限公司 Voice quality inspection method, device, terminal and computer readable storage medium
CN111835925A (en) * 2020-06-16 2020-10-27 杭州云嘉云计算有限公司 Off-line voice quality inspection and analysis system for call center
CN111951831A (en) * 2020-08-24 2020-11-17 浙江百应科技有限公司 Method for realizing audio quality inspection based on AI
CN115148205A (en) * 2022-06-23 2022-10-04 鼎富新动力(北京)智能科技有限公司 Voice interaction method, system, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107204195A (en) * 2017-05-19 2017-09-26 四川新网银行股份有限公司 A kind of intelligent quality detecting method analyzed based on mood
US20200364271A1 (en) * 2017-11-03 2020-11-19 Money Brain Co., Ltd. Method and computer device for providing natural language conversation by providing interjection response in timely manner, and computer-readable recording medium
CN113539275A (en) * 2020-04-22 2021-10-22 北京有限元科技有限公司 Method, apparatus and storage medium for determining dialogs
CN112017629A (en) * 2020-07-15 2020-12-01 马上消费金融股份有限公司 Conversation control method and equipment of voice robot and storage medium
CN111968679A (en) * 2020-10-22 2020-11-20 深圳追一科技有限公司 Emotion recognition method and device, electronic equipment and storage medium
CN112885332A (en) * 2021-01-08 2021-06-01 天讯瑞达通信技术有限公司 Voice quality inspection method, system and storage medium
CN114842849A (en) * 2022-04-24 2022-08-02 马上消费金融股份有限公司 Voice conversation detection method and device

Also Published As

Publication number Publication date
CN114842849A (en) 2022-08-02
CN114842849B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
US11227124B2 (en) Context-aware human-to-computer dialog
CN108428447B (en) Voice intention recognition method and device
US9348817B2 (en) Automatic generation of question-answer pairs from conversational text
US9363378B1 (en) Processing stored voice messages to identify non-semantic message characteristics
KR20200129182A (en) Automated assistant invocation of appropriate agent
JP2017534941A (en) Orphan utterance detection system and method
JP5496863B2 (en) Emotion estimation apparatus, method, program, and recording medium
US10199035B2 (en) Multi-channel speech recognition
WO2023207212A1 (en) Voice dialogue detection method and apparatus
CN112053687A (en) Voice processing method and device, computer readable storage medium and equipment
JP7151181B2 (en) VOICE DIALOGUE SYSTEM, PROCESSING METHOD AND PROGRAM THEREOF
CN115545002B (en) Model training and business processing method, device, storage medium and equipment
CN110581927A (en) Call content processing and prompting method and device
CN113779208A (en) Method and device for man-machine conversation
CN110209768B (en) Question processing method and device for automatic question answering
Church et al. Speaker diarization: a perspective on challenges and opportunities from theory to practice
CN112908315A (en) Question-answer intention judgment method based on voice characteristics and voice recognition
US10824520B2 (en) Restoring automated assistant sessions
CN111970311B (en) Session segmentation method, electronic device and computer readable medium
JP2020140169A (en) Speaker determination device, speaker determination method, and speaker determination device control program
CN114860910A (en) Intelligent dialogue method and system
CN115186051A (en) Sensitive word detection method and device and computer readable storage medium
US20240143925A1 (en) Method and apparatus for automatic entity recognition in customer service environments
US20230133027A1 (en) Method and apparatus for intent-guided automated speech recognition
US20240144920A1 (en) Method and apparatus for automatic intent detection in customer service environments

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23794652

Country of ref document: EP

Kind code of ref document: A1