CN116129903A - Call audio processing method and device - Google Patents
Call audio processing method and device Download PDFInfo
- Publication number
- CN116129903A CN116129903A CN202310027081.3A CN202310027081A CN116129903A CN 116129903 A CN116129903 A CN 116129903A CN 202310027081 A CN202310027081 A CN 202310027081A CN 116129903 A CN116129903 A CN 116129903A
- Authority
- CN
- China
- Prior art keywords
- question
- answer
- text
- audio
- conversation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title abstract description 13
- 238000000034 method Methods 0.000 claims abstract description 30
- 238000000926 separation method Methods 0.000 claims abstract description 21
- 239000012634 fragment Substances 0.000 claims abstract description 12
- 239000013598 vector Substances 0.000 claims description 82
- 238000011176 pooling Methods 0.000 claims description 62
- 230000004913 activation Effects 0.000 claims description 32
- 230000000977 initiatory effect Effects 0.000 claims description 12
- 238000012545 processing Methods 0.000 claims description 11
- 238000012549 training Methods 0.000 claims description 11
- 238000000605 extraction Methods 0.000 claims description 9
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000015654 memory Effects 0.000 claims description 4
- 238000004891 communication Methods 0.000 abstract description 12
- 239000000284 extract Substances 0.000 abstract description 8
- 239000003795 chemical substances by application Substances 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 230000003993 interaction Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 230000006698 induction Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 101001072091 Homo sapiens ProSAAS Proteins 0.000 description 1
- 102100036366 ProSAAS Human genes 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
- G06F40/35—Discourse or dialogue representation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
- G10L21/028—Voice signal separating using properties of sound source
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Telephonic Communication Services (AREA)
Abstract
The application provides a call audio processing method and device. The method comprises the following steps: and carrying out sound channel separation on the call audio, extracting effective voice fragments, after identifying corresponding session texts, arranging the session texts according to time sequence, extracting question-answer pairs based on roles and time sequence corresponding to each session text, utilizing a language understanding model to carry out reasoning and prediction on the question-answer pairs, generating question-answer types and question-answer results of the question-answer pairs, and finally clustering the question-answer results of the question-answer pairs with higher question-answer relevance according to the question-answer types to obtain key information of the call audio. The whole method analyzes the call audio based on questions and answers, is more close to the actual application scene, combines a language understanding model to automatically extract key information of the call audio, and finally sorts out the key information corresponding to the call audio, so that the operation efficiency can be greatly improved, the information loss during communication outside an intelligent voice system can be avoided, and the whole business progress control is facilitated.
Description
Technical Field
The present disclosure relates to the field of speech synthesis technologies, and in particular, to a method and an apparatus for processing call audio.
Background
Intelligent voice service systems, such as: the system comprises a call-accelerating system, an electric sales system and a customer service SAAS system, has the functions of supporting the human-computer conversation and the man-machine conversation, and can utilize the intelligent voice system to dial the telephone of the customer to realize the direct conversation with the customer or can also rely on the intelligent voice customer service carried on the intelligent voice system to automatically communicate with the customer in the interaction scene with higher repeatability such as telephone sales, telephone call-accelerating and the like, so that the man-machine cooperation forms a closed loop, and the operation efficiency can be effectively improved.
In order to systematically grasp the progress of each service, after each direct call with a customer, a manual agent needs to sort call audio data and record key information to form the record information or sales information of the service. In addition, in the actual operation process, many businesses are communication by directly dialing a telephone outside the intelligent voice system to clients by using manual customer service, so that the intelligent voice system cannot acquire the communication information in time, the loss of the communication information can influence the monitoring of the overall progress of the businesses, and therefore, the lost communication information is also required to be supplemented and summarized manually, and the overall operation efficiency is lower.
Therefore, how to automatically extract key information from call audio becomes a urgent problem to be solved.
Disclosure of Invention
The application provides a call audio processing method and device, which can be used for solving the technical problems that the existing intelligent voice system lacks a function of automatically extracting key information from call audio, and is low in operation efficiency by means of manual arrangement.
In a first aspect, an embodiment of the present application provides a call audio processing method, where the method includes:
carrying out channel separation on call audio to be processed to obtain mono audio corresponding to each role;
acquiring a plurality of session texts corresponding to each piece of mono audio;
based on the positions of all the conversation texts in the conversation audio, arranging all the conversation texts according to a time sequence to obtain a conversation text set;
extracting a plurality of question-answer pairs from the conversation text set based on the corresponding roles and time sequences of each conversation text, wherein each question-answer pair comprises a question text and an answer text;
inputting the question text and the answer text into a pre-constructed language understanding model to obtain a question-answer type, a question-answer result and a question-answer correlation probability of the question-answer pair;
And clustering the question and answer results of all question and answer pairs with the question and answer correlation probability larger than a preset threshold according to the question and answer types to generate key information clustering results of the call audio.
With reference to the first aspect, in an implementation manner of the first aspect, the language understanding model includes a first encoding module and a second encoding module, an input end of the first encoding module is used for inputting the question text, and an input end of the second encoding module is used for inputting the answer text;
the first output end of the first coding module is connected with the input end of the first pooling layer, the second output end of the first coding module is connected with the input end of the BiLSTM layer of the two-way long-short-period memory network, the first output end of the second coding module is connected with the input end of the second pooling layer, and the second output end of the second coding module is also connected with the input end of the BiLSTM layer;
the output end of the BiLSTM layer is connected with the input end of the attention module, the output end of the first pooling layer and the output end of the second pooling layer are all connected with the input end of the full-connection layer, the first output end of the full-connection layer is connected with the input end of the first activation layer, the second output end of the full-connection layer is connected with the input end of the second activation layer, the output end of the first activation layer is used for outputting question-answer types and question-answer results of the question-answer pairs, and the output end of the second activation layer is used for outputting question-answer correlation probabilities of the question-answer pairs.
With reference to the first aspect, in an implementation manner of the first aspect, the first encoding module and the second encoding module are both RoBERTa pre-training models.
With reference to the first aspect, in an implementation manner of the first aspect, the inputting the question text and the answer text into a pre-constructed language understanding model to obtain a question-answer type, a question-answer result and a question-answer relevance probability of the question-answer pair includes:
inputting the problem text into the first coding module to code so as to obtain sentence embedded vectors and word embedded vectors of the problem text;
inputting the answer text into the second coding module to code, so as to obtain sentence embedded vectors and word embedded vectors of the answer text;
inputting sentence embedded vectors of the problem text into the first pooling layer for pooling treatment to obtain a first pooling result;
inputting sentence embedded vectors of the answer text into the second pooling layer for pooling treatment to obtain a second pooling result;
inputting the word embedded vector of the question text and the word embedded vector of the answer text into the BiLSTM layer for semantic recognition to obtain an output vector;
Inputting the output vector into the attention module for weight calculation of each position and weighting the vector of each position word to obtain sentence representation vector;
inputting the first pooling result, the second pooling result and the sentence representation vector into the full-connection layer for stitching to obtain a stitching vector;
inputting the spliced vector into the first activation layer for classification to obtain a question-answer type and a question-answer result of the question-answer pair;
and inputting the spliced vector into the second activation layer to perform correlation prediction, so as to obtain question-answer correlation probability of the question-answer pair.
With reference to the first aspect, in an implementation manner of the first aspect, the extracting, based on a role and a time sequence corresponding to each session text, a plurality of question-answer pairs from the session text set includes:
classifying each session text in the session text set by using a preset question answer judging rule to obtain each question text and answer text;
starting a session initiation time of a question text corresponding to a first character, and determining the question text corresponding to the first character and an answer text corresponding to a second character as a question-answer pair, wherein the answer text corresponding to the second character is an answer text initiated by the second character, which is located after the question text corresponding to the first character in time sequence and is located before the session initiation time of the next question text corresponding to the first character;
And sequentially extracting a plurality of question-answer pairs from the conversation text set according to the conversation time sequence from the early to the late.
With reference to the first aspect, in an implementation manner of the first aspect, the performing channel separation on the call audio to be processed to obtain mono audio corresponding to each role includes:
and extracting the audio of each audio channel in the call audio to be processed by using a preset SoX tool to obtain each mono audio, wherein different mono audio corresponds to different roles.
With reference to the first aspect, in an implementation manner of the first aspect, the acquiring a plurality of session texts corresponding to each piece of monaural audio includes:
extracting a plurality of non-muted voice fragments in each piece of mono audio;
and acquiring the conversation text corresponding to each voice fragment.
In a second aspect, an embodiment of the present application provides a call audio processing apparatus, applied to an intelligent voice system, where the apparatus includes:
the sound channel separation module is configured to perform sound channel separation on call audio to be processed to obtain mono audio corresponding to each role;
the voice recognition module is configured to acquire a plurality of conversation texts corresponding to each piece of mono audio;
The conversation text sorting module is configured to sort all conversation texts according to a time sequence based on the positions of all conversation texts in the conversation audio so as to obtain a conversation text set;
a question-answer pair extraction module configured to extract a plurality of question-answer pairs from the set of conversation texts based on a character and a time sequence corresponding to each of the conversation texts, each of the question-answer pairs including a question text and an answer text;
the prediction module is configured to input the question text and the answer text into a pre-constructed language understanding model to obtain a question-answer type, a question-answer result and a question-answer correlation probability of the question-answer pair;
and the key information clustering module is configured to cluster the question and answer results of all question and answer pairs with the question and answer correlation probability larger than a preset threshold value according to the question and answer types and generate the key information clustering result of the call audio.
With reference to the second aspect, in an implementation manner of the second aspect, the question-answer pair extracting module is configured to extract a plurality of question-answer pairs from the session text set based on a role and a time sequence corresponding to each of the session text, including:
classifying each session text in the session text set by using a preset question answer judging rule to obtain each question text and answer text;
Starting a session initiation time of a question text corresponding to a first character, and determining the question text corresponding to the first character and an answer text corresponding to a second character as a question-answer pair, wherein the answer text corresponding to the second character is an answer text initiated by the second character, which is located after the question text corresponding to the first character in time sequence and is located before the session initiation time of the next question text corresponding to the first character;
and sequentially extracting a plurality of question-answer pairs from the conversation text set according to the conversation time sequence from the early to the late.
With reference to the second aspect, in an implementation manner of the second aspect, the channel separation module is configured to perform channel separation on call audio to be processed, and obtain mono audio corresponding to each role, where the method includes:
and extracting the audio of each audio channel in the call audio to be processed by using a preset SoX tool to obtain each mono audio, wherein different mono audio corresponds to different roles.
With reference to the second aspect, in an implementation manner of the second aspect, the voice recognition module is configured to obtain a plurality of session texts corresponding to each piece of monaural audio, including:
Extracting a plurality of non-muted voice fragments in each piece of mono audio;
and acquiring the conversation text corresponding to each voice fragment.
With reference to the second aspect, in an implementation manner of the second aspect, the language understanding model includes a first encoding module and a second encoding module, an input end of the first encoding module is used for inputting the question text, and an input end of the second encoding module is used for inputting the answer text;
the first output end of the first coding module is connected with the input end of the first pooling layer, the second output end of the first coding module is connected with the input end of the BiLSTM layer of the two-way long-short-period memory network, the first output end of the second coding module is connected with the input end of the second pooling layer, and the second output end of the second coding module is also connected with the input end of the BiLSTM layer;
the output end of the BiLSTM layer is connected with the input end of the attention module, the output end of the first pooling layer and the output end of the second pooling layer are all connected with the input end of the full-connection layer, the first output end of the full-connection layer is connected with the input end of the first activation layer, the second output end of the full-connection layer is connected with the input end of the second activation layer, the output end of the first activation layer is used for outputting question-answer types and question-answer results of the question-answer pairs, and the output end of the second activation layer is used for outputting question-answer correlation probabilities of the question-answer pairs.
With reference to the second aspect, in an implementation manner of the second aspect, the first encoding module and the second encoding module are both a RoBERTa pre-training model.
With reference to the second aspect, in an implementation manner of the second aspect, the prediction module is configured to input the question text and the answer text into a pre-constructed language understanding model, to obtain a question-answer type, a question-answer result and a question-answer relevance probability of the question-answer pair, including:
inputting the problem text into the first coding module to code so as to obtain sentence embedded vectors and word embedded vectors of the problem text;
inputting the answer text into the second coding module to code, so as to obtain sentence embedded vectors and word embedded vectors of the answer text;
inputting sentence embedded vectors of the problem text into the first pooling layer for pooling treatment to obtain a first pooling result;
inputting sentence embedded vectors of the answer text into the second pooling layer for pooling treatment to obtain a second pooling result;
inputting the word embedded vector of the question text and the word embedded vector of the answer text into the BiLSTM layer for semantic recognition to obtain an output vector;
Inputting the output vector into the attention module for weight calculation of each position and weighting the vector of each position word to obtain sentence representation vector;
inputting the first pooling result, the second pooling result and the sentence representation vector into the full-connection layer for stitching to obtain a stitching vector;
inputting the spliced vector into the first activation layer for classification to obtain a question-answer type and a question-answer result of the question-answer pair;
and inputting the spliced vector into the second activation layer to perform correlation prediction, so as to obtain question-answer correlation probability of the question-answer pair.
In the call audio processing method, on the basis of a scene of a person call, sound channels are separated and effective voice fragments are extracted from call audio, after corresponding conversation texts are identified, the conversation texts are arranged according to time sequences, question-answer pairs are extracted on the basis of roles and time sequences corresponding to each conversation text, a language understanding model is utilized to infer and predict the question-answer pairs, question-answer types and question-answer results of the question-answer pairs are generated, and finally, the question-answer results of the question-answer pairs with higher question-answer relevance are clustered according to the question-answer types to obtain key information of the call audio. The whole method analyzes the call audio based on question-answer pairs, is more close to the actual application scene, combines a language understanding model to automatically extract key information in the call audio, and finally sorts out the key information corresponding to the call audio, so that the operation efficiency can be greatly improved, the information loss during communication outside an intelligent voice system can be avoided, and the whole business progress control is facilitated.
Drawings
Fig. 1 is an application scenario schematic diagram of a call audio processing method provided in an embodiment of the present application;
fig. 2 is a schematic workflow diagram of a call audio processing method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a language understanding model according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a call audio processing device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Fig. 1 is an application scenario schematic diagram of a call audio processing method provided in an embodiment of the present application. As shown in fig. 1, in a double interaction scenario with high repeatability, such as telephone sales, telephone collection, etc., a manual agent can dial a customer phone B by using an intelligent voice system A1 to realize direct communication with a customer, or can directly dial the customer phone B by using a personal phone A2 outside the intelligent voice system A1 to communicate.
In the application scenario, after the communication is finished, the manual agent needs to sort key information such as the prompt information or the sales information according to the communication content, so that the progress of the related business can be accurately controlled. In order to improve the operation efficiency and prevent important information from missing, it is needed to enable an intelligent voice system to automatically extract key information from call audio so as to assist a manual seat to perform operation.
In order to solve the technical problems that the existing intelligent voice system lacks a function of automatically extracting key information from call audio, and only relies on manual arrangement and is low in operation efficiency, the application discloses a call audio processing method through the following embodiment. The call audio processing method provided by the embodiment of the application is applied to an intelligent voice system, and the intelligent voice system is used for automatically generating structured key information records according to call audio so as to improve the working efficiency. Referring to a workflow diagram shown in fig. 2, the method for processing call audio provided in the embodiment of the present application specifically includes the following steps:
101: and carrying out channel separation on the call audio to be processed, and obtaining mono audio corresponding to each role.
In some embodiments, the call audio to be processed may originate from the introduction of an external recording, or from a call recording generated by the agent after the end of communicating with the customer using the intelligent voice system.
The call audio may include data of a plurality of parallel audio channels, typically mono audio and binaural audio, and may be in a wav (waveform sound file) file format. In the embodiment of the application, based on the scene that the manual agent communicates with the client to perform telephone call-up or telephone sales, the call audio to be processed defaults to two-channel audio, and the two-channel audio records the call contents of a first role (such as the manual agent) and a second role (such as the client) respectively.
There are various ways of performing channel separation on call audio to be processed. In one embodiment, the preset SoX (Sound eXchange) tool may be used to extract the audio of each audio channel in the call audio to be processed, so as to obtain each mono audio. Wherein different mono audio corresponds to different characters. Illustratively, left and right channel audio are extracted from a call audio file of a mixing of a human agent and a customer, respectively, the left channel audio corresponding to mono audio of the human agent and the right channel audio corresponding to mono audio of the customer.
SoX the tool is an audio processing tool, and may perform operations of splitting and merging audio channels, and intercepting, splicing, format converting of audio, and the specific type of SoX tools is not limited in the embodiment of the present application.
102: and acquiring a plurality of session texts corresponding to each piece of mono audio.
In some embodiments, the plurality of conversational text corresponding to each monaural audio may be obtained by:
in a first step, a plurality of non-muted speech segments are extracted from each piece of mono audio.
When two characters communicate, one character usually speaks and the other character listens, the monaural audio corresponding to the listening character will have a silence segment at the moment, and the silence segment can be used for representing a voice segment with energy lower than a certain threshold.
In some embodiments, VAD (Voice Activity Detection, voice endpoint detection) segmentation may be performed on each piece of mono audio to obtain multiple speech segments. Then, a non-muted speech segment is obtained from the plurality of speech segments. In particular, the VAD slicing may result in a start time and an end time for each speech segment.
There are various ways of VAD segmentation, for example: the method comprises the steps of framing mono audio, detecting whether the energy of each frame signal is larger than a preset threshold or not frame by frame, and if the capacity of a certain frame signal is larger than the preset threshold, enabling the frame signal to be a non-mute effective frame. Successive active frames may constitute a non-muted speech segment. The individual non-muted speech segments may be separated by muted speech segments. The embodiment of the present application does not specifically limit the VAD slicing manner.
And secondly, acquiring a conversation text corresponding to each voice fragment.
In particular, automatic speech recognition (Automatic Speech Recognition, ASR) techniques may be utilized to convert speech segments into corresponding conversational text, such as: the text of the session may be "hello, is the query xxx? "or" do you have work at present? ". The starting time and the ending time of the voice segment in the call audio to be processed are the starting time and the ending time of the session text corresponding to the voice segment in the call audio to be processed. There are a variety of automatic speech recognition techniques, and embodiments of the present application are not specifically limited thereto.
In other embodiments, a third party ASR service may be invoked to process each piece of monaural audio, directly output the respective conversation text corresponding to the monaural audio, and the start time and end time of the conversation text in the call audio to be processed. The starting time and the ending time of the conversation text in the call audio to be processed are equivalent to the starting time and the ending time of the conversation text in the corresponding mono audio.
103: and arranging all the conversation texts according to the time sequence based on the positions of the conversation texts in the conversation audio to obtain a conversation text set.
Specifically, after a plurality of conversation texts corresponding to each monaural audio are acquired, all conversation texts corresponding to the two monaural audio may be collectively ordered in the order from the early to the late. Illustratively, the set of conversation text includes "hello, is the query xxx? "yes," i am "is" you can now handle YYY "i am now busy, no time" … …, "good, bye".
104: a plurality of question-answer pairs are extracted from the set of conversation text based on the corresponding roles and chronological order of each conversation text.
Wherein each question-answer pair includes a question text and an answer text.
Based on the application scenario of the embodiment of the application, each session text can correspond to one character, and the two characters are in interactive communication with each other. In some embodiments, a plurality of question-answer pairs may be extracted from a set of conversation text by:
step one, classifying each session text in the session text set by using a preset question answer judging rule to obtain each question text and answer text.
Specifically, according to a preset question answer judging rule, judging which conversation text belongs to the question and which conversation text belongs to the answer in all conversation texts. For example: the session text 1 "hello, is the query xxx? "belongs to the question, it is classified as question text, and the conversation text 2" is yes, i am "belongs to the answer, and it is classified as answer text.
And secondly, starting from the session initiation time of the question text corresponding to the first character, and determining the question text corresponding to the first character and the answer text corresponding to the second character as a question-answer pair.
The first role may be an artificial agent, and the second role may be a client. The answer text corresponding to the second character is the answer text initiated by the second character, which is time-sequentially positioned after the question text corresponding to the first character and is positioned before the session initiation time of the next question text corresponding to the first character.
That is, in the interaction process, a question text is usually presented first, then an answer text is presented correspondingly, and the question text and the answer text belong to different roles, so that the question text corresponding to a first role and the answer text responding to a second role after the first role are determined as a question-answer pair.
Illustratively, the conversation text corresponding to the manual agent "hello, please be xxx is? The "", as well as the conversational text "not recognized" of the customer response, may be a question-answer pair.
It should be noted that, in a special case, after the first character presents a problem, the second character may generate a mute duration due to thinking and other reasons during the reply process, and then resume the reply, where the same reply content is split into two non-mute speech segments when the VAD is performed, so that two pieces of conversation text are also generated when the conversation text is identified. When the question-answer pair extraction is performed, if the fact that the next conversation text still belongs to the answer text corresponding to the second role is detected, the two answer texts corresponding to the second role can be combined and used as a reply of the question text corresponding to the first role. The method can be flexibly determined according to experience and actual conditions.
And thirdly, sequentially extracting a plurality of question-answer pairs from the conversation text set according to the conversation time sequence from the early to the late.
In other embodiments, a deep learning method may be used to extract question and answer pairs, for example, an automatic extraction is performed by using a pre-trained recognition model, and the extraction manner of the question and answer pairs in the embodiments of the present application is not specifically limited.
105: and inputting the question text and the answer text into a pre-constructed language understanding model to obtain question and answer types, question and answer results and question and answer correlation probabilities of the question and answer pairs.
Fig. 3 is a schematic structural diagram of a language understanding model provided in an embodiment of the present application. As shown in fig. 3, in some embodiments, the language understanding model provided by the embodiments of the present application includes a first encoding module 310, a second encoding module 320, a first pooling layer 330, a second pooling layer 340, a BiLSTM layer 350, an attention module 360, a full connectivity layer 370, a first activation layer 380, and a second activation layer 390. The input end of the first coding module is used for inputting the question text, and the input end of the second coding module is used for inputting the answer text.
A first output of the first encoding module 310 is connected to an input of the first pooling layer 330, a second output of the first encoding module 310 is connected to an input of the BiLSTM (two-way long and short-term memory network) layer 350, a first output of the second encoding module 320 is connected to an input of the second pooling layer 340, and a second output of the second encoding module 320 is also connected to an input of the BiLSTM layer 350. The output end of the BiLSTM layer 350 is connected with the input end of the attention module 360, the output end of the first pooling layer 330 and the output end of the second pooling layer 340 are all connected with the input end of the full-connection layer 370, the first output end of the full-connection layer 370 is connected with the input end of the first activation layer 380, the second output end of the full-connection layer 370 is connected with the input end of the second activation layer 390, the output end of the first activation layer 380 is used for outputting the question-answer type and the question-answer result of the question-answer pair, and the output end of the second activation layer 390 is used for outputting the question-answer correlation probability of the question-answer pair.
In some embodiments, the first encoding module 310 and the second encoding module 320 may each be a RoBERTa pre-training model.
After the architecture is built, the language understanding model provided by the embodiment of the application can be trained by adopting the pre-acquired labeled training data of double texts and double labels. For example, the training data of the double text double label may be the question text "ask you are XX, the answer text" who you do not know he ", the labeled question-answer type and question-answer result are" identity confirmation-not-own ", and the labeled question-answer correlation probability is" 1". Or, the training data is a question text "you can process a part of the text", the answer text "what you say", the noted question-answer type and question-answer result are "UNKNOWN", and the noted question-answer correlation probability is "0". The cross entropy loss function used in training can be set as:
loss=a*loss qa-type +b*loss qa-related
wherein loss is loss, a is question-answer type loss weight coefficient, loss is less qa-type For question-answer type loss, b is a question-answer correlation loss weight coefficient, loss qa-related Is a question-answer relevance loss. Typically, a may be set to 0.8 and b may be set to 0.2.
And carrying out parameter iteration and evaluation on the built language understanding model by using the noted training data of the double texts and the double labels until convergence, and finishing training.
Further, inputting the question text and the answer text into the language understanding model, the following reasoning process may be performed:
in the first step, the question text is input to the first encoding module 310 to be encoded, so as to obtain sentence embedded vectors and word embedded vectors of the question text. The answer text is input to the second encoding module 320 for encoding, and sentence-embedded vectors and word-embedded vectors of the answer text are obtained.
In the second step, sentence-embedded vectors of the question text are input into the first pooling layer 330 for pooling, so as to obtain a first pooling result. And inputting sentence embedded vectors of the answer text into a second pooling layer 340 for pooling processing to obtain a second pooling result.
Third, the word-embedded vector of the question text and the word-embedded vector of the answer text are input into the BiLSTM layer 350 for semantic recognition to obtain an output vector. The word embedded vector of the question text and the word embedded vector of the answer text are input to the BiLSTM layer 350 after being connected in front-to-back.
Fourth, the output vector is input to the attention module 360 for weighting the weight of each position and the vectors of the words at each position to obtain sentence representation vectors.
And fifthly, inputting the first pooling result, the second pooling result and the sentence representation vector into the full connection layer 370 for stitching to obtain a stitching vector.
And sixthly, inputting the spliced vector into the first activation layer 380 for classification to obtain the question-answer type and the question-answer result of the question-answer pair. The splice vector is input to the second activation layer 390 for correlation prediction to obtain question-answer correlation probabilities for question-answer pairs.
Illustratively, the input question text is "about borrowing, you can process, the answer text is" no processing at present ", and the question type and question result finally output by the language understanding model are" repayment willingness-refusal repayment ", and the question relevance probability is 0.85.
106: and clustering the question and answer results of all question and answer pairs with the question and answer correlation probability larger than a preset threshold according to the question and answer types to generate key information clustering results of the call audio.
The preset threshold value can be set according to needs and actual situations, for example: set to 0.8, which is not particularly limited in the embodiments of the present application.
After each question and answer pair is inferred and identified by using the language understanding model, the key information clustering result of the call audio can be finally generated by clustering according to the inference result of each question and answer pair. The question and answer type can be confirmed as a key information field of call audio, and can be set in the language understanding model parameters in advance according to requirements. Under the scenes of telephone sales, telephone induction and the like, the key information clustering result of the call audio is equivalent to the summary of sales points or the summary of induction points.
For example, the key information clustering result of the generated call audio may be "whether to turn on: switching on; whether per se: is the person; willingness to repay: refusing repayment; answering attitude: is mild; … …).
Therefore, key information clustering results of call audios are structured expression information, and the situation comparison and monitoring between different services are facilitated.
The embodiment of the application provides a call audio processing method, which is based on a scene of a person call, performs sound channel separation on call audio, extracts effective voice fragments, arranges the session texts according to time sequence after identifying the corresponding session texts, extracts question-answer pairs based on roles and time sequence corresponding to each session text, utilizes a language understanding model to infer and predict the question-answer pairs, generates question-answer types and question-answer results of the question-answer pairs, and finally clusters the question-answer results of the question-answer pairs with higher question-answer relevance according to the question-answer types to obtain key information of the call audio. The whole method analyzes the call audio based on question-answer pairs, is more close to the actual application scene, combines a language understanding model to automatically extract key information in the call audio, and finally sorts out the key information corresponding to the call audio, so that the operation efficiency can be greatly improved, the information loss during communication outside an intelligent voice system can be avoided, and the whole business progress control is facilitated.
The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.
Fig. 4 is a schematic structural diagram of a call audio processing device according to an embodiment of the present application. As shown in fig. 4, the device provided in the embodiment of the present application is applied to an intelligent voice system, and has a function of implementing the call audio processing method, where the function may be implemented by hardware, or may be implemented by executing corresponding software by hardware. The apparatus may include: a vocal tract separation module 401, a voice recognition module 402, a conversation text sorting module 403, a question-answer pair extraction module 404, a prediction module 405 and a key information clustering module 406. Wherein:
the channel separation module 401 is configured to perform channel separation on the call audio to be processed, and obtain mono audio corresponding to each role.
The speech recognition module 402 is configured to obtain a plurality of conversation texts corresponding to each monaural audio.
The conversation text sorting module 403 is configured to sort all conversation texts according to a time sequence based on the positions of the conversation texts in the conversation audio, so as to obtain a conversation text set.
The question-answer pair extraction module 404 is configured to extract a plurality of question-answer pairs from the set of conversation texts based on the corresponding roles and time sequence of each conversation text, each question-answer pair including a question text and an answer text.
The prediction module 405 is configured to input the question text and the answer text into a pre-constructed language understanding model, and obtain a question-answer type, a question-answer result and a question-answer relevance probability of the question-answer pair.
The key information clustering module 406 is configured to cluster the question and answer results of all question and answer pairs with the question and answer correlation probability larger than a preset threshold according to the question and answer types, and generate key information clustering results of the call audio.
In one implementation, question-answer pair extraction module 404 is configured to extract a plurality of question-answer pairs from a set of conversation text based on a corresponding role and chronological order for each conversation text, including:
classifying each session text in the session text set by using a preset question answer judging rule to obtain each question text and answer text.
And starting the session initiation time of the question text corresponding to the first character, and determining the answer text corresponding to the first character and the answer text corresponding to the second character as a question-answer pair, wherein the answer text corresponding to the second character is the answer text initiated by the second character, which is positioned behind the question text corresponding to the first character in time sequence and is positioned before the session initiation time of the next question text corresponding to the first character.
And sequentially extracting a plurality of question-answer pairs from the conversation text set according to the conversation time sequence from the early to the late.
In one implementation manner, the channel separation module 401 is configured to perform channel separation on call audio to be processed, and obtain mono audio corresponding to each role, including:
and extracting the audio of each audio channel in the call audio to be processed by using a preset SoX tool to obtain each mono audio, wherein different mono audio corresponds to different roles.
In one implementation, the speech recognition module 402 is configured to obtain a plurality of conversational text corresponding to each piece of mono audio, including:
a plurality of speech segments are extracted that are not muted in each piece of mono audio.
And acquiring the conversation text corresponding to each voice fragment.
In one implementation, the language understanding model includes a first encoding module and a second encoding module, an input of the first encoding module is used for inputting the question text, and an input of the second encoding module is used for inputting the answer text.
The first output end of the first coding module is connected with the input end of the first pooling layer, the second output end of the first coding module is connected with the input end of the BiLSTM layer of the two-way long-short-period memory network, the first output end of the second coding module is connected with the input end of the second pooling layer, and the second output end of the second coding module is also connected with the input end of the BiLSTM layer.
The output end of the BiLSTM layer is connected with the input end of the attention module, the output end of the first pooling layer and the output end of the second pooling layer are connected with the input end of the full-connection layer, the first output end of the full-connection layer is connected with the input end of the first activation layer, the second output end of the full-connection layer is connected with the input end of the second activation layer, the output end of the first activation layer is used for outputting question-answer types and question-answer results of question-answer pairs, and the output end of the second activation layer is used for outputting question-answer correlation probabilities of the question-answer pairs.
In one implementation, the first encoding module and the second encoding module are both RoBERTa pre-training models.
In one implementation, the prediction module 405 is configured to input the question text and the answer text into a pre-constructed language understanding model to obtain question-answer types, question-answer results, and question-answer relevance probabilities for the question-answer pairs, including:
and inputting the question text into a first coding module for coding to obtain sentence embedded vectors and word embedded vectors of the question text.
And inputting the answer text into a second coding module for coding to obtain sentence embedded vectors and word embedded vectors of the answer text.
And inputting sentence embedded vectors of the problem text into a first pooling layer for pooling treatment to obtain a first pooling result.
And inputting sentence embedded vectors of the answer text into a second pooling layer for pooling treatment to obtain a second pooling result.
And inputting the word embedded vector of the question text and the word embedded vector of the answer text into a BiLSTM layer for semantic recognition to obtain an output vector.
And inputting the output vector into the attention module for weight calculation of each position and weighting the vector of each position word to obtain sentence representation vector.
And inputting the first pooling result, the second pooling result and the sentence representation vector into a full connection layer for splicing to obtain a spliced vector.
And inputting the spliced vector into a first activation layer for classification to obtain the question-answer type and the question-answer result of the question-answer pair.
And inputting the spliced vector into a second activation layer to perform correlation prediction, so as to obtain question-answer correlation probability of the question-answer pair.
In one implementation, the key information clustering module 406 is configured to perform clustering according to the reasoning result of each question-answer pair after reasoning and identifying each question-answer pair by using the language understanding model, and finally generate a key information clustering result of the call audio. The question and answer type can be confirmed as a key information field of call audio, and can be set in the language understanding model parameters in advance according to requirements.
In this way, in the call audio processing device provided by the embodiment of the application, on the basis of the scene of a person call, sound channel separation is performed on call audio, effective voice fragments are extracted, after corresponding session texts are identified, the session texts are arranged according to time sequence, question-answer pairs are extracted on the basis of roles and time sequence corresponding to each session text, a language understanding model is utilized to reason and predict the question-answer pairs, question-answer types and question-answer results of the question-answer pairs are generated, and finally, key information of the call audio is obtained after the question-answer results of the question-answer pairs with higher question-answer relevance are clustered according to the question-answer types. The whole device analyzes the call audio based on question-answer pairs, is more close to an actual application scene, combines a language understanding model to automatically extract key information in the call audio, and finally sorts out the key information corresponding to the call audio, so that the working efficiency can be greatly improved, the defect of information when communication is performed outside an intelligent voice system can be avoided, and the whole business progress is controlled.
The foregoing detailed description has been provided for the purposes of illustration in connection with specific embodiments and exemplary examples, but such description is not to be construed as limiting the application. Those skilled in the art will appreciate that various equivalent substitutions, modifications and improvements may be made to the technical solution of the present application and its embodiments without departing from the spirit and scope of the present application, and these all fall within the scope of the present application. The scope of the application is defined by the appended claims.
Claims (10)
1. A method for processing call audio, the method comprising:
carrying out channel separation on call audio to be processed to obtain mono audio corresponding to each role;
acquiring a plurality of session texts corresponding to each piece of mono audio;
based on the positions of all the conversation texts in the conversation audio, arranging all the conversation texts according to a time sequence to obtain a conversation text set;
extracting a plurality of question-answer pairs from the conversation text set based on the corresponding roles and time sequences of each conversation text, wherein each question-answer pair comprises a question text and an answer text;
inputting the question text and the answer text into a pre-constructed language understanding model to obtain a question-answer type, a question-answer result and a question-answer correlation probability of the question-answer pair;
and clustering the question and answer results of all question and answer pairs with the question and answer correlation probability larger than a preset threshold according to the question and answer types to generate key information clustering results of the call audio.
2. The method of claim 1, wherein the language understanding model comprises a first encoding module and a second encoding module, an input of the first encoding module for inputting the question text, and an input of the second encoding module for inputting the answer text;
The first output end of the first coding module is connected with the input end of the first pooling layer, the second output end of the first coding module is connected with the input end of the BiLSTM layer of the two-way long-short-period memory network, the first output end of the second coding module is connected with the input end of the second pooling layer, and the second output end of the second coding module is also connected with the input end of the BiLSTM layer;
the output end of the BiLSTM layer is connected with the input end of the attention module, the output end of the first pooling layer and the output end of the second pooling layer are all connected with the input end of the full-connection layer, the first output end of the full-connection layer is connected with the input end of the first activation layer, the second output end of the full-connection layer is connected with the input end of the second activation layer, the output end of the first activation layer is used for outputting question-answer types and question-answer results of the question-answer pairs, and the output end of the second activation layer is used for outputting question-answer correlation probabilities of the question-answer pairs.
3. The method of claim 2, wherein the first encoding module and the second encoding module are each a RoBERTa pre-training model.
4. The method of claim 2, wherein said inputting the question text and the answer text into a pre-constructed language understanding model, to obtain a question-answer type, a question-answer result, and a question-answer relevance probability of the question-answer pair, comprises:
inputting the problem text into the first coding module to code so as to obtain sentence embedded vectors and word embedded vectors of the problem text;
inputting the answer text into the second coding module to code, so as to obtain sentence embedded vectors and word embedded vectors of the answer text;
inputting sentence embedded vectors of the problem text into the first pooling layer for pooling treatment to obtain a first pooling result;
inputting sentence embedded vectors of the answer text into the second pooling layer for pooling treatment to obtain a second pooling result;
inputting the word embedded vector of the question text and the word embedded vector of the answer text into the BiLSTM layer for semantic recognition to obtain an output vector;
inputting the output vector into the attention module for weight calculation of each position and weighting the vector of each position word to obtain sentence representation vector;
Inputting the first pooling result, the second pooling result and the sentence representation vector into the full-connection layer for stitching to obtain a stitching vector;
inputting the spliced vector into the first activation layer for classification to obtain a question-answer type and a question-answer result of the question-answer pair;
and inputting the spliced vector into the second activation layer to perform correlation prediction, so as to obtain question-answer correlation probability of the question-answer pair.
5. The method of claim 1, wherein the extracting a plurality of question-answer pairs from the set of conversation text based on the corresponding roles and chronological order of each of the conversation text comprises:
classifying each session text in the session text set by using a preset question answer judging rule to obtain each question text and answer text;
starting a session initiation time of a question text corresponding to a first character, and determining the question text corresponding to the first character and an answer text corresponding to a second character as a question-answer pair, wherein the answer text corresponding to the second character is an answer text initiated by the second character, which is located after the question text corresponding to the first character in time sequence and is located before the session initiation time of the next question text corresponding to the first character;
And sequentially extracting a plurality of question-answer pairs from the conversation text set according to the conversation time sequence from the early to the late.
6. The method of claim 1, wherein the performing channel separation on the call audio to be processed to obtain mono audio corresponding to each character comprises:
and extracting the audio of each audio channel in the call audio to be processed by using a preset SoX tool to obtain each mono audio, wherein different mono audio corresponds to different roles.
7. The method of claim 1, wherein the obtaining a plurality of conversational text corresponding to each monaural audio comprises:
extracting a plurality of non-muted voice fragments in each piece of mono audio;
and acquiring the conversation text corresponding to each voice fragment.
8. A call audio processing apparatus for use in an intelligent voice system, said apparatus comprising:
the sound channel separation module is configured to perform sound channel separation on call audio to be processed to obtain mono audio corresponding to each role;
the voice recognition module is configured to acquire a plurality of conversation texts corresponding to each piece of mono audio;
the conversation text sorting module is configured to sort all conversation texts according to a time sequence based on the positions of all conversation texts in the conversation audio so as to obtain a conversation text set;
A question-answer pair extraction module configured to extract a plurality of question-answer pairs from the set of conversation texts based on a character and a time sequence corresponding to each of the conversation texts, each of the question-answer pairs including a question text and an answer text;
the prediction module is configured to input the question text and the answer text into a pre-constructed language understanding model to obtain a question-answer type, a question-answer result and a question-answer correlation probability of the question-answer pair;
and the key information clustering module is configured to cluster the question and answer results of all question and answer pairs with the question and answer correlation probability larger than a preset threshold value according to the question and answer types and generate the key information clustering result of the call audio.
9. The apparatus of claim 8, wherein the question-answer pair extraction module is configured to extract a plurality of question-answer pairs from the set of conversation texts based on a role and a time sequence corresponding to each of the conversation texts, comprising:
classifying each session text in the session text set by using a preset question answer judging rule to obtain each question text and answer text;
starting a session initiation time of a question text corresponding to a first character, and determining the question text corresponding to the first character and an answer text corresponding to a second character as a question-answer pair, wherein the answer text corresponding to the second character is an answer text initiated by the second character, which is located after the question text corresponding to the first character in time sequence and is located before the session initiation time of a next question text corresponding to the first character;
And sequentially extracting a plurality of question-answer pairs from the conversation text set according to the conversation time sequence from the early to the late.
10. The apparatus of claim 8, wherein the channel separation module is configured to channel separate call audio to be processed to obtain mono audio corresponding to each character, comprising:
and extracting the audio of each audio channel in the call audio to be processed by using a preset SoX tool to obtain each mono audio, wherein different mono audio corresponds to different roles.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310027081.3A CN116129903A (en) | 2023-01-09 | 2023-01-09 | Call audio processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310027081.3A CN116129903A (en) | 2023-01-09 | 2023-01-09 | Call audio processing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116129903A true CN116129903A (en) | 2023-05-16 |
Family
ID=86298776
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310027081.3A Pending CN116129903A (en) | 2023-01-09 | 2023-01-09 | Call audio processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116129903A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118051582A (en) * | 2024-01-05 | 2024-05-17 | 深圳市六度人和科技有限公司 | Method, device, equipment and medium for identifying potential customers based on telephone voice analysis |
-
2023
- 2023-01-09 CN CN202310027081.3A patent/CN116129903A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118051582A (en) * | 2024-01-05 | 2024-05-17 | 深圳市六度人和科技有限公司 | Method, device, equipment and medium for identifying potential customers based on telephone voice analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111026843B (en) | Artificial intelligent voice outbound method, system and storage medium | |
CN112289323B (en) | Voice data processing method and device, computer equipment and storage medium | |
CN110136749A (en) | The relevant end-to-end speech end-point detecting method of speaker and device | |
CN107818798A (en) | Customer service quality evaluating method, device, equipment and storage medium | |
CN109065052B (en) | Voice robot | |
CN113239147B (en) | Intelligent session method, system and medium based on graph neural network | |
CN110379441B (en) | Voice service method and system based on countermeasure type artificial intelligence network | |
CN111128241A (en) | Intelligent quality inspection method and system for voice call | |
CN112131359A (en) | Intention identification method based on graphical arrangement intelligent strategy and electronic equipment | |
CN114328867A (en) | Intelligent interruption method and device in man-machine conversation | |
CN112562682A (en) | Identity recognition method, system, equipment and storage medium based on multi-person call | |
CN114818649A (en) | Service consultation processing method and device based on intelligent voice interaction technology | |
CN110602334A (en) | Intelligent outbound method and system based on man-machine cooperation | |
CN110570847A (en) | Man-machine interaction system and method for multi-person scene | |
CN116129903A (en) | Call audio processing method and device | |
CN112102807A (en) | Speech synthesis method, apparatus, computer device and storage medium | |
CN116013257A (en) | Speech recognition and speech recognition model training method, device, medium and equipment | |
CN113744742A (en) | Role identification method, device and system in conversation scene | |
CN114974294A (en) | Multi-mode voice call information extraction method and system | |
CN118072734A (en) | Speech recognition method, device, processor, memory and electronic equipment | |
CN112087726B (en) | Method and system for identifying polyphonic ringtone, electronic equipment and storage medium | |
JP7304627B2 (en) | Answering machine judgment device, method and program | |
EP4093005A1 (en) | System method and apparatus for combining words and behaviors | |
CN115691500A (en) | Power customer service voice recognition method and device based on time delay neural network | |
KR102370437B1 (en) | Virtual Counseling System and counseling method using the same |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |