CN111916111A - Intelligent voice outbound method and device with emotion, server and storage medium - Google Patents

Intelligent voice outbound method and device with emotion, server and storage medium Download PDF

Info

Publication number
CN111916111A
CN111916111A CN202010699699.0A CN202010699699A CN111916111A CN 111916111 A CN111916111 A CN 111916111A CN 202010699699 A CN202010699699 A CN 202010699699A CN 111916111 A CN111916111 A CN 111916111A
Authority
CN
China
Prior art keywords
emotion
voice
customer service
voice data
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010699699.0A
Other languages
Chinese (zh)
Other versions
CN111916111B (en
Inventor
林希帆
张雷妮
张奕宁
欧歆
关生力
王臻杰
刘柄廷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202010699699.0A priority Critical patent/CN111916111B/en
Publication of CN111916111A publication Critical patent/CN111916111A/en
Application granted granted Critical
Publication of CN111916111B publication Critical patent/CN111916111B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/50Centralised arrangements for answering calls; Centralised arrangements for recording messages for absent or busy subscribers ; Centralised arrangements for recording messages
    • H04M3/527Centralised call answering arrangements not requiring operator intervention

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The embodiment of the application discloses an intelligent voice outbound method and device with emotion, a server and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining customer service corpora and customer corpora of a first process node, converting the customer service corpora into first customer service voice data and sending the first customer service voice data to a target customer; receiving voice data of a target client, and determining a target process baseline where a first process node is located according to a matching result of a target client text corresponding to the voice data and a client corpus; determining a next process node according to the process sequence value of the first process node in the target process baseline; obtaining an emotion label of the voice data according to the voice characteristic information and the emotion text information of the voice data; obtaining a target emotion label according to the emotion comparison matrix and the emotion label; and generating second customer service voice data according to the target emotion label and the customer service corpus of the next process node, and returning the second customer service voice data to the target client. By adopting the method and the device, the outbound success rate and the outbound voice personification degree can be improved.

Description

Intelligent voice outbound method and device with emotion, server and storage medium
Technical Field
The application relates to the field of intelligent customer service, in particular to an intelligent voice outbound method and device with emotion, a server and a storage medium.
Background
At present, the intelligent voice outbound mainly comprises the steps that a robot matches a text corresponding to an acquired client voice with a client corpus in a corpus, and obtains a client service corpus in the corpus according to a matching result, and replies a voice corresponding to the client service corpus to a client. In addition, the intelligent voice outbound mainly comprises the steps that the robot mechanically analyzes the voice of the client, voice response is carried out according to the result obtained by analysis, the response voice is uniform and too monotonous, and the anthropomorphic similarity of the response voice is poor, wherein the anthropomorphic similarity of the response voice refers to the similarity between the emotion of the robot simulating the response emotion adopted by the artificial customer service aiming at the emotion corresponding to the voice of the client and the emotion adopted by the artificial customer service aiming at the emotion corresponding to the voice of the client.
Disclosure of Invention
The embodiment of the application provides an intelligent voice outbound method and device with emotion, a server and a storage medium, so as to improve the outbound success rate and the anthropomorphic degree of outbound voice.
In a first aspect, an embodiment of the present application provides an intelligent voice outbound method with emotion, including:
the method comprises the steps of obtaining customer service corpora and customer corpora corresponding to a first process node, converting the customer service corpora into first customer service voice data and sending the first customer service voice data to a target customer;
receiving voice data of the target client, and matching a target client text corresponding to the voice data with the client corpus to obtain a matching result;
determining a target process baseline corresponding to the first process node in a plurality of process baselines according to the matching result, wherein the plurality of process baselines are determined based on historical dialogue voice records, and each process baseline comprises a plurality of process nodes and a process sequence value of each process node in the process baselines;
determining a next process node corresponding to the first process node according to the process sequence value of the first process node in the target process baseline, and acquiring customer service corpora corresponding to the next process node;
extracting voice characteristic information and emotion text information of the voice data;
inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain emotion labels of the voice data, wherein the emotion recognition model is obtained by training the emotion labels of each voice data in a sample voice set;
acquiring a plurality of groups of customer service client emotion comparison groups containing emotion labels from an emotion comparison matrix, wherein the emotion comparison matrix comprises at least one group of customer service client emotion comparison groups and outbound success probabilities corresponding to each group of customer service client emotion comparison groups, and the outbound success probabilities corresponding to the customer service client emotion comparison groups are determined by the historical dialogue voice records containing the customer service emotion labels and the customer emotion labels of the customer service client emotion comparison groups and the successful outbound dialogue voice records, wherein the historical dialogue voice records comprise the customer service emotion labels, the customer emotion labels and the outbound results of each dialogue voice record;
determining a customer service emotion label in a customer service client emotion contrast group with the highest outbound success probability as a target emotion label;
and generating the second customer service voice data according to the target emotion label and the customer service corpus corresponding to the next process node, and replying the second customer service voice data to the target client.
Optionally, before obtaining the customer service corpus and the customer corpus corresponding to the first process node, the method further includes:
converting each of the historical dialogue voice records into a plurality of texts, wherein each text in the plurality of texts carries a dialogue sequence value, and the plurality of texts comprises a first text;
calculating the matching degree between the keywords of the first text and the respective process labels of each process category to obtain a plurality of category matching degree values between the first text and each process category;
determining a process category corresponding to the maximum value in the plurality of category matching values as a process category of the first text;
obtaining a dialogue link of each dialogue voice record according to the dialogue sequence value carried by each text and the process category of each text, and further obtaining a dialogue link set of the historical dialogue voice record, wherein the dialogue link comprises a plurality of process nodes, and the process nodes correspond to the process categories one by one;
and determining the conversation link with the occurrence frequency larger than a preset occurrence frequency threshold value in the conversation link set as the process baseline.
Optionally, the determining, according to the matching result, a target process baseline corresponding to the first process node in the multiple process baselines includes:
acquiring at least one optional flow baseline comprising the first flow node from the plurality of flow baselines, wherein the first flow node in each optional flow baseline carries a preset matching result of the target customer text and the customer corpus;
and determining the optional process baseline which is consistent with the matching result in the preset matching result carried by the first process node in the at least one optional process baseline as the target process baseline.
Optionally, before the obtaining of the plurality of groups of customer service client emotion comparison groups including the emotion labels from the emotion comparison matrix, the method further includes:
dividing each conversation voice record into client voice data and client service voice data;
inputting the voice characteristic information and emotion text information of the client voice data into the emotion recognition model to obtain a client emotion label of each conversation voice record;
inputting the voice characteristic information and emotion text information of the customer service voice data into the emotion recognition model to obtain a customer service emotion label of each conversation voice record;
and calculating to obtain the emotion comparison matrix according to the client emotion label, the customer service emotion label and the outbound result recorded by each conversation voice.
Optionally, the historical conversational speech record comprises a first conversational speech record;
the dividing of each dialogue voice record into customer voice data and customer service voice data comprises:
obtaining a plurality of voice segments according to the first pair of speech sound records, and extracting MFCC (Mel frequency cepstrum coefficient) features of each voice segment in the plurality of voice segments;
inputting the MFCC features into an identity vector model to obtain identity vector features of each voice segment;
calculating the similarity between every two identity vector features in the first pair of speech sound records;
clustering each voice fragment in the first speech record pair according to the similarity to obtain a first speaker voice set and a second speaker voice set;
calculating the matching degree between the keywords corresponding to each speaker voice set and the customer service label and the customer label respectively, and determining the label with the highest matching degree as the label corresponding to each speaker voice set;
and dividing the first speech record into the customer speech data and the customer service speech data according to the labels corresponding to the speaker speech sets.
Optionally, the emotion recognition model includes a convolutional layer, a cyclic layer and a transcription layer;
the inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain an emotion label of the voice data comprises:
fusing the voice characteristic information and the emotion text information to obtain fusion information, and inputting the fusion information into the emotion recognition model;
extracting features of the fusion information through a convolutional layer to obtain emotional features of the voice data;
predicting the emotion characteristics through the circulation layer to obtain a prediction sequence of the voice data;
converting, by the transcription layer, the predicted sequence into an emotion tag for the speech data.
Optionally, the extracting emotion text information of the voice data includes:
splitting text information corresponding to the voice data into at least one word;
calculating the degree of correlation between a target word in the at least one word and each emotion word in a preset emotion text set respectively to obtain a plurality of degree of correlation values between the target word and each emotion word;
determining a maximum value of the plurality of relevance degree values as an emotion score of the target word, thereby obtaining an emotion score of each word in the text information;
determining words with emotion scores larger than a preset threshold value as emotion text information of the voice data.
Optionally, the speech feature information includes a fundamental frequency;
the extracting of the voice feature information of the voice data comprises:
framing the voice data to obtain at least one voice frame, and performing wavelet decomposition on each voice frame in the at least one voice frame to obtain a plurality of wavelet decomposition signals corresponding to each voice frame, wherein each wavelet decomposition signal comprises wavelet high-frequency decomposition signals and wavelet low-frequency decomposition signals of a plurality of voice sampling points;
determining the number of target wavelet decomposition times according to the maximum value in the amplitude of the wavelet low-frequency decomposition signal corresponding to each adjacent two times of wavelet decomposition of the at least one voice frame;
and calculating the fundamental frequency of the voice data according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
In a second aspect, an intelligent voice outbound device with emotion is provided for an embodiment of the present application, including:
the first acquiring and sending module is used for acquiring a customer service corpus and a client corpus corresponding to the first process node, converting the customer service corpus into first customer service voice data and sending the first customer service voice data to a target client;
the receiving and matching module is used for receiving the voice data of the target client and matching the target client text corresponding to the voice data with the client linguistic data to obtain a matching result;
a target baseline determining module, configured to determine, according to the matching result, a target process baseline corresponding to the first process node among multiple process baselines, where the multiple process baselines are determined based on a historical dialogue voice record, and each process baseline includes multiple process nodes and a process sequence value of each process node in the process baseline;
a determining and obtaining module, configured to determine, according to a process sequence value of the first process node in the target process baseline, a next process node corresponding to the first process node, and obtain a customer service corpus corresponding to the next process node;
the extraction module is used for extracting the voice characteristic information and the emotion text information of the voice data;
the client emotion acquisition module is used for inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain emotion labels of the voice data, and the emotion recognition model is obtained by training the emotion labels of each voice data in a sample voice set;
the system comprises an obtaining comparison group module and a comparison group module, wherein the obtaining comparison group module is used for obtaining a plurality of groups of customer service client emotion comparison groups containing emotion labels from an emotion comparison matrix, the emotion comparison matrix comprises at least one group of customer service client emotion comparison groups and outbound success probabilities corresponding to each group of customer service client emotion comparison groups, the outbound success probabilities corresponding to the customer service client emotion comparison groups are determined by the historical dialogue voice records which contain the customer service emotion labels and the customer emotion labels of the customer service client emotion comparison groups and which are successful in outbound, and the historical dialogue voice records comprise the customer service emotion labels, the customer emotion labels and the outbound results of each dialogue voice record;
the customer service emotion determining module is used for determining a customer service emotion label in a customer service customer emotion contrast group with the highest probability of successful outbound call as a target emotion label;
and the generation reply module is used for generating the second customer service voice data according to the target emotion tag and the customer service corpus corresponding to the next process node, and replying the second customer service voice data to the target client.
Optionally, the apparatus further comprises: a flow baseline determination module.
The process baseline determination module is used for converting each dialogue voice record in the historical dialogue voice records into a plurality of texts, wherein each text in the plurality of texts carries a dialogue sequence value, and the plurality of texts comprise a first text;
calculating the matching degree between the keywords of the first text and the respective process labels of each process category to obtain a plurality of category matching degree values between the first text and each process category;
determining a process category corresponding to the maximum value in the plurality of category matching values as a process category of the first text;
obtaining a dialogue link of each dialogue voice record according to the dialogue sequence value carried by each text and the process category of each text, and further obtaining a dialogue link set of the historical dialogue voice record, wherein the dialogue link comprises a plurality of process nodes, and the process nodes correspond to the process categories one by one;
and determining the conversation link with the occurrence frequency larger than a preset occurrence frequency threshold value in the conversation link set as the process baseline.
Optionally, the target baseline determination module is specifically configured to:
acquiring at least one optional flow baseline comprising the first flow node from the plurality of flow baselines, wherein the first flow node in each optional flow baseline carries a preset matching result of the target customer text and the customer corpus;
and determining the optional process baseline which is consistent with the matching result in the preset matching result carried by the first process node in the at least one optional process baseline as the target process baseline.
Optionally, the apparatus further comprises: and an emotion comparison matrix determining module.
The emotion comparison matrix determining module is used for dividing each dialogue voice data into client voice data and client service voice data;
inputting the voice characteristic information and emotion text information of the client voice data into the emotion recognition model to obtain a client emotion label of each conversation voice record;
inputting the voice characteristic information and emotion text information of the customer service voice data into the emotion recognition model to obtain a customer service emotion label of each conversation voice record;
and calculating to obtain the emotion comparison matrix according to the client emotion label, the customer service emotion label and the outbound result recorded by each conversation voice.
Optionally, the historical conversational speech record comprises a first conversational speech record;
the emotion comparison matrix determining module is used for obtaining a plurality of voice segments according to the first pair of speech voice data and extracting MFCC (Mel frequency cepstrum coefficient) features of each voice segment in the plurality of voice segments;
inputting the MFCC features into an identity vector model to obtain identity vector features of each voice segment;
calculating the similarity between every two identity vector features in the first pair of speech data;
clustering each voice segment in the first speech data pair according to the similarity to obtain a first speaker voice set and a second speaker voice set;
calculating the matching degree between the keywords corresponding to each speaker voice set and the customer service label and the customer label respectively, and determining the label with the highest matching degree as the label corresponding to each speaker voice set;
and dividing the first pair of voice data into the customer voice data and the customer service voice data according to the label corresponding to each speaker voice set.
Optionally, the emotion recognition model includes a convolutional layer, a cyclic layer and a transcription layer;
the client emotion acquisition module is specifically configured to:
fusing the voice characteristic information and the emotion text information to obtain fusion information, and inputting the fusion information into the emotion recognition model;
extracting features of the fusion information through a convolutional layer to obtain emotional features of the voice data;
predicting the emotion characteristics through the circulation layer to obtain a prediction sequence of the voice data;
converting, by the transcription layer, the predicted sequence into an emotion tag for the speech data.
Optionally, the extracting module is specifically configured to:
splitting text information corresponding to the voice data into at least one word;
calculating the degree of correlation between a target word in the at least one word and each emotion word in a preset emotion text set respectively to obtain a plurality of degree of correlation values between the target word and each emotion word;
determining a maximum value of the plurality of relevance degree values as an emotion score of the target word, thereby obtaining an emotion score of each word in the text information;
determining words with emotion scores larger than a preset threshold value as emotion text information of the voice data.
In a third aspect, a server is provided for an embodiment of the present application, and includes a processor, a memory, and a transceiver, where the processor, the memory, and the transceiver are connected to each other, where the memory is used to store a computer program that supports the electronic device to execute the intelligent voice outbound method, and the computer program includes program instructions; the processor is configured to invoke the program instructions to execute the intelligent voice outbound method with emotion as described in an aspect of an embodiment of the present application.
In a fourth aspect, a storage medium is provided for embodiments of the present application, the storage medium storing a computer program, the computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform an intelligent voice outbound method with emotion as described in an aspect of an embodiment of the present application.
In the embodiment of the application, the customer service corpus and the client corpus corresponding to the first process node are obtained, the customer service corpus is converted into first customer service voice data, and the first customer service voice data is sent to a target client; receiving voice data of a target client, and matching a target client text corresponding to the voice data with the client corpus to obtain a matching result; determining a target process baseline corresponding to the first process node in the plurality of process baselines according to the matching result; determining a next process node corresponding to the first process node according to the process sequence value of the first process node in the target process baseline, and acquiring customer service corpora corresponding to the next process node; extracting voice characteristic information and emotion text information of the voice data; inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain an emotion label of the voice data; acquiring a plurality of groups of customer service client emotion comparison groups containing emotion labels from the emotion comparison matrix; determining a customer service emotion label in a customer service client emotion contrast group with the highest outbound success probability as a target emotion label; and generating second customer service voice data according to the target emotion label and the customer service corpus corresponding to the next process node, and replying the second customer service voice data to the target customer. By adopting the method and the device, the outbound success rate and the outbound voice personification degree can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic flowchart of an intelligent voice outbound method with emotion according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for intelligent voice outbound call with emotion according to an embodiment of the present application;
FIG. 3 is a schematic structural diagram of an intelligent voice outbound device with emotion according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a flowchart illustrating an intelligent voice outbound method with emotion according to an embodiment of the present application. As shown in fig. 1, this method embodiment comprises the steps of:
s101, obtaining a customer service corpus and a client corpus corresponding to the first process node, converting the customer service corpus into first customer service voice data and sending the first customer service voice data to a target client.
The first process node may be any one of the process nodes on the plurality of process baselines. The multiple flow baselines are determined based on the historical dialogue voice record, and for a specific implementation process, reference is made to the description of the subsequent embodiments, which is not described herein again.
If the first flow node is the first flow node on the multiple flow baselines, before the step S101 is executed, the intelligent voice outbound device dials a call of the target customer, and after the call is connected, obtains customer service corpora and customer corpora corresponding to the first flow node, converts the customer service corpora into first customer service voice data, and sends the first customer service voice data to the target customer.
Specifically, the intelligent voice outbound device acquires customer service corpora and client corpora in the process category according to the process category corresponding to the first process node, namely the customer service corpora and the client corpora corresponding to the first process node, converts the customer service corpora into voice, obtains first customer service voice data, and sends the first customer service voice data to the target client. After that, step S102 is executed.
And S102, receiving voice data of a target client, and matching the target client text corresponding to the voice data with the client linguistic data to obtain a matching result.
Wherein the client corpus contains a plurality of types of client text.
Specifically, the intelligent voice outbound device receives voice data of a target client, converts the voice data into a text to obtain a target client text, and calculating the matching degree between the keywords of the target client text and the type labels of various types of client texts, respectively, to obtain the matching degree between the target client text and various types of client texts, for example, if the keywords of the target client text are A, B, C, the type labels of the first type of client text are a, b, and c, the matching degree calculation formula between the target customer text and the first kind of customer text may be a sum of matching degrees between the keyword A, B, C of the target customer text and the type labels a, b, c of the first kind of customer text respectively, the matching degree between the single keyword and the single type label can be obtained through a preset matching degree table. Calculating the matching degree between the target client text and the client texts of various types according to the above manner to obtain a plurality of matching degree values, and determining the matching result according to the type corresponding to the maximum value in the plurality of matching degree values.
S103, determining a target process baseline corresponding to the first process node in the plurality of process baselines according to the matching result.
Wherein the plurality of process baselines are determined based on a historical dialogue voice record, and the process baselines comprise a plurality of process nodes and a process sequence value of each process node in the process baselines.
In one possible implementation manner, at least one optional process baseline including the first process node is obtained from the plurality of process baselines, and the first process node in each optional process baseline carries a preset matching result of the target customer text and the customer corpus;
and determining the optional process baseline which is consistent with the matching result in the preset matching result carried by the first process node in the at least one optional process baseline as the target process baseline.
For example, the intelligent voice outbound device calls from multiple flow baselines "flow baseline 1: owner confirmation-product introduction-retrieval-successful registration, flow baseline 2: owner confirmation-product introduction-successful registration, flow baseline 3: owner confirmation-product introduction-recovery failure, and flow baseline 4: and obtaining optional process baselines including product introduction nodes in the main validation-validation failure, namely a process baseline 1, a process baseline 2 and a process baseline 3, wherein preset matching results of target customer texts and customer linguistic data carried by the product introduction nodes in the process baseline 1, the process baseline 2 and the process baseline 3 are respectively second type matching, first type matching and second type matching, the first type matching is purchase approval, the second type matching is purchase refusal, and the matching result in the step S102, namely the first type matching, is determined as the target process baseline 2, which is consistent with the target process baseline.
And S104, determining a next process node corresponding to the first process node according to the process sequence value of the first process node in the target process baseline, and acquiring customer service corpora corresponding to the next process node.
Specifically, the intelligent voice outbound device adds 1 to the process sequence value of the first process node in the target process baseline to obtain the process sequence value of the next process node in the target process baseline, and determines the next process node corresponding to the first process node according to the process sequence value of the next process node in the target process baseline.
For example, the intelligent voice outbound device is based on the product introduction node at "target flow baseline: and (3) determining the successfully registered node with the flow sequence value of 3 in the target flow baseline as the next flow node corresponding to the product introduction node if the flow sequence value 2 in the owner confirmation-product introduction-successful registration is calculated to obtain the flow sequence value of the next flow node in the target flow baseline as 3.
And then, the intelligent voice call-out device acquires the customer service corpus corresponding to the process node from the corpus according to the process category corresponding to the next process node.
And S105, extracting the voice characteristic information and the emotion text information of the voice data.
Optionally, the speech feature information includes a fundamental frequency;
the extracting of the voice feature information of the voice data comprises:
framing the voice data to obtain at least one voice frame, and performing wavelet decomposition on each voice frame in the at least one voice frame to obtain a plurality of wavelet decomposition signals corresponding to each voice frame, wherein each wavelet decomposition signal comprises wavelet high-frequency decomposition signals and wavelet low-frequency decomposition signals of a plurality of voice sampling points;
determining the number of target wavelet decomposition times according to the maximum value in the amplitude of the wavelet low-frequency decomposition signal corresponding to each adjacent two times of wavelet decomposition of the at least one voice frame;
and calculating the fundamental frequency of the voice data according to the period of the wavelet high-frequency decomposition signal under the target wavelet decomposition times.
The fundamental frequency is the fundamental frequency. The sound is generated by the vibration of an object, instantaneous changes of air flow can be caused in the vibration process, the changes cause instantaneous sharp changes of a voice signal, a mutation point is generated, and the reciprocal of the time interval length between two adjacent mutation points is the fundamental tone frequency at the moment. Because the wavelet has strong detection capability to the signal mutation point, the pitch frequency can be determined by determining the position of the maximum value point after wavelet transformation.
In the following, a detailed description is given to wavelet decomposition, in which at least one speech frame obtained by framing speech data of a client is subjected to wavelet decomposition, and this embodiment is exemplarily described by a first speech frame in the at least one speech frame. It is understood that the wavelet decomposition process can be regarded as a high-low pass filtering process, the high-low pass filtering characteristics are different according to the type of the selected filter, and a 16-tap Daubechies8 wavelet can be selected as an example. The 1 st level wavelet decomposition signal is obtained through a high-low pass filter, the 1 st level wavelet decomposition signal comprises a wavelet low-frequency decomposition signal L1 and a wavelet high-frequency decomposition signal H1, the wavelet low-frequency decomposition signal L1 in the 1 st level wavelet decomposition signal is continuously subjected to high-low pass filtering to obtain a wavelet low-frequency decomposition signal L2 and a wavelet high-frequency decomposition signal H2 in the 2 nd level wavelet decomposition signal, the wavelet low-frequency decomposition signal L2 in the 2 nd level wavelet decomposition signal is subjected to high-low pass filtering to obtain a wavelet low-frequency decomposition signal L3 and a wavelet high-frequency decomposition signal H3 in the 3 rd level wavelet decomposition signal, and so on, the input signal can be subjected to multi-level wavelet decomposition, and the multi-level wavelet decomposition is only exemplified here. It is understood that L3 and H3 contain all information of L2, L2 and H2 contain all information of L1, and L1 and H1 contain all information of the first speech frame, so that the first wavelet decomposition signal composed of L3, H3, H2 and H1 by concatenation can represent the first speech frame.
In a possible implementation manner, the intelligent voice outbound device uses 16kHz as the sampling frequency of the voice data of the client, uses 10ms as the frame shift and uses 10ms as the frame length, frames the voice data, each voice frame includes 160 voice sampling points, performs wavelet decomposition on each voice frame, the number of voice sampling points in the wavelet high-frequency decomposition signal after the first high-pass filtering is 160, the number of voice sampling points in the wavelet low-frequency decomposition signal after the first low-pass filtering is also 160, forms a level 1 wavelet decomposition signal, in order to keep the number of voice sampling points after the wavelet decomposition consistent with the number of voice sampling points of the original voice frame, the signal after the high-pass filtering and the low-pass filtering can be downsampled, i.e. downsampling the wavelet low-frequency decomposition signal after the first low-pass filtering, the sampling frequency of the wavelet low-frequency decomposition signal after the first low-pass filtering is half of the sampling frequency of the first voice frame, the number of voice sampling points in the wavelet low-frequency decomposition signal after the first low-pass filtering down-sampling is 80; similarly, the number of speech sampling points in the wavelet high-frequency decomposition signal after the first high-pass filtering down-sampling is 80, and the number of speech sampling points in the wavelet decomposition signal of level 1 is 160 obtained by adding the number of speech sampling points after the first low-pass filtering down-sampling and the first high-pass filtering down-sampling, and is consistent with the number of speech sampling points of one speech frame signal.
Further, the intelligent voice outbound device calculates a ratio α of a maximum value in amplitudes of wavelet low-frequency decomposition signals corresponding to the kth +1 th wavelet decomposition of all voice frames in the voice data to a maximum value in amplitudes of wavelet low-frequency decomposition signals corresponding to the kth wavelet decomposition of all voice frames, and determines k +1 as the optimal wavelet decomposition times if the ratio is smaller than a first preset threshold a. The first predetermined threshold a may be pi/2.
Then, the intelligent voice outbound device determines at least one maximum value sampling point from a plurality of voice sampling points according to the amplitude of a wavelet high-frequency decomposition signal corresponding to the k +1 th wavelet decomposition, calculates the time interval between every two adjacent maximum value sampling points and the occurrence frequency of each time interval value according to the sampling sequence, and determines the reciprocal of the time interval value with the most occurrence frequency as the base tone frequency if the occurrence frequency of each time interval value is different; if each time interval value occurs the same number of times, the inverse of the maximum value of the time interval values or the inverse of the average of all time interval values may be determined as the tone frequency.
Optionally, the speech feature information further includes formants. The specific implementation process of the intelligent voice outbound device for extracting the formants of the voice data is as follows:
the intelligent voice call-out device uses a Linear Predictive Coding (LPC) model to represent the voice data in an LPC form, namely:
Figure BDA0002592566850000121
wherein u (n) is an excitation function, G is a gain parameter, beta is an LPC parameter, and gamma represents the number of poles.
Further, the transfer function G (n) z (n)/(G u (n)) 1/, (1-a) is obtained from the LPC form of the speech dataβ*n)=1/∏(1-nβ*n) Wherein n isβIs the beta pole of g (n) in the n-plane, all poles of g (n) are within the unit circle of the z-plane. The frequency and bandwidth of the beta-th formant are respectively thetaβT2 pi and ln (r)β) And/pi T. And (5) carrying out root finding on the g and the n to obtain a formant of the voice data.
Optionally, the extracting emotion text information of the voice data includes:
converting the voice data into text information, and splitting the text information into at least one word;
calculating the degree of correlation between a target word in the at least one word and each emotion word in a preset emotion text set respectively to obtain a plurality of degree of correlation values between the target word and each emotion word;
determining a maximum value of the plurality of relevance degree values as an emotion score of the target word, thereby obtaining an emotion score of each word in the text information;
determining words with emotion scores larger than a preset threshold value as emotion text information of the voice data.
Specifically, the intelligent voice outbound device converts voice data of the client into text information, divides the text information into at least one word, and calculates a degree value of correlation between a target word in the at least one word and each emotion word in a preset emotion text set, wherein the calculation formula is as follows:
Figure BDA0002592566850000131
in the formula, A represents that the target word T is contained and belongs to the emotional word ciB represents that the target word T is included but not the emotional word ciC represents that the target word T is not included and belongs to the emotion word CiD represents that the target word T is not included and does not belong to the emotion word ciThe number of words in (2).
And then, determining the maximum value of the multiple relevance degree values of the target word as the emotion score of the target word, obtaining the emotion score of each word in the text information according to the method, sequencing at least one word of the text information according to the emotion score, and determining the first k words of which the emotion scores are more than or equal to a preset threshold value as the emotion text information of the voice data, namely one k-dimensional emotion text information of the voice data.
And S106, inputting the voice characteristic information and the emotion text information into the emotion recognition model to obtain an emotion label of the voice data.
Before inputting the voice characteristic information and the emotion text information into the emotion recognition model, the intelligent voice calling device can obtain the emotion recognition model in a convolution cyclic neural network mode.
Specifically, the intelligent voice outbound device randomly selects a sample voice set in the artificial outbound voice library, the sample voice set comprises at least one artificial outbound voice and text information of each artificial outbound voice, an actual emotion label of each artificial outbound voice in the sample voice set is obtained in an artificial labeling mode, then the intelligent voice outbound device extracts voice feature information and emotion text information of each artificial outbound voice in the sample voice set, and a specific implementation manner of the intelligent voice outbound device extracting the voice feature information and emotion text information of each artificial outbound voice refers to a specific implementation manner of the intelligent voice outbound device extracting the voice feature information and emotion text information of the voice data in the step, and is not repeated here, so that the voice feature information and emotion text information of the sample voice set are obtained, and dividing the sample voice set into a training set and a verification set according to a certain proportion. For example, the sample voice set has 2000 artificial outbound voices and text information corresponding to each artificial outbound voice, and the voices which are angry, enthusiasm, flat and sad are obtained by artificial classification respectively are 200, 1000, 600 and 200, and the ratio is 7: the ratio of 3 gives 1400 and 600 voices in the training and validation sets, respectively.
And then, the intelligent voice outbound device fuses the voice characteristic information and the emotion text information of the sample voice set to obtain the fusion information of each artificial outbound voice.
Specifically, the fundamental frequency and the formants in the voice feature information are fused, the formants can be optimized by adopting a principal component analysis method due to the fact that the data volume of the fundamental frequency is similar to that of the formants, the fundamental frequency is connected to the optimized formants to obtain the audio features of each artificial outbound voice, and further, the audio features and emotion text information of each artificial outbound voice are input into a Restricted Boltzmann Machine (RBM) model to obtain the fusion information of each artificial outbound voice.
Then, the intelligent voice calling-out device inputs the fusion information of each artificial calling-out voice in the training set and the actual emotion label into the initial convolution circulation neural network model for training and learning to obtain a first convolution circulation neural network model, inputs the fusion information of the verification set into the first convolution circulation neural network model to obtain a predicted emotion label of the verification set, calculates the proportion of the number of the voices, which are consistent with the actual emotion label, in the predicted emotion label of the verification set in the total number of the voices contained in the verification set, and judges whether the first convolution circulation neural network model reaches the convergence condition or not according to the proportion, for example, the intelligent voice calling-out device calculates the proportion of the number of the voices, which are consistent with the actual emotion label, in the predicted emotion label of the verification set in the total number of the voices contained in the verification set to be 60%, the convergence condition is not reached, namely the preset output accuracy is 95%, therefore, the intelligent voice outbound device changes parameters of the first convolution recurrent neural network model, such as the learning rate and the weight of each layer in the neural network, according to the occupied proportion, namely the output accuracy of the model, until the first convolution recurrent neural network model meets the convergence condition, and if the output accuracy of the model reaches 95%, the first convolution recurrent neural network model at the moment is determined to be the emotion recognition model.
And then, the intelligent voice calling device inputs the voice characteristic information and the emotion text information into the emotion recognition model to obtain the emotion label of the audio data.
Optionally, the emotion recognition model includes a convolutional layer, a cyclic layer and a transcription layer;
the inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain an emotion label of the voice data comprises:
fusing the voice characteristic information and the emotion text information to obtain fusion information, and inputting the fusion information into the emotion recognition model;
extracting features of the fusion information through a convolutional layer to obtain emotional features of the voice data;
predicting the emotion characteristics through the circulation layer to obtain a prediction sequence of the voice data;
converting, by the transcription layer, the predicted sequence into an emotion tag for the speech data.
Specifically, the intelligent voice outbound device fuses the fundamental frequency, the formants and the emotion text information of the voice data to obtain the fusion information of the voice data, and please refer to the specific implementation manner of obtaining the fusion information of each artificial outbound voice in the sample voice set in this step, which is not described herein again.
And then inputting the fusion information of the voice data into an emotion recognition model, and performing feature extraction on the fusion information of the input voice data through a convolution layer in the emotion recognition model, wherein the convolution layer in the emotion recognition model can be understood as a standard convolution neural network model without a full connection layer, the fusion information of the voice data firstly enters the convolution layer in the standard convolution neural network model, a small part of the fusion information is randomly selected as a sample, some feature information is learned from the small sample, and then convolution operation is performed on the feature information learned from the sample and the fusion information. After the convolution operation is finished, the features of the fusion information are extracted, but the number of the features extracted only through the convolution operation is large, in order to reduce the calculated amount, pooling operation is needed, namely the features extracted through the convolution operation are transmitted to a pooling layer of a standard convolution neural network model, aggregation statistics is carried out on the extracted features, the order of magnitude of the statistical features is far lower than that of the features extracted through the convolution operation, and meanwhile, the classification effect is improved. The commonly used pooling methods mainly include an average pooling operation method and a maximum pooling operation method. The average pooling operation method is to calculate an average characteristic in a characteristic set to represent the characteristic of the characteristic set; the maximum pooling operation is to extract the feature of which the maximum feature represents in a feature set. Through convolution processing of the convolution layer and pooling processing of the pooling layer, static structure characteristic information of the fusion information can be extracted, and emotion characteristics of the voice data can be obtained.
Thereafter, the emotional features of the speech data are input into the recurrent layer of the emotion recognition model, which can be understood as a recurrent neural network model, which, although it can model a time series, has a limited ability to learn long-term dependency information, i.e. if the current output is associated with a long previous sequence, the dependency is difficult to learn, because the sequence is too long and long-term dependency can lead to gradient disappearance or gradient explosion. Therefore, in this embodiment, a special recurrent neural network model, that is, a Long Short-Term Memory (LSTM) model, is selected to predict emotional characteristics of the voice data to obtain a prediction sequence of the voice data, and the prediction sequence of the voice data can be understood as a probability row vector having N elements, where the probability row vector includes an element aiElement aiA probability that the emotion label representing the voice data is the ith emotion label.
Then, the predicted sequence of the voice data is sent to a transcription layer, which can be understood as using (CTC) to connect a main time Classification algorithm, and finding the emotion label with the highest probability according to the probability row vector with N elements, namely the emotion label of the voice data.
S107, a plurality of groups of customer service client emotion comparison groups containing emotion labels are obtained from the emotion comparison matrix.
The emotion comparison matrix comprises at least one group of customer service client emotion comparison groups and outbound success probabilities corresponding to each group of customer service client emotion comparison groups, the outbound success probabilities corresponding to the customer service client emotion comparison groups are determined by historical conversation voice records which comprise the customer service emotion labels and the customer emotion labels of the customer service client emotion comparison groups and are successful in outbound conversation voice records, and the historical conversation voice records comprise the customer service emotion labels, the customer emotion labels and the outbound results of each conversation voice record.
Here, a specific implementation of obtaining the emotion matching matrix is described first: the intelligent voice calling device records each of the historical conversation voicesAfter the conversation voice records are divided into client voice data and customer service voice data, voice characteristic information and emotion text information of the client voice data and the customer service voice data of each conversation voice record are respectively extracted, then the voice characteristic information and emotion text information of the client voice data and the customer service voice data of each conversation voice record are respectively fused to obtain fusion information of the client voice data and fusion information of the customer service voice data, the fusion information of the client voice data and the fusion information of the customer service voice data are input into an emotion recognition model to obtain a client emotion label and a customer service emotion label of each conversation voice record, the customer service emotion label in the history conversation voice records is calculated to be the ith emotion label, the client emotion label is calculated to be the jth emotion label, and the number of successfully called conversation voice records and the customer service emotion label in the history conversation voice records are the ith emotion label, Ratio P between conversation voice record numbers with client emotion label as jth emotion labelijObtaining a group of customer service client emotion contrast groups, wherein the customer service emotion labels in the customer service client emotion contrast groups are the ith emotion label and the jth emotion label, and the outbound success probability of the customer service client emotion contrast groups is PijAnd further obtaining a plurality of groups of customer service client emotion comparison groups and the outbound success probability of each group of customer service client emotion comparison groups, namely an emotion comparison matrix.
The emotion comparison matrix is a square matrix, that is, the number of rows and columns of the matrix is equal, and the number m of rows and the number n of columns of the matrix are integers greater than or equal to 3, when n is 3, the customer service emotion label and the customer emotion label include negative emotion, neutral emotion and positive emotion, and when n is 4, the customer service emotion label and the customer emotion label include anger, enthusiasm, blankness and sadness. Illustratively, when n is 4, the emotion contrast matrix is as follows:
Figure BDA0002592566850000161
wherein, each row in the emotion comparison matrix represents the emotion corresponding to the row when the customer service emotion isWhen the emotion labels are used, the customer emotion takes the outbound success probability corresponding to different emotion labels, and the emotion comparison matrix comprises 16 customer service emotion comparison groups and the outbound success probability corresponding to each customer service emotion comparison group, namely an anger-anger emotion comparison group and the outbound success probability P of the group11Anger-enthusiasm control group and outbound success probability P for this group12Anger-bland emotion control group and outbound success probability P for that group13…, sad-sad emotion control group and probability P of successful call-out of the group44
For example, if the emotion label of the voice data obtained by the smart voice calling-out device is flat, 4 customer service customer emotion control groups including the customer emotion label of flat are obtained from the emotion comparison matrix, namely, an angry-flat emotion control group, a enthusiasm-flat emotion control group, a flat-flat emotion control group and a sad-flat emotion control group.
And S108, determining the service emotion label in the service client emotion contrast group with the highest outbound success probability as a target emotion label.
For example, if the outgoing call success probabilities of the anger-bland emotion control group, the enthusiasm-bland emotion control group, the bland-bland emotion control group and the sad-bland emotion control group in the 4 groups including the customer emotion label of the flat are respectively 0.2, 0.9, 0.5 and 0.1, the intelligent voice outgoing call device determines the customer emotion label, i.e., the enthusiasm, in the enthusiasm-bland emotion control group with the highest outgoing call success power as the target emotion label.
And S109, generating second customer service voice data according to the target emotion label and the customer service corpus corresponding to the next process node, and returning the second customer service voice data to the target client.
The intelligent voice outbound device obtains emotion parameters corresponding to the target emotion label, such as speed, intonation and the like, in a mode of searching a voice emotion parameter table according to the customer service corpus of the next process node, adjusts the emotion parameters, generates second customer service voice data with emotion being the target emotion label and content being the customer service corpus of the next process node, and replies the second customer service voice data to the target client.
In the embodiment of the application, after converting customer service linguistic data corresponding to a first process node into voice data and sending the voice data to a target client, the intelligent voice outbound device matches a target client text corresponding to the voice data of the target client with the customer linguistic data corresponding to the first process node, determines a target process baseline corresponding to the first process node in a plurality of process baselines according to a matching result, determines a next process node corresponding to the first process node according to a process sequence value of the first process node in the target process baseline, then obtains an emotion label of the voice data of the target client through an emotion recognition model, and determines a customer service emotion label in a customer service client emotion comparison group which contains the emotion label in an emotion comparison matrix and has the highest outbound success probability as a target emotion label according to the emotion label, and in addition, the emotion tag corresponding to the reply voice sent to the target client is an emotion tag containing voice data of the target client in an emotion comparison matrix and is a customer service emotion tag in a customer service client emotion comparison group with the highest probability of success of outbound call, so that the success rate of outbound call marketing is improved, and the personification degree of outbound call voice is also improved.
Please refer to fig. 2, which is a flowchart illustrating an intelligent voice outbound method with emotion according to an embodiment of the present application. As shown in fig. 2, the method embodiment comprises the following steps:
s201, determining a plurality of process baselines according to the historical dialogue voice record.
In one possible implementation, the intelligent voice outbound device converts each of the historical dialogue voice records into a plurality of texts, wherein each of the plurality of texts carries a dialogue sequence value, and the plurality of texts comprises a first text;
calculating the matching degree between the keywords of the first text and the respective process labels of each process category to obtain a plurality of category matching degree values between the first text and each process category;
determining a process category corresponding to the maximum value in the plurality of category matching values as a process category of the first text;
obtaining a dialogue link of each dialogue voice record according to the dialogue sequence value carried by each text and the process category of each text, and further obtaining a dialogue link set of the historical dialogue voice record, wherein the dialogue link comprises a plurality of process nodes, and the process nodes correspond to the process categories one by one;
and determining the conversation link with the occurrence frequency larger than a preset occurrence frequency threshold value in the conversation link set as the process baseline.
Specifically, the intelligent voice outbound device converts each dialogue voice record in the historical dialogue voice record into a text to obtain a plurality of texts corresponding to each dialogue voice record, and calculates a matching degree between a keyword of a first text in the plurality of texts and each flow label of each flow category to obtain a plurality of category matching degrees between the first text and each flow category. Calculating the class matching degree between the first text and each process class according to the manner to obtain a plurality of class matching values, determining the process class corresponding to the maximum value in the plurality of class matching values as the process class of the first text, obtaining the process class of each text in each dialogue voice record according to the manner, sequencing the process class of each text according to the order from small to large and further obtaining the dialogue link of each dialogue voice record, wherein the dialogue link of the first pair of dialogue voice records is obtained by exemplarily obtaining the dialogue sequence values carried by the texts a1 to a7 in the first pair of dialogue voice records as 1-7 respectively and the process classes from the text a1 to the text a7 as the owner confirmation, the product introduction, the retrieval and the successful registration respectively, the conversation link comprises four process nodes which are an owner confirmation node, a product introduction node, a retrieval node and a successful registration node respectively, and the process categories corresponding to the four process nodes are owner confirmation, product introduction, retrieval and successful registration respectively. According to the mode, obtaining a dialogue link recorded by each voice dialogue, further obtaining a dialogue link set recorded by historical dialogue voice, determining the dialogue links with the occurrence times larger than a preset occurrence time threshold (such as 100) in the dialogue link set as a flow baseline, or sequencing the dialogue links in the dialogue link set according to the sequence of the occurrence times from high to low to obtain a sequenced dialogue link set, and determining the first n dialogue links in the sequenced dialogue link set as the flow baseline.
S202, obtaining customer service linguistic data and customer linguistic data corresponding to the first process node, converting the customer service linguistic data into first customer service voice data and sending the first customer service voice data to a target customer.
Before step S202 is executed, the intelligent voice outbound device determines a corpus according to the historical dialogue voice record, and the specific implementation manner is as follows:
optionally, the historical conversational speech record includes a first pair of speech sound records including first pair of speech sound data;
the dividing of each dialogue voice data into customer voice data and customer service voice data includes:
obtaining a plurality of voice segments according to the first pair of speech voice data, and extracting MFCC (Mel frequency cepstrum coefficient) features of each voice segment in the plurality of voice segments;
inputting the MFCC features into an identity vector model to obtain identity vector features of each voice segment;
calculating the similarity between every two identity vector features in the first pair of speech data;
clustering each voice segment in the first speech data pair according to the similarity to obtain a first speaker voice set and a second speaker voice set;
calculating the matching degree between the keywords corresponding to each speaker voice set and the customer service label and the customer label respectively, and determining the label with the highest matching degree as the label corresponding to each speaker voice set;
and dividing the first pair of voice data into the customer voice data and the customer service voice data according to the label corresponding to each speaker voice set.
Specifically, the intelligent voice outbound device performs voice detection on the first pair of voice data through a Gaussian mixture model to obtain a plurality of voice segments only containing one speaker, concatenates the plurality of voice segments only containing one speaker into new voice data, and cuts the new voice data into a plurality of voice segments which are identical in length and partially overlapped.
Then, the intelligent voice outbound device extracts the MFCC characteristics of each voice segment in the multiple voice segments, and the specific implementation mode is as follows:
the intelligent voice outbound device can intercept each voice segment by using a window function with limited length to form an analysis frame, and the window function sets sampling points outside a region needing to be processed to zero to obtain a current voice frame.
Alternatively, the window function in the embodiment of the present application may use a hamming window function, that is,
Figure BDA0002592566850000201
where N is the frame length, which is 256 or 512.
Then obtaining the voice segment S corresponding to the nth moment after windowingω(n) that is, a group of,
Sω(n)=S(n)×ω(n)
wherein, s (n) is a speech segment corresponding to the nth time, i.e. a speech sample value at the n time.
Then, the voice segment S corresponding to the nth moment after the windowing processing is carried outω(n) performing pre-emphasis by using y (n) ═ x (n) -ax (n-1) to the windowed speech segment Sω(n) processing, wherein x (n) is the speech segment S after the windowing processing at the nth timeωAnd (n) the speech sample value a is a pre-emphasis coefficient, the value of a is between 0.9 and 1, and exemplarily, a is 0.9375, and y (n) is a signal subjected to pre-emphasis processing. It can be understood that the pre-emphasis process compensates the high frequency components by passing the speech segments through a high pass filter, reducing the high frequency loss caused by the process of lip articulation or microphone recording.
After windowing and pre-emphasis processing is performed on the voice segments to obtain each voice frame of the voice segments, fast fourier transform is also performed on each voice frame to obtain the frequency spectrum of each voice frame. Illustratively, the spectrum of each speech frame can be obtained by performing a discrete fourier transform on each speech frame according to the following formula.
Figure BDA0002592566850000202
Wherein, x (N) is the voice segment after the windowing pre-emphasis processing, and N represents the point number of Fourier transform.
The intelligent voice call-out device obtains the energy spectrum of each voice frame by squaring the spectrum amplitude of each voice frame, because the cochlea is equivalent to a filter bank when the human ear distinguishes the voice, the voice is filtered on the logarithmic domain, namely, compared with the frequency f, the Mel frequency fMel2595 × log (1+ f/700) is closer to the auditory mechanism of human ears, so the energy spectrum of each speech frame needs to be passed through a group of Mel frequency filter banks (M Mel band pass filters) to obtain the output power spectrum of the M Mel band pass filters.
The intelligent voice calling device takes logarithm of output power spectrum, and then inverse discrete cosine changes are carried out to obtain a plurality of MFCC Coefficients (Mel-Frequency Cepstral Coefficients, Mel Frequency cepstrum Coefficients), namely static characteristics which are generally 12-16, and the static characteristics can be calculated by the following formula:
Figure BDA0002592566850000211
where x (k) is the output power spectrum of each Mel band-pass filter, C0Is the spectral energy.
Then, the intelligent voice outbound device performs first-order and second-order difference on the static characteristics and the spectral energy to obtain dynamic characteristics, sums the static characteristics and the dynamic characteristics to obtain a characteristic vector corresponding to each voice frame, and further obtains the MFCC characteristics of each voice fragment.
Optionally, the identity VECTOR model comprises an X-VECTOR model.
Then, the intelligent voice calling device inputs the MFCC characteristics of each voice segment into the trained X-VECTOR model to obtain the X-VECTOR characteristics of each voice segment, and calculates the cosine similarity S between every two X-VECTOR characteristics<X1,X2>/(‖X1‖*‖X2II) in which X1And X2And the X-VECTOR characteristics are represented, S represents the cosine value of an included angle between every two X-VECTOR characteristics, and the similarity between every two X-VECTOR characteristics is higher when the included angle is smaller.
Further, according to the similarity, clustering is carried out on each voice segment in the first pair of voice data to obtain a first speaker voice set and a second speaker voice set.
Specifically, the intelligent voice outbound device uses a k-means algorithm to cluster the X-VECTOR characteristics of each voice segment in the first pair of voice data, and the implementation process is as follows: 1) randomly selecting two X-VECTOR characteristics from a plurality of X-VECTOR characteristics as the centers of two groups, namely group C1And group C2Group centers of (A) are respectively Q1And Q2(ii) a 2) Traverse the remaining X-VECTOR features separately from Q1And Q2The similarity between them, and comparing the sizes, if the X-VECTOR is specialSign and Q1The similarity between the X-VECTOR characteristics is higher, the X-VECTOR characteristics are distributed to the group C1Otherwise, to group C2Completing the distribution of each X-VECTOR characteristic according to the mode; 3) recalculating group C1And group C2Repeating the steps 2) and 3) until the k-means algorithm reaches a convergence condition, and ending clustering if the maximum iteration times is reached to obtain a first speaker voice set and a second speaker voice set.
Then, the matching degree between the keyword corresponding to each speaker voice set and the customer service label is calculated, for example, if the keyword corresponding to the first speaker voice set is A, B, C, and the customer service labels are a, b, and c, the matching degree calculation formula between the first speaker voice set and the customer service label may be the sum of the matching degrees between the keyword corresponding to the first speaker voice set A, B, C and the customer service labels a, b, and c, respectively, where the matching degree between a single keyword and a single customer service label may be obtained through a preset matching degree table. And calculating the matching degree between the first speaker voice set and the client label according to the mode, and determining the label with the maximum matching degree in the matching degrees between the first speaker voice set and the client label and between the first speaker voice set and the client label as the label corresponding to the first speaker voice set. And determining the label corresponding to the second speaker voice set according to the mode. Further, the dialogue voice data of each dialogue voice record in the history dialogue voice record is divided into customer service voice data and customer voice data according to the above manner.
The intelligent voice outbound device converts the client voice data and the customer service voice data recorded by all the conversation voices into texts to obtain a client text set and a customer service text set.
Converting each word in a first client text in a client text set into a discrete symbol in a One-Hot Encoder mode, that is, each word in the first client text corresponds to a row vector, only One value in the row vector is 1, and the rest values are 0, that is, the row vector corresponding to each word is an initial word vector of the word, wherein the dimension of the initial word vector is set by human, and is not limited here. Because the numbers of words contained in different client texts are different, in order to make the number of rows and columns of the initial matrix of each client text consistent, the initial word vectors of each word in the client text are sequenced according to the appearance sequence of each word in the client text to obtain a matrix with m rows and n columns, then the number L of words contained in the client text is compared with the preset number L of rows of the initial matrix, and if the number of words is less than the preset number L of rows of the initial matrix, the (L-L) n-dimensional zero vectors are sequentially added downwards to the m +1 th row in the matrix with m rows and n columns to obtain a matrix with L rows and n columns, namely the initial matrix of the client text. And multiplying the initial matrix of the client text by an input weight matrix containing preset values (the number of hidden layer neurons) of input weight column vectors to obtain a text matrix of the client text, namely obtaining the word2vec word vector of each word. The input weight matrix is obtained by training the input weight matrix based on a sample text set and a plurality of initial word vectors of each text in the set. The training process may be understood as a Continuous bag of words (CBOW) model in the Word2vec model, that is, a neural network model with an implied layer number of 1, predicting Word vectors of target words by using initial Word vectors of other words except the target words in each text to obtain predicted Word vectors of the target words, continuously reducing an error between each element value in the predicted Word vectors of the target words and each element value in the initial Word vectors of the target words by adjusting an initial input weight matrix and an initial output weight matrix in the CBOW model, and determining the initial input weight matrix after adjustment at this time as an input weight matrix when error values between each element are all minimized.
Then, the word2vec word vector of each word in the first client text is input into a pre-trained Language model based word vector (ELMO) model, for the word2vec word vector of each word, for the Bilm model of L layers, there are 2L +1 representations, each layer has a forward LSTM output and a backward LSTM output, after the two are singly spliced,for each layer there is a column vector of 2X 1. And summarizing vectors output by the LSTM at the top layer to serve as ELMO vectors of each word, wherein the ELMO model is obtained by training based on a sample text set and a plurality of word2vec word vectors of each text in the set, and the specific training process is not repeated here. And obtaining at least one ELMO vector of each customer text in the customer text set according to the mode. And clustering ELMO vectors of all client texts in the client text set by using a k-means algorithm, wherein the implementation process is as follows: 1) randomly selecting N ELMO vectors from the plurality of ELMO vectors as the cluster centers of N clusters, i.e., cluster C1Group C2… group CNGroup centers of (A) are respectively Q1、Q2…、QN(ii) a 2) Traverse the remaining ELMO vectors with Q respectively1、Q2…、QNThe Euclidean distance between them, and compare the magnitude if the ELMO vector is equal to Q1The minimum euclidean distance between them, the ELMO vector is assigned to group C1Completing the distribution of each ELMO vector according to the mode; 3) recalculating group C1Group C2… group CNRepeating the steps 2) and 3) until the k-means algorithm reaches a convergence condition, if the maximum iteration number is reached, finishing clustering to obtain N text sets, wherein N is the number of the process categories.
Further, the matching degree between the high-frequency words appearing in each text set in the N text sets and the flow category label of each flow category is calculated. If the high-frequency word of the first text set is A, B, C, and the flow category labels of the first flow category are a, b, and c, the category matching degree calculation formula between the first text set and the first flow category may be a sum of matching degrees between the high-frequency word A, B, C of the first text set and the flow category labels a, b, and c of the first flow category, respectively, where the matching degree between a single high-frequency word and a single flow category label may be obtained through a preset matching degree table. And calculating to obtain the category matching degree between the first text set and each flow category according to the mode to obtain a plurality of category matching values, determining the flow category corresponding to the highest category matching value in the plurality of category matching values as the flow category corresponding to the first text set, and obtaining the flow category corresponding to each text set in the N text sets according to the mode.
Further, the customer service text set is divided into N text sets according to the method, and the flow category corresponding to each text set in the N text sets is obtained. Because each customer service text in the customer service text set carries an outbound result, the outbound result in each text set in the N text sets is successful, and the text with the largest occurrence frequency is determined as the customer service corpus in the process category corresponding to the text set, so as to obtain the customer service corpus corresponding to each process category in the corpus, obtain the customer corpus corresponding to each process category in the corpus according to the method, cluster ELMO vectors of all customer texts contained in the customer corpus of the first process category in each process category by using a k-means algorithm, obtain m text sets of the customer corpus of the first process category, match keywords of the first text set in the m text sets with type labels of the customer corpus types in the first process category, obtain a plurality of matching values, determine the customer corpus type corresponding to the maximum value in the plurality of matching values as the customer corpus type of the first text set, according to the above manner, the client corpus type of each text set in the client corpus of the first process category can be obtained, and then the client corpus type of each text set in the client corpus of each process category is obtained, wherein the client corpus of each process category and the multiple client corpora corresponding to each process category form a corpus, as shown in table 1.
TABLE 1 corpus
Figure BDA0002592566850000241
Figure BDA0002592566850000251
And then, the intelligent voice outbound device acquires the customer service corpus and the client corpus under the process category, namely the customer service corpus and the client corpus corresponding to the first process node, from the corpus according to the process category corresponding to the first process node, converts the customer service corpus into voice, obtains first customer service voice data, and sends the first customer service voice data to the target client.
S203, receiving the voice data of the target client, and matching the target client text corresponding to the voice data with the client linguistic data to obtain a matching result.
And S204, determining a target process baseline corresponding to the first process node in the plurality of process baselines according to the matching result.
Wherein the plurality of process baselines are determined based on a historical dialogue voice record, and the process baselines comprise a plurality of process nodes and a process sequence value of each process node in the process baselines.
S205, according to the process sequence value of the first process node in the target process baseline, determining a next process node corresponding to the first process node, and acquiring customer service corpora corresponding to the next process node.
And S206, extracting the voice characteristic information and the emotion text information of the voice data.
And S207, inputting the voice characteristic information and the emotion text information into the emotion recognition model to obtain an emotion label of the voice data.
And S208, acquiring a plurality of groups of customer service client emotion comparison groups containing emotion labels from the emotion comparison matrix.
S209, determining the customer service emotion label in the customer service emotion comparison group with the highest outbound success probability as a target emotion label.
And S210, generating second customer service voice data according to the target emotion label and the customer service corpus corresponding to the next process node, and returning the second customer service voice data to the target client.
Here, the specific implementation manner of steps S203 to S210 may refer to the description of steps S102 to S109 in the embodiment corresponding to fig. 1, and is not described herein again.
In the embodiment of the application, after converting customer service linguistic data corresponding to a first process node into voice data and sending the voice data to a target client, the intelligent voice outbound device matches a target client text corresponding to the voice data of the target client with the customer linguistic data corresponding to the first process node, determines a target process baseline corresponding to the first process node in a plurality of process baselines according to a matching result, determines a next process node corresponding to the first process node according to a process sequence value of the first process node in the target process baseline, then obtains an emotion label of the voice data of the target client through an emotion recognition model, and determines a customer service emotion label in a customer service client emotion comparison group which contains the emotion label in an emotion comparison matrix and has the highest outbound success probability as a target emotion label according to the emotion label, and in addition, the emotion tag corresponding to the reply voice sent to the target client is the emotion tag containing the voice data of the target client in the emotion comparison matrix and is the customer service emotion tag in the customer service client emotion comparison group with the highest outbound success probability, so that the outbound marketing success rate is improved, and the anthropomorphic voice of the outbound voice is also improved.
In addition, after receiving the voice data of the target client, the intelligent voice outbound device can also determine the next process node corresponding to the first process node in the following manner:
the intelligent voice outbound device obtains a plurality of optional flow baselines including a first flow node from the plurality of flow baselines, obtains an optional flow node with a flow sequence value n +1 of the first flow node on each optional flow baseline according to a flow sequence value n of the first flow node on each optional flow baseline, further obtains a plurality of optional flow nodes of the first flow node on the plurality of optional flow baselines, calculates a matching degree between a target customer text corresponding to the voice data and a customer corpus corresponding to each optional flow node, obtains a plurality of matching degree values, and determines the optional flow node corresponding to the maximum value in the plurality of matching degree values as a next flow node corresponding to the first flow node. Here, the customer corpus corresponding to each process node is a text of the customer voice conversion acquired by the intelligent voice outbound device at the previous process node of the process node, wherein the customer corpus and the customer corpus corresponding to each process node are both acquired according to the historical dialogue voice record, and in addition, the first process node in the process baseline contains the customer corpus but does not contain the customer corpus.
Please refer to fig. 3, which is a schematic structural diagram of an intelligent voice outbound device with emotion according to an embodiment of the present application. As shown in fig. 3, the intelligent voice outbound call device with emotion 3 includes a first acquiring and sending module 31, a receiving and matching module 32, a target baseline determining module 33, a determining and acquiring module 34, an extracting module 35, a client emotion acquiring module 36, an acquiring and comparing module 37, a client emotion determining module 38 and a reply generating module 39.
The first acquiring and sending module 31 is configured to acquire a customer service corpus and a client corpus corresponding to the first process node, convert the customer service corpus into first customer service voice data, and send the first customer service voice data to a target client;
a receiving matching module 32, configured to receive voice data of the target client, and match a target client text corresponding to the voice data with the client corpus to obtain a matching result;
a target flow baseline determining module 33, configured to determine, according to the matching result, a target flow baseline corresponding to the first flow node among multiple flow baselines, where the multiple flow baselines are determined based on a historical dialogue voice record, and each flow baseline includes multiple flow nodes and a flow sequence value of each flow node in the flow baseline;
a determining and obtaining module 34, configured to determine, according to the process sequence value of the first process node in the target process baseline, a next process node corresponding to the first process node, and obtain a customer service corpus corresponding to the next process node;
the extraction module 35 is configured to extract voice feature information and emotion text information of the voice data;
a client emotion obtaining module 36, configured to input the speech feature information and the emotion text information into an emotion recognition model to obtain an emotion tag of the speech data, where the emotion recognition model is obtained by performing emotion tag training based on each speech data in a sample speech set;
an obtaining comparison group module 37, configured to obtain multiple groups of customer service client emotion comparison groups including the emotion labels from an emotion comparison matrix, where the emotion comparison matrix includes at least one group of customer service client emotion comparison groups and an outbound success probability corresponding to each group of customer service client emotion comparison groups, and the outbound success probability corresponding to the customer service client emotion comparison group is determined by a historical dialogue voice record that includes the customer service emotion labels and the client emotion labels of the customer service client emotion comparison groups and that is successful in outbound, where the historical dialogue voice record includes the customer service emotion labels, the client emotion labels and the outbound results of each dialogue voice record;
a customer service emotion determining module 38, configured to determine a customer service emotion tag in a customer service emotion comparison group with the highest probability of successful outbound call as a target emotion tag;
and a reply generation module 39, configured to generate the second customer service voice data according to the target emotion tag and the customer service corpus corresponding to the next process node, and reply the second customer service voice data to the target client.
Optionally, the apparatus further comprises: a flow baseline determination module 310.
The process baseline determination module 310 is configured to convert each of the historical dialogue speech records into a plurality of texts, where each of the plurality of texts carries a dialogue sequence value, and the plurality of texts includes a first text;
calculating the matching degree between the keywords of the first text and the respective process labels of each process category to obtain a plurality of category matching degree values between the first text and each process category;
determining a process category corresponding to the maximum value in the plurality of category matching values as a process category of the first text;
obtaining a dialogue link of each dialogue voice record according to the dialogue sequence value carried by each text and the process category of each text, and further obtaining a dialogue link set of the historical dialogue voice record, wherein the dialogue link comprises a plurality of process nodes, and the process nodes correspond to the process categories one by one;
and determining the conversation link with the occurrence frequency larger than a preset occurrence frequency threshold value in the conversation link set as the process baseline.
Optionally, the target baseline determination module 33 is specifically configured to:
acquiring at least one optional flow baseline comprising the first flow node from the plurality of flow baselines, wherein the first flow node in each optional flow baseline carries a preset matching result of the target customer text and the customer corpus;
and determining the optional process baseline which is consistent with the matching result in the preset matching result carried by the first process node in the at least one optional process baseline as the target process baseline.
Optionally, the apparatus further comprises: emotion matching matrix determination module 311.
The emotion comparison matrix determining module 311 is configured to divide each piece of dialogue voice data into client voice data and customer service voice data;
inputting the voice characteristic information and emotion text information of the client voice data into the emotion recognition model to obtain a client emotion label of each conversation voice record;
inputting the voice characteristic information and emotion text information of the customer service voice data into the emotion recognition model to obtain a customer service emotion label of each conversation voice record;
and calculating to obtain the emotion comparison matrix according to the client emotion label, the customer service emotion label and the outbound result recorded by each conversation voice.
Optionally, the historical conversational speech record comprises a first conversational speech record;
the emotion comparison matrix determining module 311 is configured to obtain a plurality of speech segments according to the first pair of speech data, and extract an MFCC feature of each of the plurality of speech segments;
inputting the MFCC features into an identity vector model to obtain identity vector features of each voice segment;
calculating the similarity between every two identity vector features in the first pair of speech data;
clustering each voice segment in the first speech data pair according to the similarity to obtain a first speaker voice set and a second speaker voice set;
calculating the matching degree between the keywords corresponding to each speaker voice set and the customer service label and the customer label respectively, and determining the label with the highest matching degree as the label corresponding to each speaker voice set;
and dividing the first pair of voice data into the customer voice data and the customer service voice data according to the label corresponding to each speaker voice set.
Optionally, the emotion recognition model includes a convolutional layer, a cyclic layer and a transcription layer;
the client emotion obtaining module 36 is specifically configured to:
fusing the voice characteristic information and the emotion text information to obtain fusion information, and inputting the fusion information into the emotion recognition model;
extracting features of the fusion information through a convolutional layer to obtain emotional features of the voice data;
predicting the emotion characteristics through the circulation layer to obtain a prediction sequence of the voice data;
converting, by the transcription layer, the predicted sequence into an emotion tag for the speech data.
Optionally, the extracting module 35 is specifically configured to:
splitting text information corresponding to the voice data into at least one word;
calculating the degree of correlation between a target word in the at least one word and each emotion word in a preset emotion text set respectively to obtain a plurality of degree of correlation values between the target word and each emotion word;
determining a maximum value of the plurality of relevance degree values as an emotion score of the target word, thereby obtaining an emotion score of each word in the text information;
determining words with emotion scores larger than a preset threshold value as emotion text information of the voice data.
It is understood that the intelligent speech calling device with emotion 3 is used for implementing the steps executed by the intelligent speech calling device with emotion in the embodiment of fig. 1 and 2. For specific implementation and corresponding beneficial effects of the functional blocks included in the intelligent speech outbound device 3 with emotion of fig. 3, reference may be made to the detailed description of the embodiments of fig. 1 and fig. 2, which is not repeated herein.
The intelligent voice outbound device 3 with emotion in the embodiment shown in fig. 3 can be implemented by the server 400 shown in fig. 4. Please refer to fig. 4, which provides a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 4, the server 400 may include: one or more processors 401, memory 402, and a transceiver 403. The processor 401, memory 402 and transceiver 403 are connected by a bus 404. Wherein the transceiver 403 is configured to receive or transmit data, and the memory 402 is configured to store a computer program, which includes program instructions; processor 401 is configured to execute program instructions stored in memory 402 to perform the following operations:
the method comprises the steps of obtaining customer service corpora and customer corpora corresponding to a first process node, converting the customer service corpora into first customer service voice data and sending the first customer service voice data to a target customer;
receiving voice data of the target client, and matching a target client text corresponding to the voice data with the client corpus to obtain a matching result;
determining a target process baseline corresponding to the first process node in a plurality of process baselines according to the matching result, wherein the plurality of process baselines are determined based on historical dialogue voice records, and each process baseline comprises a plurality of process nodes and a process sequence value of each process node in the process baselines;
determining a next process node corresponding to the first process node according to the process sequence value of the first process node in the target process baseline, and acquiring customer service corpora corresponding to the next process node;
extracting voice characteristic information and emotion text information of the voice data;
inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain emotion labels of the voice data, wherein the emotion recognition model is obtained by training the emotion labels of each voice data in a sample voice set;
acquiring a plurality of groups of customer service client emotion comparison groups containing emotion labels from an emotion comparison matrix, wherein the emotion comparison matrix comprises at least one group of customer service client emotion comparison groups and outbound success probabilities corresponding to each group of customer service client emotion comparison groups, and the outbound success probabilities corresponding to the customer service client emotion comparison groups are determined by the historical dialogue voice records containing the customer service emotion labels and the customer emotion labels of the customer service client emotion comparison groups and the successful outbound dialogue voice records, wherein the historical dialogue voice records comprise the customer service emotion labels, the customer emotion labels and the outbound results of each dialogue voice record;
determining a customer service emotion label in a customer service client emotion contrast group with the highest outbound success probability as a target emotion label;
and generating the second customer service voice data according to the target emotion label and the customer service corpus corresponding to the next process node, and replying the second customer service voice data to the target client.
Optionally, before the processor 401 obtains the customer service corpus and the customer corpus corresponding to the first process node, the following operation is further specifically performed:
converting each of the historical dialogue voice records into a plurality of texts, wherein each text in the plurality of texts carries a dialogue sequence value, and the plurality of texts comprises a first text;
calculating the matching degree between the keywords of the first text and the respective process labels of each process category to obtain a plurality of category matching degree values between the first text and each process category;
determining a process category corresponding to the maximum value in the plurality of category matching values as a process category of the first text;
obtaining a dialogue link of each dialogue voice record according to the dialogue sequence value carried by each text and the process category of each text, and further obtaining a dialogue link set of the historical dialogue voice record, wherein the dialogue link comprises a plurality of process nodes, and the process nodes correspond to the process categories one by one;
and determining the conversation link with the occurrence frequency larger than a preset occurrence frequency threshold value in the conversation link set as the process baseline.
Optionally, the processor 401 determines, according to the matching result, a target process baseline corresponding to the first process node among the multiple process baselines, and specifically performs the following operations:
acquiring at least one optional flow baseline comprising the first flow node from the plurality of flow baselines, wherein the first flow node in each optional flow baseline carries a preset matching result of the target customer text and the customer corpus;
and determining the optional process baseline which is consistent with the matching result in the preset matching result carried by the first process node in the at least one optional process baseline as the target process baseline.
Optionally, before the processor 401 obtains a plurality of groups of emotion comparison groups of customer service clients including the emotion labels from the emotion comparison matrix, the following operations are specifically performed:
dividing each dialogue voice data into client voice data and client service voice data;
inputting the voice characteristic information and emotion text information of the client voice data into the emotion recognition model to obtain a client emotion label of each conversation voice record;
inputting the voice characteristic information and emotion text information of the customer service voice data into the emotion recognition model to obtain a customer service emotion label of each conversation voice record;
and calculating to obtain the emotion comparison matrix according to the client emotion label, the customer service emotion label and the outbound result recorded by each conversation voice.
Optionally, the historical conversational speech record comprises a first conversational speech record;
the processor 401 divides each of the dialogue voice data into client voice data and customer service voice data, and specifically performs the following operations:
obtaining a plurality of voice segments according to the first pair of speech voice data, and extracting MFCC (Mel frequency cepstrum coefficient) features of each voice segment in the plurality of voice segments;
inputting the MFCC features into an identity vector model to obtain identity vector features of each voice segment;
calculating the similarity between every two identity vector features in the first pair of speech data;
clustering each voice segment in the first speech data pair according to the similarity to obtain a first speaker voice set and a second speaker voice set;
calculating the matching degree between the keywords corresponding to each speaker voice set and the customer service label and the customer label respectively, and determining the label with the highest matching degree as the label corresponding to each speaker voice set;
and dividing the first pair of voice data into the customer voice data and the customer service voice data according to the label corresponding to each speaker voice set.
Optionally, the emotion recognition model includes a convolutional layer, a cyclic layer and a transcription layer;
the processor 401 inputs the speech feature information and the emotion text information into an emotion recognition model to obtain an emotion tag of the speech data, and specifically executes the following operations:
fusing the voice characteristic information and the emotion text information to obtain fusion information, and inputting the fusion information into the emotion recognition model;
extracting features of the fusion information through a convolutional layer to obtain emotional features of the voice data;
predicting the emotion characteristics through the circulation layer to obtain a prediction sequence of the voice data;
converting, by the transcription layer, the predicted sequence into an emotion tag for the speech data.
Optionally, the processor 401 extracts emotion text information of the voice data, and specifically performs the following operations:
splitting text information corresponding to the voice data into at least one word;
calculating the degree of correlation between a target word in the at least one word and each emotion word in a preset emotion text set respectively to obtain a plurality of degree of correlation values between the target word and each emotion word;
determining a maximum value of the plurality of relevance degree values as an emotion score of the target word, thereby obtaining an emotion score of each word in the text information;
determining words with emotion scores larger than a preset threshold value as emotion text information of the voice data.
In an embodiment of the present application, a computer storage medium may be provided, which may be used to store computer software instructions for the intelligent speech outbound device with emotion in the embodiment shown in fig. 3, and which contains a program designed for the intelligent speech outbound device with emotion in the embodiment. The storage medium includes, but is not limited to, flash memory, hard disk, solid state disk.
In the embodiment of the present application, a computer program product is also provided, and when being executed by a computing device, the computer program product can execute the intelligent voice call-out device with emotion designed in the embodiment shown in fig. 3.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
In the present application, "a and/or B" means one of the following cases: a, B, A and B. "at least one of … …" refers to any combination of the listed items or any number of the listed items, e.g., "at least one of A, B and C" refers to one of: any one of seven cases, a, B, C, a and B, B and C, a and C, A, B and C.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The method and the related apparatus provided by the embodiments of the present application are described with reference to the flowchart and/or the structural diagram of the method provided by the embodiments of the present application, and each flow and/or block of the flowchart and/or the structural diagram of the method, and the combination of the flow and/or block in the flowchart and/or the block diagram can be specifically implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block or blocks of the block diagram. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block or blocks.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (10)

1. An intelligent voice outbound method with emotion is characterized by comprising the following steps:
the method comprises the steps of obtaining customer service corpora and customer corpora corresponding to a first process node, converting the customer service corpora into first customer service voice data and sending the first customer service voice data to a target customer;
receiving voice data of the target client, and matching a target client text corresponding to the voice data with the client corpus to obtain a matching result;
determining a target process baseline corresponding to the first process node in a plurality of process baselines according to the matching result, wherein the plurality of process baselines are determined based on historical dialogue voice records, and each process baseline comprises a plurality of process nodes and a process sequence value of each process node in the process baselines;
determining a next process node corresponding to the first process node according to the process sequence value of the first process node in the target process baseline, and acquiring customer service corpora corresponding to the next process node;
extracting voice characteristic information and emotion text information of the voice data;
inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain emotion labels of the voice data, wherein the emotion recognition model is obtained by training the emotion labels of each voice data in a sample voice set;
acquiring a plurality of groups of customer service client emotion comparison groups containing emotion labels from an emotion comparison matrix, wherein the emotion comparison matrix comprises at least one group of customer service client emotion comparison groups and outbound success probabilities corresponding to each group of customer service client emotion comparison groups, and the outbound success probabilities corresponding to the customer service client emotion comparison groups are determined by the historical dialogue voice records containing the customer service emotion labels and the customer emotion labels of the customer service client emotion comparison groups and the successful outbound dialogue voice records, wherein the historical dialogue voice records comprise the customer service emotion labels, the customer emotion labels and the outbound results of each dialogue voice record;
determining a customer service emotion label in a customer service client emotion contrast group with the highest outbound success probability as a target emotion label;
and generating the second customer service voice data according to the target emotion label and the customer service corpus corresponding to the next process node, and replying the second customer service voice data to the target client.
2. The method of claim 1, wherein before obtaining the customer service corpus and the customer corpus corresponding to the first process node, further comprising:
converting each of the historical dialogue voice records into a plurality of texts, wherein each text in the plurality of texts carries a dialogue sequence value, and the plurality of texts comprises a first text;
calculating the matching degree between the keywords of the first text and the respective process labels of each process category to obtain a plurality of category matching degree values between the first text and each process category;
determining a process category corresponding to the maximum value in the plurality of category matching values as a process category of the first text;
obtaining a dialogue link of each dialogue voice record according to the dialogue sequence value carried by each text and the process category of each text, and further obtaining a dialogue link set of the historical dialogue voice record, wherein the dialogue link comprises a plurality of process nodes, and the process nodes correspond to the process categories one by one;
and determining the conversation link with the occurrence frequency larger than a preset occurrence frequency threshold value in the conversation link set as the process baseline.
3. The method of claim 1, wherein said determining a target process baseline corresponding to the first process node among a plurality of process baselines according to the matching result comprises:
acquiring at least one optional flow baseline comprising the first flow node from the plurality of flow baselines, wherein the first flow node in each optional flow baseline carries a preset matching result of the target customer text and the customer corpus;
and determining the optional process baseline which is consistent with the matching result in the preset matching result carried by the first process node in the at least one optional process baseline as the target process baseline.
4. The method of claim 1, wherein before obtaining the plurality of customer service customer emotion comparison groups including the emotion labels from the emotion comparison matrix, the method further comprises:
dividing each conversation voice record into client voice data and client service voice data;
inputting the voice characteristic information and emotion text information of the client voice data into the emotion recognition model to obtain a client emotion label of each conversation voice record;
inputting the voice characteristic information and emotion text information of the customer service voice data into the emotion recognition model to obtain a customer service emotion label of each conversation voice record;
and calculating to obtain the emotion comparison matrix according to the client emotion label, the customer service emotion label and the outbound result recorded by each conversation voice.
5. The method of claim 4, wherein the historical conversational speech record comprises a first conversational speech record;
the dividing of each dialogue voice record into customer voice data and customer service voice data comprises:
obtaining a plurality of voice segments according to the first pair of speech sound records, and extracting MFCC (Mel frequency cepstrum coefficient) features of each voice segment in the plurality of voice segments;
inputting the MFCC features into an identity vector model to obtain identity vector features of each voice segment;
calculating the similarity between every two identity vector features in the first pair of speech sound records;
clustering each voice fragment in the first speech record pair according to the similarity to obtain a first speaker voice set and a second speaker voice set;
calculating the matching degree between the keywords corresponding to each speaker voice set and the customer service label and the customer label respectively, and determining the label with the highest matching degree as the label corresponding to each speaker voice set;
and dividing the first speech record into the customer speech data and the customer service speech data according to the labels corresponding to the speaker speech sets.
6. The method of claim 1, wherein the emotion recognition model comprises a convolutional layer, a cyclic layer, and a transcription layer;
the inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain an emotion label of the voice data comprises:
fusing the voice characteristic information and the emotion text information to obtain fusion information, and inputting the fusion information into the emotion recognition model;
extracting features of the fusion information through a convolutional layer to obtain emotional features of the voice data;
predicting the emotion characteristics through the circulation layer to obtain a prediction sequence of the voice data;
converting, by the transcription layer, the predicted sequence into an emotion tag for the speech data.
7. The method of claim 1, wherein the extracting emotional text information of the voice data comprises:
splitting text information corresponding to the voice data into at least one word;
calculating the degree of correlation between a target word in the at least one word and each emotion word in a preset emotion text set respectively to obtain a plurality of degree of correlation values between the target word and each emotion word;
determining a maximum value of the plurality of relevance degree values as an emotion score of the target word, thereby obtaining an emotion score of each word in the text information;
determining words with emotion scores larger than a preset threshold value as emotion text information of the voice data.
8. The utility model provides a take emotional intelligent pronunciation device of calling out which characterized in that includes:
the first acquiring and sending module is used for acquiring a customer service corpus and a client corpus corresponding to the first process node, converting the customer service corpus into first customer service voice data and sending the first customer service voice data to a target client;
the receiving and matching module is used for receiving the voice data of the target client and matching the target client text corresponding to the voice data with the client linguistic data to obtain a matching result;
a target baseline determining module, configured to determine, according to the matching result, a target process baseline corresponding to the first process node among multiple process baselines, where the multiple process baselines are determined based on a historical dialogue voice record, and each process baseline includes multiple process nodes and a process sequence value of each process node in the process baseline;
a determining and obtaining module, configured to determine, according to a process sequence value of the first process node in the target process baseline, a next process node corresponding to the first process node, and obtain a customer service corpus corresponding to the next process node;
the extraction module is used for extracting the voice characteristic information and the emotion text information of the voice data;
the client emotion acquisition module is used for inputting the voice characteristic information and the emotion text information into an emotion recognition model to obtain emotion labels of the voice data, and the emotion recognition model is obtained by training the emotion labels of each voice data in a sample voice set;
the system comprises an obtaining comparison group module and a comparison group module, wherein the obtaining comparison group module is used for obtaining a plurality of groups of customer service client emotion comparison groups containing emotion labels from an emotion comparison matrix, the emotion comparison matrix comprises at least one group of customer service client emotion comparison groups and outbound success probabilities corresponding to each group of customer service client emotion comparison groups, the outbound success probabilities corresponding to the customer service client emotion comparison groups are determined by the historical dialogue voice records which contain the customer service emotion labels and the customer emotion labels of the customer service client emotion comparison groups and which are successful in outbound, and the historical dialogue voice records comprise the customer service emotion labels, the customer emotion labels and the outbound results of each dialogue voice record;
the customer service emotion determining module is used for determining a customer service emotion label in a customer service customer emotion contrast group with the highest probability of successful outbound call as a target emotion label;
and the generation reply module is used for generating the second customer service voice data according to the target emotion tag and the customer service corpus corresponding to the next process node, and replying the second customer service voice data to the target client.
9. A server, comprising a processor, a memory and a transceiver, wherein the processor, the memory and the transceiver are connected to each other, wherein the transceiver is configured to receive or transmit data, the memory is configured to store program code, and the processor is configured to call the program code to execute the intelligent voice outbound method with emotion according to any of claims 1-7.
10. A storage medium, characterized in that the storage medium stores a computer program comprising program instructions; the program instructions, when executed by a processor, cause the processor to perform the intelligent voice outbound method with emotion of any of claims 1-7.
CN202010699699.0A 2020-07-20 2020-07-20 Intelligent voice outbound method and device with emotion, server and storage medium Active CN111916111B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010699699.0A CN111916111B (en) 2020-07-20 2020-07-20 Intelligent voice outbound method and device with emotion, server and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010699699.0A CN111916111B (en) 2020-07-20 2020-07-20 Intelligent voice outbound method and device with emotion, server and storage medium

Publications (2)

Publication Number Publication Date
CN111916111A true CN111916111A (en) 2020-11-10
CN111916111B CN111916111B (en) 2023-02-03

Family

ID=73280488

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010699699.0A Active CN111916111B (en) 2020-07-20 2020-07-20 Intelligent voice outbound method and device with emotion, server and storage medium

Country Status (1)

Country Link
CN (1) CN111916111B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112835805A (en) * 2021-02-26 2021-05-25 中国银行股份有限公司 Customer service system testing method and device and electronic equipment
CN112967725A (en) * 2021-02-26 2021-06-15 平安科技(深圳)有限公司 Voice conversation data processing method and device, computer equipment and storage medium
CN113593521A (en) * 2021-07-29 2021-11-02 北京三快在线科技有限公司 Speech synthesis method, device, equipment and readable storage medium
CN113743126A (en) * 2021-11-08 2021-12-03 北京博瑞彤芸科技股份有限公司 Intelligent interaction method and device based on user emotion
CN114025050A (en) * 2021-11-08 2022-02-08 浙江百应科技有限公司 Speech recognition method and device based on intelligent outbound and text analysis
CN114022192A (en) * 2021-10-20 2022-02-08 百融云创科技股份有限公司 Data modeling method and system based on intelligent marketing scene
CN114242070A (en) * 2021-12-20 2022-03-25 阿里巴巴(中国)有限公司 Video generation method, device, equipment and storage medium
CN115083434A (en) * 2022-07-22 2022-09-20 平安银行股份有限公司 Emotion recognition method and device, computer equipment and storage medium
CN115171731A (en) * 2022-07-11 2022-10-11 腾讯科技(深圳)有限公司 Emotion category determination method, device and equipment and readable storage medium
CN115188374A (en) * 2022-06-22 2022-10-14 百融睿诚信息科技有限公司 Method and device for updating dialect
CN116389644A (en) * 2022-11-10 2023-07-04 八度云计算(安徽)有限公司 Outbound system based on big data analysis

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190182383A1 (en) * 2017-12-08 2019-06-13 Asapp, Inc. Transfer of customer support to automated processing
CN110298682A (en) * 2019-05-22 2019-10-01 深圳壹账通智能科技有限公司 Intelligent Decision-making Method, device, equipment and medium based on user information analysis
CN110351444A (en) * 2019-06-20 2019-10-18 杭州智飘网络科技有限公司 A kind of intelligent sound customer service system
CN110442701A (en) * 2019-08-15 2019-11-12 苏州思必驰信息科技有限公司 Voice dialogue processing method and device
CN110457709A (en) * 2019-08-16 2019-11-15 北京一链数云科技有限公司 Outgoing call dialog process method, apparatus and server
CN110895940A (en) * 2019-12-17 2020-03-20 集奥聚合(北京)人工智能科技有限公司 Intelligent voice interaction method and device
CN110955770A (en) * 2019-12-18 2020-04-03 圆通速递有限公司 Intelligent dialogue system
CN110995945A (en) * 2019-11-29 2020-04-10 中国银行股份有限公司 Data processing method, device, equipment and system for generating outbound flow
CN111063370A (en) * 2019-12-31 2020-04-24 中国银行股份有限公司 Voice processing method and device
CN111078846A (en) * 2019-11-25 2020-04-28 青牛智胜(深圳)科技有限公司 Multi-turn dialog system construction method and system based on business scene

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190182383A1 (en) * 2017-12-08 2019-06-13 Asapp, Inc. Transfer of customer support to automated processing
CN110298682A (en) * 2019-05-22 2019-10-01 深圳壹账通智能科技有限公司 Intelligent Decision-making Method, device, equipment and medium based on user information analysis
CN110351444A (en) * 2019-06-20 2019-10-18 杭州智飘网络科技有限公司 A kind of intelligent sound customer service system
CN110442701A (en) * 2019-08-15 2019-11-12 苏州思必驰信息科技有限公司 Voice dialogue processing method and device
CN110457709A (en) * 2019-08-16 2019-11-15 北京一链数云科技有限公司 Outgoing call dialog process method, apparatus and server
CN111078846A (en) * 2019-11-25 2020-04-28 青牛智胜(深圳)科技有限公司 Multi-turn dialog system construction method and system based on business scene
CN110995945A (en) * 2019-11-29 2020-04-10 中国银行股份有限公司 Data processing method, device, equipment and system for generating outbound flow
CN110895940A (en) * 2019-12-17 2020-03-20 集奥聚合(北京)人工智能科技有限公司 Intelligent voice interaction method and device
CN110955770A (en) * 2019-12-18 2020-04-03 圆通速递有限公司 Intelligent dialogue system
CN111063370A (en) * 2019-12-31 2020-04-24 中国银行股份有限公司 Voice processing method and device

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967725A (en) * 2021-02-26 2021-06-15 平安科技(深圳)有限公司 Voice conversation data processing method and device, computer equipment and storage medium
CN112835805A (en) * 2021-02-26 2021-05-25 中国银行股份有限公司 Customer service system testing method and device and electronic equipment
CN113593521A (en) * 2021-07-29 2021-11-02 北京三快在线科技有限公司 Speech synthesis method, device, equipment and readable storage medium
CN113593521B (en) * 2021-07-29 2022-09-20 北京三快在线科技有限公司 Speech synthesis method, device, equipment and readable storage medium
CN114022192A (en) * 2021-10-20 2022-02-08 百融云创科技股份有限公司 Data modeling method and system based on intelligent marketing scene
CN113743126A (en) * 2021-11-08 2021-12-03 北京博瑞彤芸科技股份有限公司 Intelligent interaction method and device based on user emotion
CN113743126B (en) * 2021-11-08 2022-06-14 北京博瑞彤芸科技股份有限公司 Intelligent interaction method and device based on user emotion
CN114025050A (en) * 2021-11-08 2022-02-08 浙江百应科技有限公司 Speech recognition method and device based on intelligent outbound and text analysis
CN114242070A (en) * 2021-12-20 2022-03-25 阿里巴巴(中国)有限公司 Video generation method, device, equipment and storage medium
CN115188374A (en) * 2022-06-22 2022-10-14 百融睿诚信息科技有限公司 Method and device for updating dialect
CN115171731A (en) * 2022-07-11 2022-10-11 腾讯科技(深圳)有限公司 Emotion category determination method, device and equipment and readable storage medium
CN115083434A (en) * 2022-07-22 2022-09-20 平安银行股份有限公司 Emotion recognition method and device, computer equipment and storage medium
CN115083434B (en) * 2022-07-22 2022-11-25 平安银行股份有限公司 Emotion recognition method and device, computer equipment and storage medium
CN116389644A (en) * 2022-11-10 2023-07-04 八度云计算(安徽)有限公司 Outbound system based on big data analysis
CN116389644B (en) * 2022-11-10 2023-11-03 八度云计算(安徽)有限公司 Outbound system based on big data analysis

Also Published As

Publication number Publication date
CN111916111B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
JP5768093B2 (en) Speech processing system
EP4018437B1 (en) Optimizing a keyword spotting system
Demircan et al. Feature extraction from speech data for emotion recognition
CN111932296B (en) Product recommendation method and device, server and storage medium
US5594834A (en) Method and system for recognizing a boundary between sounds in continuous speech
JPH06274200A (en) Equipment and method for audio coding
US5734793A (en) System for recognizing spoken sounds from continuous speech and method of using same
CN107767881B (en) Method and device for acquiring satisfaction degree of voice information
CN112017694B (en) Voice data evaluation method and device, storage medium and electronic device
CN114550703A (en) Training method and device of voice recognition system, and voice recognition method and device
CN113889090A (en) Multi-language recognition model construction and training method based on multi-task learning
CN114708857A (en) Speech recognition model training method, speech recognition method and corresponding device
CN116631412A (en) Method for judging voice robot through voiceprint matching
CN111968652A (en) Speaker identification method based on 3DCNN-LSTM and storage medium
Thukroo et al. Spoken language identification system for kashmiri and related languages using mel-spectrograms and deep learning approach
CA2190619A1 (en) Speech-recognition system utilizing neural networks and method of using same
CN111640423B (en) Word boundary estimation method and device and electronic equipment
Sakamoto et al. Stargan-vc+ asr: Stargan-based non-parallel voice conversion regularized by automatic speech recognition
Syfullah et al. Efficient vector code-book generation using K-means and Linde-Buzo-Gray (LBG) algorithm for Bengali voice recognition
Mengistu Automatic text independent amharic language speaker recognition in noisy environment using hybrid approaches of LPCC, MFCC and GFCC
Singh et al. Speaker Recognition Assessment in a Continuous System for Speaker Identification
Andra et al. Contextual keyword spotting in lecture video with deep convolutional neural network
CN114694688A (en) Speech analyzer and related methods
Nijhawan et al. Real time speaker recognition system for hindi words

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant