CN110782916B - Multi-mode complaint identification method, device and system - Google Patents

Multi-mode complaint identification method, device and system Download PDF

Info

Publication number
CN110782916B
CN110782916B CN201910943563.7A CN201910943563A CN110782916B CN 110782916 B CN110782916 B CN 110782916B CN 201910943563 A CN201910943563 A CN 201910943563A CN 110782916 B CN110782916 B CN 110782916B
Authority
CN
China
Prior art keywords
emotion
voice
complaint
model
acoustic waveform
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910943563.7A
Other languages
Chinese (zh)
Other versions
CN110782916A (en
Inventor
苏绥绥
常富洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qilu Information Technology Co Ltd
Original Assignee
Beijing Qilu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qilu Information Technology Co Ltd filed Critical Beijing Qilu Information Technology Co Ltd
Priority to CN201910943563.7A priority Critical patent/CN110782916B/en
Publication of CN110782916A publication Critical patent/CN110782916A/en
Application granted granted Critical
Publication of CN110782916B publication Critical patent/CN110782916B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/01Customer relationship services
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Accounting & Taxation (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Psychiatry (AREA)
  • Telephonic Communication Services (AREA)

Abstract

The invention discloses a multi-mode complaint identification method, device and system, which are used for identifying whether user call content contains complaint content or not, wherein the method comprises the following steps: receiving user voice in the user call, and converting the user voice into an acoustic waveform; converting the acoustic waveform into image sequence data while identifying text content data of the acoustic waveform; calculating a score reflecting a probability of complaints from the image sequence data and the text content data; and judging whether the user call contains complaint content or not according to the score. According to the invention, the user voice is converted into the image sequence data and the text content data, and the emotion recognition is carried out on the image sequence data and the text content data, so that the accuracy of emotion recognition is improved.

Description

Multi-mode complaint identification method, device and system
Technical Field
The invention relates to the field of computer information processing, in particular to a multi-mode complaint identification method, device and system.
Background
Customer service centers are the main bridge for enterprises to communicate with users, and are the main channels for improving user satisfaction. In the past, the customer service center mainly uses manual customer service and professional customer service personnel serve users.
With the development of computer information processing technology, more and more customer service centers begin to adopt voice robots to serve users, so that the problem of excessively long waiting time of manual customer service is solved.
At present, voice robots generally cannot recognize the emotion of a user, and in order to solve the problem, some customer service centers introduce voice recognition to analyze and judge the emotion of the customer. However, the recognition is not very accurate only through voice recognition, and misjudgment or omission exists.
There is a need for a technique that more accurately recognizes the emotion of a user from multiple angles, more accurately discovers the emotion fluctuation of the user, and reduces user complaints.
Disclosure of Invention
The invention aims to solve the problem of low accuracy of the existing user emotion recognition technology.
In order to solve the above technical problem, a first aspect of the present invention provides a multi-modal complaint recognition method for recognizing whether a user call content includes complaint content, the complaint recognition method comprising:
receiving user voice in the user call, and converting the user voice into an acoustic waveform;
converting the acoustic waveform into image sequence data while identifying text content data of the acoustic waveform;
Calculating a score reflecting a probability of complaints from the image sequence data and the text content data;
and judging whether the user call contains complaint content or not according to the score.
According to a preferred embodiment of the present invention, calculating a score reflecting a probability of complaints from the image sequence data and the text content data includes:
and inputting the image sequence data and the text content data into a complaint probability judging model for calculation, wherein the complaint probability is judged to be a machine self-learning model, and the machine self-learning model is trained through historical user call records.
According to a preferred embodiment of the present invention, inputting the image sequence data and the text content data into a complaint probability judging model for calculation includes:
and vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judging model for calculation.
According to a preferred embodiment of the present invention, converting the user speech input into acoustic waveforms is specifically: the voice input is detected using the VAD algorithm to obtain the acoustic waveform.
According to a preferred embodiment of the present invention, the continuous sampling of the acoustic waveform is specifically: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapping sliding window to obtain a series of waveform samples.
According to a preferred embodiment of the present invention, the speech emotion judgment model is an RNN recurrent neural network model.
According to a preferred embodiment of the present invention, the text emotion judgment model is a CNN convolutional neural network model.
In order to solve the above-mentioned technical problem, a second aspect of the present invention provides a multi-modal complaint recognition device for recognizing whether a user call content includes complaint content, the complaint recognition device comprising:
the voice receiving module is used for receiving user voice in the user call and converting the user voice into an acoustic waveform;
the voice conversion module is used for converting the acoustic waveform into image sequence data and identifying text content data of the acoustic waveform at the same time;
a probability calculation module for calculating a score reflecting a probability of complaints from the image sequence data and the text content data;
and the complaint judging module is used for judging whether the user call contains complaint content according to the score.
According to a preferred embodiment of the present invention, calculating a score reflecting a probability of complaints from the image sequence data and the text content data includes:
and inputting the image sequence data and the text content data into a complaint probability judging model for calculation, wherein the complaint probability is judged to be a machine self-learning model, and the machine self-learning model is trained through historical user call records.
According to a preferred embodiment of the present invention, inputting the image sequence data and the text content data into a complaint probability judging model for calculation includes:
and vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judging model for calculation.
According to a preferred embodiment of the present invention, converting the user speech input into acoustic waveforms is specifically: the voice input is detected using the VAD algorithm to obtain the acoustic waveform.
According to a preferred embodiment of the present invention, the continuous sampling of the acoustic waveform is specifically: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapping sliding window to obtain a series of waveform samples.
According to a preferred embodiment of the present invention, the speech emotion judgment model is an RNN recurrent neural network model.
According to a preferred embodiment of the present invention, the text emotion judgment model is a CNN convolutional neural network model.
In order to solve the above technical problem, a third aspect of the present invention provides a multi-modal complaint recognition system, including:
A storage unit configured to store a computer-executable program;
and the processing unit is used for reading the computer executable program in the storage unit so as to execute the multi-mode complaint identification method.
In order to solve the above-mentioned technical problem, a fourth aspect of the present invention proposes a computer-readable medium storing a computer-readable program for executing a multi-modal complaint recognition method.
By adopting the technical scheme, the existing data is utilized to train a voice emotion judgment model, and the complaint probability is analyzed and judged through the image sequence data and the text content data, so that the accuracy of complaint recognition is improved.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects achieved more clear, specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings. It should be noted, however, that the drawings described below are merely illustrative of exemplary embodiments of the present invention and that other embodiments of the drawings may be derived from these drawings by those skilled in the art without undue effort.
FIG. 1 is a flow chart of a multi-modal complaint identification method in an embodiment of the invention;
FIG. 2A is a diagram of a speech waveform in the time domain according to one embodiment of the present invention;
FIG. 2B is a time domain speech waveform diagram for an embodiment of the present invention within 800 ms;
FIG. 2C is a sequence of images of the speech waveform of FIG. 2A;
FIG. 3 is a schematic diagram of a multi-modal complaint recognition device according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a structural framework of a multi-modal complaint recognition system in an embodiment of the invention;
fig. 5 is a schematic diagram of a computer-readable storage medium in an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments are shown, although the exemplary embodiments may be practiced in various specific ways. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.
The structures, capabilities, effects, or other features described in a particular embodiment may be incorporated in one or more other embodiments in any suitable manner without departing from the spirit of the present invention.
In describing particular embodiments, specific details of construction, performance, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by those skilled in the art. It is not excluded, however, that one skilled in the art may implement the present invention in a particular situation in a solution that does not include the structures, properties, effects, or other characteristics described above.
The flow diagrams in the figures are merely exemplary flow illustrations and do not represent that all of the elements, operations, and steps in the flow diagrams must be included in the aspects of the present invention, nor that the steps must be performed in the order shown in the figures. For example, some operations/steps in the flowcharts may be decomposed, some operations/steps may be combined or partially combined, etc., and the order of execution shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and/or processing unit means and/or microcontroller means.
The same reference numerals in the drawings denote the same or similar elements, components or portions, and thus repeated descriptions of the same or similar elements, components or portions may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or portions, these devices, elements, components or portions should not be limited by these terms. That is, these phrases are merely intended to distinguish one from the other. For example, a first device may also be referred to as a second device without departing from the spirit of the invention. Furthermore, the term "and/or," "and/or" is meant to include all combinations of any one or more of the items listed.
The invention is mainly applied to the voice robot. As described above, the current voice robot cannot recognize the emotion of the user from the voice of the user, and thus cannot make a corresponding coping strategy. In order to solve the problem, the invention provides a method for identifying emotion of a user by analyzing a pre-trained model by using sound wave graph and text data.
FIG. 1 is a multi-modal complaint recognition method for recognizing whether the user call content includes complaint content, as shown in FIG. 1, the method of the present invention comprises the following steps:
S1, receiving user voice in the user call, and converting the user voice into an acoustic waveform.
Based on the above technical solution, further, converting the user voice input into an acoustic waveform specifically includes: the voice input is detected using the VAD algorithm to obtain the acoustic waveform.
In the embodiment, when the voice robot solves the problem for the client and communicates, the voice robot processes the user voice of the user, filters the non-human voice part, only retains the human voice part, facilitates subsequent analysis and improves accuracy.
Voice activity detection VAD algorithms, also known as voice endpoint detection algorithms or voice boundary detection algorithms. In this embodiment, due to the influence of noise such as environmental noise and device noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment in which the user is located, and if these noises are not filtered, the analysis result is affected. Therefore, the voice segments and the non-voice segments in the audio data are marked by the VAD algorithm, the non-voice segments in the audio data are removed by the marking result, voice input of a user is detected, environmental noise is filtered, only human voice of the user is reserved, and the voice is converted into an acoustic waveform.
There are a number of specific algorithms in the VAD algorithm, and in this embodiment, a Gaussian mixture GMM model algorithm is used for human voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
S2, converting the acoustic waveform into image sequence data, and identifying text content data of the acoustic waveform.
The speech input received from the user is typically analog audio data, but may also be digital audio data, which typically has a certain compression rate. After receiving the user voice input, on one hand, the voice robot performs voice-to-text content recognition on the audio data, and after recognizing the content, the voice robot also needs to perform semantic understanding on the content by using a semantic understanding engine. Unlike the prior art, in the above process, the present invention also converts the audio data into graphic data that can be processed by the data processing apparatus in real time to recognize the graphic in a subsequent step to acquire emotion information.
In the present invention, the graphic data refers to a voice waveform obtained by processing an input voice.
In one embodiment, the speech waveform is graphically represented in the time dimension by the energy value of the speech. The speech data may be presented by means of a waveform of speech energy, one of which is in the time domain. That is, we can show a section of speech as a graphical pattern based on the energy level over time.
Fig. 2A is a waveform diagram in the time domain of one embodiment of the present invention. As shown in fig. 2A, which shows a time domain waveform of a voice over a period of time from 0 to 600ms, it can be seen that different voices will exhibit different waveforms.
Furthermore, fig. 2A shows a picture of a continuous curve, which may also be displayed as a picture of a block if the temporal range is taken longer, as shown in fig. 2B. In this figure, a time domain speech waveform diagram is shown within 800 ms. In other embodiments, a fill algorithm may also be used to convert the line drawing to a block drawing. The present invention is not limited to a particular graphical presentation method.
Resampling of audio data is required, whether for analog audio data or for data audio data. Preferably, the present invention uses the VAD algorithm to detect speech input to obtain the acoustic waveform. Voice activity detection VAD algorithms, also known as voice endpoint detection algorithms or voice boundary detection algorithms. In this embodiment, due to the influence of noise such as environmental noise and device noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment in which the user is located, and if these noises are not filtered, the analysis result is affected. Therefore, the voice segments and the non-voice segments in the audio data are marked by the VAD algorithm, the non-voice segments in the audio data are removed by the marking result, voice input of a user is detected, environmental noise is filtered, only human voice of the user is reserved, and the voice is converted into an acoustic waveform.
The specific algorithm in the VAD algorithm is various, and the invention preferably adopts a Gaussian mixture GMM model algorithm for voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
In order to convert the voice waveform into a format which can be recognized by a machine learning model, the voice waveform needs to be segmented. That is, the speech waveform patterns are divided over a predetermined time window such that the user's speech input produces temporally successive speech waveform patterns. For example, we can divide the speech waveform map continuously with a time window, thereby generating successive speech waveform map segments. The length of the time window may be predetermined, e.g., 25ms,50ms,100ms,200ms, etc.
In this embodiment, the continuous sampling of the acoustic waveform specifically includes: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapping sliding window to obtain a series of waveform samples. As shown in fig. 2C, the speech waveform diagram in fig. 2A is a plurality of fragment images obtained by continuous cutting.
In another embodiment, the user's voice input may also produce temporally overlapping voice waveforms. In order to avoid missing related edge picture information in continuous pictures, the invention can adopt an overlapped segmentation mode, for example, the waveform diagram shown in fig. 2A can be cut into 0ms-50ms, 25ms-75ms, 50ms-100ms and 75-125ms … ….
The cut image may be stored as a jpg file. In other embodiments, the image file may be converted into an image file of another format. In other embodiments, the converted image is also represented as a vector for input into the emotion judgment model.
And S3, calculating a score reflecting the complaint probability according to the image sequence data and the text content data.
On the basis of the above technical solution, further, calculating a score reflecting a probability of complaints from the image sequence data and the text content data includes: and inputting the image sequence data and the text content data into a complaint probability judging model for calculation, wherein the complaint probability is judged to be a machine self-learning model, and the machine self-learning model is trained through historical user call records.
On the basis of the above technical solution, further, inputting the image sequence data and the text content data into a complaint probability judgment model to calculate includes: and vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judging model for calculation.
One common technique in deep neural networks is a pretraining technique. Multiple studies have demonstrated that using vectors derived from extensive data for unsupervised or supervised training to initialize parameters of a neural network may result in a better model than random initialization training. Therefore, in this embodiment the machine self-learning model is trained from historical user call records.
On the basis of the technical scheme, the voice emotion judging model is an RNN circulating neural network model.
The cyclic neural network RNN is a kind of deep network which can be used for unsupervised and supervised learning, the depth can even be consistent with the length of an input sequence, in an unsupervised learning mode, the cyclic neural network RNN is used for predicting a future data sequence according to a previous data sample, and category information is not used in the learning process, so the cyclic neural network RNN is very suitable for sequence data modeling.
Moreover, in the field of language processing, the recurrent neural network RNN model is one of the most widely used neural networks. In the language processing field, the effect of the above information on the following is generally analyzed by using a language model, while the RNN model naturally uses the above information by using a hidden layer of cyclic feedback, and can theoretically use all the above information, which cannot be done by the conventional language model. Therefore, in the present embodiment, the speech emotion estimation model is an RNN recurrent neural network model.
In this embodiment, the speech emotion judgment model includes an input layer, an implicit layer, and an output layer, where the input layer is used for inputting the image sequence data, and the output layer is used for outputting an emotion judgment value sequence of a user, and the number of nodes of the input layer is the same as that of nodes of the output layer.
In this embodiment, image sequence data is input to an input layer of a speech emotion judgment model, the number of nodes of an output layer of the speech emotion judgment model is the same as the number of nodes of the input layer, emotion judgment values corresponding to each sample in the image sequence data are output, and the output emotion judgment values constitute an emotion judgment value sequence.
On the basis of the technical scheme, the text emotion judgment model is a CNN convolutional neural network model.
In this embodiment, the text emotion judgment model based on the convolutional neural network CNN performs emotion classification on text content data in a problem area using vocabulary semantic vectors generated in a target area, the input of which is a sentence or a document expressed in a matrix, each row of the matrix corresponds to one word segmentation element, and each row is a vector representing one word.
In this embodiment, the text emotion judgment model outputs a text emotion fluctuation value.
And S4, judging whether the user call contains complaint content or not according to the score.
In this embodiment, the speech emotion judgment model outputs an emotion judgment value sequence, and further data processing is required. And solving the variance of the emotion judgment value sequence to obtain values which are voice emotion fluctuation values, wherein different voice emotion fluctuation values correspond to different emotions.
In this embodiment, the variance of the emotion determination value sequence is calculated to determine the magnitude of the emotion fluctuation of the user, and the larger the variance value is, the larger the emotion fluctuation value is, which means that the emotion fluctuation of the user is larger.
In this embodiment, weights of the speech emotion fluctuation value and the text emotion fluctuation value are set, respectively, and a global emotion fluctuation value is calculated. And a global emotion fluctuation value threshold is preset, and when the calculated global emotion fluctuation value exceeds the global emotion fluctuation value threshold, the situation that the emotion fluctuation of the user is serious and the probability of complaint is high is indicated. At this time, the dialogue strategy of the voice robot needs to be adjusted, and the adjusted dialogue strategy includes adjusting the speech speed, adjusting the intonation, adjusting the speaking content, and the like.
As shown in fig. 3, in this embodiment, there is further provided a multi-modal complaint recognition apparatus 300, including:
the voice receiving module 301 is configured to receive a user voice in the user call, and convert the user voice into an acoustic waveform.
Based on the above technical solution, further, converting the user voice input into an acoustic waveform specifically includes: the voice input is detected using the VAD algorithm to obtain the acoustic waveform.
In the embodiment, when the voice robot solves the problem for the client and communicates, the voice robot processes the user voice of the user, filters the non-human voice part, only retains the human voice part, facilitates subsequent analysis and improves accuracy.
Voice activity detection VAD algorithms, also known as voice endpoint detection algorithms or voice boundary detection algorithms. In this embodiment, due to the influence of noise such as environmental noise and device noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment in which the user is located, and if these noises are not filtered, the analysis result is affected. Therefore, the voice segments and the non-voice segments in the audio data are marked by the VAD algorithm, the non-voice segments in the audio data are removed by the marking result, voice input of a user is detected, environmental noise is filtered, only human voice of the user is reserved, and the voice is converted into an acoustic waveform.
There are a number of specific algorithms in the VAD algorithm, and in this embodiment, a Gaussian mixture GMM model algorithm is used for human voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
The voice conversion module 302 is configured to convert the acoustic waveform into image sequence data, and identify text content data of the acoustic waveform.
The speech input received from the user is typically analog audio data, but may also be digital audio data, which typically has a certain compression rate. After receiving the user voice input, on one hand, the voice robot performs voice-to-text content recognition on the audio data, and after recognizing the content, the voice robot also needs to perform semantic understanding on the content by using a semantic understanding engine. Unlike the prior art, in the above process, the present invention also converts the audio data into graphic data that can be processed by the data processing apparatus in real time to recognize the graphic in a subsequent step to acquire emotion information.
In the present invention, the graphic data refers to a voice waveform obtained by processing an input voice.
In one embodiment, the speech waveform is graphically represented in the time dimension by the energy value of the speech. The speech data may be presented by means of a waveform of speech energy, one of which is in the time domain. That is, we can show a section of speech as a graphical pattern based on the energy level over time.
Resampling of audio data is required, whether for analog audio data or for data audio data. Preferably, the present invention uses the VAD algorithm to detect speech input to obtain the acoustic waveform. Voice activity detection VAD algorithms, also known as voice endpoint detection algorithms or voice boundary detection algorithms. In this embodiment, due to the influence of noise such as environmental noise and device noise, the voice input of the user often includes not only the sound of the user but also the noise of the environment in which the user is located, and if these noises are not filtered, the analysis result is affected. Therefore, the voice segments and the non-voice segments in the audio data are marked by the VAD algorithm, the non-voice segments in the audio data are removed by the marking result, voice input of a user is detected, environmental noise is filtered, only human voice of the user is reserved, and the voice is converted into an acoustic waveform.
The specific algorithm in the VAD algorithm is various, and the invention preferably adopts a Gaussian mixture GMM model algorithm for voice detection. In other embodiments, other ones of the VAD algorithms may also be employed.
In order to convert the voice waveform into a format which can be recognized by a machine learning model, the voice waveform needs to be segmented. That is, the speech waveform patterns are divided over a predetermined time window such that the user's speech input produces temporally successive speech waveform patterns. For example, we can divide the speech waveform map continuously with a time window, thereby generating successive speech waveform map segments. The length of the time window may be predetermined, e.g., 25ms,50ms,100ms,200ms, etc.
In this embodiment, the continuous sampling of the acoustic waveform specifically includes: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and cutting the acoustic waveform by using the overlapping sliding window to obtain a series of waveform samples.
In another embodiment, the user's voice input may also produce temporally overlapping voice waveforms. In order to avoid missing related edge picture information in continuous pictures, the invention can adopt an overlapped segmentation mode, for example, the waveform diagram shown in fig. 2A can be cut into 0ms-50ms, 25ms-75ms, 50ms-100ms and 75-125ms … ….
The cut image may be stored as a jpg file. In other embodiments, the image file may be converted into an image file of another format. In other embodiments, the converted image is also represented as a vector for input into the emotion judgment model. And a probability calculation module 303 for calculating a score reflecting the probability of complaints from the image sequence data and the text content data.
On the basis of the above technical solution, further, calculating a score reflecting a probability of complaints from the image sequence data and the text content data includes: and inputting the image sequence data and the text content data into a complaint probability judging model for calculation, wherein the complaint probability is judged to be a machine self-learning model, and the machine self-learning model is trained through historical user call records.
On the basis of the above technical solution, further, inputting the image sequence data and the text content data into a complaint probability judgment model to calculate includes: and vectorizing the image sequence data and the text content data, and inputting the vectorized data into the complaint probability judging model for calculation.
One common technique in deep neural networks is a pretraining technique. Multiple studies have demonstrated that using vectors derived from extensive data for unsupervised or supervised training to initialize parameters of a neural network may result in a better model than random initialization training. Therefore, in this embodiment the machine self-learning model is trained from historical user call records.
On the basis of the technical scheme, the voice emotion judging model is an RNN circulating neural network model.
The cyclic neural network RNN is a kind of deep network which can be used for unsupervised and supervised learning, the depth can even be consistent with the length of an input sequence, in an unsupervised learning mode, the cyclic neural network RNN is used for predicting a future data sequence according to a previous data sample, and category information is not used in the learning process, so the cyclic neural network RNN is very suitable for sequence data modeling.
Moreover, in the field of language processing, the recurrent neural network RNN model is one of the most widely used neural networks. In the language processing field, the effect of the above information on the following is generally analyzed by using a language model, while the RNN model naturally uses the above information by using a hidden layer of cyclic feedback, and can theoretically use all the above information, which cannot be done by the conventional language model. Therefore, in the present embodiment, the speech emotion estimation model is an RNN recurrent neural network model.
In this embodiment, the speech emotion judgment model includes an input layer, an implicit layer, and an output layer, where the input layer is used for inputting the image sequence data, and the output layer is used for outputting an emotion judgment value sequence of a user, and the number of nodes of the input layer is the same as that of nodes of the output layer.
In this embodiment, image sequence data is input to an input layer of a speech emotion judgment model, the number of nodes of an output layer of the speech emotion judgment model is the same as the number of nodes of the input layer, emotion judgment values corresponding to each sample in the image sequence data are output, and the output emotion judgment values constitute an emotion judgment value sequence.
On the basis of the technical scheme, the text emotion judgment model is a CNN convolutional neural network model.
In this embodiment, the text emotion judgment model based on the convolutional neural network CNN performs emotion classification on text content data in a problem area using vocabulary semantic vectors generated in a target area, the input of which is a sentence or a document expressed in a matrix, each row of the matrix corresponds to one word segmentation element, and each row is a vector representing one word.
In this embodiment, the text emotion judgment model outputs a text emotion fluctuation value.
And a complaint judging module 304, configured to judge whether the user call contains complaint content according to the score.
In this embodiment, the speech emotion judgment model outputs an emotion judgment value sequence, and further data processing is required. And solving the variance of the emotion judgment value sequence to obtain values which are voice emotion fluctuation values, wherein different voice emotion fluctuation values correspond to different emotions.
In this embodiment, the variance of the emotion determination value sequence is calculated to determine the magnitude of the emotion fluctuation of the user, and the larger the variance value is, the larger the emotion fluctuation value is, which means that the emotion fluctuation of the user is larger.
In this embodiment, weights of the speech emotion fluctuation value and the text emotion fluctuation value are set, respectively, and a global emotion fluctuation value is calculated. And a global emotion fluctuation value threshold is preset, and when the calculated global emotion fluctuation value exceeds the global emotion fluctuation value threshold, the situation that the emotion fluctuation of the user is serious and the probability of complaint is high is indicated. At this time, the dialogue strategy of the voice robot needs to be adjusted, and the adjusted dialogue strategy includes adjusting the speech speed, adjusting the intonation, adjusting the speaking content, and the like.
As shown in FIG. 4, a multi-modal complaint recognition system is also disclosed in one embodiment of the present invention, and the information processing system shown in FIG. 4 is merely an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
A multi-modal complaint recognition system 400 includes a storage unit 420 for storing a computer-executable program; and a processing unit 410 for reading the computer executable program in the storage unit to perform the steps of the various embodiments of the present invention.
The multi-modal complaint recognition system 400 in this embodiment further includes a bus 430, a display unit 440, and the like, which connect the different system components (including the storage unit 420 and the processing unit 410).
The storage unit 420 stores a computer readable program, which may be a source program or code of a read only program. The program may be executed by the processing unit 410 such that the processing unit 410 performs the steps of various embodiments of the present invention. For example, the processing unit 410 may perform the steps shown in fig. 1.
The memory unit 420 may include readable media in the form of volatile memory units, such as Random Access Memory (RAM) 4201 and/or cache memory 4202, and may further include Read Only Memory (ROM) 4203. The storage unit 420 may also include a program/utility 4204 having a set (at least one) of program modules 4205, such program modules 4205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 430 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
Multimodal complaint recognition system 400 may also communicate with one or more external devices 470 (e.g., keyboard, display, network device, bluetooth device, etc.) such that a user can interact with processing unit 410 via these external devices 470 through input/output (I/O) interface 450, as well as with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) through network adapter 460. Network adapter 460 may communicate with other modules of multi-modal complaint recognition system 400 via bus 430. It should be appreciated that although not shown, other hardware and/or software modules may be used in multi-modal complaint identification system 400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.
FIG. 5 is a schematic diagram of one embodiment of a computer readable medium of the present invention. As shown in fig. 4, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage unit, a magnetic storage unit, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer readable medium to carry out the above-described method of the present invention, namely:
S1, receiving user voice in the user call, and converting the user voice into an acoustic waveform;
s2, converting the acoustic waveform into image sequence data, and identifying text content data of the acoustic waveform at the same time;
s3, calculating a score reflecting the complaint probability according to the image sequence data and the text content data;
and S4, judging whether the user call contains complaint content or not according to the score.
From the above description of embodiments, those skilled in the art will readily appreciate that the exemplary embodiments described herein may be implemented in software, or may be implemented in software in combination with necessary hardware. Thus, the technical solution according to the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a computer readable storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, comprising several instructions to cause a data processing device (may be a personal computer, a server, or a network device, etc.) to perform the above-described method according to the present invention.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).
In summary, the present invention may be implemented in a method, apparatus, electronic device, or computer readable medium that executes a computer program. Some or all of the functions of the present invention may be implemented in practice using general-purpose data processing devices such as a micro-processing unit or a digital signal processing unit (DSP).
The above-described specific embodiments further describe the objects, technical solutions and advantageous effects of the present invention in detail, and it should be understood that the present invention is not inherently related to any particular computer, virtual device or electronic apparatus, and various general-purpose devices may also implement the present invention. The foregoing description of the embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (16)

1. A multi-mode complaint identification method is used for identifying whether user communication content contains complaint content or not, and is characterized by comprising the following steps:
receiving user voice in the user call, removing a non-voice section of audio data in the user voice by using an algorithm labeling result, and converting the user voice into an acoustic waveform by resampling the audio data;
converting the acoustic waveform into image sequence data and vectorizing, and simultaneously, identifying text content data of the acoustic waveform and vectorizing; wherein converting the acoustic waveform into image sequence data comprises: converting the acoustic waveform into a waveform diagram in the time domain through a graph mode that the graph of the voice energy value in the time dimension shows the energy level; obtaining continuously sampled waveform sample images by performing overlapping cutting on the acoustic waveform map using an overlapping sliding window; converting the acoustic waveform into a format which can be identified by a machine learning model and representing the acoustic waveform as a vector;
Inputting the vector of the image sequence data and the vector of the text content data into a complaint probability judging model for calculation, wherein the method comprises the following steps: inputting the vector of the image sequence data into a voice emotion judgment model, outputting an emotion judgment value sequence formed by corresponding emotion judgment values, and solving variance of the emotion judgment value sequence to obtain a voice emotion fluctuation value; inputting the vector of the text content data into a text emotion judgment model, and outputting a corresponding text emotion fluctuation value; calculating a global emotion fluctuation value according to the set weight of the voice emotion fluctuation value and the set weight of the text emotion fluctuation value;
judging whether the user call contains complaint content according to the global emotion fluctuation value, wherein the method comprises the following steps: judging whether the calculated global emotion fluctuation value exceeds a preset global emotion fluctuation value threshold value, and if so, indicating that the probability of complaint is larger.
2. The method of claim 1, wherein inputting the vector of the image sequence data and the vector of the text content data into a complaint probability judging model for calculation comprises:
and judging the complaint probability as a machine self-learning model, wherein the machine self-learning model is trained through historical user call records.
3. The method of claim 1, wherein inputting the vector of the image sequence data into a speech emotion judgment model, outputting a sequence of emotion judgment values constituted by corresponding emotion judgment values, comprises:
the number of nodes of the output layer of the voice emotion judging model is the same as that of the nodes of the input layer, the vector of the image sequence data is input into the input layer of the voice emotion judging model, the emotion judging value corresponding to each sampling sample in the image sequence data can be output, and the output emotion judging values form the emotion judging value sequence.
4. A method according to any one of claims 1 to 3, characterized in that converting the user speech input into acoustic waveforms is in particular: the voice input is detected using the VAD algorithm to obtain the acoustic waveform.
5. A method according to any one of claims 1 to 3, wherein obtaining continuously sampled waveform sample images by overlapping cutting of the acoustic waveform map using overlapping sliding windows further comprises:
setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and performing overlapping cutting on the acoustic waveform by using the overlapping sliding window to obtain a series of waveform samples.
6. A method according to any one of claims 1 to 3, wherein the speech emotion judgment model is an RNN recurrent neural network model.
7. A method according to any one of claims 1 to 3, wherein the text emotion judgment model is a CNN convolutional neural network model.
8. A multi-modal complaint recognition device for recognizing whether user call content includes complaint content, the complaint recognition device comprising:
the voice receiving module is used for receiving the user voice in the user call, removing a non-voice section of the audio data in the user voice by using an algorithm labeling result, and converting the user voice into an acoustic waveform by resampling the audio data;
the voice conversion module is used for converting the acoustic waveform into image sequence data and vectorizing, and identifying text content data of the acoustic waveform and vectorizing; wherein converting the acoustic waveform into image sequence data comprises: converting the acoustic waveform into a waveform diagram in the time domain through a graph mode that the graph of the voice energy value in the time dimension shows the energy level; obtaining continuously sampled waveform sample images by performing overlapping cutting on the acoustic waveform map using an overlapping sliding window; converting the acoustic waveform into a format which can be identified by a machine learning model and representing the acoustic waveform as a vector;
The probability calculation module is used for inputting the vector of the image sequence data and the vector of the text content data into a complaint probability judgment model for calculation, and comprises the following steps: inputting the vector of the image sequence data into a voice emotion judgment model, outputting an emotion judgment value sequence formed by corresponding emotion judgment values, and solving variance of the emotion judgment value sequence to obtain a voice emotion fluctuation value; inputting the vector of the text content data into a text emotion judgment model, and outputting a corresponding text emotion fluctuation value; calculating a global emotion fluctuation value according to the set weight of the voice emotion fluctuation value and the set weight of the text emotion fluctuation value;
the complaint judging module is used for judging whether the user call contains complaint content according to the global emotion fluctuation value, and comprises the following steps: judging whether the calculated global emotion fluctuation value exceeds a preset global emotion fluctuation value threshold value, and if so, indicating that the probability of complaint is larger.
9. The apparatus of claim 8, wherein inputting the vector of the image sequence data and the vector of the text content data into a complaint probability judging model for calculation comprises:
and judging the complaint probability as a machine self-learning model, wherein the machine self-learning model is trained through historical user call records.
10. The apparatus of claim 8, wherein inputting the vector of the image sequence data into a speech emotion judgment model, outputting a sequence of emotion judgment values constituted by corresponding emotion judgment values, comprises:
the number of nodes of the output layer of the voice emotion judging model is the same as that of the nodes of the input layer, the vector of the image sequence data is input into the input layer of the voice emotion judging model, the emotion judging value corresponding to each sampling sample in the image sequence data can be output, and the output emotion judging values form the emotion judging value sequence.
11. The apparatus according to any one of claims 8 to 10, wherein converting the user speech input into acoustic waveforms is in particular: the voice input is detected using the VAD algorithm to obtain the acoustic waveform.
12. The apparatus of any of claims 8 to 10, wherein obtaining continuously sampled waveform sample images by overlapping cutting of the acoustic waveform map using an overlapping sliding window further comprises: setting the length of a sliding window as a sampling period, setting the overlapping length of the sliding window, and performing overlapping cutting on the acoustic waveform by using the overlapping sliding window to obtain a series of waveform samples.
13. The apparatus according to any one of claims 8 to 10, wherein the speech emotion judgment model is an RNN recurrent neural network model.
14. The apparatus of any one of claims 8 to 10, wherein the text emotion judgment model is a CNN convolutional neural network model.
15. A multi-modal complaint recognition system, comprising:
a storage unit configured to store a computer-executable program;
a processing unit for reading the computer executable program in the storage unit to perform the method of any of claims 1 to 7.
16. A computer readable medium storing a computer readable program for performing the method of any one of claims 1 to 7.
CN201910943563.7A 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system Active CN110782916B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910943563.7A CN110782916B (en) 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910943563.7A CN110782916B (en) 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system

Publications (2)

Publication Number Publication Date
CN110782916A CN110782916A (en) 2020-02-11
CN110782916B true CN110782916B (en) 2023-09-05

Family

ID=69385079

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910943563.7A Active CN110782916B (en) 2019-09-30 2019-09-30 Multi-mode complaint identification method, device and system

Country Status (1)

Country Link
CN (1) CN110782916B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101046B (en) * 2020-11-02 2022-04-29 北京淇瑀信息科技有限公司 Conversation analysis method, device and system based on conversation behavior
CN112804400B (en) * 2020-12-31 2023-04-25 中国工商银行股份有限公司 Customer service call voice quality inspection method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339606A (en) * 2011-05-17 2012-02-01 首都医科大学宣武医院 Depressed mood phone automatic speech recognition screening system
WO2014069443A1 (en) * 2012-10-31 2014-05-08 日本電気株式会社 Complaint call determination device and complaint call determination method
CN103811009A (en) * 2014-03-13 2014-05-21 华东理工大学 Smart phone customer service system based on speech analysis
CN105810205A (en) * 2014-12-29 2016-07-27 中国移动通信集团公司 Speech processing method and device
CN109817245A (en) * 2019-01-17 2019-05-28 深圳壹账通智能科技有限公司 Generation method, device, computer equipment and the storage medium of meeting summary

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107210045B (en) * 2015-02-03 2020-11-17 杜比实验室特许公司 Meeting search and playback of search results

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339606A (en) * 2011-05-17 2012-02-01 首都医科大学宣武医院 Depressed mood phone automatic speech recognition screening system
WO2014069443A1 (en) * 2012-10-31 2014-05-08 日本電気株式会社 Complaint call determination device and complaint call determination method
CN103811009A (en) * 2014-03-13 2014-05-21 华东理工大学 Smart phone customer service system based on speech analysis
CN105810205A (en) * 2014-12-29 2016-07-27 中国移动通信集团公司 Speech processing method and device
CN109817245A (en) * 2019-01-17 2019-05-28 深圳壹账通智能科技有限公司 Generation method, device, computer equipment and the storage medium of meeting summary

Also Published As

Publication number Publication date
CN110782916A (en) 2020-02-11

Similar Documents

Publication Publication Date Title
CN110648691B (en) Emotion recognition method, device and system based on energy value of voice
CN101930735B (en) Speech emotion recognition equipment and speech emotion recognition method
US8209182B2 (en) Emotion recognition system
EP2387031B1 (en) Methods and systems for grammar fitness evaluation as speech recognition error predictor
CN109686383B (en) Voice analysis method, device and storage medium
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
CN111901627B (en) Video processing method and device, storage medium and electronic equipment
CN112397056B (en) Voice evaluation method and computer storage medium
CN111177186A (en) Question retrieval-based single sentence intention identification method, device and system
CN113053410B (en) Voice recognition method, voice recognition device, computer equipment and storage medium
CN113539240A (en) Animation generation method and device, electronic equipment and storage medium
CN110782916B (en) Multi-mode complaint identification method, device and system
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
CN110782902A (en) Audio data determination method, apparatus, device and medium
CN114420169A (en) Emotion recognition method and device and robot
CN112885379A (en) Customer service voice evaluation method, system, device and storage medium
CN110619894B (en) Emotion recognition method, device and system based on voice waveform diagram
CN116580698A (en) Speech synthesis method, device, computer equipment and medium based on artificial intelligence
KR20210130024A (en) Dialogue system and method of controlling the same
CN112017668B (en) Intelligent voice conversation method, device and system based on real-time emotion detection
CN112101046B (en) Conversation analysis method, device and system based on conversation behavior
CN113763992A (en) Voice evaluation method and device, computer equipment and storage medium
Kamble et al. Spontaneous emotion recognition for Marathi spoken words
KR20110071742A (en) Apparatus for utterance verification based on word specific confidence threshold
CN117275458B (en) Speech generation method, device and equipment for intelligent customer service and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant