CN110580899A

CN110580899A - Voice recognition method and device, storage medium and computing equipment

Info

Publication number: CN110580899A
Application number: CN201910967019.6A
Authority: CN
Inventors: 李君浩; 邹婷婷; 顾少丰
Original assignee: Shanghai Lake Information Technology Co Ltd
Current assignee: Shanghai Lake Information Technology Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2019-12-17

Abstract

A voice recognition method and device, a storage medium and a computing device are provided, wherein the voice recognition method comprises the following steps: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score. The technical scheme provided by the invention can efficiently and accurately complete the detection of the voice data and improve the detection rate of the illegal voice.

Description

Voice recognition method and device, storage medium and computing equipment

Technical Field

The invention relates to the technical field of voice detection, in particular to a voice recognition method and device, a storage medium and computing equipment.

background

With the development of communication technology, call centers generate huge amounts of telephone recording files every day. When the conversation content quality inspection work is carried out, the traditional quality inspection method can adopt a manual spot inspection mode to randomly spot inspect a small number of telephone recording files so as to judge whether the conversation content of customer service personnel violates rules. However, the traditional quality inspection method has low efficiency, cannot check each telephone recording file one by one, and is difficult to find the working quality of customer service staff through the recording files in time.

Disclosure of Invention

The invention solves the technical problem of how to efficiently and accurately identify illegal voices.

To solve the foregoing technical problem, an embodiment of the present invention provides a speech recognition method, including: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score.

Optionally, the determining whether the voice data to be detected has the violation risk based on the emotion score includes: and when the emotion score is higher than a preset threshold value, determining that the voice data to be detected has violation risk.

Optionally, the speech recognition method further includes: and marking the voice data to be detected with violation risk.

optionally, the training to obtain the emotion detection model based on the emotion feature vector and the text data includes: and training by adopting a neural network algorithm based on the emotion feature vector and the text data to obtain the emotion detection model.

Optionally, the training to obtain the emotion detection model based on the feature vector and the text data includes: and training by adopting a logistic regression algorithm based on the emotion feature vector and the text data to obtain the emotion detection model.

Optionally, the emotion feature vector is used to represent an emotion type, and the emotion type is selected from: happiness, sadness, anger, fear, disgust.

optionally, the converting the set of voice data into text data includes: and converting the voice data into the text data by adopting a voice-to-text technology.

Optionally, the voice data includes voice data of a first character and voice data of a second character, and the extracting an emotion feature vector from a set of voice data and converting the set of voice data into text data includes: distinguishing voice data of a first role from voice data of a second role in the group of voice data to obtain the voice data of the first role and the voice data of the second role; and extracting emotion characteristic vectors of the voice data of the first role and the voice data of the second role respectively, and converting the voice data of the first role and the voice data of the second role into text data respectively.

In order to solve the above technical problem, an embodiment of the present invention further provides a speech recognition apparatus, including: the extraction module is used for extracting emotion characteristic vectors from a group of voice data and converting the group of voice data into text data; the training module is used for training to obtain an emotion detection model based on the emotion feature vector and the text data, and the emotion detection model is used for calculating an emotion score; the calculation module is used for calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and the judging module is used for judging whether the voice data to be detected has violation risks or not based on the emotion scores.

To solve the above technical problem, an embodiment of the present invention further provides a storage medium having stored thereon computer instructions, where the computer instructions execute the steps of the above method when executed.

in order to solve the above technical problem, an embodiment of the present invention further provides a computing device, including a memory and a processor, where the memory stores computer instructions executable on the processor, and the processor executes the computer instructions to perform the steps of the above method.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

The embodiment of the invention provides a voice recognition method, which comprises the following steps: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score. According to the embodiment of the invention, the emotion feature vector extracted from the voice data and the text data are used as input data, and the emotion detection model is obtained through training. Because a large amount of voice data can be used as input data of the training model, the statistical advantages can be exerted, and the emotion detection model with high accuracy can be obtained through the training method. The voice data to be detected is determined based on the emotion detection model with high accuracy, so that the detection of the voice data can be completed more efficiently and accurately, and the detection rate of illegal voices is improved. Furthermore, the embodiment of the invention is suitable for mass voice detection and can expand voice detection scenes.

Further, the training of the emotion detection model based on the emotion feature vector and the text data includes: and training by adopting a neural network algorithm based on the emotion feature vector and the text data to obtain the emotion detection model. According to the embodiment of the invention, the neural network model is adopted as the emotion detection model, and the emotion detection model with higher accuracy can be trained by virtue of the advantages of the neural network, so that the detection rate of illegal voices is further improved.

Drawings

FIG. 1 is a flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating a speech recognition method in an exemplary scenario according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention.

Detailed Description

As for the background art, the prior art uses a manual sampling inspection method to search for illegal voices, which is inefficient.

the inventor of the present application has found that, in the prior art, the following steps may also be adopted to determine whether the voice data is an illegal voice: firstly, converting voice data to be detected into text data, and extracting emotion characteristic vectors of the voice data to be detected; secondly, determining voice characteristics according to the emotion characteristic vector, and searching whether text data obtained through conversion contains preset keywords or not; and then, comprehensively determining whether the audio data is illegal voice data or not by combining the voice features and preset keywords.

However, when the prior art scheme is adopted to analyze each telephone recording in a large number of telephone recording files, the common information of the illegal voice data statistics cannot be acquired, and the accuracy is low.

The embodiment of the invention provides a voice recognition method, which comprises the following steps: extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data; training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and judging whether the voice data to be detected has violation risk or not based on the emotion score.

According to the embodiment of the invention, the emotion feature vector extracted from the voice data and the text data are used as input data, and the emotion detection model is obtained through training. Because a large amount of voice data can be used as input data of the training model, the statistical advantages can be exerted, and the emotion detection model with high accuracy can be obtained through the training method. The voice data to be detected is determined based on the emotion detection model with high accuracy, so that the detection of the voice data can be completed more efficiently and accurately, and the detection rate of illegal voices is improved. Furthermore, the embodiment of the invention is suitable for mass voice detection and can expand voice detection scenes.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention. The speech recognition method may be performed by a computing device, such as a server, a personal terminal, or the like.

Specifically, the speech recognition method may include the steps of:

Step S101, extracting emotion characteristic vectors from a group of voice data, and converting the group of voice data into text data;

Step S102, training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score;

Step S103, calculating emotion scores of the voice data to be detected based on the voice data to be detected and the emotion detection model;

And step S104, judging whether the voice data to be detected has violation risk or not based on the emotion score.

More specifically, each recording file of the call center can be used as one voice data, so that a huge amount of voice data can be obtained.

in step S101, at least a portion of the mass voice data may be treated as a set of voice data. And extracting the emotion feature vector of each voice data from the group of voice data, and further obtaining a plurality of emotion feature vectors.

wherein, the emotion feature vector can be used to represent or describe the emotion type, and the emotion type can be happy (happy), sad (sadness), angry (anger), fear (fear), disgust (distust).

those skilled in the art understand that each voice data may generally contain voice output by a plurality of characters. For example, a voice recording recorded at a call center will typically include voices output by two roles, e.g., by customer service personnel and customer personnel, respectively.

Taking the example that the voice data includes voice data of two characters, the voice data may include voice data of a first character and voice data of a second character. At this time, the voice data of the first character and the voice data of the second character may be first distinguished to obtain the voice data of the first character and the voice data of the second character.

In a specific implementation, the voice data of the customer service personnel and the voice data of the customer service personnel can be distinguished in advance in the voice data recorded for evaluation. For example, the customer service person outputs voice data through a first frequency, and the customer service person outputs data through a second frequency, which is different from the first frequency. Also for example, the distinction may be made by keywords or common languages of different characters.

Thereafter, emotion feature vectors of the voice data of the first character and the voice data of the second character may be extracted, respectively.

Further, each voice data may be converted into text data. In one embodiment, an Automatic Speech Recognition (ASR) technique may be used to convert each piece of Speech data into text data, so as to obtain a plurality of pieces of text data.

Take the example that the voice data includes voice data of a first character and voice data of a second character. In a specific implementation, the voice data of the first character and the voice data of the second character can be distinguished in advance. And then, respectively converting the voice data of the first role and the voice data of the second role into text data.

In step S102, the emotion feature vectors and the text data may be trained, so as to obtain an emotion detection model. And taking the voice data to be detected as the input of the emotion detection model, and outputting the emotion score of the voice data to be detected.

In one embodiment, the emotion detection model may be obtained by training using a neural network algorithm based on the emotion feature vector and the text data. Preferably, the neural network algorithm may adopt a long short-Term Memory network (LSTM) algorithm. LSTM is a time-cycled neural network.

In another embodiment, the feature vector and the text data may be used as input data of a Logistic Regression (LR) algorithm, and the emotion detection model is obtained through training.

In a specific implementation, the speech data and the text data of each character can be input to the emotion detection model together to train the emotion detection model. For example, voice data and text data of each character are marked so as to distinguish voice data and text data of different characters.

in step S103, an emotion score of the voice data to be detected is calculated based on the voice data to be detected and the emotion detection model. The voice data to be detected can be voice data in a preset time period or a recorded voice file. And inputting the voice data to be detected into the emotion detection model, and calculating the emotion score of the voice data to be detected through the emotion detection model.

In one embodiment, the voice data to be detected is a voice file, and the voice file includes voice data of a first role and voice data of a second role. It is assumed that the voice data of the first role is the voice data of the customer service personnel, and the voice data of the second role is the voice data of the customer service personnel. After the voice file is distinguished and marked, the voice file may be input to the emotion detection model, which outputs an emotion score that is an emotion score for a first character (e.g., a customer service person).

It should be noted that the speech data of the second character facilitates the emotion detection model to calculate the emotion score of the first character.

In step S104, it may be determined whether the voice data to be detected has a violation risk based on the emotion score. Taking the call center attendant as an example, the violation can refer to the attendant's occurrence of aggressive, abusive, or the like language during the course of a conversation with the attendant.

in specific implementation, a preset threshold may be set by the emotion detection model, and whether the voice data to be detected has a violation risk is determined by using the preset threshold.

if the emotion score is not higher than the preset threshold, it may be determined that the voice data to be detected is not at risk of violation.

If the emotion score is higher than the preset threshold, it may be determined that the voice data to be detected has a violation risk. Further, the voice data to be detected with violation risk can be labeled.

In practical applications, the voice data with the tag may be further confirmed manually to review the voice data.

fig. 2 is a flowchart illustrating a speech recognition method in a typical scenario according to an embodiment of the present invention. As shown in fig. 2, in a typical scenario, a recording file recorded by a call center may be used as voice data, and after obtaining an emotion detection model, the emotion detection model is used to determine whether the recording file has a violation risk.

Specifically, first, operation S201 may be performed to acquire voice data, for example, to acquire a recording file of a call center.

Next, operation S202 may be performed to convert the voice data into text data. Specifically, ASR technology can be used to obtain the text content corresponding to each audio file, and distinguish two conversational roles, i.e., customer service staff and client staff.

Again, operation S203 may be performed to extract an emotion feature vector from the voice data. Specifically, it is possible to derive an emotion feature vector using an acoustic emotion model in the related art, determine which of five emotions, i.e., joy, hurry, anger, fear, neutrality, and neutral, the emotion of two conversational characters belongs to, and output the corresponding emotion feature vector.

Further, an operation S204 may be performed to train the emotion detection model. Specifically, the text content and the emotion feature vector can be used as the input of the emotion detection model, and the emotion detection model is obtained by training through a neural network algorithm or a logistic regression algorithm.

Thereafter, operations S205 and S206 may be performed to input the voice file to be detected to the emotion detection model and calculate an emotion score. Specifically, a voice file to be detected is input to the emotion detection model, and an emotion score is output.

further, if the emotion score output by the emotion detection model exceeds a preset threshold, the sound recording file may be tagged (not shown).

Further, the tagged audio may be provided to a human for further confirmation. The preset threshold may be determined comprehensively according to the rechecking manpower condition and the accuracy related index (not shown).

therefore, the embodiment of the invention fully utilizes the mass voice data to train, so as to obtain the training model (namely the emotion detection model) with higher accuracy, the training model is suitable for mass voice detection, the detection of the voice data can be efficiently and accurately completed, and the detection rate of the illegal voice is improved.

Fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention. The speech recognition device 3 may implement the method solutions shown in fig. 1 and 2, and be executed by a computing device.

specifically, the speech recognition apparatus 3 may include: an extraction module 31, configured to extract an emotion feature vector from a set of voice data, and convert the set of voice data into text data; the training module 32 is used for training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score; the calculation module 33 is used for calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model; and the judging module 34 is used for judging whether the voice data to be detected has violation risk or not based on the emotion score.

In a specific implementation, the determining module 34 may include: a determining submodule 341, configured to determine that the voice data to be detected has a violation risk when the emotion score is higher than a preset threshold.

In a specific implementation, the speech recognition apparatus 3 may further include: and the marking module 35 is used for marking the voice data to be detected with violation risk.

In one embodiment, the training module 32 may include: the first training submodule 321 obtains the emotion detection model by training with a neural network algorithm based on the emotion feature vector and the text data.

in another embodiment, the training module 32 may include: and the second training submodule 322 obtains the emotion detection model by training through a logistic regression algorithm based on the feature vector and the text data.

In a specific implementation, the emotional feature vector may be used to represent a type of emotion, which may be selected from: happiness, sadness, anger, fear, disgust.

in a specific implementation, the extraction module 31 may include: the conversion sub-module 311 is configured to convert the voice data into the text data by using a voice-to-text technique.

In a specific implementation, the voice data may include voice data of a first character and voice data of a second character, and the extraction module 31 may include: a distinguishing submodule 312, configured to distinguish voice data of a first role from voice data of a second role in the group of voice data to obtain voice data of the first role and voice data of the second role; the extracting sub-module 313 is configured to extract emotion feature vectors of the voice data of the first character and the voice data of the second character, and convert the voice data of the first character and the voice data of the second character into text data.

For more details of the operation principle and the operation mode of the speech recognition apparatus 3, reference may be made to the related description in fig. 1 and fig. 2, and details are not repeated here.

Further, the embodiment of the present invention also discloses a storage medium, on which computer instructions are stored, and when the computer instructions are executed, the technical solution of the method in the embodiment shown in fig. 1 and fig. 2 is executed. Preferably, the storage medium may include a computer-readable storage medium such as a non-volatile (non-volatile) memory or a non-transitory (non-transient) memory. The storage medium may include ROM, RAM, magnetic or optical disks, etc.

further, the embodiment of the present invention also discloses a computing device, which includes a memory and a processor, where the memory stores computer instructions capable of running on the processor, and the processor executes the computer instructions to execute the technical solutions of the methods described in the embodiments shown in fig. 1 and fig. 2.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A speech recognition method, comprising:

Extracting emotion feature vectors from a set of voice data and converting the set of voice data into text data;

Training to obtain an emotion detection model based on the emotion feature vector and the text data, wherein the emotion detection model is used for calculating an emotion score;

Calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model;

And judging whether the voice data to be detected has violation risk or not based on the emotion score.

2. The voice recognition method according to claim 1, wherein the determining whether the voice data to be detected has a violation risk based on the emotion score comprises:

And when the emotion score is higher than a preset threshold value, determining that the voice data to be detected has violation risk.

3. The speech recognition method of claim 2, further comprising:

And marking the voice data to be detected with violation risk.

4. The speech recognition method of any one of claims 1 to 3, wherein training a sentiment detection model based on the sentiment feature vectors and the text data comprises:

and training by adopting a neural network algorithm based on the emotion feature vector and the text data to obtain the emotion detection model.

5. The speech recognition method of any one of claims 1 to 3, wherein training a sentiment detection model based on the sentiment feature vectors and the text data comprises:

And training by adopting a logistic regression algorithm based on the feature vector and the text data to obtain the emotion detection model.

6. a speech recognition method according to any one of claims 1 to 3, wherein the emotion feature vector is used to represent an emotion type selected from: happiness, sadness, anger, fear, disgust.

7. The speech recognition method of any one of claims 1 to 3, wherein the converting the set of speech data into text data comprises:

And converting the voice data into the text data by adopting a voice-to-text technology.

8. The speech recognition method of any one of claims 1 to 3, wherein the speech data comprises speech data of a first character and speech data of a second character, and wherein extracting an emotional feature vector from a set of speech data and converting the set of speech data into text data comprises:

Distinguishing voice data of a first role from voice data of a second role in the group of voice data to obtain the voice data of the first role and the voice data of the second role;

and extracting emotion characteristic vectors of the voice data of the first role and the voice data of the second role respectively, and converting the voice data of the first role and the voice data of the second role into text data respectively.

9. A speech recognition apparatus, comprising:

The extraction module is used for extracting emotion characteristic vectors from a group of voice data and converting the group of voice data into text data;

The training module is used for training to obtain an emotion detection model based on the emotion feature vector and the text data, and the emotion detection model is used for calculating an emotion score;

The calculation module is used for calculating the emotion score of the voice data to be detected based on the voice data to be detected and the emotion detection model;

And the judging module is used for judging whether the voice data to be detected has violation risks or not based on the emotion scores.

10. A storage medium having stored thereon computer instructions, characterized in that the computer instructions are operative to perform the steps of the method of any one of claims 1 to 8.

11. A computing device comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, wherein the processor, when executing the computer instructions, performs the steps of the method of any of claims 1 to 8.