CN113642317A

CN113642317A - Text error correction method and system based on voice recognition result

Info

Publication number: CN113642317A
Application number: CN202110922253.4A
Authority: CN
Inventors: 王晓虎; 冉猛; 汪哲逸; 宋佳鑫; 陈浩楠
Original assignee: Zhejiang Geely Holding Group Co Ltd; Guangyu Mingdao Digital Technology Co Ltd
Current assignee: Zhejiang Geely Holding Group Co Ltd; Guangyu Mingdao Digital Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-11-12

Abstract

The invention is suitable for the technical field of natural language processing, and provides a text error correction method and a text error correction system based on a voice recognition result, wherein the method comprises the following steps: collecting sample data, and preprocessing the sample data, wherein the preprocessing comprises converting voice data into text data; marking the preprocessed sample data to form a first sample data set, and establishing a first model according to the first sample data set; acquiring an output result of the first model, using the output result as a second sample data set, and constructing a second model according to the second sample data set, wherein the second model is used for performing text error correction according to the correct probability of the output result of the first model; acquiring target voice data, and acquiring correct text data after the target voice data is subjected to pretreatment, first model treatment and second model treatment in sequence; the invention solves the problem of poor recognition effect of voice recognition in the prior art.

Description

Text error correction method and system based on voice recognition result

Technical Field

The invention relates to the technical field of natural language processing, in particular to a text error correction method and system based on a voice recognition result.

Background

At present, more and more people have car purchasing requirements, in order to guarantee the legal rights and interests of consumers and improve the service quality of car 4S shop salespeople, speech recognition needs to be carried out on dialogs in the selling process, and then texts need to be analyzed through a natural language processing technology. However, the current speech recognition technology cannot achieve good recognition effect, especially some professional terms and harmonic words in the automobile industry. In addition, the source of the voice data is different, which causes the quality of the text data converted from the voice data to be uneven, and this brings great difficulty to the subsequent data mining and data analysis.

Disclosure of Invention

The invention provides a text error correction method and system based on a voice recognition result, and aims to solve the problem that the recognition effect of voice recognition is poor in the prior art.

The text error correction method based on the voice recognition result provided by the invention comprises the following steps:

collecting sample data, and preprocessing the sample data, wherein the preprocessing comprises converting voice data into text data;

marking the preprocessed sample data to form a first sample data set, and establishing a first model according to the first sample data set, wherein the first model is used for fault diagnosis and classification of text data;

acquiring an output result of the first model, taking the output result as a second sample data set, and constructing a second model according to the second sample data set, wherein the second model is used for performing text error correction according to the correct probability of the output result of the first model;

and acquiring target voice data, and acquiring correct text data after the target voice data is subjected to the preprocessing, the first model and the second model in sequence.

Optionally, the first model is a text classifier, the pre-processed sample data is labeled to form a first sample data set, and a first model is established according to the first sample data set, which specifically includes:

marking the preprocessed sample data to form a first sample data set;

training the text classifier using the first sample data set;

and inputting the first sample data set into the text classifier, and outputting the probability of each data in the first sample data set to each category, wherein the categories comprise correctness and errors.

Optionally, the training of the text classifier by using the first sample data set specifically includes:

extracting features of the first sample data set by adopting a deep neural network to obtain a feature data set;

and training the text classifier by adopting the characteristic data set.

Optionally, the second model is a text error correction model, the obtaining of the output result of the first model is used as a second sample data set, and the constructing of the second model according to the second sample data set specifically includes:

acquiring an output result of the first model, and acquiring the correct probability of the output result according to the output result of the first model;

acquiring a second sample data set according to the correct probability and the probability threshold of the output result;

and training the text error correction model by adopting the second sample data set.

Optionally, the training the text error correction model by using the second sample data set specifically includes:

and inputting the second sample data set into the text error correction model, performing mask processing on the second sample data set, predicting the masked second sample data set, and outputting a predicted text.

Optionally, the preprocessing, the first model and the second model are sequentially performed on the target voice data, and then correct text data is obtained, which specifically includes:

performing the preprocessing on the target voice data;

inputting the preprocessed target voice data into the first model to obtain a target classification result;

acquiring text data to be corrected according to the target classification result;

and inputting the text data to be corrected into the second model to obtain correct text data.

Optionally, the voice data is generated by car sales, and the method further includes:

acquiring the automobile preference of a target user according to the correct text data;

and obtaining a target automobile recommendation scheme according to the automobile preference of the target user.

The invention also provides a text error correction system based on the voice recognition result, which comprises:

the data acquisition module is used for acquiring sample data and preprocessing the sample data, wherein the preprocessing comprises converting voice data into text data;

the first model building module is used for marking the preprocessed sample data to form a first sample data set, and building a first model according to the first sample data set, wherein the first model is used for fault diagnosis and classification of the text data;

the second model building module is used for obtaining the output result of the first model, using the output result as a second sample data set, and building a second model according to the second sample data set, wherein the second model is used for performing text error correction according to the correct probability of the output result of the first model;

and the target data acquisition module is used for acquiring target voice data, and acquiring correct text data after the target voice data is subjected to the pretreatment, the first model and the second model in sequence.

The present invention also provides an electronic device comprising: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the text error correction method based on the voice recognition result.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a text error correction method based on a speech recognition result as described above.

The invention has the beneficial effects that: the text error correction method based on the voice recognition result comprises the steps of converting voice data into text data by collecting the voice data, marking the text data to form a first sample data set, and establishing a first model for fault diagnosis and classification of the text data according to the first sample data set; then obtaining an output result of the first model, taking the output result as a second sample data set, and constructing a second model for text error correction according to the correct probability of the output result of the first model; the method comprises the steps of obtaining target voice data, and obtaining correct text data after the target voice data are subjected to preprocessing, a first model and a second model in sequence, so that accurate recognition of target voice is achieved, accuracy of voice recognition in a target field is improved, and subsequent data analysis and mining are facilitated.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow chart of a text error correction method based on a speech recognition result according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for obtaining correct text data according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a text error correction system based on a speech recognition result in an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

First embodiment

Fig. 1 is a flowchart illustrating a text error correction method based on a speech recognition result according to an embodiment of the present invention.

As shown in fig. 1, the text error correction method based on the speech recognition result includes steps S110 to S140:

s110, collecting sample data and preprocessing the sample data, wherein the preprocessing comprises converting voice data into text data;

s120, marking the preprocessed sample data to form a first sample data set, and establishing a first model for text data fault diagnosis and classification according to the first sample data set;

s130, obtaining an output result of the first model, using the output result as a second sample data set, and constructing a second model for text error correction according to the correct probability of the output result of the first model according to the second sample data set;

and S140, acquiring target voice data, and sequentially carrying out preprocessing, first model processing and second model processing on the target voice data to acquire correct text data.

In step S110 of this embodiment, the sample data is voice data, the voice data may be voice data in the field of automobiles, and the voice data may also be specifically historical dialogue voices generated in the automobile sales process. Converting the voice data into text data specifically includes: and recognizing the voice data as a Chinese pinyin sequence by adopting the trained acoustic model, and then recognizing the Chinese pinyin sequence as a Chinese character sequence by adopting the trained language model, thereby obtaining text data. The acquisition mode of the acoustic model comprises the following steps: and performing feature extraction on the voice data to obtain a time-frequency image, acquiring a time-frequency data set according to the time-frequency image, and training an acoustic model by using the time-frequency data set, thereby obtaining the trained acoustic model. The voice data is Chinese voice data, and the voice data can be marked voice data or unmarked voice data. The step of extracting the characteristics of the voice data to obtain the time-frequency diagram specifically comprises the following steps: and pre-emphasis, framing, windowing and Fourier transform processing are carried out on the historical voice to obtain a time-frequency graph. The influence of lip radiation can be reduced through pre-emphasis, and the resolution of high-frequency voice is improved, so that the recognition accuracy is improved. The acoustic model may be a deep convolutional neural network model, employing CTC as a loss function in the acoustic model training process. The language model acquisition mode comprises the following steps: and acquiring the pinyin sequence and the character sequence in the existing corpus, and training a language model by using the acquired pinyin sequence and character sequence to obtain the trained language model. The existing corpus can be a THCHS-30 corpus, and the language model can be an N-Gram model, a feedforward neural network model, an RNN model and the like.

In step S120 in this embodiment, labeling the preprocessed sample data may label only a part of data in the preprocessed sample data, so as to form the first sample data set. The first model may be a text classifier; establishing a first model for text fault diagnosis classification from the first sample dataset specifically comprises: training a text classifier by adopting a first sample data set to establish a first model for text fault diagnosis classification; and inputting the first sample data set into a text classifier, and outputting the probability of each data in the first sample data set to each category, wherein the categories comprise correctness and errors.

In an embodiment, the method for training the text classifier to establish the first model for the text fault diagnosis classification by using the first sample data set may be: extracting the features of the first sample data set by adopting a deep neural network to obtain a feature data set; training a text classifier with a feature data set to establish a first model for text fault diagnosis classification. Specifically, the deep neural network layer comprises an ALBERT layer, also called a vector layer, and a first sample data set is vectorized by adopting a trained ALBERT pre-training model to obtain an orientation vector data set; then transmitting the output of the vector layer into a dropout layer, and randomly reserving certain characteristics according to a certain proportion by the dropout layer; inputting the reserved characteristics of the dropout layer into the activation layer, and generating an activation state through a relu function; and finally, obtaining the probability of each category through a full connection layer and a softmax function.

In step S130 of this embodiment, the acquiring the second sample data set specifically includes: converting the acquired historical voice data into historical text data through voice recognition, inputting the historical text data into a first model, and acquiring a classification result of the historical text data; and obtaining text data to be processed according to the classification result of the historical text data and the probability threshold value to form a second sample data set. Specifically, if the probability value of the correct category of the historical text data is greater than or equal to the probability threshold, the historical text data is text data to be processed, and a plurality of text data to be processed are collected to form a second sample data set. The probability threshold may be determined according to the accuracy of obtaining the historical text data by performing speech recognition on the historical speech data, and if the accuracy of the speech recognition is high, for example, the accuracy is 88%, 90%, and the like, the probability threshold is set to a higher value, for example, 90%, 95%, and the like; if the accuracy of speech recognition is low, the probability threshold is set to a lower value, which may be, for example, 60%, 65%, etc. The historical speech data may be historical speech data of the automotive field.

In an embodiment, the second model is a text error correction model, and establishing the second model for performing text error correction according to the correct probability of the output result of the first model according to the second sample data set specifically includes: training a text error correction model by adopting a second sample data set; and inputting the second sample data set into a text error correction model, performing mask processing on the second sample data set, predicting the masked second sample data set, and outputting a predicted text. The masking process on the second sample data set specifically includes: acquiring error characters of the text data to be processed in the second sample data set, and performing mask processing on the error characters; and then, learning and recovering the covered characters, so that the second sample data set after mask processing is predicted, and a predicted text is output. The text error correction model adopts a Masked LM model in a UniLM in a training stage, and executes constraint conditions on the coding process according to a pre-training target, so that error correction of error text data is realized, and accuracy of voice recognition is improved.

In step S140 of this embodiment, after the target voice data is sequentially subjected to the preprocessing, the first model and the second model, the correct text data is obtained, referring to fig. 2, and fig. 2 is a flowchart illustrating a method for obtaining the correct text data according to an embodiment of the present invention.

As shown in fig. 2, the method for acquiring correct text data may include the following steps S210-S240:

s210, preprocessing target voice data;

s220, inputting the preprocessed target voice data into a first model to obtain a target classification result;

s230, acquiring text data to be corrected according to the target classification result;

s240, inputting the text data to be corrected into the second model to obtain correct text data.

In an embodiment, the target voice data may obtain the target voice according to the voice communication content in the automobile field, specifically, the target voice data may be obtained according to the voice communication content of a salesperson and a client in the automobile sales process, the process of performing voice recognition on the target voice is consistent with the voice recognition process on the sample data, the target voice data is recognized as a chinese pinyin sequence by using a trained acoustic model, and then the chinese pinyin sequence is recognized as a chinese character sequence by using the trained language model, so as to obtain the target text data.

And inputting the target text into the classification model to obtain a target classification result, wherein the target classification result comprises the correct classification probability and the wrong classification probability of the target text.

In an embodiment, the determination manner of the probability threshold is the same as the determination manner of the probability threshold in step S130, and is not described herein again. The target voice data is preprocessed, namely the target voice data is converted into target text data, and then the target text data is input into the first model to obtain a target classification result. When the correct category probability of the target text data is greater than or equal to the probability threshold, the recognition accuracy in the voice recognition process of the target voice is high, so that the text data is determined to be correct text data, and subsequent error correction processing is not performed; when the probability of the correct category of the target text data is smaller than the probability threshold, the recognition accuracy in the process of carrying out voice recognition on the target voice is low, so that the text data is determined as text data to be corrected, and then the text data to be corrected is subjected to error correction processing to obtain the correct text data. By inputting the target text data into the first model, the correct category probability of the target text data is obtained, so that the target text data is determined to be the correct text data or the text data to be corrected, and the recognition accuracy and the processing efficiency of the target voice are improved.

In an embodiment, if the target text data is the text data to be corrected, the text data to be corrected is input into the second model established in step S130, and the text data to be corrected is subjected to error correction processing, so as to obtain correct text data. Acquiring historical dialogue voice generated in the automobile sale process, performing voice recognition on the historical dialogue voice data to acquire historical text data, and establishing a first model for fault diagnosis classification and a second model for text error correction according to the correct probability of an output result of the first model on the basis of the historical text data; then, target voice data are obtained according to voice communication contents in the automobile field, target text data are obtained after voice recognition is carried out on the target voice data, the target text data are input into a first model, and a target classification result is obtained; and acquiring text data to be corrected according to the target classification result, inputting the text data to be corrected into the second model, and acquiring correct text data, so that accurate recognition of the target voice data is realized, the recognition accuracy of the professional terms in the automobile field in the target voice is improved, and subsequent data analysis and mining are facilitated.

In one embodiment, the voice data in the sample data is historical dialogue voice generated by automobile sales, a first model used for fault diagnosis classification and a second model used for text error correction according to the correct probability of the output result of the first model are established on the basis, the obtained target voice data is converted into target text data, and then the target text data is input into the first model, or the target text data is input into the first model and the second model, and then the correct text data is obtained. And then, the automobile preference of the target user can be obtained according to the correct text data, and then the target automobile recommendation scheme is obtained according to the automobile preference of the target user. The automobile preference of the target user is obtained by extracting the target user information in the correct text data, the automobile sales staff obtain the automobile recommendation scheme according to the automobile preference of the target user, and the target user is recommended individually according to the automobile recommendation scheme, so that the transaction with the target user is easier to achieve on the basis, the service quality of the automobile sales staff is improved, and the transaction rate of automobile sales is improved.

Second embodiment

Based on the same inventive concept as the method in the first embodiment, correspondingly, the embodiment also provides a text error correction system based on the voice recognition result.

Fig. 3 is a schematic structural diagram of a text error correction system based on a speech recognition result provided by the present invention.

As shown in fig. 3, the system 3 shown comprises: the system comprises a data acquisition module 31, a first model building module 32, a second model building module 33 and a target data acquisition module 34.

the second model building module is used for obtaining the output result of the first model and taking the output result as a second sample data set to build a second model, and the second model is used for performing text error correction according to the correct probability of the output result of the first model;

In some exemplary embodiments, the first model building module comprises:

the first sample data set acquisition unit is used for marking the preprocessed sample data to form a first sample data set;

the first training unit is used for training a first model by adopting a first sample data set, wherein the classification model is a text classifier;

and the first model establishing unit is used for inputting the first sample data set into the text classifier and outputting the probability of each data in the first sample data set to each category, wherein the category comprises correctness and errors.

In some exemplary embodiments, the first training unit includes:

the characteristic extraction subunit is used for extracting the characteristics of the first sample data set by adopting a deep neural network to obtain a characteristic data set;

a first training subunit for training the text classifier using the feature data set.

In some exemplary embodiments, the second model building module comprises:

the output result acquisition unit is used for acquiring the output result of the first model and acquiring the correct probability of the output result according to the output result of the first model;

the second sample data set acquisition unit is used for acquiring a second sample data set according to the correct probability and the probability threshold of the output result;

and the second training unit is used for training a second model by adopting a second sample data set, wherein the second model is a text error correction model.

In some exemplary embodiments, the second training unit comprises:

and the second training subunit is used for inputting the second sample data set into the text error correction model, performing mask processing on the second sample data set, predicting the masked second sample data set and outputting a predicted text.

In some exemplary embodiments, the target data acquisition module includes:

the preprocessing unit is used for acquiring target voice data and preprocessing the target voice data;

the target classification result acquisition unit is used for inputting the preprocessed target voice data into the first model to acquire a target classification result;

the data acquisition unit to be corrected is used for acquiring text data to be corrected according to the target classification result;

and the correct text data acquisition unit is used for inputting the text data to be corrected into the second model to acquire correct text data.

In some exemplary embodiments, the text correction system based on the speech recognition result further includes:

the automobile preference acquisition module is used for acquiring the automobile preference of the target user according to the correct text data;

and the recommendation scheme acquisition module is used for acquiring a target automobile recommendation scheme according to the automobile preference of the target user.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment also provides an electronic device, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In the above-described embodiments, references in the specification to "the present embodiment," "an embodiment," "another embodiment," "in some exemplary embodiments," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The various appearances of the phrase "the present embodiment," "one embodiment," or "another embodiment" are not necessarily all referring to the same embodiment.

In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A text error correction method based on a voice recognition result is characterized by comprising the following steps:

acquiring an output result of the first model, using the output result as a second sample data set, and constructing a second model according to the second sample data set, wherein the second model is used for performing text error correction according to the correct probability of the output result of the first model;

2. The method according to claim 1, wherein the first model is a text classifier, the labeling of the preprocessed sample data forms a first sample data set, and the building of the first model according to the first sample data set specifically includes:

marking the preprocessed sample data to form a first sample data set;

training the text classifier using the first sample data set;

3. The method of claim 2, wherein the training of the text classifier using the first sample data set comprises:

and training the text classifier by adopting the characteristic data set.

4. The method according to claim 2, wherein the second model is a text correction model, the obtaining of the output result of the first model is used as a second sample data set, and the constructing of the second model according to the second sample data set includes:

5. The method of claim 4, wherein the training of the text correction model using the second sample data set comprises:

6. The method according to claim 2, wherein the preprocessing, the first model processing and the second model processing are sequentially performed on the target voice data to obtain correct text data, and specifically comprises:

performing the preprocessing on the target voice data;

7. The text error correction method based on the voice recognition result according to any one of claims 1 or 6, wherein the voice data is voice data generated by car sales, the method further comprising:

8. A text correction system based on speech recognition results, the system comprising:

9. An electronic device comprising a processor, a memory, and a communication bus;

the communication bus is used for connecting the processor and the memory;

the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.