CN110223675B

CN110223675B - Method and system for screening training text data for voice recognition

Info

Publication number: CN110223675B
Application number: CN201910510814.2A
Authority: CN
Inventors: 陈明佳
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2022-04-19
Anticipated expiration: 2039-06-13
Also published as: CN110223675A

Abstract

The embodiment of the invention provides a method for screening training text data for voice recognition. The method comprises the following steps: the method comprises the following steps of carrying out standardization processing on training text data, and carrying out preprocessing before inputting the training text data after the standardization processing, wherein the preprocessing comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text; and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition. The embodiment of the invention also provides a system for screening the training text data for the voice recognition. The embodiment of the invention ensures that the processes are all automatic, thereby saving a large amount of labor cost, improving the reusability, and improving the screening effect of the training text data by considering the language relationship in the text content.

Description

Method and system for screening training text data for voice recognition

Technical Field

The invention relates to the field of intelligent voice, in particular to a method and a system for screening training text data for voice recognition.

Background

In training a speech recognition model, a large amount of high-quality training text data is often required for the effect of training. In the existing scheme, in order to acquire a large amount of training text data, the training text data is usually completed by simple and rough matching based on simple character rules, or is screened by a manual inspection mode after being matched by simple rules.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:

the existing text training data of voice recognition are obtained by using a small amount of manual inspection labels, most of the existing text training data are obtained by using simple and rough tools for processing, and efficient and automatic methods are not used for processing. Although the data quality obtained by the manual labeling method is generally high, the data volume of the speech recognition training text is generally very large, and in general, a large number of MB (Megabyte ) (corresponding to a hundred million characters), and a large number of TB (Terabyte, Terabyte or Terabyte) bring a large amount of indication of manpower and material resources.

First, the amount of text training data that is usually recognized is huge and the data distribution is very diverse, and a method using simple rules may result in effectiveness on some fixed range or domain of text data, but the same scheme migrates to other data, and much work needs to be reprocessed. The reusability of the scheme is extremely low, and the scheme is not easy to popularize.

Secondly, the general automated scheme cannot effectively consider the semantic language relationship in the text content, the effectiveness of the screened data is not high, but whether the text is reasonable or smooth is judged by depending on the semantic language relationship in the text.

In addition, a general automatic scheme adopts a rule screening and rule filtering method, after general rules of the scheme are increased, a plurality of rules are contradictory or redundant, the same input data can be simultaneously applied to a plurality of rules, and how to select the really applied rules needs manual intervention or the weight conditions of some rules are set in advance; furthermore, in such schemes, it is usually impossible to predict in advance which rule or rules the data needs to be processed for each piece of input data, so that each piece of data needs to be calculated by the rule one by one, which makes it difficult for these schemes to process data in a large-scale and distributed manner quickly and efficiently.

Disclosure of Invention

The method at least solves the problem that the existing text training data in the prior art are all completed based on simple character rule rough matching or are screened in a mode of manual inspection after simple rule matching.

In a first aspect, an embodiment of the present invention provides a method for screening training text data for speech recognition, including:

carrying out standardization processing on training text data, and carrying out preprocessing before inputting on the training text data after the standardization processing, wherein the preprocessing before inputting at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;

and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.

In a second aspect, an embodiment of the present invention provides a system for screening training text data for speech recognition, including:

the pre-processing module is used for carrying out normalization processing on the training text data and carrying out pre-input preprocessing on the training text data after the normalization processing, wherein the pre-input preprocessing at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;

and the training text screening module is used for importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences which reach a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.

In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for filtering training text data for speech recognition according to any of the embodiments of the present invention.

In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for filtering training text data for speech recognition according to any embodiment of the present invention.

The embodiment of the invention has the beneficial effects that: the training text data is subjected to standardized processing and converted into input information using a data screening model, so that the whole process is automated, a large amount of labor cost can be saved, a large amount of expenditure is reduced for enterprises, and the reusability is improved. By fusing various neural network screening models and screening and judging a plurality of dimensions, the language relationship in the text content is effectively considered, and the screening effect of the training text data is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for screening training text data for speech recognition according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a system for screening training text data for speech recognition according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a flowchart of a method for screening training text data for speech recognition according to an embodiment of the present invention, which includes the following steps:

s11: carrying out standardization processing on training text data, and carrying out preprocessing before inputting on the training text data after the standardization processing, wherein the preprocessing before inputting at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;

s12: and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.

In the embodiment, a method for deepening neural network classification by deep text semantic representation is adopted. Because the deep learning method is adopted to express the semantics of the text, the effective semantic information in the text can be obtained by a very simple and efficient scheme, so that the screening effect becomes more accurate. The scheme based on deep learning only needs to consider different input information and output information, can share the same model structure, and then is used in different fields or scenes, so that the model has high expansibility. The scheme based on deep learning, namely model training or formal application of the model can be deployed in a distributed computer cluster, so that the screening efficiency can be doubled.

For step S11, performing normalization processing on the training text data, where the normalization processing at least includes, as an embodiment: text format processing and/or character form processing; the text format processing includes: converting the training text data in a non-standard format into a text form of one word per line or one sentence per line, wherein the non-standard format comprises HTML and JSON; the character form processing includes: and removing illegal symbols in the training text data, wherein the illegal symbols comprise webpage labels and emoticons.

For example, since training text data is widely acquired, the acquired training text data is relatively cluttered. For example, the obtained HTML-formatted data is as follows:

-var articleties ═ machine-learned text classification ═ & (attached training set + dataset + all codes) "; -1. precision mode, trying to start the sentence most accurately, suitable for text analysis. < br/> -2. the full mode scans all words that can be worded in a sentence out very fast but does not resolve the ambiguity < br/>

Such training text sentences usually have similar mark symbols, and during data training, such illegal characters need to be removed, and the text format processing is performed to obtain:

text classification for machine learning (with training set + data set + all codes) "

1. And the accurate mode tries to start the sentence most accurately, and is suitable for text analysis.

2. The full mode scans all words that can be typed in a sentence very fast but does not resolve ambiguities.

And further performing text format processing and character form processing on the data acquired by the webpage. Thereby making the training text more canonical. And the accuracy of training text data is improved.

After the character processing, the normalization processing further includes: sentence breaking processing;

the sentence-breaking processing comprises: and performing punctuation according to punctuation marks in the training text data, when no punctuation mark exists in the training text exceeding the preset length, performing punctuation addition through character form processing, and performing punctuation processing on the training text data to which the punctuation marks are added.

For example, in the above sentence, "2. full mode scans all words that can be typed in the sentence very fast but cannot resolve ambiguities" can be broken to "2. full mode scans all words that can be typed in the sentence very fast but cannot resolve ambiguities. "

After normalization processing, data conversion is carried out on the training text data, the training text data are converted into input information suitable for a data screening model, the input information comprises unique number numbers corresponding to sentences of the training text, the input of the data screening module is usually a vector or a matrix, therefore, the sentences in each sentence cannot be directly used as input, each word in the feature conversion module is uniquely corresponding to a number, each sentence text corresponds to a number string, and the vector formed by the number strings is used as the input data of the data screening module.

For step S12, the converted input information is imported into a fusion screening model formed by combining a plurality of neural network screening models in parallel, text sentences that reach a preset positive case probability score threshold value in the output of the fusion screening model are screened as training text data for speech recognition, wherein a batch of data input each time is simultaneously sent into the plurality of neural network screening models, then model calculation for data screening is respectively performed, scores of each classification result corresponding to the data are respectively obtained and fused, a certain fusion method is adopted for a plurality of groups of scores, a group of scores corresponding to each classification label of the input are obtained, and the labels are divided into positive case labels and negative case labels of sentences. For example, the sentence: "Cano-femoral leg", the positive example tag scored 0.1 and the negative example tag scored 0.9. For example: the sentence "too dark to turn the light on" positive case label score is 0.95 and negative case label score is 0.05. For example, when the default positive probability score threshold is 0.5, the sentence "turning the light on too dark" is filtered as training text data for speech recognition.

According to the embodiment, the training text data is subjected to the standardized processing and is converted into the input information using the data screening model, so that the whole process is automated, a large amount of labor cost can be saved, a large amount of expenditure is reduced for enterprises, and the reusability is improved. By fusing various neural network screening models and screening and judging a plurality of dimensions, the language relationship in the text content is effectively considered, and the screening effect of the training text data is improved.

As an implementation manner, in this embodiment, the pre-input preprocessing at least includes:

performing word segmentation on the training text data after the normalization processing to obtain word string combinations with uniform granularity;

and converting the word string combination into input information suitable for a data screening model, wherein the input information comprises a unique number corresponding to the word string combination.

In the present embodiment, the training text data after the normalization processing is subjected to word segmentation, for example, "sunflower is beautiful under Sansku-Gao pen", word string combinations "Sansko", "sunflower" and "beautiful" with uniform granularity are obtained after the word segmentation, and these word string combinations are converted into an input using the data filtering model. For example, the corresponding unique number, e.g., "Sanskrit" corresponds to a number: 37954567646040612330 (or other types of numbers, without limitation) a word will typically have a number corresponding to it, and the numbers will form a string of numbers.

Through the embodiment, the word segmentation is carried out on the training text data, so that the data distinguishing performance can be further improved, and the error classification in the subsequent screening process is avoided.

As an implementation manner, in this embodiment, the method further includes:

performing word segmentation on the training text data after the normalization processing to obtain word string combinations with uniform granularity, and determining word segmentation parts of the word string combinations;

and converting the word string combination and the corresponding word segmentation part of speech into input information suitable for a data screening model, wherein the input information comprises the combination of the word string combination and the corresponding word segmentation part of speech to obtain a corresponding unique number.

In the present embodiment, in addition to determining a word string combination, the part-word part-of-speech of the word string combination is also determined, for example: morphemes, adjectives, paraphrases, ideograms, discriminators, conjunctions, etc. And then the word string combination and the corresponding word segmentation part of speech are converted into input information using a data screening model.

According to the embodiment, each sentence of training text is composed of the information of the words and the part-of-speech information corresponding to the words, and due to the combination mode of the part-of-speech information, the smoothness degree of a sentence can be reflected to a great extent, so that the accuracy of the screening model can be increased by taking the information as a component in the scheme.

As an implementation manner, in this embodiment, the screening, as training text data for speech recognition, a text sentence in the output of the fusion screening model that reaches a preset normative probability score threshold includes:

respectively obtaining the positive case probability scores output by each neural network screening model, and screening the text sentences of which the highest score reaches a preset positive case probability score threshold value in the positive case probability scores into training text data for voice recognition; or

And respectively obtaining the positive case probability score output by each neural network screening model, and screening the text statement into training text data for voice recognition when the positive case probability score of the weighted mean of the positive case probability scores reaches a preset positive case probability score threshold value.

In this embodiment, the positive example probability scores of each neural network output may be obtained, and the text sentence with the highest positive example probability score and reaching the preset threshold value of the percentage of modification is screened as the training text data for speech recognition, for example, the sentence "turn on the light too dark" is selected, the positive example label score of the first neural network output is 0.95, and the positive example label score of the second neural network output is 0.75, and then the positive example probability scores are compared with the preset threshold value of the positive example probability scores according to the score of 0.95. A weighted average may also be performed to give a score of 0.85 for (0.95+ 0.75)/2.

It can be seen from this embodiment that the user's needs can be satisfied by using different score determination methods. Different fusion modes are provided according to the requirements of users. Therefore, text training data which are more suitable for user requirements are screened out.

As an implementation manner, in this embodiment, each neural network screening model has a first fully connected layer and a second fully connected layer, wherein the dimension of the first fully connected layer is larger than that of the second fully connected layer, and the neural network screening model is trained using dropout in the second fully connected layer to prevent overfitting.

In the embodiment, each model structure in the screening model structure has two full-connection layers, the two full-connection layers have different dimensions, the first full-connection layer has a large dimension, the second full-connection layer has a small dimension, and the second full-connection layer uses the dropout technology, so that the model cannot be over-fitted when the screening model is trained by the design technology.

According to the embodiment, the model has stronger generalization capability and more robustness in the formal use process.

As an implementation manner, in this embodiment, the plurality of neural network screening models includes at least two neural network screening models, including: each sentence requires a long-short term memory network screening model that inputs each word in turn, and a convolutional neural network screening model that allows the input of a complete string of words for a complete sentence at a time.

In this embodiment, it can be seen that in a CNN (Convolutional Neural Network) screening model, each word after feature conversion is first converted into a corresponding word nest, then after a result of the word nest is calculated by convolution of the Convolutional Neural Network, a main feature after convolution is selected through a maximum pooling layer and is input into two continuous full-link layers, and finally, a probability score of the current word for each classification label is obtained through a classification layer.

In an LSTM (Long Short-Term Memory network) screening model, the main flow is similar to that of a CNN screening model, the largest difference exists between two points, and the most main model structure in the first model is a Long-Term and Short-Term Memory network layer; the second LSTM screening model memory network requires entry of each word once for each utterance, but the CNN screening model allows entry of a complete string of words for a complete utterance at a time. Therefore, the CNN screening model is usually faster to compute than the long-term memory network model.

As an implementation manner, in this embodiment, the training text data is from raw data uploaded by a web crawler or manually.

In the embodiment, as massive data are required to be screened, and massive data are obtained from a certain position, if the massive data are only uploaded manually, the labor cost is high, and the automatic acquisition of the initial-level training text by the webpage crawler is quicker.

According to the embodiment, the source of the training text data is also automated on the basis of full automation of the training text data screening. The enterprise expenses are reduced to a greater extent.

On the whole, the quality and the quantity of text training data in voice recognition can greatly influence the performance of voice recognition, and a large amount of high-quality training data are selected through an automatic scheme and can be used for training a voice recognition model, so that the performance of voice recognition is greatly improved; the scheme is based on a classification method, and similar training data screening can be improved slightly by using the framework. For example, screening of picture data and audio data only requires that a word nesting layer in the data preprocessing module, the word segmentation module, the feature extraction module and the screening model is correspondingly changed and can be used.

Fig. 2 is a schematic structural diagram of a system for screening training text data for speech recognition according to an embodiment of the present invention, which can execute the method for screening training text data for speech recognition according to any of the embodiments described above and is configured in a terminal.

The present embodiment provides a system for screening training text data for speech recognition, including: a pre-preprocessing module 11 and a training text screening module 12.

The pre-processing module 11 is configured to perform normalization processing on training text data, and perform pre-processing before inputting on the training text data after the normalization processing, where the pre-processing before inputting at least includes: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text; the training text screening module 12 is configured to import the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screen a text sentence, which reaches a preset positive case probability score threshold value in the output of the fusion screening model, as training text data for speech recognition.

The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the method for screening the training text data for voice recognition in any method embodiment;

as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:

As a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the methods of testing software in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a screening method of training text data for speech recognition in any of the method embodiments described above.

The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for filtering training text data for speech recognition according to any of the embodiments of the present invention.

The client of the embodiment of the present application exists in various forms, including but not limited to:

(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.

(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.

(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.

(4) Other electronic devices with data processing capabilities.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of automated screening of training text data for speech recognition, comprising:

acquiring original data from a webpage crawler or manually uploaded;

carrying out text format processing and/or character form processing normalization processing on the original data;

performing word segmentation on the normalized original data;

converting the original data after word segmentation processing into input information which is suitable for a neural network screening model and corresponds to a unique number, wherein the input information comprises: the unique number corresponding to the word string combination and the corresponding word segmentation part-of-speech combination;

importing the input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, wherein the neural network screening models at least comprise: each sentence needs to input the long and short term memory network screening model of each word in turn, and the convolution neural network screening model of the complete word string allowing to input a complete sentence at a time;

and fusing the output results of the plurality of neural network screening models, and screening out training text data according to the fusion result.

2. The method of claim 1, wherein the tokenizing the normalized raw data comprises:

and the word segmentation is carried out to obtain word string combinations with uniform granularity.

3. The method of claim 2, wherein the method further comprises:

and performing word segmentation on the normalized original data to obtain word string combinations with uniform granularity, and determining word segmentation parts of the word string combinations.

4. The method of claim 1, wherein the fusing the output results of the plurality of neural network filtering models, and filtering out training text data according to the fused results comprises:

5. The method of claim 1, wherein each neural network screening model has a first fully-connected layer and a second fully-connected layer, wherein the first fully-connected layer is larger in dimension than the second fully-connected layer, wherein the neural network screening model is trained using dropout in the second fully-connected layer to prevent overfitting.

6. The method of claim 1, wherein,

the text format processing includes: converting original data in a non-standard format into a text form of one word per line or one sentence per line, wherein the non-standard format comprises HTML and JSON;

the character form processing includes: and removing illegal symbols in the original data, wherein the illegal symbols comprise webpage labels and emoticons.

7. The method of claim 6, wherein after the character form processing, the normalization processing further comprises: sentence breaking processing;

the sentence-breaking processing comprises: and performing punctuation according to punctuation marks in the original data, performing punctuation addition through character form processing when punctuation marks do not exist in a training text with a length exceeding a preset length, and performing punctuation processing on the original data after punctuation addition.

8. An automated screening system of training text data for speech recognition, comprising:

the original data acquisition program module is used for acquiring original data from a webpage crawler or manually uploaded;

the normalized processing program module is used for performing normalized processing of text format processing and/or character form processing on the original data;

the word segmentation program module is used for segmenting the original data after the normalization processing;

a conversion program module, configured to convert the original data after word segmentation processing into input information corresponding to a unique number and suitable for a neural network screening model, where the input information includes: the unique number corresponding to the word string combination and the corresponding word segmentation part-of-speech combination;

a screening program module, configured to import the input information into a fusion screening model formed by parallel combination of a plurality of neural network screening models, where the plurality of neural network screening models at least include: each sentence needs to input the long and short term memory network screening model of each word in turn, and the convolution neural network screening model of the complete word string allowing to input a complete sentence at a time;

and the training program module is used for fusing the output results of the plurality of neural network screening models and screening out training text data according to the fusion result.