CN110223675B - Method and system for screening training text data for voice recognition - Google Patents

Method and system for screening training text data for voice recognition Download PDF

Info

Publication number
CN110223675B
CN110223675B CN201910510814.2A CN201910510814A CN110223675B CN 110223675 B CN110223675 B CN 110223675B CN 201910510814 A CN201910510814 A CN 201910510814A CN 110223675 B CN110223675 B CN 110223675B
Authority
CN
China
Prior art keywords
screening
training text
neural network
processing
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910510814.2A
Other languages
Chinese (zh)
Other versions
CN110223675A (en
Inventor
陈明佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN201910510814.2A priority Critical patent/CN110223675B/en
Publication of CN110223675A publication Critical patent/CN110223675A/en
Application granted granted Critical
Publication of CN110223675B publication Critical patent/CN110223675B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method for screening training text data for voice recognition. The method comprises the following steps: the method comprises the following steps of carrying out standardization processing on training text data, and carrying out preprocessing before inputting the training text data after the standardization processing, wherein the preprocessing comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text; and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition. The embodiment of the invention also provides a system for screening the training text data for the voice recognition. The embodiment of the invention ensures that the processes are all automatic, thereby saving a large amount of labor cost, improving the reusability, and improving the screening effect of the training text data by considering the language relationship in the text content.

Description

Method and system for screening training text data for voice recognition
Technical Field
The invention relates to the field of intelligent voice, in particular to a method and a system for screening training text data for voice recognition.
Background
In training a speech recognition model, a large amount of high-quality training text data is often required for the effect of training. In the existing scheme, in order to acquire a large amount of training text data, the training text data is usually completed by simple and rough matching based on simple character rules, or is screened by a manual inspection mode after being matched by simple rules.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the related art:
the existing text training data of voice recognition are obtained by using a small amount of manual inspection labels, most of the existing text training data are obtained by using simple and rough tools for processing, and efficient and automatic methods are not used for processing. Although the data quality obtained by the manual labeling method is generally high, the data volume of the speech recognition training text is generally very large, and in general, a large number of MB (Megabyte ) (corresponding to a hundred million characters), and a large number of TB (Terabyte, Terabyte or Terabyte) bring a large amount of indication of manpower and material resources.
First, the amount of text training data that is usually recognized is huge and the data distribution is very diverse, and a method using simple rules may result in effectiveness on some fixed range or domain of text data, but the same scheme migrates to other data, and much work needs to be reprocessed. The reusability of the scheme is extremely low, and the scheme is not easy to popularize.
Secondly, the general automated scheme cannot effectively consider the semantic language relationship in the text content, the effectiveness of the screened data is not high, but whether the text is reasonable or smooth is judged by depending on the semantic language relationship in the text.
In addition, a general automatic scheme adopts a rule screening and rule filtering method, after general rules of the scheme are increased, a plurality of rules are contradictory or redundant, the same input data can be simultaneously applied to a plurality of rules, and how to select the really applied rules needs manual intervention or the weight conditions of some rules are set in advance; furthermore, in such schemes, it is usually impossible to predict in advance which rule or rules the data needs to be processed for each piece of input data, so that each piece of data needs to be calculated by the rule one by one, which makes it difficult for these schemes to process data in a large-scale and distributed manner quickly and efficiently.
Disclosure of Invention
The method at least solves the problem that the existing text training data in the prior art are all completed based on simple character rule rough matching or are screened in a mode of manual inspection after simple rule matching.
In a first aspect, an embodiment of the present invention provides a method for screening training text data for speech recognition, including:
carrying out standardization processing on training text data, and carrying out preprocessing before inputting on the training text data after the standardization processing, wherein the preprocessing before inputting at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;
and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.
In a second aspect, an embodiment of the present invention provides a system for screening training text data for speech recognition, including:
the pre-processing module is used for carrying out normalization processing on the training text data and carrying out pre-input preprocessing on the training text data after the normalization processing, wherein the pre-input preprocessing at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;
and the training text screening module is used for importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences which reach a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.
In a third aspect, an electronic device is provided, comprising: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for filtering training text data for speech recognition according to any of the embodiments of the present invention.
In a fourth aspect, an embodiment of the present invention provides a storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the method for filtering training text data for speech recognition according to any embodiment of the present invention.
The embodiment of the invention has the beneficial effects that: the training text data is subjected to standardized processing and converted into input information using a data screening model, so that the whole process is automated, a large amount of labor cost can be saved, a large amount of expenditure is reduced for enterprises, and the reusability is improved. By fusing various neural network screening models and screening and judging a plurality of dimensions, the language relationship in the text content is effectively considered, and the screening effect of the training text data is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method for screening training text data for speech recognition according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a system for screening training text data for speech recognition according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart of a method for screening training text data for speech recognition according to an embodiment of the present invention, which includes the following steps:
s11: carrying out standardization processing on training text data, and carrying out preprocessing before inputting on the training text data after the standardization processing, wherein the preprocessing before inputting at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;
s12: and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.
In the embodiment, a method for deepening neural network classification by deep text semantic representation is adopted. Because the deep learning method is adopted to express the semantics of the text, the effective semantic information in the text can be obtained by a very simple and efficient scheme, so that the screening effect becomes more accurate. The scheme based on deep learning only needs to consider different input information and output information, can share the same model structure, and then is used in different fields or scenes, so that the model has high expansibility. The scheme based on deep learning, namely model training or formal application of the model can be deployed in a distributed computer cluster, so that the screening efficiency can be doubled.
For step S11, performing normalization processing on the training text data, where the normalization processing at least includes, as an embodiment: text format processing and/or character form processing; the text format processing includes: converting the training text data in a non-standard format into a text form of one word per line or one sentence per line, wherein the non-standard format comprises HTML and JSON; the character form processing includes: and removing illegal symbols in the training text data, wherein the illegal symbols comprise webpage labels and emoticons.
For example, since training text data is widely acquired, the acquired training text data is relatively cluttered. For example, the obtained HTML-formatted data is as follows:
-var articleties ═ machine-learned text classification ═ & (attached training set + dataset + all codes) "; -1. precision mode, trying to start the sentence most accurately, suitable for text analysis. < br/> -2. the full mode scans all words that can be worded in a sentence out very fast but does not resolve the ambiguity < br/>
Such training text sentences usually have similar mark symbols, and during data training, such illegal characters need to be removed, and the text format processing is performed to obtain:
text classification for machine learning (with training set + data set + all codes) "
1. And the accurate mode tries to start the sentence most accurately, and is suitable for text analysis.
2. The full mode scans all words that can be typed in a sentence very fast but does not resolve ambiguities.
And further performing text format processing and character form processing on the data acquired by the webpage. Thereby making the training text more canonical. And the accuracy of training text data is improved.
After the character processing, the normalization processing further includes: sentence breaking processing;
the sentence-breaking processing comprises: and performing punctuation according to punctuation marks in the training text data, when no punctuation mark exists in the training text exceeding the preset length, performing punctuation addition through character form processing, and performing punctuation processing on the training text data to which the punctuation marks are added.
For example, in the above sentence, "2. full mode scans all words that can be typed in the sentence very fast but cannot resolve ambiguities" can be broken to "2. full mode scans all words that can be typed in the sentence very fast but cannot resolve ambiguities. "
After normalization processing, data conversion is carried out on the training text data, the training text data are converted into input information suitable for a data screening model, the input information comprises unique number numbers corresponding to sentences of the training text, the input of the data screening module is usually a vector or a matrix, therefore, the sentences in each sentence cannot be directly used as input, each word in the feature conversion module is uniquely corresponding to a number, each sentence text corresponds to a number string, and the vector formed by the number strings is used as the input data of the data screening module.
For step S12, the converted input information is imported into a fusion screening model formed by combining a plurality of neural network screening models in parallel, text sentences that reach a preset positive case probability score threshold value in the output of the fusion screening model are screened as training text data for speech recognition, wherein a batch of data input each time is simultaneously sent into the plurality of neural network screening models, then model calculation for data screening is respectively performed, scores of each classification result corresponding to the data are respectively obtained and fused, a certain fusion method is adopted for a plurality of groups of scores, a group of scores corresponding to each classification label of the input are obtained, and the labels are divided into positive case labels and negative case labels of sentences. For example, the sentence: "Cano-femoral leg", the positive example tag scored 0.1 and the negative example tag scored 0.9. For example: the sentence "too dark to turn the light on" positive case label score is 0.95 and negative case label score is 0.05. For example, when the default positive probability score threshold is 0.5, the sentence "turning the light on too dark" is filtered as training text data for speech recognition.
According to the embodiment, the training text data is subjected to the standardized processing and is converted into the input information using the data screening model, so that the whole process is automated, a large amount of labor cost can be saved, a large amount of expenditure is reduced for enterprises, and the reusability is improved. By fusing various neural network screening models and screening and judging a plurality of dimensions, the language relationship in the text content is effectively considered, and the screening effect of the training text data is improved.
As an implementation manner, in this embodiment, the pre-input preprocessing at least includes:
performing word segmentation on the training text data after the normalization processing to obtain word string combinations with uniform granularity;
and converting the word string combination into input information suitable for a data screening model, wherein the input information comprises a unique number corresponding to the word string combination.
In the present embodiment, the training text data after the normalization processing is subjected to word segmentation, for example, "sunflower is beautiful under Sansku-Gao pen", word string combinations "Sansko", "sunflower" and "beautiful" with uniform granularity are obtained after the word segmentation, and these word string combinations are converted into an input using the data filtering model. For example, the corresponding unique number, e.g., "Sanskrit" corresponds to a number: 37954567646040612330 (or other types of numbers, without limitation) a word will typically have a number corresponding to it, and the numbers will form a string of numbers.
Through the embodiment, the word segmentation is carried out on the training text data, so that the data distinguishing performance can be further improved, and the error classification in the subsequent screening process is avoided.
As an implementation manner, in this embodiment, the method further includes:
performing word segmentation on the training text data after the normalization processing to obtain word string combinations with uniform granularity, and determining word segmentation parts of the word string combinations;
and converting the word string combination and the corresponding word segmentation part of speech into input information suitable for a data screening model, wherein the input information comprises the combination of the word string combination and the corresponding word segmentation part of speech to obtain a corresponding unique number.
In the present embodiment, in addition to determining a word string combination, the part-word part-of-speech of the word string combination is also determined, for example: morphemes, adjectives, paraphrases, ideograms, discriminators, conjunctions, etc. And then the word string combination and the corresponding word segmentation part of speech are converted into input information using a data screening model.
According to the embodiment, each sentence of training text is composed of the information of the words and the part-of-speech information corresponding to the words, and due to the combination mode of the part-of-speech information, the smoothness degree of a sentence can be reflected to a great extent, so that the accuracy of the screening model can be increased by taking the information as a component in the scheme.
As an implementation manner, in this embodiment, the screening, as training text data for speech recognition, a text sentence in the output of the fusion screening model that reaches a preset normative probability score threshold includes:
respectively obtaining the positive case probability scores output by each neural network screening model, and screening the text sentences of which the highest score reaches a preset positive case probability score threshold value in the positive case probability scores into training text data for voice recognition; or
And respectively obtaining the positive case probability score output by each neural network screening model, and screening the text statement into training text data for voice recognition when the positive case probability score of the weighted mean of the positive case probability scores reaches a preset positive case probability score threshold value.
In this embodiment, the positive example probability scores of each neural network output may be obtained, and the text sentence with the highest positive example probability score and reaching the preset threshold value of the percentage of modification is screened as the training text data for speech recognition, for example, the sentence "turn on the light too dark" is selected, the positive example label score of the first neural network output is 0.95, and the positive example label score of the second neural network output is 0.75, and then the positive example probability scores are compared with the preset threshold value of the positive example probability scores according to the score of 0.95. A weighted average may also be performed to give a score of 0.85 for (0.95+ 0.75)/2.
It can be seen from this embodiment that the user's needs can be satisfied by using different score determination methods. Different fusion modes are provided according to the requirements of users. Therefore, text training data which are more suitable for user requirements are screened out.
As an implementation manner, in this embodiment, each neural network screening model has a first fully connected layer and a second fully connected layer, wherein the dimension of the first fully connected layer is larger than that of the second fully connected layer, and the neural network screening model is trained using dropout in the second fully connected layer to prevent overfitting.
In the embodiment, each model structure in the screening model structure has two full-connection layers, the two full-connection layers have different dimensions, the first full-connection layer has a large dimension, the second full-connection layer has a small dimension, and the second full-connection layer uses the dropout technology, so that the model cannot be over-fitted when the screening model is trained by the design technology.
According to the embodiment, the model has stronger generalization capability and more robustness in the formal use process.
As an implementation manner, in this embodiment, the plurality of neural network screening models includes at least two neural network screening models, including: each sentence requires a long-short term memory network screening model that inputs each word in turn, and a convolutional neural network screening model that allows the input of a complete string of words for a complete sentence at a time.
In this embodiment, it can be seen that in a CNN (Convolutional Neural Network) screening model, each word after feature conversion is first converted into a corresponding word nest, then after a result of the word nest is calculated by convolution of the Convolutional Neural Network, a main feature after convolution is selected through a maximum pooling layer and is input into two continuous full-link layers, and finally, a probability score of the current word for each classification label is obtained through a classification layer.
In an LSTM (Long Short-Term Memory network) screening model, the main flow is similar to that of a CNN screening model, the largest difference exists between two points, and the most main model structure in the first model is a Long-Term and Short-Term Memory network layer; the second LSTM screening model memory network requires entry of each word once for each utterance, but the CNN screening model allows entry of a complete string of words for a complete utterance at a time. Therefore, the CNN screening model is usually faster to compute than the long-term memory network model.
As an implementation manner, in this embodiment, the training text data is from raw data uploaded by a web crawler or manually.
In the embodiment, as massive data are required to be screened, and massive data are obtained from a certain position, if the massive data are only uploaded manually, the labor cost is high, and the automatic acquisition of the initial-level training text by the webpage crawler is quicker.
According to the embodiment, the source of the training text data is also automated on the basis of full automation of the training text data screening. The enterprise expenses are reduced to a greater extent.
On the whole, the quality and the quantity of text training data in voice recognition can greatly influence the performance of voice recognition, and a large amount of high-quality training data are selected through an automatic scheme and can be used for training a voice recognition model, so that the performance of voice recognition is greatly improved; the scheme is based on a classification method, and similar training data screening can be improved slightly by using the framework. For example, screening of picture data and audio data only requires that a word nesting layer in the data preprocessing module, the word segmentation module, the feature extraction module and the screening model is correspondingly changed and can be used.
Fig. 2 is a schematic structural diagram of a system for screening training text data for speech recognition according to an embodiment of the present invention, which can execute the method for screening training text data for speech recognition according to any of the embodiments described above and is configured in a terminal.
The present embodiment provides a system for screening training text data for speech recognition, including: a pre-preprocessing module 11 and a training text screening module 12.
The pre-processing module 11 is configured to perform normalization processing on training text data, and perform pre-processing before inputting on the training text data after the normalization processing, where the pre-processing before inputting at least includes: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text; the training text screening module 12 is configured to import the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screen a text sentence, which reaches a preset positive case probability score threshold value in the output of the fusion screening model, as training text data for speech recognition.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions which can execute the method for screening the training text data for voice recognition in any method embodiment;
as one embodiment, a non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
carrying out standardization processing on training text data, and carrying out preprocessing before inputting on the training text data after the standardization processing, wherein the preprocessing before inputting at least comprises the following steps: converting the training text data after the normalization processing into input information of a data screening model, wherein the input information comprises a unique number corresponding to a sentence of the training text;
and importing the converted input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, and screening the text sentences reaching a preset positive case probability score threshold value in the output of the fusion screening model into training text data for voice recognition.
As a non-volatile computer readable storage medium, may be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the methods of testing software in embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium, which when executed by a processor, perform a screening method of training text data for speech recognition in any of the method embodiments described above.
The non-volatile computer-readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a device of test software, and the like. Further, the non-volatile computer-readable storage medium may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium optionally includes memory located remotely from the processor, which may be connected to the means for testing software over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
An embodiment of the present invention further provides an electronic device, which includes: the apparatus includes at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for filtering training text data for speech recognition according to any of the embodiments of the present invention.
The client of the embodiment of the present application exists in various forms, including but not limited to:
(1) mobile communication devices, which are characterized by mobile communication capabilities and are primarily targeted at providing voice and data communications. Such terminals include smart phones, multimedia phones, functional phones, and low-end phones, among others.
(2) The ultra-mobile personal computer equipment belongs to the category of personal computers, has calculation and processing functions and generally has the characteristic of mobile internet access. Such terminals include PDA, MID, and UMPC devices, such as tablet computers.
(3) Portable entertainment devices such devices may display and play multimedia content. The devices comprise audio and video players, handheld game consoles, electronic books, intelligent toys and portable vehicle-mounted navigation devices.
(4) Other electronic devices with data processing capabilities.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of automated screening of training text data for speech recognition, comprising:
acquiring original data from a webpage crawler or manually uploaded;
carrying out text format processing and/or character form processing normalization processing on the original data;
performing word segmentation on the normalized original data;
converting the original data after word segmentation processing into input information which is suitable for a neural network screening model and corresponds to a unique number, wherein the input information comprises: the unique number corresponding to the word string combination and the corresponding word segmentation part-of-speech combination;
importing the input information into a fusion screening model formed by combining a plurality of neural network screening models in parallel, wherein the neural network screening models at least comprise: each sentence needs to input the long and short term memory network screening model of each word in turn, and the convolution neural network screening model of the complete word string allowing to input a complete sentence at a time;
and fusing the output results of the plurality of neural network screening models, and screening out training text data according to the fusion result.
2. The method of claim 1, wherein the tokenizing the normalized raw data comprises:
and the word segmentation is carried out to obtain word string combinations with uniform granularity.
3. The method of claim 2, wherein the method further comprises:
and performing word segmentation on the normalized original data to obtain word string combinations with uniform granularity, and determining word segmentation parts of the word string combinations.
4. The method of claim 1, wherein the fusing the output results of the plurality of neural network filtering models, and filtering out training text data according to the fused results comprises:
respectively obtaining the positive case probability scores output by each neural network screening model, and screening the text sentences of which the highest score reaches a preset positive case probability score threshold value in the positive case probability scores into training text data for voice recognition; or
And respectively obtaining the positive case probability score output by each neural network screening model, and screening the text statement into training text data for voice recognition when the positive case probability score of the weighted mean of the positive case probability scores reaches a preset positive case probability score threshold value.
5. The method of claim 1, wherein each neural network screening model has a first fully-connected layer and a second fully-connected layer, wherein the first fully-connected layer is larger in dimension than the second fully-connected layer, wherein the neural network screening model is trained using dropout in the second fully-connected layer to prevent overfitting.
6. The method of claim 1, wherein,
the text format processing includes: converting original data in a non-standard format into a text form of one word per line or one sentence per line, wherein the non-standard format comprises HTML and JSON;
the character form processing includes: and removing illegal symbols in the original data, wherein the illegal symbols comprise webpage labels and emoticons.
7. The method of claim 6, wherein after the character form processing, the normalization processing further comprises: sentence breaking processing;
the sentence-breaking processing comprises: and performing punctuation according to punctuation marks in the original data, performing punctuation addition through character form processing when punctuation marks do not exist in a training text with a length exceeding a preset length, and performing punctuation processing on the original data after punctuation addition.
8. An automated screening system of training text data for speech recognition, comprising:
the original data acquisition program module is used for acquiring original data from a webpage crawler or manually uploaded;
the normalized processing program module is used for performing normalized processing of text format processing and/or character form processing on the original data;
the word segmentation program module is used for segmenting the original data after the normalization processing;
a conversion program module, configured to convert the original data after word segmentation processing into input information corresponding to a unique number and suitable for a neural network screening model, where the input information includes: the unique number corresponding to the word string combination and the corresponding word segmentation part-of-speech combination;
a screening program module, configured to import the input information into a fusion screening model formed by parallel combination of a plurality of neural network screening models, where the plurality of neural network screening models at least include: each sentence needs to input the long and short term memory network screening model of each word in turn, and the convolution neural network screening model of the complete word string allowing to input a complete sentence at a time;
and the training program module is used for fusing the output results of the plurality of neural network screening models and screening out training text data according to the fusion result.
CN201910510814.2A 2019-06-13 2019-06-13 Method and system for screening training text data for voice recognition Active CN110223675B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910510814.2A CN110223675B (en) 2019-06-13 2019-06-13 Method and system for screening training text data for voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910510814.2A CN110223675B (en) 2019-06-13 2019-06-13 Method and system for screening training text data for voice recognition

Publications (2)

Publication Number Publication Date
CN110223675A CN110223675A (en) 2019-09-10
CN110223675B true CN110223675B (en) 2022-04-19

Family

ID=67816839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910510814.2A Active CN110223675B (en) 2019-06-13 2019-06-13 Method and system for screening training text data for voice recognition

Country Status (1)

Country Link
CN (1) CN110223675B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929532B (en) * 2019-11-21 2023-03-21 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN111145732B (en) * 2019-12-27 2022-05-10 思必驰科技股份有限公司 Processing method and system after multi-task voice recognition
CN111090970B (en) * 2019-12-31 2023-05-12 思必驰科技股份有限公司 Text standardization processing method after voice recognition
CN111429913B (en) * 2020-03-26 2023-03-31 厦门快商通科技股份有限公司 Digit string voice recognition method, identity verification device and computer readable storage medium
CN112560453B (en) * 2020-12-18 2023-07-14 平安银行股份有限公司 Voice information verification method and device, electronic equipment and medium
CN113361644B (en) * 2021-07-03 2024-05-14 上海理想信息产业(集团)有限公司 Model training method, telecommunication service characteristic information extraction method, device and equipment
CN116911305A (en) * 2023-09-13 2023-10-20 中博信息技术研究院有限公司 Chinese address recognition method based on fusion model
CN117252539A (en) * 2023-09-20 2023-12-19 广东筑小宝人工智能科技有限公司 Engineering standard specification acquisition method and system based on neural network

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514170B (en) * 2012-06-20 2017-03-29 中国移动通信集团安徽有限公司 A kind of file classification method and device of speech recognition
CN104217717B (en) * 2013-05-29 2016-11-23 腾讯科技(深圳)有限公司 Build the method and device of language model
CN105957518B (en) * 2016-06-16 2019-05-31 内蒙古大学 A kind of method of Mongol large vocabulary continuous speech recognition
US9972308B1 (en) * 2016-11-08 2018-05-15 International Business Machines Corporation Splitting utterances for quick responses
CN107229684B (en) * 2017-05-11 2021-05-18 合肥美的智能科技有限公司 Sentence classification method and system, electronic equipment, refrigerator and storage medium
CN107680579B (en) * 2017-09-29 2020-08-14 百度在线网络技术(北京)有限公司 Text regularization model training method and device, and text regularization method and device
CN108509411B (en) * 2017-10-10 2021-05-11 腾讯科技(深圳)有限公司 Semantic analysis method and device
CN109460472A (en) * 2018-11-09 2019-03-12 北京京东金融科技控股有限公司 File classification method and device and electronic equipment

Also Published As

Publication number Publication date
CN110223675A (en) 2019-09-10

Similar Documents

Publication Publication Date Title
CN110223675B (en) Method and system for screening training text data for voice recognition
CN108920622B (en) Training method, training device and recognition device for intention recognition
CN108874776B (en) Junk text recognition method and device
CN111339305B (en) Text classification method and device, electronic equipment and storage medium
CN111177326B (en) Key information extraction method and device based on fine labeling text and storage medium
CN111160031A (en) Social media named entity identification method based on affix perception
CN111460162B (en) Text classification method and device, terminal equipment and computer readable storage medium
CN113553848B (en) Long text classification method, system, electronic device, and computer-readable storage medium
CN111460149A (en) Text classification method, related equipment and readable storage medium
CN105975497A (en) Automatic microblog topic recommendation method and device
CN112380866A (en) Text topic label generation method, terminal device and storage medium
CN113469298A (en) Model training method and resource recommendation method
CN115600605A (en) Method, system, equipment and storage medium for jointly extracting Chinese entity relationship
CN111291551A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN106202349B (en) Webpage classification dictionary generation method and device
CN111079433A (en) Event extraction method and device and electronic equipment
CN113076720B (en) Long text segmentation method and device, storage medium and electronic device
CN110969005A (en) Method and device for determining similarity between entity corpora
CN113806483A (en) Data processing method and device, electronic equipment and computer program product
CN111178080A (en) Named entity identification method and system based on structured information
CN110874408A (en) Model training method, text recognition device and computing equipment
CN112559750A (en) Text data classification method and device, nonvolatile storage medium and processor
CN113761874A (en) Event reality prediction method and device, electronic equipment and storage medium
CN112541352A (en) Policy interpretation method based on deep learning
CN112015891A (en) Method and system for classifying messages of network inquiry platform based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant