CN113076749A

CN113076749A - Text recognition method and system

Info

Publication number: CN113076749A
Application number: CN202110417492.4A
Authority: CN
Inventors: 王珏; 史文华
Original assignee: Shanghai Yunshen Intelligent Technology Co ltd
Current assignee: Shanghai Yunshen Intelligent Technology Co ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-07-06

Abstract

The invention discloses a text recognition method and a system, comprising the following steps: training according to the generalized and expanded training corpus to obtain a language model; preprocessing a text to be recognized to obtain a processed text; performing word segmentation on the processed text and vectorizing to obtain a word segmentation sequence; the word segmentation sequence is arranged according to the character sequence of the text to be recognized; and inputting the word segmentation sequence into a pre-trained language model to obtain a candidate recognition result, and determining the candidate recognition result with the highest joint probability value as a final recognition result. On the premise of having a small amount of sample data, the text is accurately and efficiently identified.

Description

Text recognition method and system

Technical Field

The invention relates to the field of computer processing, in particular to a text recognition method and a text recognition system.

Background

Language is the most important means of human communication and the main expression of human communication. People have resorted to the success of language in preserving and conveying human civilization. The characters are used for the expression of language visualization, the time and space limitations of spoken language are broken through, and the characters can completely inherit human wisdom and mental wealth in writing, so that the human can perfect an education system, improve the wisdom of the human, develop scientific technology and enter the civilized society.

In natural language processing, text recognition is to determine which language is based on the content of a given text. With the development of cross-language retrieval technology, the research on text recognition as a core technology thereof is attracting attention, and the text recognition technology is mainly applied to the tasks of machine translation and multi-language retrieval. Currently, the research of text recognition is mainly rule-based methods and machine learning-based methods. The rule-based method needs manual summarization and induction to obtain the linguistic rule, then carries out character string matching, needs a large number of professional experts to analyze the language, and has difficulty in ensuring the accuracy.

In the field of key information extraction, in the prior art, due to the fact that corpus samples are few and the generalization performance is poor, correct recognition is difficult to achieve on a text to be recognized in a new field, and therefore the accuracy of text recognition cannot be guaranteed.

Disclosure of Invention

The invention aims to provide a text recognition method and a text recognition system, which can quickly obtain a large amount of training linguistic data, improve the accuracy of a training language model and further improve the accuracy and reliability of text recognition.

The technical scheme provided by the invention is as follows:

the invention provides a text recognition method, which comprises the following steps:

training according to the generalized and expanded training corpus to obtain a language model;

preprocessing a text to be recognized to obtain a processed text;

performing word segmentation on the processed text and vectorizing to obtain a word segmentation sequence; the word segmentation sequence is arranged according to the character sequence of the text to be recognized;

and inputting the word segmentation sequence into a pre-trained language model to obtain a candidate recognition result, and determining the candidate recognition result with the highest joint probability value as a final recognition result.

Further, the training according to the generalized and expanded training corpus to obtain the language model includes the steps of:

carrying out generalization pretreatment on the obtained sample corpus to obtain a training corpus, and dividing all the training corpuses into a training set and a verification set respectively;

performing word segmentation processing on the training corpus in the training set to obtain word segmentation results, and labeling the word segmentation results to obtain corresponding word vectors;

and training according to the training set and the verification set to obtain the language model.

Further, the step of performing generalization pretreatment on the obtained sample corpus to obtain the training corpus comprises:

establishing a replacement dictionary in advance according to wrongly written characters and similar meaning words; the replacement dictionary comprises a corresponding relation between preset words and replacement words;

and carrying out word replacement on the sample corpus according to the replacement dictionary to obtain an expanded corpus, and summarizing the sample corpus and the expanded corpus to obtain the training corpus.

Further, the step of inputting the word segmentation sequence into a pre-trained language model to obtain a candidate recognition result, and determining the candidate recognition result with the highest joint probability value as a final recognition result includes:

sequentially inputting each word vector to be recognized in the word segmentation sequence to the language model according to the character sequence of the text to be recognized, and outputting the occurrence probability of each word vector to be recognized in the text to be recognized through the language model;

and calculating the joint probability value of each candidate recognition result by a similarity algorithm according to the occurrence probability of each word vector to be recognized in the text to be recognized, and determining the joint probability value to take the highest candidate recognition result as the final recognition result.

The present invention also provides a text recognition system, comprising:

the preprocessing module is used for preprocessing the text to be recognized to obtain a processed text;

the word segmentation module is used for performing word segmentation on the processed text and performing vectorization processing to obtain a word segmentation sequence; the word segmentation sequence is arranged according to the character sequence of the text to be recognized;

and the recognition processing module is used for inputting the word segmentation sequence into a pre-trained language model to obtain a candidate recognition result, and determining the candidate recognition result with the highest joint probability value as a final recognition result.

Further, the text recognition system further includes:

the system comprises a sample processing module, a word segmentation module and a word segmentation module, wherein the sample processing module is used for carrying out generalization pretreatment on the obtained sample linguistic data to obtain training linguistic data, dividing all the training linguistic data into a training set and a verification set respectively, carrying out word segmentation on the training linguistic data in the training set to obtain word segmentation results, and carrying out vectorization on the word segmentation results to obtain corresponding word vectors;

and the model training module is used for training according to the training set and the verification set to obtain the language model.

Further, the sample processing module comprises:

the dictionary creating unit is used for pre-establishing a replacement dictionary according to the wrongly written characters and the similar meaning words; the replacement dictionary comprises a corresponding relation between preset words and replacement words;

and the generalization processing unit is used for carrying out word replacement on the sample corpus according to the replacement dictionary to obtain an expanded corpus, and summarizing the sample corpus and the expanded corpus to obtain the training corpus.

Further, the identification processing module includes:

the input unit is used for sequentially inputting each word vector to be recognized in the word segmentation sequence to the language model according to the character sequence of the text to be recognized, and outputting the occurrence probability of each word vector to be recognized in the text to be recognized through the language model;

and the processing unit is used for calculating the joint probability value of each candidate recognition result through a similarity algorithm according to the occurrence probability of each word vector to be recognized in the text to be recognized, and determining the joint probability value to take the highest candidate recognition result as the final recognition result.

The invention has the following beneficial effects and advantages:

the text recognition method and the text recognition system can accurately and efficiently recognize the text, the number of recognized texts is far more than that of the text recognition methods, the number of recognized texts can be continuously expanded on the premise of sample data, and the recognition accuracy of the text to be recognized is greatly improved.

Drawings

FIG. 1 is a flow diagram of one embodiment of a text recognition method of the present invention;

FIG. 2 is a flow diagram of another embodiment of a text recognition method of the present invention;

FIG. 3 is a flow diagram of another embodiment of a text recognition method of the present invention;

FIG. 4 is a flow diagram of another embodiment of a text recognition method of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "one" means not only "only one" but also a case of "more than one".

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

One embodiment of the present invention, as shown in fig. 1, is a text recognition method, including the steps of:

s100, training according to the generalized and expanded training corpus to obtain a language model;

specifically, the terminal device or the server performs training according to the generalized and expanded training corpus to obtain a language model, where the language model is generally an N-Gram language model (a statistical language model based on byte fragment sequences with a length of N), or a Bert model.

S200, preprocessing a text to be recognized to obtain a processed text;

specifically, the text to be recognized may be obtained by a voice audio signal collected by a microphone, and the corresponding text to be recognized may be obtained by performing voice recognition on the voice audio signal by using a voice recognition technology of scientific news. Of course, the text to be recognized may also be a text to be recognized obtained by performing image recognition on image data captured by a camera. Of course, the user may also directly input or select a text segment manually to obtain the text to be recognized.

After the terminal device or the server obtains the language model through the training in the above manner, if the text to be recognized is obtained, the text to be recognized is preprocessed to obtain a processed text, where the preprocessing includes, but is not limited to, filtering and deleting common auxiliary words (including structural auxiliary words such as "as, ground, get, and get", temporal auxiliary words such as "as, get, and go", language-oriented auxiliary words such as "as, do, bar, o" "), prepositions (e.g., from, to, toward, in, and when), and other interfering words, deleting symbols, expressions (e.g., chinese expressions, character expressions, and the like).

S300, performing word segmentation and vectorization processing on the processed text to obtain a word segmentation sequence; the word segmentation sequence is arranged according to the character sequence of the text to be recognized;

s400, inputting the word segmentation sequence into a pre-trained language model to obtain a candidate recognition result, and determining the candidate recognition result with the highest joint probability value as a final recognition result.

Specifically, the terminal device or the server performs word segmentation processing on the processed text by using a word segmentation tool (a Jieba word segmentation tool, an ICTCLAS word segmentation tool, or an MMSeg word segmentation tool) to obtain word segmentation results, and vectorizes each word segmentation. Vectorization methods include, but are not limited to, TFidf, word2vec, or self _ learning methods to embed (meaning that vectors of words are initialized and then the expression of word vectors is learned while training).

Then, the terminal device or the server arranges the segmentation results after the labeling processing according to the character sequence of the text to be recognized, so that the terminal device or the server can obtain the segmentation sequence corresponding to the text to be recognized. And finally, the terminal equipment or the server inputs the word segmentation sequence into the trained language model to obtain a plurality of candidate recognition results, calculates the occurrence probability of each word in each candidate recognition result to obtain a joint probability value, and determines the candidate recognition result with the highest joint probability value as the final recognition result.

The invention enlarges the training corpus of the data set coverage range to obtain the language model, improves the training effect of the language model, greatly improves the identification accuracy of the text to be identified, and supports the identification of the text to be identified in various fields. In addition, the language model is obtained through training the generalized and expanded training corpus, the period for constructing the data set is obviously shortened, the time cost for constructing the data set is reduced, the efficiency for obtaining the language model through training is improved, and the recognition efficiency of texts to be recognized in different fields is greatly improved. The invention carries out preprocessing operation on the text to be recognized, and can reduce the influence of non-key information in the text to be recognized on the text semantics.

An embodiment of the present invention, as shown in fig. 2, is a text recognition method, including the steps of:

s110, carrying out generalization pretreatment on the obtained sample corpus to obtain training corpuses, and dividing all the training corpuses into a training set and a verification set respectively;

s120, performing word segmentation processing on the training corpus in the training set to obtain word segmentation results, and labeling the word segmentation results to obtain corresponding word vectors;

s130, training according to the training set and the verification set to obtain the language model;

specifically, the terminal device or the server performs generalization preprocessing on the obtained sample corpus to obtain a corpus, and then divides all corpora into a training set and a verification set. Referring to the above embodiment, the terminal device or the server performs word segmentation on the training corpus in the training set to obtain word segmentation results, and performs vectorization on the word segmentation results to obtain corresponding word vectors, and the terminal device or the server performs training according to the training set and the verification set to obtain the language model. In the training, the training of the model is supervised by adopting an early training method, after the data training of each batch size is finished (or after each N training periods), the verification set can verify the error or the accuracy of the current model, when the error obtained by the batch size is smaller than that obtained by the last time or the last two times (which can be specified), the training can be continued until the error becomes larger, the training is stopped, and the model trained after the last batch size is saved as the final language model.

The early screening method works by monitoring the model's performance on an additional test set, and the terminal device or server will terminate the training process when the model's performance on the test set no longer promotes over successive iterations (specified in advance). The early screening method avoids overfitting by attempting to automatically select the inflection point where the performance of the test data set begins to degrade while the performance of the training data set continues to improve as the model begins to overfit.

The terminal equipment or the server adopts maximum likelihood estimation to estimate the parameters of the language model, and the loss function is determined as a function obtained by negating the maximum likelihood function. Moreover, the invention embodies the abstract model parameter values by using maximum likelihood estimation, and can finish the gradient adjustment in advance by using early sieving.

S200, preprocessing a text to be recognized to obtain a processed text;

s300, performing word segmentation and vectorization on the processed text to obtain a word segmentation sequence; the word segmentation sequence is arranged according to the character sequence of the text to be recognized;

The method effectively expands the coverage of the data set, realizes the enhancement and the extension of the training corpora, increases more training corpora for the subsequent machine learning, obviously shortens the period of constructing the data set, reduces the cost of constructing the data set, assists in improving the training effect of the language model, and greatly improves the identification accuracy of the text to be identified. In addition, the language model is obtained through training the generalized and expanded training corpus, the period for constructing the data set is obviously shortened, the time cost for constructing the data set is reduced, the efficiency for obtaining the language model through training is improved, and the recognition efficiency of texts to be recognized in different fields is greatly improved. The method and the device can be used for preprocessing the text to be recognized, so that the influence of non-key information in the text to be recognized on the text semantics can be reduced.

One embodiment of the present invention, as shown in fig. 3, is a text recognition method, including the steps of:

s111, establishing a replacement dictionary in advance according to wrongly written characters and synonyms; the replacement dictionary comprises a corresponding relation between preset words and replacement words;

s112, performing word replacement on the sample corpus according to the replacement dictionary to obtain an expanded corpus, and summarizing the sample corpus and the expanded corpus to obtain the training corpus;

s113, dividing all training corpora into a training set and a verification set respectively;

s120, performing word segmentation on the training corpus in the training set to obtain word segmentation results, and performing vectorization on the word segmentation results to obtain corresponding word vectors;

s130, training according to the training set and the test set to obtain the language model;

specifically, the terminal device or the server establishes a lexicon of alternatives in advance, the lexicon of alternatives includes the corresponding relation between the preset words and the substitute words, and the terminal device or the server performs random substitution processing on the sample corpus according to the lexicon of alternatives to obtain a large amount of expanded corpuses to achieve enhancement of the training corpus, so that the terminal device or the server can obtain the training sample after data enhancement. According to the method, the new sample can be generated through word replacement to obtain the expanded corpus, the generated new sample is used for any model training, and compared with the method of directly training by using the original sample, the model trained by using the new sample and the original sample is better in performance.

Preferably, the method continues to use the language model BERT as a basis for training, can extract deep bidirectional semantic features, and simultaneously improves the calculation efficiency. According to the invention, the labels of the samples before and after enhancement are not changed through word replacement, and the accuracy of the labels of the new samples obtained through enhancement is improved. Meanwhile, the trouble that 'label embedding' needs to be additionally arranged and retrained is avoided, training can be directly carried out on the basis of the language model BERT, the network structure does not need to be modified, and the training difficulty is reduced.

S200, preprocessing a text to be recognized to obtain a processed text;

The invention adopts the data set in the target field to pre-train the language model BERT to obtain the language model BERT which not only has target field knowledge but also is fit with the text distribution characteristics of the data set, so that the predicted final recognition result is more appropriate, and the relation between the predicted final recognition result and the original sentence is more compact semantically. The method enhances data, is beneficial to generating more diversified training corpora, and greatly improves the generalization ability of the model. The method and the device specially carry out wrongly written characters or similar meaning word replacement aiming at the text words of the sample corpus, so that the introduction of label information is omitted, and the enhanced sample label is ensured to be unchanged. Through the reasonable and effective enhancement and extension of the training corpus, the requirement on the training data volume can be met, and the robustness and the generalization capability of the language model can be improved.

An embodiment of the present invention, as shown in fig. 4, is a text recognition method, including the steps of:

s200, preprocessing a text to be recognized to obtain a processed text;

s410, sequentially inputting each word vector to be recognized in the word segmentation sequence to the language model according to the character sequence of the text to be recognized, and outputting the occurrence probability of each word vector to be recognized in the text to be recognized through the language model;

s420, calculating the joint probability value of each candidate recognition result through a similarity algorithm according to the occurrence probability of each word vector to be recognized in the text to be recognized, and determining the joint probability value to take the highest candidate recognition result as the final recognition result.

Specifically, the word segmentation sequence obtained after the terminal device or the server processes the text to be recognized is as follows:

S＝{w₁，w₂，w₃，…，w_n}

P(S)＝P(w₁，w₂，w₃，…，w_n)

＝P(w₁)P(w₂|w₁)，…P(w_n|w₁，w₂，…，w_n-1)

wherein S is a word vector sequence, n represents the number of word vectors in the word vector sequence, and w_nIs the nth word vector, P (S) is the probability that the character string composed according to the arrangement order of the word vectors is a sentence, P (w)_n|w₁，w₂，…，w_n-1) Is the current word vector w_nThe first n-1 word vectors are (w)₁，w₂，…，w_n-1) The probability of occurrence of.

And the terminal equipment or the server sequentially inputs each word vector to be recognized in the word segmentation sequence to the language model according to the character sequence of the text to be recognized, and outputs the occurrence probability of each word vector to be recognized in the text to be recognized through the language model. Then, the terminal device or the server calculates joint probability value Pn (w) of each candidate recognition result through a similarity algorithm according to the occurrence probability of each word vector to be recognized in the text to be recognized, wherein Pn is the joint probability value P (w) of each candidate recognition result_n|w₁，w₂，…，w_n-1)*P(w_n-1|w₁，w₂，…，w_n-2)…，P(w₂|w₁) That is, the terminal device or the server performs product calculation on the occurrence probability of each sequence to be recognized in the text to be recognized, so as to obtain the joint probability value of each candidate recognition result, and the terminal device or the server determines that the joint probability value takes the highest candidate recognition result as the final recognition result.

Aiming at the problem of insufficient generalization capability of the model, the invention adopts the pre-training model Bert and carries out word segmentation on the training corpus at the same time. The invention uses early positioning mode in the aspect of model fitting to end the gradient adjustment in advance. In the aspect of model control, when the method is used, the words of the sample corpus are matched and replaced by using the lexicon of replacement, and a new label is not required to be introduced to help solve the problem of language model overfitting. Firstly, a dictionary is established, words in the dictionary have high priority, and the words are determined to be extracted. An exemplary method for replacing each character of the corresponding word in the 50% corpus with a replacement word in the lexicon of alternatives results in a new sample corpus being trained as the corpus input in the initial language model above. In prediction, words containing a lexicon of alternatives in the sample corpus can be replaced by alternative words, thus realizing the extraction capability by utilizing a lexicon intervention model. The language model of the invention increases the result information of word segmentation, rather than sharing task parameters, and the additional information can help to solve the generalization problem. In addition, the invention realizes the enhancement and the extension of the training materials by utilizing the training mode of using the alternative words by utilizing the alternative dictionary, and the robustness and the generalization capability of the language model obtained by training can be improved by carrying out model training according to the training sample enhanced by the data.

The method has good generalization effect, can be used for easily, accurately and efficiently identifying the text to be identified in the new field, and is not easy to generate the over-fitting problem because the language model is trained by using the training samples with enhanced quantity. The method uses an early screening method and a loss function to perform finetune on a pre-training language model through maximum likelihood estimation to obtain the language model. The method can increase the generalization ability of the model by utilizing the pre-training model and the word segmentation result information under the condition of smaller training corpus quantity. During model training, overfitting is prevented using earlystopping.

In one embodiment of the present invention, a text recognition system includes:

Specifically, this embodiment is a system embodiment corresponding to the above method embodiment, and specific effects refer to the above method embodiment, which is not described in detail herein.

Based on the foregoing embodiment, the text recognition system further includes:

and the model training module is used for carrying out finetune training on the pre-training language model according to the training set and the test set to obtain the language model.

Based on the foregoing embodiments, the sample processing module includes:

Based on the foregoing embodiment, the identification processing module includes:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of program modules is illustrated, and in practical applications, the above-described distribution of functions may be performed by different program modules, that is, the internal structure of the apparatus may be divided into different program units or modules to perform all or part of the above-described functions. Each program module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one processing unit, and the integrated unit may be implemented in a form of hardware, or may be implemented in a form of software program unit. In addition, the specific names of the program modules are only used for distinguishing the program modules from one another, and are not used for limiting the protection scope of the application.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed method and system may be implemented in other ways. For example, the division of the module or unit is only one logic function division, and there may be another division manner in actual implementation.

It should be noted that the above embodiments can be freely combined as necessary. The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A text recognition method, comprising the steps of:

preprocessing a text to be recognized to obtain a processed text;

2. The method according to claim 1, wherein said training according to the generalized and expanded corpus to obtain the language model comprises the steps of:

performing word segmentation on the training corpus in the training set to obtain word segmentation results, and performing vectorization on the word segmentation results to obtain corresponding word vectors;

3. The text recognition method according to claim 2, wherein the step of performing generalization preprocessing on the obtained sample corpus to obtain the training corpus comprises:

4. The text recognition method according to any one of claims 1 to 3, wherein the step of inputting the word segmentation sequence into a pre-trained language model to obtain a candidate recognition result, and the step of determining the candidate recognition result with the highest joint probability value as the final recognition result comprises the steps of:

5. A text recognition system, comprising:

6. The text recognition system of claim 5, further comprising:

7. The text recognition system of claim 6, wherein the sample processing module comprises:

8. The text recognition system of any one of claims 5-7, wherein the recognition processing module comprises: