CN111460827A

CN111460827A - Text information processing method, system, equipment and computer readable storage medium

Info

Publication number: CN111460827A
Application number: CN202010248972.8A
Authority: CN
Inventors: 邬国锐; 李杨
Original assignee: Beijing Aikaka Information Technology Co ltd
Current assignee: Beijing Aikaka Information Technology Co ltd
Priority date: 2020-04-01
Filing date: 2020-04-01
Publication date: 2020-07-28
Anticipated expiration: 2040-04-01
Also published as: CN111460827B

Abstract

The invention discloses a text information processing method, a text information processing system, text information processing equipment and a computer readable storage medium. The method of the invention carries out error correction processing on the text to be processed by adopting an error correction model obtained by training an error correction training set corresponding to the type of the text to be processed in advance to obtain at least one corrected text of the text to be processed, thereby realizing the correction of font errors and the like in the text to be processed; the method comprises the steps of extracting the structural features of a corrected text by adopting a named entity recognition model obtained by training a structural feature training set corresponding to the type of the text to be processed in advance, matching the structural features of the corrected text with the structural features of each standard text message in a credible data set, determining the standard text message corresponding to the corrected text, further correcting the named entity errors existing in the corrected text through the structural features, and improving the accuracy of text message recognition.

Description

Text information processing method, system, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method, a system, a device, and a computer-readable storage medium for processing text information.

Background

In daily work or life, paper documents such as various bills and certificates, for example, invoices, business licenses, and the like are used, and in order to realize recognition of the paper documents, it is a trend to automatically recognize text information printed on paper by using a computer technology. Especially for the key text information such as the name of a company, the structure characteristics are special, the identification accuracy is high, and in many financial occasions, the name of the company and the text information similar to the name of the company are not allowed to have any errors.

At present, Optical Character Recognition (OCR) technology is mainly used for recognizing text information printed on paper, which uses Optical technology and computer technology to read out characters printed or written on paper and convert the characters into a format that can be accepted by a computer and understood by people. The OCR processing steps mainly comprise: image preprocessing, layout analysis, text positioning (or image cutting), character cutting and recognition, and the like.

However, the following recognition errors often occur in the process of text recognition by OCR technology: for some long text information on the paper document, the head and/or tail of the text is cut due to the text positioning deviation; because the Chinese characters are structural fonts, the Chinese characters with left and right structures or left and right structures in the text information are easily recognized into two or more characters by mistake, for example, "ka" is recognized into "koka" and the like; the difficulty in recognition caused by unclear paper documents, oblique printing, partially covered text, overlapping text coverage, low brightness, etc., results in recognition of some complex-structured words as other words, etc., and recognition errors for mixed chinese and english words, such as "IBM" as "18M", etc. The existing technical problems that the error rate of recognition of text information on a paper file is high, how to correct the recognition result and improve the accuracy rate of text information recognition are needed to be solved urgently.

Disclosure of Invention

The invention provides a text information processing method, a text information processing system, text information processing equipment and a computer readable storage medium, which are used for overcoming the technical problems in the prior art and improving the accuracy of identification of text information on paper files.

The invention provides a text information processing method, which comprises the following steps:

carrying out error correction processing on a text to be processed through an error correction model to obtain at least one corrected text of the text to be processed, wherein the error correction model is obtained through error correction training set training corresponding to the type of the text to be processed;

extracting the structural features of the corrected text through a named entity recognition model, wherein the named entity recognition model is obtained through training of a structural feature training set corresponding to the type of the text to be processed;

and matching the structural features of the corrected text with the structural features of each standard text message in the credible data set, and determining the standard text message corresponding to the corrected text.

The present invention also provides a text information processing system, comprising:

the first error correction module is used for carrying out error correction processing on a text to be processed through an error correction model to obtain at least one corrected text of the text to be processed, and the error correction model is obtained through error correction training set training corresponding to the type of the text to be processed;

the structural feature extraction module is used for extracting the structural features of the corrected text through a named entity recognition model, and the named entity recognition model is obtained through training of a structural feature training set corresponding to the type of the text to be processed;

and the second error correction module is used for matching the structural characteristics of the corrected text with the structural characteristics of each standard text message in the credible data set and determining the standard text message corresponding to the corrected text.

The present invention also provides a text information processing apparatus comprising:

a processor, a memory, and a computer program stored on the memory and executable on the processor; wherein the processor implements the text information processing method as described above when running the computer program.

The present invention also provides a computer-readable storage medium storing a computer program that can be executed to execute the text information processing method described above.

The method comprises the steps of carrying out error correction processing on a text to be processed by adopting an error correction model obtained by training an error correction training set corresponding to the type of the text to be processed in advance to obtain at least one corrected text of the text to be processed, and correcting font errors and the like in the text to be processed; the method comprises the steps of extracting the structural features of a corrected text by adopting a named entity recognition model obtained by training a structural feature training set corresponding to the type of the text to be processed in advance, matching the structural features of the corrected text with the structural features of each standard text message in a credible data set, determining the standard text message corresponding to the corrected text, further correcting the named entity errors existing in the corrected text through the structural features, and improving the accuracy of text message recognition.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a flowchart of a text information processing method according to an embodiment of the present invention;

fig. 2 is a flowchart of a text information processing method according to a second embodiment of the present invention;

fig. 3 is a schematic structural diagram of a text information processing system according to a third embodiment of the present invention;

fig. 4 is a schematic structural diagram of a text information processing system according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a text information processing apparatus according to a fifth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first", "second", etc. referred to in the present invention are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. In the description of the following examples, "plurality" means two or more unless specifically limited otherwise.

In order to make the technical solution of the present invention clearer, embodiments of the present invention are described in detail below with reference to the accompanying drawings.

The method and the device can be particularly applied to further correcting the recognition result of the text information with the structural characteristics on the paper document. The paper document can be a bill such as an invoice, a certificate such as a business license, or other paper documents containing text information, etc. The textual information having a structured feature may be textual information having a defined or colloquial structured feature, may be a company name on an invoice or license, a company address on a license, and so forth.

For example, in the case of company names, the structured features of the company names are determined according to the company name management method issued by the market supervision and administration head office, and all company names have such structured features. The company name usually includes parts of place, word size, industry, organization structure and the like, each part is used as a characteristic item, wherein the word size usually has uniqueness and can uniquely identify a company. For company names, the arrangement sequence of characters is unique, the same company names are not allowed to exist, and in many financial occasions, the company names are not allowed to be wrong, and the company names are miscalculated by one word.

For example, addresses typically include provinces, cities, districts/counties, towns, and so forth.

For the text information with the structural characteristics, even the existing OCR recognition model with good recognition effect inevitably has errors of character patterns or word shapes.

The method can take the recognition result of the paper text as the text to be processed, and carry out error correction processing on the text to be processed through the error correction model to obtain at least one corrected text of the text to be processed so as to realize correction of font errors in the text to be processed; and then extracting the structural features of the corrected text through the named entity recognition model, matching the structural features of the corrected text with the structural features of each standard text message in the credible data set, and determining the standard text message corresponding to the corrected text, so as to further correct the named entity recognition error in the corrected text, thereby improving the accuracy of text message recognition.

The text information processing method of the present invention is described in detail with the text to be processed as an example of a company name, that is, with the type of the text to be processed as an example of a company name.

Fig. 1 is a flowchart of a text information processing method according to an embodiment of the present invention, and as shown in fig. 1, the text information processing method according to the embodiment includes the following steps:

step 101, performing error correction processing on the text to be processed through an error correction model to obtain at least one corrected text of the text to be processed, wherein the error correction model is obtained through training of an error correction training set corresponding to the type of the text to be processed.

The text to be processed may be specifically a recognition result of text information from a paper document by an OCR technology or the like.

The text to be processed may be text information with structured features such as a company name, an address, or an account, that is, the type of the text to be processed may be the company name, the address, or the account. In addition, the type of the text to be processed may also be other types, and may be specifically set and adjusted by a technician according to an actual application scenario and needs, which is not specifically limited in this embodiment.

In this embodiment, the error correction model is a deep learning model trained in advance, and is used to correct font errors, morphological errors, and the like in the text to be processed, so as to implement error correction processing on the text to be processed.

For example, for different types of texts to be processed, corresponding error correction models can be trained in advance. Specifically, an error correction training set corresponding to each type of the text to be processed can be obtained, and the error correction model corresponding to the type can be trained by adopting the corresponding error correction training set, so that the error correction model can be trained specifically for common errors of the texts to be processed of different types, and the error correction effect of the error correction model on the text to be processed is better.

Optionally, a unified error correction model may also be trained for two or more types, and the unified error correction model may be used to perform error correction processing on the two or more types of texts to be processed.

Illustratively, in this embodiment, the error correction training set includes a plurality of pieces of error correction training data, and each piece of error correction training data includes: contains at least one error text message and its corresponding standard text message.

And 102, extracting the structural features of the corrected text through a named entity recognition model, wherein the named entity recognition model is obtained through training of a structural feature training set corresponding to the type of the text to be processed.

For example, the named entity recognition model can be an L STM + CRF model, and the L STM + CRF model is a named entity recognition method based on word frequency features and a probability map, and the application of the named entity recognition model is that the named entity recognition is carried out on the phrases without wrong words and normal semantics.

Illustratively, the structured feature training set may be a pre-acquired training set corresponding to the type of the text to be processed, and the structured feature training set includes a plurality of pieces of structured feature training data, and each piece of structured feature training data includes: standard text information and its structured features.

And 103, matching the structural features of the corrected text with the structural features of each standard text message in the credible data set, and determining the standard text message corresponding to the corrected text.

The credible data set is a data set which is established in advance and corresponds to the type of the text to be processed, the type of the text to be processed is different, and the corresponding credible data sets are different. The trusted data set includes: standard text information and its structured features.

For example, for a company name, the corresponding trusted data set includes the correct company name, as well as the structured features corresponding to each company name.

In this embodiment, for the corrected text after the error correction processing, the structured features of the corrected text are further extracted, the structured features of the corrected text are matched with the structured features of each standard text message in the trusted data set, the matching degree between the corrected text and each standard text message is determined, each standard text message is sorted according to the matching degree with the corrected text, and one or more standard text messages with the top sorting degree are determined as the standard text messages corresponding to the corrected text, that is, one or more standard text messages with the highest matching degree with the structured features of the corrected text are determined as the target text after the text to be processed, so as to obtain the final recognition result.

The embodiment of the invention carries out error correction treatment on the text to be processed by adopting an error correction model obtained by training an error correction training set corresponding to the type of the text to be processed in advance to obtain at least one corrected text of the text to be processed, thereby realizing the correction of font errors and the like in the text to be processed; the method comprises the steps of extracting the structural features of a corrected text by adopting a named entity recognition model obtained by training a structural feature training set corresponding to the type of the text to be processed in advance, matching the structural features of the corrected text with the structural features of each standard text message in a credible data set, determining the standard text message corresponding to the corrected text, further correcting the named entity errors existing in the corrected text through the structural features, and improving the accuracy of text message recognition.

Fig. 2 is a flowchart of a text information processing method according to a second embodiment of the present invention, where on the basis of the first embodiment, before performing error correction processing on a text to be processed by using an error correction model to obtain at least one corrected text of the text to be processed, the method further includes: acquiring an error correction training set corresponding to the type of the text to be processed, wherein the error correction training set comprises a plurality of pieces of error correction training data; and carrying out deep learning model training through an error correction training set to obtain an error correction model. Before extracting the structural features of the corrected text through the named entity recognition model, the method further comprises the following steps: acquiring a structural feature training set corresponding to the type of the text to be processed, wherein the structural feature training set comprises a plurality of pieces of structural feature extraction training data; and carrying out model training on the initial named entity recognition model through a structural feature training set to obtain the named entity recognition model.

As shown in fig. 2, the text information processing method in this embodiment includes the following steps:

step 201, an error correction training set corresponding to the type of the text to be processed is obtained, where the error correction training set includes a plurality of error correction training data.

In this embodiment, the step may be specifically implemented as follows:

acquiring a plurality of standard text messages corresponding to the types of the texts to be processed; and constructing an error text corresponding to each standard text message, wherein each error text and the corresponding standard text message form an error correction training data.

Optionally, the error text corresponding to the standard text information is constructed, statistics may be performed on the error types that have occurred in the text of the text type to be processed in the using process, the frequency, the probability, and the like of the error types, and the standard text is subjected to deformation of at least one error type according to the statistical results, so as to generate the corresponding error text. Wherein, the error conditions at least comprise the replacement of missing characters, multiple characters, characters or words with similar characters or words, and the like.

Optionally, the method for acquiring the plurality of standard text messages corresponding to the types of the texts to be processed may be implemented by collecting a large amount (ten million levels) of text corpora through a crawler technology, or may be implemented by using other methods for acquiring big data, which is not specifically limited in this embodiment. For example, as many valid company names as possible that conform to the official registration can be acquired for the company name, and about 2200 ten thousand valid company names can be currently acquired.

And 202, carrying out deep learning model training through an error correction training set to obtain an error correction model.

In this embodiment, the error correction model may be a transform model, which is an effective entity recognition method applied in the Natural language Processing (N L P) translation field, and can perform cross validation by using bag-of-words masking to form features, and finally, effectively recognize an entity.

After the error correction training set is obtained, the transform model is trained through the error correction training set in the step, so that an error correction model for error correction processing of the text to be processed is obtained, and font errors, morphological errors and the like of the text to be processed are corrected through the error correction model.

The sequence generation process of the transformer model comprises the following steps: encoding an input text into a word vector; a transformer decoder decodes the word vectors into semantic vectors; and decoding the semantic vector by using a transform decoder and an attention mechanism to obtain a generated result.

Based on an application scenario of recognizing structured text information such as 'company name', the difference between OCR recognition and a correct result is generally small, and the input and output of a conventional transformer model process for neural network translation are very different. In order to enable the error correction effect of the transform model when the transform model is applied to the current application scenario to be better, in the embodiment of the present invention, before a decoder decodes a word vector, the decoder of the transform model randomly deletes one or more dimensions in the word vector to further destroy the word vector, and then recovers each dimension of the word vector to improve the correctness and integrity of the word vector, so that the usability and integrity of the word vector can be further better, and the error correction effect of the error correction model can be further improved. In the embodiment of the invention, for the line characters which are frequently identified by the OCR and are wrong, the improved transformer model is utilized to take the wrong characters as the mask to realize deviation rectification so as to train the corresponding relation for improving the correctness.

Illustratively, the part of the encoder in the model uses 6-layer transformers as the encoder, and the part of the decoder uses 3 two-layer transformers, which respectively correspond to deleting, inserting and replacing three operations.

In particular, the decoder of the transform model may comprise three classifiers, each classifier comprising one two-layer transform. These three classifiers are used to implement the following functions for one dimension of the word vector: deleting characters (namely replacing other characters with blank characters), inserting blank characters, and replacing the blank characters with other characters.

Illustratively, a five-tuple (y, a, ∈, R, y) may be used⁰) To describe the decoder of the transform model, wherein ∈ is a medium, like a black box, with input behaviorReturning to the new sequence; a is a set of behaviors representing all possible behaviors;

is of length N_maxA set of all possible sequences of (a); r represents a feedback function (reward) measuring the distance between the true and false sequences; y is⁰Representing an initial sequence, the initial sequence may be null.

For example, the input sequence may be an n-dimensional word vector, which may be represented as: y is^k＝y_1:nThe output sequence to be generated may be denoted y^k+1＝∈(y^k,α^k+1) Where k is zero or a positive integer for distinguishing between different sequences α^k+1Representation acting on y^kThe behavior of (c).

In particular, a set of behaviors may include both delete and insert behaviors. Wherein the act of deleting is for each character of the input sequence determining whether to delete the character by a deletion policy. The delete behavior may be described as: for each character y of the sequence y_i∈ y, the index i indicates the position of the character in the sequence, and the deletion strategy can be expressed as π^del(d | i, y) where d | i represents the deletion character y_iThe deletion policy makes a binary decision to determine whether to delete character y_i. Illustratively, the deletion policy may be implemented using a binary model. E.g. pi^del(d | i, y) may represent deleting a character y in the sequence y_iWhen the probability meets a preset condition, the character y in the sequence y can be deleted_i。

The insertion behavior is used to determine, for each slot in the input sequence, whether to insert a placeholder in the slot via an insertion policy, and to determine a character generated at the inserted placeholder via a generation policy. The insertion behavior can be described as: for all slots in sequence y (y)_i,y_i+1) Wherein y is_i∈ y, the index i indicates the position of the character in the sequence, the insertion strategy gives the slot (y)_i,y_i+1) The probability of inserting a placeholder, the insertion strategy can be expressed as pi^plh(p | i, y) ofIn (d), p | i represents a groove (y)_i,y_i+1) An act of inserting a placeholder therein; the generation policy gives a policy to generate other characters at this generated placeholder, where t | i represents the value for slot (y)_i,y_i+1) The behavior of the inserted placeholder in the placeholder for generating other characters can be expressed as pi^tok(t | i, y) in the groove (y)_i,y_i+1) The placeholders inserted in generate probabilities for other characters.

Illustratively, the insertion strategy may be implemented using a two-class model; the generation strategy can be realized by adopting a multi-classification model, the probability of inserting other characters into the placeholder can be given for one placeholder generation strategy, and one character with the maximum probability is selected to be inserted into the placeholder.

In summary, the overall behavior of the decoder of the transform model for the input sequence can be expressed as:

where the subscripts 0-n denote the position of the character in the sequence, y_iA character representing the ith position in the sequence, d₀,…,d_nAn act of indicating that a character at a corresponding position in the sequence is deleted; p is a radical of₀,…,p_n-1Representing the act of inserting placeholders in respective slots in the sequence, e.g. p_iIs shown at the trough (y)_i,y_i+1) An act of inserting a placeholder therein;

representing the action of generating a corresponding character in each placeholder, e.g.

Represents p_iThe behavior occurs in (y)_i,y_i+1) After inserting the placeholder, the behavior of other characters is generated in the placeholder. The overall policy may be expressed as: pi (a | y) ═ pi^del(d_i|i,y)*∏π^plh(p_i|i,y′)*∏π^tok(t_iI, y "), where y' ∈ (y, d) indicates that sequence y is deleted by a deletion strategyThe sequence obtained by dividing the operation is denoted by y ″ ═ ∈ (y ', p) and the sequence obtained by performing an insertion operation on y' by the insertion strategy.

In the embodiment of the present invention, the overall processing operation of the decoder of the transform model on the input sequence may include deletion, insertion, and generation, the word vector to be decoded is further damaged through the deletion operation, and then a new word vector is generated by recovering through the insertion and generation operation, so as to improve the correctness and integrity of the word vector, further improve the usability and integrity of the word vector, and further improve the error correction effect of the error correction model.

In this step, in the training process of the transform model, the three classifiers of the decoder of the transform model are also trained, so that the error correction capability of the error correction model obtained by training can be improved.

And 203, performing error correction processing on the text to be processed through the error correction model to obtain at least one corrected text of the text to be processed.

After the error correction model corresponding to the type of the text to be processed is obtained, the text to be processed is subjected to error correction processing through the error correction model, so that at least one corrected text of the text to be processed is obtained, and the process is a primary error correction process of the text to be processed.

In this embodiment, the subsequent step 205-.

And 204, acquiring a structural feature training set corresponding to the type of the text to be processed, wherein the structural feature training set comprises a plurality of pieces of structural feature extraction training data.

In this embodiment, the step may be specifically implemented as follows:

acquiring a plurality of standard text messages corresponding to the types of the texts to be processed; performing word segmentation processing on each standard text message to obtain word segmentation results; and converting the word segmentation result of each standard text message into a corresponding structural feature according to a structural feature rule corresponding to the type of the text to be processed, wherein each standard text message and the corresponding structural feature form a piece of structural feature extraction training data.

For example, the word segmentation processing is performed on each standard text message, which may be performed by using a Chinese word segmentation tool, which is a word segmentation tool using a hidden markov model, and has the advantages of no need of training, being used after opening a box, being the most sophisticated interface and being the most functional compared with other word segmentation tools.

In this embodiment, the type structured feature related to the text to be processed may include a plurality of feature items. For example, a company name typically includes parts of a place, a font size, an industry, an organizational structure, and so on, each as a feature item.

In this step, after the word segmentation result of the standard text information is obtained, the feature items corresponding to the respective words in the word segmentation result may be analyzed and determined according to the structural feature rule corresponding to the type of the text to be processed, so that the respective words in the word segmentation result of the standard text information are mapped to the respective feature items of the structural feature, and the structural feature corresponding to the standard text information is obtained.

Optionally, the word segmentation result is mapped to each feature item of the structural features, and each word segmentation can be classified through a pre-trained classification model. For example, for the feature item "industry" of the company name, the classification model for identifying whether the input word is used for describing an industry classification can be used, and the classification result is obtained by processing the input word.

For example, for a company name, as many effective company names as possible that meet the official registration can be obtained, about 2200 ten thousand effective company names can be obtained at present, and a named entity distribution probability and a feature model of the company name are formed by using L STM + CRF models in subsequent steps.

In addition, for named entities that are invalid, training data can be constructed by manual labeling to train and improve the accuracy of the model.

And 205, performing model training on the initial named entity recognition model through a structured feature training set to obtain the named entity recognition model.

After the structural feature training set is obtained, model training is carried out on the initial named entity recognition model through the structural feature training set, and the named entity recognition model is obtained.

The initial named entity recognition model can be a L STM + CRF model, a long and short time Memory Network (L on short Term Memory Network, L STM for short)) is a variant of a recurrent neural Network, the L STM model has the advantages that the training can automatically extract data, particularly the characteristics of time sequence type data, and the characteristic construction by the model is not needed.

L STM + CRF model named entity recognition process comprises encoding input text into word vector, encoding the word vector into semantic vector by L STM model, and decoding the semantic vector by CRF decoder to obtain named entity recognition result.

In the embodiment, in order to further improve the accuracy of extracting the structural features of the named entity recognition model obtained through training, a network layer used for coding an input text into word vectors is trained in the training process of the named entity recognition model, and the length of a neuron is adjusted in the training process, so that the network layer of the named entity recognition model obtained through training is more suitable for short texts, the word vectors generated through coding in the network layer are more consistent with the characteristics of the structural features, and the structural features extracted by the named entity recognition model obtained through training are further more accurate.

And step 206, extracting the structural characteristics of the corrected text through the named entity recognition model.

In this embodiment, after obtaining at least one corrected text of the text to be processed, the structural feature of the corrected text is extracted through the named entity recognition model, and then the structural feature of the corrected text may be matched with the structural feature of each standard text information in the trusted data set through step 207, so as to determine the standard text information corresponding to the corrected text.

And step 207, matching the structural features of the corrected text with the structural features of each standard text message in the credible data set, and determining the standard text message corresponding to the corrected text.

The trusted data set can be stored in a database to facilitate reading and storing of data.

Before the step, a trusted data set is generated in advance, and the method specifically comprises the following steps: acquiring standard text information corresponding to the type of the text to be processed; and extracting the structural features corresponding to each standard text message through the named entity recognition model to obtain a credible data set.

Optionally, the structural feature training set corresponding to the type of the text to be processed, which is obtained in step 204, includes multiple pieces of structural feature extraction training data, and each piece of structural feature extraction training data includes standard text information and a structural feature corresponding to the standard text information.

In this embodiment, the structured feature corresponding to the type of the text to be processed may include a plurality of feature items. The step can be realized by the following method:

determining the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set; and determining at least one standard text message corresponding to the corrected text according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set.

For example, the determination of the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the trusted data set may be implemented by using a minimum edit distance algorithm, or may be implemented by using any existing method for calculating the similarity between two short texts, which is not described herein again.

Illustratively, determining at least one standard text message corresponding to the corrected text according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the trusted data set may be implemented as follows:

determining the integral matching degree of the corrected text and each standard text message in the credible data set according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set; and determining at least one standard text message corresponding to the corrected text according to the integral matching degree of the corrected text and each standard text message in the credible data set.

Further, according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the trusted data set, the overall matching degree between the corrected text and each standard text message in the trusted data set is determined, and specifically, the overall matching degree can be calculated by giving corresponding weights to each feature item, and calculating by means of weighted summation, weighted average and the like.

Optionally, some feature items may be set as hard matching items, and other feature items may be set as soft matching items. When determining at least one standard text message corresponding to the corrected text according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the trusted data set, screening each standard text message in the trusted data set according to whether the hard matching item is completely matched, and reserving the standard text message completely matched with the corrected text message; and then, calculating the similarity between the soft matching item of the reserved standard text information and the corrected text, comprehensively calculating the overall matching degree of the reserved standard text information and the corrected text, and further determining at least one standard text information corresponding to the corrected text according to the overall matching degree of the corrected text.

Further, according to the overall matching degree between the corrected text and each standard text information in the trusted data set, at least one standard text information corresponding to the corrected text is determined, which can be specifically implemented in any one of the following manners:

a first possible implementation: and sequencing the standard text information according to the matching degree with the corrected text, and determining one or more standard text information which is/are sequenced most in front as the standard text information corresponding to the corrected text, namely determining one or more standard text information which is/are matched with the structured features of the corrected text at the highest degree, and taking the standard text information as the target text after the text to be processed is processed to obtain the final recognition result.

A second possible implementation: and if the only one standard text message completely matches with each feature item of the corrected text, determining the standard text message as the standard text message corresponding to the corrected text. If there is no standard text information that completely matches each feature item of the corrected text, sorting each standard text information according to the matching degree with the corrected text according to the first possible implementation manner, and determining one or more standard text information that is sorted most in the top as the standard text information corresponding to the corrected text.

The method of the embodiment of the invention effectively solves the problem of OCR recognition errors occurring when the head and the tail cut characters, can be obtained through reasoning by the credible data set if the named entities have errors, and can recommend a plurality of named entities with high similarity if the text error degree is too large.

In addition, in order to evaluate the error correction capability of the present invention for text information, a first time accuracy and a first n hit accuracy may be used as two evaluation indexes, where n is a positive integer, n is the determined number of standard text information corresponding to the text to be processed, and n may be 5, for example. The first accuracy rate refers to the first correct percentage in the standard text information determined by the invention, and the top n hit accuracy rate refers to the percentage in the n standard text information determined by the invention, which includes correct names.

The method comprises the steps of carrying out error correction processing on a text to be processed by adopting an error correction model obtained by training an error correction training set corresponding to the type of the text to be processed in advance to obtain at least one corrected text of the text to be processed, and realizing correction of font errors and the like in the text to be processed; the method comprises the steps of extracting the structural features of a corrected text by adopting a named entity recognition model obtained by training a structural feature training set corresponding to the type of the text to be processed in advance, matching the structural features of the corrected text with the structural features of each standard text message in a credible data set, determining the standard text message corresponding to the corrected text, further correcting the named entity errors existing in the corrected text through the structural features, and improving the accuracy of text message recognition. The deviation rectifying effect of the OCR recognition result can be verified through the OCR, and after the deviation rectifying effect is verified to be rectified through the text information processing party of the embodiment, the recognition accuracy is effectively improved.

Fig. 3 is a schematic structural diagram of a text information processing system according to a third embodiment of the present invention, and as shown in fig. 3, the text information processing system in this embodiment includes: a first error correction module 301, a structured feature extraction module 302, and a second error correction module 303.

Specifically, the first error correction module 301 is configured to perform error correction processing on a text to be processed through an error correction model to obtain at least one corrected text of the text to be processed, where the error correction model is obtained through an error correction training set corresponding to the type of the text to be processed.

The structural feature extraction module 302 is configured to extract structural features of the corrected text through a named entity recognition model, where the named entity recognition model is obtained through training of a structural feature training set corresponding to the type of the text to be processed.

The second error correction module 303 is configured to match the structural features of the corrected text with the structural features of each standard text information in the trusted data set, and determine the standard text information corresponding to the corrected text.

The above functional modules are respectively used for completing a corresponding operation function of the method embodiment of the present invention, and similar functional effects are also achieved, and detailed descriptions are omitted.

Fig. 4 is a schematic structural diagram of a text information processing system according to a fourth embodiment of the present invention, and on the basis of the third embodiment, in this embodiment, the text information processing system further includes an error correction model training module 304.

The error correction model training module 304 is configured to: acquiring an error correction training set corresponding to the type of the text to be processed, wherein the error correction training set comprises a plurality of pieces of error correction training data; and carrying out deep learning model training through an error correction training set to obtain an error correction model.

Optionally, the error correction model training module 304 is further configured to: acquiring a plurality of standard text messages corresponding to the types of the texts to be processed; and constructing an error text corresponding to each standard text message, wherein each error text and the corresponding standard text message form an error correction training data.

Optionally, the textual information processing system further includes a named entity recognition model training module 305. The named entity recognition model training module 305 is configured to: acquiring a structural feature training set corresponding to the type of the text to be processed, wherein the structural feature training set comprises a plurality of pieces of structural feature extraction training data; and carrying out model training on the initial named entity recognition model through a structural feature training set to obtain the named entity recognition model.

Optionally, the named entity recognition model training module 305 is further configured to: acquiring a plurality of standard text messages corresponding to the types of the texts to be processed; performing word segmentation processing on each standard text message to obtain word segmentation results; and converting the word segmentation result of each standard text message into a corresponding structural feature according to a structural feature rule corresponding to the type of the text to be processed, wherein each standard text message and the corresponding structural feature form a piece of structural feature extraction training data.

Optionally, the text information processing system further comprises a trusted data set acquisition module 306. The trusted data set acquisition module 306 is configured to: acquiring standard text information corresponding to the type of the text to be processed; and extracting the structural features corresponding to each standard text message through the named entity recognition model to obtain a credible data set.

Optionally, the type structured feature of the text to be processed includes a plurality of feature items, and the second error correction module 304 is further configured to: determining the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set; and determining at least one standard text message corresponding to the corrected text according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set.

Optionally, the second error correction module 304 is further configured to: determining the integral matching degree of the corrected text and each standard text message in the credible data set according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set; and determining at least one standard text message corresponding to the corrected text according to the integral matching degree of the corrected text and each standard text message in the credible data set.

Optionally, the error correction model is a transform model.

Optionally, the initial named entity recognition model is the L STM + CRF model.

The above functional modules are respectively used for completing the corresponding operation functions of the method embodiment of the present invention, and similar functional effects are also achieved, and detailed descriptions are omitted.

Fig. 5 is a schematic structural diagram of a text information processing apparatus according to a fifth embodiment of the present invention. As shown in fig. 5, the apparatus 50 includes: a processor 501, a memory 502, and computer programs stored on the memory 502 and executable on the processor 501.

When the processor 501 runs the computer program, the text information processing method provided by any one of the above method embodiments is implemented.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes: ROM/RAM, magnetic disks, optical disks, etc., and the computer-readable storage medium stores a computer program that can be executed by a hardware device such as a terminal device, a computer, or a server to execute the text information processing method provided by any of the above embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A text information processing method, comprising:

2. The method according to claim 1, wherein before the error correction processing is performed on the text to be processed by the error correction model to obtain at least one corrected text of the text to be processed, the method further comprises:

acquiring an error correction training set corresponding to the type of the text to be processed, wherein the error correction training set comprises a plurality of pieces of error correction training data;

and carrying out deep learning model training through the error correction training set to obtain the error correction model.

3. The method of claim 2, wherein the obtaining an error correction training set corresponding to the type of the text to be processed, the error correction training set comprising a plurality of pieces of error correction training data, comprises:

acquiring a plurality of standard text messages corresponding to the types of the texts to be processed;

and constructing an error text corresponding to each standard text message, wherein each error text and the corresponding standard text message form one piece of error correction training data.

4. The method of claim 1, wherein prior to extracting the structured features of the corrected text via the named entity recognition model, further comprising:

acquiring a structural feature training set corresponding to the type of the text to be processed, wherein the structural feature training set comprises a plurality of pieces of structural feature extraction training data;

and carrying out model training on the initial named entity recognition model through the structural feature training set to obtain the named entity recognition model.

5. The method according to claim 4, wherein the obtaining a structured feature training set corresponding to the type of the text to be processed, the structured feature training set comprising a plurality of pieces of structured feature extraction training data comprises:

performing word segmentation processing on each standard text message to obtain word segmentation results;

and converting the word segmentation result of each piece of standard text information into corresponding structural features according to a structural feature rule corresponding to the type of the text to be processed, wherein each piece of standard text information and the corresponding structural feature form one piece of structural feature extraction training data.

6. The method of claim 1, wherein before matching the structured features of the correction text with the structured features of each standard text message in the trusted data set and determining the standard text message corresponding to the correction text, further comprising:

acquiring standard text information corresponding to the type of the text to be processed;

and extracting the structural features corresponding to each standard text message through the named entity recognition model to obtain the credible data set.

7. The method of claim 1, wherein the structured feature corresponding to the type of text to be processed comprises a plurality of feature items,

the matching the structural features of the corrected text with the structural features of each standard text message in the trusted data set to determine the standard text message corresponding to the corrected text includes:

determining the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set;

and determining at least one standard text message corresponding to the corrected text according to the matching degree between each feature item of the corrected text and the corresponding feature item of each standard text message in the credible data set.

8. The method of claim 7, wherein determining at least one standard text message corresponding to the correction text according to a matching degree between each feature item of the correction text and a corresponding feature item of each standard text message in the trusted data set comprises:

determining the overall matching degree of the correction text and each standard text message in the credible data set according to the matching degree between each feature item of the correction text and the corresponding feature item of each standard text message in the credible data set;

and determining at least one standard text message corresponding to the corrected text according to the overall matching degree of the corrected text and each standard text message in the credible data set.

9. A text information processing system, comprising:

10. A text information processing apparatus characterized by comprising:

a processor, a memory, and a computer program stored on the memory and executable on the processor;

wherein the processor, when executing the computer program, implements the method of any of claims 1 to 8.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which can be executed to perform the method according to any one of claims 1 to 8.