CN111339910A - Text processing method and device and text classification model training method and device - Google Patents

Text processing method and device and text classification model training method and device Download PDF

Info

Publication number
CN111339910A
CN111339910A CN202010111039.6A CN202010111039A CN111339910A CN 111339910 A CN111339910 A CN 111339910A CN 202010111039 A CN202010111039 A CN 202010111039A CN 111339910 A CN111339910 A CN 111339910A
Authority
CN
China
Prior art keywords
text
data
ocr
sample
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010111039.6A
Other languages
Chinese (zh)
Other versions
CN111339910B (en
Inventor
李哲
李若愚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Labs Singapore Pte Ltd
Original Assignee
Alipay Labs Singapore Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Labs Singapore Pte Ltd filed Critical Alipay Labs Singapore Pte Ltd
Priority to CN202010111039.6A priority Critical patent/CN111339910B/en
Publication of CN111339910A publication Critical patent/CN111339910A/en
Application granted granted Critical
Publication of CN111339910B publication Critical patent/CN111339910B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The embodiment of the specification provides a method and a device for processing a text and training a text classification model, and the method comprises the following steps: acquiring target OCR text data of a target certificate; identifying the data type to which the text content possibly belongs by using a text classification model aiming at the text content of the text line or the text column in the target OCR text data; determining a data type to which the text content of each text line or text column in the target OCR text data belongs according to each data type and type determination model; the text classification model is obtained by training based on a sample OCR text data set corresponding to each certificate, the sample OCR text data set comprises correct sample OCR text data and wrong sample OCR text data, and the data type of the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs to.

Description

Text processing method and device and text classification model training method and device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing a text and training a text classification model.
Background
With the rapid development of computer and internet technologies, Optical Character Recognition (OCR) technology is widely used. In the process of business transaction, when the user needs to be authenticated, the certificate of the user is scanned or the user uploads a certificate photo of the user, then the OCR technology is used for identifying the certificate image through a background, and the certificate image is translated into computer characters. Finally, the information required for current authentication, such as name, certificate number, etc., needs to be extracted from the text obtained by OCR recognition.
Therefore, it is necessary to provide a technical solution to reliably extract the required information from the OCR text.
Disclosure of Invention
The embodiment of the specification aims to provide a text processing method and a text classification model training method and device, so that required information can be reliably extracted from an OCR text.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
an embodiment of the present specification provides a text processing method, including:
acquiring target OCR text data of a target certificate;
for the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.
An embodiment of the present specification further provides a method for training a text classification model, including:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training the text classification model based on the sample OCR text data set corresponding to each certificate.
An embodiment of the present specification further provides a text processing apparatus, where the apparatus includes:
the acquisition module acquires target OCR text data of the target certificate;
the recognition module is used for recognizing the data type to which the text content possibly belongs by using a text classification model aiming at the text content of each text line or text column in the target OCR text data; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and the first determining module is used for determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and type determining model.
An embodiment of the present specification further provides a device for training a text classification model, where the device includes:
the second determining module is used for determining the layout arrangement template corresponding to each certificate based on the layout arrangement of each certificate;
the configuration module is used for configuring a sample OCR text data set for the layout arrangement template corresponding to each type of certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and the training module is used for training the text classification model based on the sample OCR text data set corresponding to each certificate.
An embodiment of the present specification further provides a text processing apparatus, including:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring target OCR text data of a target certificate;
for the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.
An embodiment of the present specification further provides a training device for a text classification model, including:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training the text classification model based on the sample OCR text data set corresponding to each certificate.
Embodiments of the present specification also provide a storage medium for storing computer-executable instructions, which when executed implement the following processes:
acquiring target OCR text data of a target certificate;
for the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.
Embodiments of the present specification also provide a storage medium for storing computer-executable instructions, which when executed implement the following processes:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training the text classification model based on the sample OCR text data set corresponding to each certificate.
According to the technical scheme in the embodiment, the data type to which the text content of each text line or text column in target OCR text data may belong is identified based on a trained text classification model, and then the data type to which the text content belongs is determined from the data types to which the text content may belong according to each data type and type determination model; in the technical scheme, when a text classification model is trained, sample OCR text data sets corresponding to certificates arranged in different formats are respectively configured for each certificate, so that the trained text classification model can identify the certificates arranged in multiple formats, and therefore, the OCR text data corresponding to the certificates arranged in multiple formats can be processed simultaneously when the text is processed; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a method of processing a text according to an embodiment of the present disclosure;
fig. 2 is a second flowchart of a text processing method provided in the embodiments of the present disclosure;
fig. 3 is a schematic diagram of a layout template in a text processing method provided in an embodiment of the present specification;
fig. 4(a) is one of schematic diagrams of sample OCR text data in a text processing method provided in an embodiment of the present specification;
fig. 4(b) is a second schematic diagram of sample OCR text data in the text processing method provided in the embodiment of the present specification;
fig. 4(c) is a third schematic diagram of sample OCR text data in the text processing method provided in the embodiment of the present specification;
fig. 5 is a third flowchart of a text processing method provided in the embodiments of the present disclosure;
FIG. 6 is a flowchart of a method for training a text classification model according to an embodiment of the present disclosure;
fig. 7 is a schematic block diagram of a text processing apparatus according to an embodiment of the present disclosure;
FIG. 8 is a block diagram illustrating a training apparatus for text classification models provided in an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a text processing device provided in an embodiment of the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The idea of the embodiment of the present specification is that a text classification model is used to identify a data type to which text content of each text line or text column in OCR text data may belong, and the text classification model is generated based on a sample data set corresponding to certificates arranged in each format, so that certificates arranged in multiple formats can be identified at the same time, and influence caused by different format arrangements is avoided. Based on this, embodiments of the present specification provide a text processing method, apparatus, device and storage medium, which are used for processing OCR text data, and will be discussed in detail below.
In a specific application scenario, the method for processing a text provided in this specification may be applied to an authentication device, that is, an execution subject of the method may be an authentication device, and specifically, may be a text processing apparatus installed on the authentication device. The authentication device may be an authentication client or an authentication server.
Fig. 1 is a flowchart of a method of processing a text according to an embodiment of the present disclosure, where the method shown in fig. 1 at least includes the following steps:
and 102, acquiring target OCR text data of the target certificate.
The target certificate can be identity card, passport, driver's license, etc.
In a specific implementation mode, when the identity of a user needs to be verified, a certificate image of a target certificate of the user is collected, and the certificate image is identified through an OCR identification module, so that target OCR text data corresponding to the target certificate is obtained.
In addition, in the embodiment of the present specification, when performing OCR recognition on a target certificate, the layout arrangement of the target certificate is not changed, that is, the layout arrangement of the obtained target OCR text data is consistent with the layout arrangement of the target certificate.
For example, if the target certificate is arranged according to rows, the identified target OCR text is also arranged according to rows, and the arrangement sequence and content of each row are consistent with those of the target certificate, and the number of rows remains unchanged; and if the target certificate is arranged according to the columns, the target OCR texts obtained by recognition are also arranged according to the columns, the arrangement sequence and content of each column are consistent with those of the target certificate, and the number of the columns is kept unchanged.
Step 104, aiming at the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content possibly belongs by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs.
The data types refer to fields such as "name", "gender", "certificate number", "address", and the like.
And performing type recognition on the text content of each text line or text column in the target OCR text data by using a text classification model, wherein the output result of the text classification model comprises each possible data type corresponding to each text content and the probability that the text content may belong to each data type. In specific implementation, for each text content, a set number of data types can be intercepted from a plurality of possible data types output by the text classification model according to the sequence of the probabilities from high to low as the data types to which the text content may belong.
For ease of understanding, the following description will be given by way of example.
For example, for the text content a of a certain text line in the target OCR text data, one possible recognition result obtained by the text classification model is as follows:
the probability that the text content a belongs to "name" is 96%;
the probability that text content a belongs to "gender" is 28%;
the probability that the text content A belongs to the "nationality" is 65%;
the probability that the text content A belongs to the certificate number is 54 percent;
the probability that text content a belongs to "address" is 38%.
Sequencing the data types corresponding to the identified text content A according to the sequence from high probability to low probability, wherein the sequence after sequencing is as follows: name, ethnicity, certificate number, address, gender, and then the top 3 data types are cut from the sorted sequence as the data types to which the text content a may belong, i.e. the data types to which the text content a may belong are name, ethnicity, and certificate number.
Of course, the data types, the probability values and the numbers of the intercepted data types in the above examples are only illustrative and do not constitute a limitation to the embodiments of the present specification.
The layout arrangement of the certificate refers to the arrangement of each text content in the certificate on the certificate.
In addition, in the embodiment of the present specification, when generating the text classification model, the sample OCR data is a sample OCR text data set corresponding to each document arranged in each format. Because there may be multiple layouts for the same document, multiple sets of sample OCR text data may appear for the same document. Therefore, the generated text classification model can be suitable for the certificates arranged in various formats, and therefore recognition of various formats of the same certificate can be achieved.
In addition, when character recognition is performed by using the OCR technology, a case of character line missing or a case of a similar character recognition error often occurs. For example, in some cases, N in a certificate may be identified as M, or S in a certificate may be identified as 5, etc. Therefore, in the embodiment of the specification, when a sample OCR text data set corresponding to a certificate is acquired, some error sample OCR text data may be generated based on errors that often occur in OCR recognition, and a text classification model is trained based on correct sample OCR text data and error OCR text sample data together, so that the obtained trained text classification model can better handle special situations of OCR text omission or error detection, and the applicability of the text classification model is improved.
And 106, determining the data type of the text content of each text line or text column in the target OCR text data according to each data type and type determination model.
In this embodiment of the present specification, after determining a data type to which text content of each text line or text column in target OCR text data belongs, extracting text content of a specified data type from the target OCR text data according to the data type to which each text content belongs.
For example, in a specific implementation, a certificate number of a user needs to be acquired, and after a data type to which each text content in target OCR text data corresponding to a target certificate belongs is identified, a text line or a text column corresponding to the "certificate number" is found, where the text content corresponding to the text line or the text column is the certificate number.
Optionally, in the step 106, according to the data type to which each text content may belong and the probability that each text content belongs, a data type corresponding to the maximum probability in the probabilities corresponding to each text content is selected as the data type to which the text content belongs, if the data types of two text contents are consistent, a second approximate probability in the probabilities corresponding to the two text contents is compared, and the data type corresponding to the higher probability is used as the data type to which the text content belongs.
In a specific implementation manner, in the step 106, determining a data type to which the text content of each text line or text column in the target OCR text data belongs according to each data type and type determination model specifically includes the following processes:
combining the data types to which the text contents possibly belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data; inputting a plurality of possible data type combination sequences into a type determination model for processing, and determining one data type combination sequence output by the type determination model as a data type combination sequence corresponding to target OCR text data; and determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.
Generally, for a certificate, the certificate contains text contents of various data types, such as "name", "gender", "ethnicity", "address", "identification number", etc. in the case of an identification card. However, the data types to which the text contents of each text line or each text column belong are different, and when the data types to which the text contents of each text line or each text column belong are determined, if the text contents are divided, the probability that the text contents belong to each data type is considered only, and a situation that two or more texts all correspond to the same data type may occur. Therefore, in the embodiment of the present specification, the data types to which the text contents in the target OCR text data may belong are combined to obtain a data type combination sequence to which the target OCR text data may belong, that is, the data types are determined by taking the target OCR text data as a whole, so that a situation that two or more texts all correspond to the same data type can be avoided.
Because the characters in each certificate are arranged according to a certain rule, for example, the first line is the name, the second line is the gender, and the like, the arrangement of each text content in the OCR text data obtained through OCR recognition is consistent with that of the original certificate. In order to facilitate determining the data type to which each text content belongs from the data type combination sequence corresponding to the target OCR text data, in a specific embodiment, each data type in the data type combination sequence may be arranged according to the arrangement order of each text content in the target OCR text data.
For example, the target OCR text data includes text contents of three text lines, which are respectively denoted as text content 1, text content 2 and text content 3, and the text content 1 is arranged at a line before the text content 2, and the text content 2 is arranged at a line before the text content 3. Therefore, when generating the data type combination sequence, the plurality of data type combination sequences which the target OCR text data may possibly describe may be generated in the order of the data type to which the text content 1 belongs, the data type to which the text content 2 belongs, and the data type to which the text content 3 belongs.
To facilitate understanding of the specific process of the above data type combination, the following description will be given by way of example.
For example, in one specific implementation, the text content included in the target OCR text data includes text content a, text content B and text content C, and text content a is arranged in a first line of the target OCR text data, text content B is arranged in a second line of the target OCR text data, and text content C is arranged in a third line of the target OCR text data, and the possible data types of text content a, text content B and text content C that belong to the text classification model are as follows:
text content a may belong to "name", "gender", and "ethnicity";
text content B may belong to "gender" and "name";
the text content C may belong to "ethnicity" and "gender".
Combining the data types to which the text content A, the text content B and the text content C respectively possibly belong, wherein the obtained possible data type combination sequence corresponding to the target OCR text data is as follows:
sequence 1: name, sex, ethnicity
Sequence 2: name, ethnicity
And (3) sequence: name, sex
And (3) sequence 4: name, sex
And (5) sequence: sex, ethnicity
And (3) sequence 6: sex, sex
And (3) sequence 7: sex, name, ethnicity
And (2) sequence 8: sex, name, sex
Sequence 9: nationality, sex, ethnicity
Sequence 10: nationality, sex
Sequence 11: nationality, name, ethnicity
Sequence 12: nationality, name, sex
And then inputting the obtained data type combination sequences into a pre-trained type determination model, and determining one data type combination sequence from the multiple data type combination sequences as the data type of the target OCR text data through the type determination model.
Fig. 2 is a second flowchart of a text processing method provided in an embodiment of the present disclosure, where the method shown in fig. 2 at least includes the following steps:
step 202, target OCR text data of the target certificate is acquired.
Step 204, aiming at the text content of each text line or text column in the target OCR text data, a text classification model is used to identify the data type to which the text content may belong.
And step 206, combining the data types to which the text contents may belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data.
And step 208, determining a data type combination sequence corresponding to the target OCR text data by using the trained type determination model according to the plurality of possible data type combination sequences.
And step 210, determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.
In specific implementation, before the method provided by the embodiment of the present specification is executed, the text classification model needs to be trained, and therefore, before the step 102 is executed to acquire the target OCR text data of the target certificate, the method provided by the embodiment of the present specification further includes the following steps:
determining a format arrangement template corresponding to each certificate based on the format arrangement of each certificate; for each type of certificate arranged, configuring a sample OCT text data set for a type arrangement template corresponding to the certificate; training a text classification model based on a sample OCR text data set corresponding to each certificate; the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text sample data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs.
The layout template is used for representing text lines where various types of text contents in the certificate are located. A schematic of the layout template is shown in fig. 3.
In specific implementation, if a certain certificate has multiple layout arrangements, one layout arrangement template is determined for each layout arrangement of the certificate. That is, in the embodiment of the present specification, one layout arrangement corresponds to one template. For example, for the certificate a, if there are three types of layout arrangements in the certificate a, they are respectively marked as layout arrangement 1, layout arrangement 2, and layout arrangement 3, and when configuring the sample OCR text data set, the sample OCR text data set is configured for the template corresponding to the layout arrangement 1, the sample OCR text data set is configured for the template corresponding to the layout arrangement 2, and the sample OCR text data set is configured for the template corresponding to the layout arrangement 3.
In this specification, a name and address database is pre-established, and when configuring a sample OCR text data set for each layout template, information such as a name and an address may be selected from the name and address database to configure the sample OCR text data set.
In addition, it should be noted that, in the embodiment of the present specification, when configuring the sample OCR text data set for the layout template, it is necessary to configure the correct sample OCR text data for the layout template, and also to configure the incorrect sample OCR text data based on an error that may occur frequently in the OCR recognition process.
In an embodiment of the present specification, the above-mentioned erroneous sample OCR text data includes at least one or more of the following sample data:
sample data obtained by deleting the text content of at least one text line or text column in the OCR text data of the correct sample;
sample data resulting from replacing characters in the correct sample OCR text data with similar characters.
In specific implementation, for correct sample data corresponding to each layout arrangement template, a set number of correct sample OCR text data can be selected based on the probability of missing lines during OCR recognition to generate incorrect sample OCR text data.
For ease of understanding, the following description will be given by way of example.
For example, in a specific embodiment, the number of correct sample OCR text data configured for a certain layout template is 1000, and if the probability of missing lines is 5% in the OCR recognition process, any one or more lines of text content in 50 sample data in the correct sample OCR text data are deleted; similar characters in the rest of the correct sample OCR text data, which often have recognition errors, are then replaced. For example, if a character M appears in a correct sample OCR text data, the character M may be replaced by a character N to obtain an incorrect sample OCR text data; if a character S appears in the OCR text data of a certain correct sample, replacing the character S with a number 5 to obtain the OCR text data of an error sample; if a number 8 appears in some correct sample OCR text data, 8 may be replaced by 9 to obtain an incorrect sample OCR text data.
A schematic diagram of correct sample OCR text data is shown in fig. 4(a), a schematic diagram of error sample OCR text data obtained after missing line simulation is shown in fig. 4(b), and a schematic diagram of error sample OCR text data obtained after similar character replacement (o is replaced by 0, S is replaced by 5, and 9 is replaced by 8) is shown in fig. 4 (c).
And then marking the data types of the text contents of all lines in the correct sample OCR text data and the error sample OCR text data respectively to obtain a sample OCR text data set corresponding to each format arrangement template, and training a text classification model based on the sample OCR text data set corresponding to each format arrangement template.
It should be noted that, in the embodiment of the present specification, the text classification model adopted is a bidirectional long-short-term memory neural network (BiLSTM) text classification model. In addition, other existing text classification models may also be used, and as long as a model capable of realizing text classification can be applied to the embodiment of the present specification, the embodiment of the present specification does not limit a specific model of a text classification model.
In addition, in the embodiment of the present specification, the type determination model is a Markov (Markov) probabilistic model. The training process is as follows:
after a sample OCR text data set corresponding to each format arrangement template is obtained, a data type combination sequence corresponding to each sample OCR text data is generated based on the data type to which the text content of each text line or text column in each sample OCR text data (including correct sample OCR text data and wrong sample OCR text data) in the sample OCR text data set belongs.
For example, with respect to the OCR text data shown in fig. 4(a), the data type combination sequence corresponding to it is: name, certificate number, year and month of birth, address, and issue date.
In order to facilitate understanding of the methods provided by the embodiments of the present disclosure, the following description will be provided with reference to specific application scenarios. In a specific application scenario, when the user a is authenticated, the certificate number of the user a needs to be extracted from the certificate of the user a. Based on the application scenario, fig. 5 shows a third method flowchart of the text processing method provided in the embodiment of the present specification, and the method shown in fig. 5 at least includes the following steps:
step 502, collecting the certificate image of the user A, and performing OCR recognition on the certificate image to obtain OCR text data of the certificate of the user A.
Step 504, for the text content of each text line in the OCR text data, using a pre-trained BiLSTM classification model to identify each data type corresponding to the text content and the probability that the text content belongs to each data type.
Step 506, for each text content, intercepting a set number of data types as the data types to which the text content may belong according to the sequence from high to low of the probability that the text content belongs to each data type.
And step 508, combining the data types to which the text contents may belong to obtain a plurality of possible data type combination sequences corresponding to the OCR text data of the user a.
And step 510, determining a data type combination sequence corresponding to the OCR text data by using a pre-trained Markov probability model according to the plurality of possible data type combination sequences.
And step 512, determining the data type of the text content of each text line according to the data type combination sequence corresponding to the OCR text data.
And 514, determining a text line corresponding to the certificate number based on the data type of the text content of each text line, and extracting the text content of the text line.
In the OCR text processing method provided in the embodiment of the present specification, a data type to which text content of each text row or text column in target OCR text data may belong is identified based on a trained text classification model, and then a data type to which the text content belongs is determined from the data types to which the text content may belong according to each data type; in the technical scheme, when a text classification model is trained, sample OCR text data sets corresponding to certificates arranged in different formats are respectively configured for each certificate, so that the trained text classification model can identify the certificates arranged in multiple formats, and therefore, the OCR text data corresponding to the certificates arranged in multiple formats can be processed simultaneously when the text is processed; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
The embodiment of the present specification also applies to a method for training a text classification model, where the trained text classification model is applied to the embodiments shown in fig. 1 to 5. Fig. 6 is a flowchart of a method for training a text classification model according to an embodiment of the present disclosure, where the method shown in fig. 6 at least includes the following steps:
step 602, determining a layout arrangement template corresponding to each certificate based on the layout arrangement of each certificate;
step 604, configuring a sample OCR text data set for a format arrangement template corresponding to each certificate in terms of each format arrangement certificate; wherein the sample OCR text data set comprises correct sample OCR text data, incorrect sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
step 606, training a text classification model based on the sample OCR text data sets corresponding to the certificates.
Specifically, in this embodiment of this specification, in step 604, for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate, includes the following steps:
configuring a plurality of sample user data for each layout arrangement template according to a pre-established sample user database to obtain a plurality of correct sample OCR text data corresponding to the layout arrangement template; processing the OCR text data of the correct sample according to a set rule to obtain a plurality of OCR sample data of the wrong sample corresponding to the format arrangement template; combining the data types to which the text content of each text row or text column in the multiple correct sample OCR text data, the multiple error sample OCT text data, the correct sample OCR text data and the error sample OCR text data belongs to serve as a sample OCR text data set.
Specifically, the processing the OCR text data of the correct sample according to the set rule includes:
deleting the text content of at least one text line or text column in the correct sample OCR text data;
and/or the presence of a gas in the gas,
similar characters are used to replace characters in the correct sample OCR text data.
The processing of the OCR text data of the correct sample according to the set rule at least includes the following three implementation modes:
text content of at least one text line or text column in the correct sample OCR text data is deleted.
Replacing characters in the correct textual OCR text data with similar characters;
deleting the textual content of at least one textual line or textual column in the correct sample OCR textual data, and replacing characters in the correct OCR textual data with similar characters.
The specific implementation process of each step in the embodiments of this specification may refer to the embodiments shown in fig. 1 to 5, which are not described herein again.
In the training method of the text classification model provided in the embodiment of the present specification, for each certificate arranged in different formats, a sample OCR text data set corresponding to each certificate is configured, so that the trained text classification model can identify certificates arranged in multiple formats, and therefore, when processing a text, OCR text data corresponding to the certificates arranged in multiple formats can be processed at the same time; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Corresponding to the processing method of the text provided by the embodiments shown in fig. 1 to fig. 5 in the embodiments of the present description, based on the same idea, the embodiments of the present description further provide a processing apparatus of the text, which is used for executing the processing apparatus of the text provided by the embodiments shown in fig. 1 to fig. 5 in the embodiments of the present description. Fig. 7 is a schematic diagram illustrating a module composition of a text processing apparatus according to an embodiment of the present disclosure, where the apparatus shown in fig. 7 at least includes the following modules:
an obtaining module 702, configured to obtain target OCR text data of a target certificate;
a recognition module 704, configured to, for the text content of each text row or text column in the target OCR text data, recognize a data type to which the text content may belong using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and the first determining module 706 is configured to determine, according to the respective data types and type determination models, a data type to which text content of each text row or text column in the target OCR text data belongs.
Optionally, the first determining module 706 includes:
the combination unit is used for combining the data types to which the text contents possibly belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;
the first determining unit is used for inputting a plurality of possible data type combination sequences into the type determining model for processing, and determining one data type combination sequence output by the type determining model as a data type combination sequence corresponding to the target OCR text data;
and the second determining unit is used for determining the data type of each text content according to the data type combination sequence to which the target OCR text data belongs.
Optionally, the apparatus provided in this specification further includes:
the third determining module is used for determining the layout arrangement template corresponding to the certificate based on the layout arrangement of each certificate;
the configuration module is used for configuring a sample OCR text data set for the layout arrangement template corresponding to the certificate aiming at the certificate arranged in each layout; the sample OCR text data set comprises correct sample OCR text data, error sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs;
and the training module is used for training a text classification model based on the sample OCR text data set corresponding to each certificate.
Optionally, the OCR text data of the error sample includes at least one or more of the following sample data:
sample data obtained by deleting the text content of at least one text line or text column in the OCR text data of the correct sample;
sample data resulting from replacing characters in the correct sample OCR text data with similar characters.
Optionally, the apparatus provided in this specification further includes:
and the extraction module is used for extracting the text content of the specified data type from the target OCR text data according to the data type corresponding to each text content in the target OCR text data.
Optionally, the text classification model is a Bi LSTM model of a bidirectional long-and-short-term memory recurrent neural network;
the type determination model is a Markov probability model.
It should be noted that the processing apparatus for text provided in the embodiment of the present specification and the processing method for text provided in the embodiment shown in fig. 1 to fig. 5 of the present specification are based on the same invention, and therefore, the specific implementation of the embodiment may refer to the implementation of the processing method for text, and repeated details are not repeated.
The processing device for the text provided by the embodiment of the present specification identifies, based on a trained text classification model, a data type to which text content of each text row or text column in target OCR text data may belong, and then determines, according to each data type, a data type to which the text content belongs from the data types to which the text content may belong; in the technical scheme, when a text classification model is trained, sample OCR text data sets corresponding to certificates arranged in different formats are respectively configured for each certificate, so that the trained text classification model can identify the certificates arranged in multiple formats, and therefore, the OCR text data corresponding to the certificates arranged in multiple formats can be processed simultaneously when the text is processed; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Based on the same idea, the embodiment of this specification further provides a training apparatus for a text classification model, which is used to execute the method provided by the embodiment shown in fig. 6 of this specification, fig. 8 is a schematic diagram of module components of the training apparatus for a text classification model provided by the embodiment of this specification, and the apparatus shown in fig. 8 at least includes:
a second determining module 802, configured to determine, based on the layout arrangement of each certificate, a layout arrangement template corresponding to the certificate;
the configuration module 804 is used for configuring a sample OCR text data set for the layout arrangement template corresponding to the certificate aiming at the certificate arranged in each layout; the sample OCR text data set comprises correct sample OCR text data, error sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs;
and the training module 806 is configured to train a text classification model based on the sample OCR text data sets corresponding to the respective certificates.
Optionally, the configuring module 804 includes:
the configuration unit is used for configuring a plurality of sample user data for each format arrangement template according to a pre-established sample user database to obtain a plurality of correct sample OCR text data corresponding to the format arrangement templates;
the processing unit is used for processing the OCR text data of the correct samples according to a set rule to obtain a plurality of OCR text data of the wrong samples corresponding to the format arrangement template;
and the combination unit is used for combining the data types to which the text content of each text row or text column in the plurality of correct sample OCR text data, the plurality of error sample OCR text data, the correct sample OCR text data and the error sample OCR text data belongs to serve as a sample OCR text data set.
Optionally, the processing unit is specifically configured to:
deleting the text content of at least one text line or text column in the correct sample OCR text data;
and/or the presence of a gas in the gas,
similar characters are used to replace characters in the correct sample OCR text data.
It should be noted that the training apparatus for the text classification model provided in the embodiment of the present specification and the training method for the text classification model provided in the embodiment shown in fig. 6 of the embodiment of the present specification are based on the same invention, and therefore, the specific implementation of the embodiment may refer to the implementation of the training method for the text classification model, and repeated details are not repeated.
The training device for the text classification model provided in the embodiment of the present specification configures corresponding sample OCR text data sets for each document arranged in different formats, so that the trained text classification model can identify documents arranged in multiple formats, and thus, when processing a text, processing OCR text data corresponding to documents arranged in multiple formats can be performed simultaneously; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Further, based on the methods shown in fig. 1 to fig. 5, an embodiment of the present specification further provides a text processing apparatus, as shown in fig. 9.
The text processing device may have a large difference due to different configurations or performances, and may include one or more processors 901 and a memory 902, and one or more stored applications or data may be stored in the memory 902. Memory 902 may be, among other things, transient storage or persistent storage. The application program stored in memory 902 may include one or more modules (not shown), each of which may include a series of computer-executable instruction information in a processing device for text. Still further, the processor 901 may be configured to communicate with the memory 902 to execute a series of computer-executable instruction information in the memory 902 on a text processing device. The processing apparatus of text may also include one or more power supplies 903, one or more wired or wireless network interfaces 904, one or more input-output interfaces 905, one or more keyboards 906, and the like.
In one particular embodiment, an apparatus for processing text includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instruction information for the apparatus for processing text, and the one or more programs configured to be executed by the one or more processors include the computer-executable instruction information for:
acquiring target OCR text data of a target certificate;
identifying a data type to which the text content may belong by using a text classification model aiming at the text content of each text line or text column in the target OCR text data; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data types and the type determination models.
Optionally, when executed, the computer-executable instruction information determines, according to the respective data types and type determination models, a data type to which text content of each text line or text column in the target OCR text data belongs, including:
combining the data types to which the text contents possibly belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;
inputting a plurality of possible data type combination sequences into a type determination model for processing, and determining one type combination sequence output by the type determination model as a data type combination sequence to which target OCR text data belongs;
and determining the data type of each text content according to the data type combination sequence of the target OCR text data.
Optionally, before the computer-executable instruction information is executed and target OCR text data of the target certificate is acquired, the following steps may be further executed:
determining a format arrangement template corresponding to each certificate based on the format arrangement of each certificate;
configuring a sample OCR text data set for a format arrangement template corresponding to the certificate aiming at the certificate arranged in each format; the sample OCR text data set comprises correct sample OCR text data, error sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs;
and training a text classification model based on the sample OCR text data set corresponding to each certificate.
Optionally, the computer executable instruction information, when executed, the erroneous sample OCR text data includes at least one or more of the following sample data:
sample data obtained by deleting the text content of at least one text line or text column in the OCR text data of the correct sample;
sample data resulting from replacing characters in the correct sample OCR text data with similar characters.
Optionally, after determining the data type to which the text content of each text line or text column in the target OCR text data belongs according to each data type and the probability corresponding to the data type when the computer-executable instruction information is executed, the following steps may be further performed:
and extracting the text content of the specified data type from the target OCR text data according to the data type corresponding to each text content in the target OCR text data.
Optionally, when the computer executable instruction information is executed, the text classification model is a Bi-directional long-and-short-term memory recurrent neural network Bi LSTM model;
the type determination model is a Markov probability model.
The processing device for the text provided by the embodiment of the present specification identifies, based on a trained text classification model, a data type to which text content of each text row or text column in target OCR text data may belong, and then determines, according to each data type, a data type to which the text content belongs from the data types to which the text content may belong; in the technical scheme, when a text classification model is trained, sample OCR text data sets corresponding to certificates arranged in different formats are respectively configured for each certificate, so that the trained text classification model can identify the certificates arranged in multiple formats, and therefore, the OCR text data corresponding to the certificates arranged in multiple formats can be processed simultaneously when the text is processed; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Further, based on the methods shown in fig. 1 to fig. 5, an embodiment of the present specification further provides a training device for a text classification model, and a specific structure of the training device may refer to a processing device for a text shown in fig. 9.
In one particular embodiment, a training apparatus for a text classification model includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instruction information for the training apparatus for the text classification model, and the one or more programs configured to be executed by one or more processors include the computer-executable instruction information for:
determining a format arrangement template corresponding to each certificate based on the format arrangement of each certificate;
configuring a sample OCR text data set for a format arrangement template corresponding to the certificate aiming at the certificate arranged in each format; the sample OCR text data set comprises correct sample OCR text data, error sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs;
and training a text classification model based on the sample OCR text data set corresponding to each certificate.
Optionally, when the computer executable instruction information is executed, for each document arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the document, including:
configuring a plurality of sample user data for each layout arrangement template according to a pre-established sample user database to obtain a plurality of correct sample OCR text data corresponding to the layout arrangement templates;
processing the OCR text data of the correct samples according to a set rule to obtain a plurality of OCR text data of the wrong samples corresponding to the format arrangement template;
combining the data types to which the text contents of each text line or text column in the multiple correct sample OCR text data, the multiple error sample OCR text data, the correct sample OCR text data and the error sample OCR text data belong to serve as a sample OCR text data set.
Optionally, when executed, the computer-executable instruction information processes the OCR text data of the correct sample according to a set rule, including:
deleting the text content of at least one text line or text column in the correct sample OCR text data;
and/or the presence of a gas in the gas,
similar characters are used to replace characters in the correct sample OCR text data.
In the training device of the text classification model provided in the embodiment of the present specification, for each certificate arranged in different formats, a sample OCR text data set corresponding to each certificate is configured, so that the trained text classification model can identify certificates arranged in multiple formats, and therefore, when processing a text, processing of OCR text data corresponding to the certificates arranged in multiple formats can be performed at the same time; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Further, based on the methods shown in fig. 1 to fig. 5, in a specific embodiment, the storage medium may be a usb disk, an optical disk, a hard disk, or the like, and when executed by a processor, the storage medium stores computer-executable instruction information that implements the following processes:
acquiring target OCR text data of a target certificate;
identifying a data type to which the text content may belong by using a text classification model aiming at the text content of each text line or text column in the target OCR text data; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data types and the type determination models.
Optionally, the storage medium stores computer-executable instruction information, which when executed by the processor determines a data type to which text content of each text line or text column in the target OCR text data belongs according to the respective data type and type determination model, and includes:
combining the data types to which the text contents possibly belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;
inputting a plurality of possible data type combination sequences into the type determination model for processing, and determining one data type combination sequence output by the type determination model as a data type combination sequence corresponding to the target OCR text data;
and determining the data type of each text content according to the data type combination sequence of the target OCR text data.
Optionally, before the storage medium stores computer-executable instruction information and is executed by the processor to acquire target OCR text data of a target certificate, the following steps may be further performed:
determining a format arrangement template corresponding to each certificate based on the format arrangement of each certificate;
configuring a sample OCR text data set for a format arrangement template corresponding to the certificate aiming at the certificate arranged in each format; the sample OCR text data set comprises correct sample OCR text data, error sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs;
and training a text classification model based on the sample OCR text data set corresponding to each certificate.
Optionally, the storage medium stores computer executable instruction information, which when executed by the processor, the erroneous sample OCR text data includes at least one or more of the following sample data:
sample data obtained by deleting the text content of at least one text line or text column in the OCR text data of the correct sample;
sample data resulting from replacing characters in the correct sample OCR text data with similar characters.
Optionally, after determining the data type to which the text content of each text line or text column in the target OCR text data belongs according to each data type and the probability corresponding to the data type when the computer-executable instruction information stored in the storage medium is executed by the processor, the following steps may also be performed:
and extracting the text content of the specified data type from the target OCR text data according to the data type corresponding to each text content in the target OCR text data.
Optionally, when the computer executable instruction information stored in the storage medium is executed by the processor, the text classification model is a Bi-directional long-and-short-term memory recurrent neural network Bi LSTM model;
the type determination model is a Markov probability model.
When being executed by a processor, the computer-executable instruction information stored in the storage medium provided by the embodiment of the specification identifies the data type to which the text content of each text row or text column in the target OCR text data may belong based on the trained text classification model, and then determines the data type to which the text content belongs from the data types to which the text content may belong according to each data type; in the technical scheme, when a text classification model is trained, sample OCR text data sets corresponding to certificates arranged in different formats are respectively configured for each certificate, so that the trained text classification model can identify the certificates arranged in multiple formats, and therefore, the OCR text data corresponding to the certificates arranged in multiple formats can be processed simultaneously when the text is processed; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
Further, based on the method shown in fig. 6, in a specific embodiment, the storage medium may be a usb disk, an optical disk, a hard disk, or the like, and when executed by a processor, the storage medium stores computer-executable instruction information that implements the following processes:
determining a format arrangement template corresponding to each certificate based on the format arrangement of each certificate;
configuring a sample OCR text data set for a format arrangement template corresponding to the certificate aiming at the certificate arranged in each format; the sample OCR text data set comprises correct sample OCR text data, error sample OCR text data and a data type to which the text content of each text row or text column in the correct sample OCR text data and the error sample OCR text data belongs;
and training a text classification model based on the sample OCR text data set corresponding to each certificate.
Optionally, when executed by the processor, the computer-executable instruction information stored in the storage medium configures, for each document arranged in a format, a sample OCR text data set for a format arrangement template corresponding to the document, including:
configuring a plurality of sample user data for each layout arrangement template according to a pre-established sample user database to obtain a plurality of correct sample OCR text data corresponding to the layout arrangement templates;
processing the OCR text data of the correct samples according to a set rule to obtain a plurality of OCR text data of the wrong samples corresponding to the format arrangement template;
combining the data types to which the text contents of each text line or text column in the multiple correct sample OCR text data, the multiple error sample OCR text data, the correct sample OCR text data and the error sample OCR text data belong to serve as a sample OCR text data set.
Optionally, the storage medium stores computer-executable instruction information, which when executed by the processor, processes the OCR text data of the correct sample according to a set rule, and includes:
deleting the text content of at least one text line or text column in the correct sample OCR text data;
and/or the presence of a gas in the gas,
similar characters are used to replace characters in the correct sample OCR text data.
When the computer executable instruction information stored in the storage medium provided in the embodiment of the present specification is executed by the processor, the sample OCR text data sets corresponding to the certificates arranged in different formats are configured for each certificate, so that the trained text classification model can identify the certificates arranged in multiple formats, and therefore, when a text is processed, the OCR text data corresponding to the certificates arranged in multiple formats can be processed at the same time; moreover, errors possibly encountered in the OCR recognition process are taken into consideration as error sample OCR text data during model training, even the error OCR text data obtained in the OCR recognition process can be processed, the applicability of the text classification model is improved, special problems in various OCR scenes can be better processed, the accuracy of text type recognition can be improved, and the text content of the required data type can be accurately extracted.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardsradware (Hardware Description Language), vhjhd (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functionality of the units may be implemented in one or more software and/or hardware when implementing the present application.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instruction information. These computer program instruction information may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instruction information executed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instruction information may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instruction information stored in the computer-readable memory produce an article of manufacture including instruction information means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instruction information may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instruction information executed on the computer or other programmable apparatus provides steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instruction information, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The application may be described in the general context of computer-executable instruction information, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (16)

1. A method of processing text, the method comprising:
acquiring target OCR text data of a target certificate;
for the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.
2. The method of claim 1, wherein determining a data type to which text content of each text line or text column in the target OCR text data belongs according to the respective data type and type determination model comprises:
combining the data types to which the text contents possibly belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;
inputting a plurality of possible data type combination sequences into the type determination model for processing, and determining one data type combination sequence output by the type determination model as a data type combination sequence corresponding to the target OCR text data;
and determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.
3. The method of claim 1 or 2, prior to the acquiring target OCR text data for a target document, the method further comprising:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training the text classification model based on the sample OCR text data set corresponding to each certificate.
4. The method of claim 3, the erroneous sample OCR text data comprising at least one or more of the following sample data:
sample data obtained by deleting the text content of at least one text line or text column in the correct sample OCR text data;
sample data resulting from replacing characters in the correct sample OCR text data with similar characters.
5. The method of claim 1, after determining the data type to which the text content of each text row or text column in the target OCR text data belongs according to each of the data types and the probabilities corresponding to the data types, the method further comprises:
and extracting the text content of the specified data type from the target OCR text data according to the data type of each text content in the target OCR text data.
6. The method of claim 2, wherein the text classification model is a bidirectional long-and-short-term memory recurrent neural network (BilTM) model;
the type determination model is a Markov probability model.
7. A method of training a text classification model, the method comprising:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training the text classification model based on the sample OCR text data set corresponding to each certificate.
8. The method of claim 7, wherein for each layout certificate, configuring a sample OCR text data set for a layout template corresponding to the certificate comprises:
configuring a plurality of sample user data for each layout arrangement template according to a pre-established sample user database to obtain a plurality of correct sample OCR text data corresponding to the layout arrangement templates;
processing the correct sample OCR text data according to a set rule to obtain a plurality of error sample OCR text data corresponding to the format arrangement template;
combining the data types to which the text content of each text row or text column in the plurality of correct sample OCR text data, the plurality of incorrect sample OCR text data, the correct sample OCR text data, and the incorrect sample OCR text data belongs as the sample OCR text data set.
9. The method of claim 8, wherein the processing the correct sample OCR text data according to the set rule comprises:
deleting the text content of at least one text line or text column in the correct sample OCR text data;
and/or the presence of a gas in the gas,
replacing characters in the correct sample OCR text data with similar characters.
10. An apparatus for processing text, the apparatus comprising:
the acquisition module acquires target OCR text data of the target certificate;
the recognition module is used for recognizing the data type to which the text content possibly belongs by using a text classification model aiming at the text content of each text line or text column in the target OCR text data; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and the first determining module is used for determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and type determining model.
11. The apparatus of claim 10, the first determining module, comprising:
the combination unit is used for combining the data types to which the text contents possibly belong to obtain a plurality of possible data type combination sequences corresponding to the target OCR text data;
the first determining unit is used for inputting a plurality of possible data type combination sequences into the type determining model for processing, and determining one data type combination sequence output by the type determining model as a data type combination sequence corresponding to the target OCR text data;
and the second determining unit is used for determining the data type of each text content according to the data type combination sequence corresponding to the target OCR text data.
12. An apparatus for training a text classification model, the apparatus comprising:
the second determining module is used for determining the layout arrangement template corresponding to each certificate based on the layout arrangement of each certificate;
the configuration module is used for configuring a sample OCR text data set for the layout arrangement template corresponding to each type of certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and the training module is used for training the text classification model based on the sample OCR text data set corresponding to each certificate.
13. A device for processing text, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
acquiring target OCR text data of a target certificate;
for the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.
14. A training apparatus for a text classification model, comprising:
a processor; and
a memory arranged to store computer executable instructions that, when executed, cause the processor to:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training the text classification model based on the sample OCR text data set corresponding to each certificate.
15. A storage medium storing computer-executable instructions that, when executed, implement the following:
acquiring target OCR text data of a target certificate;
for the text content of each text line or text column in the target OCR text data, identifying the data type to which the text content may belong by using a text classification model; the text classification model is obtained by training based on a sample OCR text data set corresponding to the certificates arranged in various formats, and the sample OCR text data set comprises correct sample OCR text data, wrong sample OCR text data and a data type to which the text content of each text line or text column in the correct sample OCR text data and the wrong sample OCR text data belongs;
and determining the data type of the text content of each text line or text column in the target OCR text data according to the data type and the type determination model.
16. A storage medium storing computer-executable instructions that, when executed, implement the following:
determining a format arrangement template corresponding to each certificate based on format arrangement of each certificate;
for each certificate arranged in a format, configuring a sample OCR text data set for a format arrangement template corresponding to the certificate; wherein the sample OCR text data set comprises correct sample OCR text data and incorrect sample OCR text data, and a data type to which text content of each text row or text column in the correct sample OCR text data and the incorrect sample OCR text data belongs;
and training a text classification model based on the sample OCR text data set corresponding to each certificate.
CN202010111039.6A 2020-02-24 2020-02-24 Text processing and text classification model training method and device Active CN111339910B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010111039.6A CN111339910B (en) 2020-02-24 2020-02-24 Text processing and text classification model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010111039.6A CN111339910B (en) 2020-02-24 2020-02-24 Text processing and text classification model training method and device

Publications (2)

Publication Number Publication Date
CN111339910A true CN111339910A (en) 2020-06-26
CN111339910B CN111339910B (en) 2023-11-28

Family

ID=71185404

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010111039.6A Active CN111339910B (en) 2020-02-24 2020-02-24 Text processing and text classification model training method and device

Country Status (1)

Country Link
CN (1) CN111339910B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434686A (en) * 2020-11-16 2021-03-02 浙江大学 End-to-end error-containing text classification recognition instrument for OCR (optical character recognition) picture
WO2022134580A1 (en) * 2020-12-22 2022-06-30 深圳壹账通智能科技有限公司 Method and apparatus for acquiring certificate information, and storage medium and computer device

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090182733A1 (en) * 2008-01-11 2009-07-16 Hideo Itoh Apparatus, system, and method for information search
CN102103627A (en) * 2010-11-26 2011-06-22 中兴通讯股份有限公司 Method and device for identifying two-dimensional codes on mobile terminal
CN105138515A (en) * 2015-09-02 2015-12-09 百度在线网络技术(北京)有限公司 Named entity recognition method and device
CN105260727A (en) * 2015-11-12 2016-01-20 武汉大学 Academic-literature semantic restructuring method based on image processing and sequence labeling
CN106056114A (en) * 2016-05-24 2016-10-26 腾讯科技(深圳)有限公司 Business card content identification method and business card content identification device
US20170330048A1 (en) * 2016-05-13 2017-11-16 Abbyy Development Llc Optical character recognition of series of images
CN108596268A (en) * 2018-05-03 2018-09-28 湖南大学 A kind of data classification method
CN109360571A (en) * 2018-10-31 2019-02-19 深圳壹账通智能科技有限公司 Processing method and processing device, storage medium, the computer equipment of credit information
CN109376219A (en) * 2018-10-31 2019-02-22 北京锐安科技有限公司 Matching process, device, electronic equipment and the storage medium of text attributes field
CN109389115A (en) * 2017-08-11 2019-02-26 腾讯科技(上海)有限公司 Text recognition method, device, storage medium and computer equipment
CN110008331A (en) * 2019-04-15 2019-07-12 三角兽(北京)科技有限公司 Information displaying method, device, electronic equipment and computer readable storage medium
CN110245557A (en) * 2019-05-07 2019-09-17 平安科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN110263740A (en) * 2019-06-26 2019-09-20 四川新网银行股份有限公司 Different type block letter document dubbing method based on OCR technique
CN110265024A (en) * 2019-05-20 2019-09-20 平安普惠企业管理有限公司 Requirement documents generation method and relevant device
CN110362826A (en) * 2019-07-05 2019-10-22 武汉莱博信息技术有限公司 Periodical submission method, equipment and readable storage medium storing program for executing based on artificial intelligence
CN110414519A (en) * 2019-06-27 2019-11-05 众安信息技术服务有限公司 A kind of recognition methods of picture character and its identification device
CN110647829A (en) * 2019-09-12 2020-01-03 全球能源互联网研究院有限公司 Bill text recognition method and system
CN110688833A (en) * 2019-09-16 2020-01-14 苏州创意云网络科技有限公司 Text correction method, device and equipment

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090182733A1 (en) * 2008-01-11 2009-07-16 Hideo Itoh Apparatus, system, and method for information search
CN102103627A (en) * 2010-11-26 2011-06-22 中兴通讯股份有限公司 Method and device for identifying two-dimensional codes on mobile terminal
CN105138515A (en) * 2015-09-02 2015-12-09 百度在线网络技术(北京)有限公司 Named entity recognition method and device
CN105260727A (en) * 2015-11-12 2016-01-20 武汉大学 Academic-literature semantic restructuring method based on image processing and sequence labeling
US20170330048A1 (en) * 2016-05-13 2017-11-16 Abbyy Development Llc Optical character recognition of series of images
CN106056114A (en) * 2016-05-24 2016-10-26 腾讯科技(深圳)有限公司 Business card content identification method and business card content identification device
CN109389115A (en) * 2017-08-11 2019-02-26 腾讯科技(上海)有限公司 Text recognition method, device, storage medium and computer equipment
CN108596268A (en) * 2018-05-03 2018-09-28 湖南大学 A kind of data classification method
CN109376219A (en) * 2018-10-31 2019-02-22 北京锐安科技有限公司 Matching process, device, electronic equipment and the storage medium of text attributes field
CN109360571A (en) * 2018-10-31 2019-02-19 深圳壹账通智能科技有限公司 Processing method and processing device, storage medium, the computer equipment of credit information
CN110008331A (en) * 2019-04-15 2019-07-12 三角兽(北京)科技有限公司 Information displaying method, device, electronic equipment and computer readable storage medium
CN110245557A (en) * 2019-05-07 2019-09-17 平安科技(深圳)有限公司 Image processing method, device, computer equipment and storage medium
CN110265024A (en) * 2019-05-20 2019-09-20 平安普惠企业管理有限公司 Requirement documents generation method and relevant device
CN110263740A (en) * 2019-06-26 2019-09-20 四川新网银行股份有限公司 Different type block letter document dubbing method based on OCR technique
CN110414519A (en) * 2019-06-27 2019-11-05 众安信息技术服务有限公司 A kind of recognition methods of picture character and its identification device
CN110362826A (en) * 2019-07-05 2019-10-22 武汉莱博信息技术有限公司 Periodical submission method, equipment and readable storage medium storing program for executing based on artificial intelligence
CN110647829A (en) * 2019-09-12 2020-01-03 全球能源互联网研究院有限公司 Bill text recognition method and system
CN110688833A (en) * 2019-09-16 2020-01-14 苏州创意云网络科技有限公司 Text correction method, device and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
A. RIAD ET AL: "Classification and Information Extraction for Complex and Nested Tabular Structures in Images" *
M. RAMANAN ET AL: "A preprocessing method for printed Tamil documents: Skew correction and textual classification" *
孙婷等: "文书排版特征专家辅助识别系统之行列信息识别" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434686A (en) * 2020-11-16 2021-03-02 浙江大学 End-to-end error-containing text classification recognition instrument for OCR (optical character recognition) picture
WO2022134580A1 (en) * 2020-12-22 2022-06-30 深圳壹账通智能科技有限公司 Method and apparatus for acquiring certificate information, and storage medium and computer device

Also Published As

Publication number Publication date
CN111339910B (en) 2023-11-28

Similar Documents

Publication Publication Date Title
CN109766438B (en) Resume information extraction method, resume information extraction device, computer equipment and storage medium
US10915788B2 (en) Optical character recognition using end-to-end deep learning
EP3869385B1 (en) Method for extracting structural data from image, apparatus and device
CN109190007B (en) Data analysis method and device
US9626555B2 (en) Content-based document image classification
CN108108342B (en) Structured text generation method, search method and device
RU2613846C2 (en) Method and system for extracting data from images of semistructured documents
KR101377601B1 (en) System and method for providing recognition and translation of multiple language in natural scene image using mobile camera
CN105631393A (en) Information recognition method and device
CN111507214A (en) Document identification method, device and equipment
US11295175B1 (en) Automatic document separation
CN112149680B (en) Method and device for detecting and identifying wrong words, electronic equipment and storage medium
CN111339910B (en) Text processing and text classification model training method and device
JP2019079347A (en) Character estimation system, character estimation method, and character estimation program
CN111357015B (en) Text conversion method, apparatus, computer device, and computer-readable storage medium
CN112287071A (en) Text relation extraction method and device and electronic equipment
CN114332873A (en) Training method and device for recognition model
CN104252446A (en) Computing device, and verification system and method for consistency of contents of files
CN117216279A (en) Text extraction method, device and equipment of PDF (portable document format) file and storage medium
JP2012234512A (en) Method for text segmentation, computer program product and system
CN112149678A (en) Character recognition method and device for special language and recognition model training method and device
CN116757183A (en) Project information processing method and device
KR102468975B1 (en) Method and apparatus for improving accuracy of recognition of precedent based on artificial intelligence
US20220044048A1 (en) System and method to recognise characters from an image
Chowdhury et al. Implementation of an optical character reader (ocr) for bengali language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant