WO2023087702A1

WO2023087702A1 - Text recognition method for form certificate image file, and computing device

Info

Publication number: WO2023087702A1
Application number: PCT/CN2022/100026
Authority: WO
Inventors: 郎志刚; 付勇; 范增虎
Original assignee: 深圳前海微众银行股份有限公司
Priority date: 2021-11-22
Filing date: 2022-06-21
Publication date: 2023-05-25
Also published as: CN114049642A

Abstract

Provided in the embodiments of the present invention are a text recognition method for a form certificate image file, and a computing device. The method comprises: for any type of form certificate image file, performing text content recognition on the type of form certificate image file, so as to determine first text content; when the number of character strings of a first text line in the first text content is different from the number of key fields of a form certificate of the type, splicing any character string of the first text line with each character string in a second text line, and verifying a spliced character string; and when any spliced character string conforms to a text content rule of any key field in the form certificate of the type, determining the spliced character string to be text content of the key field. Therefore, the text content of each key field can be determined. In this way, by means of the solution, the accuracy of recognizing text content in a form certificate image file can be effectively improved, and the costs accumulated when maintaining different content templates are reduced.

Description

A text recognition method and computing device for form certificate image

Cross References to Related Applications

This application claims the priority of the Chinese patent application submitted to the China Patent Office on November 22, 2021, with the application number 202111382325.7, and the title of the application is "A Text Recognition Method and Computing Equipment for Image Documents of Forms and Certificates", the entire content of which has been passed References are incorporated in this application.

technical field

Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular to a text recognition method and computing equipment for form certificate image documents.

Background technique

With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology. However, due to the security and real-time requirements of the financial industry, higher requirements are also placed on technology. In the financial field, when users handle financial services (such as loan business, etc.), in order to ensure the safety of financial business operations, users are required to upload relevant certificate images for auxiliary review. At this time, users are required to upload their own relevant certificate images. , such as uploading your own housing property certificate image, motor vehicle registration certificate image, or enterprise business registration image, etc., so that business personnel can use OCR (Optical Character Recognition, Optical Character Recognition) technology to extract the content of the certificate image uploaded by customers and review.

At this stage, before performing recognition processing on each type of form certificate image, each unit of this type of form certificate will be introduced in the process of defining the row and column of the content template corresponding to the type of form certificate The coordinate values of the four corners of the grid, so as to realize the recognition of the text content in the image file of this type of form certificate. Specifically, for a certain type of form certificate image uploaded by a certain user, when performing recognition processing on this type of form certificate image, it may first be based on the four corners of each cell of the type of form certificate The coordinate value is segmented for each cell, and then OCR recognition is performed for each segmented cell to obtain the text content in each cell, and then the text content identified for each cell is analyzed through the content template and structured processing, so that the complete content information corresponding to each key field in this type of form certificate can be obtained. However, this solution is due to scanning deviation, shooting deviation, or The position of the fields in this type of form certificate image file changes, so when the cells are segmented according to the coordinate values of the four corners of each cell, the segmented cells will be inaccurate. As a result, the identified text content corresponding to at least one key field is also inaccurate. Among them, since the existing scheme configures a content template for each type of form certificate that meets the content format requirements of the type of form certificate, and the content template corresponding to each type of form certificate is fixed, in When the content in a certain type of form certificate changes, it is necessary to redesign and develop the content template corresponding to this type of certificate, so the versatility is poor and the maintenance cost is high. For example, for house property right certificates, the content of the house property right certificates defined in different cities will vary greatly, so it is necessary to configure a content template for the house property right certificates defined in each city, which will lead to house The development cycle of the content template corresponding to the property right certificate is long and the labor cost is high.

To sum up, there is an urgent need for a text recognition method for form certificate images, which can effectively improve the accuracy of recognizing text content in form certificate images, and can reduce the cost of maintaining content templates.

Contents of the invention

In the first aspect, an embodiment of the present invention provides a method for text recognition of an image of a form certificate, including:

For any type of form certificate image, by performing text content recognition on the form certificate image of the type, determine the first text content of the form certificate image;

When it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, any character string in the first text line , performing splicing processing with each character string in the second text line, and verifying the spliced string; the second text line is the latest line before the first text line in the first text content;

When any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field;

The text content of each key field is determined as the second text content of the type of form certificate image file.

In the above-mentioned technical scheme, because the prior art scheme configures a content template that meets the content format requirements of this type of form certificate for each type of form certificate (such as a house property right certificate, etc.), once a certain type of form certificate is in If the content of the form certificate changes, it is necessary to redesign and develop the content template corresponding to this type of form certificate, so the versatility is poor and the maintenance cost is high. If there is any deviation, or the position of the fields in the image of this type of form certificate changes, the identified text content corresponding to at least one key field in the image of this type of form certificate will also be inaccurate. Based on this, the technical solution in the present invention automatically judges whether the text content in the recognized image of a certain type of form certificate has a certain number of character strings in one or several text lines and the number of characters in the type of form certificate. The data of each key field is different, and when it is determined that the number of character strings in one or several text lines is different from the data of each key field of this type of form certificate, follow the corresponding The processing rules are automatically processed, so that it can effectively ensure that the text content corresponding to each key field of this type of form certificate is accurate, and can avoid scanning deviation, shooting deviation, or this type of form certificate. The position of the field in the image of the type of form certificate changes, resulting in inaccurate text content corresponding to each identified key field. Specifically, for any type of form certificate image, by identifying the text content of the type of form certificate image, the first text content of the type of form certificate image can be determined, and after determining the first When the number of character strings in the first text line in the text content is different from the number of key fields of this type of form certificate, any character string in the first text line and the number of key fields in the second text line Each character string is spliced, and the spliced string is verified. If it is determined that a certain spliced string conforms to the text content rules of a key field in this type of form certificate, the spliced character The string is determined as the text content of the key field, so that the text content of each key field can be determined. In this way, the scheme only needs to splice each character string in a certain text line that is not the same with each character string in the nearest text line before the text line, and determine whether a certain spliced string is Comply with the text content rules of a certain key field, so that the text content of a certain key field can be accurately determined, which is helpful to effectively improve the accuracy of identifying the text content in the image of the form certificate, without the need for each Each type of form certificate is configured with a content template, and there is no need to introduce the four corners of each cell of the type of form certificate in the process of defining the row and column of the content template corresponding to each type of form certificate Coordinate values, so this solution is highly versatile and can meet the needs of different usage scenarios. For example, it can be applied to deal with different form certificates set in different regions, and of course it can also be applied to deal with a certain type of certificate that has changed. Form certificates, and can reduce the maintenance cost of configuring different content templates for different types of form certificates.

Optionally, determine that the concatenated character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:

If the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;

For any key field whose text content belongs to a pure number type, perform length check and regular expression check on the spliced character string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.

In the above technical solution, if it is determined that a character string after splicing belongs to the pure digital type, you can first select the key field whose text content belongs to the pure digital type from each key field, and only need to follow the text content to belong to the pure digital type. The text content rules of the key fields of the numeric type only need to perform length checks and regular expression checks on the spliced strings, and there is no need to check the spliced strings according to the content rules of each key field. verification, then it can save the time spent by verifying the content rules of each key field for the concatenated character string belonging to the pure digital type, thereby effectively improving the verification efficiency, thereby improving The efficiency of determining the text content of the key field can be improved, and the recognition accuracy of the text content of the key field whose text content belongs to the pure digital type in the form certificate image file can be improved.

Optionally, it is determined that the spliced character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:

If the string after splicing belongs to the letter+at least one special character type, then determine from the key fields that the text content belongs to the letter+at least one special character type key field;

For any key field whose text content belongs to the letter + at least one special character type, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.

In the above technical solution, if it is determined that a character string after splicing belongs to a letter+at least one special character type, you can first select a key field whose text content belongs to a letter+at least one special character type from each key field, and It is only necessary to perform a regular expression check on the concatenated string according to the text content rules of the key field whose text content belongs to the letter + at least one special character type, without following the content rules of each key field The spliced string is checked with a regular expression, which can save the time it takes to check the content rules of each key field for the spliced string that belongs to the pure digital type, so that Effectively improve the efficiency of regular expression verification, so as to improve the efficiency of determining the text content of the key field, and improve the text content of the key field whose text content belongs to letters + at least one special character type in the image of the form certificate recognition accuracy.

If the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;

Processing the spliced character string through a set language model, and determining that the spliced character string conforms to the sentence probability that any text content belongs to a text content rule of a key field containing at least one Chinese word type, Therefore, it is determined whether the character string after splicing conforms to the text content rule of the key field.

In the above technical solution, if it is determined that a character string after splicing belongs to the type containing at least one Chinese word, then it only needs to process the character string after splicing through the set language model to determine the character string after splicing. The string conforms to the sentence probability of a text content belonging to the text content rule of the key field containing at least one Chinese word type, so that it can be accurately determined based on the sentence probability whether the spliced string is the text content belonging to the text content containing at least one The text content of the key field of a Chinese word type can improve the efficiency of determining the text content of the key field, and can improve the text content of the text content of the key field containing at least one Chinese word type in the image file of the form certificate. recognition accuracy.

Optionally, the set language model is used to process the spliced string, and it is determined that the spliced string conforms to any text whose text content belongs to a key field containing at least one Chinese word type. Sentence probabilities for content rules, including:

performing word segmentation processing on the spliced character string through a set word segmentation tool to obtain at least one word segment and the part of speech of each word segment in the at least one word segment;

Combining the at least one participle into at least one sentence in a permutation and combination manner;

For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model The parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined; The first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;

Comparing the sentence probabilities of the at least one sentence, determining the maximum sentence probability, and determining the maximum sentence probability as the sentence probability corresponding to the spliced character string.

Optionally, the set language model is a binary model;

The sentence probability of the sentence satisfies the following form:

P＝Ω× _Pw ×(1-Ω)× _Pc

Wherein, through the binary model, each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:

Through the binary model, the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:

Wherein, P is used to represent the sentence probability of the sentence, P _w is used to represent the first sub-sentence probability of the sentence, P _c is used to represent the second sub-sentence probability of the sentence, Ω is used to represent the weight, W _i is used to represent any participle in the sentence, and C _i is used to represent the part of speech of any participle in the sentence.

In the above technical solution, for a spliced character string, after the word segmentation and part-of-speech processing are performed on the spliced character string through the set word segmentation tool, the probability related to word segmentation and the probability related to part-of-speech of word segmentation can be introduced, The sentence probability corresponding to the spliced string can be fully and more accurately determined, and it can be more accurately determined whether the spliced string is a key field containing at least one Chinese word type in the text content text content.

Optionally, after determining the text content of each key field as the second text content of the type of form certificate image file, it further includes:

Taking the key field at the starting position in the type of form certificate as the starting point for construction, and according to the second text content, constructing a directed acyclic graph of the type of form certificate;

Taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, so as to obtain the data content conforming to the set data format .

In the above technical solution, in order to more conveniently and accurately organize the text content in the key-value form in a certain type of form certificate image file into content that conforms to a certain set data format (such as Json data format), therefore, In this solution, the number of character strings in a certain text line in the text content does not match the number of key fields of this type of form certificate (that is, the problem of column misalignment), and the text line The strings in the column are aligned accordingly, and a directed acyclic graph can be constructed according to the text content after the column alignment, and then each node in the directed acyclic graph is traversed in sequence through the depth-first search algorithm. The data content conforming to the set data format can be obtained. In this way, the solution can adaptively complete the text content recognition processing for different types of form certificate images, without relying on different content templates to complete, with strong versatility, and can ensure that different types of form certificate image The text content of the document is accurate, and the cost of maintaining content templates for different form certificates can be reduced.

Optionally, taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, including:

For any node in the directed acyclic graph, when determining that the node is not an empty node, determine whether the node exists in the key field library; the key field library is used to store various types of The key fields of the form certificate;

If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.

In the above technical solution, in order to more conveniently and accurately organize the text content in the key-value form in a certain type of form certificate image file into content that conforms to a certain set data format (such as Json data format), therefore, This scheme pre-configures a key field mapping array for each type of form certificate, that is, stores each key field mapping of each type of form certificate in the key field library, so that when the depth-first search The algorithm traverses a certain node in the directed acyclic graph, and can promptly determine whether the node is a key field node, that is, determine whether the node is a key in the image file of the form certificate of this type (ie, the key The text content of segments and key fields exists in the form of key-value pairs). If it is determined that the node is the first key field node (that is, the key field that a certain type of form certificate has), then by following the top-down traversal direction, perform By traversing, you can find the value of the first key field node in time, that is, the text content of the first key field node, and update the corresponding data format according to the text content of the first key field node. The key-value content of the first key field can be used to obtain the data content of the set data format mapped to the first key field. Wherein, in the key field library, the first key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.

Optionally, after stopping the traversal until an empty node appears in the first traversal direction, the method further includes:

Traverse the second node adjacent to the first key field node along the second traversal direction, and when it is determined that the second node is not an empty node and exists in the key field library, The second node is determined as a second key field node;

Traversing the third node adjacent to the second key field node along the first traversal direction, and determining that the third node is not an empty node and does not exist in the key field library , determine the third node as the text content of the second key field node, stop traversing until an empty node appears in the first traversing direction, and stop traversing in the first traversing direction, Traverse the fourth node adjacent to the second key field node along the second traversal direction, and stop traversing until an empty node appears in the second traversal direction; wherein, the second traversal direction For left to right.

In the above technical solution, after the traversal is completed along the top-down traversal direction, the second node adjacent to the first key field node can be traversed along the left-to-right traversal direction, and similarly , if the second node is an empty node, end the traversal in the left-to-right traversal direction, if the second node is not an empty node and exists in the key field library, then determine the second node as the second key Field node (that is, the key field of a certain type of form certificate), so that the third node adjacent to the second key field node can be traversed along the top-down traversal direction , you can find the value of the second key field node in time, that is, the text content of the second key field node, and update the setting data format corresponding to the text content of the first key field node Key-value content, the data content of the set data format mapped to the first key field can be obtained. Wherein, in the key field library, the second key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.

In the second aspect, the embodiment of the present invention also provides a text recognition device for a form certificate image document, including:

The recognition unit is configured to, for any type of form certificate image, determine the first text content of the type of form certificate image by performing text content recognition on the type of form certificate image;

a processing unit, configured to convert the first text line to Any character string in the second text line is spliced with each character string in the second text line, and the spliced string is verified; the second text line is located in the first text line in the first text content The last line before; when any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field; The text content of each key field is determined as the second text content of the type of form certificate image file.

In a third aspect, an embodiment of the present invention provides a computing device, including at least one processor and at least one memory, wherein the memory stores a computer program, and when the program is executed by the processor, the processing The device executes the method for text recognition of the form certificate image document described in any of the above first aspects.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program runs on the computing device, the computing device executes the above-mentioned first The text recognition method of any form certificate image document described in the aspect.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.

FIG. 1 is a schematic flowchart of a text recognition method for a form certificate image provided by an embodiment of the present invention;

Fig. 2a is a schematic diagram of an image of a form certificate provided by an embodiment of the present invention;

FIG. 2b is a schematic diagram of a text result after OCR parsing and processing provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of text content after column alignment processing provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a directed acyclic graph provided by an embodiment of the present invention;

Fig. 5 is a schematic diagram of another form certificate image file provided by the embodiment of the present invention;

FIG. 6 is a schematic diagram of another directed acyclic graph provided by an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a text recognition device for a form certificate image document provided by an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.

Detailed ways

In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

Fig. 1 exemplarily shows the flow of a text recognition method for a form certificate image document provided by an embodiment of the present invention, and the process can be executed by a text recognition device for a form certificate image document.

As shown in Figure 1, the process specifically includes:

Step 101 , for any type of form certificate image, by performing text content recognition on the type of form certificate image, determine the first text content of the type of form certificate image.

In the embodiment of the present invention, for any type of form certificate image submitted by the user (such as house ownership certificate, land use certificate, etc.), the text content of the type of form certificate image can be processed by the set text recognition tool. Recognition processing, so that the text content with at least one text line in the recognized text area can be obtained.

Exemplarily, after a service device (such as a server for processing form certificate images) receives a certain type of form certificate image uploaded by a user, it can use a set text recognition tool, such as an open source OCR tool (Optical Character Recognition, optical character recognition), the Pillow library in the Python environment or the pytesseract library in the Python environment, etc., analyze and process this type of form certificate image file, and you can get the text result, that is, this type of form The text content in the certificate image file, but the text result may have column inconsistency (that is, the column is not aligned), that is, the number of strings in one or several rows is different from that of the type of form certificate. The number of each key field (that is, each key field of the header row) is different. For example, as shown in Figure 2a, it is a schematic diagram of a form certificate image file provided by the embodiment of the present invention. By using an OCR tool to analyze and process the form certificate image file, the OCR analysis process as shown in Figure 2b can be obtained Schematic diagram of the text results. It can be seen from Figure 2b that the number of key fields in the first line (that is, the header of the form certificate) is 4, the number of each character string in the second line is 4, and the number of each character string in the third line is 3. It can be seen that the number of strings in the second row is the same as the number of key fields in the first row, and the number of strings in the third row is different from the number of key fields in the first row. Therefore, there is a problem of column inconsistency (that is, columns are not aligned) in the text result of this type of form certificate image file, and it is necessary to perform column alignment processing on the text result of this type of form certificate image file, that is, for the third row Each character string in is processed according to the processing rules of the character string, so that the text result after column alignment processing can be obtained, that is, the complete text content corresponding to each key field in the first row can be obtained.

Step 102, when it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, convert any of the first text lines to A character string is concatenated with each character string in the second text line, and the concatenated character string is verified.

Step 103, when any concatenated character string conforms to the text content rule of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field.

In the embodiment of the present invention, if the number of character strings of a certain text line in the text content with at least one text line in the identified text area is different from the number of key fields that this type of form certificate has, The same (that is, the number of strings in a certain text line is different from the number of key fields in the key field text line, so that the columns in the entire text area are inconsistent, that is, the columns are not aligned) , it means that any character string in this text line should belong to the connection content of a certain string in the nearest text line before this text line, that is, each character string in this text line does not exist independently, but should be concatenated to a corresponding string in the nearest preceding line of text. Therefore, by concatenating any character string in the first text line with each character string in the second text line, and at the same time verifying each concatenated character string, it is determined that the concatenated character string belongs to The text content of which key field, that is, if it is determined that a concatenated character string conforms to the text content rules of a key field in this type of form certificate, the concatenated character string is determined to be the The text content of the key field. Wherein, the second text line is the latest line before the first text line in the first text content.

Specifically, when it is determined that a concatenated character string conforms to the text content rules of a key field in this type of form certificate, if it is determined that the concatenated character string belongs to the pure digital type, you can start with each keyword Select the key field whose text content belongs to the pure digital type in the paragraph, and for any key field whose text content belongs to the pure digital type, check the length of the spliced string according to the text content rules of the key field Check with the regular expression to determine whether the spliced string conforms to the text content rules of the key field. If it is determined that the spliced string conforms to the text content rules of the key field, then the spliced string The string is determined as the text content of the key field. If it is determined that the concatenated string does not comply with the text content rules of the key field, continue to follow the text content rules of the next key field whose text content belongs to a pure number type. Check the string after splicing, or continue to check the next string after splicing that belongs to the pure number type. In this way, the scheme only needs to perform length check and regular expression check on the concatenated string according to the text content rule of the key field whose text content belongs to the pure digital type, instead of checking the length of each key field. All content rules check the concatenated string, which saves the time spent on verifying the concatenated string that belongs to the pure digital type by the content rules of each key field, so as to This can improve the efficiency of determining the text content of the key field, and can improve the recognition accuracy of the text content of the key field whose text content belongs to the pure digital type in the form certificate image file.

If it is determined that the spliced character string belongs to the letter + at least one special character type, you can first select the key field whose text content belongs to the letter + at least one special character type from each key field, and for any text content belongs to letter + at least one special character type key field, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine whether the spliced string conforms to the key field Text content rules, if it is determined that the concatenated character string conforms to the text content rules of the key field, then determine the concatenated character string as the text content of the key field, if it is determined that the concatenated character string does not conform to For the text content rule of the key field, continue to check the spliced string according to the text content rule of the next key field whose text content belongs to letters + at least one special character type, or continue to check the next text content belonging to The concatenated string of letters + at least one special character type is verified. In this way, the solution only needs to perform regular expression verification on the concatenated string according to the text content rules of the key field whose text content belongs to the letter + at least one special character type, without the need to follow the text content rules of each key field The content rules all perform regular expression verification on the concatenated string, so the cost of verifying the concatenated string belonging to the pure digital type can be saved by verifying the content rules of each key field. Time, and can improve the recognition accuracy of the text content in the key field of the letter + at least one special character type in the image of the form certificate.

If it is determined that the spliced string belongs to the type that contains at least one Chinese word, it only needs to process the spliced string through the set language model to determine that the spliced string conforms to a certain text content The sentence probability of the text content rule belonging to the key field containing at least one Chinese word type, so that it can be accurately determined based on the sentence probability whether the spliced string is the text content belonging to the key field containing at least one Chinese word type The text content of the field. If the string after splicing conforms to the text content rules of the key field, then the string after splicing can be determined as the text content of the key field containing at least one Chinese word type, if the string after splicing If the character string does not conform to the text content rule of the key field, continue to judge whether the spliced character string conforms to the text content rule of the next text content belonging to a key field containing at least one Chinese word type. Specifically, when it is determined by the set language model that the spliced character string conforms to the sentence probability that any text content belongs to the text content rule of the key field containing at least one Chinese word type, the word segmentation tool set The concatenated character string is subjected to word segmentation processing to obtain at least one word segment and the part of speech of each word segment in the at least one word segment. According to the way of permutation and combination, combine the at least one participle into at least one sentence, and for each sentence, process the part of speech of each participle in the sentence through the set language model, and determine the second subsentence of the sentence probability. Then, based on the first sub-sentence probability of the sentence and the second sub-sentence probability of the sentence, the sentence probability of the sentence is determined. Finally, compare the sentence probabilities of at least one sentence, determine the maximum sentence probability, and determine the maximum sentence probability as the sentence probability corresponding to the concatenated character string. Wherein, the first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by counting the part-of-speech word frequency of each participle in the sentence. In this way, by introducing the probability related to the word segmentation and the probability related to the part of speech of the word segmentation, the probability of the sentence corresponding to the spliced string can be fully and more accurately determined, so as to more accurately determine whether the spliced string is the The text content belongs to the text content of the key field that contains at least one Chinese word type to provide support.

Exemplarily, continuing to take the text result after OCR parsing and processing shown in Figure 2b above as an example, perform corresponding column alignment processing for each character string in the third row whose number of key fields is different from that in the first row , there are three processing methods, that is, the first processing method is: the character string to be split belongs to the pure digital type, such as the string "12345678" shown in Figure 2b, and generally the string belonging to the pure digital type will be in Appears in mobile phone number or ID card number, etc. When processing a character string belonging to a pure number type, it is necessary to concatenate the character string with the character strings in the nearest line before the line where the character string is located, and at least one concatenated character string can be obtained, and Perform a length check and a regular expression check on each spliced string to determine whether the spliced string meets the content format requirements of a key field in the first line, and if so, the spliced The character string after is determined as the text content of the key field in the first row. For example, the character string "12345678" in the third line is spliced with the character string "Zhang San" in the second line to form the spliced string "Zhang San 12345678" or "12345678 Zhang San", and respectively for " Zhang San 12345678" and "12345678 Zhang San" perform length check and regular expression check, and it is determined that the two spliced strings do not meet the content format requirements of any key field in the first line. Then, the string "12345678" in the third line is spliced with the string "1234567890" in the second line to form the spliced string "123456781234567890" or "123456789012345678", and for "123456781234567890" and "1234567890 12345678" Carry out the length check separately to make sure that the two spliced strings meet the number length requirements of the key field "ID card" in the first line, and then perform the two spliced strings according to the regular expression of the ID card. The matching check determines that only the spliced string "123456789012345678" meets the number format requirements of the ID card, so the spliced string "123456789012345678" is used as the text content of the ID card. Among them, before the regular expression verification is performed on the spliced string, a corresponding regular expression will be configured for the text content corresponding to each key field of each type of form certificate, so as to be able to timely and effectively Perform content verification on the text content corresponding to each key field. Then there is no need to splice "12345678" with other character strings in the second line (such as "zhangsan" or "Longhaijia, Nanshan District, Shenzhen City"), so that the character string "12345678" in the third line can be ended. " match check.

Alternatively, for each character string in the third line, the character string is spliced with each character string in the second line to obtain multiple spliced strings. For each spliced string, if it is determined The concatenated character string belongs to the pure digital type, then determine from the key fields of the first line that the format requirements of the text content belong to the key fields of the pure digital type, and target the The length check and regular expression check are performed on the spliced string, and if the check is successful, the spliced string is determined to be the text content of the key field. For example, concatenate the character string "12345678" in the third line with the character strings in the second line to obtain multiple concatenated character strings, namely "Zhang San 12345678", "12345678 Zhang San", "123456781234567890" , "123456789012345678", "12345678zhangsan" and "zhangsan12345678", "12345678 Longhai Home in Nanshan District, Shenzhen" and "Longhai Home in Nanshan District, Shenzhen 12345678". Take "123456781234567890" and "123456789012345678" as an example, the two concatenated strings are all of pure numeric type, and only the content format of the key field "ID card" in the first line of the key fields is required to meet the requirements of pure numbers Therefore, according to the content format requirements of the key field "ID card", the length checks are performed on "123456781234567890" and "123456789012345678", and it is determined that "123456781234567890" and "123456789012345678" meet the number length requirements of the key field "ID card". , and then match and check "123456781234567890" and "123456789012345678" according to the regular expression of the key field "ID card", and confirm that only the spliced string "123456789012345678" meets the number format requirements of the ID card, so the spliced string The string "123456789012345678" is used as the text content of the ID card.

The second processing method is: the segmented string belongs to the type of letter + at least one special character (excluding Chinese words), such as the string "@qq.com" shown in Figure 2b, generally belongs to the letter + at least one special character Strings of character type appear in mailboxes, etc. When processing a character string belonging to the letter + at least one special character type, it is necessary to concatenate the character string with the character strings in the nearest line before the line where the character string is located, and at least one concatenated character string can be obtained string, and perform a regular expression check on each concatenated string to determine whether the concatenated string meets the content format requirements of a key field in the first line, and if so, the concatenated The character string after is determined as the text content of the key field in the first line. For example, the string "@qq.com" in the third line and the string "Zhang San" in the second line are spliced to form the spliced string "Zhang San@qq.com" or "@qq.com Zhang San", and perform regular expression verification on "Zhang San@qq.com" and "@qq.comZhang San", that is, according to the regular expression corresponding to each key field in the first line. "Zhangsan@qq.com" or "@qq.com张san" conducts a match check to determine that the two concatenated character strings do not meet the content format requirements of any key field in the first line. Then, the string "@qq.com" in the third line is spliced with the string "1234567890" in the second line to form the spliced string "@qq.com 1234567890" or "1234567890@qq.com ", and perform regular expression verification on "@qq.com1234567890" and "1234567890@qq.com" respectively, that is, according to the regular expression corresponding to each key field in the first line, "@qq. com 1234567890" or "1234567890@qq.com" for matching verification, and it is determined that the two spliced strings do not meet the content format requirements of any key fields in the first line. Afterwards, the string "@qq.com" in the third line is spliced with the string "zhangsan" in the second line to form the spliced string "@qq.com zhangsan" or "zhangsan@qq.com ", and perform regular expression verification for "@qq.com zhangsan" and "zhangsan@qq.com" respectively, that is, to check "@qq .com zhangsan" or "zhangsan@qq.com" for matching verification, to determine that only the spliced string "zhangsan@qq.com" meets the content format requirements of the key field "mailbox" in the first line, so the concatenated The following string "zhangsan@qq.com" is used as the text content of the mailbox. Then there is no need to splice "@qq.com" with other character strings in the second line (such as "Longhaijia, Nanshan District, Shenzhen City"), and it is possible to end the string "@qq" in the third line. .com" match check.

Alternatively, for each character string in the third line, the character string is spliced with each character string in the second line to obtain multiple spliced strings. For each spliced string, if it is determined The string after splicing belongs to the letter + at least one special character type, then determine the format requirements of the text content from the key fields of the first line and belong to the key field of the letter + at least one special character type, and according to the key field The content format of the field requires a regular expression check on the spliced string, and if the check is successful, the spliced string is determined to be the text content of the key field. For example, concatenate the string "@qq.com" in the third line with the strings in the second line to obtain multiple concatenated strings, namely "张三@qq.com", "@qq .com Zhang San", "@qq.com 1234567890", "1234567890@qq.com", "zhangsan@qq.com" and "@qq.com zhangsan", "@qq.com Shenzhen Nanshan Longhaijia " and "Shenzhen Nanshan District Longhaijia@qq.com". Take "zhangsan@qq.com" and "@qq.com zhangsan" as an example, the two concatenated strings belong to the letter + at least one special character type, then there are only keywords in the key fields of the first line The content format of the segment "mailbox" is required to conform to the letter + at least one special character type, so according to the content format requirements of the key field "mailbox", perform regular expressions for "zhangsan@qq.com" and "@qq.com zhangsan" respectively Verify that "zhangsan@qq.com" meets the content format requirements of "mailbox", so the spliced string "zhangsan@qq.com" is used as the text content of the mailbox.

The third processing method is: the character string to be segmented belongs to a type that contains at least one Chinese word, such as the character string "No. 1818, Building 8, Garden" shown in FIG. 2b. When processing a character string of a type containing at least one Chinese word, it is necessary to concatenate the character string with the character strings in the last line before the line where the character string is located, and at least one concatenated character string can be obtained string. For each concatenated character string, use a set word segmentation tool, such as Jieba Chinese word segmentation tool (ie jieba Chinese word segmentation tool), to perform word segmentation and part-of-speech processing on the concatenated character string to obtain at least one word segmentation and the Part of speech for each token in at least one token. Then the at least one participle is arranged and combined in the manner of forming a sentence, multiple sentences can be formed, and for each sentence in the multiple sentences, the sentence is processed by a set language model (such as an n-gram language model) Processing, determining the probability of the first sub-sentence of the sentence, and processing the part of speech of each participle in the sentence through a set language model (such as an n-gram language model), to determine the probability of the second sub-sentence of the sentence. Then, based on the probability of the first sub-sentence of the sentence and the probability of the second sub-sentence of the sentence, the sentence probability of the sentence is determined, and the sentence probabilities of multiple sentences are compared to determine the maximum sentence probability , and determine the maximum sentence probability as the sentence probability corresponding to the concatenated character string. Finally, by comparing the sentence probabilities corresponding to the at least one concatenated character string, the maximum sentence probability can be determined, and the concatenated character string corresponding to the maximum sentence probability is determined as the text content that contains at least one The text content of a key field of the Chinese word type.

Wherein, when determining the sentence probability of a certain sentence based on the first sub-sentence probability of a certain sentence and the second sub-sentence probability of a certain sentence, the probability of a certain sentence can be determined in the following manner:

P＝Ω× _Pw ×(1-Ω)× _Pc

Among them, P is used to represent the sentence probability of a certain sentence, P _w is used to represent the sentence probability calculated by performing statistical processing on each participle in the sentence (that is, the first sub-sentence probability of the sentence), and P _c is used for Indicates the sentence probability calculated by performing statistical processing on the parts of speech of each participle in the sentence (that is, the second subsentence probability of the sentence), Ω is used to represent the weight, for example, it can be 0.5, 0.6 or 0.7, etc. Specifically, it can be Set according to actual application scenarios or experience of those skilled in the art.

The following is a brief introduction to the n-gram language model:

The n-gram language model is usually used to describe the probability that a random word sequence belongs to a normal semantic sentence. Assuming that a sentence is composed of participle words such as W ₁ , W ₂ , ..., W _n , its corresponding probability is as follows:

The conditional probability in the n-gram language model can be calculated using word frequency:

For example, assuming that a normal semantic string is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen", after word segmentation and part-of-speech processing by the Jieba Chinese word segmentation tool, at least one word segmentation is obtained, namely ["Shenzhen", "Nanshan District ", "Longhai", "homeland", "8", "building", "1818", "number"]. Then, by combining these word segmentations, the probability of generating the normal semantic string "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" is:

P (Shenzhen, Nanshan District, Longhai, Homeland, Building 8, No. 1818)

＝P(Shenzhen)×P(Nanshan District|Shenzhen)×P(Longhai|Shenzhen, Nanshan District)

×P(Homeland|Shenzhen, Nanshan District, Longhai)×P(8|Shenzhen, Nanshan District, Longhai, Homeland)

×P(Building|Shenzhen, Nanshan District, Longhai, Homeland, 8)

×P(1818|Shenzhen, Nanshan District, Longhai, Homeland, Building 8)

×P(Number|Shenzhen, Nanshan District, Longhai, Homeland, Building 8, 1818)

Wherein, each probability in the above formula can be calculated by counting the word frequency of the word segmentation, specifically, see Table 1.

Table 1

However, the calculation by the method shown in the above table 1 is too complicated, so by aiming at the unary model in the n-gram language model, namely n=1, unigram, binary model, namely n=2, bigram and ternary model, namely n=3, trigrams are compared, and combined with actual needs (that is, considering the relationship between words, but the calculated relationship matrix should not be too sparse), a binary model is selected to calculate the probability of forming a sentence. Among them, for the univariate model, the formula for calculating the probability is

The calculation method of the unigram model is simple, but it does not consider the order relationship between words; for the binary model, the formula for calculating the probability is

The calculation method of the binary model is relatively simple, but the order relationship between the two words is considered; for the ternary model, the formula for calculating the probability is

The ternary model considers the order relationship among the three words, but the calculated relationship matrix is too sparse to be practical.

It should be noted that when calculating each probability in the above formula by counting the word frequency of the word segmentation, first obtain a Chinese corpus, and then use the open source stuttering Chinese word segmentation tool to perform word segmentation processing on all the corpus. The vocabulary composed of all different word segmentations is represented by W={W ₁ ,W ₂ ,W ₃ ,…,W _n }, where n represents the size of the vocabulary, that is, word segmentation processing for one or more sentences After that, the number of all different participle obtained. Next, the transition probability of each word segment is counted, assuming that in all corpora, the word segment W _i appears count(W _i ) times in total, and the words adjacent to the word segment W _i and appearing after the word segment W _i are W ₁ , W ₂ , W ₃ , the times are count(W _i ,W ₁ ), count(W _i ,W ₂ ), count(W _i ,W ₃ ), then the transition probabilities for word segmentation W _i are count(W _i , W ₁ )/count(W _i ), count(W _i ,W ₂ )/count(W _i ), count(W _i ,W ₃ )/count(W _i ), and the word segmentation W _i is aimed at the vocabulary W Except for W ₁ , W ₂ , and W ₃ , the transition probabilities of all other word segments are 0. Finally, the word segmentation transition probability matrix X can be obtained, the dimension is n*n, and the value of the i-th row and j-column in the matrix is equal to the transition probability of the word segmentation W _i for the word segmentation W _j , that is, count(W _i , W _j )/count(W _i ).

Among them, the probability of generating a normal semantic string "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" through binary model calculation can be expressed as:

P (Shenzhen, Nanshan District, Longhai, Homeland, Building 8, No. 1818)

＝P(Nanshan District|Shenzhen)×P(Longhai|Nanshan District)×P(home|Longhai)×P(8|home)

×P(Building|8)×P(1818|Building)×P(No.|1818)

Wherein, each probability in the above formula can be calculated by counting word frequency, specifically, see Table 2.

Table 2

n-gram语言模型n-gram language model	概率probability
P(南山区\|深圳市)P(Nanshan District\|Shenzhen)	Count(深圳市，南山区)/count(深圳市)Count(Shenzhen, Nanshan District)/count(Shenzhen)
P(龙海\|南山区)P(Longhai\|Nanshan District)	Count(南山区，龙海)/count(南山区)Count(Nanshan District, Longhai)/count(Nanshan District)
P(家园\|龙海)P(Homeland\|Longhai)	Count(龙海，家园)/count(龙海)Count(Longhai, homeland)/count(Longhai)
P(8\|家园)P(8\|home)	Count(家园，8)/count(家园)Count(home, 8)/count(home)
P(栋\|8)P(Building\|8)	Count(8，栋)/count(8)Count(8, building)/count(8)
P(1818\|栋)P(1818\|Building)	Count(栋，1818)/count(栋)Count(Building, 1818)/count(Building)
P(号\|1818)P(No.\|1818)	Count(1818，号)/count(1818)Count(1818, number)/count(1818)

However, a major problem encountered in the process of natural language processing is the occurrence of unregistered words, that is, words that have not appeared in the training set appear in the test set, resulting in the probability calculated by the language model being 0. For example, there are no dragons in the data set. For the word Haijiayuan, the calculation result of Count(Longhai, Homeland)/count(Longhai) in Table 2 above is 0, or a certain subsequence may not appear in the training set, which will also result in a probability of 0. In view of this, it is necessary to smooth the calculation formula for statistical word frequency, that is, to use Laplace smoothing (adding 1 to the numerator and denominator at the same time), so that the sentence probability P _w calculated by word frequency can be expressed for:

It should be noted that in Chinese sentences, the order in which parts of speech appear has a certain relationship, so when calculating the probability of generating a certain sentence through the binary model, the part of speech of each participle in the sentence is also considered, that is, the part of speech is introduced associated probabilities. For example, take a sentence, that is, "Zhang San has a black car", and after performing word segmentation and part-of-speech processing on the sentence through the stuttering Chinese word segmentation tool, at least one part of the word and the part of speech of each part of the word can be obtained, that is, [("Zhang Three", "person's name"), ("have", "verb"), ("one", "number"), ("black", "noun"), ("car", "noun") ]. From this example, it can be seen that there is a high probability that people’s names will not be directly connected to numerals, and there is a high probability that numerals will not be directly connected to verbs. Therefore, by calculating the probability of the order of the parts of speech of the words that make up the sentence, it can be used to supplement The probability of forming a sentence calculated solely by word frequency is insufficient, so that a sentence that complies with the actual situation can be determined more accurately. Or, take the address sentence "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City" as an example. After the word segmentation and part-of-speech processing of the address sentence is performed by the stammering Chinese word segmentation tool, at least one part of the word and the part of speech of each part of the word can be obtained. Namely [("Shenzhen", "place name"), ("Nanshan District", "place name"), ("Longhai", "place name"), ("homeland", "noun"), ("8", "number"), ("building", "number"), ("1818", "number"), ("number", "number")]. From this example, it can be seen that there is a high probability that the place name will not be directly connected with a numeral, and there is a high probability that the numeral will not be directly connected with a place name.

For the sentence probability P _c calculated by the part of speech, the binary model + Laplace smoothing is also used for calculation, that is, the sentence probability P _c calculated by the part of speech can be obtained, specifically:

Among them, C _i is used to represent the part of speech of each participle in a certain sentence.

It should be noted that when calculating the part-of-speech probability between each two parts of speech in the above formula by counting the word frequency of the part-of-speech, first obtain a Chinese corpus, and then use the open-source stuttering Chinese word segmentation tool to perform all corpus Word segmentation and part-of-speech processing can get the part-of-speech tag of each word segmentation. All part-of-speech sets are represented by a part-of-speech table C={C ₁ , C ₂ , C ₃ , . . . , C _m }, where m represents the size of a part-of-speech set. Next, the transition probability of each part of part of speech is counted, assuming that in all corpora, part of part of speech C _i appears count(C _i ) times in total, the part of part of speech adjacent to part of part of speech C _i and appearing after part of part of speech C _i There are C ₁ , C ₂ , and C ₃ , and the times are count(C _i ,C ₁ ), count(C _i ,C ₂ ), and count(C _i ,C ₃ ), then part of speech C _i for part of speech C ₁ , C ₂ , and C ₃ transition probabilities are respectively count(C _i ,C ₁ )/count(C _i ), count(C _i ,C ₂ )/count(C _i ), count(C _i ,C ₃ )/ count(C _i ), moreover, the transition probability of part of speech C _i for part of part of speech C is 0 for all parts of part of speech except C ₁ , C ₂ , and C ₃ in part of speech table C. Finally, the part-of-speech transition probability matrix Y can be obtained, the dimension is m*m, and the value of row i and column j in the matrix is equal to the transition probability of part-of-speech C _i for part-of-speech C _j , namely count(C _i , C _j )/ count(C _i ).

Exemplarily, continue to take Figure 2b as an example, and take the example of splicing the string "No. 1818, Building 8, Garden" and the string "Longhaijia, Nanshan District, Shenzhen" in the second line as an example, assuming that the spliced string It is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen". After performing word segmentation and part-of-speech processing on the spliced character string through the stuttering Chinese word segmentation tool, at least one word segmentation is obtained, namely ["Shenzhen", "Nanshan District" , "Longhai", "Homeland", "8", "Building", "1818", "number"], and the part of speech of each participle, namely [("Shenzhen", "place name"), ("Nanshan District", "place name"), ("Longhai", "place name"), ("home", "noun"), ("8", "number"), ("building", "number") , ("1818", "number"), ("number", "number")]. In the way of permutation and combination, at least one participle is combined into a plurality of sentences, such as wherein a sentence is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City", and the binary model (ie, the binary language model) is paired with The sentence carries out statistics on the word frequency of each part of the word to determine the probability P ₁ of the first sub-sentence, assuming that P ₁ = 0.8, and through the binary model (that is, the binary language model) to calculate the word frequency of each part of speech of the sentence to determine the second sub-sentence For the probability P ₂ of the second subsentence, assume that P ₂ =0.75, and assume that Ω=0.5, so that the sentence probability P=0.5×0.8×(1-0.5)×0.75=0.15 for generating the sentence can be determined. Assuming that there is a sentence "No. 1818, Building 8, Longhai Homeland, Shenzhen, Nanshan District", the probability of the first sub-sentence P ₁ is determined by counting the word frequency of each participle through the binary model (namely, the binary language model). P ₁ =0.4, and the second subsentence probability P ₂ is determined by counting the word frequency of each part of speech of the sentence through the binary model (ie, the binary language model), assuming P ₂ =0.35, assuming Ω=0.5, so That is to say, the sentence probability P=0.5×0.4×(1-0.5)×0.35=0.035 for generating the sentence can be determined. Through calculation, it is determined that the probabilities of other sentences in multiple sentences are less than 0.15. Therefore, by comparing the probabilities of multiple sentences, it is determined that the maximum probability is 0.15, which is the string "No. 8 Building 1818, Yuan 8" The probability of splicing with the string "Longhaijia, Nanshan District, Shenzhen City" to produce "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" is the highest, ie 0.15. In addition, through calculation, it is found that the probability of sentences determined by splicing the string "No. 8 Building 1818" with other character strings in the second row is less than 0.15. The string "Zhang San" in the second line is spliced, and after the spliced strings are word-segmented and permuted, assuming that the binary model is used, the probability of generating the sentence "No. 1818, Building 8, Zhang Sanyuan" is the highest, but It is also less than 0.15, or the string "No. 8 Building 1818" is spliced with the string "ID card" in the second line, and after word segmentation and permutation and combination are performed on the spliced string, it is assumed that through the two The meta-model determines that the probability of generating the sentence "Yuan 8 Building 1818 ID number" is the highest, but it is also less than 0.15. At the same time, it is determined that the probability of generating "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City" conforms to the text content rules of the key field "home address". Therefore, the spliced string "Longhaijiayuan, Nanshan District, Shenzhen City 8 Building No. 1818" is determined as the text content of the key field "home address".

Based on this, for a text line whose number of character strings is different from the number of key fields of this type of form certificate, after processing according to the above three processing methods, the complete text of each key field can be obtained. Text content, that is, after column alignment processing, the complete text content of each key field is obtained, for example. As shown in FIG. 3 , it is a schematic diagram of text content after column alignment processing provided by an embodiment of the present invention. It can be seen from Figure 3 that the text content of the key field "name" is "Zhang San", the text content of the key field "ID card" is "123456789012345678", and the text content of the key field "mailbox" is "zhangsan@qq. com", the text content of the key field "home address" is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen".

Step 104, determining the text content of each key field as the second text content of the type of form certificate image.

In the embodiment of the present invention, after determining the text content of each key field as the second text content of a type of form certificate image file, in order to more conveniently and more accurately set the key in a certain type of form certificate image file The text content in the form of value is sorted into content that conforms to a certain set data format (such as Json data format). Therefore, this solution is aimed at tables whose number of strings in a certain text line in the text content does not conform to this type of table The number of each key field of the certificate (that is, the column misalignment problem), by performing corresponding column alignment processing on the strings in the text line, and constructing an effective Directed acyclic graph, and then through the depth-first search algorithm to traverse each node in the directed acyclic graph in turn, you can get the data content that conforms to the set data format. Specifically, the key field at the initial position in the type of form certificate is used as the starting point for construction, and according to the second text content, the directed acyclic graph of the type of form certificate is constructed, and then the directed acyclic graph The node at the starting position is the starting point of traversal, and each node in the directed acyclic graph is traversed sequentially through the depth-first search algorithm, so that the data content conforming to the set data format can be obtained. Wherein, when each node in the directed acyclic graph is traversed sequentially through the depth-first search algorithm, for any node in the directed acyclic graph, when it is determined that the node is not an empty node, determine whether the node exists In the key field library; wherein, the key field library is used to store the key fields of various types of form certificates. If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not an empty node and If it does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction. Wherein, the first traversal direction is top-down. In this way, if it is determined that the node is the first key field node (that is, the key field of a certain type of form certificate), then by following the top-down traversal direction, for the node adjacent to the bottom By traversing the nodes, you can find the value of the first key field node in time, that is, the text content of the first key field node, and update the set data format according to the text content of the first key field node The corresponding key-value content can obtain the data content of the set data format mapped to the first key field. Wherein, in the key field library, the first key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format. At the same time, after stopping the traversal until an empty node appears in the first traversal direction, traverse the second node adjacent to the first key field node along the second traversal direction, and determine that the second node is not When an empty node exists in the key field library, determine the second node as the second key field node, and traverse the third node adjacent to the second key field node along the first traversal direction , if it is determined that the third node is not an empty node and does not exist in the key field library, then the third node can be determined as the text content of the second key field node until an empty node appears in the first traversal direction stop traversal. Then, after stopping the traversal in the first traversal direction, traverse the fourth node adjacent to the second key field node along the second traversal direction until an empty node appears in the second traversal direction. Wherein, the second traversal direction is from left to right. In this way, if the second node is not an empty node and exists in the key field library, then it is determined that the second node is the second key field node (that is, the key field of a certain type of form certificate), In this way, the third node adjacent to the second key field node can be traversed along the top-down traversal direction, and the value of the second key field node, that is, the value of the second key field node can be found in time. The text content of the second key field node, and update the key value content corresponding to the setting data format according to the text content of the first key field node, and then the setting data format mapped to the first key field can be obtained data content. Wherein, in the key field library, the second key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.

Exemplarily, taking FIG. 3 as an example, according to each key field in FIG. 3 and the text content of each key field, a directed acyclic graph as shown in FIG. 4 can be constructed. Among them, the data of each node in Figure 4 can be {"tableValue": "the value in the table, such as the name", "rightNode": "the current node, the node on the right", "belowNode": "the current node, the node below node"}.

For the directed acyclic graph shown in FIG. 4 , the depth-first-search algorithm (Depth-First-Search, DFS) is used to traverse from the root node, that is, to traverse from “tableValue=name”. For any node in the directed acyclic graph, first determine whether the node is an empty node, if not, then determine whether the node exists in the keymap, that is, use tableValue to match the keyname in the keymap, For example, when traversing the root node, that is, use "tableValue=name" to query the keymap to see if there is a "keyname=name", if so, it will match {"keyname": "name", "jsonkey ":"name"}. At this point, the key in the json corresponding to the node (the value is "jsonkey": "name" in the keymap) can be written into the data content of the json data format, that is, the current data content of the json data format is {" name": Null}. Next, you only need to find the value corresponding to the key in json.

Among them, the technical solution in the embodiment of the present invention configures a key field mapping array for each type of form certificate in advance, that is, stores each key field mapping of each type of form certificate in the key field library. For example, a certain type of form certificate shown in Figure 2a is taken as an example. By extracting the key fields in the header of this type of form certificate, a keymap can be formed, that is, the keymap of this type of form certificate can be configured as :

Keymap = [

{keyname": "name", "jsonkey": "name"},

{keyname": "ID", "jsonkey": "idNo"},

{keyname": "email", "jsonkey": "email"},

{keyname":"homeaddress", "jsonkey":"address"},

];

Among them, keyname corresponds to the name of the header field in the form certificate of this type, and jsonkey corresponds to the key in the data content of the finally sorted out json data format. Among them, for the keymap update: when the field name in the original form certificate of a certain type is changed or added, just insert the new field name in the keymap. When the field name in a certain type of original form certificate is deleted, the keymap will not be updated. In addition, the keymap does not need to store multiple versions (save all known keys). For each type of form certificate, only one latest keymap needs to be reserved for the processing of the type of form certificate image.

Then, according to the depth-first search algorithm, find the following node through belowNode, first judge whether the following node is an empty node, if not, then judge whether the following node exists in the keymap, if not in the keymap, It means that this node is the value node of the previous node, that is, the value of the previous node (because the value node of a key can only appear on the right or below), if it is an empty node, it means that this branch has been traversed. For example, traverse the following node "tableValue=Zhang San" of the root node "tableValue=Name", and determine that this node is not an empty node and does not exist in the keymap, then it can be determined that this node is the value node of the previous node, that is Yes, determine that the node "tableValue=Zhang San" is the value of the root node "tableValue=Name". At this point, the data content in the json data format can be updated to {"name": Null}, that is, the data content in the json data format becomes {"name": "Zhang San"}.

Next, continue to use the depth-first search algorithm to find the following node through the belowNode, first determine whether the following node is an empty node, if not, then determine whether the following node exists in the keymap, if not exist in the keymap In , it means that this node is the value node of the previous node. If it is an empty node, it means that the node in this traversal direction (that is, this traversal branch) has been traversed. For example, traverse the nodes below the node "tableValue=Zhang San" and find that the following nodes are empty nodes, then determine that this traverse branch has been traversed, and return to the starting node of the traverse at the same time, that is, return to the root node " tableValue=name", and traverse the nodes on the right of the root node through rightNode, that is, traverse the node "tableValue=ID card". First determine whether the node is an empty node, if not, then determine whether the node exists in the keymap, that is, use "tableValue = ID card" to query the keymap to see if there is a "keyname = ID card" , if there is, it will be matched to {"keyname": "ID card", "jsonkey": "idNo"}. At this point, the key in the json corresponding to the node (the value is "jsonkey": "idNo" in the keymap) can be written into the data content of the json data format, that is, the current data content of the json data format is {" idNo":Null}. Next, you only need to find the value corresponding to the key in json. Then, traverse the following node "tableValue=123456789012345678" of the node "tableValue=ID card", and determine that the node is not an empty node and does not exist in the keymap, then it can be determined that the node is the value node of the previous node, that is, , determine that the node "tableValue=123456789012345678" is the value of the node "tableValue=ID card". At this time, the data content in the json data format can be updated to {"idNo": Null}, that is, the data content in the json data format becomes {"idNo": "123456789012345678"}. Next, traverse the nodes below the node "tableValue=123456789012345678" and find that the following nodes are empty nodes, then determine that this traverse branch has been traversed, and return to the starting node of the traverse, that is, return to the node "tableValue= ID card", and traverse the nodes on the right side of the node through rightNode, that is, traverse the node "tableValue=mailbox".

When traversing the node "tableValue=mailbox", first judge whether the node is an empty node, if not, then judge whether the node exists in the keymap, that is, use "tableValue=mailbox" to query in the keymap , to see if there is a "keyname=mailbox", if there is, it will match {keyname": "mailbox", "jsonkey": "email"}. At this time, you can use the key in the json corresponding to the node ( The value is "jsonkey": "email" in the keymap) written to the data content of the json data format, that is, the current data content of the json data format is {"email": Null}. Next, you only need to find the key in the json The corresponding value is enough. Then, traverse the following node "tableValue=zhangsan@qq.com" of the node "tableValue=mailbox", and determine that the node is not an empty node and does not exist in the keymap, then it can be determined that the node is The value node of the previous node, that is, determine that the node "tableValue=zhangsan@qq.com" is the value value of the node "tableValue=mailbox". At this time, the data content in the json data format can be updated as {"email": Null}, that is, the data content in the json data format becomes {"email": "zhangsan@qq.com"}. Then, traverse the nodes below the node "tableValue=zhangsan@qq.com" and find that the following If the node is an empty node, it is determined that this traversal branch has been traversed, and at the same time return to the starting node of the traversal, that is, return to the node "tableValue=mailbox", and traverse the node on the right side of the node through rightNode, that is, The node "tableValue=home address" will be traversed.

When traversing the node "tableValue=home address", first judge whether the node is an empty node, if not, then judge whether the node exists in the keymap, that is, use "tableValue=home address" to the keymap Query to see if there is "keyname=home address". If there is, it will match {keyname": "home address", "jsonkey": "address"}. At this time, you can use the json corresponding to the node The key (the value is "jsonkey": "address" in the keymap) is written to the data content of the json data format, that is, the current data content of the json data format is {"address": Null}. Next, just find The value corresponding to the key in the json is enough. Then, traverse the following node "tableValue = No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" under the node "tableValue = home address", and determine that the node is not an empty node and does not If it exists in the keymap, it can be determined that the node is the value node of the previous node, that is, the node "tableValue = No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" is determined as the value of the node "tableValue = home address" Value. At this time, the data content in the json data format can be updated as {"address": Null}, that is, the data content in the json data format becomes {"address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" }. Next, traverse the nodes below the node "tableValue=home address", and find that the following nodes are empty nodes, then determine that this traverse branch has been traversed, and return to the starting node of the traverse at the same time, that is, return to the node "tableValue=home address", and traverse the node on the right side of the node through rightNode, and find that the node on the right side of the node "tableValue=home address" is an empty node, then it is determined that this traversal branch has been traversed, and the end of this traversal branch Directed acyclic graph traversal, and the data content of the mapped json data format can be obtained at the same time: {"name": "Zhang San", "idNo": "123456789012345678", "email": "zhangsan@qq.com" , "address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen"}.

In addition, based on the image of a certain type of form certificate shown in FIG. 2a, if the format of the content of the type of form certificate changes, that is, a schematic diagram of another form certificate image as shown in FIG. 5. Then, the form certificate image shown in Figure 5 is processed according to the processing process of a certain type of form certificate image shown in Figure 2a (the specific processing process can refer to the form certificate image shown in Figure 2a). The processing process will not be repeated here), and another directed acyclic graph as shown in Figure 6 is constructed according to the contents of each key field obtained after the column alignment processing. For the traversal process of each node in the DAG shown in FIG. 6 , reference may be made to the above-mentioned traversal process for each node in the DAG shown in FIG. 4 , which will not be repeated here. Of course, based on the directed acyclic graph shown in Figure 6, the data content of the mapped json data format can also be obtained: {"name": "Zhang San", "idNo": "123456789012345678", "email": " zhangsan@qq.com", "address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen"}.

Based on the same technical concept, FIG. 7 exemplarily shows a text recognition device for a form certificate image document provided by an embodiment of the present invention, and the device can execute a flow of a text recognition method for a form certificate image document.

As shown in Figure 7, the device includes:

The recognition unit 701 is configured to, for any type of form certificate image, identify the first text content of the type of form certificate image by performing text content recognition on the type of form certificate image;

The processing unit 702 is configured to convert the first text to any character string in the line, splicing with each character string in the second text line, and verifying the spliced string; the second text line is located in the first text content in the first text line; when any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field ; Determining the text content of each key field as the second text content of the type of form certificate image.

Optionally, the processing unit 702 is specifically configured to:

Optionally, the set language model is a binary model;

The sentence probability of the sentence satisfies the following form:

P＝Ω× _Pw ×(1-Ω)× _Pc

Optionally, the processing unit 702 is further configured to:

After the text content of each key field is determined as the second text content of the type of form certificate image file, the key field at the starting position in the type of form certificate is used as the starting point for construction, according to the According to the content of the second text, a directed acyclic graph of the type of form certificate is constructed;

Optionally, the processing unit 702 is specifically configured to:

Optionally, the processing unit 702 is further configured to:

After stopping the traversal until an empty node appears in the first traversal direction, traverse the second node adjacent to the first key field node along the second traversal direction, and determine the second node When it is not an empty node and exists in the key field library, determine the second node as a second key field node;

Based on the same technical concept, the embodiment of the present invention also provides a computing device, as shown in FIG. 8 , including at least one processor 801 and a memory 802 connected to the at least one processor. The specific connection medium between the processor 801 and the memory 802, the bus connection between the processor 801 and the memory 802 in FIG. 8 is taken as an example. The bus can be divided into address bus, data bus, control bus and so on. In the embodiment of the present invention, the memory 802 stores instructions that can be executed by at least one processor 801, and at least one processor 801 executes the instructions stored in the memory 802 to perform the text recognition method included in the aforementioned form certificate image document. A step of.

Among them, the processor 801 is the control center of the computing device, which can use various interfaces and lines to connect various parts of the computing device, by running or executing instructions stored in the memory 802 and calling data stored in the memory 802, thereby realizing data deal with. Optionally, the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor. The call processor mainly handles issuing instructions. It can be understood that the foregoing modem processor may not be integrated into the processor 801 . In some embodiments, the processor 801 and the memory 802 can be implemented on the same chip, and in some embodiments, they can also be implemented on independent chips.

The processor 801 can be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the text recognition method combined with the form certificate image can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

The memory 802, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. Memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Memory, Disk , CD, etc. Memory 802 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 802 in the embodiment of the present invention may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.

Based on the same technical idea, an embodiment of the present invention also provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program is run on the computing device, the computing device The steps of the text recognition method for the above-mentioned form certificate image document are executed.

Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of this application and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims

A text recognition method for a form certificate image file, characterized in that it includes:

For any type of form certificate image, by performing text content recognition on the form certificate image of the type, determine the first text content of the form certificate image;

When it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, any character string in the first text line , performing splicing processing with each character string in the second text line, and verifying the spliced string; the second text line is the latest line before the first text line in the first text content;

When any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field;

The text content of each key field is determined as the second text content of the type of form certificate image file.
The method according to claim 1, wherein determining that the spliced character string conforms to the text content rules of any key field in the type of form certificate through the following methods, including:

If the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;

For any key field whose text content belongs to a pure number type, perform length check and regular expression check on the spliced character string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
The method according to claim 1, wherein determining that the spliced character string conforms to the text content rules of any key field in the type of form certificate through the following methods, including:

If the string after splicing belongs to the letter+at least one special character type, then determine from the key fields that the text content belongs to the letter+at least one special character type key field;

For any key field whose text content belongs to the letter + at least one special character type, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
The method according to claim 1, wherein determining that the spliced character string conforms to the text content rules of any key field in the type of form certificate through the following methods, including:

If the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;

Processing the spliced character string through a set language model, and determining that the spliced character string conforms to the sentence probability that any text content belongs to a text content rule of a key field containing at least one Chinese word type, Therefore, it is determined whether the character string after splicing conforms to the text content rule of the key field.
The method according to claim 4, wherein the set language model is used to process the spliced character string, and it is determined that the spliced character string conforms to any text content that contains at least one Sentence probabilities of text content rules for key fields of the Chinese word type, including:

performing word segmentation processing on the spliced character string through a set word segmentation tool to obtain at least one word segment and the part of speech of each word segment in the at least one word segment;

Combining the at least one participle into at least one sentence in a permutation and combination manner;

For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model The parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined; The first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;

Comparing the sentence probabilities of the at least one sentence, determining the maximum sentence probability, and determining the maximum sentence probability as the sentence probability corresponding to the spliced character string.
The method according to claim 5, wherein the language model set is a binary model;

The sentence probability of the sentence satisfies the following form:

P＝Ω× Pw ×(1-Ω)× Pc

Wherein, through the binary model, each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:

Through the binary model, the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:

Wherein, P is used to represent the sentence probability of the sentence, P w is used to represent the first sub-sentence probability of the sentence, P c is used to represent the second sub-sentence probability of the sentence, Ω is used to represent the weight, W i is used to represent any participle in the sentence, and C i is used to represent the part of speech of any participle in the sentence.
The method according to any one of claims 1 to 6, characterized in that, after determining the text content of each key field as the second text content of the type of form certificate image, further comprising:

Taking the key field at the starting position in the type of form certificate as the starting point for construction, and according to the second text content, constructing a directed acyclic graph of the type of form certificate;

Taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, so as to obtain the data content conforming to the set data format .
The method according to claim 7, characterized in that, taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially searching each node in the directed acyclic graph through a depth-first search algorithm Traverse, including:

For any node in the directed acyclic graph, when determining that the node is not an empty node, determine whether the node exists in the key field library; the key field library is used to store various types of The key fields of the form certificate;

If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.
The method according to claim 8, further comprising: after stopping the traversal until empty nodes appear in the first traversal direction:

Traverse the second node adjacent to the first key field node along the second traversal direction, and when it is determined that the second node is not an empty node and exists in the key field library, The second node is determined as a second key field node;

Traversing the third node adjacent to the second key field node along the first traversal direction, and determining that the third node is not an empty node and does not exist in the key field library , determine the third node as the text content of the second key field node, stop traversing until an empty node appears in the first traversing direction, and stop traversing in the first traversing direction, Traverse the fourth node adjacent to the second key field node along the second traversal direction, and stop traversing until an empty node appears in the second traversal direction; wherein, the second traversal direction For left to right.
A computing device, characterized by comprising at least one processor and at least one memory, wherein the memory stores a computer program that, when the program is executed by the processor, causes the processor to perform claim 1 to the method described in any one of 9.