WO2023087702A1 - Text recognition method for form certificate image file, and computing device - Google Patents

Text recognition method for form certificate image file, and computing device Download PDF

Info

Publication number
WO2023087702A1
WO2023087702A1 PCT/CN2022/100026 CN2022100026W WO2023087702A1 WO 2023087702 A1 WO2023087702 A1 WO 2023087702A1 CN 2022100026 W CN2022100026 W CN 2022100026W WO 2023087702 A1 WO2023087702 A1 WO 2023087702A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
sentence
key field
text content
type
Prior art date
Application number
PCT/CN2022/100026
Other languages
French (fr)
Chinese (zh)
Inventor
郎志刚
付勇
范增虎
Original Assignee
深圳前海微众银行股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海微众银行股份有限公司 filed Critical 深圳前海微众银行股份有限公司
Publication of WO2023087702A1 publication Critical patent/WO2023087702A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular to a text recognition method and computing equipment for form certificate image documents.
  • each unit of this type of form certificate will be introduced in the process of defining the row and column of the content template corresponding to the type of form certificate The coordinate values of the four corners of the grid, so as to realize the recognition of the text content in the image file of this type of form certificate.
  • this type of form certificate image uploaded by a certain user, when performing recognition processing on this type of form certificate image, it may first be based on the four corners of each cell of the type of form certificate The coordinate value is segmented for each cell, and then OCR recognition is performed for each segmented cell to obtain the text content in each cell, and then the text content identified for each cell is analyzed through the content template and structured processing, so that the complete content information corresponding to each key field in this type of form certificate can be obtained.
  • this solution is due to scanning deviation, shooting deviation, or The position of the fields in this type of form certificate image file changes, so when the cells are segmented according to the coordinate values of the four corners of each cell, the segmented cells will be inaccurate.
  • the identified text content corresponding to at least one key field is also inaccurate.
  • the existing scheme configures a content template for each type of form certificate that meets the content format requirements of the type of form certificate, and the content template corresponding to each type of form certificate is fixed, in When the content in a certain type of form certificate changes, it is necessary to redesign and develop the content template corresponding to this type of certificate, so the versatility is poor and the maintenance cost is high.
  • the content of the house property right certificates defined in different cities will vary greatly, so it is necessary to configure a content template for the house property right certificates defined in each city, which will lead to house
  • the development cycle of the content template corresponding to the property right certificate is long and the labor cost is high.
  • an embodiment of the present invention provides a method for text recognition of an image of a form certificate, including:
  • any character string in the first text line is different from the number of key fields that the type of form certificate has, any character string in the first text line , performing splicing processing with each character string in the second text line, and verifying the spliced string;
  • the second text line is the latest line before the first text line in the first text content;
  • the text content of each key field is determined as the second text content of the type of form certificate image file.
  • the prior art scheme configures a content template that meets the content format requirements of this type of form certificate for each type of form certificate (such as a house property right certificate, etc.), once a certain type of form certificate is in If the content of the form certificate changes, it is necessary to redesign and develop the content template corresponding to this type of form certificate, so the versatility is poor and the maintenance cost is high. If there is any deviation, or the position of the fields in the image of this type of form certificate changes, the identified text content corresponding to at least one key field in the image of this type of form certificate will also be inaccurate.
  • the technical solution in the present invention automatically judges whether the text content in the recognized image of a certain type of form certificate has a certain number of character strings in one or several text lines and the number of characters in the type of form certificate.
  • the data of each key field is different, and when it is determined that the number of character strings in one or several text lines is different from the data of each key field of this type of form certificate, follow the corresponding
  • the processing rules are automatically processed, so that it can effectively ensure that the text content corresponding to each key field of this type of form certificate is accurate, and can avoid scanning deviation, shooting deviation, or this type of form certificate.
  • the position of the field in the image of the type of form certificate changes, resulting in inaccurate text content corresponding to each identified key field.
  • the first text content of the type of form certificate image can be determined, and after determining the first
  • the number of character strings in the first text line in the text content is different from the number of key fields of this type of form certificate, any character string in the first text line and the number of key fields in the second text line
  • Each character string is spliced, and the spliced string is verified. If it is determined that a certain spliced string conforms to the text content rules of a key field in this type of form certificate, the spliced character The string is determined as the text content of the key field, so that the text content of each key field can be determined.
  • the scheme only needs to splice each character string in a certain text line that is not the same with each character string in the nearest text line before the text line, and determine whether a certain spliced string is Comply with the text content rules of a certain key field, so that the text content of a certain key field can be accurately determined, which is helpful to effectively improve the accuracy of identifying the text content in the image of the form certificate, without the need for each
  • Each type of form certificate is configured with a content template, and there is no need to introduce the four corners of each cell of the type of form certificate in the process of defining the row and column of the content template corresponding to each type of form certificate Coordinate values, so this solution is highly versatile and can meet the needs of different usage scenarios. For example, it can be applied to deal with different form certificates set in different regions, and of course it can also be applied to deal with a certain type of certificate that has changed. Form certificates, and can reduce the maintenance cost of configuring different content templates for different types of form certificates.
  • the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;
  • the spliced character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:
  • the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
  • the spliced character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:
  • the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;
  • a character string after splicing belongs to the type containing at least one Chinese word
  • the string conforms to the sentence probability of a text content belonging to the text content rule of the key field containing at least one Chinese word type, so that it can be accurately determined based on the sentence probability whether the spliced string is the text content belonging to the text content containing at least one
  • the text content of the key field of a Chinese word type can improve the efficiency of determining the text content of the key field, and can improve the text content of the text content of the key field containing at least one Chinese word type in the image file of the form certificate. recognition accuracy.
  • the set language model is used to process the spliced string, and it is determined that the spliced string conforms to any text whose text content belongs to a key field containing at least one Chinese word type.
  • Sentence probabilities for content rules including:
  • each sentence For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model
  • the parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined;
  • the first sub-sentence probability is determined by counting the word frequency of each participle in the sentence;
  • the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;
  • the set language model is a binary model
  • the sentence probability of the sentence satisfies the following form:
  • each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:
  • the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:
  • P is used to represent the sentence probability of the sentence
  • P w is used to represent the first sub-sentence probability of the sentence
  • P c is used to represent the second sub-sentence probability of the sentence
  • is used to represent the weight
  • W i is used to represent any participle in the sentence
  • C i is used to represent the part of speech of any participle in the sentence.
  • the probability related to word segmentation and the probability related to part-of-speech of word segmentation can be introduced,
  • the sentence probability corresponding to the spliced string can be fully and more accurately determined, and it can be more accurately determined whether the spliced string is a key field containing at least one Chinese word type in the text content text content.
  • each key field as the second text content of the type of form certificate image file, it further includes:
  • the solution can adaptively complete the text content recognition processing for different types of form certificate images, without relying on different content templates to complete, with strong versatility, and can ensure that different types of form certificate image
  • the text content of the document is accurate, and the cost of maintaining content templates for different form certificates can be reduced.
  • a depth-first search algorithm including:
  • the key field library is used to store various types of The key fields of the form certificate
  • the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not
  • the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.
  • This scheme pre-configures a key field mapping array for each type of form certificate, that is, stores each key field mapping of each type of form certificate in the key field library, so that when the depth-first search
  • the algorithm traverses a certain node in the directed acyclic graph, and can promptly determine whether the node is a key field node, that is, determine whether the node is a key in the image file of the form certificate of this type (ie, the key The text content of segments and key fields exists in the form of key-value pairs).
  • the node is the first key field node (that is, the key field that a certain type of form certificate has)
  • the first key field node that is, the key field that a certain type of form certificate has
  • the top-down traversal direction By traversing, you can find the value of the first key field node in time, that is, the text content of the first key field node, and update the corresponding data format according to the text content of the first key field node.
  • the key-value content of the first key field can be used to obtain the data content of the set data format mapped to the first key field.
  • the first key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.
  • the method further includes:
  • the second node is determined as a second key field node
  • the second node adjacent to the first key field node can be traversed along the left-to-right traversal direction, and similarly , if the second node is an empty node, end the traversal in the left-to-right traversal direction, if the second node is not an empty node and exists in the key field library, then determine the second node as the second key Field node (that is, the key field of a certain type of form certificate), so that the third node adjacent to the second key field node can be traversed along the top-down traversal direction , you can find the value of the second key field node in time, that is, the text content of the second key field node, and update the setting data format corresponding to the text content of the first key field node Key-value content, the data content of the set data format mapped to the first key field can be obtained.
  • the second key field node is used as a key, and has
  • the embodiment of the present invention also provides a text recognition device for a form certificate image document, including:
  • the recognition unit is configured to, for any type of form certificate image, determine the first text content of the type of form certificate image by performing text content recognition on the type of form certificate image;
  • a processing unit configured to convert the first text line to Any character string in the second text line is spliced with each character string in the second text line, and the spliced string is verified; the second text line is located in the first text line in the first text content The last line before; when any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field; The text content of each key field is determined as the second text content of the type of form certificate image file.
  • an embodiment of the present invention provides a computing device, including at least one processor and at least one memory, wherein the memory stores a computer program, and when the program is executed by the processor, the processing The device executes the method for text recognition of the form certificate image document described in any of the above first aspects.
  • an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program runs on the computing device, the computing device executes the above-mentioned first The text recognition method of any form certificate image document described in the aspect.
  • FIG. 1 is a schematic flowchart of a text recognition method for a form certificate image provided by an embodiment of the present invention
  • Fig. 2a is a schematic diagram of an image of a form certificate provided by an embodiment of the present invention.
  • FIG. 2b is a schematic diagram of a text result after OCR parsing and processing provided by an embodiment of the present invention
  • FIG. 3 is a schematic diagram of text content after column alignment processing provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a directed acyclic graph provided by an embodiment of the present invention.
  • Fig. 5 is a schematic diagram of another form certificate image file provided by the embodiment of the present invention.
  • FIG. 6 is a schematic diagram of another directed acyclic graph provided by an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a text recognition device for a form certificate image document provided by an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.
  • Fig. 1 exemplarily shows the flow of a text recognition method for a form certificate image document provided by an embodiment of the present invention, and the process can be executed by a text recognition device for a form certificate image document.
  • the process specifically includes:
  • Step 101 for any type of form certificate image, by performing text content recognition on the type of form certificate image, determine the first text content of the type of form certificate image.
  • the text content of the type of form certificate image can be processed by the set text recognition tool. Recognition processing, so that the text content with at least one text line in the recognized text area can be obtained.
  • a service device such as a server for processing form certificate images
  • a service device receives a certain type of form certificate image uploaded by a user
  • a set text recognition tool such as an open source OCR tool (Optical Character Recognition, optical character recognition), the Pillow library in the Python environment or the pytesseract library in the Python environment, etc.
  • OCR tool Optical Character Recognition, optical character recognition
  • you can get the text result that is, this type of form
  • the text content in the certificate image file, but the text result may have column inconsistency (that is, the column is not aligned), that is, the number of strings in one or several rows is different from that of the type of form certificate.
  • the number of each key field that is, each key field of the header row) is different.
  • Figure 2a it is a schematic diagram of a form certificate image file provided by the embodiment of the present invention.
  • the OCR analysis process as shown in Figure 2b can be obtained Schematic diagram of the text results.
  • the number of key fields in the first line that is, the header of the form certificate
  • the number of each character string in the second line is 4
  • the number of each character string in the third line is 3.
  • the number of strings in the second row is the same as the number of key fields in the first row
  • the number of strings in the third row is different from the number of key fields in the first row.
  • Step 102 when it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, convert any of the first text lines to A character string is concatenated with each character string in the second text line, and the concatenated character string is verified.
  • Step 103 when any concatenated character string conforms to the text content rule of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field.
  • the number of character strings of a certain text line in the text content with at least one text line in the identified text area is different from the number of key fields that this type of form certificate has, The same (that is, the number of strings in a certain text line is different from the number of key fields in the key field text line, so that the columns in the entire text area are inconsistent, that is, the columns are not aligned) , it means that any character string in this text line should belong to the connection content of a certain string in the nearest text line before this text line, that is, each character string in this text line does not exist independently, but should be concatenated to a corresponding string in the nearest preceding line of text.
  • the concatenated character string belongs to The text content of which key field, that is, if it is determined that a concatenated character string conforms to the text content rules of a key field in this type of form certificate, the concatenated character string is determined to be the The text content of the key field.
  • the second text line is the latest line before the first text line in the first text content.
  • a concatenated character string conforms to the text content rules of a key field in this type of form certificate
  • the concatenated string does not comply with the text content rules of the key field, continue to follow the text content rules of the next key field whose text content belongs to a pure number type. Check the string after splicing, or continue to check the next string after splicing that belongs to the pure number type. In this way, the scheme only needs to perform length check and regular expression check on the concatenated string according to the text content rule of the key field whose text content belongs to the pure digital type, instead of checking the length of each key field.
  • All content rules check the concatenated string, which saves the time spent on verifying the concatenated string that belongs to the pure digital type by the content rules of each key field, so as to This can improve the efficiency of determining the text content of the key field, and can improve the recognition accuracy of the text content of the key field whose text content belongs to the pure digital type in the form certificate image file.
  • the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine whether the spliced string conforms to the key field Text content rules, if it is determined that the concatenated character string conforms to the text content rules of the key field, then determine the concatenated character string as the text content of the key field, if it is determined that the concatenated character string does not conform to For the text content rule of the key field, continue to check the spliced string according to the text content rule of the next key field whose text content belongs to letters + at least one special character type, or continue to check the next text content belonging to The concatenated string of letters + at least one special character type is verified.
  • the solution only needs to perform regular expression verification on the concatenated string according to the text content rules of the key field whose text content belongs to the letter + at least one special character type, without the need to follow the text content rules of each key field
  • the content rules all perform regular expression verification on the concatenated string, so the cost of verifying the concatenated string belonging to the pure digital type can be saved by verifying the content rules of each key field. Time, and can improve the recognition accuracy of the text content in the key field of the letter + at least one special character type in the image of the form certificate.
  • the spliced string belongs to the type that contains at least one Chinese word, it only needs to process the spliced string through the set language model to determine that the spliced string conforms to a certain text content
  • the sentence probability of the text content rule belonging to the key field containing at least one Chinese word type so that it can be accurately determined based on the sentence probability whether the spliced string is the text content belonging to the key field containing at least one Chinese word type
  • the text content of the field is a certain text content.
  • the string after splicing can be determined as the text content of the key field containing at least one Chinese word type, if the string after splicing If the character string does not conform to the text content rule of the key field, continue to judge whether the spliced character string conforms to the text content rule of the next text content belonging to a key field containing at least one Chinese word type.
  • the word segmentation tool set When it is determined by the set language model that the spliced character string conforms to the sentence probability that any text content belongs to the text content rule of the key field containing at least one Chinese word type, the word segmentation tool set The concatenated character string is subjected to word segmentation processing to obtain at least one word segment and the part of speech of each word segment in the at least one word segment. According to the way of permutation and combination, combine the at least one participle into at least one sentence, and for each sentence, process the part of speech of each participle in the sentence through the set language model, and determine the second subsentence of the sentence probability.
  • the sentence probability of the sentence is determined. Finally, compare the sentence probabilities of at least one sentence, determine the maximum sentence probability, and determine the maximum sentence probability as the sentence probability corresponding to the concatenated character string.
  • the first sub-sentence probability is determined by counting the word frequency of each participle in the sentence
  • the second sub-sentence probability is determined by counting the part-of-speech word frequency of each participle in the sentence.
  • the probability of the sentence corresponding to the spliced string can be fully and more accurately determined, so as to more accurately determine whether the spliced string is the
  • the text content belongs to the text content of the key field that contains at least one Chinese word type to provide support.
  • the first processing method is: the character string to be split belongs to the pure digital type, such as the string "12345678" shown in Figure 2b, and generally the string belonging to the pure digital type will be in Appears in mobile phone number or ID card number, etc.
  • the character string "12345678" in the third line is spliced with the character string "Zhang San” in the second line to form the spliced string "Zhang San 12345678" or "12345678 Zhang San”, and respectively for " Zhang San 12345678" and "12345678 Zhang San” perform length check and regular expression check, and it is determined that the two spliced strings do not meet the content format requirements of any key field in the first line.
  • the string "12345678" in the third line is spliced with the string "1234567890” in the second line to form the spliced string "123456781234567890” or "123456789012345678", and for "123456781234567890” and "1234567890 12345678"
  • the matching check determines that only the spliced string "123456789012345678" meets the number format requirements of the ID card, so the spliced string "123456789012345678" is used as the text content of the ID card.
  • a corresponding regular expression will be configured for the text content corresponding to each key field of each type of form certificate, so as to be able to timely and effectively Perform content verification on the text content corresponding to each key field.
  • the character string is spliced with each character string in the second line to obtain multiple spliced strings.
  • the concatenated character string belongs to the pure digital type, then determine from the key fields of the first line that the format requirements of the text content belong to the key fields of the pure digital type, and target the The length check and regular expression check are performed on the spliced string, and if the check is successful, the spliced string is determined to be the text content of the key field.
  • the second processing method is: the segmented string belongs to the type of letter + at least one special character (excluding Chinese words), such as the string "@qq.com” shown in Figure 2b, generally belongs to the letter + at least one special character Strings of character type appear in mailboxes, etc.
  • special character excluding Chinese words
  • the string “@qq.com” in the third line and the string “Zhang San” in the second line are spliced to form the spliced string "Zhang San@qq.com” or “@qq.com Zhang San”, and perform regular expression verification on "Zhang San@qq.com” and "@qq.comZhang San”, that is, according to the regular expression corresponding to each key field in the first line.
  • "Zhangsan@qq.com” or “@qq.com ⁇ san” conducts a match check to determine that the two concatenated character strings do not meet the content format requirements of any key field in the first line.
  • the string “@qq.com” in the third line is spliced with the string “1234567890” in the second line to form the spliced string "@qq.com 1234567890” or “1234567890@qq.com “, and perform regular expression verification on "@qq.com1234567890” and "1234567890@qq.com” respectively, that is, according to the regular expression corresponding to each key field in the first line, "@qq. com 1234567890” or "1234567890@qq.com” for matching verification, and it is determined that the two spliced strings do not meet the content format requirements of any key fields in the first line.
  • the string "@qq.com” in the third line is spliced with the string “zhangsan” in the second line to form the spliced string "@qq.com zhangsan” or "zhangsan@qq.com “, and perform regular expression verification for "@qq.com zhangsan” and "zhangsan@qq.com” respectively, that is, to check "@qq .com zhangsan” or "zhangsan@qq.com” for matching verification, to determine that only the spliced string "zhangsan@qq.com” meets the content format requirements of the key field "mailbox” in the first line, so the concatenated
  • the following string "zhangsan@q.com” is used as the text content of the mailbox.
  • the character string is spliced with each character string in the second line to obtain multiple spliced strings.
  • For each spliced string if it is determined The string after splicing belongs to the letter + at least one special character type, then determine the format requirements of the text content from the key fields of the first line and belong to the key field of the letter + at least one special character type, and according to the key field
  • the content format of the field requires a regular expression check on the spliced string, and if the check is successful, the spliced string is determined to be the text content of the key field.
  • the third processing method is: the character string to be segmented belongs to a type that contains at least one Chinese word, such as the character string "No. 1818, Building 8, Garden” shown in FIG. 2b.
  • a character string of a type containing at least one Chinese word it is necessary to concatenate the character string with the character strings in the last line before the line where the character string is located, and at least one concatenated character string can be obtained string.
  • a set word segmentation tool such as Jieba Chinese word segmentation tool (ie jieba Chinese word segmentation tool)
  • ie jieba Chinese word segmentation tool ie jieba Chinese word segmentation tool
  • the at least one participle is arranged and combined in the manner of forming a sentence, multiple sentences can be formed, and for each sentence in the multiple sentences, the sentence is processed by a set language model (such as an n-gram language model) Processing, determining the probability of the first sub-sentence of the sentence, and processing the part of speech of each participle in the sentence through a set language model (such as an n-gram language model), to determine the probability of the second sub-sentence of the sentence.
  • a set language model such as an n-gram language model
  • the sentence probability of the sentence is determined, and the sentence probabilities of multiple sentences are compared to determine the maximum sentence probability , and determine the maximum sentence probability as the sentence probability corresponding to the concatenated character string.
  • the maximum sentence probability can be determined, and the concatenated character string corresponding to the maximum sentence probability is determined as the text content that contains at least one The text content of a key field of the Chinese word type.
  • the probability of a certain sentence when determining the sentence probability of a certain sentence based on the first sub-sentence probability of a certain sentence and the second sub-sentence probability of a certain sentence, the probability of a certain sentence can be determined in the following manner:
  • P is used to represent the sentence probability of a certain sentence
  • P w is used to represent the sentence probability calculated by performing statistical processing on each participle in the sentence (that is, the first sub-sentence probability of the sentence)
  • P c is used for Indicates the sentence probability calculated by performing statistical processing on the parts of speech of each participle in the sentence (that is, the second subsentence probability of the sentence)
  • is used to represent the weight, for example, it can be 0.5, 0.6 or 0.7, etc. Specifically, it can be Set according to actual application scenarios or experience of those skilled in the art.
  • the n-gram language model is usually used to describe the probability that a random word sequence belongs to a normal semantic sentence. Assuming that a sentence is composed of participle words such as W 1 , W 2 , ..., W n , its corresponding probability is as follows:
  • conditional probability in the n-gram language model can be calculated using word frequency:
  • a normal semantic string is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen”
  • word segmentation and part-of-speech processing by the Jieba Chinese word segmentation tool at least one word segmentation is obtained, namely ["Shenzhen", “Nanshan District “, “Longhai”, “homeland”, “8”, “building”, “1818", “number”].
  • the probability of generating the normal semantic string "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen” is:
  • each probability in the above formula can be calculated by counting the word frequency of the word segmentation, specifically, see Table 1.
  • the formula for calculating the probability is The calculation method of the unigram model is simple, but it does not consider the order relationship between words; for the binary model, the formula for calculating the probability is The calculation method of the binary model is relatively simple, but the order relationship between the two words is considered; for the ternary model, the formula for calculating the probability is The ternary model considers the order relationship among the three words, but the calculated relationship matrix is too sparse to be practical.
  • the transition probability of each word segment is counted, assuming that in all corpora, the word segment W i appears count(W i ) times in total, and the words adjacent to the word segment W i and appearing after the word segment W i are W 1 , W 2 , W 3 , the times are count(W i ,W 1 ), count(W i ,W 2 ), count(W i ,W 3 ), then the transition probabilities for word segmentation W i are count(W i , W 1 )/count(W i ), count(W i ,W 2 )/count(W i ), count(W i ,W 3 )/count(W i ), and the word segmentation W i is aimed at the vocabulary W Except for W 1 , W 2 , and W 3 , the transition probabilities of all other word segments are 0.
  • the word segmentation transition probability matrix X can be obtained, the dimension is n*n, and the value of the i-th row and j-column in the matrix is equal to the transition probability of the word segmentation W i for the word segmentation W j , that is, count(W i , W j )/count(W i ).
  • each probability in the above formula can be calculated by counting word frequency, specifically, see Table 2.
  • the binary model + Laplace smoothing is also used for calculation, that is, the sentence probability P c calculated by the part of speech can be obtained, specifically:
  • C i is used to represent the part of speech of each participle in a certain sentence.
  • part-of-speech probability between each two parts of speech in the above formula by counting the word frequency of the part-of-speech, first obtain a Chinese corpus, and then use the open-source stuttering Chinese word segmentation tool to perform all corpus Word segmentation and part-of-speech processing can get the part-of-speech tag of each word segmentation.
  • the transition probability of each part of part of speech is counted, assuming that in all corpora, part of part of speech C i appears count(C i ) times in total, the part of part of speech adjacent to part of part of speech C i and appearing after part of part of speech C i
  • There are C 1 , C 2 , and C 3 and the times are count(C i ,C 1 ), count(C i ,C 2 ), and count(C i ,C 3 )
  • part of speech C i for part of speech C 1 , C 2 , and C 3 transition probabilities are respectively count(C i ,C 1 )/count(C i ), count(C i ,C 2 )/count(C i ), count(C i ,C 3 )/ count(C i )
  • the transition probability of part of speech C i for part of part of speech C is 0 for all parts of part of speech except C 1 , C 2 , and C 3 in part of
  • the part-of-speech transition probability matrix Y can be obtained, the dimension is m*m, and the value of row i and column j in the matrix is equal to the transition probability of part-of-speech C i for part-of-speech C j , namely count(C i , C j )/ count(C i ).
  • At least one word segmentation is obtained, namely ["Shenzhen", “Nanshan District” , “Longhai”, “Homeland”, “8”, “Building”, “1818”, “number”], and the part of speech of each participle, namely [("Shenzhen", “place name”), ("Nanshan District”, “place name”), ("Longhai”, “place name”), ("home”, “noun”), ("8", “number”), ("building”, “number”) , ("1818", "number”), (“number”, “number”)].
  • At least one participle is combined into a plurality of sentences, such as wherein a sentence is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City", and the binary model (ie, the binary language model) is paired with
  • the probability of the first sub-sentence P 1 is determined by counting the word frequency of each participle through the binary model (namely, the binary language model).
  • P 1 0.4
  • the maximum probability is 0.15, which is the string "No. 8 Building 1818, Yuan 8"
  • the probability of splicing with the string “Longhaijia, Nanshan District, Shenzhen City” to produce “No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen” is the highest, ie 0.15.
  • the probability of sentences determined by splicing the string "No. 8 Building 1818" with other character strings in the second row is less than 0.15.
  • the string "Zhang San” in the second line is spliced, and after the spliced strings are word-segmented and permuted, assuming that the binary model is used, the probability of generating the sentence "No. 1818, Building 8, Zhang Sanyuan” is the highest, but It is also less than 0.15, or the string "No. 8 Building 1818" is spliced with the string "ID card” in the second line, and after word segmentation and permutation and combination are performed on the spliced string, it is assumed that through the two The meta-model determines that the probability of generating the sentence "Yuan 8 Building 1818 ID number" is the highest, but it is also less than 0.15.
  • FIG. 3 it is a schematic diagram of text content after column alignment processing provided by an embodiment of the present invention. It can be seen from Figure 3 that the text content of the key field "name” is "Zhang San”, the text content of the key field “ID card” is "123456789012345678", and the text content of the key field "mailbox” is "zhangsan@qq. com", the text content of the key field "home address" is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen".
  • Step 104 determining the text content of each key field as the second text content of the type of form certificate image.
  • the text content in the form of value is sorted into content that conforms to a certain set data format (such as Json data format).
  • this solution is aimed at tables whose number of strings in a certain text line in the text content does not conform to this type of table
  • the number of each key field of the certificate that is, the column misalignment problem
  • by performing corresponding column alignment processing on the strings in the text line and constructing an effective Directed acyclic graph, and then through the depth-first search algorithm to traverse each node in the directed acyclic graph in turn, you can get the data content that conforms to the set data format.
  • the key field at the initial position in the type of form certificate is used as the starting point for construction, and according to the second text content, the directed acyclic graph of the type of form certificate is constructed, and then the directed acyclic graph
  • the node at the starting position is the starting point of traversal, and each node in the directed acyclic graph is traversed sequentially through the depth-first search algorithm, so that the data content conforming to the set data format can be obtained.
  • the key field library is used to store the key fields of various types of form certificates. If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not an empty node and If it does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction.
  • the first traversal direction is top-down.
  • the node is the first key field node (that is, the key field of a certain type of form certificate)
  • the top-down traversal direction for the node adjacent to the bottom
  • the corresponding key-value content can obtain the data content of the set data format mapped to the first key field.
  • the first key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.
  • traverse the second node adjacent to the first key field node along the second traversal direction and determine that the second node is not
  • the third node can be determined as the text content of the second key field node until an empty node appears in the first traversal direction stop traversal.
  • the second traversal direction is from left to right.
  • the second node is the second key field node (that is, the key field of a certain type of form certificate)
  • the third node adjacent to the second key field node can be traversed along the top-down traversal direction, and the value of the second key field node, that is, the value of the second key field node can be found in time.
  • the text content of the second key field node and update the key value content corresponding to the setting data format according to the text content of the first key field node, and then the setting data format mapped to the first key field can be obtained data content.
  • the second key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.
  • each node in Figure 4 can be ⁇ "tableValue”: "the value in the table, such as the name”, "rightNode”: “the current node, the node on the right”, “belowNode”: "the current node, the node below node” ⁇ .
  • DFS Depth-First-Search
  • the key in the json corresponding to the node (the value is "jsonkey”: "name” in the keymap) can be written into the data content of the json data format, that is, the current data content of the json data format is ⁇ " name": Null ⁇ .
  • the technical solution in the embodiment of the present invention configures a key field mapping array for each type of form certificate in advance, that is, stores each key field mapping of each type of form certificate in the key field library.
  • a certain type of form certificate shown in Figure 2a is taken as an example.
  • a keymap can be formed, that is, the keymap of this type of form certificate can be configured as :
  • keyname corresponds to the name of the header field in the form certificate of this type
  • jsonkey corresponds to the key in the data content of the finally sorted out json data format.
  • keymap update when the field name in the original form certificate of a certain type is changed or added, just insert the new field name in the keymap.
  • the keymap will not be updated.
  • the keymap does not need to store multiple versions (save all known keys). For each type of form certificate, only one latest keymap needs to be reserved for the processing of the type of form certificate image.
  • the depth-first search algorithm finds the following node through belowNode, first judge whether the following node is an empty node, if not, then judge whether the following node exists in the keymap, if not in the keymap, It means that this node is the value node of the previous node, that is, the value of the previous node (because the value node of a key can only appear on the right or below), if it is an empty node, it means that this branch has been traversed.
  • the data content in the json data format can be updated to ⁇ "idNo": Null ⁇ , that is, the data content in the json data format becomes ⁇ "idNo”: "123456789012345678” ⁇ .
  • the data content in the json data format can be updated as ⁇ "address": Null ⁇ , that is, the data content in the json data format becomes ⁇ "address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen” ⁇ .
  • FIG. 7 exemplarily shows a text recognition device for a form certificate image document provided by an embodiment of the present invention, and the device can execute a flow of a text recognition method for a form certificate image document.
  • the device includes:
  • the recognition unit 701 is configured to, for any type of form certificate image, identify the first text content of the type of form certificate image by performing text content recognition on the type of form certificate image;
  • the processing unit 702 is configured to convert the first text to any character string in the line, splicing with each character string in the second text line, and verifying the spliced string; the second text line is located in the first text content in the first text line; when any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field ; Determining the text content of each key field as the second text content of the type of form certificate image.
  • processing unit 702 is specifically configured to:
  • the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;
  • processing unit 702 is specifically configured to:
  • the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
  • processing unit 702 is specifically configured to:
  • the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;
  • processing unit 702 is specifically configured to:
  • each sentence For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model
  • the parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined;
  • the first sub-sentence probability is determined by counting the word frequency of each participle in the sentence;
  • the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;
  • the set language model is a binary model
  • the sentence probability of the sentence satisfies the following form:
  • each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:
  • the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:
  • P is used to represent the sentence probability of the sentence
  • P w is used to represent the first sub-sentence probability of the sentence
  • P c is used to represent the second sub-sentence probability of the sentence
  • is used to represent the weight
  • W i is used to represent any participle in the sentence
  • C i is used to represent the part of speech of any participle in the sentence.
  • processing unit 702 is further configured to:
  • the key field at the starting position in the type of form certificate is used as the starting point for construction, according to the According to the content of the second text, a directed acyclic graph of the type of form certificate is constructed;
  • processing unit 702 is specifically configured to:
  • the key field library is used to store various types of The key fields of the form certificate
  • the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not
  • the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.
  • processing unit 702 is further configured to:
  • the embodiment of the present invention also provides a computing device, as shown in FIG. 8 , including at least one processor 801 and a memory 802 connected to the at least one processor.
  • the specific connection medium between the processor 801 and the memory 802, the bus connection between the processor 801 and the memory 802 in FIG. 8 is taken as an example.
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the memory 802 stores instructions that can be executed by at least one processor 801, and at least one processor 801 executes the instructions stored in the memory 802 to perform the text recognition method included in the aforementioned form certificate image document. A step of.
  • the processor 801 is the control center of the computing device, which can use various interfaces and lines to connect various parts of the computing device, by running or executing instructions stored in the memory 802 and calling data stored in the memory 802, thereby realizing data deal with.
  • the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor.
  • the call processor mainly handles issuing instructions. It can be understood that the foregoing modem processor may not be integrated into the processor 801 .
  • the processor 801 and the memory 802 can be implemented on the same chip, and in some embodiments, they can also be implemented on independent chips.
  • the processor 801 can be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present invention.
  • a general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the text recognition method combined with the form certificate image can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
  • the memory 802 can be used to store non-volatile software programs, non-volatile computer-executable programs and modules.
  • Memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Memory, Disk , CD, etc.
  • Memory 802 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer.
  • the memory 802 in the embodiment of the present invention may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
  • an embodiment of the present invention also provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program is run on the computing device, the computing device The steps of the text recognition method for the above-mentioned form certificate image document are executed.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of this application and their equivalent technologies, the present invention also intends to include these modifications and variations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Input (AREA)

Abstract

Provided in the embodiments of the present invention are a text recognition method for a form certificate image file, and a computing device. The method comprises: for any type of form certificate image file, performing text content recognition on the type of form certificate image file, so as to determine first text content; when the number of character strings of a first text line in the first text content is different from the number of key fields of a form certificate of the type, splicing any character string of the first text line with each character string in a second text line, and verifying a spliced character string; and when any spliced character string conforms to a text content rule of any key field in the form certificate of the type, determining the spliced character string to be text content of the key field. Therefore, the text content of each key field can be determined. In this way, by means of the solution, the accuracy of recognizing text content in a form certificate image file can be effectively improved, and the costs accumulated when maintaining different content templates are reduced.

Description

一种表格证件影像件的文本识别方法及计算设备A text recognition method and computing device for form certificate image
相关申请的交叉引用Cross References to Related Applications
本申请要求在2021年11月22日提交中国专利局、申请号为202111382325.7、申请名称为“一种表格证件影像件的文本识别方法及计算设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted to the China Patent Office on November 22, 2021, with the application number 202111382325.7, and the title of the application is "A Text Recognition Method and Computing Equipment for Image Documents of Forms and Certificates", the entire content of which has been passed References are incorporated in this application.
技术领域technical field
本发明实施例涉及金融科技(Fintech)领域,尤其涉及一种表格证件影像件的文本识别方法及计算设备。Embodiments of the present invention relate to the field of financial technology (Fintech), and in particular to a text recognition method and computing equipment for form certificate image documents.
背景技术Background technique
随着计算机技术的发展,越来越多的技术应用在金融领域,传统金融业正在逐步向金融科技转变,但由于金融行业的安全性、实时性要求,也对技术提出的更高的要求。在金融领域,用户在办理金融业务(比如贷款业务等)时,为了确保金融业务操作的安全性,需要用户上传相关的证件影像件进行辅助审核,此时就需要用户上传自己相关的证件影像件,比如上传自己的房屋产权证影像件、机动车登记证书影像件或企业工商登记影像件等,以便业务人员利用OCR(Optical Character Recognition,光学字符识别)技术对客户上传的证件影像件进行内容提取并审核。With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually transforming into financial technology. However, due to the security and real-time requirements of the financial industry, higher requirements are also placed on technology. In the financial field, when users handle financial services (such as loan business, etc.), in order to ensure the safety of financial business operations, users are required to upload relevant certificate images for auxiliary review. At this time, users are required to upload their own relevant certificate images. , such as uploading your own housing property certificate image, motor vehicle registration certificate image, or enterprise business registration image, etc., so that business personnel can use OCR (Optical Character Recognition, Optical Character Recognition) technology to extract the content of the certificate image uploaded by customers and review.
现阶段,在针对每种类型的表格证件影像件进行识别处理之前,会在对该类型的表格证件所对应的内容模板进行定义行与列的过程中,引入该类型的表格证件的每个单元格的四个角的坐标值,以此实现针对该类型的表格证件影像件中文本内容的识别。具体地,对于某一用户上传的某一种类型的表格证件影像件,在针对该类型的表格证件影像件进行识别处理时,可以先根据该类型的表格证件的各单元格的四个角的坐标值针对各单元格进行切分,再针对切分出的每个单元格进行OCR识别,得到每个单元格中的文本内容,然后通过内容模板针对每个单元格识别出的文本内容进行解析和结构化处理,如此即可得到该类型的表格证件中每个关键字段所对应的完整内容信息,但是,这种方案由于某一类型的表格证件影像件若出现扫描偏差、拍摄偏差,或该类型的表格证件影像件中的字段位置发生变动,因此在根据各单元格的四个角的坐标值针对各单元格进行切分时,就会出现切分出的单元格是不准确的,从而导致识别出的至少一个关键字段对应的文本内容也是不准确的。其中,由于现有方案会为每种类型的表格证件都配置一个符合该类型的表格证件所具有的内容格式要求的内容模板,且每种类型的表格证件所对应的内容模板是固定的,在某一类型的表格证件中的内容发生变更时需要重新设计并开发该类型的证件所对应的内容模板,因此通用性较差、维护成本较高。比如针对房屋产权证,不同的城市所定义的房屋产权证的内容会存在大大小小的差异,那么就需要针对每个城市所定义的房屋产权证都配置一种内容模板,如此就会导致房屋产权证所对应的内容模板的开发周期长、人力成本高。At this stage, before performing recognition processing on each type of form certificate image, each unit of this type of form certificate will be introduced in the process of defining the row and column of the content template corresponding to the type of form certificate The coordinate values of the four corners of the grid, so as to realize the recognition of the text content in the image file of this type of form certificate. Specifically, for a certain type of form certificate image uploaded by a certain user, when performing recognition processing on this type of form certificate image, it may first be based on the four corners of each cell of the type of form certificate The coordinate value is segmented for each cell, and then OCR recognition is performed for each segmented cell to obtain the text content in each cell, and then the text content identified for each cell is analyzed through the content template and structured processing, so that the complete content information corresponding to each key field in this type of form certificate can be obtained. However, this solution is due to scanning deviation, shooting deviation, or The position of the fields in this type of form certificate image file changes, so when the cells are segmented according to the coordinate values of the four corners of each cell, the segmented cells will be inaccurate. As a result, the identified text content corresponding to at least one key field is also inaccurate. Among them, since the existing scheme configures a content template for each type of form certificate that meets the content format requirements of the type of form certificate, and the content template corresponding to each type of form certificate is fixed, in When the content in a certain type of form certificate changes, it is necessary to redesign and develop the content template corresponding to this type of certificate, so the versatility is poor and the maintenance cost is high. For example, for house property right certificates, the content of the house property right certificates defined in different cities will vary greatly, so it is necessary to configure a content template for the house property right certificates defined in each city, which will lead to house The development cycle of the content template corresponding to the property right certificate is long and the labor cost is high.
综上,目前亟需一种表格证件影像件的文本识别方法,用以有效地提高识别表格证件影像件中文本内容的准确性,并可以降低因维护内容模板所耗费的成本。To sum up, there is an urgent need for a text recognition method for form certificate images, which can effectively improve the accuracy of recognizing text content in form certificate images, and can reduce the cost of maintaining content templates.
发明内容Contents of the invention
第一方面,本发明实施例提供了一种表格证件影像件的文本识别方法,包括:In the first aspect, an embodiment of the present invention provides a method for text recognition of an image of a form certificate, including:
针对任一类型的表格证件影像件,通过对所述类型的表格证件影像件进行文本内容识别,确定出所述类型的表格证件影像件的第一文本内容;For any type of form certificate image, by performing text content recognition on the form certificate image of the type, determine the first text content of the form certificate image;
在确定所述第一文本内容中第一文本行的字符串的数量与所述类型的表格证件所具有的各关键字段的数量不相同时,将所述第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,并对拼接后的字符串进行验证;所述第二文本行为所述第一文本内容中位于所述第一文本行之前的最近一行;When it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, any character string in the first text line , performing splicing processing with each character string in the second text line, and verifying the spliced string; the second text line is the latest line before the first text line in the first text content;
在任一拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则时,将所述拼接后的字符串确定为所述关键字段的文本内容;When any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field;
将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容。The text content of each key field is determined as the second text content of the type of form certificate image file.
上述技术方案中,由于现有技术方案是通过为每种类型的表格证件(比如房屋产权证等)配置一个符合该类型的表格证件的内容格式要求的内容模板,一旦某一类型的表格证件中的内容发生变更,就需要重新设计并开发该类型的表格证件所对应的内容模板,因此通用性较差、维护成本较高,而且针对某一类型的表格证件影像件,若出现扫描偏差、拍摄偏差,或该类型的表格证件影像件中的字段位置发生变动,就会导致识别出的该类型的表格证件影像件中至少一个关键字段对应的文本内容也是不准确的。基于此,本发明中的技术方案通过自动判断所识别出的某一类型的表格证件影像件中的文本内容是否存在有某一个或某几个文本行的字符串数量与该类型的表格证件所具有的各关键字段的数据不相同,并在确定存在有某一个或某几个文本行的字符串数量与该类型的表格证件所具有的各关键字段的数据不相同时,按照相应的处理规则自动处理,从而可以有效地确保该类型的表格证件所具有的各关键字段对应的文本内容都是准确的,并可以避免因某一类型的表格证件出现扫描偏差、拍摄偏差,或该类型的表格证件影像件中的字段位置发生变动而导致识别出的各关键字段对应的文本内容不准确的情况。具体来说,针对任一类型的表格证件影像件,通过对该类型的表格证件影像件进行文本内容识别,即可确定出该类型的表格证件影像件的第一文本内容,并在确定第一文本内容中第一文本行的字符串的数量与该类型的表格证件所具有的各关键字段的数量不相同时,将该第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,并对拼接后的字符串进行验证,如果确定某一拼接后的字符串符合该类型的表格证件中某一关键字段的文本内容规则,则将该拼接后的字符串确定为该关键字段的文本内容,如此即可确定出各关键字段的文本内容。如此,该方案只需将存在不相同的某一文本行中的每个字符串与位于该文本行之前的最近文本行中的每个字符串进行拼接,并确定某一拼接后的字符串是否符合某一关键字段的文本内容规则,以此可准确地确定出某一关键字段的文本内容,有助于有效地提高识别表格证件影像件中文本内容的准确性,而无需为每种类型的表格证件都配置一种内容模板,且无需在对每种类型的表格证件所对应的内容模板进行定义行与列的过程中,引入该类型的表格证件的每个单元格的四个角的坐标值,因此该方案的通用性强,能够满足不同使用场景的需求,比如,可以适用于处理不同区域所设置的存在差异的表格证件,当然也可以适用于处理存在变更的某一类型的表格证件,并可以降低因不同类型的表格证件配置不同的内容模板所耗费的维护成本。In the above-mentioned technical scheme, because the prior art scheme configures a content template that meets the content format requirements of this type of form certificate for each type of form certificate (such as a house property right certificate, etc.), once a certain type of form certificate is in If the content of the form certificate changes, it is necessary to redesign and develop the content template corresponding to this type of form certificate, so the versatility is poor and the maintenance cost is high. If there is any deviation, or the position of the fields in the image of this type of form certificate changes, the identified text content corresponding to at least one key field in the image of this type of form certificate will also be inaccurate. Based on this, the technical solution in the present invention automatically judges whether the text content in the recognized image of a certain type of form certificate has a certain number of character strings in one or several text lines and the number of characters in the type of form certificate. The data of each key field is different, and when it is determined that the number of character strings in one or several text lines is different from the data of each key field of this type of form certificate, follow the corresponding The processing rules are automatically processed, so that it can effectively ensure that the text content corresponding to each key field of this type of form certificate is accurate, and can avoid scanning deviation, shooting deviation, or this type of form certificate. The position of the field in the image of the type of form certificate changes, resulting in inaccurate text content corresponding to each identified key field. Specifically, for any type of form certificate image, by identifying the text content of the type of form certificate image, the first text content of the type of form certificate image can be determined, and after determining the first When the number of character strings in the first text line in the text content is different from the number of key fields of this type of form certificate, any character string in the first text line and the number of key fields in the second text line Each character string is spliced, and the spliced string is verified. If it is determined that a certain spliced string conforms to the text content rules of a key field in this type of form certificate, the spliced character The string is determined as the text content of the key field, so that the text content of each key field can be determined. In this way, the scheme only needs to splice each character string in a certain text line that is not the same with each character string in the nearest text line before the text line, and determine whether a certain spliced string is Comply with the text content rules of a certain key field, so that the text content of a certain key field can be accurately determined, which is helpful to effectively improve the accuracy of identifying the text content in the image of the form certificate, without the need for each Each type of form certificate is configured with a content template, and there is no need to introduce the four corners of each cell of the type of form certificate in the process of defining the row and column of the content template corresponding to each type of form certificate Coordinate values, so this solution is highly versatile and can meet the needs of different usage scenarios. For example, it can be applied to deal with different form certificates set in different regions, and of course it can also be applied to deal with a certain type of certificate that has changed. Form certificates, and can reduce the maintenance cost of configuring different content templates for different types of form certificates.
可选地,通过如下方式确定拼接后的字符串符合所述类型的表格证件中任一关键字段 的文本内容规则,包括:Optionally, determine that the concatenated character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:
若所述拼接后的字符串属于纯数字类型,则从所述各关键字段中确定出文本内容属于纯数字类型的关键字段;If the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;
针对任一文本内容属于纯数字类型的关键字段,按照所述关键字段的文本内容规则对所述拼接后的字符串进行长度校验和正则表达式校验,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。For any key field whose text content belongs to a pure number type, perform length check and regular expression check on the spliced character string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
上述技术方案中,如果确定某一拼接后的字符串属于纯数字类型,则可以先从各关键字段中选择出文本内容属于纯数字类型的关键字段,并只需按照该文本内容属于纯数字类型的关键字段的文本内容规则对该拼接后的字符串进行长度校验和正则表达式校验即可,无需按照每个关键字段的内容规则都对该拼接后的字符串进行校验,那么就可以节省因将每个关键字段的内容规则都来校验一下该属于纯数字类型的拼接后的字符串所耗费的时间,从而可以有效地提高校验效率,以此可以提高确定关键字段的文本内容的效率,并可以提高表格证件影像件中文本内容属于纯数字类型的关键字段的文本内容的识别准确性。In the above technical solution, if it is determined that a character string after splicing belongs to the pure digital type, you can first select the key field whose text content belongs to the pure digital type from each key field, and only need to follow the text content to belong to the pure digital type. The text content rules of the key fields of the numeric type only need to perform length checks and regular expression checks on the spliced strings, and there is no need to check the spliced strings according to the content rules of each key field. verification, then it can save the time spent by verifying the content rules of each key field for the concatenated character string belonging to the pure digital type, thereby effectively improving the verification efficiency, thereby improving The efficiency of determining the text content of the key field can be improved, and the recognition accuracy of the text content of the key field whose text content belongs to the pure digital type in the form certificate image file can be improved.
可选地,通过如下方式确定拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则,包括:Optionally, it is determined that the spliced character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:
若所述拼接后的字符串属于字母+至少一个特殊字符类型,则从所述各关键字段中确定出文本内容属于字母+至少一个特殊字符类型的关键字段;If the string after splicing belongs to the letter+at least one special character type, then determine from the key fields that the text content belongs to the letter+at least one special character type key field;
针对任一文本内容属于字母+至少一个特殊字符类型的关键字段,按照所述关键字段的文本内容规则对所述拼接后的字符串进行正则表达式校验,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。For any key field whose text content belongs to the letter + at least one special character type, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
上述技术方案中,如果确定某一拼接后的字符串属于字母+至少一个特殊字符类型,则可以先从各关键字段中选择出文本内容属于字母+至少一个特殊字符类型的关键字段,并只需按照该文本内容属于字母+至少一个特殊字符类型的关键字段的文本内容规则对该拼接后的字符串进行正则表达式校验即可,无需按照每个关键字段的内容规则都对该拼接后的字符串进行正则表达式校验,那么就可以节省因将每个关键字段的内容规则都来校验一下该属于纯数字类型的拼接后的字符串所耗费的时间,从而可以有效地提高正则表达式校验的效率,以此可以提高确定关键字段的文本内容的效率,并可以提高表格证件影像件中文本内容属于字母+至少一个特殊字符类型的关键字段的文本内容的识别准确性。In the above technical solution, if it is determined that a character string after splicing belongs to a letter+at least one special character type, you can first select a key field whose text content belongs to a letter+at least one special character type from each key field, and It is only necessary to perform a regular expression check on the concatenated string according to the text content rules of the key field whose text content belongs to the letter + at least one special character type, without following the content rules of each key field The spliced string is checked with a regular expression, which can save the time it takes to check the content rules of each key field for the spliced string that belongs to the pure digital type, so that Effectively improve the efficiency of regular expression verification, so as to improve the efficiency of determining the text content of the key field, and improve the text content of the key field whose text content belongs to letters + at least one special character type in the image of the form certificate recognition accuracy.
可选地,通过如下方式确定拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则,包括:Optionally, it is determined that the spliced character string conforms to the text content rules of any key field in the type of form certificate by the following methods, including:
若所述拼接后的字符串属于包含至少一个中文词类型,则从所述各关键字段中确定出文本内容属于包含至少一个中文词类型的关键字段;If the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;
通过设定的语言模型对所述拼接后的字符串进行处理,确定出所述拼接后的字符串符合任一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。Processing the spliced character string through a set language model, and determining that the spliced character string conforms to the sentence probability that any text content belongs to a text content rule of a key field containing at least one Chinese word type, Therefore, it is determined whether the character string after splicing conforms to the text content rule of the key field.
上述技术方案中,如果确定某一拼接后的字符串属于包含至少一个中文词类型,则就只需要通过设定的语言模型对拼接后的字符串进行处理,即可确定出该拼接后的字符串符合某一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,从而可以基于该句子概率即可准确地确定该拼接后的字符串是否为该文本内容属于包含至少一个中文词类型的关键字段的文本内容,如此可以提高确定关键字段的文本内容的效率, 并可以提高表格证件影像件中文本内容属于包含至少一个中文词类型的关键字段的文本内容的识别准确性。In the above technical solution, if it is determined that a character string after splicing belongs to the type containing at least one Chinese word, then it only needs to process the character string after splicing through the set language model to determine the character string after splicing. The string conforms to the sentence probability of a text content belonging to the text content rule of the key field containing at least one Chinese word type, so that it can be accurately determined based on the sentence probability whether the spliced string is the text content belonging to the text content containing at least one The text content of the key field of a Chinese word type can improve the efficiency of determining the text content of the key field, and can improve the text content of the text content of the key field containing at least one Chinese word type in the image file of the form certificate. recognition accuracy.
可选地,所述通过设定的语言模型对所述拼接后的字符串进行处理,确定出所述拼接后的字符串符合任一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,包括:Optionally, the set language model is used to process the spliced string, and it is determined that the spliced string conforms to any text whose text content belongs to a key field containing at least one Chinese word type. Sentence probabilities for content rules, including:
通过设定的分词工具对所述拼接后的字符串进行分词处理,得到至少一个分词以及所述至少一个分词中每个分词的词性;performing word segmentation processing on the spliced character string through a set word segmentation tool to obtain at least one word segment and the part of speech of each word segment in the at least one word segment;
按照排列组合的方式,将所述至少一个分词组合成至少一个句子;Combining the at least one participle into at least one sentence in a permutation and combination manner;
针对每个句子,通过所述设定的语言模型对所述句子中的各分词进行处理,确定出所述句子的第一子句子概率,并通过所述设定的语言模型对所述句子中各分词的词性进行处理,确定出所述句子的第二子句子概率;基于所述句子的第一子句子概率以及所述句子的第二子句子概率,确定出所述句子的句子概率;所述第一子句子概率是通过统计所述句子中各分词的词频确定的;所述第二子句子概率是通过统计所述句子中各分词的词性的词频确定的;For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model The parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined; The first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;
将所述至少一个句子的句子概率进行比对,确定出最大的句子概率,并将所述最大的句子概率确定为所述拼接后的字符串对应的句子概率。Comparing the sentence probabilities of the at least one sentence, determining the maximum sentence probability, and determining the maximum sentence probability as the sentence probability corresponding to the spliced character string.
可选地,所述设定的语言模型为二元模型;Optionally, the set language model is a binary model;
所述句子的句子概率满足下述形式:The sentence probability of the sentence satisfies the following form:
P=Ω×P w×(1-Ω)×P c P=Ω× Pw ×(1-Ω)× Pc
其中,通过所述二元模型,对所述句子中的各分词进行统计处理,所确定出的第一子句子概率满足下述形式:Wherein, through the binary model, each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:
Figure PCTCN2022100026-appb-000001
Figure PCTCN2022100026-appb-000001
通过所述二元模型,对所述句子中各分词的词性进行统计处理,所确定出的第二子句子概率满足下述形式:Through the binary model, the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:
Figure PCTCN2022100026-appb-000002
Figure PCTCN2022100026-appb-000002
其中,P用于表示所述句子的句子概率,P w用于表示所述句子的第一子句子概率,P c用于表示所述句子的第二子句子概率,Ω用于表示权重,W i用于表示所述句子中的任一分词,C i用于表示所述句子中任一分词的词性。 Wherein, P is used to represent the sentence probability of the sentence, P w is used to represent the first sub-sentence probability of the sentence, P c is used to represent the second sub-sentence probability of the sentence, Ω is used to represent the weight, W i is used to represent any participle in the sentence, and C i is used to represent the part of speech of any participle in the sentence.
上述技术方案中,针对某一拼接后的字符串,在通过设定的分词工具对该拼接后的字符串进行分词和词性处理后,即可通过引入分词相关的概率以及分词词性相关的概率,能够充分且更精确地确定出该拼接后的字符串所对应的句子概率,也就能够更准确地确定该拼接后的字符串是否为该文本内容属于包含至少一个中文词类型的关键字段的文本内容。In the above technical solution, for a spliced character string, after the word segmentation and part-of-speech processing are performed on the spliced character string through the set word segmentation tool, the probability related to word segmentation and the probability related to part-of-speech of word segmentation can be introduced, The sentence probability corresponding to the spliced string can be fully and more accurately determined, and it can be more accurately determined whether the spliced string is a key field containing at least one Chinese word type in the text content text content.
可选地,在将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容之后,还包括:Optionally, after determining the text content of each key field as the second text content of the type of form certificate image file, it further includes:
以所述类型的表格证件中位于起始位置的关键字段为构建起点,根据所述第二文本内容,构建出所述类型的表格证件的有向无环图;Taking the key field at the starting position in the type of form certificate as the starting point for construction, and according to the second text content, constructing a directed acyclic graph of the type of form certificate;
以所述有向无环图中位于起始位置的节点为遍历起点,通过深度优先搜索算法依次对 所述有向无环图中的各节点进行遍历,从而得到符合设定数据格式的数据内容。Taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, so as to obtain the data content conforming to the set data format .
上述技术方案中,为了能够更方便且更准确地将某一类型的表格证件影像件中的键值形式的文本内容整理为符合某一设定数据格式(比如Json数据格式)的内容,因此,该方案针对该文本内容中所存在的某一文本行中的字符串数量不符合该类型的表格证件所具有的各关键字段的数量(也即是列不对齐问题),通过对该文本行中的字符串进行相应的列对齐处理,并可根据列对齐处理后的文本内容构建出一个有向无环图,然后通过深度优先搜索算法依次对有向无环图中的各节点进行遍历,即可得到符合设定数据格式的数据内容。如此,该方案能够自适应完成针对不同类型的表格证件影像件的文本内容识别处理,而无需依靠不同的内容模板来完成,通用性强,并可以确保针对不同类型的表格证件影像件所识别出的文本内容都是准确的,而且可以降低因维护不同表格证件的内容模板所耗费的成本。In the above technical solution, in order to more conveniently and accurately organize the text content in the key-value form in a certain type of form certificate image file into content that conforms to a certain set data format (such as Json data format), therefore, In this solution, the number of character strings in a certain text line in the text content does not match the number of key fields of this type of form certificate (that is, the problem of column misalignment), and the text line The strings in the column are aligned accordingly, and a directed acyclic graph can be constructed according to the text content after the column alignment, and then each node in the directed acyclic graph is traversed in sequence through the depth-first search algorithm. The data content conforming to the set data format can be obtained. In this way, the solution can adaptively complete the text content recognition processing for different types of form certificate images, without relying on different content templates to complete, with strong versatility, and can ensure that different types of form certificate image The text content of the document is accurate, and the cost of maintaining content templates for different form certificates can be reduced.
可选地,以所述有向无环图中位于起始位置的节点为遍历起点,通过深度优先搜索算法依次对所述有向无环图中的各节点进行遍历,包括:Optionally, taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, including:
针对所述有向无环图中的任一节点,在确定所述节点不为空节点时,确定所述节点是否存在于关键字段库中;所述关键字段库用于存储各类型的表格证件所具有的关键字段;For any node in the directed acyclic graph, when determining that the node is not an empty node, determine whether the node exists in the key field library; the key field library is used to store various types of The key fields of the form certificate;
若是,则确定所述节点为第一关键字段节点,并沿着第一遍历方向对与所述第一关键字段节点相邻的第一节点进行遍历,在确定所述第一节点不为空节点且不存在于所述关键字段库中时,将所述第一节点确定为所述第一关键字段节点的文本内容,直至所述第一遍历方向上出现空节点时停止遍历;其中,所述第一遍历方向为自上而下。If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.
上述技术方案中,为了能够更方便且更准确地将某一类型的表格证件影像件中的键值形式的文本内容整理为符合某一设定数据格式(比如Json数据格式)的内容,因此,该方案通过预先为每种类型的表格证件配置一个关键字段映射数组,也即是将每种类型的表格证件所具有的各关键字段映射存储在关键字段库,以便在通过深度优先搜索算法对有向无环图中的某一节点进行遍历,能够及时地确定该节点是否为关键字段节点,也即是确定该节点是否为该类型的表格证件影像件中的键(即关键字段与关键字段的文本内容是以键值对的形式存在)。如果确定该节点为第一关键字段节点(也即是某一类型的表格证件所具有的关键字段),则通过沿着自上而下的遍历方向,针对该节点相邻的下方节点进行遍历,即可及时地找到该第一关键字段节点的值,也即是该第一关键字段节点的文本内容,并根据该第一关键字段节点的文本内容更新设定数据格式所对应的键值内容,即可得到该第一关键字段所映射的该设定数据格式的数据内容。其中,在关键字段库中,该第一关键字段节点作为键,与设定数据格式所对应的键值内容中的某一个键存在对应关系。In the above technical solution, in order to more conveniently and accurately organize the text content in the key-value form in a certain type of form certificate image file into content that conforms to a certain set data format (such as Json data format), therefore, This scheme pre-configures a key field mapping array for each type of form certificate, that is, stores each key field mapping of each type of form certificate in the key field library, so that when the depth-first search The algorithm traverses a certain node in the directed acyclic graph, and can promptly determine whether the node is a key field node, that is, determine whether the node is a key in the image file of the form certificate of this type (ie, the key The text content of segments and key fields exists in the form of key-value pairs). If it is determined that the node is the first key field node (that is, the key field that a certain type of form certificate has), then by following the top-down traversal direction, perform By traversing, you can find the value of the first key field node in time, that is, the text content of the first key field node, and update the corresponding data format according to the text content of the first key field node. The key-value content of the first key field can be used to obtain the data content of the set data format mapped to the first key field. Wherein, in the key field library, the first key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.
可选地,在直至所述第一遍历方向上出现空节点时停止遍历之后,还包括:Optionally, after stopping the traversal until an empty node appears in the first traversal direction, the method further includes:
沿着第二遍历方向对与所述第一关键字段节点相邻的第二节点进行遍历,并在确定所述第二节点不为空节点且存在于所述关键字段库中时,将所述第二节点确定为第二关键字段节点;Traverse the second node adjacent to the first key field node along the second traversal direction, and when it is determined that the second node is not an empty node and exists in the key field library, The second node is determined as a second key field node;
沿着所述第一遍历方向对与所述第二关键字段节点相邻的第三节点进行遍历,并在确定所述第三节点不为空节点且不存在于所述关键字段库中时,将所述第三节点确定为所述第二关键字段节点的文本内容,直至所述第一遍历方向上出现空节点时停止遍历,并在所述第一遍历方向上停止遍历后,沿着所述第二遍历方向对与所述第二关键字段节点相邻的第四节点进行遍历,直至所述第二遍历方向上出现空节点时停止遍历;其中,所述第二遍 历方向为自左向右。Traversing the third node adjacent to the second key field node along the first traversal direction, and determining that the third node is not an empty node and does not exist in the key field library , determine the third node as the text content of the second key field node, stop traversing until an empty node appears in the first traversing direction, and stop traversing in the first traversing direction, Traverse the fourth node adjacent to the second key field node along the second traversal direction, and stop traversing until an empty node appears in the second traversal direction; wherein, the second traversal direction For left to right.
上述技术方案中,在通过沿着自上而下的遍历方向进行遍历结束后,即可沿着自左向右的遍历方向针对该第一关键字段节点相邻的第二节点进行遍历,同样,如果该第二节点为空节点,则在自左向右的遍历方向上结束遍历,如果该第二节点不为空节点且存在于关键字段库中,则确定第二节点为第二关键字段节点(也即是某一类型的表格证件所具有的关键字段),如此即可沿着自上而下的遍历方向针对与该第二关键字段节点相邻的第三节点进行遍历,即可及时地找到该第二关键字段节点的值,也即是该第二关键字段节点的文本内容,并根据该第一关键字段节点的文本内容更新设定数据格式所对应的键值内容,即可得到该第一关键字段所映射的该设定数据格式的数据内容。其中,在关键字段库中,该第二关键字段节点作为键,与设定数据格式所对应的键值内容中的某一个键存在对应关系。In the above technical solution, after the traversal is completed along the top-down traversal direction, the second node adjacent to the first key field node can be traversed along the left-to-right traversal direction, and similarly , if the second node is an empty node, end the traversal in the left-to-right traversal direction, if the second node is not an empty node and exists in the key field library, then determine the second node as the second key Field node (that is, the key field of a certain type of form certificate), so that the third node adjacent to the second key field node can be traversed along the top-down traversal direction , you can find the value of the second key field node in time, that is, the text content of the second key field node, and update the setting data format corresponding to the text content of the first key field node Key-value content, the data content of the set data format mapped to the first key field can be obtained. Wherein, in the key field library, the second key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.
第二方面,本发明实施例还提供了一种表格证件影像件的文本识别装置,包括:In the second aspect, the embodiment of the present invention also provides a text recognition device for a form certificate image document, including:
识别单元,用于针对任一类型的表格证件影像件,通过对所述类型的表格证件影像件进行文本内容识别,确定出所述类型的表格证件影像件的第一文本内容;The recognition unit is configured to, for any type of form certificate image, determine the first text content of the type of form certificate image by performing text content recognition on the type of form certificate image;
处理单元,用于在确定所述第一文本内容中第一文本行的字符串的数量与所述类型的表格证件所具有的各关键字段的数量不相同时,将所述第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,并对拼接后的字符串进行验证;所述第二文本行为所述第一文本内容中位于所述第一文本行之前的最近一行;在任一拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则时,将所述拼接后的字符串确定为所述关键字段的文本内容;将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容。a processing unit, configured to convert the first text line to Any character string in the second text line is spliced with each character string in the second text line, and the spliced string is verified; the second text line is located in the first text line in the first text content The last line before; when any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field; The text content of each key field is determined as the second text content of the type of form certificate image file.
第三方面,本发明实施例提供一种计算设备,包括至少一个处理器以及至少一个存储器,其中,所述存储器存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行上述第一方面任意所述的表格证件影像件的文本识别方法。In a third aspect, an embodiment of the present invention provides a computing device, including at least one processor and at least one memory, wherein the memory stores a computer program, and when the program is executed by the processor, the processing The device executes the method for text recognition of the form certificate image document described in any of the above first aspects.
第四方面,本发明实施例提供一种计算机可读存储介质,其存储有可由计算设备执行的计算机程序,当所述程序在所述计算设备上运行时,使得所述计算设备执行上述第一方面任意所述的表格证件影像件的文本识别方法。In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program runs on the computing device, the computing device executes the above-mentioned first The text recognition method of any form certificate image document described in the aspect.
附图说明Description of drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For Those of ordinary skill in the art can also obtain other drawings based on these drawings without making creative efforts.
图1为本发明实施例提供的一种表格证件影像件的文本识别方法的流程示意图;FIG. 1 is a schematic flowchart of a text recognition method for a form certificate image provided by an embodiment of the present invention;
图2a为本发明实施例提供的一种表格证件影像件示意图;Fig. 2a is a schematic diagram of an image of a form certificate provided by an embodiment of the present invention;
图2b为本发明实施例提供的一种OCR解析处理后的文本结果示意图;FIG. 2b is a schematic diagram of a text result after OCR parsing and processing provided by an embodiment of the present invention;
图3为本发明实施例提供的一种列对齐处理后的文本内容示意图;FIG. 3 is a schematic diagram of text content after column alignment processing provided by an embodiment of the present invention;
图4为本发明实施例提供的一种有向无环图的示意图;FIG. 4 is a schematic diagram of a directed acyclic graph provided by an embodiment of the present invention;
图5为本发明实施例提供的另一种表格证件影像件示意图;Fig. 5 is a schematic diagram of another form certificate image file provided by the embodiment of the present invention;
图6为本发明实施例提供的另一种有向无环图的示意图;FIG. 6 is a schematic diagram of another directed acyclic graph provided by an embodiment of the present invention;
图7为本发明实施例提供的一种表格证件影像件的文本识别装置的结构示意图;FIG. 7 is a schematic structural diagram of a text recognition device for a form certificate image document provided by an embodiment of the present invention;
图8为本发明实施例提供的一种计算设备的结构示意图。FIG. 8 is a schematic structural diagram of a computing device provided by an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明的一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.
图1示例性的示出了本发明实施例提供的一种表格证件影像件的文本识别方法的流程,该流程可以由表格证件影像件的文本识别装置执行。Fig. 1 exemplarily shows the flow of a text recognition method for a form certificate image document provided by an embodiment of the present invention, and the process can be executed by a text recognition device for a form certificate image document.
如图1所示,该流程具体包括:As shown in Figure 1, the process specifically includes:
步骤101,针对任一类型的表格证件影像件,通过对所述类型的表格证件影像件进行文本内容识别,确定出所述类型的表格证件影像件的第一文本内容。 Step 101 , for any type of form certificate image, by performing text content recognition on the type of form certificate image, determine the first text content of the type of form certificate image.
本发明实施例中,针对用户提交的任一类型的表格证件影像件(比如房屋所有权证、土地使用证等),即可通过设定的文本识别工具对该类型的表格证件影像件进行文本内容识别处理,从而可得到所识别出的文本区域中具有至少一个文本行的文本内容。In the embodiment of the present invention, for any type of form certificate image submitted by the user (such as house ownership certificate, land use certificate, etc.), the text content of the type of form certificate image can be processed by the set text recognition tool. Recognition processing, so that the text content with at least one text line in the recognized text area can be obtained.
示例性地,服务设备(比如用于处理表格证件影像件的服务器)在接收到某一用户上传的某一类型的表格证件影像件后,可以通过设定的文本识别工具,比如开源的OCR工具(Optical Character Recognition,光学字符识别)、Python环境下的Pillow库或Python环境下的pytesseract库等,对该类型的表格证件影像件进行解析处理,即可得到文本结果,也即是该类型的表格证件影像件中的文本内容,但是这个文本结果中可能会存在列不一致(即列是不对齐的)的问题,也即是存在某一或某几行的字符串数量与该类型的表格证件所具有的各关键字段(即表头一行的各关键字段)的数量不相同。例如,如图2a所示,为本发明实施例提供的一种表格证件影像件示意图,通过利用OCR工具针对该表格证件影像件进行解析处理,即可得到如图2b所示的OCR解析处理后的文本结果示意图。通过图2b可知,第一行(也即是表格证件的表头)中各关键字段的数量为4,第二行中各字符串的数量为4,第三行中各字符串的数量为3,如此可知,第二行中各字符串的数量与第一行中各关键字段的数量相同,第三行中各字符串的数量与第一行中各关键字段的数量不相同,所以该类型的表格证件影像件的文本结果中存在列不一致(即列是不对齐的)的问题,需要针对该类型的表格证件影像件的文本结果进行列对齐处理,也即是针对第三行中每个字符串按照该字符串的处理规则进行相应的处理,以此即可得到列对齐处理后的文本结果,也即是得到第一行中各关键字段各自对应的完整文本内容。Exemplarily, after a service device (such as a server for processing form certificate images) receives a certain type of form certificate image uploaded by a user, it can use a set text recognition tool, such as an open source OCR tool (Optical Character Recognition, optical character recognition), the Pillow library in the Python environment or the pytesseract library in the Python environment, etc., analyze and process this type of form certificate image file, and you can get the text result, that is, this type of form The text content in the certificate image file, but the text result may have column inconsistency (that is, the column is not aligned), that is, the number of strings in one or several rows is different from that of the type of form certificate. The number of each key field (that is, each key field of the header row) is different. For example, as shown in Figure 2a, it is a schematic diagram of a form certificate image file provided by the embodiment of the present invention. By using an OCR tool to analyze and process the form certificate image file, the OCR analysis process as shown in Figure 2b can be obtained Schematic diagram of the text results. It can be seen from Figure 2b that the number of key fields in the first line (that is, the header of the form certificate) is 4, the number of each character string in the second line is 4, and the number of each character string in the third line is 3. It can be seen that the number of strings in the second row is the same as the number of key fields in the first row, and the number of strings in the third row is different from the number of key fields in the first row. Therefore, there is a problem of column inconsistency (that is, columns are not aligned) in the text result of this type of form certificate image file, and it is necessary to perform column alignment processing on the text result of this type of form certificate image file, that is, for the third row Each character string in is processed according to the processing rules of the character string, so that the text result after column alignment processing can be obtained, that is, the complete text content corresponding to each key field in the first row can be obtained.
步骤102,在确定所述第一文本内容中第一文本行的字符串的数量与所述类型的表格证件所具有的各关键字段的数量不相同时,将所述第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,并对拼接后的字符串进行验证。 Step 102, when it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, convert any of the first text lines to A character string is concatenated with each character string in the second text line, and the concatenated character string is verified.
步骤103,在任一拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则时,将所述拼接后的字符串确定为所述关键字段的文本内容。 Step 103, when any concatenated character string conforms to the text content rule of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field.
本发明实施例中,如果所识别出的文本区域中具有至少一个文本行的文本内容中存在有某一文本行的字符串的数量与该类型的表格证件所具有的各关键字段的数量不相同(也即是存在某一文本行的字符串数量与关键字段文本行的各关键字段数量不相同,使得整个文本区域中出现列不一致的情况,也即是,列是不对齐的),则说明该文本行的任一字符 串应该属于该文本行之前的最近文本行中某一字符串的连接内容,也即是该文本行的各字符串并非是独立存在的,而是应该拼接到位于该文本行之前的最近文本行中对应的某一字符串中。因此,通过将第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,同时针对每个拼接后的字符串进行验证,以此确定该拼接后的字符串属于哪一个关键字段的文本内容,也即是,如果确定某一拼接后的字符串符合该类型的表格证件中某一关键字段的文本内容规则,则将该拼接后的字符串确定为该关键字段的文本内容。其中,第二文本行为第一文本内容中位于第一文本行之前的最近一行。In the embodiment of the present invention, if the number of character strings of a certain text line in the text content with at least one text line in the identified text area is different from the number of key fields that this type of form certificate has, The same (that is, the number of strings in a certain text line is different from the number of key fields in the key field text line, so that the columns in the entire text area are inconsistent, that is, the columns are not aligned) , it means that any character string in this text line should belong to the connection content of a certain string in the nearest text line before this text line, that is, each character string in this text line does not exist independently, but should be concatenated to a corresponding string in the nearest preceding line of text. Therefore, by concatenating any character string in the first text line with each character string in the second text line, and at the same time verifying each concatenated character string, it is determined that the concatenated character string belongs to The text content of which key field, that is, if it is determined that a concatenated character string conforms to the text content rules of a key field in this type of form certificate, the concatenated character string is determined to be the The text content of the key field. Wherein, the second text line is the latest line before the first text line in the first text content.
具体地,在确定某一拼接后的字符串符合该类型的表格证件中某一关键字段的文本内容规则时,如果确定该拼接后的字符串属于纯数字类型,则可以先从各关键字段中选择出文本内容属于纯数字类型的关键字段,并针对任一文本内容属于纯数字类型的关键字段,按照该关键字段的文本内容规则对该拼接后的字符串进行长度校验和正则表达式校验,从而确定该拼接后的字符串是否符合该关键字段的文本内容规则,如果确定该拼接后的字符串符合该关键字段的文本内容规则,则将该拼接后的字符串确定为该关键字段的文本内容,如果确定该拼接后的字符串不符合该关键字段的文本内容规则,则继续按照下一个文本内容属于纯数字类型的关键字段的文本内容规则针对该拼接后的字符串进行校验,或者,继续针对下一个属于纯数字类型的拼接后的字符串进行校验。如此,该方案只需按照该文本内容属于纯数字类型的关键字段的文本内容规则对该拼接后的字符串进行长度校验和正则表达式校验即可,无需按照每个关键字段的内容规则都对该拼接后的字符串进行校验,那么就可以节省因将每个关键字段的内容规则都来校验一下该属于纯数字类型的拼接后的字符串所耗费的时间,以此可以提高确定关键字段的文本内容的效率,并可以提高表格证件影像件中文本内容属于纯数字类型的关键字段的文本内容的识别准确性。Specifically, when it is determined that a concatenated character string conforms to the text content rules of a key field in this type of form certificate, if it is determined that the concatenated character string belongs to the pure digital type, you can start with each keyword Select the key field whose text content belongs to the pure digital type in the paragraph, and for any key field whose text content belongs to the pure digital type, check the length of the spliced string according to the text content rules of the key field Check with the regular expression to determine whether the spliced string conforms to the text content rules of the key field. If it is determined that the spliced string conforms to the text content rules of the key field, then the spliced string The string is determined as the text content of the key field. If it is determined that the concatenated string does not comply with the text content rules of the key field, continue to follow the text content rules of the next key field whose text content belongs to a pure number type. Check the string after splicing, or continue to check the next string after splicing that belongs to the pure number type. In this way, the scheme only needs to perform length check and regular expression check on the concatenated string according to the text content rule of the key field whose text content belongs to the pure digital type, instead of checking the length of each key field. All content rules check the concatenated string, which saves the time spent on verifying the concatenated string that belongs to the pure digital type by the content rules of each key field, so as to This can improve the efficiency of determining the text content of the key field, and can improve the recognition accuracy of the text content of the key field whose text content belongs to the pure digital type in the form certificate image file.
如果确定该拼接后的字符串属于字母+至少一个特殊字符类型,则可以先从各关键字段中选择出文本内容属于字母+至少一个特殊字符类型的关键字段,并针对任一文本内容属于字母+至少一个特殊字符类型的关键字段,按照该关键字段的文本内容规则对该拼接后的字符串进行正则表达式校验,从而确定该拼接后的字符串是否符合该关键字段的文本内容规则,如果确定该拼接后的字符串符合该关键字段的文本内容规则,则将该拼接后的字符串确定为该关键字段的文本内容,如果确定该拼接后的字符串不符合该关键字段的文本内容规则,则继续按照下一个文本内容属于字母+至少一个特殊字符类型的关键字段的文本内容规则针对该拼接后的字符串进行校验,或者,继续针对下一个属于字母+至少一个特殊字符类型的拼接后的字符串进行校验。如此,该方案只需按照该文本内容属于字母+至少一个特殊字符类型的关键字段的文本内容规则对该拼接后的字符串进行正则表达式校验即可,无需按照每个关键字段的内容规则都对该拼接后的字符串进行正则表达式校验,那么就可以节省因将每个关键字段的内容规则都来校验一下该属于纯数字类型的拼接后的字符串所耗费的时间,并可以提高表格证件影像件中文本内容属于字母+至少一个特殊字符类型的关键字段的文本内容的识别准确性。If it is determined that the spliced character string belongs to the letter + at least one special character type, you can first select the key field whose text content belongs to the letter + at least one special character type from each key field, and for any text content belongs to letter + at least one special character type key field, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine whether the spliced string conforms to the key field Text content rules, if it is determined that the concatenated character string conforms to the text content rules of the key field, then determine the concatenated character string as the text content of the key field, if it is determined that the concatenated character string does not conform to For the text content rule of the key field, continue to check the spliced string according to the text content rule of the next key field whose text content belongs to letters + at least one special character type, or continue to check the next text content belonging to The concatenated string of letters + at least one special character type is verified. In this way, the solution only needs to perform regular expression verification on the concatenated string according to the text content rules of the key field whose text content belongs to the letter + at least one special character type, without the need to follow the text content rules of each key field The content rules all perform regular expression verification on the concatenated string, so the cost of verifying the concatenated string belonging to the pure digital type can be saved by verifying the content rules of each key field. Time, and can improve the recognition accuracy of the text content in the key field of the letter + at least one special character type in the image of the form certificate.
如果确定该拼接后的字符串属于包含至少一个中文词类型,则就只需要通过设定的语言模型对拼接后的字符串进行处理,即可确定出该拼接后的字符串符合某一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,从而可以基于该句子概率即可准确地确定该拼接后的字符串是否为该文本内容属于包含至少一个中文词类型的关键字段的文本内容。如果该拼接后的字符串符合该关键字段的文本内容规则,则可以将 该拼接后的字符串确定为该文本内容属于包含至少一个中文词类型的关键字段的文本内容,如果该拼接后的字符串不符合该关键字段的文本内容规则,则继续判断该拼接后的字符串是否符合下一个文本内容属于包含至少一个中文词类型的关键字段的文本内容规则。具体地,在通过设定的语言模型确定出该拼接后的字符串符合任一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率时,通过设定的分词工具对该拼接后的字符串进行分词处理,得到至少一个分词以及该至少一个分词中每个分词的词性。再按照排列组合的方式,将该至少一个分词组合成至少一个句子,并针对每个句子,通过设定的语言模型对该句子中各分词的词性进行处理,确定出该句子的第二子句子概率。然后,基于该句子的第一子句子概率以及该句子的第二子句子概率,确定出该句子的句子概率。最后,将至少一个句子的句子概率进行比对,确定出最大的句子概率,并将该最大的句子概率确定为该拼接后的字符串对应的句子概率。其中,第一子句子概率是通过统计该句子中各分词的词频确定的;第二子句子概率是通过统计该句子中各分词的词性的词频确定的。如此,通过引入分词相关的概率以及分词词性相关的概率,可以充分且更精确地确定出该拼接后的字符串所对应的句子概率,以便为更准确地确定该拼接后的字符串是否为该文本内容属于包含至少一个中文词类型的关键字段的文本内容提供支持。If it is determined that the spliced string belongs to the type that contains at least one Chinese word, it only needs to process the spliced string through the set language model to determine that the spliced string conforms to a certain text content The sentence probability of the text content rule belonging to the key field containing at least one Chinese word type, so that it can be accurately determined based on the sentence probability whether the spliced string is the text content belonging to the key field containing at least one Chinese word type The text content of the field. If the string after splicing conforms to the text content rules of the key field, then the string after splicing can be determined as the text content of the key field containing at least one Chinese word type, if the string after splicing If the character string does not conform to the text content rule of the key field, continue to judge whether the spliced character string conforms to the text content rule of the next text content belonging to a key field containing at least one Chinese word type. Specifically, when it is determined by the set language model that the spliced character string conforms to the sentence probability that any text content belongs to the text content rule of the key field containing at least one Chinese word type, the word segmentation tool set The concatenated character string is subjected to word segmentation processing to obtain at least one word segment and the part of speech of each word segment in the at least one word segment. According to the way of permutation and combination, combine the at least one participle into at least one sentence, and for each sentence, process the part of speech of each participle in the sentence through the set language model, and determine the second subsentence of the sentence probability. Then, based on the first sub-sentence probability of the sentence and the second sub-sentence probability of the sentence, the sentence probability of the sentence is determined. Finally, compare the sentence probabilities of at least one sentence, determine the maximum sentence probability, and determine the maximum sentence probability as the sentence probability corresponding to the concatenated character string. Wherein, the first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by counting the part-of-speech word frequency of each participle in the sentence. In this way, by introducing the probability related to the word segmentation and the probability related to the part of speech of the word segmentation, the probability of the sentence corresponding to the spliced string can be fully and more accurately determined, so as to more accurately determine whether the spliced string is the The text content belongs to the text content of the key field that contains at least one Chinese word type to provide support.
示例性地,继续以上述图2b所示的OCR解析处理后的文本结果为例,针对与第一行中各关键字段的数量不相同的第三行中各字符串进行相应的列对齐处理时,存在三种处理方式,即,第一种处理方式为:被分割的字符串属于纯数字类型,比如图2b中所示的字符串“12345678”,一般属于纯数字类型的字符串会在手机号码或身份证号码等中出现。在针对属于纯数字类型的字符串进行处理时,需要将该字符串与位于该字符串所在行之前的最近一行中的各字符串分别进行拼接处理,可以得到至少一个拼接后的字符串,并针对每个拼接后的字符串进行长度校验和正则表达式校验,以此确定该拼接后的字符串是否符合第一行中某一关键字段的内容格式要求,若是,则将该拼接后的字符串确定为第一行中该关键字段的文本内容。比如,将第三行中的字符串“12345678”与第二行中的字符串“张三”进行拼接,组成拼接后的字符串“张三12345678”或“12345678张三”,并分别针对“张三12345678”和“12345678张三”进行长度校验和正则表达式校验,确定这两个拼接后的字符串均不符合第一行中任何关键字段的内容格式要求。然后,将第三行中的字符串“12345678”与第二行中的字符串“1234567890”进行拼接,组成拼接后的字符串“123456781234567890”或“123456789012345678”,并针对“123456781234567890”和“123456789012345678”分别进行长度校验,确定这两个拼接后的字符串符合第一行中关键字段“身份证”的号码长度要求,再按照身份证的正则表达式对这两个拼接后的字符串进行匹配校验,确定只有拼接后的字符串“123456789012345678”符合身份证的号码格式要求,因此将拼接后的字符串“123456789012345678”作为身份证的文本内容。其中,在针对拼接后的字符串进行正则表达式校验之前,会为每种类型的表格证件所具有的各关键字段对应的文本内容都配置一个相应的正则表达式,以便能够及时有效地针对各关键字段所对应的文本内容进行内容校验。之后也就不需要将“12345678”与第二行中的其它字符串(比如“zhangsan”或“深圳市南山区龙海家”)进行拼接,也就可以结束针对第三行中字符串“12345678”的匹配校验。Exemplarily, continuing to take the text result after OCR parsing and processing shown in Figure 2b above as an example, perform corresponding column alignment processing for each character string in the third row whose number of key fields is different from that in the first row , there are three processing methods, that is, the first processing method is: the character string to be split belongs to the pure digital type, such as the string "12345678" shown in Figure 2b, and generally the string belonging to the pure digital type will be in Appears in mobile phone number or ID card number, etc. When processing a character string belonging to a pure number type, it is necessary to concatenate the character string with the character strings in the nearest line before the line where the character string is located, and at least one concatenated character string can be obtained, and Perform a length check and a regular expression check on each spliced string to determine whether the spliced string meets the content format requirements of a key field in the first line, and if so, the spliced The character string after is determined as the text content of the key field in the first row. For example, the character string "12345678" in the third line is spliced with the character string "Zhang San" in the second line to form the spliced string "Zhang San 12345678" or "12345678 Zhang San", and respectively for " Zhang San 12345678" and "12345678 Zhang San" perform length check and regular expression check, and it is determined that the two spliced strings do not meet the content format requirements of any key field in the first line. Then, the string "12345678" in the third line is spliced with the string "1234567890" in the second line to form the spliced string "123456781234567890" or "123456789012345678", and for "123456781234567890" and "1234567890 12345678" Carry out the length check separately to make sure that the two spliced strings meet the number length requirements of the key field "ID card" in the first line, and then perform the two spliced strings according to the regular expression of the ID card. The matching check determines that only the spliced string "123456789012345678" meets the number format requirements of the ID card, so the spliced string "123456789012345678" is used as the text content of the ID card. Among them, before the regular expression verification is performed on the spliced string, a corresponding regular expression will be configured for the text content corresponding to each key field of each type of form certificate, so as to be able to timely and effectively Perform content verification on the text content corresponding to each key field. Then there is no need to splice "12345678" with other character strings in the second line (such as "zhangsan" or "Longhaijia, Nanshan District, Shenzhen City"), so that the character string "12345678" in the third line can be ended. " match check.
或者,针对第三行中每个字符串,将该字符串分别与第二行中的各字符串进行拼接处理,得到多个拼接后的字符串,针对每个拼接后的字符串,如果确定该拼接后的字符串属于纯数字类型,则从第一行的各关键字段中确定出文本内容的格式要求属于纯数字类型的 关键字段,并按照该关键字段的内容格式要求针对该拼接后的字符串进行长度校验和正则表达式校验,如果校验成功,则确定该拼接后的字符串为该关键字段的文本内容。比如,将第三行中的字符串“12345678”与第二行中的各字符串进行拼接,得到多个拼接后的字符串,即“张三12345678”、“12345678张三”、“123456781234567890”、“123456789012345678”、“12345678zhangsan”和“zhangsan12345678”、“12345678深圳市南山区龙海家”以及“深圳市南山区龙海家12345678”。以“123456781234567890”和“123456789012345678”为例,这两个拼接后的字符串都属于纯数字类型,则第一行的各关键字段中只有关键字段“身份证”的内容格式要求符合纯数字类型,因此按照关键字段“身份证”的内容格式要求分别针对“123456781234567890”和“123456789012345678”进行长度校验,确定“123456781234567890”和“123456789012345678”都符合关键字段“身份证”的号码长度要求,再按照关键字段“身份证”的正则表达式对“123456781234567890”和“123456789012345678”进行匹配校验,确定只有拼接后的字符串“123456789012345678”符合身份证的号码格式要求,因此将拼接后的字符串“123456789012345678”作为身份证的文本内容。Alternatively, for each character string in the third line, the character string is spliced with each character string in the second line to obtain multiple spliced strings. For each spliced string, if it is determined The concatenated character string belongs to the pure digital type, then determine from the key fields of the first line that the format requirements of the text content belong to the key fields of the pure digital type, and target the The length check and regular expression check are performed on the spliced string, and if the check is successful, the spliced string is determined to be the text content of the key field. For example, concatenate the character string "12345678" in the third line with the character strings in the second line to obtain multiple concatenated character strings, namely "Zhang San 12345678", "12345678 Zhang San", "123456781234567890" , "123456789012345678", "12345678zhangsan" and "zhangsan12345678", "12345678 Longhai Home in Nanshan District, Shenzhen" and "Longhai Home in Nanshan District, Shenzhen 12345678". Take "123456781234567890" and "123456789012345678" as an example, the two concatenated strings are all of pure numeric type, and only the content format of the key field "ID card" in the first line of the key fields is required to meet the requirements of pure numbers Therefore, according to the content format requirements of the key field "ID card", the length checks are performed on "123456781234567890" and "123456789012345678", and it is determined that "123456781234567890" and "123456789012345678" meet the number length requirements of the key field "ID card". , and then match and check "123456781234567890" and "123456789012345678" according to the regular expression of the key field "ID card", and confirm that only the spliced string "123456789012345678" meets the number format requirements of the ID card, so the spliced string The string "123456789012345678" is used as the text content of the ID card.
第二种处理方式为:被分割的字符串属于字母+至少一个特殊字符(不包含中文词)类型,比如图2b中所示的字符串“@qq.com”,一般属于字母+至少一个特殊字符类型的字符串会在邮箱等中出现。在针对属于字母+至少一个特殊字符类型的字符串进行处理时,需要将该字符串与位于该字符串所在行之前的最近一行中的各字符串分别进行拼接处理,可以得到至少一个拼接后的字符串,并针对每个拼接后的字符串进行正则表达式校验,以此确定该拼接后的字符串是否符合第一行中某一关键字段的内容格式要求,若是,则将该拼接后的字符串确定为第一行中该关键字段的文本内容。比如,将第三行中的字符串“@qq.com”第二行中的字符串“张三”进行拼接,组成拼接后的字符串“张三@qq.com”或“@qq.com张三”,并针对“张三@qq.com”和“@qq.com张三”进行正则表达式校验,也即是按照第一行中每个关键字段所对应的正则表达式对“张三@qq.com”或“@qq.com张三”进行匹配校验,确定这两个拼接后的字符串均不符合第一行中任何关键字段的内容格式要求。然后,将第三行中的字符串“@qq.com”与第二行中的字符串“1234567890”进行拼接,组成拼接后的字符串“@qq.com 1234567890”或“1234567890@qq.com”,并分别针对“@qq.com1234567890”和“1234567890@qq.com”进行正则表达式校验,也即是按照第一行中每个关键字段所对应的正则表达式对“@qq.com 1234567890”或“1234567890@qq.com”进行匹配校验,确定这两个拼接后的字符串均不符合第一行中任何关键字段的内容格式要求。之后,将第三行中的字符串“@qq.com”与第二行中的字符串“zhangsan”进行拼接,组成拼接后的字符串“@qq.com zhangsan”或“zhangsan@qq.com”,并分别针对“@qq.com zhangsan”和“zhangsan@qq.com”进行正则表达式校验,也即是按照第一行中每个关键字段所对应的正则表达式对“@qq.com zhangsan”或“zhangsan@qq.com”进行匹配校验,确定只有拼接后的字符串“zhangsan@qq.com”符合第一行中关键字段“邮箱”的内容格式要求,因此将拼接后的字符串“zhangsan@qq.com”作为邮箱的文本内容。之后也就不需要将“@qq.com”与第二行中的其它字符串(比如“深圳市南山区龙海家”)进行拼接,也就可以结束针对第三行中字符串“@qq.com”的匹配校验。The second processing method is: the segmented string belongs to the type of letter + at least one special character (excluding Chinese words), such as the string "@qq.com" shown in Figure 2b, generally belongs to the letter + at least one special character Strings of character type appear in mailboxes, etc. When processing a character string belonging to the letter + at least one special character type, it is necessary to concatenate the character string with the character strings in the nearest line before the line where the character string is located, and at least one concatenated character string can be obtained string, and perform a regular expression check on each concatenated string to determine whether the concatenated string meets the content format requirements of a key field in the first line, and if so, the concatenated The character string after is determined as the text content of the key field in the first line. For example, the string "@qq.com" in the third line and the string "Zhang San" in the second line are spliced to form the spliced string "Zhang San@qq.com" or "@qq.com Zhang San", and perform regular expression verification on "Zhang San@qq.com" and "@qq.comZhang San", that is, according to the regular expression corresponding to each key field in the first line. "Zhangsan@qq.com" or "@qq.com张san" conducts a match check to determine that the two concatenated character strings do not meet the content format requirements of any key field in the first line. Then, the string "@qq.com" in the third line is spliced with the string "1234567890" in the second line to form the spliced string "@qq.com 1234567890" or "1234567890@qq.com ", and perform regular expression verification on "@qq.com1234567890" and "1234567890@qq.com" respectively, that is, according to the regular expression corresponding to each key field in the first line, "@qq. com 1234567890" or "1234567890@qq.com" for matching verification, and it is determined that the two spliced strings do not meet the content format requirements of any key fields in the first line. Afterwards, the string "@qq.com" in the third line is spliced with the string "zhangsan" in the second line to form the spliced string "@qq.com zhangsan" or "zhangsan@qq.com ", and perform regular expression verification for "@qq.com zhangsan" and "zhangsan@qq.com" respectively, that is, to check "@qq .com zhangsan" or "zhangsan@qq.com" for matching verification, to determine that only the spliced string "zhangsan@qq.com" meets the content format requirements of the key field "mailbox" in the first line, so the concatenated The following string "zhangsan@qq.com" is used as the text content of the mailbox. Then there is no need to splice "@qq.com" with other character strings in the second line (such as "Longhaijia, Nanshan District, Shenzhen City"), and it is possible to end the string "@qq" in the third line. .com" match check.
或者,针对第三行中每个字符串,将该字符串分别与第二行中的各字符串进行拼接处理,得到多个拼接后的字符串,针对每个拼接后的字符串,如果确定该拼接后的字符串属于字母+至少一个特殊字符类型,则从第一行的各关键字段中确定出文本内容的格式要求 属于字母+至少一个特殊字符类型的关键字段,并按照该关键字段的内容格式要求针对该拼接后的字符串进行正则表达式校验,如果校验成功,则确定该拼接后的字符串为该关键字段的文本内容。比如,将第三行中的字符串“@qq.com”与第二行中的各字符串进行拼接,得到多个拼接后的字符串,即“张三@qq.com”、“@qq.com张三”、“@qq.com 1234567890”、“1234567890@qq.com”、“zhangsan@qq.com”和“@qq.com zhangsan”、“@qq.com深圳市南山区龙海家”以及“深圳市南山区龙海家@qq.com”。以“zhangsan@qq.com”和“@qq.com zhangsan”为例,这两个拼接后的字符串都属于字母+至少一个特殊字符类型,则第一行的各关键字段中只有关键字段“邮箱”的内容格式要求符合字母+至少一个特殊字符类型,因此按照关键字段“邮箱”的内容格式要求分别针对“zhangsan@qq.com”和“@qq.com zhangsan”进行正则表达式校验,确定“zhangsan@qq.com”符合“邮箱”的内容格式要求,因此将拼接后的字符串“zhangsan@qq.com”作为邮箱的文本内容。Alternatively, for each character string in the third line, the character string is spliced with each character string in the second line to obtain multiple spliced strings. For each spliced string, if it is determined The string after splicing belongs to the letter + at least one special character type, then determine the format requirements of the text content from the key fields of the first line and belong to the key field of the letter + at least one special character type, and according to the key field The content format of the field requires a regular expression check on the spliced string, and if the check is successful, the spliced string is determined to be the text content of the key field. For example, concatenate the string "@qq.com" in the third line with the strings in the second line to obtain multiple concatenated strings, namely "张三@qq.com", "@qq .com Zhang San", "@qq.com 1234567890", "1234567890@qq.com", "zhangsan@qq.com" and "@qq.com zhangsan", "@qq.com Shenzhen Nanshan Longhaijia " and "Shenzhen Nanshan District Longhaijia@qq.com". Take "zhangsan@qq.com" and "@qq.com zhangsan" as an example, the two concatenated strings belong to the letter + at least one special character type, then there are only keywords in the key fields of the first line The content format of the segment "mailbox" is required to conform to the letter + at least one special character type, so according to the content format requirements of the key field "mailbox", perform regular expressions for "zhangsan@qq.com" and "@qq.com zhangsan" respectively Verify that "zhangsan@qq.com" meets the content format requirements of "mailbox", so the spliced string "zhangsan@qq.com" is used as the text content of the mailbox.
第三种处理方式为:被分割的字符串属于包含至少一个中文词类型,比如图2b中所示的字符串“园8栋1818号”。在针对属于包含至少一个中文词的类型的字符串进行处理时,需要将该字符串与位于该字符串所在行之前的最近一行中的各字符串分别进行拼接处理,可以得到至少一个拼接后的字符串。针对每个拼接后的字符串,通过设定的分词工具,比如结巴中文分词工具(即jieba中文分词工具),对该拼接后的字符串进行分词和词性处理,即可得到至少一个分词以及该至少一个分词中每个分词的词性。再将该至少一个分词以组成句子的方式进行排列组合,可以组成多个句子,并针对该多个句子中每个句子,通过设定的语言模型(比如n-gram语言模型)对该句子进行处理,确定出该句子的第一子句子概率,以及通过设定的语言模型(比如n-gram语言模型)对该句子中各分词的词性进行处理,确定出该句子的第二子句子概率。然后,基于该句子的第一子句子概率以及该句子的第二子句子概率,确定出该句子的句子概率,并将多个句子各自的句子概率进行比对,即可确定出最大的句子概率,并将最大的句子概率确定为拼接后的字符串对应的句子概率。最后,将该至少一个拼接后的字符串各自对应的句子概率进行比对,即可确定出最大的句子概率,并将最大的句子概率对应的拼接后的字符串确定为文本内容属于包含至少一个中文词的类型的某一关键字段的文本内容。The third processing method is: the character string to be segmented belongs to a type that contains at least one Chinese word, such as the character string "No. 1818, Building 8, Garden" shown in FIG. 2b. When processing a character string of a type containing at least one Chinese word, it is necessary to concatenate the character string with the character strings in the last line before the line where the character string is located, and at least one concatenated character string can be obtained string. For each concatenated character string, use a set word segmentation tool, such as Jieba Chinese word segmentation tool (ie jieba Chinese word segmentation tool), to perform word segmentation and part-of-speech processing on the concatenated character string to obtain at least one word segmentation and the Part of speech for each token in at least one token. Then the at least one participle is arranged and combined in the manner of forming a sentence, multiple sentences can be formed, and for each sentence in the multiple sentences, the sentence is processed by a set language model (such as an n-gram language model) Processing, determining the probability of the first sub-sentence of the sentence, and processing the part of speech of each participle in the sentence through a set language model (such as an n-gram language model), to determine the probability of the second sub-sentence of the sentence. Then, based on the probability of the first sub-sentence of the sentence and the probability of the second sub-sentence of the sentence, the sentence probability of the sentence is determined, and the sentence probabilities of multiple sentences are compared to determine the maximum sentence probability , and determine the maximum sentence probability as the sentence probability corresponding to the concatenated character string. Finally, by comparing the sentence probabilities corresponding to the at least one concatenated character string, the maximum sentence probability can be determined, and the concatenated character string corresponding to the maximum sentence probability is determined as the text content that contains at least one The text content of a key field of the Chinese word type.
其中,在基于某一句子的第一子句子概率以及某一句子的第二子句子概率,确定出某一句子的句子概率时,可以通过下述方式确定某一句子的概率:Wherein, when determining the sentence probability of a certain sentence based on the first sub-sentence probability of a certain sentence and the second sub-sentence probability of a certain sentence, the probability of a certain sentence can be determined in the following manner:
P=Ω×P w×(1-Ω)×P c P=Ω× Pw ×(1-Ω)× Pc
其中,P用于表示某一句子的句子概率,P w用于表示通过针对该句子中各分词进行统计处理所计算出来的句子概率(即该句子的第一子句子概率),P c用于表示通过针对该句子中各分词的词性进行统计处理所计算出来的句子概率(即该句子的第二子句子概率),Ω用于表示权重,比如可以为0.5、0.6或0.7等,具体地可以根据实际应用场景或本领域技术人员的经验进行设置。 Among them, P is used to represent the sentence probability of a certain sentence, P w is used to represent the sentence probability calculated by performing statistical processing on each participle in the sentence (that is, the first sub-sentence probability of the sentence), and P c is used for Indicates the sentence probability calculated by performing statistical processing on the parts of speech of each participle in the sentence (that is, the second subsentence probability of the sentence), Ω is used to represent the weight, for example, it can be 0.5, 0.6 or 0.7, etc. Specifically, it can be Set according to actual application scenarios or experience of those skilled in the art.
下面对n-gram语言模型进行简单的介绍:The following is a brief introduction to the n-gram language model:
n-gram语言模型通常用来描述一个随机分词序列属于一个正常语义语句的概率。假设一个句子由W 1、W 2、…、W n这些分词词组成,则它对应的概率如下: The n-gram language model is usually used to describe the probability that a random word sequence belongs to a normal semantic sentence. Assuming that a sentence is composed of participle words such as W 1 , W 2 , ..., W n , its corresponding probability is as follows:
Figure PCTCN2022100026-appb-000003
Figure PCTCN2022100026-appb-000003
n-gram语言模型中的条件概率可以用词频来计算:The conditional probability in the n-gram language model can be calculated using word frequency:
Figure PCTCN2022100026-appb-000004
Figure PCTCN2022100026-appb-000004
例如,假设一个正常语义字符串为“深圳市南山区龙海家园8栋1818号”,经过结巴中文分词工具进行分词和词性处理后,得到至少一个分词,即[“深圳市”,“南山区”,“龙海”,“家园”,“8”,“栋”,“1818”,“号”]。那么通过针对这些分词进行组合处理,产生该正常语义字符串“深圳市南山区龙海家园8栋1818号”的概率为:For example, assuming that a normal semantic string is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen", after word segmentation and part-of-speech processing by the Jieba Chinese word segmentation tool, at least one word segmentation is obtained, namely ["Shenzhen", "Nanshan District ", "Longhai", "homeland", "8", "building", "1818", "number"]. Then, by combining these word segmentations, the probability of generating the normal semantic string "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" is:
P(深圳市,南山区,龙海,家园,8,栋,1818,号)P (Shenzhen, Nanshan District, Longhai, Homeland, Building 8, No. 1818)
=P(深圳市)×P(南山区|深圳市)×P(龙海|深圳市,南山区)=P(Shenzhen)×P(Nanshan District|Shenzhen)×P(Longhai|Shenzhen, Nanshan District)
×P(家园|深圳市,南山区,龙海)×P(8|深圳市,南山区,龙海,家园)×P(Homeland|Shenzhen, Nanshan District, Longhai)×P(8|Shenzhen, Nanshan District, Longhai, Homeland)
×P(栋|深圳市,南山区,龙海,家园,8)×P(Building|Shenzhen, Nanshan District, Longhai, Homeland, 8)
×P(1818|深圳市,南山区,龙海,家园,8,栋)×P(1818|Shenzhen, Nanshan District, Longhai, Homeland, Building 8)
×P(号|深圳市,南山区,龙海,家园,8,栋,1818)×P(Number|Shenzhen, Nanshan District, Longhai, Homeland, Building 8, 1818)
其中,上述公式中的每个概率可以通过统计分词的词频来计算,具体地,可以参见表1。Wherein, each probability in the above formula can be calculated by counting the word frequency of the word segmentation, specifically, see Table 1.
表1Table 1
Figure PCTCN2022100026-appb-000005
Figure PCTCN2022100026-appb-000005
然而,通过上述表1所示的方式计算太复杂,因此通过针对n-gram语言模型中的一元模型,即n=1,unigram,二元模型,即n=2,bigram以及三元模型,即n=3,trigram进行比较,并结合实际需求(即考虑词与词之间的关系,但是计算出的关系矩阵不能太稀疏),选择二元模型来进行组成句子的概率计算。其中,针对一元模型,计算概率的公式为
Figure PCTCN2022100026-appb-000006
该一元模型计算方式简单,但是并未考虑词与词之间的顺序关系;针对二元模型,计算概率的公式为
Figure PCTCN2022100026-appb-000007
该二元模型计算方式相对简单,但考虑了两个词之间的顺序关系;针对三元模型,计算概率的公式为
Figure PCTCN2022100026-appb-000008
该三元模型考虑了三个词之间的顺序关系, 但是计算出的关系矩阵太稀疏,实用性不强。
However, the calculation by the method shown in the above table 1 is too complicated, so by aiming at the unary model in the n-gram language model, namely n=1, unigram, binary model, namely n=2, bigram and ternary model, namely n=3, trigrams are compared, and combined with actual needs (that is, considering the relationship between words, but the calculated relationship matrix should not be too sparse), a binary model is selected to calculate the probability of forming a sentence. Among them, for the univariate model, the formula for calculating the probability is
Figure PCTCN2022100026-appb-000006
The calculation method of the unigram model is simple, but it does not consider the order relationship between words; for the binary model, the formula for calculating the probability is
Figure PCTCN2022100026-appb-000007
The calculation method of the binary model is relatively simple, but the order relationship between the two words is considered; for the ternary model, the formula for calculating the probability is
Figure PCTCN2022100026-appb-000008
The ternary model considers the order relationship among the three words, but the calculated relationship matrix is too sparse to be practical.
需要说明的是,在通过统计分词的词频来计算上述公式中的每个概率时,首先获取一份中文语料集,然后通过开源的结巴中文分词工具,对所有的语料进行分词处理。所有不同的分词组成的词表用W={W 1,W 2,W 3,…,W n}来表示,其中n表示词表的大小,也即是在针对一个或多个句子进行分词处理后,所得到的所有不同分词的个数。接下来统计每个分词的转移概率,假设在所有的语料库中,分词W i总共出现了count(W i)次,与分词W i相邻且出现在分词W i之后的词有W 1,W 2,W 3,次数分别为count(W i,W 1),count(W i,W 2),count(W i,W 3),那么分词W i针对的转移概率分别为count(W i,W 1)/count(W i),count(W i,W 2)/count(W i),count(W i,W 3)/count(W i),而且,分词W i针对词表W内除W 1,W 2,W 3之外所有其它分词的转移概率为0。最后可以得到分词转移概率矩阵X,维度为n*n,矩阵内第i行第j列的值等于分词W i针对分词W j的转移概率,即count(W i,W j)/count(W i)。 It should be noted that when calculating each probability in the above formula by counting the word frequency of the word segmentation, first obtain a Chinese corpus, and then use the open source stuttering Chinese word segmentation tool to perform word segmentation processing on all the corpus. The vocabulary composed of all different word segmentations is represented by W={W 1 ,W 2 ,W 3 ,…,W n }, where n represents the size of the vocabulary, that is, word segmentation processing for one or more sentences After that, the number of all different participle obtained. Next, the transition probability of each word segment is counted, assuming that in all corpora, the word segment W i appears count(W i ) times in total, and the words adjacent to the word segment W i and appearing after the word segment W i are W 1 , W 2 , W 3 , the times are count(W i ,W 1 ), count(W i ,W 2 ), count(W i ,W 3 ), then the transition probabilities for word segmentation W i are count(W i , W 1 )/count(W i ), count(W i ,W 2 )/count(W i ), count(W i ,W 3 )/count(W i ), and the word segmentation W i is aimed at the vocabulary W Except for W 1 , W 2 , and W 3 , the transition probabilities of all other word segments are 0. Finally, the word segmentation transition probability matrix X can be obtained, the dimension is n*n, and the value of the i-th row and j-column in the matrix is equal to the transition probability of the word segmentation W i for the word segmentation W j , that is, count(W i , W j )/count(W i ).
其中,通过二元模型计算产生正常语义字符串“深圳市南山区龙海家园8栋1818号”的概率可以表示为:Among them, the probability of generating a normal semantic string "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" through binary model calculation can be expressed as:
P(深圳市,南山区,龙海,家园,8,栋,1818,号)P (Shenzhen, Nanshan District, Longhai, Homeland, Building 8, No. 1818)
=P(南山区|深圳市)×P(龙海|南山区)×P(家园|龙海)×P(8|家园)=P(Nanshan District|Shenzhen)×P(Longhai|Nanshan District)×P(home|Longhai)×P(8|home)
×P(栋|8)×P(1818|栋)×P(号|1818)×P(Building|8)×P(1818|Building)×P(No.|1818)
其中,上述公式中的每个概率可以通过统计词频来计算,具体地,可以参见表2。Wherein, each probability in the above formula can be calculated by counting word frequency, specifically, see Table 2.
表2Table 2
n-gram语言模型n-gram language model 概率probability
P(南山区|深圳市)P(Nanshan District|Shenzhen) Count(深圳市,南山区)/count(深圳市)Count(Shenzhen, Nanshan District)/count(Shenzhen)
P(龙海|南山区)P(Longhai|Nanshan District) Count(南山区,龙海)/count(南山区)Count(Nanshan District, Longhai)/count(Nanshan District)
P(家园|龙海)P(Homeland|Longhai) Count(龙海,家园)/count(龙海)Count(Longhai, homeland)/count(Longhai)
P(8|家园)P(8|home) Count(家园,8)/count(家园)Count(home, 8)/count(home)
P(栋|8)P(Building|8) Count(8,栋)/count(8)Count(8, building)/count(8)
P(1818|栋)P(1818|Building) Count(栋,1818)/count(栋)Count(Building, 1818)/count(Building)
P(号|1818)P(No.|1818) Count(1818,号)/count(1818)Count(1818, number)/count(1818)
然而,自然语言处理过程中所遇到的一大问题是出现未登录词,也即是测试集中出现了训练集中未出现过的词,导致语言模型计算出的概率为0,比如数据集中没有龙海家园这个词,上述表2中Count(龙海,家园)/count(龙海)的计算结果为0,或者,可能某个子序列未在训练集中出现,也会导致概率为0。针对于此,需要针对统计词频的计算公式进行平滑处理,也即是利用拉普拉斯平滑进行处理(分子分母同时加1),如此,用于表示通过词频计算出来的句子概率P w可以表示为: However, a major problem encountered in the process of natural language processing is the occurrence of unregistered words, that is, words that have not appeared in the training set appear in the test set, resulting in the probability calculated by the language model being 0. For example, there are no dragons in the data set. For the word Haijiayuan, the calculation result of Count(Longhai, Homeland)/count(Longhai) in Table 2 above is 0, or a certain subsequence may not appear in the training set, which will also result in a probability of 0. In view of this, it is necessary to smooth the calculation formula for statistical word frequency, that is, to use Laplace smoothing (adding 1 to the numerator and denominator at the same time), so that the sentence probability P w calculated by word frequency can be expressed for:
Figure PCTCN2022100026-appb-000009
Figure PCTCN2022100026-appb-000009
需要说明的是,在中文句子中,词性出现的顺序有一定的关系,因此在通过二元模型计算产生某一句子的概率时,也考虑该句子中每个分词的词性,也即是引入词性相关的概率。例如,举一个句子,即“张三有一台黑色小汽车”,通过结巴中文分词工具对该句子进行分词和词性处理后,即可得到至少一个分词以及每个分词的词性,即[(“张三”,“人名”), (“有”,“动词”),(“一台”,“数词”),(“黑色”,“名词”),(“小汽车”,“名词”)]。通过该示例可知,人名后面大概率不会直接与数词相接,数词后面大概率不会直接与动词相接,因此通过对组成句子的词语词性的顺序进行概率的计算,可以用于补充单纯依靠词频所计算出的组成句子的概率的不足,从而可以更为准确地确定出符合实际情况的组成句子。或者,以地址语句“深圳市南山区龙海家园8栋1818号”为例,通过结巴中文分词工具对该地址语句进行分词和词性处理后,即可得到至少一个分词以及每个分词的词性,即[(“深圳市”,“地名”),(“南山区”,“地名”),(“龙海”,“地名”),(“家园”,“名词”),(“8”,“数词”),(“栋”,“数词”),(“1818”,“数词”),(“号”,“数词”)]。通过该示例可知,地名后面大概率不会直接与数词相接,数词后面大概率不会直接与地名相接。It should be noted that in Chinese sentences, the order in which parts of speech appear has a certain relationship, so when calculating the probability of generating a certain sentence through the binary model, the part of speech of each participle in the sentence is also considered, that is, the part of speech is introduced associated probabilities. For example, take a sentence, that is, "Zhang San has a black car", and after performing word segmentation and part-of-speech processing on the sentence through the stuttering Chinese word segmentation tool, at least one part of the word and the part of speech of each part of the word can be obtained, that is, [("Zhang Three", "person's name"), ("have", "verb"), ("one", "number"), ("black", "noun"), ("car", "noun") ]. From this example, it can be seen that there is a high probability that people’s names will not be directly connected to numerals, and there is a high probability that numerals will not be directly connected to verbs. Therefore, by calculating the probability of the order of the parts of speech of the words that make up the sentence, it can be used to supplement The probability of forming a sentence calculated solely by word frequency is insufficient, so that a sentence that complies with the actual situation can be determined more accurately. Or, take the address sentence "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City" as an example. After the word segmentation and part-of-speech processing of the address sentence is performed by the stammering Chinese word segmentation tool, at least one part of the word and the part of speech of each part of the word can be obtained. Namely [("Shenzhen", "place name"), ("Nanshan District", "place name"), ("Longhai", "place name"), ("homeland", "noun"), ("8", "number"), ("building", "number"), ("1818", "number"), ("number", "number")]. From this example, it can be seen that there is a high probability that the place name will not be directly connected with a numeral, and there is a high probability that the numeral will not be directly connected with a place name.
针对用于表示通过词性计算出来的句子概率P c,同样使用二元模型+拉普拉斯平滑进行计算,也即是可以得到该通过词性计算出来的句子概率P c,具体为: For the sentence probability P c calculated by the part of speech, the binary model + Laplace smoothing is also used for calculation, that is, the sentence probability P c calculated by the part of speech can be obtained, specifically:
Figure PCTCN2022100026-appb-000010
Figure PCTCN2022100026-appb-000010
其中,C i用于表示某一句子中每个分词的词性。 Among them, C i is used to represent the part of speech of each participle in a certain sentence.
需要说明的是,在通过统计词性的词频来计算上述公式中的每两个词性之间的词性概率时,首先获取一份中文语料集,然后通过开源的结巴中文分词工具,对所有的语料进行分词和词性处理,可得到每个分词的词性标注。所有的词性集合用来词性表C={C 1,C 2,C 3,…,C m}表示,其中,m表示词性集合的大小。接下来统计每个分词词性的转移概率,假设在所有的语料库中,分词词性C i总共出现了count(C i)次,与分词词性C i相邻且出现在分词词性C i之后的分词词性有C 1,C 2,C 3,次数分别为count(C i,C 1),count(C i,C 2),count(C i,C 3),那么分词词性C i针对分词词性C 1、C 2、C 3的转移概率分别为count(C i,C 1)/count(C i),count(C i,C 2)/count(C i),count(C i,C 3)/count(C i),而且,分词词性C i针对词性表C内除C 1,C 2,C 3之外所有其它分词词性的转移概率为0。最后可以得到分词词性转移概率矩阵Y,维度为m*m,矩阵内第i行第j列的值等于分词词性C i针对分词词性C j的转移概率,即count(C i,C j)/count(C i)。 It should be noted that when calculating the part-of-speech probability between each two parts of speech in the above formula by counting the word frequency of the part-of-speech, first obtain a Chinese corpus, and then use the open-source stuttering Chinese word segmentation tool to perform all corpus Word segmentation and part-of-speech processing can get the part-of-speech tag of each word segmentation. All part-of-speech sets are represented by a part-of-speech table C={C 1 , C 2 , C 3 , . . . , C m }, where m represents the size of a part-of-speech set. Next, the transition probability of each part of part of speech is counted, assuming that in all corpora, part of part of speech C i appears count(C i ) times in total, the part of part of speech adjacent to part of part of speech C i and appearing after part of part of speech C i There are C 1 , C 2 , and C 3 , and the times are count(C i ,C 1 ), count(C i ,C 2 ), and count(C i ,C 3 ), then part of speech C i for part of speech C 1 , C 2 , and C 3 transition probabilities are respectively count(C i ,C 1 )/count(C i ), count(C i ,C 2 )/count(C i ), count(C i ,C 3 )/ count(C i ), moreover, the transition probability of part of speech C i for part of part of speech C is 0 for all parts of part of speech except C 1 , C 2 , and C 3 in part of speech table C. Finally, the part-of-speech transition probability matrix Y can be obtained, the dimension is m*m, and the value of row i and column j in the matrix is equal to the transition probability of part-of-speech C i for part-of-speech C j , namely count(C i , C j )/ count(C i ).
示例性地,继续以图2b为例,且以字符串“园8栋1818号”与第二行中的字符串“深圳市南山区龙海家”进行拼接为例,假设拼接后的字符串为“深圳市南山区龙海家园8栋1818号”,通过结巴中文分词工具对该拼接后的字符串进行分词和词性处理后,得到至少一个分词,即[“深圳市”,“南山区”,“龙海”,“家园”,“8”,“栋”,“1818”,“号”],以及每个分词的词性,即[(“深圳市”,“地名”),(“南山区”,“地名”),(“龙海”,“地名”),(“家园”,“名词”),(“8”,“数词”),(“栋”,“数词”),(“1818”,“数词”),(“号”,“数词”)]。以排列组合的方式,将该至少一个分词,组合成多个句子,比如其中有一个句子为“深圳市南山区龙海家园8栋1818号”,通过二元模型(即二元语言模型)对该句子进行统计各分词的词频来确定第一子句子概率P 1,假设P 1=0.8,并通过二元模型(即二元语言模型)对该句子进行统计各分词的词性的词频来确定第二子句子概率P 2,假设P 2=0.75,假设Ω=0.5,如此即可确定出产生该句子的句子概率P=0.5×0.8×(1-0.5)×0.75=0.15。假设有一个句子为“南山区龙海家园深圳市8栋1818号”,通过二元模型(即二元语言模型)对该句子进行统计各分词的词频来确定第一子句子概率P 1,假设P 1=0.4,并通过二元模型(即二元语言模型)对该句子进行统计各分词的词性的词频来确定第二子句子概率P 2,假设 P 2=0.35,假设Ω=0.5,如此即可确定出产生该句子的句子概率P=0.5×0.4×(1-0.5)×0.35=0.035。通过计算确定多个句子中的其它句子的概率都是小于0.15的,因此通过将多个句子的概率进行比对,确定出最大的概率为0.15,也即是字符串“园8栋1818号”与第二行中的字符串“深圳市南山区龙海家”进行拼接产生“深圳市南山区龙海家园8栋1818号”的概率最大,即0.15。此外,通过计算发现将字符串“园8栋1818号”与第二行中的其它字符串进行拼接所确定的句子概率都是小于0.15的,比如将字符串“园8栋1818号”与第二行中的字符串“张三”进行拼接,并对拼接后的字符串进行分词处理以及进行排列组合后,假设通过二元模型,确定产生句子“张三园8栋1818号”的概率最大,但是也是小于0.15的,或者将字符串“园8栋1818号”与第二行中的字符串“身份证”进行拼接,并对拼接后的字符串进行分词处理以及进行排列组合后,假设通过二元模型,确定产生句子“园8栋1818身份证号”的概率最大,但是也是小于0.15的。同时,确定产生“深圳市南山区龙海家园8栋1818号”的概率符合关键字段“家庭地址”的文本内容规则,因此,可以将拼接后的字符串“深圳市南山区龙海家园8栋1818号”确定为关键字段“家庭地址”的文本内容。 Exemplarily, continue to take Figure 2b as an example, and take the example of splicing the string "No. 1818, Building 8, Garden" and the string "Longhaijia, Nanshan District, Shenzhen" in the second line as an example, assuming that the spliced string It is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen". After performing word segmentation and part-of-speech processing on the spliced character string through the stuttering Chinese word segmentation tool, at least one word segmentation is obtained, namely ["Shenzhen", "Nanshan District" , "Longhai", "Homeland", "8", "Building", "1818", "number"], and the part of speech of each participle, namely [("Shenzhen", "place name"), ("Nanshan District", "place name"), ("Longhai", "place name"), ("home", "noun"), ("8", "number"), ("building", "number") , ("1818", "number"), ("number", "number")]. In the way of permutation and combination, at least one participle is combined into a plurality of sentences, such as wherein a sentence is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City", and the binary model (ie, the binary language model) is paired with The sentence carries out statistics on the word frequency of each part of the word to determine the probability P 1 of the first sub-sentence, assuming that P 1 = 0.8, and through the binary model (that is, the binary language model) to calculate the word frequency of each part of speech of the sentence to determine the second sub-sentence For the probability P 2 of the second subsentence, assume that P 2 =0.75, and assume that Ω=0.5, so that the sentence probability P=0.5×0.8×(1-0.5)×0.75=0.15 for generating the sentence can be determined. Assuming that there is a sentence "No. 1818, Building 8, Longhai Homeland, Shenzhen, Nanshan District", the probability of the first sub-sentence P 1 is determined by counting the word frequency of each participle through the binary model (namely, the binary language model). P 1 =0.4, and the second subsentence probability P 2 is determined by counting the word frequency of each part of speech of the sentence through the binary model (ie, the binary language model), assuming P 2 =0.35, assuming Ω=0.5, so That is to say, the sentence probability P=0.5×0.4×(1-0.5)×0.35=0.035 for generating the sentence can be determined. Through calculation, it is determined that the probabilities of other sentences in multiple sentences are less than 0.15. Therefore, by comparing the probabilities of multiple sentences, it is determined that the maximum probability is 0.15, which is the string "No. 8 Building 1818, Yuan 8" The probability of splicing with the string "Longhaijia, Nanshan District, Shenzhen City" to produce "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" is the highest, ie 0.15. In addition, through calculation, it is found that the probability of sentences determined by splicing the string "No. 8 Building 1818" with other character strings in the second row is less than 0.15. The string "Zhang San" in the second line is spliced, and after the spliced strings are word-segmented and permuted, assuming that the binary model is used, the probability of generating the sentence "No. 1818, Building 8, Zhang Sanyuan" is the highest, but It is also less than 0.15, or the string "No. 8 Building 1818" is spliced with the string "ID card" in the second line, and after word segmentation and permutation and combination are performed on the spliced string, it is assumed that through the two The meta-model determines that the probability of generating the sentence "Yuan 8 Building 1818 ID number" is the highest, but it is also less than 0.15. At the same time, it is determined that the probability of generating "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen City" conforms to the text content rules of the key field "home address". Therefore, the spliced string "Longhaijiayuan, Nanshan District, Shenzhen City 8 Building No. 1818" is determined as the text content of the key field "home address".
基于此,针对字符串数量与该类型的表格证件所具有的各关键字段的数量不相同的某一文本行,通过按照上述三种处理方式进行处理后,即可得到各关键字段的完整文本内容,也即是经过列对齐处理后,得到各关键字段的完整文本内容,比如。如图3所示,为本发明实施例提供的一种列对齐处理后的文本内容示意图。通过图3可知,关键字段“姓名”的文本内容为“张三”,关键字段“身份证”的文本内容为“123456789012345678”,关键字段“邮箱”的文本内容为“zhangsan@qq.com”,关键字段“家庭地址”的文本内容为“深圳市南山区龙海家园8栋1818号”。Based on this, for a text line whose number of character strings is different from the number of key fields of this type of form certificate, after processing according to the above three processing methods, the complete text of each key field can be obtained. Text content, that is, after column alignment processing, the complete text content of each key field is obtained, for example. As shown in FIG. 3 , it is a schematic diagram of text content after column alignment processing provided by an embodiment of the present invention. It can be seen from Figure 3 that the text content of the key field "name" is "Zhang San", the text content of the key field "ID card" is "123456789012345678", and the text content of the key field "mailbox" is "zhangsan@qq. com", the text content of the key field "home address" is "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen".
步骤104,将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容。 Step 104, determining the text content of each key field as the second text content of the type of form certificate image.
本发明实施例中,在将各关键字段的文本内容确定为类型的表格证件影像件的第二文本内容之后,为了能够更方便且更准确地将某一类型的表格证件影像件中的键值形式的文本内容整理为符合某一设定数据格式(比如Json数据格式)的内容,因此,该方案针对该文本内容中所存在的某一文本行中的字符串数量不符合该类型的表格证件所具有的各关键字段的数量(也即是列不对齐问题),通过对该文本行中的字符串进行相应的列对齐处理,并可根据列对齐处理后的文本内容构建出一个有向无环图,然后通过深度优先搜索算法依次对有向无环图中的各节点进行遍历,即可得到符合设定数据格式的数据内容。具体地,以类型的表格证件中位于起始位置的关键字段为构建起点,并根据第二文本内容,构建出该类型的表格证件的有向无环图,再以有向无环图中位于起始位置的节点为遍历起点,通过深度优先搜索算法依次对有向无环图中的各节点进行遍历,从而即可得到符合设定数据格式的数据内容。其中,在通过深度优先搜索算法依次对有向无环图中的各节点进行遍历时,针对有向无环图中的任一节点,在确定该节点不为空节点时,确定该节点是否存在于关键字段库中;其中,关键字段库用于存储各类型的表格证件所具有的关键字段。若是,则确定该节点为第一关键字段节点,并沿着第一遍历方向对与该第一关键字段节点相邻的第一节点进行遍历,在确定该第一节点不为空节点且不存在于关键字段库中时,将该第一节点确定为该第一关键字段节点的文本内容,直至第一遍历方向上出现空节点时停止遍历。其中,第一遍历方向为自上而下。如此,如果确定该节点为第一关键字段节点(也即是某一类型的表格证件所具有的关键字段),则通过沿着自上而下的遍历方向,针对该节点相 邻的下方节点进行遍历,即可及时地找到该第一关键字段节点的值,也即是该第一关键字段节点的文本内容,并根据该第一关键字段节点的文本内容更新设定数据格式所对应的键值内容,即可得到该第一关键字段所映射的该设定数据格式的数据内容。其中,在关键字段库中,该第一关键字段节点作为键,与设定数据格式所对应的键值内容中的某一个键存在对应关系。同时,在直至第一遍历方向上出现空节点时停止遍历后,沿着第二遍历方向对与该第一关键字段节点相邻的第二节点进行遍历,并在确定该第二节点不为空节点且存在于关键字段库中时,将该第二节点确定为第二关键字段节点,并沿着第一遍历方向对与该第二关键字段节点相邻的第三节点进行遍历,若确定该第三节点不为空节点且不存在于关键字段库中,则可以将该第三节点确定为该第二关键字段节点的文本内容,直至第一遍历方向上出现空节点时停止遍历。然后,在第一遍历方向上停止遍历后,沿着第二遍历方向对与该第二关键字段节点相邻的第四节点进行遍历,直至第二遍历方向上出现空节点时停止遍历。其中,第二遍历方向为自左向右。如此,如果该第二节点不为空节点且存在于关键字段库中,则确定第二节点为第二关键字段节点(也即是某一类型的表格证件所具有的关键字段),如此即可沿着自上而下的遍历方向针对与该第二关键字段节点相邻的第三节点进行遍历,即可及时地找到该第二关键字段节点的值,也即是该第二关键字段节点的文本内容,并根据该第一关键字段节点的文本内容更新设定数据格式所对应的键值内容,即可得到该第一关键字段所映射的该设定数据格式的数据内容。其中,在关键字段库中,该第二关键字段节点作为键,与设定数据格式所对应的键值内容中的某一个键存在对应关系。In the embodiment of the present invention, after determining the text content of each key field as the second text content of a type of form certificate image file, in order to more conveniently and more accurately set the key in a certain type of form certificate image file The text content in the form of value is sorted into content that conforms to a certain set data format (such as Json data format). Therefore, this solution is aimed at tables whose number of strings in a certain text line in the text content does not conform to this type of table The number of each key field of the certificate (that is, the column misalignment problem), by performing corresponding column alignment processing on the strings in the text line, and constructing an effective Directed acyclic graph, and then through the depth-first search algorithm to traverse each node in the directed acyclic graph in turn, you can get the data content that conforms to the set data format. Specifically, the key field at the initial position in the type of form certificate is used as the starting point for construction, and according to the second text content, the directed acyclic graph of the type of form certificate is constructed, and then the directed acyclic graph The node at the starting position is the starting point of traversal, and each node in the directed acyclic graph is traversed sequentially through the depth-first search algorithm, so that the data content conforming to the set data format can be obtained. Wherein, when each node in the directed acyclic graph is traversed sequentially through the depth-first search algorithm, for any node in the directed acyclic graph, when it is determined that the node is not an empty node, determine whether the node exists In the key field library; wherein, the key field library is used to store the key fields of various types of form certificates. If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not an empty node and If it does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction. Wherein, the first traversal direction is top-down. In this way, if it is determined that the node is the first key field node (that is, the key field of a certain type of form certificate), then by following the top-down traversal direction, for the node adjacent to the bottom By traversing the nodes, you can find the value of the first key field node in time, that is, the text content of the first key field node, and update the set data format according to the text content of the first key field node The corresponding key-value content can obtain the data content of the set data format mapped to the first key field. Wherein, in the key field library, the first key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format. At the same time, after stopping the traversal until an empty node appears in the first traversal direction, traverse the second node adjacent to the first key field node along the second traversal direction, and determine that the second node is not When an empty node exists in the key field library, determine the second node as the second key field node, and traverse the third node adjacent to the second key field node along the first traversal direction , if it is determined that the third node is not an empty node and does not exist in the key field library, then the third node can be determined as the text content of the second key field node until an empty node appears in the first traversal direction stop traversal. Then, after stopping the traversal in the first traversal direction, traverse the fourth node adjacent to the second key field node along the second traversal direction until an empty node appears in the second traversal direction. Wherein, the second traversal direction is from left to right. In this way, if the second node is not an empty node and exists in the key field library, then it is determined that the second node is the second key field node (that is, the key field of a certain type of form certificate), In this way, the third node adjacent to the second key field node can be traversed along the top-down traversal direction, and the value of the second key field node, that is, the value of the second key field node can be found in time. The text content of the second key field node, and update the key value content corresponding to the setting data format according to the text content of the first key field node, and then the setting data format mapped to the first key field can be obtained data content. Wherein, in the key field library, the second key field node is used as a key, and has a corresponding relationship with a certain key in the key-value content corresponding to the set data format.
示例性地,以图3为例,根据该图3中的各关键字段以及各关键字段的文本内容,可以构建出如图4所示的有向无环图。其中,该图4中每个节点的数据可以为{“tableValue”:“表格中的值,比如姓名”,“rightNode”:“当前节点,右边的节点”,“belowNode”:“当前节点,下面的节点”}。Exemplarily, taking FIG. 3 as an example, according to each key field in FIG. 3 and the text content of each key field, a directed acyclic graph as shown in FIG. 4 can be constructed. Among them, the data of each node in Figure 4 can be {"tableValue": "the value in the table, such as the name", "rightNode": "the current node, the node on the right", "belowNode": "the current node, the node below node"}.
针对图4所示的有向无环图,通过深度优先搜索算法(Depth-First-Search,DFS)从根节点开始遍历,也即是从“tableValue=姓名”开始遍历。针对有向无环图中的任一节点,首先判断该节点是否为空节点,如果不为空节点,则判断该节点是否存在于keymap中,也即是用tableValue到keymap中的keyname进行匹配,比如,在遍历根节点时,也即是用“tableValue=姓名”到keymap中查询,看是否存在有“keyname=姓名”,如果有,则会匹配到{“keyname”:“姓名”,“jsonkey”:“name”}。此时,可以将该节点所对应的json中的key(值为keymap中的“jsonkey”:“name”)写入到json数据格式的数据内容,即当前该json数据格式的数据内容为{“name”:Null}。接下来只需找到json中的key对应的value值即可。For the directed acyclic graph shown in FIG. 4 , the depth-first-search algorithm (Depth-First-Search, DFS) is used to traverse from the root node, that is, to traverse from “tableValue=name”. For any node in the directed acyclic graph, first determine whether the node is an empty node, if not, then determine whether the node exists in the keymap, that is, use tableValue to match the keyname in the keymap, For example, when traversing the root node, that is, use "tableValue=name" to query the keymap to see if there is a "keyname=name", if so, it will match {"keyname": "name", "jsonkey ":"name"}. At this point, the key in the json corresponding to the node (the value is "jsonkey": "name" in the keymap) can be written into the data content of the json data format, that is, the current data content of the json data format is {" name": Null}. Next, you only need to find the value corresponding to the key in json.
其中,本发明实施例中的技术方案通过预先为每种类型的表格证件配置一个关键字段映射数组,也即是将每种类型的表格证件所具有的各关键字段映射存储在关键字段库。比如图2a所示的某一类型的表格证件为例,通过抽取出该类型的表格证件中的表头关键字段,即可组成一个keymap,也即是可以配置该类型的表格证件的keymap为:Among them, the technical solution in the embodiment of the present invention configures a key field mapping array for each type of form certificate in advance, that is, stores each key field mapping of each type of form certificate in the key field library. For example, a certain type of form certificate shown in Figure 2a is taken as an example. By extracting the key fields in the header of this type of form certificate, a keymap can be formed, that is, the keymap of this type of form certificate can be configured as :
Keymap=[Keymap = [
{keyname”:“姓名”,“jsonkey”:“name”},{keyname": "name", "jsonkey": "name"},
{keyname”:“身份证”,“jsonkey”:“idNo”},{keyname": "ID", "jsonkey": "idNo"},
{keyname”:“邮箱”,“jsonkey”:“email”},{keyname": "email", "jsonkey": "email"},
{keyname”:“家庭地址”,“jsonkey”:“address”},{keyname":"homeaddress", "jsonkey":"address"},
];];
其中,keyname对应于该类型的表格证件中的表头字段的名称,jsonkey对应于最终整理出的json数据格式的数据内容中的key。其中,对于keymap的更新:当某一类型的原表格证件内字段名称发生变更,或者新增后,在keymap中插入新的字段名称即可。当某一类型的原表格证件内字段名称发生删减后,keymap不更新。此外,keymap不需要存储多版本(保存全量已知key),对每一类型的表格证件,仅需保留一份最新的keymap用于该类型的表格证件影像件的处理即可。Among them, keyname corresponds to the name of the header field in the form certificate of this type, and jsonkey corresponds to the key in the data content of the finally sorted out json data format. Among them, for the keymap update: when the field name in the original form certificate of a certain type is changed or added, just insert the new field name in the keymap. When the field name in a certain type of original form certificate is deleted, the keymap will not be updated. In addition, the keymap does not need to store multiple versions (save all known keys). For each type of form certificate, only one latest keymap needs to be reserved for the processing of the type of form certificate image.
然后,根据深度优先搜索算法,通过belowNode找到下面的节点,首先判断该下面的节点是否为空节点,如果不是空节点,则判断该下面的节点是否存在于keymap中,如果不存在于keymap中,则说明此节点为上一个节点的值节点,也即是上一个节点的值(因为一个key的值节点只可能出现在右边或下面),如果是空节点,则说明这条分支已遍历完成。比如,对根节点“tableValue=姓名”的下面节点“tableValue=张三”进行遍历,确定该节点不是空节点且不存在于keymap中,则可以确定该节点为上一个节点的值节点,也即是,确定该节点“tableValue=张三”为根节点“tableValue=姓名”的value值。此时可以更新json数据格式的数据内容为{“name”:Null},也即是json数据格式的数据内容变为{“name”:“张三”}。Then, according to the depth-first search algorithm, find the following node through belowNode, first judge whether the following node is an empty node, if not, then judge whether the following node exists in the keymap, if not in the keymap, It means that this node is the value node of the previous node, that is, the value of the previous node (because the value node of a key can only appear on the right or below), if it is an empty node, it means that this branch has been traversed. For example, traverse the following node "tableValue=Zhang San" of the root node "tableValue=Name", and determine that this node is not an empty node and does not exist in the keymap, then it can be determined that this node is the value node of the previous node, that is Yes, determine that the node "tableValue=Zhang San" is the value of the root node "tableValue=Name". At this point, the data content in the json data format can be updated to {"name": Null}, that is, the data content in the json data format becomes {"name": "Zhang San"}.
接下来,继续根据深度优先搜索算法,通过belowNode找到下面的节点,首先判断该下面的节点是否为空节点,如果不是空节点,则判断该下面的节点是否存在于keymap中,如果不存在于keymap中,则说明此节点为上一个节点的值节点,如果是空节点,则说明这个遍历方向上(即这条遍历分支)的节点已遍历完成。比如,对节点“tableValue=张三”下面的节点进行遍历,发现该下面的节点为空节点,则确定这条遍历分支已遍历完成,同时返回到遍历出发节点,也即是返回到根节点“tableValue=姓名”,并通过rightNode对根节点右边的节点进行遍历,也即是会对节点“tableValue=身份证”进行遍历。首先判断该节点是否为空节点,如果不为空节点,则判断该节点是否存在于keymap中,也即是用“tableValue=身份证”到keymap中查询,看是否存在有“keyname=身份证”,如果有,则会匹配到{“keyname”:“身份证”,“jsonkey”:“idNo”}。此时,可以将该节点所对应的json中的key(值为keymap中的“jsonkey”:“idNo”)写入到json数据格式的数据内容,即当前该json数据格式的数据内容为{“idNo”:Null}。接下来只需找到json中的key对应的value值即可。然后,对节点“tableValue=身份证”的下面节点“tableValue=123456789012345678”进行遍历,确定该节点不是空节点且不存在于keymap中,则可以确定该节点为上一个节点的值节点,也即是,确定该节点“tableValue=123456789012345678”为节点“tableValue=身份证”的value值。此时可以更新json数据格式的数据内容为{“idNo”:Null},也即是json数据格式的数据内容变为{“idNo”:“123456789012345678”}。接着,对节点“tableValue=123456789012345678”下面的节点进行遍历,发现该下面的节点为空节点,则确定这条遍历分支已遍历完成,同时返回到遍历出发节点,也即是返回到节点“tableValue=身份证”,并通过rightNode对该节点右边的节点进行遍历,也即是会对节点“tableValue=邮箱”进行遍历。Next, continue to use the depth-first search algorithm to find the following node through the belowNode, first determine whether the following node is an empty node, if not, then determine whether the following node exists in the keymap, if not exist in the keymap In , it means that this node is the value node of the previous node. If it is an empty node, it means that the node in this traversal direction (that is, this traversal branch) has been traversed. For example, traverse the nodes below the node "tableValue=Zhang San" and find that the following nodes are empty nodes, then determine that this traverse branch has been traversed, and return to the starting node of the traverse at the same time, that is, return to the root node " tableValue=name", and traverse the nodes on the right of the root node through rightNode, that is, traverse the node "tableValue=ID card". First determine whether the node is an empty node, if not, then determine whether the node exists in the keymap, that is, use "tableValue = ID card" to query the keymap to see if there is a "keyname = ID card" , if there is, it will be matched to {"keyname": "ID card", "jsonkey": "idNo"}. At this point, the key in the json corresponding to the node (the value is "jsonkey": "idNo" in the keymap) can be written into the data content of the json data format, that is, the current data content of the json data format is {" idNo":Null}. Next, you only need to find the value corresponding to the key in json. Then, traverse the following node "tableValue=123456789012345678" of the node "tableValue=ID card", and determine that the node is not an empty node and does not exist in the keymap, then it can be determined that the node is the value node of the previous node, that is, , determine that the node "tableValue=123456789012345678" is the value of the node "tableValue=ID card". At this time, the data content in the json data format can be updated to {"idNo": Null}, that is, the data content in the json data format becomes {"idNo": "123456789012345678"}. Next, traverse the nodes below the node "tableValue=123456789012345678" and find that the following nodes are empty nodes, then determine that this traverse branch has been traversed, and return to the starting node of the traverse, that is, return to the node "tableValue= ID card", and traverse the nodes on the right side of the node through rightNode, that is, traverse the node "tableValue=mailbox".
在对节点“tableValue=邮箱”进行遍历时,首先判断该节点是否为空节点,如果不为空节点,则判断该节点是否存在于keymap中,也即是用“tableValue=邮箱”到keymap中查询, 看是否存在有“keyname=邮箱”,如果有,则会匹配到{keyname”:“邮箱”,“jsonkey”:“email”}。此时,可以将该节点所对应的json中的key(值为keymap中的“jsonkey”:“email”)写入到json数据格式的数据内容,即当前该json数据格式的数据内容为{“email”:Null}。接下来只需找到json中的key对应的value值即可。然后,对节点“tableValue=邮箱”的下面节点“tableValue=zhangsan@qq.com”进行遍历,确定该节点不是空节点且不存在于keymap中,则可以确定该节点为上一个节点的值节点,也即是,确定该节点“tableValue=zhangsan@qq.com”为节点“tableValue=邮箱”的value值。此时可以更新json数据格式的数据内容为{“email”:Null},也即是json数据格式的数据内容变为{“email”:“zhangsan@qq.com”}。接着,对节点“tableValue=zhangsan@qq.com”下面的节点进行遍历,发现该下面的节点为空节点,则确定这条遍历分支已遍历完成,同时返回到遍历出发节点,也即是返回到节点“tableValue=邮箱”,并通过rightNode对该节点右边的节点进行遍历,也即是会对节点“tableValue=家庭地址”进行遍历。When traversing the node "tableValue=mailbox", first judge whether the node is an empty node, if not, then judge whether the node exists in the keymap, that is, use "tableValue=mailbox" to query in the keymap , to see if there is a "keyname=mailbox", if there is, it will match {keyname": "mailbox", "jsonkey": "email"}. At this time, you can use the key in the json corresponding to the node ( The value is "jsonkey": "email" in the keymap) written to the data content of the json data format, that is, the current data content of the json data format is {"email": Null}. Next, you only need to find the key in the json The corresponding value is enough. Then, traverse the following node "tableValue=zhangsan@qq.com" of the node "tableValue=mailbox", and determine that the node is not an empty node and does not exist in the keymap, then it can be determined that the node is The value node of the previous node, that is, determine that the node "tableValue=zhangsan@qq.com" is the value value of the node "tableValue=mailbox". At this time, the data content in the json data format can be updated as {"email": Null}, that is, the data content in the json data format becomes {"email": "zhangsan@qq.com"}. Then, traverse the nodes below the node "tableValue=zhangsan@qq.com" and find that the following If the node is an empty node, it is determined that this traversal branch has been traversed, and at the same time return to the starting node of the traversal, that is, return to the node "tableValue=mailbox", and traverse the node on the right side of the node through rightNode, that is, The node "tableValue=home address" will be traversed.
在对节点“tableValue=家庭地址”进行遍历时,首先判断该节点是否为空节点,如果不为空节点,则判断该节点是否存在于keymap中,也即是用“tableValue=家庭地址”到keymap中查询,看是否存在有“keyname=家庭地址”,如果有,则会匹配到{keyname”:“家庭地址”,“jsonkey”:“address”}。此时,可以将该节点所对应的json中的key(值为keymap中的“jsonkey”:“address”)写入到json数据格式的数据内容,即当前该json数据格式的数据内容为{“address”:Null}。接下来只需找到json中的key对应的value值即可。然后,对节点“tableValue=家庭地址”的下面节点“tableValue=深圳市南山区龙海家园8栋1818号”进行遍历,确定该节点不是空节点且不存在于keymap中,则可以确定该节点为上一个节点的值节点,也即是,确定该节点“tableValue=深圳市南山区龙海家园8栋1818号”为节点“tableValue=家庭地址”的value值。此时可以更新json数据格式的数据内容为{“address”:Null},也即是json数据格式的数据内容变为{“address”:“深圳市南山区龙海家园8栋1818号”}。接着,对节点“tableValue=家庭地址”下面的节点进行遍历,发现该下面的节点为空节点,则确定这条遍历分支已遍历完成,同时返回到遍历出发节点,也即是返回到节点“tableValue=家庭地址”,并通过rightNode对该节点右边的节点进行遍历,发现节点“tableValue=家庭地址”右边的节点为空节点,则确定这条遍历分支已遍历完成,如此即可结束针对该有向无环图的遍历,同时可以得到映射的json数据格式的数据内容为:{“name”:“张三”,“idNo”:“123456789012345678”,“email”:“zhangsan@qq.com”,“address”:“深圳市南山区龙海家园8栋1818号”}。When traversing the node "tableValue=home address", first judge whether the node is an empty node, if not, then judge whether the node exists in the keymap, that is, use "tableValue=home address" to the keymap Query to see if there is "keyname=home address". If there is, it will match {keyname": "home address", "jsonkey": "address"}. At this time, you can use the json corresponding to the node The key (the value is "jsonkey": "address" in the keymap) is written to the data content of the json data format, that is, the current data content of the json data format is {"address": Null}. Next, just find The value corresponding to the key in the json is enough. Then, traverse the following node "tableValue = No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" under the node "tableValue = home address", and determine that the node is not an empty node and does not If it exists in the keymap, it can be determined that the node is the value node of the previous node, that is, the node "tableValue = No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" is determined as the value of the node "tableValue = home address" Value. At this time, the data content in the json data format can be updated as {"address": Null}, that is, the data content in the json data format becomes {"address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen" }. Next, traverse the nodes below the node "tableValue=home address", and find that the following nodes are empty nodes, then determine that this traverse branch has been traversed, and return to the starting node of the traverse at the same time, that is, return to the node "tableValue=home address", and traverse the node on the right side of the node through rightNode, and find that the node on the right side of the node "tableValue=home address" is an empty node, then it is determined that this traversal branch has been traversed, and the end of this traversal branch Directed acyclic graph traversal, and the data content of the mapped json data format can be obtained at the same time: {"name": "Zhang San", "idNo": "123456789012345678", "email": "zhangsan@qq.com" , "address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen"}.
此外,基于图2a所示的某一类型的表格证件影像件,如果该类型的表格证件内容格式发生变更,也即是如图5所示的另一种表格证件影像件示意图。则针对图5所示的表格证件影像件进行按照图2a所示的某一类型的表格证件影像件的处理过程进行处理后(具体地处理过程可以参照针对图2a所示的表格证件影像件进行处理的过程,在此不再赘述),根据列对齐处理后所得到的各关键字段的内容构建出的另一种如图6所示的有向无环图。其中,针对该图6所示的有向无环图中各节点的遍历过程,可以参照上述针对图4所示的有向无环图中各节点的遍历过程,在此不再赘述。当然,基于图6所示的有向无环图,同样能够得到映射的json数据格式的数据内容为:{“name”:“张三”,“idNo”:“123456789012345678”,“email”:“zhangsan@qq.com”,“address”:“深圳市南山区龙海家园8栋1818号”}。In addition, based on the image of a certain type of form certificate shown in FIG. 2a, if the format of the content of the type of form certificate changes, that is, a schematic diagram of another form certificate image as shown in FIG. 5. Then, the form certificate image shown in Figure 5 is processed according to the processing process of a certain type of form certificate image shown in Figure 2a (the specific processing process can refer to the form certificate image shown in Figure 2a). The processing process will not be repeated here), and another directed acyclic graph as shown in Figure 6 is constructed according to the contents of each key field obtained after the column alignment processing. For the traversal process of each node in the DAG shown in FIG. 6 , reference may be made to the above-mentioned traversal process for each node in the DAG shown in FIG. 4 , which will not be repeated here. Of course, based on the directed acyclic graph shown in Figure 6, the data content of the mapped json data format can also be obtained: {"name": "Zhang San", "idNo": "123456789012345678", "email": " zhangsan@qq.com", "address": "No. 1818, Building 8, Longhaijiayuan, Nanshan District, Shenzhen"}.
基于相同的技术构思,图7示例性的示出了本发明实施例提供的一种表格证件影像件的文本识别装置,该装置可以执行表格证件影像件的文本识别方法的流程。Based on the same technical concept, FIG. 7 exemplarily shows a text recognition device for a form certificate image document provided by an embodiment of the present invention, and the device can execute a flow of a text recognition method for a form certificate image document.
如图7所示,该装置包括:As shown in Figure 7, the device includes:
识别单元701,用于针对任一类型的表格证件影像件,通过对所述类型的表格证件影像件进行文本内容识别,确定出所述类型的表格证件影像件的第一文本内容;The recognition unit 701 is configured to, for any type of form certificate image, identify the first text content of the type of form certificate image by performing text content recognition on the type of form certificate image;
处理单元702,用于在确定所述第一文本内容中第一文本行的字符串的数量与所述类型的表格证件所具有的各关键字段的数量不相同时,将所述第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,并对拼接后的字符串进行验证;所述第二文本行为所述第一文本内容中位于所述第一文本行之前的最近一行;在任一拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则时,将所述拼接后的字符串确定为所述关键字段的文本内容;将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容。The processing unit 702 is configured to convert the first text to any character string in the line, splicing with each character string in the second text line, and verifying the spliced string; the second text line is located in the first text content in the first text line; when any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field ; Determining the text content of each key field as the second text content of the type of form certificate image.
可选地,所述处理单元702具体用于:Optionally, the processing unit 702 is specifically configured to:
若所述拼接后的字符串属于纯数字类型,则从所述各关键字段中确定出文本内容属于纯数字类型的关键字段;If the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;
针对任一文本内容属于纯数字类型的关键字段,按照所述关键字段的文本内容规则对所述拼接后的字符串进行长度校验和正则表达式校验,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。For any key field whose text content belongs to a pure number type, perform length check and regular expression check on the spliced character string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
可选地,所述处理单元702具体用于:Optionally, the processing unit 702 is specifically configured to:
若所述拼接后的字符串属于字母+至少一个特殊字符类型,则从所述各关键字段中确定出文本内容属于字母+至少一个特殊字符类型的关键字段;If the string after splicing belongs to the letter+at least one special character type, then determine from the key fields that the text content belongs to the letter+at least one special character type key field;
针对任一文本内容属于字母+至少一个特殊字符类型的关键字段,按照所述关键字段的文本内容规则对所述拼接后的字符串进行正则表达式校验,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。For any key field whose text content belongs to the letter + at least one special character type, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
可选地,所述处理单元702具体用于:Optionally, the processing unit 702 is specifically configured to:
若所述拼接后的字符串属于包含至少一个中文词类型,则从所述各关键字段中确定出文本内容属于包含至少一个中文词类型的关键字段;If the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;
通过设定的语言模型对所述拼接后的字符串进行处理,确定出所述拼接后的字符串符合任一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。Processing the spliced character string through a set language model, and determining that the spliced character string conforms to the sentence probability that any text content belongs to a text content rule of a key field containing at least one Chinese word type, Therefore, it is determined whether the character string after splicing conforms to the text content rule of the key field.
可选地,所述处理单元702具体用于:Optionally, the processing unit 702 is specifically configured to:
通过设定的分词工具对所述拼接后的字符串进行分词处理,得到至少一个分词以及所述至少一个分词中每个分词的词性;performing word segmentation processing on the spliced character string through a set word segmentation tool to obtain at least one word segment and the part of speech of each word segment in the at least one word segment;
按照排列组合的方式,将所述至少一个分词组合成至少一个句子;Combining the at least one participle into at least one sentence in a permutation and combination manner;
针对每个句子,通过所述设定的语言模型对所述句子中的各分词进行处理,确定出所述句子的第一子句子概率,并通过所述设定的语言模型对所述句子中各分词的词性进行处理,确定出所述句子的第二子句子概率;基于所述句子的第一子句子概率以及所述句子的第二子句子概率,确定出所述句子的句子概率;所述第一子句子概率是通过统计所述句子中各分词的词频确定的;所述第二子句子概率是通过统计所述句子中各分词的词性的词频确定的;For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model The parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined; The first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;
将所述至少一个句子的句子概率进行比对,确定出最大的句子概率,并将所述最大的句子概率确定为所述拼接后的字符串对应的句子概率。Comparing the sentence probabilities of the at least one sentence, determining the maximum sentence probability, and determining the maximum sentence probability as the sentence probability corresponding to the spliced character string.
可选地,所述设定的语言模型为二元模型;Optionally, the set language model is a binary model;
所述句子的句子概率满足下述形式:The sentence probability of the sentence satisfies the following form:
P=Ω×P w×(1-Ω)×P c P=Ω× Pw ×(1-Ω)× Pc
其中,通过所述二元模型,对所述句子中的各分词进行统计处理,所确定出的第一子句子概率满足下述形式:Wherein, through the binary model, each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:
Figure PCTCN2022100026-appb-000011
Figure PCTCN2022100026-appb-000011
通过所述二元模型,对所述句子中各分词的词性进行统计处理,所确定出的第二子句子概率满足下述形式:Through the binary model, the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:
Figure PCTCN2022100026-appb-000012
Figure PCTCN2022100026-appb-000012
其中,P用于表示所述句子的句子概率,P w用于表示所述句子的第一子句子概率,P c用于表示所述句子的第二子句子概率,Ω用于表示权重,W i用于表示所述句子中的任一分词,C i用于表示所述句子中任一分词的词性。 Wherein, P is used to represent the sentence probability of the sentence, P w is used to represent the first sub-sentence probability of the sentence, P c is used to represent the second sub-sentence probability of the sentence, Ω is used to represent the weight, W i is used to represent any participle in the sentence, and C i is used to represent the part of speech of any participle in the sentence.
可选地,所述处理单元702还用于:Optionally, the processing unit 702 is further configured to:
在将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容之后,以所述类型的表格证件中位于起始位置的关键字段为构建起点,根据所述第二文本内容,构建出所述类型的表格证件的有向无环图;After the text content of each key field is determined as the second text content of the type of form certificate image file, the key field at the starting position in the type of form certificate is used as the starting point for construction, according to the According to the content of the second text, a directed acyclic graph of the type of form certificate is constructed;
以所述有向无环图中位于起始位置的节点为遍历起点,通过深度优先搜索算法依次对所述有向无环图中的各节点进行遍历,从而得到符合设定数据格式的数据内容。Taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, so as to obtain the data content conforming to the set data format .
可选地,所述处理单元702具体用于:Optionally, the processing unit 702 is specifically configured to:
针对所述有向无环图中的任一节点,在确定所述节点不为空节点时,确定所述节点是否存在于关键字段库中;所述关键字段库用于存储各类型的表格证件所具有的关键字段;For any node in the directed acyclic graph, when determining that the node is not an empty node, determine whether the node exists in the key field library; the key field library is used to store various types of The key fields of the form certificate;
若是,则确定所述节点为第一关键字段节点,并沿着第一遍历方向对与所述第一关键字段节点相邻的第一节点进行遍历,在确定所述第一节点不为空节点且不存在于所述关键字段库中时,将所述第一节点确定为所述第一关键字段节点的文本内容,直至所述第一遍历方向上出现空节点时停止遍历;其中,所述第一遍历方向为自上而下。If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.
可选地,所述处理单元702还用于:Optionally, the processing unit 702 is further configured to:
在直至所述第一遍历方向上出现空节点时停止遍历之后,沿着第二遍历方向对与所述第一关键字段节点相邻的第二节点进行遍历,并在确定所述第二节点不为空节点且存在于所述关键字段库中时,将所述第二节点确定为第二关键字段节点;After stopping the traversal until an empty node appears in the first traversal direction, traverse the second node adjacent to the first key field node along the second traversal direction, and determine the second node When it is not an empty node and exists in the key field library, determine the second node as a second key field node;
沿着所述第一遍历方向对与所述第二关键字段节点相邻的第三节点进行遍历,并在确定所述第三节点不为空节点且不存在于所述关键字段库中时,将所述第三节点确定为所述第二关键字段节点的文本内容,直至所述第一遍历方向上出现空节点时停止遍历,并在所述第一遍历方向上停止遍历后,沿着所述第二遍历方向对与所述第二关键字段节点相邻的第四节点进行遍历,直至所述第二遍历方向上出现空节点时停止遍历;其中,所述第二遍 历方向为自左向右。Traversing the third node adjacent to the second key field node along the first traversal direction, and determining that the third node is not an empty node and does not exist in the key field library , determine the third node as the text content of the second key field node, stop traversing until an empty node appears in the first traversing direction, and stop traversing in the first traversing direction, Traverse the fourth node adjacent to the second key field node along the second traversal direction, and stop traversing until an empty node appears in the second traversal direction; wherein, the second traversal direction For left to right.
基于相同的技术构思,本发明实施例还提供了一种计算设备,如图8所示,包括至少一个处理器801,以及与至少一个处理器连接的存储器802,本发明实施例中不限定处理器801与存储器802之间的具体连接介质,图8中处理器801和存储器802之间通过总线连接为例。总线可以分为地址总线、数据总线、控制总线等。在本发明实施例中,存储器802存储有可被至少一个处理器801执行的指令,至少一个处理器801通过执行存储器802存储的指令,可以执行前述的表格证件影像件的文本识别方法中所包括的步骤。Based on the same technical concept, the embodiment of the present invention also provides a computing device, as shown in FIG. 8 , including at least one processor 801 and a memory 802 connected to the at least one processor. The specific connection medium between the processor 801 and the memory 802, the bus connection between the processor 801 and the memory 802 in FIG. 8 is taken as an example. The bus can be divided into address bus, data bus, control bus and so on. In the embodiment of the present invention, the memory 802 stores instructions that can be executed by at least one processor 801, and at least one processor 801 executes the instructions stored in the memory 802 to perform the text recognition method included in the aforementioned form certificate image document. A step of.
其中,处理器801是计算设备的控制中心,可以利用各种接口和线路连接计算设备的各个部分,通过运行或执行存储在存储器802内的指令以及调用存储在存储器802内的数据,从而实现数据处理。可选的,处理器801可包括一个或多个处理单元,处理器801可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理下发指令。可以理解的是,上述调制解调处理器也可以不集成到处理器801中。在一些实施例中,处理器801和存储器802可以在同一芯片上实现,在一些实施例中,它们也可以在独立的芯片上分别实现。Among them, the processor 801 is the control center of the computing device, which can use various interfaces and lines to connect various parts of the computing device, by running or executing instructions stored in the memory 802 and calling data stored in the memory 802, thereby realizing data deal with. Optionally, the processor 801 may include one or more processing units, and the processor 801 may integrate an application processor and a modem processor. The call processor mainly handles issuing instructions. It can be understood that the foregoing modem processor may not be integrated into the processor 801 . In some embodiments, the processor 801 and the memory 802 can be implemented on the same chip, and in some embodiments, they can also be implemented on independent chips.
处理器801可以是通用处理器,例如中央处理器(CPU)、数字信号处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本发明实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合表格证件影像件的文本识别方法实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。The processor 801 can be a general-purpose processor, such as a central processing unit (CPU), a digital signal processor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array or other programmable logic devices, discrete gates or transistors Logic devices and discrete hardware components can implement or execute the methods, steps and logic block diagrams disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the text recognition method combined with the form certificate image can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.
存储器802作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块。存储器802可以包括至少一种类型的存储介质,例如可以包括闪存、硬盘、多媒体卡、卡型存储器、随机访问存储器(Random Access Memory,RAM)、静态随机访问存储器(Static Random Access Memory,SRAM)、可编程只读存储器(Programmable Read Only Memory,PROM)、只读存储器(Read Only Memory,ROM)、带电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、磁性存储器、磁盘、光盘等等。存储器802是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。本发明实施例中的存储器802还可以是电路或者其它任意能够实现存储功能的装置,用于存储程序指令和/或数据。The memory 802, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs and modules. Memory 802 may include at least one type of storage medium, for example, may include flash memory, hard disk, multimedia card, card memory, random access memory (Random Access Memory, RAM), static random access memory (Static Random Access Memory, SRAM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Magnetic Memory, Disk , CD, etc. Memory 802 is, but is not limited to, any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 802 in the embodiment of the present invention may also be a circuit or any other device capable of implementing a storage function, and is used for storing program instructions and/or data.
基于相同的技术构思,本发明实施例还提供了一种计算机可读存储介质,其存储有可由计算设备执行的计算机程序,当所述程序在所述计算设备上运行时,使得所述计算设备执行上述表格证件影像件的文本识别方法的步骤。Based on the same technical idea, an embodiment of the present invention also provides a computer-readable storage medium, which stores a computer program executable by a computing device, and when the program is run on the computing device, the computing device The steps of the text recognition method for the above-mentioned form certificate image document are executed.
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. Obviously, those skilled in the art can make various changes and modifications to the present invention without departing from the spirit and scope of the present invention. Thus, if these modifications and variations of the present invention fall within the scope of the claims of this application and their equivalent technologies, the present invention also intends to include these modifications and variations.

Claims (10)

  1. 一种表格证件影像件的文本识别方法,其特征在于,包括:A text recognition method for a form certificate image file, characterized in that it includes:
    针对任一类型的表格证件影像件,通过对所述类型的表格证件影像件进行文本内容识别,确定出所述类型的表格证件影像件的第一文本内容;For any type of form certificate image, by performing text content recognition on the form certificate image of the type, determine the first text content of the form certificate image;
    在确定所述第一文本内容中第一文本行的字符串的数量与所述类型的表格证件所具有的各关键字段的数量不相同时,将所述第一文本行的任一字符串,与第二文本行中的各字符串进行拼接处理,并对拼接后的字符串进行验证;所述第二文本行为所述第一文本内容中位于所述第一文本行之前的最近一行;When it is determined that the number of character strings in the first text line in the first text content is different from the number of key fields that the type of form certificate has, any character string in the first text line , performing splicing processing with each character string in the second text line, and verifying the spliced string; the second text line is the latest line before the first text line in the first text content;
    在任一拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则时,将所述拼接后的字符串确定为所述关键字段的文本内容;When any concatenated character string conforms to the text content rules of any key field in the type of form certificate, determine the concatenated character string as the text content of the key field;
    将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容。The text content of each key field is determined as the second text content of the type of form certificate image file.
  2. 如权利要求1所述的方法,其特征在于,通过如下方式确定拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则,包括:The method according to claim 1, wherein determining that the spliced character string conforms to the text content rules of any key field in the type of form certificate through the following methods, including:
    若所述拼接后的字符串属于纯数字类型,则从所述各关键字段中确定出文本内容属于纯数字类型的关键字段;If the string after the splicing belongs to the pure digital type, then determine from the key fields that the text content belongs to the key field of the pure digital type;
    针对任一文本内容属于纯数字类型的关键字段,按照所述关键字段的文本内容规则对所述拼接后的字符串进行长度校验和正则表达式校验,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。For any key field whose text content belongs to a pure number type, perform length check and regular expression check on the spliced character string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
  3. 如权利要求1所述的方法,其特征在于,通过如下方式确定拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则,包括:The method according to claim 1, wherein determining that the spliced character string conforms to the text content rules of any key field in the type of form certificate through the following methods, including:
    若所述拼接后的字符串属于字母+至少一个特殊字符类型,则从所述各关键字段中确定出文本内容属于字母+至少一个特殊字符类型的关键字段;If the string after splicing belongs to the letter+at least one special character type, then determine from the key fields that the text content belongs to the letter+at least one special character type key field;
    针对任一文本内容属于字母+至少一个特殊字符类型的关键字段,按照所述关键字段的文本内容规则对所述拼接后的字符串进行正则表达式校验,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。For any key field whose text content belongs to the letter + at least one special character type, the regular expression check is performed on the spliced string according to the text content rules of the key field, so as to determine the spliced character string Whether the string complies with the text content rules for said key field.
  4. 如权利要求1所述的方法,其特征在于,通过如下方式确定拼接后的字符串符合所述类型的表格证件中任一关键字段的文本内容规则,包括:The method according to claim 1, wherein determining that the spliced character string conforms to the text content rules of any key field in the type of form certificate through the following methods, including:
    若所述拼接后的字符串属于包含至少一个中文词类型,则从所述各关键字段中确定出文本内容属于包含至少一个中文词类型的关键字段;If the string after the splicing belongs to the type that contains at least one Chinese word, then determine from the key fields that the text content belongs to the key field that contains at least one Chinese word type;
    通过设定的语言模型对所述拼接后的字符串进行处理,确定出所述拼接后的字符串符合任一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,从而确定所述拼接后的字符串是否符合所述关键字段的文本内容规则。Processing the spliced character string through a set language model, and determining that the spliced character string conforms to the sentence probability that any text content belongs to a text content rule of a key field containing at least one Chinese word type, Therefore, it is determined whether the character string after splicing conforms to the text content rule of the key field.
  5. 如权利要求4所述的方法,其特征在于,所述通过设定的语言模型对所述拼接后的字符串进行处理,确定出所述拼接后的字符串符合任一文本内容属于包含至少一个中文词类型的关键字段的文本内容规则的句子概率,包括:The method according to claim 4, wherein the set language model is used to process the spliced character string, and it is determined that the spliced character string conforms to any text content that contains at least one Sentence probabilities of text content rules for key fields of the Chinese word type, including:
    通过设定的分词工具对所述拼接后的字符串进行分词处理,得到至少一个分词以及所述至少一个分词中每个分词的词性;performing word segmentation processing on the spliced character string through a set word segmentation tool to obtain at least one word segment and the part of speech of each word segment in the at least one word segment;
    按照排列组合的方式,将所述至少一个分词组合成至少一个句子;Combining the at least one participle into at least one sentence in a permutation and combination manner;
    针对每个句子,通过所述设定的语言模型对所述句子中的各分词进行处理,确定出所 述句子的第一子句子概率,并通过所述设定的语言模型对所述句子中各分词的词性进行处理,确定出所述句子的第二子句子概率;基于所述句子的第一子句子概率以及所述句子的第二子句子概率,确定出所述句子的句子概率;所述第一子句子概率是通过统计所述句子中各分词的词频确定的;所述第二子句子概率是通过统计所述句子中各分词的词性的词频确定的;For each sentence, process each participle in the sentence through the set language model, determine the first sub-sentence probability of the sentence, and analyze the sentence in the sentence through the set language model The parts of speech of each participle are processed to determine the second subsentence probability of the sentence; based on the first subsentence probability of the sentence and the second subsentence probability of the sentence, the sentence probability of the sentence is determined; The first sub-sentence probability is determined by counting the word frequency of each participle in the sentence; the second sub-sentence probability is determined by the word frequency of the part of speech of each participle in the described sentence;
    将所述至少一个句子的句子概率进行比对,确定出最大的句子概率,并将所述最大的句子概率确定为所述拼接后的字符串对应的句子概率。Comparing the sentence probabilities of the at least one sentence, determining the maximum sentence probability, and determining the maximum sentence probability as the sentence probability corresponding to the spliced character string.
  6. 如权利要求5所述的方法,其特征在于,所述设定的语言模型为二元模型;The method according to claim 5, wherein the language model set is a binary model;
    所述句子的句子概率满足下述形式:The sentence probability of the sentence satisfies the following form:
    P=Ω×P w×(1-Ω)×P c P=Ω× Pw ×(1-Ω)× Pc
    其中,通过所述二元模型,对所述句子中的各分词进行统计处理,所确定出的第一子句子概率满足下述形式:Wherein, through the binary model, each participle in the sentence is statistically processed, and the determined first sub-sentence probability satisfies the following form:
    Figure PCTCN2022100026-appb-100001
    Figure PCTCN2022100026-appb-100001
    通过所述二元模型,对所述句子中各分词的词性进行统计处理,所确定出的第二子句子概率满足下述形式:Through the binary model, the part of speech of each participle in the sentence is statistically processed, and the determined second subsentence probability satisfies the following form:
    Figure PCTCN2022100026-appb-100002
    Figure PCTCN2022100026-appb-100002
    其中,P用于表示所述句子的句子概率,P w用于表示所述句子的第一子句子概率,P c用于表示所述句子的第二子句子概率,Ω用于表示权重,W i用于表示所述句子中的任一分词,C i用于表示所述句子中任一分词的词性。 Wherein, P is used to represent the sentence probability of the sentence, P w is used to represent the first sub-sentence probability of the sentence, P c is used to represent the second sub-sentence probability of the sentence, Ω is used to represent the weight, W i is used to represent any participle in the sentence, and C i is used to represent the part of speech of any participle in the sentence.
  7. 如权利要求1至6任一项所述的方法,其特征在于,在将所述各关键字段的文本内容确定为所述类型的表格证件影像件的第二文本内容之后,还包括:The method according to any one of claims 1 to 6, characterized in that, after determining the text content of each key field as the second text content of the type of form certificate image, further comprising:
    以所述类型的表格证件中位于起始位置的关键字段为构建起点,根据所述第二文本内容,构建出所述类型的表格证件的有向无环图;Taking the key field at the starting position in the type of form certificate as the starting point for construction, and according to the second text content, constructing a directed acyclic graph of the type of form certificate;
    以所述有向无环图中位于起始位置的节点为遍历起点,通过深度优先搜索算法依次对所述有向无环图中的各节点进行遍历,从而得到符合设定数据格式的数据内容。Taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially traversing each node in the directed acyclic graph through a depth-first search algorithm, so as to obtain the data content conforming to the set data format .
  8. 如权利要求7所述的方法,其特征在于,以所述有向无环图中位于起始位置的节点为遍历起点,通过深度优先搜索算法依次对所述有向无环图中的各节点进行遍历,包括:The method according to claim 7, characterized in that, taking the node at the starting position in the directed acyclic graph as the starting point of traversal, and sequentially searching each node in the directed acyclic graph through a depth-first search algorithm Traverse, including:
    针对所述有向无环图中的任一节点,在确定所述节点不为空节点时,确定所述节点是否存在于关键字段库中;所述关键字段库用于存储各类型的表格证件所具有的关键字段;For any node in the directed acyclic graph, when determining that the node is not an empty node, determine whether the node exists in the key field library; the key field library is used to store various types of The key fields of the form certificate;
    若是,则确定所述节点为第一关键字段节点,并沿着第一遍历方向对与所述第一关键字段节点相邻的第一节点进行遍历,在确定所述第一节点不为空节点且不存在于所述关键字段库中时,将所述第一节点确定为所述第一关键字段节点的文本内容,直至所述第一遍历方向上出现空节点时停止遍历;其中,所述第一遍历方向为自上而下。If so, then determine that the node is the first key field node, and traverse the first node adjacent to the first key field node along the first traversal direction, and determine that the first node is not When an empty node does not exist in the key field library, determine the first node as the text content of the first key field node, and stop traversing until an empty node appears in the first traversal direction; Wherein, the first traversal direction is from top to bottom.
  9. 如权利要求8所述的方法,其特征在于,在直至所述第一遍历方向上出现空节点时停止遍历之后,还包括:The method according to claim 8, further comprising: after stopping the traversal until empty nodes appear in the first traversal direction:
    沿着第二遍历方向对与所述第一关键字段节点相邻的第二节点进行遍历,并在确定所 述第二节点不为空节点且存在于所述关键字段库中时,将所述第二节点确定为第二关键字段节点;Traverse the second node adjacent to the first key field node along the second traversal direction, and when it is determined that the second node is not an empty node and exists in the key field library, The second node is determined as a second key field node;
    沿着所述第一遍历方向对与所述第二关键字段节点相邻的第三节点进行遍历,并在确定所述第三节点不为空节点且不存在于所述关键字段库中时,将所述第三节点确定为所述第二关键字段节点的文本内容,直至所述第一遍历方向上出现空节点时停止遍历,并在所述第一遍历方向上停止遍历后,沿着所述第二遍历方向对与所述第二关键字段节点相邻的第四节点进行遍历,直至所述第二遍历方向上出现空节点时停止遍历;其中,所述第二遍历方向为自左向右。Traversing the third node adjacent to the second key field node along the first traversal direction, and determining that the third node is not an empty node and does not exist in the key field library , determine the third node as the text content of the second key field node, stop traversing until an empty node appears in the first traversing direction, and stop traversing in the first traversing direction, Traverse the fourth node adjacent to the second key field node along the second traversal direction, and stop traversing until an empty node appears in the second traversal direction; wherein, the second traversal direction For left to right.
  10. 一种计算设备,其特征在于,包括至少一个处理器以及至少一个存储器,其中,所述存储器存储有计算机程序,当所述程序被所述处理器执行时,使得所述处理器执行权利要求1至9任一项所述的方法。A computing device, characterized by comprising at least one processor and at least one memory, wherein the memory stores a computer program that, when the program is executed by the processor, causes the processor to perform claim 1 to the method described in any one of 9.
PCT/CN2022/100026 2021-11-22 2022-06-21 Text recognition method for form certificate image file, and computing device WO2023087702A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111382325.7 2021-11-22
CN202111382325.7A CN114049642A (en) 2021-11-22 2021-11-22 Text recognition method and computing device for form certificate image piece

Publications (1)

Publication Number Publication Date
WO2023087702A1 true WO2023087702A1 (en) 2023-05-25

Family

ID=80210390

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/100026 WO2023087702A1 (en) 2021-11-22 2022-06-21 Text recognition method for form certificate image file, and computing device

Country Status (2)

Country Link
CN (1) CN114049642A (en)
WO (1) WO2023087702A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049642A (en) * 2021-11-22 2022-02-15 深圳前海微众银行股份有限公司 Text recognition method and computing device for form certificate image piece

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101811581B1 (en) * 2016-11-15 2017-12-26 주식회사 셀바스에이아이 Aparatus and method for cell decomposition for a table recognition in document image
CN110532968A (en) * 2019-09-02 2019-12-03 苏州美能华智能科技有限公司 Table recognition method, apparatus and storage medium
CN111191652A (en) * 2019-12-20 2020-05-22 中国建设银行股份有限公司 Certificate image identification method and device, electronic equipment and storage medium
CN111611990A (en) * 2020-05-22 2020-09-01 北京百度网讯科技有限公司 Method and device for identifying table in image
CN114049642A (en) * 2021-11-22 2022-02-15 深圳前海微众银行股份有限公司 Text recognition method and computing device for form certificate image piece

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101811581B1 (en) * 2016-11-15 2017-12-26 주식회사 셀바스에이아이 Aparatus and method for cell decomposition for a table recognition in document image
CN110532968A (en) * 2019-09-02 2019-12-03 苏州美能华智能科技有限公司 Table recognition method, apparatus and storage medium
CN111191652A (en) * 2019-12-20 2020-05-22 中国建设银行股份有限公司 Certificate image identification method and device, electronic equipment and storage medium
CN111611990A (en) * 2020-05-22 2020-09-01 北京百度网讯科技有限公司 Method and device for identifying table in image
CN114049642A (en) * 2021-11-22 2022-02-15 深圳前海微众银行股份有限公司 Text recognition method and computing device for form certificate image piece

Also Published As

Publication number Publication date
CN114049642A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
US11017178B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
WO2021164226A1 (en) Method and apparatus for querying knowledge map of legal cases, device and storage medium
US20200081899A1 (en) Automated database schema matching
US7636657B2 (en) Method and apparatus for automatic grammar generation from data entries
US8868479B2 (en) Natural language parsers to normalize addresses for geocoding
US11243923B2 (en) Computing the need for standardization of a set of values
US9720903B2 (en) Method for parsing natural language text with simple links
WO2021051517A1 (en) Information retrieval method based on convolutional neural network, and device related thereto
US11397855B2 (en) Data standardization rules generation
US11790174B2 (en) Entity recognition method and apparatus
CN109145287B (en) Indonesia word error detection and correction method and system
CN110083832B (en) Article reprint relation identification method, device, equipment and readable storage medium
US7627567B2 (en) Segmentation of strings into structured records
CN109885641B (en) Method and system for searching Chinese full text in database
EP3598321A1 (en) Method for parsing natural language text with constituent construction links
WO2023087702A1 (en) Text recognition method for form certificate image file, and computing device
US10810368B2 (en) Method for parsing natural language text with constituent construction links
CN114329112A (en) Content auditing method and device, electronic equipment and storage medium
CN111782892B (en) Similar character recognition method, device, apparatus and storage medium based on prefix tree
CN116383412B (en) Functional point amplification method and system based on knowledge graph
US20170286394A1 (en) Method for parsing natural language text with constituent construction links
WO2021227951A1 (en) Naming of front-end page element
CN115831117A (en) Entity identification method, entity identification device, computer equipment and storage medium
WO2021196835A1 (en) Method and apparatus for extracting time character string, and computer device and storage medium
WO2021056740A1 (en) Language model construction method and system, computer device and readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22894245

Country of ref document: EP

Kind code of ref document: A1