WO2023039942A1 - 基于文本识别的要素信息提取方法、装置、设备及介质 - Google Patents

基于文本识别的要素信息提取方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023039942A1
WO2023039942A1 PCT/CN2021/121116 CN2021121116W WO2023039942A1 WO 2023039942 A1 WO2023039942 A1 WO 2023039942A1 CN 2021121116 W CN2021121116 W CN 2021121116W WO 2023039942 A1 WO2023039942 A1 WO 2023039942A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
information
document
error correction
initial
Prior art date
Application number
PCT/CN2021/121116
Other languages
English (en)
French (fr)
Inventor
杨东泉
程佳宇
王天星
钱启
Original Assignee
深圳前海环融联易信息科技服务有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳前海环融联易信息科技服务有限公司 filed Critical 深圳前海环融联易信息科技服务有限公司
Publication of WO2023039942A1 publication Critical patent/WO2023039942A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • the present application relates to the technical field of text recognition, and in particular to a text recognition-based element information extraction method, device, device and medium.
  • the embodiment of the present application provides a method, device, device and medium for extracting element information based on text recognition, aiming to solve the problem that element information cannot be accurately and efficiently extracted from documents in existing methods.
  • the embodiment of the present application provides a method for extracting element information based on text recognition, which includes:
  • the input initial document If the input initial document is received, perform paging recognition on the initial document to obtain the document information page therein; identify the text content contained in each document information page according to the preset initial text recognition model to obtain the corresponding initial text information; determine whether the document information page contains unrecognized document content; if the document information page contains unrecognized document content, identify the unrecognized document content according to the preset handwriting recognition model to obtain the corresponding Handwritten text information; perform text error correction processing on the initial text information and the handwritten text information according to a preset text error correction model to obtain corresponding error correction text information; extract the corrected elements from the The corresponding text element information is extracted from the wrong text information.
  • the embodiment of the present application provides a device for extracting element information based on text recognition, which includes:
  • a document information page acquisition unit configured to perform pagination recognition on the initial document to acquire a document information page therein if the input initial document is received; an initial text information acquisition unit, configured to identify the original document according to a preset initial text recognition model The text content contained in each document information page is identified to obtain corresponding initial text information; the document information page judging unit is used to judge whether the document information page contains unrecognized document content; the handwritten text information acquisition unit uses If the document information page contains unrecognized document content, identify the unrecognized document content according to a preset handwriting recognition model to obtain corresponding handwritten text information; the error correction text information acquisition unit is used to The text error correction model performs text error correction processing on the initial text information and the handwritten text information to obtain corresponding error correction text information; the text element information acquisition unit is used to extract the error correction information from the error correction The corresponding text element information is extracted from the text information.
  • the embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor executes the computer program.
  • the program realizes the element information extraction method based on text recognition described in the first aspect above.
  • the embodiment of the present application further provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first step.
  • the method for extracting element information based on text recognition is provided.
  • Embodiments of the present application provide a method, device, computer equipment, and readable storage medium for extracting element information based on text recognition. Perform paging recognition on the initial document to obtain the document information page, and obtain the initial text information from the document information page according to the initial text recognition model. If the document information page contains unrecognized document content, recognize the unrecognized content according to the handwriting recognition model to obtain handwriting For text information, text error correction processing is performed on initial text information and handwritten text information according to the text error correction model to obtain error-corrected text information, and text element information is extracted from it according to element extraction rules.
  • text information can be obtained through text recognition by combining the initial text recognition model and handwriting recognition model, and the corresponding text element information can be extracted after text error correction processing, which greatly improves the flexibility of text element information extraction, and The scope of application of element information extraction is improved, and the accuracy of obtaining text element information can be greatly improved based on text error correction processing, so that element information can be extracted from documents accurately and efficiently.
  • FIG. 1 is a schematic flow diagram of a method for extracting element information based on text recognition provided in an embodiment of the present application
  • FIG. 2 is a schematic subflow diagram of a method for extracting element information based on text recognition provided in an embodiment of the present application
  • FIG. 3 is a schematic diagram of another sub-flow of the element information extraction method based on text recognition provided by the embodiment of the present application;
  • FIG. 4 is a schematic diagram of another sub-flow of the method for extracting element information based on text recognition provided by the embodiment of the present application;
  • FIG. 5 is a schematic diagram of another sub-flow of the element information extraction method based on text recognition provided by the embodiment of the present application;
  • FIG. 6 is another schematic flowchart of a method for extracting element information based on text recognition provided in the embodiment of the present application.
  • FIG. 7 is another schematic flowchart of the method for extracting element information based on text recognition provided by the embodiment of the present application.
  • FIG. 8 is a schematic block diagram of an element information extraction device based on text recognition provided by an embodiment of the present application.
  • Fig. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • Fig. 1 is a schematic flow chart of the method for extracting element information based on text recognition provided by the embodiment of the present application; the method for extracting element information based on text recognition is applied to a user terminal or a management server, and the element The information extraction method is executed by the application software installed in the user terminal or the management server.
  • the management server is a server that can execute the element information extraction method based on text recognition to perform text recognition on the initial document and extract the text element information.
  • the management server It can be a server built inside an enterprise or a government department, and the user terminal is a terminal device that can execute the text recognition-based element information extraction method to perform text recognition on the initial document and extract text element information, such as desktop computers and laptops , tablet or mobile phone etc.
  • the method includes steps S110-S160.
  • the initial document is the contract, agreement and other documents to be identified.
  • the initial document can be a PDF document or a picture collection.
  • the initial document is composed of multiple pages, and the initial document can be edited. Each page contained in is identified separately from which to obtain the document information page.
  • the initial document is a contract document, it includes a contract cover page and a contract body page, and the contract body page contains content that needs to be extracted from the element information, then the contract body page is obtained from the contract document as the document information page after page-by-page identification.
  • step S110 includes sub-steps S111 , S112 , S113 and S114 .
  • the text direction of the document in each page in the initial document can be obtained.
  • only one corresponding text direction can be obtained.
  • the shape of the character block composed of characters in the document can be obtained.
  • Each character block That is, corresponding to a row or a column of characters, there is a gap between the character blocks, and the corresponding text direction can be determined according to the shape of the character block. If the horizontal length of the character block is greater than the vertical length, the text direction is determined to be horizontal. If the character block If the vertical length of the block is greater than the horizontal length, then the text direction is determined to be vertical.
  • the document is rotated. Specifically, if the text direction of the document is vertical, it is rotated 90 degrees clockwise. At this time, the text direction of the document is horizontal, and the rotation of the document is judged. Whether the beginning of the first sentence is a punctuation mark, if so, rotate 180°, then the text direction of the document at this time is the same as the standard text direction.
  • the proportion of text in the document with the same direction as the standard text Specifically, the ratio of the character block in each document to the area covered by the document can be obtained as the corresponding text proportion, that is, to calculate all the character areas in each document
  • the coverage area of the block is divided by the total area of the document to obtain the text ratio, and it is judged whether the text ratio of each document is greater than the ratio threshold.
  • the proportion of text in the document is greater than the proportion threshold, it indicates that the document on this page is a page containing the main content information, and the document on this page is obtained as the document information page. If the proportion of text in the document is not greater than the proportion threshold, it indicates that the page is The document is a cover page with no main content information.
  • the percentage threshold can be set to 12%.
  • the initial text recognition model is a model for recognizing non-handwritten characters in the document information page, wherein the initial text recognition model includes a feature vector extraction formula, a character database, and a matching degree threshold.
  • the text content included in each document information page can be identified according to the initial text recognition model, so as to obtain the text information therein as the initial text information.
  • step S120 includes sub-steps S121 , S122 , S123 , S124 , S125 , S126 and S127 .
  • the non-handwritten fonts corresponding to different paragraphs in the document information page may or may not be the same.
  • you can obtain the text strokes and text corresponding to each paragraph in the document information page Strokes are the representative strokes obtained from each paragraph, such as horizontal " ⁇ ", vertical " ⁇ ", and bottom " ⁇ ".
  • the stroke image of each character stroke corresponding to the paragraph can be obtained, and the stroke feature vector corresponding to the character stroke can be extracted from the stroke image through the feature vector extraction formula, and the stroke feature vector can be used to quantify the characteristics of the character stroke characterization.
  • the resolution of the stroke image corresponding to each character stroke is 50 ⁇ 50.
  • the resolution is 6*6 as the window, and the step size is 1.
  • Convolution operation to obtain a vector matrix with a size of 45 ⁇ 45, which is the shallow feature of the stroke image; according to the pooling calculation formula, the resolution is 9 ⁇ 9 as the window, and the step size is 6, and downsampling is performed to obtain the size It is a vector matrix of 7 ⁇ 7, which is the deep-level feature of the stroke image; according to the calculation formula in the 5 second convolution kernels, the convolution operation is performed with a resolution of 3 ⁇ 3 as a window and a step size of 2, to get a matrix of 5 vectors of size 3 ⁇ 3.
  • the first full connection formula contains five nodes, and each node is associated with a 3 ⁇ 3 vector matrix. That is to say, the values of the five nodes associated with the five 3 ⁇ 3 vector matrices are calculated through five calculation formulas respectively.
  • the values of the five nodes associated with the corresponding vector matrix can be calculated through the five calculation formulas; the values of the five nodes are calculated through the second full connection calculation formula to obtain the final stroke
  • the character database includes multiple fonts, and the stroke features of each font corresponding to representative strokes, and the font matching degree between the multiple stroke feature vectors of the paragraph and the stroke features of each font can be calculated.
  • the font matching degree between the stroke feature vector of a paragraph and a certain font can be calculated by formula (1):
  • N is the total number of stroke feature vectors
  • Z i (z 1i , z 2i ... z 9i ) is the i-th stroke feature vector
  • R i (r 1i , r 2i ... r 9i ) is the The i-th font stroke feature.
  • the number of fonts whose font matching degree with the stroke feature vector of each paragraph is greater than the matching degree threshold is greater than zero, if the font matching degree with the stroke feature vector of a certain paragraph is greater than the matching degree threshold If the number is greater than zero, it indicates that there is at least one font in the character database that matches the stroke features of the paragraph, and the target font corresponding to the font with the highest matching degree between the stroke feature vectors of the paragraph is obtained.
  • the target font matching the stroke feature vector of each paragraph can be determined through the above method.
  • the paragraph contains The content of is determined to be unrecognized document content.
  • the character database also includes a character template corresponding to each font, a character template of a font includes the template corresponding to each character of the font, and can be based on the target font matched with the stroke feature vector in the character database Character template, to identify the text content contained in the corresponding paragraph, that is, to match the character template with the character image in the paragraph to identify the text content corresponding to each character image, and to identify the text content corresponding to the target font by matching the character template with the target font
  • the included text content can be identified to obtain the initial text information.
  • the document information page contains unrecognized document content. It can be judged whether the document information page contains unrecognized document content, and the unrecognized document content is the text content contained in the paragraph whose stroke feature vector does not match any font in the character database.
  • the document information page contains unrecognized document content, recognize the unrecognized document content according to a preset handwriting recognition model to obtain corresponding handwritten text information.
  • the unrecognized document content is recognized according to a preset handwriting recognition model to obtain corresponding handwriting text information. If the document information page contains unrecognized document content, it means that the document information page contains some content that cannot be recognized by the character database, and the unrecognized document content needs to be recognized through the handwriting recognition model.
  • a model for recognizing handwritten text content wherein the handwriting recognition model includes a feature vector extraction formula and a set of handwritten characters.
  • step S140 includes sub-steps S141 , S142 and S143 .
  • the character feature vector of each character in the unrecognized document can be calculated through the feature vector extraction formula. Specifically, the character image of each character in the unrecognized document can be obtained first, and the character corresponding to the character image can be calculated based on the above feature vector extraction formula
  • the feature vector, the process of calculating the character feature vector is the same as the specific process of calculating the stroke feature vector, and will not be repeated here.
  • the handwritten character set contains a plurality of handwritten characters, and the feature vector corresponding to each handwritten character, the matching degree between the character feature vector and the handwritten character feature vector can be calculated, and the matching degree can be calculated by formula (2):
  • Z (z 1 , z 2 ... z 9 ) is the character feature vector of a certain character
  • T (t 1 , t 2 ...t 9 ) is the feature vector corresponding to a certain handwritten character.
  • the text error correction model includes a conversion dictionary and an error correction neural network.
  • step S150 includes sub-steps S151 , S152 , S153 and S154 .
  • the initial text information and the handwritten text information may be combined in sequence, and the characters therein may be preprocessed to obtain corresponding preprocessed text information.
  • step S151 includes sub-steps S1511 and S1512.
  • invalid characters in initial text information and handwritten text information may be filtered out first, such as invalid characters such as symbols and spaces, to obtain valid text information containing only valid characters.
  • the effective text information is segmented to obtain the preprocessed text information.
  • the text sentence is a sentence that can express the complete meaning, and the preprocessed text information contains multiple text segments. , each text segment corresponds to a text statement.
  • the characters contained in each text segment in the preprocessed text are converted according to the conversion dictionary.
  • the conversion dictionary contains the character code corresponding to each character, and the character code corresponding to each character in the text segment can be obtained through the conversion dictionary to obtain the text coded information.
  • the error correction neural network is an intelligent neural network used for error correction of text information.
  • the error correction neural network can be based on BERT (Bidirectional Encoder Representations from Transformers) network and natural language processing neural network (Natural Language Processing Transformer, NLP neural network)
  • BERT Bit Encoder Representations from Transformers
  • NLP neural network Natural Language Processing Transformer
  • the constructed neural network, the NLP neural network may be a neural network constructed based on a multi-head self-attention network (Multi-Head Self-Attention), and the NLP neural network is composed of multiple encoders and multiple decoders. You can first input any coding sequence into the BERT network for calculation to obtain the corresponding characterization vector, and then input the obtained characterization vector into the NLP neural network for calculation to obtain the corresponding error correction coding sequence.
  • the BERT network consists of an input layer, multiple intermediate layers, and an output layer.
  • the relationship formula To connect, the association formula can be expressed by a primary function.
  • Input the code sequence of any text segment into the BERT network, and the corresponding representation vector can be obtained from the output layer.
  • the size of the representation vector is (J, K), that is, A vector matrix with J rows and K columns, where J is equal to the number of character codes contained in the code sequence, and each vector value in the representation vector belongs to the value range of [0, 1].
  • the corresponding error correction code sequence can be obtained by calculating the representation vector through multiple encoders and multiple decoders in the NLP neural network.
  • the number of character codes in the obtained error correction code sequence is the same as that contained in the code sequence of the text segment
  • the number of character codes can be equal or unequal, if the number of character codes in the error-correcting code sequence is equal to the number of character codes contained in the input code sequence, it means that there is no error in the input code sequence or only There is a substitution error; if the number of character codes in the error-correcting code sequence is not equal to the number of character codes contained in the input code sequence, it indicates that there is an insertion error or a deletion error in the input code sequence.
  • the conversion dictionary contains the correspondence between characters and character codes, and each error correction code sequence obtained can be reverse-converted according to the conversion dictionary, and the reverse conversion is to convert the character codes contained in the error correction code sequence into Corresponding characters, characters obtained by the inverse conversion can be used as error correction text corresponding to the corresponding text segment, and the error correction text corresponding to each text segment is obtained to form the corresponding error correction text information.
  • Corresponding text element information is extracted from the error correction text information according to a preset element extraction rule.
  • the text element information can be extracted from the error correction text information according to the element extraction rules, and the text element information is the element information used to reflect the important content in the agreement document or the contract document.
  • the element extraction rule includes an element mapping table and an element checking formula.
  • step S160 includes sub-steps S161 , S162 and S163 .
  • the element field corresponding to each element in the error correction text information can be located according to the element mapping table, and the element mapping table includes an element label corresponding to each element. For example, part of the information contained in the feature mapping table is shown in Table 1.
  • Table 1 contains four elements, namely date, number, value, and text. Shown in Table 1 is the element label corresponding to the text element, and a piece of text information matching each element label in the error correction text information can be used as an element field of the corresponding element according to the element label location.
  • the element field of the element can be tested according to the element test formula corresponding to each element.
  • the element test formula can be a regular expression.
  • the regular expression corresponding to the date element is only numbers and Chinese numbers, And the number of characters is less than or equal to 8; the regular expression corresponding to the number element is only a combination of numbers and letters, and the number of characters is greater than 2, and the regular expression corresponding to the text element is only Chinese characters. If the element field corresponding to the element matches the regular expression of the element type to which the element belongs, the verification result is passed; otherwise, the verification result is failed.
  • the obtained text element information can be displayed in a list. Specifically, the element field whose verification result is passed is used as the specific content, and the element corresponding to the element field is used as the element name, and an information table is generated for display, which is more convenient for users View the extracted text element information. If the verification result corresponding to the element field fails, it indicates that the element field does not meet the corresponding requirements, and the element field cannot be used as text element information extracted from the document.
  • the initial document is recognized by pagination to obtain the document information page, and the initial text information is obtained from the document information page according to the initial text recognition model.
  • the initial text information is obtained from the document information page according to the initial text recognition model.
  • the handwriting recognition model to identify the unrecognized content to obtain handwritten text information
  • the text error correction model to perform text error correction processing on the initial text information and handwritten text information to obtain error correction text information, and extract it from it according to the element extraction rules Extract text feature information.
  • text information can be obtained through text recognition by combining the initial text recognition model and handwriting recognition model, and the corresponding text element information can be extracted after text error correction processing, which greatly improves the flexibility of text element information extraction, and The scope of application of element information extraction is improved, and the accuracy of obtaining text element information can be greatly improved based on text error correction processing, so that element information can be extracted from documents accurately and efficiently.
  • the embodiment of the present application also provides a device for extracting element information based on text recognition.
  • the device for extracting element information based on text recognition can be configured in a user terminal or a management server.
  • the device for extracting element information based on text recognition is used to perform the aforementioned Any embodiment of the method for extracting element information based on text recognition.
  • FIG. 8 is a schematic block diagram of an apparatus for extracting element information based on text recognition provided by an embodiment of the present application.
  • the element information extraction device 100 based on text recognition includes a document information page acquisition unit 110, an initial text information acquisition unit 120, a document information page judgment unit 130, a handwritten text information acquisition unit 140, an error correction text information acquisition unit 150 and a text element information acquisition unit 160.
  • the document information page acquiring unit 110 is configured to, if the input initial document is received, identify the pages of the initial document to acquire the document information page therein.
  • the document information page acquisition unit 110 includes subunits: a text direction judging unit, configured to judge whether the text direction of each page of the document in the initial document is the same as the preset standard text direction;
  • the rotation unit is used to rotate the document so that the text direction is the same as the standard text direction if the text direction of the document is not the same as the standard text direction;
  • the text ratio judging unit is used to judge the Whether the proportion of text in the document corresponding to the standard text direction is greater than a preset proportion threshold;
  • the document information page determination unit is used to obtain documents whose text proportion is greater than the proportion threshold and determine that they correspond to the initial document The documentation information page for .
  • the initial text information acquisition unit 120 is configured to identify the text content contained in each document information page according to a preset initial text recognition model to obtain corresponding initial text information.
  • the initial text information acquisition unit 120 includes subunits: a character stroke acquisition unit, configured to acquire a character stroke corresponding to each paragraph from each document information page; a stroke feature vector acquisition unit, configured to Calculate the stroke feature vector corresponding to the character strokes of each paragraph according to the feature vector extraction formula; the font matching calculation unit is used to calculate the stroke feature vector of each paragraph and each font of the character database The font matching degree between; the font quantity judging unit is used to judge whether the font quantity with the font matching degree greater than the matching degree threshold value between the stroke feature vector is greater than zero; the target font acquiring unit is used to The number of fonts whose matching degree between the stroke feature vectors is greater than the matching degree threshold is greater than zero, and the font with the highest matching degree between the stroke feature vectors is obtained as the target font matching each of the stroke feature vectors; not identified
  • the document content determination unit is used to determine the paragraph corresponding to the stroke feature vector as unrecognized document content if the number of fonts whose matching degree with the stroke feature vector is greater than
  • the document information page judging unit 130 is configured to judge whether the document information page contains unrecognized document content.
  • the handwritten text information acquisition unit 140 is configured to, if the document information page contains unrecognized document content, identify the unrecognized document content according to a preset handwriting recognition model to obtain corresponding handwritten text information.
  • the handwritten text information acquisition unit 140 includes a subunit: a character feature vector acquisition unit, configured to calculate a character feature vector corresponding to each character in the unrecognized document according to the feature vector extraction formula; A matching calculation unit, used to calculate the matching degree between each of the character feature vectors and the corresponding feature vector of each handwritten character in the set of handwritten characters; a text information acquisition unit, used to acquire the character feature with each A handwritten character with the highest matching degree among the vectors is sequentially combined to obtain the handwritten text information corresponding to the content of the unrecognized document.
  • the error correction text information acquisition unit 150 is configured to perform text error correction processing on the initial text information and the handwritten text information according to a preset text error correction model to obtain corresponding error correction text information.
  • the error correction text information acquisition unit 150 includes a subunit: a preprocessed text information acquisition unit, configured to preprocess the initial text information and the handwritten text information to obtain corresponding preprocessed text information; a text encoding information acquisition unit, configured to convert characters contained in each text segment in the preprocessed text information according to the conversion dictionary to obtain corresponding text encoding information.
  • An error correction code sequence acquisition unit configured to sequentially input a code sequence corresponding to each text segment in the text code information into the error correction neural network to obtain a corresponding error correction code sequence.
  • a sequence inverse conversion unit configured to perform inverse conversion on each of the error correction coding sequences according to the conversion dictionary to obtain corresponding error correction text information.
  • the preprocessing text information acquisition unit includes a subunit: an effective text information acquisition unit, configured to filter out invalid characters in the initial text information and the handwritten text information, and obtain corresponding effective text information.
  • Text information an effective text information segmenting unit, configured to segment the effective text information according to the text sentences contained in the initial text information and the handwritten text information, to obtain preprocessed text information including multiple text segments .
  • the text element information acquiring unit 160 is configured to extract corresponding text element information from the error correction text information according to preset element extraction rules.
  • the text element information acquisition unit 160 includes a subunit: an element field positioning unit, configured to locate the error correction text information and each element according to the element label mapped to each element in the element mapping table. An element field corresponding to the element; an inspection result acquisition unit, configured to inspect the element field corresponding to each element according to the element inspection formula, so as to obtain an inspection result whether it is passed or not.
  • the text element information determining unit is configured to obtain the element field whose test result is passed and determine it as the text element information corresponding to the error correction text information.
  • the above-mentioned text recognition-based element information extraction method is used to perform page recognition on the initial document to obtain the document information page, which is obtained from the document information page according to the initial text recognition model.
  • the initial text information if the document information page contains unrecognized document content, the unrecognized content is recognized according to the handwriting recognition model to obtain the handwritten text information, and the text error correction process is performed on the initial text information and the handwritten text information according to the text error correction model to obtain the corrected text information. Error text information and extract text element information from it according to element extraction rules.
  • text information can be obtained through text recognition by combining the initial text recognition model and handwriting recognition model, and the corresponding text element information can be extracted after text error correction processing, which greatly improves the flexibility of text element information extraction, and The scope of application of element information extraction is improved, and the accuracy of obtaining text element information can be greatly improved based on text error correction processing, so that element information can be extracted from documents accurately and efficiently.
  • the above-mentioned device for extracting element information based on text recognition can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9 .
  • FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
  • the computer device may be a user terminal or a management server for executing a method for extracting element information based on text recognition to perform text recognition on an initial document and extract text element information.
  • the computer device 500 includes a processor 502 connected through a system bus 501 , a memory and a network interface 505 , wherein the memory may include a storage medium 503 and an internal memory 504 .
  • the storage medium 503 can store an operating system 5031 and a computer program 5032 .
  • the processor 502 can execute the method for extracting element information based on text recognition, wherein the storage medium 503 can be a volatile storage medium or a non-volatile storage medium.
  • the processor 502 is used to provide calculation and control capabilities and support the operation of the entire computer device 500 .
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the storage medium 503.
  • the processor 502 can execute the element information extraction method based on text recognition.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • the network interface 505 is used for network communication, such as providing data transmission and the like.
  • FIG. 9 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer device 500 on which the solution of this application is applied.
  • the specific computer device 500 may include more or fewer components than shown, or combine certain components, or have a different arrangement of components.
  • the processor 502 is configured to run the computer program 5032 stored in the memory, so as to realize the corresponding functions in the aforementioned method for extracting element information based on text recognition.
  • the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific composition of the computer device.
  • the computer device may include more or less components than those shown in the illustration. Or combine certain components, or different component arrangements.
  • the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in FIG. 9 , and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • a computer readable storage medium may be a volatile or non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, wherein when the computer program is executed by a processor, the steps included in the above-mentioned element information extraction method based on text recognition are realized.
  • the disclosed devices, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only logical function division.
  • there may be other division methods, and units with the same function may also be combined into one Units such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
  • the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the readable storage medium includes several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned computer-readable storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Character Discrimination (AREA)

Abstract

本申请公开了基于文本识别的要素信息提取方法、装置、设备及介质,方法包括:对初始文档进行分页识别以获取文档信息页,根据初始文本识别模型从文档信息页中获取初始文本信息,若文档信息页中包含未识别文档内容,根据手写体识别模型对未识别内容进行识别得到手写体文本信息,根据文本纠错模型对初始文本信息及手写体文本信息进行文本纠错处理得到纠错文本信息并根据要素提取规则从中提取文本要素信息。

Description

基于文本识别的要素信息提取方法、装置、设备及介质
本申请要求于2021年09月17日提交中国专利局、申请号为202111094018.9,发明名称为“基于文本识别的要素信息提取方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及文本识别技术领域,尤其涉及一种基于文本识别的要素信息提取方法、装置、设备及介质。
背景技术
企业为了方便对所签订的协议、合同进行管理,通常需要从协议或合同等文档中提取得到关键信息,现有的提取方法均是对电子文档中包含的文本字符进行分析,从而从中获取得到对应的要素信息,然而发明人发现,现有的提取方法仅通过关键字匹配的方式从电子文档中获取与关键字相匹配的一段信息作为要素信息,这一提取方法无法对图片、PDF文档等文档进行分析,且对要素信息进行提取时存在灵活性不足的问题,其中部分要素信息因不与关键字相匹配而无法被提取,无法实现准确高效地从文档中提取得到相应要素信息。因此,现有技术方法中存在无法准确、高效地从文档中提取得到要素信息的问题。
发明内容
本申请实施例提供了一种基于文本识别的要素信息提取方法、装置、设备及介质,旨在解决现有技术方法中所存在的无法准确、高效地从文档中提取得到要素信息的问题。
第一方面,本申请实施例提供了一种基于文本识别的要素信息提取方法,其包括:
若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页;根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息;判断所述文档信息页中是否包含未识别文档内容;若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息;根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息;根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。
第二方面,本申请实施例提供了一种基于文本识别的要素信息提取装置,其包括:
文档信息页获取单元,用于若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页;初始文本信息获取单元,用于根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息;文档信息页判断单元,用于判断所述文档信息页中是否包含未识别文档内容;手写体文本信息获取单元,用于若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息;纠错文本信息获取单元,用于根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息;文本要素信息获取单元,用于根据预置的要素提取规则从所述纠错文本信息中提取得到 对应的文本要素信息。
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于文本识别的要素信息提取方法。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的基于文本识别的要素信息提取方法。
本申请实施例提供了一种基于文本识别的要素信息提取方法、装置、计算机设备及可读存储介质。对初始文档进行分页识别以获取文档信息页,根据初始文本识别模型从文档信息页中获取初始文本信息,若文档信息页中包含未识别文档内容,根据手写体识别模型对未识别内容进行识别得到手写体文本信息,根据文本纠错模型对初始文本信息及手写体文本信息进行文本纠错处理得到纠错文本信息并根据要素提取规则从中提取文本要素信息。通过上述方法,可通过初始文本识别模型及手写体识别模型相结合进行文本识别得到文本信息,并进行文本纠错处理后提取对应的文本要素信息,大幅提升了进行文本要素信息提取的灵活性,且提高了进行要素信息提取的适用范围,基于文本纠错处理可大幅提高获取文本要素信息的准确性,从而实现了准确、高效地从文档中提取得到要素信息。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的基于文本识别的要素信息提取方法的流程示意图;
图2为本申请实施例提供的基于文本识别的要素信息提取方法的子流程示意图;
图3为本申请实施例提供的基于文本识别的要素信息提取方法的另一子流程示意图;
图4为本申请实施例提供的基于文本识别的要素信息提取方法的另一子流程示意图;
图5为本申请实施例提供的基于文本识别的要素信息提取方法的另一子流程示意图;
图6为本申请实施例提供的基于文本识别的要素信息提取方法的另一流程示意图;
图7为本申请实施例提供的基于文本识别的要素信息提取方法的另一流程示意图;
图8为本申请实施例提供的基于文本识别的要素信息提取装置的示意性框图;
图9为本申请实施例提供的计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整 体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
请参阅图1,图1是本申请实施例提供的基于文本识别的要素信息提取方法的流程示意图;该基于文本识别的要素信息提取方法应用于用户终端或管理服务器中,该基于文本识别的要素信息提取方法通过安装于用户终端或管理服务器中的应用软件进行执行,管理服务器即是可执行基于文本识别的要素信息提取方法以对初始文档进行文本识别并提取得到文本要素信息的服务器,管理服务器可以是企业或政府部门内部所构建的服务器端,用户终端即是可执行基于文本识别的要素信息提取方法以对初始文档进行文本识别并提取得到文本要素信息的终端设备,例如台式电脑、笔记本电脑、平板电脑或手机等。如图1所示,该方法包括步骤S110~S160。
S110、若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页。
若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页。用户可输入初始文档至用户终端或管理服务器,初始文档即为待识别的合同、协议等文档,初始文档可以是PDF文档或图片集合,初始文档中由多个分页组合而成,可对初始文档中包含的每一分页分别进行识别,以从中获取文档信息页。
如初始文档为合同文档,则包含合同封面页及合同正文页,合同正文页中包含需要进行要素信息提取的内容,则通过分页识别后从该合同文档中获取合同正文页作为文档信息页。
在一实施例中,如图2所示,步骤S110包括子步骤S111、S112、S113和S114。
S111、判断所述初始文档中每一页文档的文字方向是否与预置的标准文字方向相同。
首先可获取初始文档中每一分页中文档的文字方向,对于同一页文档,只能获取对应的一个文字方向,具体的,可获取文档中字符所组成的字符区块形状,每一字符区块即对应一行或一列字符,字符区块之间存在间隙,根据字符区块的形状即可确定对应的文字方向,若字符区块的横向长度大于纵向长度,则确定文字方向为横向,若字符区块的纵向长度大于横向长度,则确定文字方向为纵向。
可判断每一页文档的文字方向是否与预置的标准文字方向相同,标准文字方向即为横向且首句开头非标点符号,则可判断文档的文字方向是否与该标准文字方向相同。
S112、若所述文档的文字方向不与所述标准文字方向相同,对所述文档进行旋转以使文字方向与所述标准文字方向相同。
若文档的文字方向不与标准文字方向相同,则对文档进行旋转,具体的,若文档的文字方向为纵向,则顺时针旋转90度,此时文档的文字方向为横向,判断旋转后文档的首句开头是否为标点符号,若是,则旋转180°,则此时该文档的文字方向即与标准文字方向相同。
S113、判断所述标准文字方向对应的所述文档中的文字占比是否大于预置的占比阈值。
获取与标准文字方向相同的文档中文字占比,具体的,可获取每一文档中字符区块在文档中覆盖区域的比值作为对应的文字占比,也即是计算每一文档中所有字符区块的覆盖面积,除以文档的总面积得到文字占比,并判断每一文档的文字占比是否大于占比阈值。
S114、获取文字占比大于所述占比阈值的文档确定为与所述初始文档对应的文档信息页。
若文档的文字占比大于占比阈值,则表明该页文档即为包含主要内容信息的分页,获取该页文档作为文档信息页,若文档的文字占比不大于占比阈值,则表明该页文档为不包含主要内容信息的封面页。例如,占比阈值可设置为12%。
S120、根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息。
根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息。初始文本识别模型即为对文档信息页中非手写字符进行识别的模型,其中,所述初始文本识别模型包括特征向量提取公式、字符数据库及匹配度阈值。可根据初始文本识别模型对每一文档信息页中包含的文本内容进行识别,以获取其中的文本信息作为初始文本信息。
在一实施例中,如图3所示,步骤S120包括子步骤S121、S122、S123、S124、S125、S126和S127。
S121、从每一所述文档信息页中获取每一段落对应的文字笔画。
文档信息页中不同段落对应的非手写字体可能相同,也可能不相同,为提高对文档信息页中所包含文本信息进行识别的准确性,可获取文档信息页中每一段落对应的文字笔画,文字笔画即为从每一段落中获取到的代表性笔画,如横“一”、竖“丨”、走之底“辶”等。
S122、根据所述特征向量提取公式计算与每一所述段落的文字笔画对应的笔画特征向量。
具体的,可获取段落对应的每一文字笔画的笔画图像,并通过特征向量提取公式从笔画图像中提取得到与文字笔画对应的笔画特征向量,笔画特征向量即可用于对该文字笔画的特征进行量化表征。
例如,每一文字笔画对应的笔画图像的分辨率均为50×50,根据特征向量提取公式中第一卷积核的计算公式,以分辨率6*6作为窗口,步长为1,进行卷积操作,以得到大小为45×45的向量矩阵,也即是笔画图像的浅层特征;根据池化计算公式,以分辨率9×9作为窗口,步长为6,进行降采样,以得到大小为7×7的向量矩阵,也即是笔画图像的深层次特征;根据5个第二卷积核中的计算公式,以分辨率3×3作为窗口,步长为2的进行卷积操作,以得到大小为3×3的5个向量矩阵。通过第一全连接计算公式,对所得到的5个3×3的向量矩阵进行计算,第一全连接公式中共包含五个节点,每一个节点均与1个3×3的向量矩阵相关联,也即是分别通过五个计算公式计算得到与5个3×3的向量矩阵相关联的五个节点的值,第一个计算公式可表示为Y 1=a 1×X 1+b 1,其中,Y 1为第一个节点的计算值,X 1为该笔画图像对应的第一个向量矩阵中的数值,a 1和b 1为第一节点与第一个向量矩阵相关联的第一计算公式中所预设的参数值,通过五个计算公式即可计算与对应向量矩阵向关联的五个节点的值;通 过第二全连接计算公式对五个节点的值进行计算以得到最终该笔画图像的特征向量,第二全连接计算公式可表示为Z 1=c 1×Y 1+c 2×Y 2+c 3×Y 3+c 4×Y 4+c 5×Y 5;其中Y 1、Y 2、Y 3、Y 4、Y 5为与该笔画图像的向量矩阵相关联的五个节点的值,c 1、c 2、c 3、c 4、c 5为五个节点至最后输出节点的预设参数值,由于3×3的向量矩阵共包含9个数值,最后得到该笔画图像的特征向量为一个1×9维的向量矩阵,可以采用Z=(z 1,z 2……z 9)来表示。
S123、计算每一所述段落的笔画特征向量与所述字符数据库每一字体之间的字体匹配度。
字符数据库中包含多种字体,以及每一字体与代表性笔画对应的字体笔画特征,则可计算段落的多个笔画特征向量与每一字体的字体笔画特征之间的字体匹配度。具体的,段落的笔画特征向量与某一字体之间的字体匹配度可采用公式(1)计算得到:
Figure PCTCN2021121116-appb-000001
其中,N为笔画特征向量的总数,Z i=(z 1i,z 2i……z 9i)为第i个笔画特征向量,R i=(r 1i,r 2i……r 9i)为某字体中第i个字体笔画特征。
S124、判断与所述笔画特征向量之间字体匹配度大于所述匹配度阈值的字体数量是否大于零;S125、若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量大于零,获取与所述笔画特征向量之间匹配度最高的字体作为与每一所述笔画特征向量相匹配的目标字体。
可根据上述的计算结果,判断与每一段落的笔画特征向量之间字体匹配度大于匹配度阈值的字体数量是否大于零,若与某一段落的笔画特征向量之间字体匹配度大于匹配度阈值的字体数量大于零,则表明字符数据库中至少有一种字体与该段落的字体笔画特征相匹配,则获取与该段落的笔画特征向量之间匹配度最高的字体对应的目标字体。通过上述方式即可确定与每一段落的笔画特征向量相匹配的目标字体。
S126、若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量不大于零,将与所述笔画特征向量对应的段落确定为未识别文档内容。
若与某一段落的笔画特征向量之间字体匹配度大于匹配度阈值的字体数量不大于零,则表明字符数据库中任意一种字体与该段落的字体笔画特征均不相匹配,将该段落所包含的内容确定为未识别文档内容。
S127、根据所述字符数据库中与每一所述笔画特征向量对应的目标字体的字符模板,对与每一所述笔画特征向量对应的段落包含的文本内容进行识别,得到对应的初始文本信息。
字符数据库中还包括与每一种字体分别对应的字符模板,一种字体的字符模板即包含该种字体与每一字符对应的模板,可基于字符数据库中与笔画特征向量相匹配的目标字体的字符模板,对相应段落所包含的文本内容进行识别,也即是通过字符模板与段落中的字符图像进行匹配,以识别每一字符图像所对应的文本内容,通过对与目标字体相匹配的段落所包含的文本内容进行识别即可得到初始文本信息。
S130、判断所述文档信息页中是否包含未识别文档内容。
判断所述文档信息页中是否包含未识别文档内容。可判断文档信息页中是否包含未识别 文档内容,未识别文档内容即为笔画特征向量不与字符数据库中任意一种字体相匹配的段落包含的文本内容。
S140、若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息。
若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息。若文档信息页中包含未识别文档内容,则表明文档信息页中包括部分无法通过字符数据库进行识别的内容,则需要通过手写体识别模型对该未识别文档内容进行识别,手写体识别模型即为对采用手写得到的文本内容进行识别的模型,其中,所述手写体识别模型包括特征向量提取公式及手写字符集合。
在一实施例中,如图4所示,步骤S140包括子步骤S141、S142和S143。
S141、根据所述特征向量提取公式计算所述未识别文档中每一字符对应的字符特征向量。
可通过特征向量提取公式计算未识别文档中每一字符的字符特征向量,具体的,可先获取未识别文档中每一字符的字符图像,并基于上述特征向量提取公式计算得到字符图像对应的字符特征向量,计算得到字符特征向量的过程与计算得到笔画特征向量的具体过程相同,在此不作赘述。
S142、计算每一所述字符特征向量与所述手写字符集合中每一手写字符对应特征向量之间的匹配度。
手写字符集合中包含多个手写字符,以及与每一手写字符对应的特征向量,可计算字符特征向量与手写字符的特征向量之间的匹配度,匹配度可采用公式(2)计算得到:
Figure PCTCN2021121116-appb-000002
其中,Z=(z 1,z 2……z 9)为某字符的字符特征向量,T=(t 1,t 2……t 9)为某一手写字符对应的特征向量。
S143、获取与每一所述字符特征向量之间匹配度最高的一个手写字符并进行顺序组合,得到与所述未识别文档内容对应的手写体文本信息。
获取与每一字符特征向量之间匹配度最高的一个手写字符,并按字符特征向量所对应字符在未识别文档中的位置,对相应手写字符按顺序进行组合,得到与未识别文档内容对应的手写体文本信息。
S150、根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息。
根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息。若文档信息页中包含未识别文档内容,则根据文本纠错模型对初始文本信息及手写体文本信息进行文本纠错处理;若文档信息也中不包含未识别文档内容,则根据文本纠错模型仅对初始文本信息进行文本纠错处理。初始文本信息及手写体文本信息中很有可能出现识别错误的字符,为提高要素信息提取的准确性,可对获取到的文本信息进行文本纠错处理,也即是对其中出现识别错误的字符进行替换/删除等处理,以使得到的纠错 文本信息中所包含的文本信息表意正确。其中,所述文本纠错模型包括转换词典及纠错神经网络。
在一实施例中,如图5所示,步骤S150包括子步骤S151、S152、S153和S154。
S151、对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息。
可对初始文本信息及手写文本信息按顺序进行组合,并对其中的字符进行预处理得到对应的预处理文本信息。
在一实施例中,如图6所示,步骤S151包括子步骤S1511和S1512。
S1511、对所述初始文本信息及所述手写文本信息中无效字符进行滤除,得到对应的有效文本信息。
具体的,可首先对初始文本信息及手写文本信息中无效字符进行滤除,如对符号、空格等无效字符进行滤除,得到仅包含有效字符的有效文本信息。
S1512、根据所述初始文本信息及所述手写文本信息中包含的文本语句对所述有效文本信息进行分段,得到包含多个文本段的预处理文本信息。
根据初始文本信息及手写文本信息中包含的文本语句对有效文本信息进行分段,得到预处理文本信息,文本语句即为能够表达完整含义的一句话,则预处理文本信息中包含多个文本段,每一文本段及对应一个文本语句。
S152、根据所述转换词典对所述预处理文本信息中每一文本段所包含的字符进行转换得到对应的文本编码信息。
根据转换词典对预处理文本中每一文本段包含的字符进行转换,转换词典中包含与每一字符对应的字符编码,则可通过转换词典获取文本段中每一字符对应的字符编码,得到文本编码信息。
S153、将所述文本编码信息中与每一所述文本段对应的编码序列依次输入所述纠错神经网络得到对应的纠错编码序列。
纠错神经网络即是用于文本信息进行纠错的智能神经网络,纠错神经网络可以是基于BERT(Bidirectional Encoder Representations from Transformers)网络及自然语言处理神经网络(Natural Language Processing Transformer,NLP神经网络)构建得到的神经网络,NLP神经网络可以是基于多头自注意力网络(Multi-Head Self-Attention)构建得到的神经网络,NLP神经网络由多个编码器和多个解码器组合而成。可先将任意一条编码序列输入BERT网络进行计算得到对应的表征向量,将所得到的表征向量输入NLP神经网络进行计算得到对应的纠错编码序列。其中,BERT网络由一个输入层、多个中间层及一个输出层组成,输入层与首个中间层之间、中间层与其他中间层之间、末尾中间层与输出层之间均通过关联公式进行连接,关联公式均可采用一次函数进行表示,将任意一个文本段的编码序列输入BERT网络,即可从输出层得到对应的表征向量,表征向量的大小为(J,K),也即是一个J行K列的向量矩阵,其中J即等于编码序列中所包含字符编码的数量,表征向量中每一向量值均属于[0,1]这一取值范围。通过NLP神经网络中的多个编码器和多个解码器对表征向量进行计算即可得到对应的纠错编码序列,所得到的纠错编码序列中字符编码的数量与文本段的编 码序列中包含的字符编码的数量可以相等也可以不相等,若纠错编码序列中字符编码的数量与所输入的编码序列中包含的字符编码的数量相等,则表明所输入的编码序列中不存在错误或仅存在替换错误;若纠错编码序列中字符编码的数量与所输入的编码序列中包含的字符编码的数量不相等,则表明所输入的编码序列中存在插入错误或删除错误。
S154、根据所述转换词典对每一所述纠错编码序列进行逆转换得到对应的纠错文本信息。
转换词典中包含字符与字符编码之间的对应关系,则可根据转换词典对所得到的每一纠错编码序列进行逆转换,逆转换也即是将纠错编码序列中包含的字符编码转换为对应的字符,逆转换所得到的字符进行顺序排列即可作为与相应文本段对应的纠错文本,获取与每一文本段对应的纠错文本即组合成对应的纠错文本信息。
S160、根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。
根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。可根据要素提取规则从纠错文本信息中提取文本要素信息,文本要素信息即为用于体现协议文档或合同文档中重要内容的要素信息。其中,所述要素提取规则包括要素映射表及要素检验式。
在一实施例中,如图7所示,步骤S160包括子步骤S161、S162和S163。
S161、根据所述要素映射表中每一要素所映射的要素标签定位所述纠错文本信息与每一所述要素对应的要素字段。
可根据要素映射表对纠错文本信息中与每一要素对应的要素字段进行定位,要素映射表中包含与每一要素对应的要素标签。例如,要素映射表中包含的部分信息如表1所示。
表1
Figure PCTCN2021121116-appb-000003
中包含四种要素,分别为日期、编号、数值、文本。表1中所示即为与文本这一种要素对应的要素标签,可根据要素标签定位纠错文本信息中与每一要素标签相匹配的一段文字信息作为对应要素的一个要素字段。
S162、根据所述要素检验式对与每一所述要素对应的要素字段进行检验,以得到是否通过的检验结果。
可根据与每一种要素对应的要素检验式对要素的要素字段进行检验,具体的,要素检验式可以是正则表达式,如日期这一种要素对应的正则表达式为仅数字和汉文数字、且字符数量小于等于8;编号这一种要素对应的正则表达式为仅数字和字母组合、且字符数量大于2,文本这一种要素对应的正则表达式为仅汉字。若要素对应的要素字段与该要素所属要素种类的正则表达式相匹配,则校验结果为校验通过;否则校验结果为校验不通过。
S163、获取检验结果为通过的所述要素字段并确定为与所述纠错文本信息对应的文本要素信息。
获取所有校验结果为通过的要素字段作为与纠错文本信息对应的文本要素信息。可对获取到的文本要素信息进行列表展示,具体的,将校验结果为通过的要素字段作为具体内容,与该要素字段对应的要素作为要素名称,生成信息表进行展示,从而更方便使用者察看所提取到的文本要素信息。若要素字段对应的校验结果为不通过,则表明要素字段不符合相应要求,该要素字段不可作为从文档中提取到的文本要素信息。
在本申请实施例所提供的基于文本识别的要素信息提取方法中,对初始文档进行分页识别以获取文档信息页,根据初始文本识别模型从文档信息页中获取初始文本信息,若文档信息页中包含未识别文档内容,根据手写体识别模型对未识别内容进行识别得到手写体文本信息,根据文本纠错模型对初始文本信息及手写体文本信息进行文本纠错处理得到纠错文本信息并根据要素提取规则从中提取文本要素信息。通过上述方法,可通过初始文本识别模型及手写体识别模型相结合进行文本识别得到文本信息,并进行文本纠错处理后提取对应的文本要素信息,大幅提升了进行文本要素信息提取的灵活性,且提高了进行要素信息提取的适用范围,基于文本纠错处理可大幅提高获取文本要素信息的准确性,从而实现了准确、高效地从文档中提取得到要素信息。
本申请实施例还提供一种基于文本识别的要素信息提取装置,该基于文本识别的要素信息提取装置可配置于用户终端或管理服务器中,该基于文本识别的要素信息提取装置用于执行前述的基于文本识别的要素信息提取方法的任一实施例。具体地,请参阅图8,图8为本申请实施例提供的基于文本识别的要素信息提取装置的示意性框图。
如图8所示,基于文本识别的要素信息提取装置100包括文档信息页获取单元110、初始文本信息获取单元120、文档信息页判断单元130、手写体文本信息获取单元140、纠错文本信息获取单元150和文本要素信息获取单元160。
文档信息页获取单元110,用于若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页。
在一具体实施例中,所述文档信息页获取单元110包括子单元:文字方向判断单元,用于判断所述初始文档中每一页文档的文字方向是否与预置的标准文字方向相同;文档旋转单元,用于若所述文档的文字方向不与所述标准文字方向相同,对所述文档进行旋转以使文字方向与所述标准文字方向相同;文字占比判断单元,用于判断所述标准文字方向对应的所述文档中的文字占比是否大于预置的占比阈值;文档信息页确定单元,用于获取文字占比大于所述占比阈值的文档确定为与所述初始文档对应的文档信息页。
初始文本信息获取单元120,用于根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息。
在一具体实施例中,所述初始文本信息获取单元120包括子单元:文字笔画获取单元,用于从每一所述文档信息页中获取每一段落对应的文字笔画;笔画特征向量获取单元,用于根据所述特征向量提取公式计算与每一所述段落的文字笔画对应的笔画特征向量;字体匹配度计算单元,用于计算每一所述段落的笔画特征向量与所述字符数据库每一字体之间的字体匹配度;字体数量判断单元,用于判断与所述笔画特征向量之间字体匹配度大于所述匹配度 阈值的字体数量是否大于零;目标字体获取单元,用于若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量大于零,获取与所述笔画特征向量之间匹配度最高的字体作为与每一所述笔画特征向量相匹配的目标字体;未识别文档内容确定单元,用于若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量不大于零,将与所述笔画特征向量对应的段落确定为未识别文档内容;文本内容识别单元,用于根据所述字符数据库中与每一所述笔画特征向量对应的目标字体的字符模板,对与每一所述笔画特征向量对应的段落包含的文本内容进行识别,得到对应的初始文本信息。
文档信息页判断单元130,用于判断所述文档信息页中是否包含未识别文档内容。
手写体文本信息获取单元140,用于若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息。
在一具体实施例中,所述手写体文本信息获取单元140包括子单元:字符特征向量获取单元,用于根据所述特征向量提取公式计算所述未识别文档中每一字符对应的字符特征向量;匹配度计算单元,用于计算每一所述字符特征向量与所述手写字符集合中每一手写字符对应特征向量之间的匹配度;文本信息获取单元,用于获取与每一所述字符特征向量之间匹配度最高的一个手写字符并进行顺序组合,得到与所述未识别文档内容对应的手写体文本信息。
纠错文本信息获取单元150,用于根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息。
在一具体实施例中,所述纠错文本信息获取单元150包括子单元:预处理文本信息获取单元,用于对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息;文本编码信息获取单元,用于根据所述转换词典对所述预处理文本信息中每一文本段所包含的字符进行转换得到对应的文本编码信息。纠错编码序列获取单元,用于将所述文本编码信息中与每一所述文本段对应的编码序列依次输入所述纠错神经网络得到对应的纠错编码序列。序列逆转换单元,用于根据所述转换词典对每一所述纠错编码序列进行逆转换得到对应的纠错文本信息。
在一具体实施例中,所述预处理文本信息获取单元包括子单元:有效文本信息获取单元,用于对所述初始文本信息及所述手写文本信息中无效字符进行滤除,得到对应的有效文本信息;有效文本信息分段单元,用于根据所述初始文本信息及所述手写文本信息中包含的文本语句对所述有效文本信息进行分段,得到包含多个文本段的预处理文本信息。
文本要素信息获取单元160,用于根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。
在一具体实施例中,所述文本要素信息获取单元160包括子单元:要素字段定位单元,用于根据所述要素映射表中每一要素所映射的要素标签定位所述纠错文本信息与每一所述要素对应的要素字段;检验结果获取单元,用于根据所述要素检验式对与每一所述要素对应的要素字段进行检验,以得到是否通过的检验结果。文本要素信息确定单元,用于获取检验结果为通过的所述要素字段并确定为与所述纠错文本信息对应的文本要素信息。
在本申请实施例所提供的基于文本识别的要素信息提取装置应用上述基于文本识别的要 素信息提取方法,对初始文档进行分页识别以获取文档信息页,根据初始文本识别模型从文档信息页中获取初始文本信息,若文档信息页中包含未识别文档内容,根据手写体识别模型对未识别内容进行识别得到手写体文本信息,根据文本纠错模型对初始文本信息及手写体文本信息进行文本纠错处理得到纠错文本信息并根据要素提取规则从中提取文本要素信息。通过上述方法,可通过初始文本识别模型及手写体识别模型相结合进行文本识别得到文本信息,并进行文本纠错处理后提取对应的文本要素信息,大幅提升了进行文本要素信息提取的灵活性,且提高了进行要素信息提取的适用范围,基于文本纠错处理可大幅提高获取文本要素信息的准确性,从而实现了准确、高效地从文档中提取得到要素信息。
上述基于文本识别的要素信息提取装置可以实现为计算机程序的形式,该计算机程序可以在如图9所示的计算机设备上运行。
请参阅图9,图9是本申请实施例提供的计算机设备的示意性框图。该计算机设备可以是用于执行基于文本识别的要素信息提取方法以对初始文档进行文本识别并提取得到文本要素信息的用户终端或管理服务器。
参阅图9,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括存储介质503和内存储器504。
该存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于文本识别的要素信息提取方法,其中,存储介质503可以为易失性的存储介质或非易失性的存储介质。
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。
该内存储器504为存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于文本识别的要素信息提取方法。
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述的基于文本识别的要素信息提取方法中对应的功能。
本领域技术人员可以理解,图9中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图9所示实施例一致,在此不再赘述。
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立 门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为易失性或非易失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现上述的基于文本识别的要素信息提取方法中所包含的步骤。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种基于文本识别的要素信息提取方法,包括:
    若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页;
    根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息;
    判断所述文档信息页中是否包含未识别文档内容;
    若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息;
    根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息;
    根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。
  2. 根据权利要求1所述的基于文本识别的要素信息提取方法,其中,所述对所述初始文档进行分页识别以获取其中的文档信息页,包括:
    判断所述初始文档中每一页文档的文字方向是否与预置的标准文字方向相同;
    若所述文档的文字方向不与所述标准文字方向相同,对所述文档进行旋转以使文字方向与所述标准文字方向相同;
    判断所述标准文字方向对应的所述文档中的文字占比是否大于预置的占比阈值;
    获取文字占比大于所述占比阈值的文档确定为与所述初始文档对应的文档信息页。
  3. 根据权利要求1所述的基于文本识别的要素信息提取方法,其中,所述初始文本识别模型包括特征向量提取公式、字符数据库及匹配度阈值,所述根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息,包括:
    从每一所述文档信息页中获取每一段落对应的文字笔画;
    根据所述特征向量提取公式计算与每一所述段落的文字笔画对应的笔画特征向量;
    计算每一所述段落的笔画特征向量与所述字符数据库每一字体之间的字体匹配度;
    判断与所述笔画特征向量之间字体匹配度大于所述匹配度阈值的字体数量是否大于零;
    若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量大于零,获取与所述笔画特征向量之间匹配度最高的字体作为与每一所述笔画特征向量相匹配的目标字体;
    若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量不大于零,将与所述笔画特征向量对应的段落确定为未识别文档内容;
    根据所述字符数据库中与每一所述笔画特征向量对应的目标字体的字符模板,对与每一所述笔画特征向量对应的段落包含的文本内容进行识别,得到对应的初始文本信息。
  4. 根据权利要求3所述的基于文本识别的要素信息提取方法,其中,所述手写体识别模型包括所述特征向量提取公式及手写字符集合,所述根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息,包括:
    根据所述特征向量提取公式计算所述未识别文档中每一字符对应的字符特征向量;
    计算每一所述字符特征向量与所述手写字符集合中每一手写字符对应特征向量之间的匹配度;
    获取与每一所述字符特征向量之间匹配度最高的一个手写字符并进行顺序组合,得到与所述未识别文档内容对应的手写体文本信息。
  5. 根据权利要求1所述的基于文本识别的要素信息提取方法,其中,所述文本纠错模型包括转换词典及纠错神经网络,所述根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息,包括:
    对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息;
    根据所述转换词典对所述预处理文本信息中每一文本段所包含的字符进行转换得到对应的文本编码信息;
    将所述文本编码信息中与每一所述文本段对应的编码序列依次输入所述纠错神经网络得到对应的纠错编码序列;
    根据所述转换词典对每一所述纠错编码序列进行逆转换得到对应的纠错文本信息。
  6. 根据权利要求5所述的基于文本识别的要素信息提取方法,其中,所述对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息,包括:
    对所述初始文本信息及所述手写文本信息中无效字符进行滤除,得到对应的有效文本信息;
    根据所述初始文本信息及所述手写文本信息中包含的文本语句对所述有效文本信息进行分段,得到包含多个文本段的预处理文本信息。
  7. 根据权利要求1所述的基于文本识别的要素信息提取方法,其中,所述要素提取规则包括要素映射表及要素检验式,所述根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息,包括:
    根据所述要素映射表中每一要素所映射的要素标签定位所述纠错文本信息与每一所述要素对应的要素字段;
    根据所述要素检验式对与每一所述要素对应的要素字段进行检验,以得到是否通过的检验结果;
    获取检验结果为通过的所述要素字段并确定为与所述纠错文本信息对应的文本要素信息。
  8. 一种基于文本识别的要素信息提取装置,包括:
    文档信息页获取单元,用于若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页;
    初始文本信息获取单元,用于根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息;
    文档信息页判断单元,用于判断所述文档信息页中是否包含未识别文档内容;
    手写体文本信息获取单元,用于若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息;
    纠错文本信息获取单元,用于根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息;
    文本要素信息获取单元,用于根据预置的要素提取规则从所述纠错文本信息中提取得到 对应的文本要素信息。
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:
    若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页;
    根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息;
    判断所述文档信息页中是否包含未识别文档内容;
    若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息;
    根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息;
    根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。
  10. 根据权利要求9所述的计算机设备,其中,所述对所述初始文档进行分页识别以获取其中的文档信息页,包括:
    判断所述初始文档中每一页文档的文字方向是否与预置的标准文字方向相同;
    若所述文档的文字方向不与所述标准文字方向相同,对所述文档进行旋转以使文字方向与所述标准文字方向相同;
    判断所述标准文字方向对应的所述文档中的文字占比是否大于预置的占比阈值;
    获取文字占比大于所述占比阈值的文档确定为与所述初始文档对应的文档信息页。
  11. 根据权利要求9所述的计算机设备,其中,所述初始文本识别模型包括特征向量提取公式、字符数据库及匹配度阈值,所述根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息,包括:
    从每一所述文档信息页中获取每一段落对应的文字笔画;
    根据所述特征向量提取公式计算与每一所述段落的文字笔画对应的笔画特征向量;
    计算每一所述段落的笔画特征向量与所述字符数据库每一字体之间的字体匹配度;
    判断与所述笔画特征向量之间字体匹配度大于所述匹配度阈值的字体数量是否大于零;
    若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量大于零,获取与所述笔画特征向量之间匹配度最高的字体作为与每一所述笔画特征向量相匹配的目标字体;
    若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量不大于零,将与所述笔画特征向量对应的段落确定为未识别文档内容;
    根据所述字符数据库中与每一所述笔画特征向量对应的目标字体的字符模板,对与每一所述笔画特征向量对应的段落包含的文本内容进行识别,得到对应的初始文本信息。
  12. 根据权利要求11所述的计算机设备,其中,所述手写体识别模型包括所述特征向量提取公式及手写字符集合,所述根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息,包括:
    根据所述特征向量提取公式计算所述未识别文档中每一字符对应的字符特征向量;
    计算每一所述字符特征向量与所述手写字符集合中每一手写字符对应特征向量之间的匹配度;
    获取与每一所述字符特征向量之间匹配度最高的一个手写字符并进行顺序组合,得到与所述未识别文档内容对应的手写体文本信息。
  13. 根据权利要求9所述的计算机设备,其中,所述文本纠错模型包括转换词典及纠错神经网络,所述根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息,包括:
    对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息;
    根据所述转换词典对所述预处理文本信息中每一文本段所包含的字符进行转换得到对应的文本编码信息;
    将所述文本编码信息中与每一所述文本段对应的编码序列依次输入所述纠错神经网络得到对应的纠错编码序列;
    根据所述转换词典对每一所述纠错编码序列进行逆转换得到对应的纠错文本信息。
  14. 根据权利要求13所述的计算机设备,其中,所述对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息,包括:
    对所述初始文本信息及所述手写文本信息中无效字符进行滤除,得到对应的有效文本信息;
    根据所述初始文本信息及所述手写文本信息中包含的文本语句对所述有效文本信息进行分段,得到包含多个文本段的预处理文本信息。
  15. 根据权利要求9所述的计算机设备,其中,所述要素提取规则包括要素映射表及要素检验式,所述根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息,包括:
    根据所述要素映射表中每一要素所映射的要素标签定位所述纠错文本信息与每一所述要素对应的要素字段;
    根据所述要素检验式对与每一所述要素对应的要素字段进行检验,以得到是否通过的检验结果;
    获取检验结果为通过的所述要素字段并确定为与所述纠错文本信息对应的文本要素信息。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,当所述计算机程序被处理器执行以下操作:
    若接收到所输入的初始文档,对所述初始文档进行分页识别以获取其中的文档信息页;
    根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息;
    判断所述文档信息页中是否包含未识别文档内容;
    若所述文档信息页中包含未识别文档内容,根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息;
    根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理, 得到对应的纠错文本信息;
    根据预置的要素提取规则从所述纠错文本信息中提取得到对应的文本要素信息。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述对所述初始文档进行分页识别以获取其中的文档信息页,包括:
    判断所述初始文档中每一页文档的文字方向是否与预置的标准文字方向相同;
    若所述文档的文字方向不与所述标准文字方向相同,对所述文档进行旋转以使文字方向与所述标准文字方向相同;
    判断所述标准文字方向对应的所述文档中的文字占比是否大于预置的占比阈值;
    获取文字占比大于所述占比阈值的文档确定为与所述初始文档对应的文档信息页。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述初始文本识别模型包括特征向量提取公式、字符数据库及匹配度阈值,所述根据预置的初始文本识别模型对每一所述文档信息页中包含的文本内容进行识别得到对应的初始文本信息,包括:
    从每一所述文档信息页中获取每一段落对应的文字笔画;
    根据所述特征向量提取公式计算与每一所述段落的文字笔画对应的笔画特征向量;
    计算每一所述段落的笔画特征向量与所述字符数据库每一字体之间的字体匹配度;
    判断与所述笔画特征向量之间字体匹配度大于所述匹配度阈值的字体数量是否大于零;
    若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量大于零,获取与所述笔画特征向量之间匹配度最高的字体作为与每一所述笔画特征向量相匹配的目标字体;
    若与所述笔画特征向量之间匹配度大于所述匹配度阈值的字体数量不大于零,将与所述笔画特征向量对应的段落确定为未识别文档内容;
    根据所述字符数据库中与每一所述笔画特征向量对应的目标字体的字符模板,对与每一所述笔画特征向量对应的段落包含的文本内容进行识别,得到对应的初始文本信息。
  19. 根据权利要求18所述的计算机可读存储介质,其中,所述手写体识别模型包括所述特征向量提取公式及手写字符集合,所述根据预置的手写体识别模型对所述未识别文档内容进行识别得到对应的手写体文本信息,包括:
    根据所述特征向量提取公式计算所述未识别文档中每一字符对应的字符特征向量;
    计算每一所述字符特征向量与所述手写字符集合中每一手写字符对应特征向量之间的匹配度;
    获取与每一所述字符特征向量之间匹配度最高的一个手写字符并进行顺序组合,得到与所述未识别文档内容对应的手写体文本信息。
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述文本纠错模型包括转换词典及纠错神经网络,所述根据预置的文本纠错模型对所述初始文本信息及所述手写体文本信息进行文本纠错处理,得到对应的纠错文本信息,包括:
    对所述初始文本信息及所述手写文本信息进行预处理得到对应的预处理文本信息;
    根据所述转换词典对所述预处理文本信息中每一文本段所包含的字符进行转换得到对应的文本编码信息;
    将所述文本编码信息中与每一所述文本段对应的编码序列依次输入所述纠错神经网络得到对应的纠错编码序列;
    根据所述转换词典对每一所述纠错编码序列进行逆转换得到对应的纠错文本信息。
PCT/CN2021/121116 2021-09-17 2021-09-28 基于文本识别的要素信息提取方法、装置、设备及介质 WO2023039942A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111094018.9A CN113536771B (zh) 2021-09-17 2021-09-17 基于文本识别的要素信息提取方法、装置、设备及介质
CN202111094018.9 2021-09-17

Publications (1)

Publication Number Publication Date
WO2023039942A1 true WO2023039942A1 (zh) 2023-03-23

Family

ID=78092827

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/121116 WO2023039942A1 (zh) 2021-09-17 2021-09-28 基于文本识别的要素信息提取方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN113536771B (zh)
WO (1) WO2023039942A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580402A (zh) * 2023-05-26 2023-08-11 读书郎教育科技有限公司 一种词典笔的文本识别方法及装置
CN117197828A (zh) * 2023-08-11 2023-12-08 中国银行保险信息技术管理有限公司 票据信息识别方法、装置、介质及设备

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114970470B (zh) * 2022-07-27 2022-11-01 中关村科学城城市大脑股份有限公司 文案信息处理方法、装置、电子设备和计算机可读介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170004359A1 (en) * 2015-07-03 2017-01-05 Cognizant Technology Solutions India Pvt. Ltd. System and Method for Efficient Recognition of Handwritten Characters in Documents
CN106570538A (zh) * 2015-10-10 2017-04-19 北大方正集团有限公司 字符图片处理方法和装置
CN111209740A (zh) * 2019-12-31 2020-05-29 中移(杭州)信息技术有限公司 文本模型训练方法、文本纠错方法、电子设备及存储介质
CN111523306A (zh) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 文本的纠错方法、装置和系统
CN112464781A (zh) * 2020-11-24 2021-03-09 厦门理工学院 基于图神经网络的文档图像关键信息提取及匹配方法

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262619A (zh) * 2010-05-31 2011-11-30 汉王科技股份有限公司 文档的文字提取方法和装置
CN102799879B (zh) * 2012-07-12 2014-04-02 中国科学技术大学 从自然场景图像中识别多言语、多字体文字的方法
CN103400127A (zh) * 2013-08-05 2013-11-20 苏州鼎富软件科技有限公司 图片文字识别方法
CN106156794B (zh) * 2016-07-01 2020-12-25 北京旷视科技有限公司 基于文字风格识别的文字识别方法及装置
CN106951832B (zh) * 2017-02-28 2022-02-18 广东数相智能科技有限公司 一种基于手写字符识别的验证方法及装置
CN107368827B (zh) * 2017-04-01 2020-09-15 阿里巴巴集团控股有限公司 字符识别方法及装置、用户设备、服务器
JP6711523B2 (ja) * 2018-05-25 2020-06-17 株式会社ふくおかフィナンシャルグループ 帳票認識システム
CN111444750B (zh) * 2019-01-17 2023-03-21 珠海金山办公软件有限公司 一种pdf文档识别方法、装置及电子设备
CN109993112B (zh) * 2019-03-29 2021-04-09 杭州睿琪软件有限公司 一种图片中表格的识别方法及装置
CN110197238B (zh) * 2019-04-15 2023-09-26 广州企图腾科技有限公司 一种字体类别的识别方法、系统及终端设备
EP3786814A1 (en) * 2019-08-30 2021-03-03 Accenture Global Solutions Limited Intelligent extraction of information from a document
CN111626383B (zh) * 2020-05-29 2023-11-07 Oppo广东移动通信有限公司 字体识别方法及装置、电子设备、存储介质
CN112101354A (zh) * 2020-09-23 2020-12-18 广州虎牙科技有限公司 文本识别模型训练方法、文本定位方法及相关装置
CN112434691A (zh) * 2020-12-02 2021-03-02 上海三稻智能科技有限公司 基于智能解析识别的hs编码匹配、展示方法、系统及存储介质
CN112733623A (zh) * 2020-12-26 2021-04-30 科大讯飞华南人工智能研究院(广州)有限公司 文本要素提取方法、相关设备及可读存储介质
CN112766255A (zh) * 2021-01-19 2021-05-07 上海微盟企业发展有限公司 一种光学文字识别方法、装置、设备及存储介质
CN112862024B (zh) * 2021-04-28 2021-09-21 明品云(北京)数据科技有限公司 一种文本识别方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170004359A1 (en) * 2015-07-03 2017-01-05 Cognizant Technology Solutions India Pvt. Ltd. System and Method for Efficient Recognition of Handwritten Characters in Documents
CN106570538A (zh) * 2015-10-10 2017-04-19 北大方正集团有限公司 字符图片处理方法和装置
CN111523306A (zh) * 2019-01-17 2020-08-11 阿里巴巴集团控股有限公司 文本的纠错方法、装置和系统
CN111209740A (zh) * 2019-12-31 2020-05-29 中移(杭州)信息技术有限公司 文本模型训练方法、文本纠错方法、电子设备及存储介质
CN112464781A (zh) * 2020-11-24 2021-03-09 厦门理工学院 基于图神经网络的文档图像关键信息提取及匹配方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580402A (zh) * 2023-05-26 2023-08-11 读书郎教育科技有限公司 一种词典笔的文本识别方法及装置
CN117197828A (zh) * 2023-08-11 2023-12-08 中国银行保险信息技术管理有限公司 票据信息识别方法、装置、介质及设备

Also Published As

Publication number Publication date
CN113536771B (zh) 2021-12-24
CN113536771A (zh) 2021-10-22

Similar Documents

Publication Publication Date Title
WO2023039942A1 (zh) 基于文本识别的要素信息提取方法、装置、设备及介质
US11455525B2 (en) Method and apparatus of open set recognition and a computer readable storage medium
CN108717406B (zh) 文本情绪分析方法、装置及存储介质
US10915788B2 (en) Optical character recognition using end-to-end deep learning
CN108763380B (zh) 商标识别检索方法、装置、计算机设备和存储介质
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
US10489645B2 (en) System and method for automatic detection and verification of optical character recognition data
CN108804423B (zh) 医疗文本特征提取与自动匹配方法和系统
CN110741376B (zh) 用于不同自然语言的自动文档分析
US11599727B2 (en) Intelligent text cleaning method and apparatus, and computer-readable storage medium
WO2021208727A1 (zh) 基于人工智能的文本错误检测方法、装置、计算机设备
CN111814463B (zh) 国际疾病分类编码推荐方法、系统及相应设备和存储介质
CN111582169A (zh) 图像识别数据纠错方法、装置、计算机设备和存储介质
CN113094509B (zh) 文本信息提取方法、系统、设备及介质
CN111460793A (zh) 纠错方法、装置、设备及存储介质
CN114358001A (zh) 诊断结果的标准化方法及其相关装置、设备和存储介质
He et al. User-assisted archive document image analysis for digital library construction
Kumar Rai et al. Medical prescription and report analyzer
CN114821610A (zh) 一种基于树状神经网络的从图像生成网页代码的方法
CN109190615B (zh) 形近字识别判定方法、装置、计算机设备和存储介质
WO2021072872A1 (zh) 基于字符转换的姓名存储方法、装置、计算机设备
CN115083550A (zh) 基于多源信息的病人相似度分类方法
TWM585395U (zh) 運用深度學習之長短期記憶模型輔助保險理賠系統
CN114065777A (zh) 双语语料检测方法、设备以及计算机可读介质
TWI712979B (zh) 運用深度學習之長短期記憶模型輔助保險理賠系統及其方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21957214

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE