CN115841677B - Text layout analysis method and device, electronic equipment and storage medium - Google Patents

Text layout analysis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115841677B
CN115841677B CN202211652196.3A CN202211652196A CN115841677B CN 115841677 B CN115841677 B CN 115841677B CN 202211652196 A CN202211652196 A CN 202211652196A CN 115841677 B CN115841677 B CN 115841677B
Authority
CN
China
Prior art keywords
text
layout
sample
analyzed
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211652196.3A
Other languages
Chinese (zh)
Other versions
CN115841677A (en
Inventor
杨沛灵
闫印强
姚兴仁
姜海昆
范宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changyang Technology Beijing Co ltd
Original Assignee
Changyang Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changyang Technology Beijing Co ltd filed Critical Changyang Technology Beijing Co ltd
Priority to CN202211652196.3A priority Critical patent/CN115841677B/en
Publication of CN115841677A publication Critical patent/CN115841677A/en
Application granted granted Critical
Publication of CN115841677B publication Critical patent/CN115841677B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a text layout analysis method, a text layout analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text content in a text layout to be analyzed and coordinates of a corresponding text box based on an identification result of the text layout to be analyzed by an OCR algorithm; converting text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed; coordinate stitching is carried out on the text boxes in the text layout to be analyzed, and coordinate information after stitching is obtained; generating character characteristic information of characters except text contents in the text layout to be analyzed; information stitching is carried out on the sentence vectors, the coordinate information and the character characteristic information to obtain a stitching sequence; and inputting the spliced sequence into a pre-trained seq2seq model to output the content identification of each text content. According to the scheme, the accuracy of the analysis result in the text layout can be improved.

Description

Text layout analysis method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computer vision, in particular to a text layout analysis method, a text layout analysis device, electronic equipment and a storage medium.
Background
In a text layout analysis scene, content detection and recognition are generally performed for a text picture using an OCR algorithm, and after the OCR algorithm detection is completed, extraction of key information is performed for the detected content. In the prior art, aiming at the identified detection content, the required key information is acquired in a coordinate positioning mode. But this approach is less accurate when dealing with text layouts that change text formats.
Disclosure of Invention
The embodiment of the invention provides a text layout analysis method, a device, electronic equipment and a storage medium, which can improve the accuracy of an analysis result in a text layout.
In a first aspect, an embodiment of the present invention provides a text layout analysis method, including:
acquiring text content in a text layout to be analyzed and coordinates of a corresponding text box based on an identification result of the text layout to be analyzed by an OCR algorithm;
converting text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed;
coordinate stitching is carried out on the text boxes in the text layout to be analyzed, and coordinate information after stitching is obtained;
generating character characteristic information of characters except text contents in the text layout to be analyzed;
information stitching is carried out on the sentence vectors, the coordinate information and the character characteristic information to obtain a stitching sequence;
and inputting the spliced sequence into a pre-trained seq2seq model to output the content identification of each text content.
In one possible implementation, the training manner of the seq2seq model includes:
acquiring a plurality of sample text layouts, and executing for each sample text layout:
based on the recognition result of the OCR algorithm on the sample text layout, acquiring sample text content and coordinates of a corresponding text box in the sample text layout;
acquiring an identification ID of sample text content in the sample text layout based on a manual identification mode;
converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout;
coordinate splicing is carried out on the text boxes in the sample text layout, and spliced sample coordinate information is obtained;
generating sample character characteristic information from characters except text contents in the sample text layout;
performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence;
taking the sample splicing sequence as input, and taking the identification ID of the sample text content in the sample text layout as output to obtain a sample pair for training the seq2seq model;
the seq2seq model is trained based on a plurality of samples.
In one possible implementation, the layout types of the plurality of sample text layouts are not identical.
In one possible implementation, the identifier ID of the sample text content in the sample text layout is used as output in onehot coding mode.
In one possible implementation, the method further includes: training to obtain a doc2vec model corresponding to the layout type by utilizing a plurality of sample text layouts of the same layout type in advance; the doc2vec model is used for converting text contents in a text layout of the layout type into sentence vectors based on the corresponding text layout;
and when sentence vector conversion is carried out, the doc2vec model of the corresponding layout type is utilized.
In one possible implementation, the coordinate stitching includes:
taking coordinate values of four vertexes of each text box as eight-dimensional coordinate values of the corresponding text box;
and performing head-to-tail splicing on the eight-dimensional coordinate values of each text box to obtain spliced coordinate information.
In one possible implementation manner, the characters except for the text content in the text layout to be analyzed include: at least one of numbers, symbols, english, and other characters;
the generating character characteristic information comprises the following steps: and generating character characteristic information of corresponding dimension based on the proportion of at least one character of numbers, symbols, english and other characters in the text layout to be analyzed.
In a second aspect, an embodiment of the present invention further provides a text layout analysis apparatus, including:
the device comprises an acquisition unit, a text box judgment unit and a text box judgment unit, wherein the acquisition unit is used for acquiring text content in a text layout to be analyzed and coordinates of a corresponding text box based on an identification result of the text layout to be analyzed by an OCR algorithm;
the conversion unit is used for converting the text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed;
the coordinate splicing unit is used for carrying out coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
the generation unit is used for generating character characteristic information of characters except text contents in the text layout to be analyzed;
the information splicing unit is used for carrying out information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a splicing sequence;
and the identification unit is used for inputting the splicing sequence into a pre-trained seq2seq model so as to output the content identification of each text content.
In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program, and when the processor executes the computer program, the method described in any embodiment of the present specification is implemented.
In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform a method according to any of the embodiments of the present specification.
The embodiment of the invention provides a text layout analysis method, a device, electronic equipment and a storage medium, which are characterized in that acquired text contents are converted into sentence vectors based on the text layout to be analyzed, text boxes are subjected to coordinate splicing, characters except for text contents in the text layout to be analyzed generate character characteristic information, then the sentence vectors, the spliced coordinate information and the character characteristic information are subjected to information splicing, the acquired splicing sequence fully contains the contents of the text layout to be analyzed, and the splicing sequence is input into a pre-trained seq2seq model, so that the seq2seq model outputs the content identification of each text content. Therefore, in the scheme, the seq2seq model can fully learn the characteristics of the text layout in the training process, so that the recognition result can be more accurate when the recognition is performed.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a text layout analysis method according to an embodiment of the present invention;
FIG. 2 is a hardware architecture diagram of an electronic device according to an embodiment of the present invention;
FIG. 3 is a block diagram of a text layout analysis device according to an embodiment of the present invention;
fig. 4 is a block diagram of another text layout analysis apparatus according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by those skilled in the art without making any inventive effort based on the embodiments of the present invention are within the scope of protection of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a text layout analysis method, which includes:
step 100, acquiring text content in a text layout to be analyzed and coordinates of a corresponding text box based on an identification result of the text layout to be analyzed by an OCR algorithm;
step 102, converting text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed;
104, carrying out coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
step 106, generating character characteristic information of characters except text contents in the text layout to be analyzed;
step 108, performing information splicing on the sentence vector, the coordinate information and the character characteristic information to obtain a spliced sequence;
step 110, inputting the spliced sequence into a pre-trained seq2seq model to output the content identifier of each text content.
In the embodiment of the invention, the obtained text content is converted into the sentence vector based on the text layout to be analyzed, the text box is subjected to coordinate splicing, characters except the text content in the text layout to be analyzed are generated into character feature information, then the sentence vector, the spliced coordinate information and the character feature information are subjected to information splicing, the obtained splicing sequence fully contains the content of the text layout to be analyzed, and the splicing sequence is input into a pre-training seq2seq model, so that the seq2seq model outputs the content identifier of each text content. Therefore, in the scheme, the seq2seq model can fully learn the characteristics of the text layout in the training process, so that the recognition result can be more accurate when the recognition is performed.
The manner in which the individual steps shown in fig. 1 are performed is described below.
First, a training process of the seq2seq model will be explained.
In the embodiment of the invention, considering that the types of different text layouts are different and the contents in different text layouts of the same layout type are also different, if the neural network model is required to be used for identifying the contents of the text layouts, the seq2seq model can be considered to be used. The input of the seq2seq model may be an indefinite sequence and the output may be an indefinite sequence, and the model may include an encoder for analyzing the input sequence and a decoder for generating the output sequence.
In order for the seq2seq model to adequately learn the content of the text layout, in one embodiment of the invention, the content of the sample text layout needs to be adequately extracted to generate an input sequence to train the seq2seq model. Specifically, the training process of the seq2seq model includes:
a1, acquiring a plurality of sample text layouts, and executing the following steps A11-A17 for each sample text layout;
the sample text layout can be a type of table, such as business license, tax return, job position report, etc. Wherein, the business license, tax return and job position list belong to the text layout with different layout types. In this embodiment, the plurality of sample text layouts may correspond to the same layout type, or may correspond to different layout types. Such as obtaining multiple business licenses as sample text layouts.
Preferably, the layout types of the plurality of sample text layouts are not identical. Therefore, the trained seq2seq model can identify text layouts with different layout types, the application range of the text layouts needing to be analyzed can be enlarged, and even if the text formats change, a good identification effect can be ensured.
A11, acquiring sample text content and coordinates of a corresponding text box in the sample text layout based on an identification result of the OCR algorithm on the sample text layout;
in the embodiment of the invention, the sample text content and the coordinates of the corresponding text box can be extracted from the recognition result of the OCR algorithm. One text content corresponds to one text box, and the coordinates of one text box are the coordinates of four vertexes of the text box.
A12, acquiring an identification ID of sample text content in the sample text layout based on a manual identification mode;
in the embodiment of the invention, only the key text content can be identified, and all text contents can be identified. In one implementation, different identification IDs may be identified for the critical text content, with the same identification ID being identified for the non-critical text content.
For example, for a business license, the key text content may be unified social credit code, company name and registered capital, then sample text content identifier 01 corresponding to unified social credit code in the sample text layout, sample text content identifier 02 corresponding to company name, sample text content identifier 03 corresponding to registered capital, and other sample text content identifiers 00.
In the embodiment of the invention, in order to ensure that the seq2seq model can identify text layouts with different layout types, the identification of the text content has uniqueness.
A13, converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout;
in the embodiment of the invention, the text contents and formats of the text layouts with different layout types are different, and the text contents of the text layouts with the same layout type are not completely the same, so that each text layout needs to be converted into sentence vectors based on the corresponding text layout.
Specifically, the doc2vec model of gensim may be used to convert text content into sentence vectors.
In order to ensure that text layout with more layout types can be adapted when sentence vectors are converted, in one embodiment of the invention, a doc2vec model corresponding to the layout type can be obtained by training a plurality of sample text layout with the same layout type in advance; the doc2vec model is used for converting text contents in a text layout of the layout type into sentence vectors based on the corresponding text layout; further, when sentence vector conversion is required, sentence vector conversion is realized by using a doc2vec model of the corresponding layout type based on the layout type of the text layout to be converted.
Training the doc2vec model corresponding to each of the plurality of layout types can enable the sentence vector to be more suitable for the text layout of the corresponding layout type during the conversion, the converted sentence vector is more accurate, and the accuracy is improved for training and identifying the subsequent seq2seq model.
A14, carrying out coordinate splicing on the text boxes in the sample text layout to obtain spliced sample coordinate information;
because each text box comprises coordinate values of four vertexes, each vertex is represented by two-dimensional coordinates, for each text box, the coordinate values of the four vertexes of the text box can be used as eight-dimensional coordinate values of the corresponding text box, and then the eight-dimensional coordinate values of each text box are spliced end to obtain spliced coordinate information.
In the end-to-end stitching, for example, the coordinate values of the first text box are: (x 11, y 11), (x 12, y 12), (x 13, y 13), (x 14, y 14), the coordinate values of the second text box are: (x 21, y 21), (x 22, y 22), (x 23, y 23), (x 24, y 24), then the end-to-end splice may be: (x 11, y 11), (x 12, y 12), (x 13, y 13), (x 14, y 14), (x 21, y 21), (x 22, y 22), (x 23, y 23), (x 24, y 24).
A15, generating sample character characteristic information from characters except text contents in the sample text layout;
in the embodiment of the invention, for some text layouts, other characters are included in addition to text contents, and the characters are also used for expressing the characteristics of the text layouts, so that in order to fully express the characteristics of the text layouts, character characteristic information needs to be generated by the characters except for the text contents.
Specifically, in this step a15, the characters in the sample text layout other than the sample text content may include: at least one of numbers, symbols, english, and other characters; character characteristic information of corresponding dimension can be generated based on the proportion of at least one character of numbers, symbols, english and other characters in the text layout to be analyzed.
For example, characters except text content are numerals, symbols, english and other characters, and the characters of corresponding dimensions are determined according to the proportion of the characters in the whole text layout (text content+numerals+symbols+English+other characters), for example, the proportion of the numerals, symbols, english and other characters is respectively: 0.2, 0.1, and 0.01, then these four feature values can be used as character feature information for four dimensions.
It should be noted that after the determination, the types of characters other than the text content are unchanged during the entire training process and the subsequent processing of the text layout to be analyzed. For example, digits, symbols, english, and other characters are collectively used as characters other than text content.
A16, performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence;
in the embodiment of the invention, A13-A15 is information extracted from different angles of the text layout, and then information is spliced from the information extracted from each angle, so that the spliced sequence obtained after splicing can accurately and fully express the content of the text layout.
The information splicing mode can also be a head-to-tail splicing mode, and the splicing sequence can be any sequence, but after the splicing sequence is determined, the same splicing sequence is used in the whole training process and the subsequent processing process of the text layout to be analyzed.
A17, taking the sample splicing sequence as input, and taking the identification ID of the sample text content in the sample text layout as output to obtain a sample pair for training the seq2seq model;
since the seq2seq model input and output are both sequences, the identification ID of the sample text content in the sample text layout needs to be processed as a sequence when it is taken as output. In one embodiment of the invention, the identifier ID of the sample text content in the sample text layout can be used as output in onehot encoding mode.
Thus, the dimension of the output sequence is the same as the number of sample text contents in the sample text layout. For example, the number of sample text contents is 10, wherein the identifier of the 2 nd sample text content is 01, the identifier of the fourth sample text content is 02, the identifier of the 5 th sample text content is 03, and the identifiers of other sample text contents are 00, and then the output sequence may be [00,01,00,02,03,00,00,00,00,00].
A2, training the seq2seq model based on a plurality of samples.
The training of the seq2seq model is completed above, and the seq2seq model can be applied to the actual analysis scene.
Aiming at step 100, based on the recognition result of the text layout to be analyzed by the OCR algorithm, the text content in the text layout to be analyzed and the coordinates of the corresponding text box are obtained.
The processing manner of the step 100 is the same as that of the step a11, and will not be described here again.
For step 102, text content in the text layout to be analyzed is converted into sentence vectors based on the text layout to be analyzed.
Preferably, based on the layout type of the text layout to be analyzed, selecting a doc2vec model with a corresponding layout type to convert the text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed.
Specifically, the processing manner of the step 102 is the same as that of the step a13, and will not be described herein.
Aiming at step 104 'coordinate stitching is performed on the text boxes in the text layout to be analyzed to obtain stitched coordinate information', step 106 'character feature information is generated on characters except text contents in the text layout to be analyzed', step 108 'information stitching is performed on the sentence vectors, the coordinate information and the character feature information to obtain a stitching sequence', and the processing mode is the same as that of the steps A14-A16, and details are omitted.
Finally, for step 110, the splice sequence is input into a pre-trained seq2seq model to output the content identifier of each text content.
The output sequence of the seq2seq model is an onehot code, and the content identification of each text content can be obtained according to the output sequence.
For example, the output sequence is [00,00,01,02,03,00,00,00], the third text content can be known to be unified social credit code, the second text content is company name, and the third text content is registered capital.
According to the embodiment of the invention, the accuracy of text content extraction concerned in the text is greatly improved, a large amount of workload of subsequent manual proofreading can be reduced, precious computing power is applied to an OCR (optical character recognition) algorithm, and the coverage of AI (advanced technology) under the condition of limited resources is greatly expanded.
As shown in fig. 2 and 3, the embodiment of the invention provides a text layout analysis device. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. In terms of hardware, as shown in fig. 2, a hardware architecture diagram of an electronic device where a text layout analysis apparatus provided in an embodiment of the present invention is located, where the electronic device where the embodiment is located may include other hardware, such as a forwarding chip responsible for processing a message, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 2. Taking a software implementation as an example, as shown in fig. 3, the device in a logic sense is formed by reading a corresponding computer program in a nonvolatile memory into a memory by a CPU of an electronic device where the device is located and running the computer program. The text layout analysis device provided in this embodiment includes:
an obtaining unit 301, configured to obtain, based on an recognition result of a text layout to be analyzed by an OCR algorithm, text content in the text layout to be analyzed and coordinates of a corresponding text box;
a conversion unit 302, configured to convert text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed;
the coordinate splicing unit 303 is configured to perform coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
a generating unit 304, configured to generate character feature information from characters except text content in the text layout to be analyzed;
an information splicing unit 305, configured to splice the sentence vector, the coordinate information and the character feature information to obtain a spliced sequence;
the recognition unit 306 is configured to input the concatenation sequence into a pre-trained seq2seq model, so as to output a content identifier of each text content.
In one embodiment of the present invention, referring to fig. 4, the apparatus further includes:
a training unit 307 for training to obtain the seq2seq model by:
acquiring a plurality of sample text layouts, and executing for each sample text layout: based on the recognition result of the OCR algorithm on the sample text layout, acquiring sample text content and coordinates of a corresponding text box in the sample text layout; acquiring an identification ID of sample text content in the sample text layout based on a manual identification mode; converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout; coordinate splicing is carried out on the text boxes in the sample text layout, and spliced sample coordinate information is obtained; generating sample character characteristic information from characters except text contents in the sample text layout; performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence; taking the sample splicing sequence as input, and taking the identification ID of the sample text content in the sample text layout as output to obtain a sample pair for training the seq2seq model;
the seq2seq model is trained based on a plurality of samples.
In one embodiment of the present invention, the layout types of the plurality of sample text layouts are not identical.
In one embodiment of the invention, an onehot coding mode is adopted to take the identification ID of the sample text content in the sample text layout as output.
In one embodiment of the present invention, the training unit is further configured to train to obtain a doc2vec model corresponding to the layout type by using a plurality of sample text layouts of the same layout type in advance; the doc2vec model is used for converting text contents in a text layout of the layout type into sentence vectors based on the corresponding text layout;
and when sentence vector conversion is carried out, the doc2vec model of the corresponding layout type is utilized.
In one embodiment of the present invention, the coordinate stitching unit is specifically configured to: taking coordinate values of four vertexes of each text box as eight-dimensional coordinate values of the corresponding text box; and performing head-to-tail splicing on the eight-dimensional coordinate values of each text box to obtain spliced coordinate information.
In one embodiment of the present invention, the characters except text content in the text layout to be analyzed include: at least one of numbers, symbols, english, and other characters;
the generating unit is specifically configured to: and generating character characteristic information of corresponding dimension based on the proportion of at least one character of numbers, symbols, english and other characters in the text layout to be analyzed.
It should be understood that the structure illustrated in the embodiment of the present invention does not constitute a specific limitation on a text layout analysis apparatus. In other embodiments of the invention, a text layout analysis device may include more or fewer components than shown, or may combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The content of information interaction and execution process between the modules in the device is based on the same conception as the embodiment of the method of the present invention, and specific content can be referred to the description in the embodiment of the method of the present invention, which is not repeated here.
The embodiment of the invention also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the text layout analysis method in any embodiment of the invention when executing the computer program.
The embodiment of the invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program when executed by a processor causes the processor to execute the text layout analysis method in any embodiment of the invention.
Specifically, a system or apparatus provided with a storage medium on which a software program code realizing the functions of any of the above embodiments is stored, and a computer (or CPU or MPU) of the system or apparatus may be caused to read out and execute the program code stored in the storage medium.
In this case, the program code itself read from the storage medium may realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code form part of the present invention.
Examples of the storage medium for providing the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD+RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer by a communication network.
Further, it should be apparent that the functions of any of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform part or all of the actual operations based on the instructions of the program code.
Further, it is understood that the program code read out by the storage medium is written into a memory provided in an expansion board inserted into a computer or into a memory provided in an expansion module connected to the computer, and then a CPU or the like mounted on the expansion board or the expansion module is caused to perform part and all of actual operations based on instructions of the program code, thereby realizing the functions of any of the above embodiments.
It is noted that relational terms such as first and second, and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one …" does not exclude the presence of additional identical elements in a process, method, article or apparatus that comprises the element.
Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, where the program, when executed, performs steps including the above method embodiments; and the aforementioned storage medium includes: various media in which program code may be stored, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A text layout analysis method, comprising:
acquiring text content in a text layout to be analyzed and coordinates of a corresponding text box based on an identification result of the text layout to be analyzed by an OCR algorithm;
converting text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed;
coordinate stitching is carried out on the text boxes in the text layout to be analyzed, and coordinate information after stitching is obtained;
generating character characteristic information of characters except text contents in the text layout to be analyzed; the characters except the text content in the text layout to be analyzed comprise: at least one of numbers, symbols, and english;
information stitching is carried out on the sentence vectors, the coordinate information and the character characteristic information to obtain a stitching sequence;
inputting the splicing sequence into a pre-trained seq2seq model to output the content identification of each text content;
the training mode of the seq2seq model comprises the following steps:
obtaining a plurality of sample text layouts, wherein the layout types of the plurality of sample text layouts are not identical, and the formats of the text layouts with different layout types are different; for each sample text layout is performed:
based on the recognition result of the OCR algorithm on the sample text layout, acquiring sample text content and coordinates of a corresponding text box in the sample text layout; acquiring an identification ID of sample text content in the sample text layout based on a manual identification mode; the key text content is identified with different identification IDs, and the non-key text content is identified with the same identification ID; the identification of the text content has uniqueness; converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout; coordinate splicing is carried out on the text boxes in the sample text layout, and spliced sample coordinate information is obtained; generating sample character characteristic information from characters except text contents in the sample text layout; performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence; taking the sample splicing sequence as input, and taking the identification ID of the sample text content in the sample text layout as output to obtain a sample pair for training the seq2seq model;
training the seq2seq model based on the plurality of samples;
training to obtain a doc2vec model corresponding to the layout type by utilizing a plurality of sample text layouts of the same layout type in advance; the doc2vec model is used for converting text contents in the text layout of the layout type into sentence vectors based on the corresponding text layout.
2. The method of claim 1, wherein the identification ID of the sample text content in the sample text layout is output by using onehot coding.
3. The method as recited in claim 1, further comprising: training to obtain a doc2vec model corresponding to the layout type by utilizing a plurality of sample text layouts of the same layout type in advance; the doc2vec model is used for converting text contents in a text layout of the layout type into sentence vectors based on the corresponding text layout;
and when sentence vector conversion is carried out, the doc2vec model of the corresponding layout type is utilized.
4. A method according to any one of claims 1-3, wherein the coordinate stitching comprises:
taking coordinate values of four vertexes of each text box as eight-dimensional coordinate values of the corresponding text box;
and performing head-to-tail splicing on the eight-dimensional coordinate values of each text box to obtain spliced coordinate information.
5. A method according to any one of claims 1-3, wherein said generating character characteristic information comprises: and generating character characteristic information of corresponding dimension based on the proportion of at least one character of numbers, symbols and english in the text layout to be analyzed.
6. A text layout analysis device, comprising:
the device comprises an acquisition unit, a text box judgment unit and a text box judgment unit, wherein the acquisition unit is used for acquiring text content in a text layout to be analyzed and coordinates of a corresponding text box based on an identification result of the text layout to be analyzed by an OCR algorithm;
the conversion unit is used for converting the text content in the text layout to be analyzed into sentence vectors based on the text layout to be analyzed;
the coordinate splicing unit is used for carrying out coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
the generation unit is used for generating character characteristic information of characters except text contents in the text layout to be analyzed; the characters except the text content in the text layout to be analyzed comprise: at least one of numbers, symbols, and english;
the information splicing unit is used for carrying out information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a splicing sequence;
the identification unit is used for inputting the splicing sequence into a pre-trained seq2seq model so as to output the content identification of each text content;
the training unit is used for training to obtain a seq2seq model by using the following modes:
obtaining a plurality of sample text layouts, wherein the layout types of the plurality of sample text layouts are not identical, and the formats of the text layouts with different layout types are different; for each sample text layout is performed: based on the recognition result of the OCR algorithm on the sample text layout, acquiring sample text content and coordinates of a corresponding text box in the sample text layout; acquiring an identification ID of sample text content in the sample text layout based on a manual identification mode; the key text content is identified with different identification IDs, and the non-key text content is identified with the same identification ID; the identification of the text content has uniqueness; converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout; coordinate splicing is carried out on the text boxes in the sample text layout, and spliced sample coordinate information is obtained; generating sample character characteristic information from characters except text contents in the sample text layout; performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence; taking the sample splicing sequence as input, and taking the identification ID of the sample text content in the sample text layout as output to obtain a sample pair for training the seq2seq model;
training the seq2seq model based on the plurality of samples;
training to obtain a doc2vec model corresponding to the layout type by utilizing a plurality of sample text layouts of the same layout type in advance; the doc2vec model is used for converting text contents in the text layout of the layout type into sentence vectors based on the corresponding text layout.
7. An electronic device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the method of any of claims 1-5 when the computer program is executed.
8. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-5.
CN202211652196.3A 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium Active CN115841677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211652196.3A CN115841677B (en) 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211652196.3A CN115841677B (en) 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115841677A CN115841677A (en) 2023-03-24
CN115841677B true CN115841677B (en) 2023-09-05

Family

ID=85579036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211652196.3A Active CN115841677B (en) 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115841677B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102012005325A1 (en) * 2012-03-19 2013-09-19 Ernst Pechtl Machine image recognition method based on a Kl system
CN110414529A (en) * 2019-06-26 2019-11-05 深圳中兴网信科技有限公司 Paper information extracting method, system and computer readable storage medium
CN113378710A (en) * 2021-06-10 2021-09-10 平安科技(深圳)有限公司 Layout analysis method and device for image file, computer equipment and storage medium
CN113657279A (en) * 2021-08-18 2021-11-16 北京玖安天下科技有限公司 Bill image layout analysis method and device
CN114359913A (en) * 2022-01-04 2022-04-15 深圳思为科技有限公司 Text label determination method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102012005325A1 (en) * 2012-03-19 2013-09-19 Ernst Pechtl Machine image recognition method based on a Kl system
CN110414529A (en) * 2019-06-26 2019-11-05 深圳中兴网信科技有限公司 Paper information extracting method, system and computer readable storage medium
WO2020259060A1 (en) * 2019-06-26 2020-12-30 深圳中兴网信科技有限公司 Test paper information extraction method and system, and computer-readable storage medium
CN113378710A (en) * 2021-06-10 2021-09-10 平安科技(深圳)有限公司 Layout analysis method and device for image file, computer equipment and storage medium
CN113657279A (en) * 2021-08-18 2021-11-16 北京玖安天下科技有限公司 Bill image layout analysis method and device
CN114359913A (en) * 2022-01-04 2022-04-15 深圳思为科技有限公司 Text label determination method and related device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Sequence to Sequence Learning with Neural Networks;Ilya Sutskever等;《arXiv:1409.3215v1》;第1-10页 *

Also Published As

Publication number Publication date
CN115841677A (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110909123B (en) Data extraction method and device, terminal equipment and storage medium
CN113627168B (en) Method, device, medium and equipment for checking component packaging conflict
CN112149663A (en) RPA and AI combined image character extraction method and device and electronic equipment
Kim et al. End-to-end digitization of image format piping and instrumentation diagrams at an industrially applicable level
CN112818852A (en) Seal checking method, device, equipment and storage medium
CN110532449B (en) Method, device, equipment and storage medium for processing service document
CN115344699A (en) Training method and device of text classification model, computer equipment and medium
CN111414732A (en) Text style conversion method and device, electronic equipment and storage medium
CN115841677B (en) Text layout analysis method and device, electronic equipment and storage medium
CN112989043A (en) Reference resolution method and device, electronic equipment and readable storage medium
CN112560855A (en) Image information extraction method and device, electronic equipment and storage medium
CN110347921B (en) Label extraction method and device for multi-mode data information
CN114708582B (en) AI and RPA-based electric power data intelligent inspection method and device
CN113569929B (en) Internet service providing method and device based on small sample expansion and electronic equipment
CN116052195A (en) Document parsing method, device, terminal equipment and computer readable storage medium
CN115620039A (en) Image labeling method, device, equipment, medium and program product
CN116185812A (en) Automatic testing method, device and medium for software system functions
CN114818627A (en) Form information extraction method, device, equipment and medium
US20200387733A1 (en) Terminal apparatus, character recognition system, and character recognition method
CN112966671A (en) Contract detection method and device, electronic equipment and storage medium
CN110083807B (en) Contract modification influence automatic prediction method, device, medium and electronic equipment
CN114722823B (en) Method and device for constructing aviation knowledge graph and computer readable medium
US20130080137A1 (en) Conversion method and system
CN109344392A (en) A kind of smart message method for pushing, system and the device of security customer service consulting
CN112906559B (en) Machine-implemented method for correcting formulas and related product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant