CN115841677A - Text layout analysis method and device, electronic equipment and storage medium - Google Patents

Text layout analysis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115841677A
CN115841677A CN202211652196.3A CN202211652196A CN115841677A CN 115841677 A CN115841677 A CN 115841677A CN 202211652196 A CN202211652196 A CN 202211652196A CN 115841677 A CN115841677 A CN 115841677A
Authority
CN
China
Prior art keywords
text
layout
sample
analyzed
splicing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211652196.3A
Other languages
Chinese (zh)
Other versions
CN115841677B (en
Inventor
杨沛灵
闫印强
姚兴仁
姜海昆
范宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changyang Technology Beijing Co ltd
Original Assignee
Changyang Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changyang Technology Beijing Co ltd filed Critical Changyang Technology Beijing Co ltd
Priority to CN202211652196.3A priority Critical patent/CN115841677B/en
Publication of CN115841677A publication Critical patent/CN115841677A/en
Application granted granted Critical
Publication of CN115841677B publication Critical patent/CN115841677B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a text layout analysis method, a text layout analysis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text contents and coordinates of corresponding text boxes in the text layout to be analyzed based on an OCR algorithm recognition result of the text layout to be analyzed; converting the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed; performing coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information; generating character characteristic information for characters except for text contents in the text layout to be analyzed; performing information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a spliced sequence; and inputting the splicing sequence into a pre-trained seq2seq model to output the content identification of each text content. According to the scheme, the accuracy of the analysis result in the text layout can be improved.

Description

Text layout analysis method and device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computer vision, in particular to a text layout analysis method and device, electronic equipment and a storage medium.
Background
In a text layout analysis scenario, content detection and recognition are generally performed on a text picture by using an OCR algorithm, and after detection by the OCR algorithm is completed, key information is extracted for the detected content. In the prior art, for the identified detection content, the required key information is acquired in a coordinate positioning manner. However, this method is less accurate when processing a text layout with a changed text format.
Disclosure of Invention
The embodiment of the invention provides a text layout analysis method and device, electronic equipment and a storage medium, which can improve the accuracy of an analysis result in a text layout.
In a first aspect, an embodiment of the present invention provides a text layout analysis method, including:
acquiring text contents and coordinates of corresponding text boxes in the text layout to be analyzed based on an OCR algorithm recognition result of the text layout to be analyzed;
converting the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed;
performing coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
generating character characteristic information for characters except for text contents in the text layout to be analyzed;
performing information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a spliced sequence;
and inputting the splicing sequence into a pre-trained seq2seq model to output the content identification of each text content.
In a possible implementation manner, the training manner of the seq2seq model includes:
obtaining a plurality of sample text layouts, and executing the following steps aiming at each sample text layout:
acquiring sample text content and coordinates of a corresponding text box in the sample text layout based on the recognition result of the OCR algorithm on the sample text layout;
acquiring an identification ID of the sample text content in the sample text layout based on a manual identification mode;
converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout;
performing coordinate splicing on the text boxes in the sample text layout to obtain spliced sample coordinate information;
generating sample character characteristic information from characters except text content in the sample text layout;
performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence;
taking the sample splicing sequence as input, and taking the identifier ID of the sample text content in the sample text layout as output to obtain a sample pair for training a seq2seq model;
the seq2seq model is trained based on a plurality of sample pairs.
In one possible implementation, the layout types of the plurality of sample text layouts are not identical.
In a possible implementation manner, an onehot coding manner is adopted to output the identification ID of the sample text content in the sample text layout.
In one possible implementation manner, the method further includes: a plurality of sample text layouts with the same layout type are used in advance, and a doc2vec model corresponding to the layout type is obtained through training; the doc2vec model is used for converting text contents in the text layout of the layout type into sentence vectors based on the corresponding text layout;
and when sentence vector conversion is carried out, using a doc2vec model of a corresponding layout type to realize the conversion.
In one possible implementation, the coordinate stitching includes:
taking the coordinate values of the four vertexes of each text box as eight-dimensional coordinate values of the corresponding text box;
and performing head-to-tail splicing on the eight-dimensional coordinate values of the text boxes to obtain spliced coordinate information.
In a possible implementation manner, the characters in the layout of the text to be analyzed, except for the text content, include: at least one of a number, a symbol, english, and other characters;
the generating character feature information includes: and generating character characteristic information of corresponding dimensionality based on the proportion of at least one character of the numbers, the symbols, the English and other characters in the text layout to be analyzed.
In a second aspect, an embodiment of the present invention further provides a text layout analysis apparatus, including:
the acquisition unit is used for acquiring the text content in the text layout to be analyzed and the coordinates of the corresponding text box based on the recognition result of the text layout to be analyzed by an OCR algorithm;
the conversion unit is used for converting the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed;
the coordinate splicing unit is used for carrying out coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
the generating unit is used for generating character characteristic information from characters except for the text content in the text layout to be analyzed;
the information splicing unit is used for performing information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a spliced sequence;
and the recognition unit is used for inputting the splicing sequence into a pre-trained seq2seq model so as to output the content identification of each text content.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the method according to any embodiment of this specification.
In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed in a computer, the computer program causes the computer to execute the method described in any embodiment of the present specification.
The embodiment of the invention provides a text layout analysis method, a text layout analysis device, electronic equipment and a storage medium, wherein the text contents are converted into sentence vectors based on the text layout to be analyzed, coordinate splicing is carried out on text boxes, character characteristic information is generated on characters except for character contents in the text layout to be analyzed, then the sentence vectors, the spliced coordinate information and the character characteristic information are subjected to information splicing, the obtained splicing sequence fully contains the contents of the text layout to be analyzed, and the splicing sequence is input into a pre-training seq2seq model, so that the seq2seq model outputs the content identification of each text content. Therefore, in the scheme, the seq2seq model can fully learn the characteristics of the text layout in the training process, so that the recognition result can be more accurate during recognition.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a flowchart of a text layout analysis method according to an embodiment of the present invention;
fig. 2 is a hardware architecture diagram of an electronic device according to an embodiment of the present invention;
fig. 3 is a structural diagram of a text layout analysis apparatus according to an embodiment of the present invention;
fig. 4 is a structural diagram of another text layout analysis apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention, and based on the embodiments of the present invention, all other embodiments obtained by a person of ordinary skill in the art without creative efforts belong to the scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a text layout analysis method, including:
step 100, acquiring text contents and coordinates of corresponding text boxes in a text layout to be analyzed based on an OCR algorithm recognition result of the text layout to be analyzed;
102, converting the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed;
104, carrying out coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
106, generating character characteristic information for characters except for text contents in the text layout to be analyzed;
step 108, performing information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a spliced sequence;
and step 110, inputting the splicing sequence into a seq2seq model trained in advance so as to output the content identification of each text content.
In the embodiment of the invention, the obtained text content is converted into a sentence vector based on the text layout to be analyzed, the text boxes are subjected to coordinate splicing, characters except for the text content in the text layout to be analyzed are generated into character characteristic information, then the sentence vector, the spliced coordinate information and the character characteristic information are subjected to information splicing, the obtained splicing sequence fully contains the content of the text layout to be analyzed, and the splicing sequence is input into a pre-trained seq2seq model, so that the seq2seq model outputs the content identification of each text content. Therefore, in the scheme, the seq2seq model can fully learn the characteristics of the text layout in the training process, so that the recognition result can be more accurate during recognition.
The manner in which the various steps shown in fig. 1 are performed is described below.
First, a training process of the seq2seq model will be explained.
In the embodiment of the invention, considering that different text layouts are different in types and different text layouts with the same layout type have different contents, if a neural network model is required to identify the contents of the text layouts, a seq2seq model can be considered. The seq2seq model may include an encoder for analyzing an input sequence and a decoder for generating an output sequence, where the input may be an indefinite length sequence and the output may be an indefinite length sequence.
In order to enable the seq2seq model to sufficiently learn the content of the text layout, in an embodiment of the present invention, the content of the sample text layout needs to be sufficiently extracted to generate an input sequence to train the seq2seq model. Specifically, the training process of the seq2seq model includes:
a1, obtaining a plurality of sample text layouts, and executing the following steps A11-A17 aiming at each sample text layout;
the sample text layout may be a type of form, such as "Business license", "tax declaration form", "job declaration form", and so on. The business license, tax declaration form and job declaration form belong to different types of text layouts. In this embodiment, the sample text layouts may correspond to the same layout type or different layout types. Such as taking a plurality of business licenses as sample text forms.
Preferably, the layout types of the plurality of sample text layouts are not identical. Therefore, the trained seq2seq model can identify the text layouts of different layout types, the application range of the text layouts to be analyzed can be enlarged, and good identification effect can be guaranteed even if the text formats are changed.
A11, acquiring sample text contents and coordinates of a corresponding text box in the sample text layout based on an OCR algorithm recognition result;
in the embodiment of the invention, the sample text content and the coordinates of the corresponding text box can be extracted from the recognition result of the OCR algorithm. And the coordinates of one text box are the coordinate values of four vertexes of the text box.
A12, acquiring an identification ID of sample text content in the sample text layout based on a manual identification mode;
in the embodiment of the present invention, only the key text content may be identified, or all the text contents may be identified. In one implementation, different identification IDs may be identified for the key text content, and the same identification ID may be identified for the non-key text content.
For example, for a business license, the key text content may be a uniform social credit code, a company name, and a registered capital, and then 00 may be identified for the sample text content identifier 01 corresponding to the uniform social credit code in the sample text layout, the sample text content identifier 02 corresponding to the company name, the sample text content identifier 03 corresponding to the registered capital, and other sample text contents.
In the embodiment of the invention, in order to ensure that the seq2seq model can identify the text layouts of different layout types, the identifiers of the text contents have uniqueness.
A13, converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout;
in the embodiment of the invention, because the text contents and formats of the text layouts with different layout types are different, and the text contents of the text layouts with the same layout types are not completely same, the text contents of each text layout need to be converted into sentence vectors based on the corresponding text layout.
In particular, the doc2vec model of gensim may be used to convert textual content into sentence vectors.
In order to ensure that text layouts with more layout types can be adapted when sentence vectors are converted, in one embodiment of the invention, a plurality of sample text layouts with the same layout type can be used in advance to train and obtain a doc2vec model corresponding to the layout type; the doc2vec model is used for converting text contents in the text layout of the layout type into sentence vectors based on the corresponding text layout; further, when sentence vector conversion is needed, based on the layout type of the text layout to be converted, the conversion of sentence vectors is realized by using the doc2vec model of the corresponding layout type.
Training doc2vec models corresponding to the layout types respectively can enable text layouts of corresponding layout types to be more suitable during sentence vector conversion, converted sentence vectors are more accurate, and accuracy is improved for training and identifying subsequent seq2seq models.
A14, carrying out coordinate splicing on the text boxes in the sample text layout to obtain spliced sample coordinate information;
because each text box comprises coordinate values of four vertexes, and each vertex is represented by a two-dimensional coordinate, for each text box, the coordinate values of the four vertexes of the text box can be used as eight-dimensional coordinate values of the corresponding text box, and then the eight-dimensional coordinate values of the text boxes are spliced end to obtain spliced coordinate information.
When the first and the last text boxes are spliced, for example, the coordinate value of the first text box is: (x 11, y 11), (x 12, y 12), (x 13, y 13), and (x 14, y 14), and the coordinate values of the second text box are: (x 21, y 21), (x 22, y 22), (x 23, y 23), and (x 24, y 24), then the end-to-end concatenation may be: (x 11, y 11), (x 12, y 12), (x 13, y 13), (x 14, y 14), (x 21, y 21), (x 22, y 22), (x 23, y 23), and (x 24, y 24).
A15, generating sample character characteristic information from characters except text content in the sample text layout;
in the embodiment of the invention, some text layouts also comprise other characters besides text contents, and the characters are also used for expressing the characteristics of the text layouts, so that the characters except the text contents need to generate character characteristic information in order to fully express the characteristics of the text layouts.
Specifically, in this step a15, the characters in the sample text layout except for the sample text content may include: at least one of a number, a symbol, english, and other characters; then, based on the proportion of at least one character of the number, the symbol, the english and other characters in the text layout to be analyzed, character feature information of corresponding dimension can be generated.
For example, the characters except for the text content are numbers, symbols, english, and other characters, and the characteristics of the corresponding dimension are determined according to the proportion of the characters in the whole text layout (text content + numbers + symbols + english + other characters), for example, the proportion of the numbers, symbols, english, and other characters is: 0.2, 0.1, and 0.01, then the four feature values can be used as character feature information in four dimensions.
It should be noted that the types of characters except for the text content are not changed after the determination, during the whole training process and the subsequent processing of the text layout to be analyzed. For example, numerals, symbols, english, and other characters are uniformly used as characters other than text contents.
A16, carrying out information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence;
in the embodiment of the invention, A13-A15 are information extracted from different angles of the text layout, and then the information extracted from each angle is spliced, so that the spliced sequence obtained after splicing can accurately and fully express the content of the text layout.
The information splicing mode can also be a head-to-tail splicing mode, the splicing sequence can be any sequence, and after the splicing sequence is determined, the same splicing sequence is used in the whole training process and the subsequent processing process of the text layout to be analyzed.
A17, taking the sample splicing sequence as input, and taking the identifier ID of the sample text content in the sample text layout as output to obtain a sample pair for training a seq2seq model;
since the seq2seq model input and output are both sequences, when the identifier ID of the sample text content in the sample text layout is used as output, the sample text layout also needs to be processed into a sequence. In an embodiment of the present invention, an onehot encoding mode may be adopted to output the identifier ID of the sample text content in the sample text layout.
In this way, the dimension of the output sequence is the same as the number of sample text contents in the sample text layout. For example, the number of sample text contents is 10, wherein the 2 nd sample text content is identified as 01, the fourth sample text content is identified as 02, the identification of the 5 th sample text content is 03 and the identification of the other sample text contents is 00, the output sequence may be [00,01,00,02,03,00 ].
And A2, training a seq2seq model based on a plurality of samples.
The training of the seq2seq model is completed, and then the seq2seq model can be applied to an actual analysis scene.
Aiming at the step 100, based on the recognition result of the text layout to be analyzed by the OCR algorithm, the text content in the text layout to be analyzed and the coordinates of the corresponding text box are obtained.
The processing manner of step 100 is the same as that of step a11, and is not described herein again.
For step 102, the text content in the text layout to be analyzed is converted into a sentence vector based on the text layout to be analyzed.
Preferably, based on the layout type of the text layout to be analyzed, a doc2vec model of a corresponding layout type is selected and used to convert the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed.
Specifically, the processing manner of step 102 is the same as that of step a13, and is not described herein again.
The processing method is the same as that of the steps a14 to a16, and is not described again here, in regard to the steps 104 "performing coordinate concatenation on the text boxes in the text layout to be analyzed to obtain concatenated coordinate information", 106 "generating character feature information on characters except for text contents in the text layout to be analyzed", and 108 "performing information concatenation on the sentence vectors, the coordinate information, and the character feature information to obtain a concatenation sequence".
Finally, aiming at step 110, the splicing sequence is input into a seq2seq model trained in advance, so as to output the content identification of each text content.
The seq2seq model outputs an output sequence coded by onehot, and the content identifier of each text content can be obtained according to the output sequence.
For example, the output sequence is [00,01, 02,03,00 ], it can be known that the third text content is the unified social credit code, the second text content is the company name, and the third text content is the registered capital.
According to the embodiment of the invention, the accuracy rate of extracting the concerned text content in the text is greatly improved, a large amount of workload of subsequent manual proofreading can be reduced, precious calculation power is applied to an OCR recognition algorithm, and the AI coverage under the condition of limited resources is greatly expanded.
As shown in fig. 2 and fig. 3, an embodiment of the present invention provides a text layout analyzing apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. From a hardware aspect, as shown in fig. 2, for a hardware architecture diagram of an electronic device in which a text layout analysis apparatus according to an embodiment of the present invention is located, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 2, the electronic device in which the apparatus is located may also include other hardware, such as a forwarding chip responsible for processing a message. Taking a software implementation as an example, as shown in fig. 3, as a logically meaningful device, the device is formed by reading a corresponding computer program in a nonvolatile memory into a memory by a CPU of an electronic device where the device is located and running the computer program. The present embodiment provides a text layout analysis apparatus, including:
an obtaining unit 301, configured to obtain, based on an recognition result of a text layout to be analyzed by an OCR algorithm, text content in the text layout to be analyzed and coordinates of a corresponding text box;
a converting unit 302, configured to convert text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed;
a coordinate splicing unit 303, configured to perform coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
a generating unit 304, configured to generate character feature information for characters in the text layout to be analyzed, except for text content;
an information splicing unit 305, configured to perform information splicing on the sentence vector, the coordinate information, and the character feature information to obtain a spliced sequence;
and the identifying unit 306 is configured to input the splicing sequence into a seq2seq model trained in advance, so as to output a content identifier of each text content.
In an embodiment of the present invention, referring to fig. 4, the apparatus further includes:
a training unit 307, configured to train to obtain a seq2seq model by using the following method:
obtaining a plurality of sample text layouts, and executing the following steps aiming at each sample text layout: acquiring sample text content and coordinates of a corresponding text box in the sample text layout based on the recognition result of the OCR algorithm on the sample text layout; acquiring an identification ID of the sample text content in the sample text layout based on a manual identification mode; converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout; performing coordinate splicing on the text boxes in the sample text layout to obtain spliced sample coordinate information; generating sample character characteristic information from characters except text content in the sample text layout; performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence; taking the sample splicing sequence as input, and taking the identifier ID of the sample text content in the sample text layout as output to obtain a sample pair for training a seq2seq model;
the seq2seq model is trained based on a plurality of sample pairs.
In one embodiment of the present invention, the layout types of the plurality of sample text layouts are not identical.
In an embodiment of the present invention, an onehot coding manner is adopted to output the identifier ID of the sample text content in the sample text layout.
In an embodiment of the present invention, the training unit is further configured to train to obtain a doc2vec model corresponding to the layout type by using a plurality of sample text layouts of the same layout type in advance; the doc2vec model is used for converting text contents in the text layout of the layout type into sentence vectors based on the corresponding text layout;
and when sentence vector conversion is carried out, using a doc2vec model of a corresponding layout type to realize the conversion.
In an embodiment of the present invention, the coordinate splicing unit is specifically configured to: taking the coordinate values of the four vertexes of each text box as eight-dimensional coordinate values of the corresponding text box; and performing head-to-tail splicing on the eight-dimensional coordinate values of the text boxes to obtain spliced coordinate information.
In an embodiment of the present invention, the characters in the layout of the text to be analyzed, except for the text content, include: at least one of a number, a symbol, english, and other characters;
the generating unit is specifically configured to: and generating character characteristic information of corresponding dimensionality based on the proportion of at least one character of the numbers, the symbols, the English and other characters in the text layout to be analyzed.
It should be understood that the schematic structure in the embodiment of the present invention does not specifically limit a text layout analysis apparatus. In other embodiments of the present invention, a text layout analysis apparatus may include more or fewer components than shown, or some components may be combined, some components may be separated, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
Because the content of information interaction, execution process, and the like among the modules in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.
The embodiment of the invention also provides electronic equipment which comprises a memory and a processor, wherein the memory is stored with a computer program, and when the processor executes the computer program, the text layout analysis method in any embodiment of the invention is realized.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program causes the processor to execute a text layout analysis method in any embodiment of the present invention.
Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.
In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.
Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.
Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.
Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion module connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion module to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a" \8230; "does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A text layout analysis method is characterized by comprising the following steps:
acquiring text contents and coordinates of corresponding text boxes in the text layout to be analyzed based on an OCR algorithm recognition result of the text layout to be analyzed;
converting the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed;
performing coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
generating character characteristic information for characters except for text contents in the text layout to be analyzed;
performing information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a spliced sequence;
and inputting the splicing sequence into a pre-trained seq2seq model to output the content identification of each text content.
2. The method of claim 1, wherein the seq2seq model is trained by:
obtaining a plurality of sample text layouts, and executing the following steps aiming at each sample text layout:
acquiring sample text content and coordinates of a corresponding text box in the sample text layout based on the recognition result of the OCR algorithm on the sample text layout;
acquiring an identification ID of the sample text content in the sample text layout based on a manual identification mode;
converting the sample text content in the sample text layout into a sample sentence vector based on the sample text layout;
performing coordinate splicing on the text boxes in the sample text layout to obtain spliced sample coordinate information;
generating sample character characteristic information from characters except text content in the sample text layout;
performing information splicing on the sample sentence vector, the sample coordinate information and the sample character characteristic information to obtain a sample splicing sequence;
taking the sample splicing sequence as input, and taking the identifier ID of the sample text content in the sample text layout as output to obtain a sample pair for training a seq2seq model;
the seq2seq model is trained based on a plurality of sample pairs.
3. The method of claim 2, wherein the layout types of the plurality of sample text layouts are not identical.
4. The method of claim 2, wherein the identity ID of the sample text content in the sample text layout is output by onehot coding.
5. The method of claim 2, further comprising: a plurality of sample text layouts with the same layout type are used in advance, and a doc2vec model corresponding to the layout type is obtained through training; the doc2vec model is used for converting text contents in the text layout of the layout type into sentence vectors based on the corresponding text layout;
and when sentence vector conversion is carried out, using a doc2vec model of a corresponding layout type to realize the conversion.
6. The method according to any one of claims 1-5, wherein the coordinate stitching comprises:
taking the coordinate values of the four vertexes of each text box as eight-dimensional coordinate values of the corresponding text box;
and performing head-to-tail splicing on the eight-dimensional coordinate values of the text boxes to obtain spliced coordinate information.
7. The method according to any one of claims 1-5, wherein the characters in the layout of the text to be analyzed, except the text content, comprise: at least one of a number, a symbol, english, and other characters;
the generating character feature information includes: and generating character characteristic information of corresponding dimensionality based on the proportion of at least one character of the numbers, the symbols, the English and other characters in the text layout to be analyzed.
8. A text layout analysis apparatus, comprising:
the acquisition unit is used for acquiring text contents and coordinates of corresponding text boxes in the text layout to be analyzed based on the recognition result of the text layout to be analyzed by an OCR algorithm;
the conversion unit is used for converting the text content in the text layout to be analyzed into a sentence vector based on the text layout to be analyzed;
the coordinate splicing unit is used for carrying out coordinate splicing on the text boxes in the text layout to be analyzed to obtain spliced coordinate information;
the generating unit is used for generating character characteristic information from characters except for the text content in the text layout to be analyzed;
the information splicing unit is used for performing information splicing on the sentence vectors, the coordinate information and the character characteristic information to obtain a spliced sequence;
and the recognition unit is used for inputting the splicing sequence into a pre-trained seq2seq model so as to output the content identification of each text content.
9. An electronic device comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the method according to any one of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-7.
CN202211652196.3A 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium Active CN115841677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211652196.3A CN115841677B (en) 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211652196.3A CN115841677B (en) 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115841677A true CN115841677A (en) 2023-03-24
CN115841677B CN115841677B (en) 2023-09-05

Family

ID=85579036

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211652196.3A Active CN115841677B (en) 2022-12-21 2022-12-21 Text layout analysis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115841677B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102012005325A1 (en) * 2012-03-19 2013-09-19 Ernst Pechtl Machine image recognition method based on a Kl system
CN110414529A (en) * 2019-06-26 2019-11-05 深圳中兴网信科技有限公司 Paper information extracting method, system and computer readable storage medium
CN113378710A (en) * 2021-06-10 2021-09-10 平安科技(深圳)有限公司 Layout analysis method and device for image file, computer equipment and storage medium
CN113657279A (en) * 2021-08-18 2021-11-16 北京玖安天下科技有限公司 Bill image layout analysis method and device
CN114359913A (en) * 2022-01-04 2022-04-15 深圳思为科技有限公司 Text label determination method and related device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102012005325A1 (en) * 2012-03-19 2013-09-19 Ernst Pechtl Machine image recognition method based on a Kl system
CN110414529A (en) * 2019-06-26 2019-11-05 深圳中兴网信科技有限公司 Paper information extracting method, system and computer readable storage medium
WO2020259060A1 (en) * 2019-06-26 2020-12-30 深圳中兴网信科技有限公司 Test paper information extraction method and system, and computer-readable storage medium
CN113378710A (en) * 2021-06-10 2021-09-10 平安科技(深圳)有限公司 Layout analysis method and device for image file, computer equipment and storage medium
CN113657279A (en) * 2021-08-18 2021-11-16 北京玖安天下科技有限公司 Bill image layout analysis method and device
CN114359913A (en) * 2022-01-04 2022-04-15 深圳思为科技有限公司 Text label determination method and related device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ILYA SUTSKEVER等: "Sequence to Sequence Learning with Neural Networks", 《ARXIV:1409.3215V1》, pages 1 - 10 *
蔺伟斌;杨世瀚;: "基于时间递归序列模型的短文本语义简化", 物联网技术, no. 05, pages 63 - 68 *

Also Published As

Publication number Publication date
CN115841677B (en) 2023-09-05

Similar Documents

Publication Publication Date Title
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN112509561A (en) Emotion recognition method, device, equipment and computer readable storage medium
CN115344699A (en) Training method and device of text classification model, computer equipment and medium
CN111414732A (en) Text style conversion method and device, electronic equipment and storage medium
CN113609865A (en) Text emotion recognition method and device, electronic equipment and readable storage medium
CN111368066A (en) Method, device and computer readable storage medium for acquiring dialogue abstract
CN115841677B (en) Text layout analysis method and device, electronic equipment and storage medium
CN116052195A (en) Document parsing method, device, terminal equipment and computer readable storage medium
CN115859121A (en) Text processing model training method and device
CN115565178A (en) Font identification method and apparatus
US11182635B2 (en) Terminal apparatus, character recognition system, and character recognition method
CN110378457B (en) Code label generation method and device
CN110083807B (en) Contract modification influence automatic prediction method, device, medium and electronic equipment
CN113569929A (en) Internet service providing method and device based on small sample expansion and electronic equipment
CN114722823B (en) Method and device for constructing aviation knowledge graph and computer readable medium
CN111475403A (en) Dynamic generation method of test script and related device
CN112906559B (en) Machine-implemented method for correcting formulas and related product
JP2020198023A (en) Information processing apparatus, method, and program
CN117435739B (en) Image text classification method and device
CN112784780B (en) Review method, review device, computer equipment and storage medium
CN116308635B (en) Plasticizing industry quotation structuring method, device, equipment and storage medium
CN111402012B (en) E-commerce defective product identification method based on transfer learning
CN116071768A (en) Table identification method, apparatus, electronic device and storage medium
CN115017313A (en) Intention recognition and model training method, electronic device and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant