CN112347742A - Method for generating document image set based on deep learning - Google Patents
Method for generating document image set based on deep learning Download PDFInfo
- Publication number
- CN112347742A CN112347742A CN202011178681.2A CN202011178681A CN112347742A CN 112347742 A CN112347742 A CN 112347742A CN 202011178681 A CN202011178681 A CN 202011178681A CN 112347742 A CN112347742 A CN 112347742A
- Authority
- CN
- China
- Prior art keywords
- document
- network
- sequence
- document image
- generating
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 238000013135 deep learning Methods 0.000 title claims abstract description 20
- 238000012549 training Methods 0.000 claims abstract description 41
- 239000011159 matrix material Substances 0.000 claims description 66
- 230000006870 function Effects 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 15
- 230000004913 activation Effects 0.000 claims description 6
- RZVAJINKPMORJF-UHFFFAOYSA-N Acetaminophen Chemical compound CC(=O)NC1=CC=C(O)C=C1 RZVAJINKPMORJF-UHFFFAOYSA-N 0.000 claims description 5
- 238000011478 gradient descent method Methods 0.000 claims description 4
- 238000002372 labelling Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 9
- 238000013507 mapping Methods 0.000 description 7
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000014509 gene expression Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 230000008859 change Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/189—Automatic justification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Processing Or Creating Images (AREA)
Abstract
The invention discloses a method for generating a document image set based on deep learning, which comprises the following steps: firstly, projecting a page object type sequence from a one-dimensional vector space to a two-dimensional vector space; then carrying out deep convolution to generate a confrontation network model; training network parameters and generating an object type sequence by using the trained network model; generating document object content according to the object type sequence generated by the network; and finally converting the document into a document image to generate a document image set. Generating an image document automatically by a confrontation network based on convolution of a deep learning framework, learning an existing document image by using a discrimination network in the confrontation network, and automatically generating a new document image by using a generation network in the confrontation network so as to obtain a document image set; because the existing document image is adopted to train the network parameters, the generated document image is closer to the publication, and compared with manual marking, the document image set and marking information can be automatically generated, so that the time and labor cost are saved, and invalid marking caused by manual marking is avoided.
Description
Technical Field
The invention relates to an image generation method, belongs to the field of automatic generation of image data sets, and particularly relates to a method for generating a document image set based on deep learning.
Background
In many fields of document image processing, such as segmentation, classification, retrieval, etc., a labeled document image set is an indispensable data basis in the machine learning process. With the advent of the big data era, "end-to-end" deep learning has become an important research method in the field of artificial intelligence research, and deep learning requires more training data than traditional machine learning.
Currently, some automatic image set generation methods are used by researchers to more efficiently obtain image sets including document images and annotation information. Paragraphs, graphs, tables, titles, paragraph titles, lists and other elements are randomly arranged to generate a Document image dataset for deep learning training in a paper (d.he, s.cohen, b.price, d.kifer and c.l.giles, "Multi-Scale Multi-Task FCN for selective Page Segmentation and Table Detection") at 2017 for Document Analysis and Recognition International Conference (icdra). Similarly, the invention patent with application publication number [ CN 108898188A ] also discloses an image data set auxiliary labeling system and method, which perform preliminary feature extraction training on images required by neural network training by using the thought of neural network training, perform identification labeling on the images to obtain a label document format required by the neural network, and obtain a certain type of label documents in a large amount of image information.
On the other hand, many image sets are still produced by manual labeling, such as: image Annotation tools VIA ("abstract sheet and Andrew zisserman.2019.the VIA Annotation Software for Images, Audio and video. in Proceedings of the 27th ACM International Conference on Multimedia (MM' 19), October 21-25,2019, Nice, france.acm, New York, NY, usa", designed by the Robotics Research Group of oxford university (road Group), with which image regions can be manually annotated using different shapes (rectangles, circles, ellipses, polygons, etc.).
For manual labeling, although the method has strong flexibility, the labeling strategy can be flexibly changed in the labeling process, and the labeling result can better conform to expectations, the method has the obvious disadvantages that the labeling process is time-consuming and labor-consuming, and the labeling quality is in direct proportion to the proficiency of a labeling person; compared with manual labeling, the automatic generation method of the document image data set can well overcome the defects of the manual labeling, but has inevitable problems, for example, the publishing industry has own industry specifications, layout designs of different publications also follow specific rules, document contents are better shown through the rules, and if the randomly generated document images cannot well accord with the typesetting rules of the publications, the trained model cannot embody the best performance of the model when applied to document images of real publications.
Disclosure of Invention
Aiming at the defects of the existing method for obtaining the document image set, the invention provides a method for generating the document image set based on deep learning, which adopts the convolution of a deep learning framework to generate a confrontation network to automatically generate an image document, uses a discrimination network in the confrontation network to learn the existing document image, and then uses a generation network in the confrontation network to automatically generate a new document image, thereby obtaining the document image set.
The invention is realized by adopting the following technical scheme: the method for generating the document image set based on the deep learning comprises the following steps:
step A, vector space projection modeling: regarding the objects in the document image page as a sequence, wherein each node in the sequence corresponds to the type of one object to obtain a document object sequence and object type sequences corresponding to the document object sequence one by one, and rearranging the object type sequences to obtain a corresponding two-dimensional matrix, so that the object type sequences are projected from a one-dimensional vector space to a two-dimensional vector space;
b, deep convolution generation countermeasure network modeling: the countermeasure network comprises a discrimination network and a generation network; the discrimination network is trained by adopting the existing document images and has the function of training to generate the network; after the generation network is trained, the trained generation network is used for generating a two-dimensional matrix and aims at automatically generating a document image set in the follow-up process;
step C, training network model parameters: b, training the confrontation network constructed in the step B and solving network parameters; rearranging a document object type sequence in the existing document image into a two-dimensional matrix for training a discrimination network; training the generated network by using the trained discrimination network;
step D, generating an object type sequence: automatically outputting a new two-dimensional matrix based on the trained generation network; then, projecting the new two-dimensional matrix to a one-dimensional vector space to obtain a new document object type sequence;
step E, generating document object content: collecting various document object data, generating a new document object type sequence according to the step D, and automatically generating specific contents of the document object;
and F, converting the document generated in the step E into a document image, and generating a document image set, wherein the document image set comprises the document image, the coordinate information of the document object and the specific content of the document object.
Further, in the step a, the types of the objects include headers, texts, graphs, icons, tables, formulas, page numbers and footers;
(1) defining several objects in a document image page as a sequence of document objects, namely:
DOi,i=1,2,3...N (1)
wherein, DOiRepresenting the ith document object; n represents the number of document objects;
and defining a type sequence corresponding to the document object sequence as an object type sequence, namely:
yi,i=1,2,3...N (2)
yi∈{Typej|j=1,2,3...M} (3)
wherein, yiIndicates the Type corresponding to the ith document object, M indicates the number of object types, TypejRepresenting a type;
(2) regarding the document object sequence in each page of document image page as a vector, formula (1) and formula (2) are expressed in vector form:
DO=[DO1,DO2,DO3,...DON] (4)
Y=[y1,y2,y3,...yN] (5)
(3) setting p pages of document images, and respectively representing a document object sequence and an object type sequence of the p page into vector forms:
wherein, the superscript p represents the p-th page, the subscript Np represents the number of document objects in the p-th page, and the type of the i-th object of the p-th page isI is more than or equal to 1 and less than or equal to Np, the p page has Np document objects in total, thep-1 page has N (p-1) document objects;
(4) arranging the object type sequence of 1-p pages according to the page number sequence,the positions in the entire sequence are:
wherein Ni represents the number of document objects in the ith page, the formula (8) is projected into a two-dimensional matrix, K represents the row number and the column number of the matrix, the row number is equal to the column number, and the column coordinate of the two-dimensional matrix is as follows:
the row coordinates of the two-dimensional matrix are:
further, it is possible to obtain:
(5) the two-dimensional matrix is defined as follows:
A=[ak1,k2]K×K (13)
wherein 1 is not more than K1, K2 is not more than K, and the element in the formula (13) and the ith object on the p-th page are of the type according to the formulas (9) to (11)And establishing a one-to-one corresponding relation.
Further, in the step C, when training the anti-network, the following method is specifically adopted:
(1) the loss function of the network is defined by the KL divergence:
wherein NS represents the number of samples, i is more than or equal to 1 and less than or equal to NS; a. theiThe ith sample point represented by formula (13); piThe ith sample point representing the random vector input at the input end of the generating network;
(2) and B, taking the two-dimensional matrix obtained in the step A as an input training network, and solving network parameters by adopting a gradient descent method, wherein a gradient function is as follows:
wherein D (para-D) is a discrimination network structure, and para-D is a discrimination network parameter; g (para-G) is a network structure, and para-G is a network parameter;
in the training process, firstly, a discrimination network is trained by using a two-dimensional matrix, and the trained discrimination network is used for training to generate the network.
Further, in the step B, the discrimination network includes four sets of convolution kernels and a full-link layer connected in sequence from left to right, and activation functions used by the four sets of convolution kernels are all relus; the generation network comprises a full connection layer and four groups of convolution kernels which are sequentially connected from left to right, and the activation functions used by the four groups of convolution kernels are respectively as follows: ReLU, ReLU and Tanh.
Further, in the step A:
that is, enough document pages are selected, and enough two-dimensional matrixes are constructed for subsequent modeling analysis.
Compared with the prior art, the invention has the advantages and positive effects that:
according to the scheme, the document image set and the marking information are automatically generated, so that time and labor cost are saved, and invalid marking caused by manual marking is avoided; the convolution of a deep learning frame is used for generating a confrontation network to automatically generate a document image, a discrimination network in the confrontation network is used for learning the existing document image, and then a generation network in the confrontation network is used for automatically generating a new document image, so that a document image set is obtained, and the cost is low and the efficiency is high; in addition, the existing document image training network parameters are adopted, so that the generated document image is closer to the publication and has better use reference value, and in addition, the text coding information (such as ASCII, Unicode and the like) of the text object in the document image is provided while the document image set is generated, and the deep learning training requirement is better met.
Drawings
FIG. 1 is a flowchart illustrating a method for generating a document image set based on deep learning according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a document object sequence according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a sequence of object types according to an embodiment of the present invention;
FIG. 4 is (a) a schematic diagram of a document image and (b) a schematic diagram of an arrangement of a "document object type" sequence in a matrix;
FIG. 5 is a schematic diagram of a deep convolution discriminant network according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a deep convolution generating network according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a document image generated by a "generating network" according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a document image set structure according to an embodiment of the present invention.
Detailed Description
In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and thus, the present invention is not limited to the specific embodiments disclosed below.
The embodiment provides a method for generating a document image set based on deep learning, as shown in fig. 1, comprising the following steps:
firstly, vector space projection modeling: regarding the objects in the document image page as a sequence, wherein each node in the sequence corresponds to the type of one object to obtain a document object sequence and object type sequences corresponding to the document object sequence one by one, and rearranging the object type sequences to obtain a corresponding two-dimensional matrix, so that the object type sequences are projected from a one-dimensional vector space to a two-dimensional vector space;
and secondly, deep convolution generation countermeasure network modeling: the confrontation network comprises a discrimination network and a generation network, wherein the discrimination network is trained by adopting the existing document image and has the function of training the generation network; after the network training is generated, the two-dimensional matrix is used for generating a corresponding two-dimensional matrix, and the aim of automatically generating a document image set in the subsequent steps is fulfilled;
thirdly, training network model parameters: training the confrontation network constructed in the second step and solving network parameters, rearranging a document object type sequence in the existing document image into a two-dimensional matrix for network training, taking a random vector as the input of a generated network, and training the generated network by using a trained discrimination network;
fourthly, generating an object type sequence: automatically outputting a new two-dimensional matrix based on the trained generation network, and then projecting the new two-dimensional matrix to a one-dimensional object class vector through the inverse process of the first step to obtain a new document object type sequence;
fifthly, generating document object content: collecting various document object data, and generating specific contents of the objects in the document based on the new document object type sequence generated in the fourth step;
and sixthly, converting the document generated in the fifth step into a document image, and generating a document image set, wherein the document image set comprises the document image, the coordinate information of the document object and the specific content of the document object.
The following detailed description of the present invention is provided with reference to specific embodiments, which are specifically:
firstly, vector space projection modeling:
as shown in the first column of FIGS. 2 and 3, the objects in the document page may be viewed as a sequence, with each node (first column of FIG. 3) in the sequence corresponding to a type tag (second column of FIG. 3), and then the sequence is projected into a two-dimensional K by K matrix space, as shown in FIG. 4.
In fig. 2, a page of document image includes 11 objects, which are in turn: header, text, graph, drawing, legend, text, graph, legend, text, footer; these objects are arranged in the order from top to bottom and from left to right, see fig. 3, both in reading order and in writing and composing order, and 11 objects are defined as a sequence of document objects:
DOi,i=1,2,3...N (1)
wherein, DOiThe ith document object is represented, in fig. 2, N is 11, 11 document objects are the left column of "document object sequence" in fig. 3, and the right column corresponding thereto is "object type sequence", defined as:
yi,i=1,2,3...N (2)
yi∈{Typej|j=1,2,3...M} (3)
specifically, in fig. 2 and 3, M is 5, Type1To Type5Respectively as follows: headers, text, figures, legends, footers; equations (1) and (2) characterize a "document object" and "object type" sequence pair in a page of document image, and such sequence pair has the following characteristics:
<1> the sequence pair represented by the formulas (1) and (2) well reflects the sequence according to the writing sequence and the typesetting sequence of 'top-down', 'left-to-right', and the sequence reflects the 'sequence relation' between 'document objects' in the same page;
<2> any document or book larger than one page has a "sequence relation" between any two "document objects" not in the same page, besides the "sequence relation" between the "document objects" in the same page, so the object sequence in each page is regarded as a vector, that is, the expressions (1) and (2) can be expressed in vector form:
DO=[DO1,DO2,DO3,...DON] (4)
Y=[y1,y2,y3,...yN] (5)
equation (5) represents a one-dimensional vector formed by all the sequences of the "document object types" in one page, and can be projected into a two-dimensional vector formed by the sequences of the "document objects" in a plurality of pages, namely: and (4) matrix.
As shown in fig. 4, (a) there are three pages of documents, (b) there is a K × K matrix, each element in the matrix represents a "document object type", and all the "document object types" in the three pages of documents in (a) fill the K × K matrix in (b) in order from top to bottom and from left to right. Let the "document object" sequence and the "object type" sequence of page p be represented in vector form, respectively:
wherein the superscript p represents the p-th page, the subscript Np represents the number of "document objects" on the p-th page, and the type of the i-th object on the p-th page isI is more than or equal to 1 and less than or equal to Np, the p page comprises Np document objects, and the p-1 page comprises N (p-1) document objects; arranging the object type sequence of 1-p pages according to the page number sequence,positions in the entire sequence are (counting from 1):
where Ni represents the number of document objects in the ith page, and the column coordinates of the K × K matrix shown in fig. 4(b) onto which equation (8) is projected are:
the row coordinates are:
equations (9) to (10) define the coordinate transformation of the projection of the one-dimensional object class vector into the K × K two-dimensional matrix, while equation (11) is the inverse transformation process.
It should be emphasized that, in the present embodiment:
that is, enough document pages need to be selected, and a sufficient number of K × K matrixes are constructed for subsequent modeling analysis.
The K matrix shown in FIG. 4(b) is defined as follows:
A=[ak1,k2]K×K (13)
wherein 1 is not more than K1, K2 is not more than K, and the element in the formula (13) and the ith object on the p-th page can be of the typeAnd establishing a one-to-one corresponding relation.
In this embodiment, as shown in fig. 2 to 4, the document object sequence is characterized as a "spatial" mapping relationship, and the document layout information is abstracted into three "spaces," that is, a "document object" sequence space, a "document object type" sequence space, and a "document object type" sequence are retapped in columns to obtain a K × K two-dimensional matrix space. There are two mapping relationships between the three spaces: (1) "document object" sequence space ← → "document object type" sequence space; the mapping relation is as follows: "one-to-one mapping"; (2) "document object type" sequence space ← → "K × K two-dimensional matrix space"; the mapping relation is as follows: "coordinate transformation relation". Relationship (1) is a natural "one-to-one mapping" relationship, and relationship (2) projects a one-dimensional sequence vector into a two-dimensional matrix space. On one hand, the method facilitates the training of 'deep convolution generation countermeasure network' by using a two-dimensional matrix; on the other hand, the network can generate a new two-dimensional matrix conveniently for generating a new 'document object type' sequence, and finally, automatic generation of the document image is realized.
And secondly, deep convolution generation countermeasure network modeling:
the K multiplied by K two-dimensional matrix obtained in the last step contains the category information of a plurality of document page objects, the K multiplied by K two-dimensional matrix output in the first step is used as input in the second step, a depth convolution is adopted to generate a confrontation network for modeling, the model contains two parts in total, the first part is shown in figure 5 and is called as a 'discrimination network' and used for discriminating whether the input K multiplied by K two-dimensional matrix can represent a document object category sequence or not; the other part is "generating network" as shown in fig. 6, to generate a new K × K two-dimensional matrix.
Specifically, in this embodiment:
on one hand, the K × K two-dimensional matrix may be used as an input of the discrimination network, as shown in fig. 5, which is a schematic structural diagram of the discrimination network, the first group of convolution kernels is 64 convolution kernels of 3 × 3 × 1, 64 feature maps are obtained through the convolution kernels, then an output is obtained through three groups of convolution kernels and finally through a full connection layer, and the output is used for identifying whether the input K × K matrix can represent a document object class sequence. The discriminant network is defined as:
D(para-d) (14)
wherein D (-) is the judgment network structure, and para-D is the network parameter.
On the other hand, as shown in fig. 6, a random vector with dimension d is generated, 512K/8 × K/8 two-dimensional matrices are calculated through a full-connected layer with dimension d × 512, and then a new K × K matrix is finally generated through three sets of fractional convolution kernels. The new K × K matrix that is expected to be generated can well characterize the document object class sequence, and a check is made using the discrimination network shown in fig. 5 to determine whether the document object class sequence can be well characterized. The generation network may be defined as:
G(para-g) (15)
wherein, the generated network structure is represented as G (-) and the network parameter is para-G.
The discriminant network (equation (14)) shown in fig. 5 and the generation network (equation (15)) shown in fig. 6 cooperate to form a deep convolution generation countermeasure network, wherein the activation functions used after four sets of convolution kernels from left to right in the generation network shown in fig. 6 are respectively: ReLU, ReLU and Tanh; in the discrimination network shown in fig. 5, the activation functions used after four groups of convolution kernels from left to right are: ReLU.
Thirdly, training network model parameters:
the loss function of the network is defined by using KL divergence (Kullabck-Leibler divergence), the loss function enables a discrimination network (figure 5) and a generation network (figure 6) to be in a game state, therefore, the two networks are synthesized to be called as a countermeasure network, the network is trained by using the output K multiplied by K two-dimensional matrix of the first step, and the network parameters can be solved by adopting a gradient descent method.
In this embodiment, after the deep convolution generation countermeasure network is constructed, it needs to be trained to obtain the optimal network parameters para-d (equation (14)) and para-g (equation (15)), and at this time, enough document pages (i.e. p in equations (6) - (12) is large enough) are needed to obtain enough matrix a (equation (13)).
The loss function of the network is defined in particular by the KL divergence (Kullabck-Leibler divergence):
wherein NS represents the Number of Samples (the Number of Samples), and naturally, i is more than or equal to 1 and less than or equal to NS; a. theiThe ith sample point of expression (13):
Ai (17)
Ρithe ith sample point of the left-hand d-dimensional random vector of FIG. 6 is represented:
Ρi (18)
the process of training the network model parameters para-d and para-g, equations (14) and (15), i.e., the process of solving the minimum of equation (16), for the gradient function:
the solution is carried out, and the specific algorithm is as follows:
in the second step and the third step, a countermeasure network is generated by adopting deep convolution, a discrimination network in the countermeasure network is used for learning the existing document image, and meanwhile, the discrimination network is used for training and generating the network; in this embodiment, a deep convolution is used to generate a confrontation network model, where the model includes two parts, the first part is called a "discriminant network" as shown in fig. 5, and is used to identify whether an input K × K matrix can represent a document object class sequence; the other part is "generating network" as shown in fig. 6, to generate a new K × K matrix.
Existing document images are learned using a discrimination network in a countermeasure network, and then new document images are automatically generated using a generation network in the countermeasure network, thereby obtaining a document image set. The generated document image is brought into close proximity to the publication as a result of the network parameters being trained using existing document images. In the process of training the network, KL divergence (Kullabck-Leibler divergence) is used to define a loss function of the network, and the loss function enables a discriminant network (figure 5) and a generation network (figure 6) to be in a game state, so that the two networks are combined to form the network which is called a countermeasure network. And (5) training the network by using the K multiplied by K matrix output in the first step, and solving the network parameters by adopting a gradient descent method. In the training process, firstly, a K multiplied by K two-dimensional matrix is used for training a 'discriminant network', and then the 'discriminant network' is used for training 'to generate a network'.
Fourthly, generating an object type sequence:
a new K × K matrix is automatically generated by using the "generation network" shown in fig. 6, and then the new K × K matrix is projected to the one-dimensional object class vector to obtain a new "document object type" sequence, that is, the generation network trained in the third step generates a K × K matrix (formula (13)) as the output of the network shown in fig. 6, and then the K × K two-dimensional matrix is projected to the one-dimensional object class vector according to formula (11) to obtain "document object type" sequences shown in formula (5) and formula (7). The specific algorithm of the generation process is as follows:
fifthly, generating document object content:
firstly, collecting various document object data, then using the "document object type" sequence generated in the fourth step to generate specific contents of objects in the document, and generating a "document object type" sequence from "algorithm 2", wherein the sequence contains 16 × 16 × 100-25600 document object types. According to the setting in "algorithm 1", 25600 can generate 25600 document objects, and p > 4000 document pages in total. Next, a specific "document object" (expression (1)) is generated from 25600 "document object types".
To generate a "document object", data is collected from an existing PDF document as defined by equation (3), where the specific parameters are as described in "algorithm 1": { Type1,Type2,...Type8Text, formula, graph, legend, table name, header, footer. The data collected is defined as:
Set1,Set2,...Set8a footer set (20)
Then, according to the "document object type" sequence generated by the "algorithm 2", the document object (defined by the equation (1)), the coordinate information, and the content information are generated by using the TeX markup language and the data set of the equation (20). Wherein, the coordinate information:
DOi-Coors (21)
which is DO in formula (1)iWherein i is more than or equal to 1 and less than or equal to 25600. In addition, the content information:
DOi-Content (22)
is referred to as DOiThe specific contents of (1) are as follows: text codes, formulas, etc. The specific document object generation process is as follows:
in the fourth step and the fifth step, after the generation of the countermeasure network by the depth convolution is well trained, a new K multiplied by K matrix is generated by using the 'generation network' (figure 6), then the new two-dimensional K multiplied by K matrix is projected to the one-dimensional object class vector to obtain a new 'document object type' sequence, a document object is automatically generated according to the new 'document object type' sequence, then the document object is converted into a document image, and a new document image set is further obtained, so that the generation efficiency and the quality of the document image set are improved.
Sixthly, converting the document into a document image, and generating a document image set:
the document generated in the fifth step is converted into a document image, a document image set (containing the document image, the coordinate information of the document object, and the content information of the document object) is generated, and each page of the PDF document generated according to algorithm 3 is converted into a document image, and an automatically generated image is given as fig. 7. Each generated document image is defined as:
DocImagec,c=1,2,...p (23)
p represents the number of images of the document image dataset (p > 4000 according to the description of "Algorithm 1" and "fifth step"), while mapping the document object space coordinates represented by equation (21) into the document image, resulting in:
DOi-Coors′ (24)
then, the document image dataset may be represented as:
DocImageSet={elec},c=1,2,...p (25)
elec={DocImagec,DOi,c-Coors′,DOi,c-Content} (26)
equation (25) defines a document image dataset in which elecAs shown by the dotted line box in FIG. 8, N document object space coordinate information (DO in equation (26)) in one image is includedi,cCoors') in one-to-one correspondence with each document object specific content information (DO in equation (26)i,c-Content)。
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.
Claims (5)
1. The method for generating the document image set based on the deep learning is characterized by comprising the following steps:
step A, vector space projection modeling: regarding the objects in the document image page as a sequence, wherein each node in the sequence corresponds to the type of one object to obtain a document object sequence and object type sequences corresponding to the document object sequence one by one, and rearranging the object type sequences to obtain a corresponding two-dimensional matrix, so that the object type sequences are projected from a one-dimensional vector space to a two-dimensional vector space;
b, deep convolution generation countermeasure network modeling: the countermeasure network comprises a discrimination network and a generation network; the discrimination network is trained by adopting the existing document images and has the function of training to generate the network; after the generation network is trained, the trained generation network is used for generating a two-dimensional matrix and aims at automatically generating a document image set in the follow-up process;
step C, training network model parameters: b, training the confrontation network constructed in the step B and solving network parameters; rearranging a document object type sequence in the existing document image into a two-dimensional matrix for training a discrimination network; training the generated network by using the trained discrimination network;
step D, generating an object type sequence: automatically outputting a new two-dimensional matrix based on the trained generation network; then, projecting the new two-dimensional matrix to a one-dimensional vector space to obtain a new document object type sequence;
step E, generating document object content: collecting various document object data, generating a new document object type sequence according to the step D, and automatically generating specific contents of the document object;
and F, converting the document generated in the step E into a document image, and generating a document image set, wherein the document image set comprises the document image, the coordinate information of the document object and the specific content of the document object.
2. The method for generating a document image set based on deep learning of claim 1, wherein: in the step A, the types of the objects comprise headers, texts, graphs, icons, tables, formulas, page numbers and footers;
(1) defining several objects in a document image page as a sequence of document objects, namely:
DOi,i=1,2,3...N (1)
wherein, DOiRepresenting the ith document object; n represents the number of document objects;
and defining a type sequence corresponding to the document object sequence as an object type sequence, namely:
yi,i=1,2,3...N (2)
yi∈{Typej|j=1,2,3...M} (3)
wherein, yiIndicates the Type corresponding to the ith document object, M indicates the number of object types, TypejRepresenting a type;
(2) regarding the document object sequence in each page of document image page as a vector, formula (1) and formula (2) are expressed in vector form:
DO=[DO1,DO2,DO3,...DON] (4)
Y=[y1,y2,y3,...yN]; (5)
(3) setting p pages of document images, and respectively representing a document object sequence and an object type sequence of the p page into vector forms:
wherein the superscript p denotesPage p, subscript Np indicates the number of document objects on page p, and type of i-th object on page p isThe p page has Np document objects, and the p-1 page has N (p-1) document objects;
(4) arranging the object type sequence of 1-p pages according to the page number sequence,the positions in the entire sequence are:
wherein Ni represents the number of document objects in the ith page, the formula (8) is projected into the two-dimensional matrix, K represents the row number and the column number of the two-dimensional matrix, the row number is equal to the column number, and the column coordinate of the two-dimensional matrix is as follows:
the row coordinates of the two-dimensional matrix are:
further, it is possible to obtain:
(5) the two-dimensional matrix is defined as follows:
A=[ak1,k2]K×K (13)
3. The method for generating a document image set based on deep learning of claim 2, wherein: in the step C, when training the countermeasure network, the following method is specifically adopted:
(1) the loss function of the network is defined by the KL divergence:
wherein NS represents the number of samples, i is more than or equal to 1 and less than or equal to NS; a. theiThe ith sample point represented by formula (13); piThe ith sample point representing the random vector input at the input end of the generating network;
(2) and B, taking the two-dimensional matrix obtained in the step A as an input training network, and solving network parameters by adopting a gradient descent method, wherein a gradient function is as follows:
wherein D (para-D) is a discrimination network structure, and para-D is a discrimination network parameter; g (para-G) is a network structure, and para-G is a network parameter;
in the training process, firstly, a discrimination network is trained by using a two-dimensional matrix, and the trained discrimination network is used for training to generate the network.
4. The method for generating a document image set based on deep learning of claim 1, wherein: in the step B, the discrimination network includes four groups of convolution kernels and a full connection layer connected in sequence from left to right, and all activation functions used by the four groups of convolution kernels are ReLU; the generation network comprises a full connection layer and four groups of convolution kernels which are sequentially connected from left to right, and the activation functions used by the four groups of convolution kernels are respectively as follows: ReLU, ReLU and Tanh.
5. The method for generating a document image set based on deep learning of claim 2, wherein: in the step A:
k represents the row number and the column number of the two-dimensional matrix, the row number is equal to the column number, namely enough document pages are selected, and the two-dimensional matrix with enough quantity is constructed for subsequent modeling analysis.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011178681.2A CN112347742B (en) | 2020-10-29 | 2020-10-29 | Method for generating document image set based on deep learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011178681.2A CN112347742B (en) | 2020-10-29 | 2020-10-29 | Method for generating document image set based on deep learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112347742A true CN112347742A (en) | 2021-02-09 |
CN112347742B CN112347742B (en) | 2022-05-31 |
Family
ID=74357017
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011178681.2A Active CN112347742B (en) | 2020-10-29 | 2020-10-29 | Method for generating document image set based on deep learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112347742B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113127622A (en) * | 2021-04-29 | 2021-07-16 | 西北师范大学 | Method and system for generating voice to image |
CN117272941A (en) * | 2023-09-21 | 2023-12-22 | 北京百度网讯科技有限公司 | Data processing method, apparatus, device, computer readable storage medium and product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107330444A (en) * | 2017-05-27 | 2017-11-07 | 苏州科技大学 | A kind of image autotext mask method based on generation confrontation network |
CN107943784A (en) * | 2017-11-02 | 2018-04-20 | 南华大学 | Relation extraction method based on generation confrontation network |
US20190050639A1 (en) * | 2017-08-09 | 2019-02-14 | Open Text Sa Ulc | Systems and methods for generating and using semantic images in deep learning for classification and data extraction |
CN109344879A (en) * | 2018-09-07 | 2019-02-15 | 华南理工大学 | A kind of decomposition convolution method fighting network model based on text-image |
CN110516577A (en) * | 2019-08-20 | 2019-11-29 | Oppo广东移动通信有限公司 | Image processing method, device, electronic equipment and storage medium |
CN111783416A (en) * | 2020-06-08 | 2020-10-16 | 青岛科技大学 | Method for constructing document image data set by using prior knowledge |
-
2020
- 2020-10-29 CN CN202011178681.2A patent/CN112347742B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107330444A (en) * | 2017-05-27 | 2017-11-07 | 苏州科技大学 | A kind of image autotext mask method based on generation confrontation network |
US20190050639A1 (en) * | 2017-08-09 | 2019-02-14 | Open Text Sa Ulc | Systems and methods for generating and using semantic images in deep learning for classification and data extraction |
CN107943784A (en) * | 2017-11-02 | 2018-04-20 | 南华大学 | Relation extraction method based on generation confrontation network |
CN109344879A (en) * | 2018-09-07 | 2019-02-15 | 华南理工大学 | A kind of decomposition convolution method fighting network model based on text-image |
CN110516577A (en) * | 2019-08-20 | 2019-11-29 | Oppo广东移动通信有限公司 | Image processing method, device, electronic equipment and storage medium |
CN111783416A (en) * | 2020-06-08 | 2020-10-16 | 青岛科技大学 | Method for constructing document image data set by using prior knowledge |
Non-Patent Citations (4)
Title |
---|
CAO SHI等: "Sentiment Analysis of Home Appliance Comment Based on Generative Probabilistic Model", 《IEEE》 * |
XU CANHUI等: "Graph-based Layout Analysis for PDF Documents", 《SPIE》 * |
XU CANHUI等: "Graphic composite segmentation for PDF documents with complex layouts", 《SPIE》 * |
陈赛健: "基于深度学习的文本图像重建方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113127622A (en) * | 2021-04-29 | 2021-07-16 | 西北师范大学 | Method and system for generating voice to image |
CN117272941A (en) * | 2023-09-21 | 2023-12-22 | 北京百度网讯科技有限公司 | Data processing method, apparatus, device, computer readable storage medium and product |
Also Published As
Publication number | Publication date |
---|---|
CN112347742B (en) | 2022-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jobin et al. | Docfigure: A dataset for scientific document figure classification | |
CN104517112B (en) | A kind of Table recognition method and system | |
US11288324B2 (en) | Chart question answering | |
CN109858036B (en) | Method and device for dividing documents | |
US7707488B2 (en) | Analyzing lines to detect tables in documents | |
CN112347742B (en) | Method for generating document image set based on deep learning | |
CN109657204A (en) | Use the automatic matching font of asymmetric metric learning | |
CN109948518B (en) | Neural network-based PDF document content text paragraph aggregation method | |
CN112464781A (en) | Document image key information extraction and matching method based on graph neural network | |
CN106601235A (en) | Semi-supervision multitask characteristic selecting speech recognition method | |
US10402484B2 (en) | Aligning annotation of fields of documents | |
JP2022541199A (en) | A system and method for inserting data into a structured database based on image representations of data tables. | |
CN110738050A (en) | Text recombination method, device and medium based on word segmentation and named entity recognition | |
CN114782965A (en) | Visual rich document information extraction method, system and medium based on layout relevance | |
CN117034948B (en) | Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion | |
CN111783416B (en) | Method for constructing document image data set by using priori knowledge | |
CN113283231A (en) | Method for acquiring signature bit, setting system, signature system and storage medium | |
CN116822634A (en) | Document visual language reasoning method based on layout perception prompt | |
Wang | Document analysis: table structure understanding and zone content classification | |
JP3898645B2 (en) | Form format editing device and form format editing program | |
KR20220143538A (en) | Method and system for extracting information from semi-structured documents | |
CN112529743A (en) | Contract element extraction method, contract element extraction device, electronic equipment and medium | |
Kovačević et al. | Recognition of common areas in a Web page using a visualization approach | |
CN115995087B (en) | Document catalog intelligent generation method and system based on fusion visual information | |
CN107451232A (en) | A kind of electronic document graph text information restoring method, storage device and terminal |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |