CN112347742A - Method for generating document image set based on deep learning - Google Patents

Method for generating document image set based on deep learning Download PDF

Info

Publication number
CN112347742A
CN112347742A CN202011178681.2A CN202011178681A CN112347742A CN 112347742 A CN112347742 A CN 112347742A CN 202011178681 A CN202011178681 A CN 202011178681A CN 112347742 A CN112347742 A CN 112347742A
Authority
CN
China
Prior art keywords
document
network
sequence
document image
generating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011178681.2A
Other languages
Chinese (zh)
Other versions
CN112347742B (en
Inventor
史操
许灿辉
刘传琦
程远志
陶冶
马兴录
刘国柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Science and Technology
Original Assignee
Qingdao University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology
Priority to CN202011178681.2A priority Critical patent/CN112347742B/en
Publication of CN112347742A publication Critical patent/CN112347742A/en
Application granted granted Critical
Publication of CN112347742B publication Critical patent/CN112347742B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/189Automatic justification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a method for generating a document image set based on deep learning, which comprises the following steps: firstly, projecting a page object type sequence from a one-dimensional vector space to a two-dimensional vector space; then carrying out deep convolution to generate a confrontation network model; training network parameters and generating an object type sequence by using the trained network model; generating document object content according to the object type sequence generated by the network; and finally converting the document into a document image to generate a document image set. Generating an image document automatically by a confrontation network based on convolution of a deep learning framework, learning an existing document image by using a discrimination network in the confrontation network, and automatically generating a new document image by using a generation network in the confrontation network so as to obtain a document image set; because the existing document image is adopted to train the network parameters, the generated document image is closer to the publication, and compared with manual marking, the document image set and marking information can be automatically generated, so that the time and labor cost are saved, and invalid marking caused by manual marking is avoided.

Description

Method for generating document image set based on deep learning
Technical Field
The invention relates to an image generation method, belongs to the field of automatic generation of image data sets, and particularly relates to a method for generating a document image set based on deep learning.
Background
In many fields of document image processing, such as segmentation, classification, retrieval, etc., a labeled document image set is an indispensable data basis in the machine learning process. With the advent of the big data era, "end-to-end" deep learning has become an important research method in the field of artificial intelligence research, and deep learning requires more training data than traditional machine learning.
Currently, some automatic image set generation methods are used by researchers to more efficiently obtain image sets including document images and annotation information. Paragraphs, graphs, tables, titles, paragraph titles, lists and other elements are randomly arranged to generate a Document image dataset for deep learning training in a paper (d.he, s.cohen, b.price, d.kifer and c.l.giles, "Multi-Scale Multi-Task FCN for selective Page Segmentation and Table Detection") at 2017 for Document Analysis and Recognition International Conference (icdra). Similarly, the invention patent with application publication number [ CN 108898188A ] also discloses an image data set auxiliary labeling system and method, which perform preliminary feature extraction training on images required by neural network training by using the thought of neural network training, perform identification labeling on the images to obtain a label document format required by the neural network, and obtain a certain type of label documents in a large amount of image information.
On the other hand, many image sets are still produced by manual labeling, such as: image Annotation tools VIA ("abstract sheet and Andrew zisserman.2019.the VIA Annotation Software for Images, Audio and video. in Proceedings of the 27th ACM International Conference on Multimedia (MM' 19), October 21-25,2019, Nice, france.acm, New York, NY, usa", designed by the Robotics Research Group of oxford university (road Group), with which image regions can be manually annotated using different shapes (rectangles, circles, ellipses, polygons, etc.).
For manual labeling, although the method has strong flexibility, the labeling strategy can be flexibly changed in the labeling process, and the labeling result can better conform to expectations, the method has the obvious disadvantages that the labeling process is time-consuming and labor-consuming, and the labeling quality is in direct proportion to the proficiency of a labeling person; compared with manual labeling, the automatic generation method of the document image data set can well overcome the defects of the manual labeling, but has inevitable problems, for example, the publishing industry has own industry specifications, layout designs of different publications also follow specific rules, document contents are better shown through the rules, and if the randomly generated document images cannot well accord with the typesetting rules of the publications, the trained model cannot embody the best performance of the model when applied to document images of real publications.
Disclosure of Invention
Aiming at the defects of the existing method for obtaining the document image set, the invention provides a method for generating the document image set based on deep learning, which adopts the convolution of a deep learning framework to generate a confrontation network to automatically generate an image document, uses a discrimination network in the confrontation network to learn the existing document image, and then uses a generation network in the confrontation network to automatically generate a new document image, thereby obtaining the document image set.
The invention is realized by adopting the following technical scheme: the method for generating the document image set based on the deep learning comprises the following steps:
step A, vector space projection modeling: regarding the objects in the document image page as a sequence, wherein each node in the sequence corresponds to the type of one object to obtain a document object sequence and object type sequences corresponding to the document object sequence one by one, and rearranging the object type sequences to obtain a corresponding two-dimensional matrix, so that the object type sequences are projected from a one-dimensional vector space to a two-dimensional vector space;
b, deep convolution generation countermeasure network modeling: the countermeasure network comprises a discrimination network and a generation network; the discrimination network is trained by adopting the existing document images and has the function of training to generate the network; after the generation network is trained, the trained generation network is used for generating a two-dimensional matrix and aims at automatically generating a document image set in the follow-up process;
step C, training network model parameters: b, training the confrontation network constructed in the step B and solving network parameters; rearranging a document object type sequence in the existing document image into a two-dimensional matrix for training a discrimination network; training the generated network by using the trained discrimination network;
step D, generating an object type sequence: automatically outputting a new two-dimensional matrix based on the trained generation network; then, projecting the new two-dimensional matrix to a one-dimensional vector space to obtain a new document object type sequence;
step E, generating document object content: collecting various document object data, generating a new document object type sequence according to the step D, and automatically generating specific contents of the document object;
and F, converting the document generated in the step E into a document image, and generating a document image set, wherein the document image set comprises the document image, the coordinate information of the document object and the specific content of the document object.
Further, in the step a, the types of the objects include headers, texts, graphs, icons, tables, formulas, page numbers and footers;
(1) defining several objects in a document image page as a sequence of document objects, namely:
DOi,i=1,2,3...N (1)
wherein, DOiRepresenting the ith document object; n represents the number of document objects;
and defining a type sequence corresponding to the document object sequence as an object type sequence, namely:
yi,i=1,2,3...N (2)
yi∈{Typej|j=1,2,3...M} (3)
wherein, yiIndicates the Type corresponding to the ith document object, M indicates the number of object types, TypejRepresenting a type;
(2) regarding the document object sequence in each page of document image page as a vector, formula (1) and formula (2) are expressed in vector form:
DO=[DO1,DO2,DO3,...DON] (4)
Y=[y1,y2,y3,...yN] (5)
(3) setting p pages of document images, and respectively representing a document object sequence and an object type sequence of the p page into vector forms:
Figure BDA0002749457790000031
Figure BDA0002749457790000032
wherein, the superscript p represents the p-th page, the subscript Np represents the number of document objects in the p-th page, and the type of the i-th object of the p-th page is
Figure BDA0002749457790000033
I is more than or equal to 1 and less than or equal to Np, the p page has Np document objects in total, thep-1 page has N (p-1) document objects;
(4) arranging the object type sequence of 1-p pages according to the page number sequence,
Figure BDA0002749457790000034
the positions in the entire sequence are:
Figure BDA0002749457790000035
wherein Ni represents the number of document objects in the ith page, the formula (8) is projected into a two-dimensional matrix, K represents the row number and the column number of the matrix, the row number is equal to the column number, and the column coordinate of the two-dimensional matrix is as follows:
Figure BDA0002749457790000036
the row coordinates of the two-dimensional matrix are:
Figure BDA0002749457790000037
further, it is possible to obtain:
Figure BDA0002749457790000038
(5) the two-dimensional matrix is defined as follows:
A=[ak1,k2]K×K (13)
wherein 1 is not more than K1, K2 is not more than K, and the element in the formula (13) and the ith object on the p-th page are of the type according to the formulas (9) to (11)
Figure BDA00027494577900000310
And establishing a one-to-one corresponding relation.
Further, in the step C, when training the anti-network, the following method is specifically adopted:
(1) the loss function of the network is defined by the KL divergence:
Figure BDA0002749457790000039
wherein NS represents the number of samples, i is more than or equal to 1 and less than or equal to NS; a. theiThe ith sample point represented by formula (13); piThe ith sample point representing the random vector input at the input end of the generating network;
(2) and B, taking the two-dimensional matrix obtained in the step A as an input training network, and solving network parameters by adopting a gradient descent method, wherein a gradient function is as follows:
Figure BDA0002749457790000041
wherein D (para-D) is a discrimination network structure, and para-D is a discrimination network parameter; g (para-G) is a network structure, and para-G is a network parameter;
in the training process, firstly, a discrimination network is trained by using a two-dimensional matrix, and the trained discrimination network is used for training to generate the network.
Further, in the step B, the discrimination network includes four sets of convolution kernels and a full-link layer connected in sequence from left to right, and activation functions used by the four sets of convolution kernels are all relus; the generation network comprises a full connection layer and four groups of convolution kernels which are sequentially connected from left to right, and the activation functions used by the four groups of convolution kernels are respectively as follows: ReLU, ReLU and Tanh.
Further, in the step A:
Figure BDA0002749457790000042
that is, enough document pages are selected, and enough two-dimensional matrixes are constructed for subsequent modeling analysis.
Compared with the prior art, the invention has the advantages and positive effects that:
according to the scheme, the document image set and the marking information are automatically generated, so that time and labor cost are saved, and invalid marking caused by manual marking is avoided; the convolution of a deep learning frame is used for generating a confrontation network to automatically generate a document image, a discrimination network in the confrontation network is used for learning the existing document image, and then a generation network in the confrontation network is used for automatically generating a new document image, so that a document image set is obtained, and the cost is low and the efficiency is high; in addition, the existing document image training network parameters are adopted, so that the generated document image is closer to the publication and has better use reference value, and in addition, the text coding information (such as ASCII, Unicode and the like) of the text object in the document image is provided while the document image set is generated, and the deep learning training requirement is better met.
Drawings
FIG. 1 is a flowchart illustrating a method for generating a document image set based on deep learning according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a document object sequence according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a sequence of object types according to an embodiment of the present invention;
FIG. 4 is (a) a schematic diagram of a document image and (b) a schematic diagram of an arrangement of a "document object type" sequence in a matrix;
FIG. 5 is a schematic diagram of a deep convolution discriminant network according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a deep convolution generating network according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a document image generated by a "generating network" according to an embodiment of the present invention;
FIG. 8 is a diagram illustrating a document image set structure according to an embodiment of the present invention.
Detailed Description
In order to make the above objects, features and advantages of the present invention more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and thus, the present invention is not limited to the specific embodiments disclosed below.
The embodiment provides a method for generating a document image set based on deep learning, as shown in fig. 1, comprising the following steps:
firstly, vector space projection modeling: regarding the objects in the document image page as a sequence, wherein each node in the sequence corresponds to the type of one object to obtain a document object sequence and object type sequences corresponding to the document object sequence one by one, and rearranging the object type sequences to obtain a corresponding two-dimensional matrix, so that the object type sequences are projected from a one-dimensional vector space to a two-dimensional vector space;
and secondly, deep convolution generation countermeasure network modeling: the confrontation network comprises a discrimination network and a generation network, wherein the discrimination network is trained by adopting the existing document image and has the function of training the generation network; after the network training is generated, the two-dimensional matrix is used for generating a corresponding two-dimensional matrix, and the aim of automatically generating a document image set in the subsequent steps is fulfilled;
thirdly, training network model parameters: training the confrontation network constructed in the second step and solving network parameters, rearranging a document object type sequence in the existing document image into a two-dimensional matrix for network training, taking a random vector as the input of a generated network, and training the generated network by using a trained discrimination network;
fourthly, generating an object type sequence: automatically outputting a new two-dimensional matrix based on the trained generation network, and then projecting the new two-dimensional matrix to a one-dimensional object class vector through the inverse process of the first step to obtain a new document object type sequence;
fifthly, generating document object content: collecting various document object data, and generating specific contents of the objects in the document based on the new document object type sequence generated in the fourth step;
and sixthly, converting the document generated in the fifth step into a document image, and generating a document image set, wherein the document image set comprises the document image, the coordinate information of the document object and the specific content of the document object.
The following detailed description of the present invention is provided with reference to specific embodiments, which are specifically:
firstly, vector space projection modeling:
as shown in the first column of FIGS. 2 and 3, the objects in the document page may be viewed as a sequence, with each node (first column of FIG. 3) in the sequence corresponding to a type tag (second column of FIG. 3), and then the sequence is projected into a two-dimensional K by K matrix space, as shown in FIG. 4.
In fig. 2, a page of document image includes 11 objects, which are in turn: header, text, graph, drawing, legend, text, graph, legend, text, footer; these objects are arranged in the order from top to bottom and from left to right, see fig. 3, both in reading order and in writing and composing order, and 11 objects are defined as a sequence of document objects:
DOi,i=1,2,3...N (1)
wherein, DOiThe ith document object is represented, in fig. 2, N is 11, 11 document objects are the left column of "document object sequence" in fig. 3, and the right column corresponding thereto is "object type sequence", defined as:
yi,i=1,2,3...N (2)
yi∈{Typej|j=1,2,3...M} (3)
specifically, in fig. 2 and 3, M is 5, Type1To Type5Respectively as follows: headers, text, figures, legends, footers; equations (1) and (2) characterize a "document object" and "object type" sequence pair in a page of document image, and such sequence pair has the following characteristics:
<1> the sequence pair represented by the formulas (1) and (2) well reflects the sequence according to the writing sequence and the typesetting sequence of 'top-down', 'left-to-right', and the sequence reflects the 'sequence relation' between 'document objects' in the same page;
<2> any document or book larger than one page has a "sequence relation" between any two "document objects" not in the same page, besides the "sequence relation" between the "document objects" in the same page, so the object sequence in each page is regarded as a vector, that is, the expressions (1) and (2) can be expressed in vector form:
DO=[DO1,DO2,DO3,...DON] (4)
Y=[y1,y2,y3,...yN] (5)
equation (5) represents a one-dimensional vector formed by all the sequences of the "document object types" in one page, and can be projected into a two-dimensional vector formed by the sequences of the "document objects" in a plurality of pages, namely: and (4) matrix.
As shown in fig. 4, (a) there are three pages of documents, (b) there is a K × K matrix, each element in the matrix represents a "document object type", and all the "document object types" in the three pages of documents in (a) fill the K × K matrix in (b) in order from top to bottom and from left to right. Let the "document object" sequence and the "object type" sequence of page p be represented in vector form, respectively:
Figure BDA0002749457790000061
Figure BDA0002749457790000062
wherein the superscript p represents the p-th page, the subscript Np represents the number of "document objects" on the p-th page, and the type of the i-th object on the p-th page is
Figure BDA0002749457790000063
I is more than or equal to 1 and less than or equal to Np, the p page comprises Np document objects, and the p-1 page comprises N (p-1) document objects; arranging the object type sequence of 1-p pages according to the page number sequence,
Figure BDA0002749457790000064
positions in the entire sequence are (counting from 1):
Figure BDA0002749457790000065
where Ni represents the number of document objects in the ith page, and the column coordinates of the K × K matrix shown in fig. 4(b) onto which equation (8) is projected are:
Figure BDA0002749457790000071
namely:
Figure BDA0002749457790000072
divide by K, the quotient plus 1 becomes the column coordinate;
the row coordinates are:
Figure BDA0002749457790000073
namely:
Figure BDA0002749457790000074
dividing by K, and changing the remainder into row coordinates;
at the same time, in
Figure BDA0002749457790000075
When determined, one can calculate:
Figure BDA0002749457790000076
equations (9) to (10) define the coordinate transformation of the projection of the one-dimensional object class vector into the K × K two-dimensional matrix, while equation (11) is the inverse transformation process.
It should be emphasized that, in the present embodiment:
Figure BDA0002749457790000077
that is, enough document pages need to be selected, and a sufficient number of K × K matrixes are constructed for subsequent modeling analysis.
The K matrix shown in FIG. 4(b) is defined as follows:
A=[ak1,k2]K×K (13)
wherein 1 is not more than K1, K2 is not more than K, and the element in the formula (13) and the ith object on the p-th page can be of the type
Figure BDA0002749457790000078
And establishing a one-to-one corresponding relation.
In this embodiment, as shown in fig. 2 to 4, the document object sequence is characterized as a "spatial" mapping relationship, and the document layout information is abstracted into three "spaces," that is, a "document object" sequence space, a "document object type" sequence space, and a "document object type" sequence are retapped in columns to obtain a K × K two-dimensional matrix space. There are two mapping relationships between the three spaces: (1) "document object" sequence space ← → "document object type" sequence space; the mapping relation is as follows: "one-to-one mapping"; (2) "document object type" sequence space ← → "K × K two-dimensional matrix space"; the mapping relation is as follows: "coordinate transformation relation". Relationship (1) is a natural "one-to-one mapping" relationship, and relationship (2) projects a one-dimensional sequence vector into a two-dimensional matrix space. On one hand, the method facilitates the training of 'deep convolution generation countermeasure network' by using a two-dimensional matrix; on the other hand, the network can generate a new two-dimensional matrix conveniently for generating a new 'document object type' sequence, and finally, automatic generation of the document image is realized.
And secondly, deep convolution generation countermeasure network modeling:
the K multiplied by K two-dimensional matrix obtained in the last step contains the category information of a plurality of document page objects, the K multiplied by K two-dimensional matrix output in the first step is used as input in the second step, a depth convolution is adopted to generate a confrontation network for modeling, the model contains two parts in total, the first part is shown in figure 5 and is called as a 'discrimination network' and used for discriminating whether the input K multiplied by K two-dimensional matrix can represent a document object category sequence or not; the other part is "generating network" as shown in fig. 6, to generate a new K × K two-dimensional matrix.
Specifically, in this embodiment:
on one hand, the K × K two-dimensional matrix may be used as an input of the discrimination network, as shown in fig. 5, which is a schematic structural diagram of the discrimination network, the first group of convolution kernels is 64 convolution kernels of 3 × 3 × 1, 64 feature maps are obtained through the convolution kernels, then an output is obtained through three groups of convolution kernels and finally through a full connection layer, and the output is used for identifying whether the input K × K matrix can represent a document object class sequence. The discriminant network is defined as:
D(para-d) (14)
wherein D (-) is the judgment network structure, and para-D is the network parameter.
On the other hand, as shown in fig. 6, a random vector with dimension d is generated, 512K/8 × K/8 two-dimensional matrices are calculated through a full-connected layer with dimension d × 512, and then a new K × K matrix is finally generated through three sets of fractional convolution kernels. The new K × K matrix that is expected to be generated can well characterize the document object class sequence, and a check is made using the discrimination network shown in fig. 5 to determine whether the document object class sequence can be well characterized. The generation network may be defined as:
G(para-g) (15)
wherein, the generated network structure is represented as G (-) and the network parameter is para-G.
The discriminant network (equation (14)) shown in fig. 5 and the generation network (equation (15)) shown in fig. 6 cooperate to form a deep convolution generation countermeasure network, wherein the activation functions used after four sets of convolution kernels from left to right in the generation network shown in fig. 6 are respectively: ReLU, ReLU and Tanh; in the discrimination network shown in fig. 5, the activation functions used after four groups of convolution kernels from left to right are: ReLU.
Thirdly, training network model parameters:
the loss function of the network is defined by using KL divergence (Kullabck-Leibler divergence), the loss function enables a discrimination network (figure 5) and a generation network (figure 6) to be in a game state, therefore, the two networks are synthesized to be called as a countermeasure network, the network is trained by using the output K multiplied by K two-dimensional matrix of the first step, and the network parameters can be solved by adopting a gradient descent method.
In this embodiment, after the deep convolution generation countermeasure network is constructed, it needs to be trained to obtain the optimal network parameters para-d (equation (14)) and para-g (equation (15)), and at this time, enough document pages (i.e. p in equations (6) - (12) is large enough) are needed to obtain enough matrix a (equation (13)).
The loss function of the network is defined in particular by the KL divergence (Kullabck-Leibler divergence):
Figure BDA0002749457790000081
wherein NS represents the Number of Samples (the Number of Samples), and naturally, i is more than or equal to 1 and less than or equal to NS; a. theiThe ith sample point of expression (13):
Ai (17)
Ρithe ith sample point of the left-hand d-dimensional random vector of FIG. 6 is represented:
Ρi (18)
the process of training the network model parameters para-d and para-g, equations (14) and (15), i.e., the process of solving the minimum of equation (16), for the gradient function:
Figure BDA0002749457790000091
the solution is carried out, and the specific algorithm is as follows:
Figure BDA0002749457790000092
Figure BDA0002749457790000101
in the second step and the third step, a countermeasure network is generated by adopting deep convolution, a discrimination network in the countermeasure network is used for learning the existing document image, and meanwhile, the discrimination network is used for training and generating the network; in this embodiment, a deep convolution is used to generate a confrontation network model, where the model includes two parts, the first part is called a "discriminant network" as shown in fig. 5, and is used to identify whether an input K × K matrix can represent a document object class sequence; the other part is "generating network" as shown in fig. 6, to generate a new K × K matrix.
Existing document images are learned using a discrimination network in a countermeasure network, and then new document images are automatically generated using a generation network in the countermeasure network, thereby obtaining a document image set. The generated document image is brought into close proximity to the publication as a result of the network parameters being trained using existing document images. In the process of training the network, KL divergence (Kullabck-Leibler divergence) is used to define a loss function of the network, and the loss function enables a discriminant network (figure 5) and a generation network (figure 6) to be in a game state, so that the two networks are combined to form the network which is called a countermeasure network. And (5) training the network by using the K multiplied by K matrix output in the first step, and solving the network parameters by adopting a gradient descent method. In the training process, firstly, a K multiplied by K two-dimensional matrix is used for training a 'discriminant network', and then the 'discriminant network' is used for training 'to generate a network'.
Fourthly, generating an object type sequence:
a new K × K matrix is automatically generated by using the "generation network" shown in fig. 6, and then the new K × K matrix is projected to the one-dimensional object class vector to obtain a new "document object type" sequence, that is, the generation network trained in the third step generates a K × K matrix (formula (13)) as the output of the network shown in fig. 6, and then the K × K two-dimensional matrix is projected to the one-dimensional object class vector according to formula (11) to obtain "document object type" sequences shown in formula (5) and formula (7). The specific algorithm of the generation process is as follows:
Figure BDA0002749457790000102
Figure BDA0002749457790000111
fifthly, generating document object content:
firstly, collecting various document object data, then using the "document object type" sequence generated in the fourth step to generate specific contents of objects in the document, and generating a "document object type" sequence from "algorithm 2", wherein the sequence contains 16 × 16 × 100-25600 document object types. According to the setting in "algorithm 1", 25600 can generate 25600 document objects, and p > 4000 document pages in total. Next, a specific "document object" (expression (1)) is generated from 25600 "document object types".
To generate a "document object", data is collected from an existing PDF document as defined by equation (3), where the specific parameters are as described in "algorithm 1": { Type1,Type2,...Type8Text, formula, graph, legend, table name, header, footer. The data collected is defined as:
Set1,Set2,...Set8a footer set (20)
Then, according to the "document object type" sequence generated by the "algorithm 2", the document object (defined by the equation (1)), the coordinate information, and the content information are generated by using the TeX markup language and the data set of the equation (20). Wherein, the coordinate information:
DOi-Coors (21)
which is DO in formula (1)iWherein i is more than or equal to 1 and less than or equal to 25600. In addition, the content information:
DOi-Content (22)
is referred to as DOiThe specific contents of (1) are as follows: text codes, formulas, etc. The specific document object generation process is as follows:
Figure BDA0002749457790000112
Figure BDA0002749457790000121
in the fourth step and the fifth step, after the generation of the countermeasure network by the depth convolution is well trained, a new K multiplied by K matrix is generated by using the 'generation network' (figure 6), then the new two-dimensional K multiplied by K matrix is projected to the one-dimensional object class vector to obtain a new 'document object type' sequence, a document object is automatically generated according to the new 'document object type' sequence, then the document object is converted into a document image, and a new document image set is further obtained, so that the generation efficiency and the quality of the document image set are improved.
Sixthly, converting the document into a document image, and generating a document image set:
the document generated in the fifth step is converted into a document image, a document image set (containing the document image, the coordinate information of the document object, and the content information of the document object) is generated, and each page of the PDF document generated according to algorithm 3 is converted into a document image, and an automatically generated image is given as fig. 7. Each generated document image is defined as:
DocImagec,c=1,2,...p (23)
p represents the number of images of the document image dataset (p > 4000 according to the description of "Algorithm 1" and "fifth step"), while mapping the document object space coordinates represented by equation (21) into the document image, resulting in:
DOi-Coors′ (24)
then, the document image dataset may be represented as:
DocImageSet={elec},c=1,2,...p (25)
elec={DocImagec,DOi,c-Coors′,DOi,c-Content} (26)
equation (25) defines a document image dataset in which elecAs shown by the dotted line box in FIG. 8, N document object space coordinate information (DO in equation (26)) in one image is includedi,cCoors') in one-to-one correspondence with each document object specific content information (DO in equation (26)i,c-Content)。
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention in other forms, and any person skilled in the art may apply the above modifications or changes to the equivalent embodiments with equivalent changes, without departing from the technical spirit of the present invention, and any simple modification, equivalent change and change made to the above embodiments according to the technical spirit of the present invention still belong to the protection scope of the technical spirit of the present invention.

Claims (5)

1. The method for generating the document image set based on the deep learning is characterized by comprising the following steps:
step A, vector space projection modeling: regarding the objects in the document image page as a sequence, wherein each node in the sequence corresponds to the type of one object to obtain a document object sequence and object type sequences corresponding to the document object sequence one by one, and rearranging the object type sequences to obtain a corresponding two-dimensional matrix, so that the object type sequences are projected from a one-dimensional vector space to a two-dimensional vector space;
b, deep convolution generation countermeasure network modeling: the countermeasure network comprises a discrimination network and a generation network; the discrimination network is trained by adopting the existing document images and has the function of training to generate the network; after the generation network is trained, the trained generation network is used for generating a two-dimensional matrix and aims at automatically generating a document image set in the follow-up process;
step C, training network model parameters: b, training the confrontation network constructed in the step B and solving network parameters; rearranging a document object type sequence in the existing document image into a two-dimensional matrix for training a discrimination network; training the generated network by using the trained discrimination network;
step D, generating an object type sequence: automatically outputting a new two-dimensional matrix based on the trained generation network; then, projecting the new two-dimensional matrix to a one-dimensional vector space to obtain a new document object type sequence;
step E, generating document object content: collecting various document object data, generating a new document object type sequence according to the step D, and automatically generating specific contents of the document object;
and F, converting the document generated in the step E into a document image, and generating a document image set, wherein the document image set comprises the document image, the coordinate information of the document object and the specific content of the document object.
2. The method for generating a document image set based on deep learning of claim 1, wherein: in the step A, the types of the objects comprise headers, texts, graphs, icons, tables, formulas, page numbers and footers;
(1) defining several objects in a document image page as a sequence of document objects, namely:
DOi,i=1,2,3...N (1)
wherein, DOiRepresenting the ith document object; n represents the number of document objects;
and defining a type sequence corresponding to the document object sequence as an object type sequence, namely:
yi,i=1,2,3...N (2)
yi∈{Typej|j=1,2,3...M} (3)
wherein, yiIndicates the Type corresponding to the ith document object, M indicates the number of object types, TypejRepresenting a type;
(2) regarding the document object sequence in each page of document image page as a vector, formula (1) and formula (2) are expressed in vector form:
DO=[DO1,DO2,DO3,...DON] (4)
Y=[y1,y2,y3,...yN]; (5)
(3) setting p pages of document images, and respectively representing a document object sequence and an object type sequence of the p page into vector forms:
Figure FDA0002749457780000021
Figure FDA0002749457780000022
wherein the superscript p denotesPage p, subscript Np indicates the number of document objects on page p, and type of i-th object on page p is
Figure FDA0002749457780000023
The p page has Np document objects, and the p-1 page has N (p-1) document objects;
(4) arranging the object type sequence of 1-p pages according to the page number sequence,
Figure FDA0002749457780000024
the positions in the entire sequence are:
Figure FDA0002749457780000025
wherein Ni represents the number of document objects in the ith page, the formula (8) is projected into the two-dimensional matrix, K represents the row number and the column number of the two-dimensional matrix, the row number is equal to the column number, and the column coordinate of the two-dimensional matrix is as follows:
Figure FDA0002749457780000026
the row coordinates of the two-dimensional matrix are:
Figure FDA0002749457780000027
further, it is possible to obtain:
Figure FDA0002749457780000028
(5) the two-dimensional matrix is defined as follows:
A=[ak1,k2]K×K (13)
wherein 1 is not more than K1, K2 is not more than K, and the element in the formula (13) is related to the type of the ith object on the p-th page according to the formulas (9) to (11)
Figure FDA0002749457780000029
And establishing a one-to-one corresponding relation.
3. The method for generating a document image set based on deep learning of claim 2, wherein: in the step C, when training the countermeasure network, the following method is specifically adopted:
(1) the loss function of the network is defined by the KL divergence:
Figure FDA00027494577800000210
wherein NS represents the number of samples, i is more than or equal to 1 and less than or equal to NS; a. theiThe ith sample point represented by formula (13); piThe ith sample point representing the random vector input at the input end of the generating network;
(2) and B, taking the two-dimensional matrix obtained in the step A as an input training network, and solving network parameters by adopting a gradient descent method, wherein a gradient function is as follows:
Figure FDA00027494577800000211
Figure FDA0002749457780000031
wherein D (para-D) is a discrimination network structure, and para-D is a discrimination network parameter; g (para-G) is a network structure, and para-G is a network parameter;
in the training process, firstly, a discrimination network is trained by using a two-dimensional matrix, and the trained discrimination network is used for training to generate the network.
4. The method for generating a document image set based on deep learning of claim 1, wherein: in the step B, the discrimination network includes four groups of convolution kernels and a full connection layer connected in sequence from left to right, and all activation functions used by the four groups of convolution kernels are ReLU; the generation network comprises a full connection layer and four groups of convolution kernels which are sequentially connected from left to right, and the activation functions used by the four groups of convolution kernels are respectively as follows: ReLU, ReLU and Tanh.
5. The method for generating a document image set based on deep learning of claim 2, wherein: in the step A:
Figure FDA0002749457780000032
k represents the row number and the column number of the two-dimensional matrix, the row number is equal to the column number, namely enough document pages are selected, and the two-dimensional matrix with enough quantity is constructed for subsequent modeling analysis.
CN202011178681.2A 2020-10-29 2020-10-29 Method for generating document image set based on deep learning Active CN112347742B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011178681.2A CN112347742B (en) 2020-10-29 2020-10-29 Method for generating document image set based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011178681.2A CN112347742B (en) 2020-10-29 2020-10-29 Method for generating document image set based on deep learning

Publications (2)

Publication Number Publication Date
CN112347742A true CN112347742A (en) 2021-02-09
CN112347742B CN112347742B (en) 2022-05-31

Family

ID=74357017

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011178681.2A Active CN112347742B (en) 2020-10-29 2020-10-29 Method for generating document image set based on deep learning

Country Status (1)

Country Link
CN (1) CN112347742B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127622A (en) * 2021-04-29 2021-07-16 西北师范大学 Method and system for generating voice to image
CN117272941A (en) * 2023-09-21 2023-12-22 北京百度网讯科技有限公司 Data processing method, apparatus, device, computer readable storage medium and product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330444A (en) * 2017-05-27 2017-11-07 苏州科技大学 A kind of image autotext mask method based on generation confrontation network
CN107943784A (en) * 2017-11-02 2018-04-20 南华大学 Relation extraction method based on generation confrontation network
US20190050639A1 (en) * 2017-08-09 2019-02-14 Open Text Sa Ulc Systems and methods for generating and using semantic images in deep learning for classification and data extraction
CN109344879A (en) * 2018-09-07 2019-02-15 华南理工大学 A kind of decomposition convolution method fighting network model based on text-image
CN110516577A (en) * 2019-08-20 2019-11-29 Oppo广东移动通信有限公司 Image processing method, device, electronic equipment and storage medium
CN111783416A (en) * 2020-06-08 2020-10-16 青岛科技大学 Method for constructing document image data set by using prior knowledge

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330444A (en) * 2017-05-27 2017-11-07 苏州科技大学 A kind of image autotext mask method based on generation confrontation network
US20190050639A1 (en) * 2017-08-09 2019-02-14 Open Text Sa Ulc Systems and methods for generating and using semantic images in deep learning for classification and data extraction
CN107943784A (en) * 2017-11-02 2018-04-20 南华大学 Relation extraction method based on generation confrontation network
CN109344879A (en) * 2018-09-07 2019-02-15 华南理工大学 A kind of decomposition convolution method fighting network model based on text-image
CN110516577A (en) * 2019-08-20 2019-11-29 Oppo广东移动通信有限公司 Image processing method, device, electronic equipment and storage medium
CN111783416A (en) * 2020-06-08 2020-10-16 青岛科技大学 Method for constructing document image data set by using prior knowledge

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CAO SHI等: "Sentiment Analysis of Home Appliance Comment Based on Generative Probabilistic Model", 《IEEE》 *
XU CANHUI等: "Graph-based Layout Analysis for PDF Documents", 《SPIE》 *
XU CANHUI等: "Graphic composite segmentation for PDF documents with complex layouts", 《SPIE》 *
陈赛健: "基于深度学习的文本图像重建方法研究", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113127622A (en) * 2021-04-29 2021-07-16 西北师范大学 Method and system for generating voice to image
CN117272941A (en) * 2023-09-21 2023-12-22 北京百度网讯科技有限公司 Data processing method, apparatus, device, computer readable storage medium and product

Also Published As

Publication number Publication date
CN112347742B (en) 2022-05-31

Similar Documents

Publication Publication Date Title
Jobin et al. Docfigure: A dataset for scientific document figure classification
CN104517112B (en) A kind of Table recognition method and system
US11288324B2 (en) Chart question answering
CN109858036B (en) Method and device for dividing documents
US7707488B2 (en) Analyzing lines to detect tables in documents
CN112347742B (en) Method for generating document image set based on deep learning
CN109657204A (en) Use the automatic matching font of asymmetric metric learning
CN109948518B (en) Neural network-based PDF document content text paragraph aggregation method
CN112464781A (en) Document image key information extraction and matching method based on graph neural network
CN106601235A (en) Semi-supervision multitask characteristic selecting speech recognition method
US10402484B2 (en) Aligning annotation of fields of documents
JP2022541199A (en) A system and method for inserting data into a structured database based on image representations of data tables.
CN110738050A (en) Text recombination method, device and medium based on word segmentation and named entity recognition
CN114782965A (en) Visual rich document information extraction method, system and medium based on layout relevance
CN117034948B (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN111783416B (en) Method for constructing document image data set by using priori knowledge
CN113283231A (en) Method for acquiring signature bit, setting system, signature system and storage medium
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
Wang Document analysis: table structure understanding and zone content classification
JP3898645B2 (en) Form format editing device and form format editing program
KR20220143538A (en) Method and system for extracting information from semi-structured documents
CN112529743A (en) Contract element extraction method, contract element extraction device, electronic equipment and medium
Kovačević et al. Recognition of common areas in a Web page using a visualization approach
CN115995087B (en) Document catalog intelligent generation method and system based on fusion visual information
CN107451232A (en) A kind of electronic document graph text information restoring method, storage device and terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant