CN110427488B - Document processing method and device - Google Patents

Document processing method and device Download PDF

Info

Publication number
CN110427488B
CN110427488B CN201910697312.5A CN201910697312A CN110427488B CN 110427488 B CN110427488 B CN 110427488B CN 201910697312 A CN201910697312 A CN 201910697312A CN 110427488 B CN110427488 B CN 110427488B
Authority
CN
China
Prior art keywords
text blocks
blocks
text
adjacent text
adjacent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910697312.5A
Other languages
Chinese (zh)
Other versions
CN110427488A (en
Inventor
袁灿
于政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201910697312.5A priority Critical patent/CN110427488B/en
Publication of CN110427488A publication Critical patent/CN110427488A/en
Application granted granted Critical
Publication of CN110427488B publication Critical patent/CN110427488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a document processing method and device. The method comprises the following steps: obtaining a plurality of text blocks in a document; determining the affiliated label of each text block in a plurality of file blocks through a classifier, analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model, and obtaining the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; and combining the appointed two adjacent text blocks under the condition that the continuous probability of the appointed two adjacent text blocks is larger than a preset threshold value and the appointed two adjacent text blocks belong to the same label. The method and the device solve the technical problem of low accuracy in text merging caused by the conditions of line feed, column separation and the like in the text when the text in the document is merged in the prior art.

Description

Document processing method and device
Technical Field
The invention relates to the field of document processing, in particular to a document processing method and device.
Background
At present, a PDF document has become an important form for storing and transmitting a text file because it has the advantages of stable document file, good reservation of document format, and the like. Most knowledge and information in the industry and academia exist in a PDF document form, and an important premise for the intellectualization of the current enterprises and public institutions is that various knowledge stored in the PDF document is analyzed and then structurally stored for use. The content that present most PDF analysis tools obtained has the problem such as the position is in disorder, and content catalogue hierarchy can't be confirmed, needs to classify and merge the text content after the analysis through technical means to reach the purpose of structuralization or semi-structuralization.
The existing method for classifying and combining the PDF analysis contents mainly determines the category of the text block by matching keywords of the text block after analysis. The keyword matching mainly includes classifying the text according to the title, the body, the comment and the like through keywords of the text category, and then matching the title usually by using common keyword characters such as "1.1" matching the title, and if the matching exists, determining that the text block is the title. And directly combining a plurality of text blocks through coordinates in PDFs where the text blocks are located, and combining the text blocks from top to bottom and from left to right according to the coordinates.
The key word matching method has the main defects that matching rules need to be accumulated in key word matching, a large number of rules need to be maintained for the PDF document with diversified title forms, and in addition, the problem that the rule recall rate is low and optimization is difficult exists. For the merging method of the text block, the method of merging according to the coordinates is too rough, and the situations of line feed, column separation and the like in the text are not considered, so that the situation of more errors in text merging exists.
Aiming at the problem that in the prior art, when the texts in the documents are merged, the text merging has low accuracy due to the conditions of line feed, column separation and the like, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for processing a document, which at least solve the technical problem of low accuracy of text combination caused by the conditions of line feed, column separation and the like in the text when the text in the document is combined in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a document processing method, including: acquiring a plurality of text blocks in a document; determining a label of each text block in the plurality of file blocks through a classifier, and analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model to obtain continuous probabilities of the pairwise adjacent text blocks, wherein the language model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; and combining the specified two adjacent text blocks under the condition that the continuous probability of the specified two adjacent text blocks is determined to be greater than a preset threshold value and the specified two adjacent text blocks belong to the same label.
Further, obtaining the plurality of text blocks in the document comprises: converting the document into a picture; and dividing different areas of the picture according to a preset rule, wherein the different areas respectively correspond to the text blocks.
Further, after the picture is divided into different regions according to a preset rule, the method further comprises: labeling labels on the different regions respectively, wherein the labels labeled on the different regions are the labels of the text blocks.
According to another aspect of the embodiments of the present invention, there is also provided a document processing apparatus, including: an acquisition unit configured to acquire a plurality of text blocks in a document; the determining unit is configured to determine, through the classifier, a label to which each text block in the plurality of file blocks belongs, and analyze, through a preset language model, every two adjacent text blocks in the plurality of file blocks to obtain a continuous probability of the every two adjacent text blocks, where the language model is trained through machine learning using multiple sets of data, and each set of data in the multiple sets of data includes: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; and the merging unit is used for merging the specified two adjacent text blocks under the condition that the continuous probability of the specified two adjacent text blocks is determined to be greater than a preset threshold value and the specified two adjacent text blocks belong to the same label.
Further, the acquisition unit includes: the first conversion module is used for converting the document into a picture; and the dividing module is used for dividing different areas of the picture according to a preset rule, wherein the different areas respectively correspond to the text blocks.
Further, the apparatus further comprises: and the labeling unit is used for labeling different regions after the picture is divided into the different regions according to a preset rule, wherein the labels labeled on the different regions are the labels of each text block.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein the program executes the document processing method according to any one of the above.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes to perform the document processing method according to any one of the above.
In the embodiment of the invention, a plurality of text blocks in a document are obtained; determining the affiliated label of each text block in a plurality of file blocks through a classifier, analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model, and obtaining the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; the method comprises the steps of combining two appointed adjacent text blocks under the condition that the continuous probability of the two appointed adjacent text blocks is larger than a preset threshold value and the two appointed adjacent text blocks belong to the same label, achieving the purpose of combining the text blocks according to the continuous probability of the label of the text block and the two adjacent text blocks, achieving the technical effect of providing the accuracy of text block combination, and further solving the technical problem that the accuracy of text combination is low due to the fact that line feed, column separation and the like exist in texts when the texts in the documents are combined in the prior art.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention to a proper form. In the drawings:
FIG. 1 is a flow diagram of a method of processing a document according to an embodiment of the invention;
FIG. 2 is a flow chart of a method for PDF parsing content classification merging according to the preferred embodiment of the present invention;
FIG. 3 is a diagram of a pictorial annotation in accordance with a preferred embodiment of the present invention;
FIG. 4 is a diagram of a block of text after parsing a PDF according to a preferred embodiment of the present invention;
FIG. 5 is a bar graph comparing F1 score test data for each category of PDF merging method and keyword matching method according to the preferred embodiment of the present invention; and
FIG. 6 is a schematic diagram of a document processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some terms or terms appearing in the description of the embodiments of the present invention are applicable to the following explanations:
f1 score: also known as the equilibrium F-score, is defined as the harmonic mean of precision and recall.
Figure BDA0002149735650000041
Wherein F1 is F1 score, precision is precision rate, and recall is recall rate.
The binary language model: the language model is typically constructed as a probability distribution p(s) of the string s, where p(s) attempts to reflect the frequency with which the string s appears as a sentence. For a sentence s of l primitives (a primitive may be a word, phrase, etc., hereinafter we will use the word as a general term), w1w2 … wl, the bigram model is the word w appearing at the ith position i With only one historical ssss word w preceding it i-1 It is related.
A classifier: classification algorithms use models such as common decision trees, logistic regression, etc.
In accordance with an embodiment of the present invention, there is provided a document processing method embodiment, it is noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The document processing method according to the embodiment of the present invention will be described in detail below.
Fig. 1 is a flowchart of a processing method of a document according to an embodiment of the present invention, as shown in fig. 1, the processing method of the document includes the steps of:
step S102, a plurality of text blocks in the document are obtained.
Wherein, the document in step S102 may include: a document in PDF format. That is, by the scheme described in step S102, a plurality of text blocks in the PDF document can be acquired.
It should be noted that obtaining the plurality of text blocks in the document may include: converting the document into a picture; and dividing different areas of the picture according to a preset rule, wherein the different areas respectively correspond to the text blocks. Namely, the document is converted into the picture, the picture is labeled according to different categories, and then the label of the text in the document is determined. The label of the text in the document can be conveniently and simply determined.
It should be further noted that, after dividing the picture into different regions according to the preset rule, the method may further include: and labeling labels for different areas respectively, wherein the labels labeled for the different areas are the labels of each text block.
The picture labeling labels can be labeled through rectangular frames with different colors and different rectangular frames.
Step S104, determining the affiliated label of each text block in the plurality of file blocks through a classifier, and analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model to obtain the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained by using a plurality of groups of data through machine learning, and each group of data in the plurality of groups of data comprises: two adjacent text blocks and two adjacent text blocks.
Analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model to obtain the continuous probability of the pairwise adjacent text blocks may include: converting every two adjacent text blocks into voice information; and inputting the voice information into a language model, and determining the continuous probability of every two adjacent text blocks through the analysis of the voice information by the language model.
And step S106, combining the specified two adjacent text blocks under the condition that the continuous probability of the specified two adjacent text blocks is determined to be greater than a preset threshold value and the specified two adjacent text blocks belong to the same label.
Through the steps, a plurality of text blocks in the document are obtained; determining the affiliated label of each text block in a plurality of file blocks through a classifier, analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model, and obtaining the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; under the condition that the continuous probability of the appointed two adjacent text blocks is larger than a preset threshold value and the appointed two adjacent text blocks belong to the same label, the appointed two adjacent text blocks are combined, and the purpose of combining the text blocks according to the continuous probability of the label of the text block and the two adjacent text blocks is achieved, so that the technical effect of providing the accuracy of text block combination is achieved, and the technical problem that in the prior art, when the texts in the documents are combined, the accuracy of text combination is low due to the fact that line feed, column separation and the like exist in the texts is solved.
The invention also provides a preferred embodiment, which provides a method for classifying and merging PDF (Portable document Format) analysis contents (corresponding to a processing method of a document).
As shown in fig. 2, a flowchart of a method for PDF analysis content classification merging is shown. The method comprises the following specific steps:
and step S202, labeling the picture label of the PDF document.
The method comprises the steps of marking part of PDF texts according to text types, wherein the types comprise two or more of titles, texts, headers, footers, comments, catalogues, legends, tables, diagrams and table descriptions, and the marking method is to convert the PDF documents into pictures for marking.
The specific operation of step S202 is: and analyzing and converting the PDF document to be analyzed by using a standard PDF analysis tool and a PDF picture tool to obtain an XML file and a JPEG format picture after analysis. Due to the special file format of the PDF, efficient annotation cannot be carried out, and the problem that PDF content cannot be annotated can be solved by annotating the PDF content by using an image annotation method. And labeling the JPEG pictures by using a labeling tool, wherein the labeling categories comprise two or more of title, text, header, footer, comment, directory, legend, table, diagram and table description.
It should be noted that the labeling process is as follows: presetting categories to be marked in a marking tool; loading the pictures to be labeled into a labeling tool; selecting a picture range corresponding to the text content to be marked by using a rectangular graphic selection frame; and judging the labeling type according to the text in the picture, and selecting the corresponding type.
Taking insurance PDF as an example, the labeled picture is shown in fig. 3, which is a schematic diagram of picture labeling. Different color rectangular selection boxes represent different categories.
Step S204, a PDF analysis text block label is obtained.
And converting the picture label by using PDF to obtain a PDF analysis text block label, mapping to a PDF coordinate according to the picture coordinate, and obtaining labels such as a text block in the PDF document.
According to the labeling result of the PDF converted picture, the picture size, the labeling coordinate area and the label type are obtained, the text block in the XML file after PDF analysis has PDF coordinates, according to the proportion of the PDF size to the PDF converted picture size, the corresponding relation between the XML text block and the labeling area of the PDF converted picture can be obtained through calculation, and finally the label of the text block is obtained. The specific mapping method comprises the following steps: the picture size is width a, length b, and the marked area is rectangle coordinate [ x1, y1, x2, y2], namely rectangle upper left corner coordinate [ x1, y1] and right corner coordinate [ x2, y2 ]; the PDF size width is c, the length is d, the coordinate of the text block is [ m, n, k, l ], and the corresponding is the distance from the top end to the left end, the width of the text block and the height of the text block. And (3) mapping the PDF conversion image labeling rectangle result mapping formula to the PDF document with coordinates [ (c/a) y1, (b/d) x1, (b/d) x2-x1), (c/a) y2-y1) ], and labeling the PDF text block as the labeling type corresponding to the PDF conversion image if the proportion of the PDF document coordinate overlapping area after the PDF text block coordinates and the PDF conversion image labeling mapping is greater than a threshold value k 1.
Step S206, a PDF analysis text block classifier is constructed.
And aiming at the text block after PDF analysis, extracting text information characteristics, position characteristics and format characteristics of the text block, and then constructing and training a classification model by using a label.
It should be noted that a training sample set is constructed by using the label data in step S204, and features are constructed by using text content information, position information, and information of preceding and following text blocks, and a classifier is trained by using a classification algorithm of a decision tree.
The classifier comprises the following specific steps: 1) regarding the labeled text blocks, taking each text block as a sample, and constructing a sample set;
2) taking text information, position information and text format of the text block as current text block feature construction source data to construct features;
features are constructed from the following three categories: text information characteristics: the number of numeric characters, the number of Chinese characters, the number of English characters, the number of space characters and the number of special characters; position information characteristics: the method comprises the following steps of (1) page four-coordinate average value, page four-coordinate maximum value, page four-coordinate minimum value and text block four-coordinate value; text format characteristics: the current font accounts for the percentage of the font of the page, and the current font size accounts for the percentage of the font size of the page.
3) And training and adjusting the classification model by taking logistic regression as a regression classifier.
And step S208, performing continuous probability prediction on the PDF analysis text block.
And (3) performing language model training by utilizing the corpus content, and predicting whether the content is continuous or not according to the language model.
Taking an insurance-related PDF document as an example, insurance-related corpus data is obtained, a binary language model (relative to a voice model) is trained, and then the binary language model is used for predicting the continuous probability of the content of the text block. Taking insurance PDF as an example, the method comprises the following specific steps:
1) acquiring insurance text predictions, crawling the related predictions through insurance related websites, and training a binary language model.
2) A schematic diagram of a text block after PDF parsing is shown in fig. 4. The four analyzed text blocks are 1, 2, 3 and 4, and the continuous probability of the text contents of the four text blocks is predicted by using the language model respectively. Taking the text block 1 as an example, three probability values P (text block 1, text block 2), P (text block 1, text block 3) and P (text block 1, text block 4) are obtained, and for the three text continuity probabilities, if the contents of the text block 1 and the text block 2 have continuity, P (text block 1, text block 2) is greater than P (text block 1, text block 3) and P (text block 1, text block 4) can be obtained. Wherein P (text block 1, text block 2) represents P (mild condition and severe condition, insurance gold), that is, the continuous probability predicted by the language model of text content "mild condition and severe condition" of text block 1 and text content "insurance gold" of text block 2.
And step S210, predicting and merging the PDF text blocks by using the text block category classifier and the content continuous language model.
The method comprises the steps of predicting text block categories by using a text block category classifier, carrying out content continuity probability on adjacent text blocks by using a content continuous language model, and finally carrying out content merging judgment according to results of the two parts.
It should be noted that, the logic for determining to merge PDF text blocks is: and text merging is carried out on the text blocks of which the adjacent text blocks have the same category and the continuous probability is greater than the probability threshold value. Taking an insurance PDF document as an example, as shown in fig. 4, the specific steps are as follows:
the text block classifier is utilized to obtain that the categories of the text blocks 1 and 2 are 'titles', and the categories of the text blocks 3 and 4 are 'texts';
obtaining continuous probabilities of the text blocks which are larger than the probability threshold k2 by using the content continuous language model, wherein the continuous probabilities include P (text block 1, text block 2) and P (text block 3, text block 4);
according to the judgment rule that the text blocks with the same category and the continuous probability greater than the probability threshold value are combined, the text block 1 is combined with the text block 2 and has a label 'title', and the text block 3 is combined with the text block 4 and has a title 'body'.
Through the steps, a training sample set is constructed by using an image labeling method, content classification and merging are performed through a classifier and a language model, a large number of rules do not need to be maintained manually, and meanwhile, the accuracy is greatly improved, as shown in fig. 5, the PDF merging method and the keyword matching method in the preferred embodiment compare histograms with F1 score test data of each category. The raw data in fig. 5 is shown as a table.
TABLE 1
Text category \ method The method of the invention Keyword matching method
Title 0.95 0.64
Text 0.92 0.72
Page header and page footer 0.89 0.85
Note 0.90 0.82
According to the embodiment of the present invention, an embodiment of a document processing apparatus is further provided, and it should be noted that the document processing apparatus may be configured to execute a document processing method in the embodiment of the present invention, that is, the document processing method in the embodiment of the present invention may be executed in the document processing apparatus.
FIG. 6 is a schematic diagram of a document processing apparatus according to an embodiment of the present invention, which may include, as shown in FIG. 6: an acquisition unit 61, a determination unit 63, and a merging unit 65. The details are as follows.
An obtaining unit 61 is configured to obtain a plurality of text blocks in a document.
Wherein, the obtaining unit may include: the first conversion module is used for converting the document into the picture; and the dividing module is used for dividing different areas of the picture according to a preset rule, wherein the different areas respectively correspond to the text blocks.
The determining unit 63 is configured to determine, through the classifier, a tag to which each text block in the plurality of file blocks belongs, and analyze, through a preset language model, every two adjacent text blocks in the plurality of file blocks to obtain a continuous probability of every two adjacent text blocks, where the language model is trained through machine learning by using multiple sets of data, and each set of data in the multiple sets of data includes: two adjacent text blocks and two adjacent text blocks.
Wherein, the determining unit may include: the second conversion module is used for converting every two adjacent text blocks into voice information; and the determining module is used for inputting the voice information into the voice model and determining the continuous probability of every two adjacent text blocks through the analysis of the voice information by the language model.
And a merging unit 65, configured to merge the specified two adjacent text blocks when it is determined that the continuity probability of the specified two adjacent text blocks is greater than a preset threshold and the specified two adjacent text blocks belong to the same label.
It should be noted that the obtaining unit 61 in this embodiment may be configured to execute step S102 in this embodiment of the present invention, the determining unit 63 in this embodiment may be configured to execute step S104 in this embodiment of the present invention, and the combining unit 65 in this embodiment may be configured to execute step S106 in this embodiment of the present invention. The modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to the disclosure of the above embodiments.
Optionally, the apparatus may further include: and the labeling unit is used for labeling labels to the different regions respectively after the different regions are divided according to the preset rule, wherein the labels labeled to the different regions are the labels of each text block.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, where the program when executed controls an apparatus on which the storage medium is located to perform the following operations: obtaining a plurality of text blocks in a document; determining the affiliated label of each text block in a plurality of file blocks through a classifier, analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model, and obtaining the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; and combining the specified two adjacent text blocks under the condition that the continuous probability of the specified two adjacent text blocks is determined to be greater than a preset threshold value and the specified two adjacent text blocks belong to the same label.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes the following operations: obtaining a plurality of text blocks in a document; determining the affiliated label of each text block in a plurality of file blocks through a classifier, analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model, and obtaining the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks; and combining the appointed two adjacent text blocks under the condition that the continuous probability of the appointed two adjacent text blocks is larger than a preset threshold value and the appointed two adjacent text blocks belong to the same label.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may also be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that those skilled in the art can make various improvements and modifications without departing from the principle of the present invention, and these improvements and modifications should also be construed as the protection scope of the present invention.

Claims (8)

1. A method for processing a document, comprising:
acquiring a plurality of text blocks in a document, wherein each text block corresponds to a color label;
determining a label of each text block in a plurality of file blocks through a classifier, and analyzing two adjacent text blocks in the plurality of file blocks through a preset language model to obtain continuous probabilities of the two adjacent text blocks, wherein the language model is trained through machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks;
combining the specified two adjacent text blocks under the condition that the continuous probability of the specified two adjacent text blocks is determined to be greater than a preset threshold value and the specified two adjacent text blocks belong to the same label;
analyzing every two adjacent text blocks in the plurality of file blocks through a preset language model to obtain the continuous probability of the every two adjacent text blocks, wherein the method comprises the following steps: converting the two adjacent text blocks into voice information; inputting the voice information into the language model, and determining the continuous probability of the two adjacent text blocks through the analysis of the voice information by the language model.
2. The method of claim 1, wherein obtaining the plurality of text blocks in the document comprises:
converting the document into a picture;
and dividing different areas of the picture according to a preset rule, wherein the different areas respectively correspond to the text blocks.
3. The method of claim 2, wherein after the picture is divided into different regions according to a preset rule, the method further comprises:
labeling labels on the different regions respectively, wherein the labels labeled on the different regions are the labels of each text block.
4. A device for processing a document, comprising:
the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a plurality of text blocks in a document, and each text block corresponds to one color label;
the determining unit is used for determining the label of each text block in the plurality of file blocks through the classifier, analyzing pairwise adjacent text blocks in the plurality of file blocks through a preset language model, and obtaining the continuous probability of the pairwise adjacent text blocks, wherein the language model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: the continuous probability of every two adjacent text blocks and every two adjacent text blocks;
the merging unit is used for merging the specified two adjacent text blocks under the condition that the continuous probability of the specified two adjacent text blocks is determined to be greater than a preset threshold value and the specified two adjacent text blocks belong to the same label;
the determining unit is further configured to convert the two adjacent text blocks into voice information; inputting the voice information into the language model, and determining the continuous probability of the two adjacent text blocks through the analysis of the voice information by the language model.
5. The apparatus of claim 4, wherein the obtaining unit comprises:
the first conversion module is used for converting the document into a picture;
and the dividing module is used for dividing different areas of the picture according to a preset rule, wherein the different areas respectively correspond to the text blocks.
6. The apparatus of claim 5, further comprising:
and the labeling unit is used for labeling different regions after the picture is divided into the different regions according to a preset rule, wherein the labels labeled on the different regions are the labels of each text block.
7. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the method of any one of claims 1 to 3.
8. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 3.
CN201910697312.5A 2019-07-30 2019-07-30 Document processing method and device Active CN110427488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910697312.5A CN110427488B (en) 2019-07-30 2019-07-30 Document processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910697312.5A CN110427488B (en) 2019-07-30 2019-07-30 Document processing method and device

Publications (2)

Publication Number Publication Date
CN110427488A CN110427488A (en) 2019-11-08
CN110427488B true CN110427488B (en) 2022-09-23

Family

ID=68413183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910697312.5A Active CN110427488B (en) 2019-07-30 2019-07-30 Document processing method and device

Country Status (1)

Country Link
CN (1) CN110427488B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761906B (en) * 2020-07-16 2024-06-18 北京沃东天骏信息技术有限公司 Method, apparatus, device and computer readable medium for parsing document
CN112101308B (en) * 2020-11-11 2021-02-09 北京云测信息技术有限公司 Method and device for combining text boxes based on language model and electronic equipment
CN112818823B (en) * 2021-01-28 2024-04-12 金科览智科技(北京)有限公司 Text extraction method based on bill content and position information
CN114495147B (en) * 2022-01-25 2023-05-05 北京百度网讯科技有限公司 Identification method, device, equipment and storage medium
CN116306575B (en) * 2023-05-10 2023-08-29 杭州恒生聚源信息技术有限公司 Document analysis method, document analysis model training method and device and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8356045B2 (en) * 2009-12-09 2013-01-15 International Business Machines Corporation Method to identify common structures in formatted text documents
CN102033964B (en) * 2011-01-13 2012-05-09 北京邮电大学 Text classification method based on block partition and position weight
CN105677764B (en) * 2015-12-30 2020-05-08 百度在线网络技术(北京)有限公司 Information extraction method and device
US20180129944A1 (en) * 2016-11-07 2018-05-10 Xerox Corporation Document understanding using conditional random fields
CN107424166B (en) * 2017-07-18 2020-05-19 深圳市速腾聚创科技有限公司 Point cloud segmentation method and device
CN107808011B (en) * 2017-11-20 2021-04-13 北京大学深圳研究院 Information classification extraction method and device, computer equipment and storage medium
CN109857942A (en) * 2019-03-14 2019-06-07 北京百度网讯科技有限公司 For handling the method, apparatus, equipment and storage medium of document

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files;Xing Wang 等;《 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)》;20180129;第759-764页 *

Also Published As

Publication number Publication date
CN110427488A (en) 2019-11-08

Similar Documents

Publication Publication Date Title
CN110427488B (en) Document processing method and device
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
CN109685056B (en) Method and device for acquiring document information
US5669007A (en) Method and system for analyzing the logical structure of a document
CN111444723B (en) Information extraction method, computer device, and storage medium
CN109933796B (en) Method and device for extracting key information of bulletin text
US20100312728A1 (en) System and method of identifying web page semantic structures
US8620079B1 (en) System and method for extracting information from documents
WO2017080090A1 (en) Extraction and comparison method for text of webpage
US10042880B1 (en) Automated identification of start-of-reading location for ebooks
CN109492168B (en) Visual tourism interest recommendation information generation method based on tourism photos
CN113326797A (en) Method for converting form information extracted from PDF document into structured knowledge
CN113627190A (en) Visualized data conversion method and device, computer equipment and storage medium
CN114022888B (en) Method, apparatus and medium for identifying PDF form
JP6621514B1 (en) Summary creation device, summary creation method, and program
Meuschke et al. A benchmark of pdf information extraction tools using a multi-task and multi-domain evaluation framework for academic documents
CN112800771B (en) Article identification method, apparatus, computer readable storage medium and computer device
Sangati et al. Multiword expression identification with recurring tree fragments and association measures
Wang et al. Bigram label regularization to reduce over-segmentation on inline math expression detection
Bartík Text-based web page classification with use of visual information
CN114842982B (en) Knowledge expression method, device and system for medical information system
JP5366179B2 (en) Information importance estimation system, method and program
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN115410185A (en) Method for extracting specific name and unit name attributes in multi-modal data
CN112257400A (en) Table data extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant