CN110427488A - The processing method and processing device of document - Google Patents
The processing method and processing device of document Download PDFInfo
- Publication number
- CN110427488A CN110427488A CN201910697312.5A CN201910697312A CN110427488A CN 110427488 A CN110427488 A CN 110427488A CN 201910697312 A CN201910697312 A CN 201910697312A CN 110427488 A CN110427488 A CN 110427488A
- Authority
- CN
- China
- Prior art keywords
- text block
- text
- adjacent
- document
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The invention discloses a kind of processing method and processing devices of document.This method comprises: by obtaining multiple text blocks in document;The affiliated label of each text block in multiple blocks of files is determined by classifier, and text block adjacent two-by-two in multiple blocks of files is analyzed by preset language model, obtain the continuous probability of text block adjacent two-by-two, wherein, language model is trained using multi-group data by machine learning, and every group of data in multi-group data include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;In the case where the text block adjacent two-by-two for being greater than preset threshold and specifying for determining the continuous probability of specified text block adjacent two-by-two belongs to same label, specified text block adjacent two-by-two is merged.The present invention solves in the prior art, when merging to the text in document, because of situations such as there is line feed, subfield in text, text is caused to merge the technical problem low there are accuracy.
Description
Technical field
The present invention relates to document processing fields, in particular to a kind of processing method and processing device of document.
Background technique
Currently, PDF document has many advantages, such as that document files is stable, the reservation of document format is preferable due to it, it has also become text
The important form stored and transmitted of this document.It is all by PDF text for industry and academia's major part knowledge and information
Shelves form exist, an important prerequisite intelligentized for current enterprises and institutions be exactly will be stored in it is various in PDF document
Carried out again after structured storage after knowledge parsing using.There are position mistakes for the content that the PDF analytical tool of Most current obtains
Disorderly, the problems such as contents directory level can not determine, needs that content of text after parsing is classified and closed by technological means
And to it achieve the purpose that structuring or semi-structured.
And it is existing classified to PDF parsing content and combined method mainly pass through text block after parsing carry out it is crucial
Word matches to determine text block classification.Keyword match is mainly by the keyword of text categories, such as to text according to title,
Text, annotation etc. are classified, and are matched then generalling use the matching common key character of title such as " 1.1 " for title,
Judge text block for title in the presence of if as matched.Multiple text blocks are then carried out by coordinate in PDF where text block straight
Engagement simultaneously, from top to bottom according to coordinate, from left to right merges.
Its major defect of the method for keyword match is that keyword match needs to accumulate matching rule, for PDF document
Its title form is ever-changing, needs to safeguard a large amount of rules, in addition the problem of difficult optimization low there is also regular recall rate.And for
The merging method of text block, the method merged according to coordinate is excessively rough, does not do for there is situations such as line feed, subfield in text
Consider, text merging is caused to there is a situation where that mistake is more.
For in the prior art, when being merged to the text in document, due to there are the feelings such as line feed, subfield in text
Condition causes text merging to there is a problem of that accuracy is low, and currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of processing method and processing devices of document, at least to solve in the prior art, to text
Text in shelves is when merging, and because of situations such as there is line feed, subfield in text, causes text to merge that there are accuracy low
Technical problem.
According to an aspect of an embodiment of the present invention, a kind of processing method of document is provided, comprising: obtain in document
Multiple text blocks;The affiliated label of each text block in the multiple blocks of files is determined by classifier, and by preset
Language model analyzes text block adjacent two-by-two in the multiple blocks of files, obtains the adjacent text block two-by-two
Continuous probability, wherein the language model is trained using multi-group data by machine learning, in the multi-group data
Every group of data include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;It is specified two-by-two determining
The continuous probability of adjacent text block is greater than preset threshold, and the specified text block adjacent two-by-two belongs to same label
In the case where, the specified text block adjacent two-by-two is merged.
Further, obtaining the multiple text block in the document includes: to convert picture for the document;To institute
It states picture and divides different regions according to preset rules, wherein the different region corresponds respectively to the multiple text block.
Further, after dividing different regions according to preset rules to the picture, the method also includes: respectively
To the different area marking label, wherein be the institute of each text block to the label of the different area marking
Belong to label.
According to another aspect of an embodiment of the present invention, a kind of processing unit of document is additionally provided, comprising: acquiring unit,
For obtaining multiple text blocks in document;Determination unit, for determining each text in the multiple blocks of files by classifier
The affiliated label of this block, and text block adjacent two-by-two in the multiple blocks of files is divided by preset language model
Analysis obtains the continuous probability of the adjacent text block two-by-two, wherein the language model is to pass through machine using multi-group data
What learning training went out, every group of data in the multi-group data include: text block adjacent two-by-two and text adjacent two-by-two
The continuous probability of block;Combining unit is greater than default threshold for the continuous probability in determining specified text block adjacent two-by-two
Value, and in the case that the specified text block adjacent two-by-two belongs to same label, by the specified text adjacent two-by-two
This block merges.
Further, the acquiring unit includes: the first conversion module, for converting picture for the document;It divides
Module, for dividing different regions according to preset rules to the picture, wherein the different region corresponds respectively to institute
State multiple text blocks.
Further, described device further include: mark unit, it is different for being divided to the picture according to preset rules
After region, respectively to the different area marking label, wherein the label to the different area marking is described every
The affiliated label of a text block.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, the storage medium includes storage
Program, wherein described program executes the processing method of document described in any of the above embodiments when running.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, the processor is used to run program,
Wherein, the processing method of document described in any of the above embodiments is executed when described program is run.
In embodiments of the present invention, by obtaining multiple text blocks in document;Multiple blocks of files are determined by classifier
In each text block affiliated label, and by preset language model to text block adjacent two-by-two in multiple blocks of files into
Row analysis, obtains the continuous probability of text block adjacent two-by-two, wherein language model is to pass through machine learning using multi-group data
It trains, every group of data in multi-group data include: the continuous of adjacent two-by-two text block and text block adjacent two-by-two
Probability;In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and specify adjacent two-by-two
In the case that text block belongs to same label, specified text block adjacent two-by-two is merged, has been reached according to text block
The continuous probability of label and adjacent two text block carry out the combined purpose of text block, provide text merged block to realize
Accuracy technical effect, and then solve in the prior art, when being merged to the text in document, due to being deposited in text
The line feed, subfield situations such as, text is caused to merge the technical problem low there are accuracy.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of document according to an embodiment of the present invention;
Fig. 2 is the flow chart for the method that a kind of PDF parsing classifying content according to the preferred embodiment of the invention merges;
Fig. 3 is the schematic diagram of picture mark according to the preferred embodiment of the invention;
Fig. 4 is the schematic diagram of text block after the parsing according to the preferred embodiment of the invention to PDF;
Fig. 5 is that PDF merging method according to the preferred embodiment of the invention and keyword matching method F1 score of all categories try
Test data comparison column diagram;And
Fig. 6 is the schematic diagram of the processing unit of document according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
Firstly, the part noun or term that occur during the embodiment of the present invention is described are suitable for following solution
It releases:
F1 score: also known as balance F score, it is defined as the harmonic-mean of accurate rate and recall rate.
Wherein, F1 is F1 score, and precision is accurate rate, and recall is recall rate.
Two gram language models: language model is generally constructed with the probability distribution p (s) of character string s, and p (s) attempts to reflect here
Be frequency that character string s occurs as a sentence.For one, by l primitive, (" primitive " can be word, word or phrase
Deng we only with " word " lead to finger later) the sentence s=w1w2 ... wl that constitutes, two-dimensional grammar model is to appear in i-th bit
Word wiOnly with a history ssss word w before iti-1It is related.
Classifier: sorting algorithm uses model, such as common decision tree, logistic regression.
According to embodiments of the present invention, a kind of processing method embodiment of document is provided, it should be noted that in attached drawing
The step of process illustrates can execute in a computer system such as a set of computer executable instructions, although also,
Logical order is shown in flow chart, but in some cases, it can be to be different from shown by sequence execution herein or retouch
The step of stating.
The processing method of the document to the embodiment of the present invention is described in detail below.
Fig. 1 is the flow chart of the processing method of document according to an embodiment of the present invention, as shown in Figure 1, the processing of the document
Method includes the following steps:
Step S102 obtains multiple text blocks in document.
Wherein, document in step s 102 may include: the document of PDF format.I.e. through the above steps S102 is recorded
Scheme, multiple text blocks in available PDF document.
It should be noted that the multiple text blocks obtained in document may include: to transform a document to picture;Picture is pressed
Different regions is divided according to preset rules, wherein different regions corresponds respectively to multiple text blocks.Transform a document to figure
Piece be labeled according to the difference of classification to picture, and then determines the label of the text in document.It can be convenient, simply
Determination document in text affiliated label.
It should also be noted that, the above method can also wrap after dividing different regions according to preset rules to picture
It includes: respectively to different area marking labels, wherein the label to different area markings is the affiliated mark of each text block
Label.
Wherein, picture mark label can be passed through, the rectangle frame of different rectangle frames, different colours is labeled.
Step S104 determines the affiliated label of each text block in multiple blocks of files by classifier, and by default
Language model text block adjacent two-by-two in multiple blocks of files is analyzed, obtain the continuous general of text block adjacent two-by-two
Rate, wherein language model is trained using multi-group data by machine learning, and every group of data in multi-group data are wrapped
It includes: the continuous probability of adjacent text block and text block adjacent two-by-two two-by-two.
Wherein, text block adjacent two-by-two in multiple blocks of files is analyzed by preset language model, obtains two
The continuous probability of two adjacent text blocks may include: to convert voice messaging for text block adjacent two-by-two;By voice messaging
It is input in language model, determines the continuous probability of text block adjacent two-by-two to the analysis of voice messaging by language model.
Step S106 in the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and is specified
Text block adjacent two-by-two belong to same label in the case where, specified text block adjacent two-by-two is merged.
Through the above steps, by obtaining multiple text blocks in document;It is determined by classifier every in multiple blocks of files
The affiliated label of a text block, and text block adjacent two-by-two in multiple blocks of files is divided by preset language model
Analysis, obtains the continuous probability of text block adjacent two-by-two, wherein language model is to pass through machine learning training using multi-group data
Out, every group of data in multi-group data include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;
In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and specified text block adjacent two-by-two
In the case where belonging to same label, specified text block adjacent two-by-two is merged, the label according to text block has been reached
The combined purpose of text block is carried out with the continuous probability of adjacent two text block, provides the accurate of text merged block to realize
Property technical effect, and then solve in the prior art, when being merged to the text in document, changed due to existing in text
Situations such as row, subfield, causes text to merge the technical problem low there are accuracy.
The present invention also provides a kind of preferred embodiment, the preferred embodiment provides a kind of PDF parsing classifying content conjunction
And method (processing method for being equivalent to document).
As shown in Fig. 2, a kind of flow chart for the method that PDF parsing classifying content merges.Specific steps are as follows:
Step S202 marks PDF document picture tag.
Wherein, by being labeled to part PDF text according to text categories, classification includes title, text, header, page
Foot, annotation, catalogue, legend, table, figure illustrate, two kinds or more in table explanation that mask method is by converting PDF document
It is labeled for picture.
The concrete operations of step S202 are as follows: to the PDF document that needs parse, utilize standard PDF analytical tool and PDF picture
Chemical industry tool is parsed and is converted, XML file and jpeg format picture after being parsed.Due to the special file of PDF itself
Format can not be marked efficiently, and using the mask method of image, being labeled to PDF content can solve mark
Problem.JPEG picture is labeled using annotation tool, mark classification includes title, text, header, footer, annotation, mesh
Record, legend, table, figure illustrate, two kinds or more in table explanation.
It should be noted that mark process is as follows: presetting the classification for needing to mark in annotation tool;Load needs to mark
Picture to annotation tool in;Picture model corresponding to the content of text to be marked is chosen using rectangular graph marquee
It encloses;Mark classification is judged according to text in picture, and selects corresponding classification.
For insuring PDF, the picture after mark is as shown in figure 3, the schematic diagram that picture marks.The choosing of different colours rectangle
Select frame represent it is different classes of.
Step S204 obtains PDF and parses text block label.
PDF is obtained using PDF transformed picture label and parses text block label, and PDF coordinate is mapped to according to Picture Coordinate, is obtained
Into PDF document, text block etc. is marked.
According to the annotation results of PDF transformed picture, dimension of picture, mark coordinates regional and label classification, PDF parsing are obtained
There is text block PDF coordinate XML can be calculated according to PDF size and PDF transformed picture dimension scale in XML file afterwards
The corresponding relationship of text block and PDF transformed picture tab area, finally obtains the label of text block.Specific mapping method are as follows: figure
Chip size be width a, length b, tab area be rectangular coordinates [x1, y1, x2, y2], i.e., rectangle top left co-ordinate [x1, y1] and
Bottom right angular coordinate [x2, y2];PDF dimension width be c, length d, text block coordinate be [m, n, k, l], correspond to top away from
From until left end distance, text block width, text block height.PDF transformed picture marks rectangle result mapping equation to PDF document
Middle coordinate is [(c/a) * y1, (b/d) * x1, (b/d) * (x2-x1), (c/a) * (y2-y1)], such as PDF text block coordinate and PDF
PDF document coordinate overlapping region ratio is greater than threshold value k1 after transformed picture mark mapping, then is labeled as corresponding to by PDF text block
The mark classification of PDF transformed picture.
Step S206, building PDF parse text block classifier.
For text block after PDF parsing, text block text information feature, position feature, format character are extracted, mark is recycled
Label is infused, simultaneously train classification models are constructed.
It should be noted that the label data using step S204 constructs training sample set, using in text block text
Hold information, location information, the information architecture feature of front and back text block, using the sorting algorithm of decision tree, training classifier.
Specific step is as follows for classifier: 1) for marking text block, each text block constructs sample as a sample
Set;
2) with the text information of text block, location information, text formatting constructs derived data as current text block feature,
Construction feature;
Feature is constructed from following three classes: text information feature: numerical value class character quantity, Chinese character quantity, English character
Quantity, space character quantity, spcial character quantity;Location information feature: four coordinate average values of the page, four coordinates of the page are most
Big value, four coordinate minimum values of the page, four coordinate values of text block;Text formatting feature: current font accounts for page font percentage
Than current font size accounts for page font size percentage.
3) using logistic regression as classifier is returned, simultaneously tuning disaggregated model is trained.
Step S208 carries out continuous probabilistic forecasting to PDF parsing text block.
Language model training is carried out using corpus content, can carry out whether content connects to adjacent text block according to language model
Continuous probabilistic forecasting.
For insuring related PDF document, obtains and insure related corpus data, two gram language models of training are (relative to language
Sound model), and then utilize the continuous probability of binary language model prediction text block content.For insuring PDF, specific steps are such as
Under:
1) insurance text is obtained it is anticipated that crawling related it is anticipated that training two gram language models by insuring related web site.
2) to the schematic diagram of text block after PDF parsing, as shown in Figure 4.Four text blocks after parsing are 1,2,3,4, point
It Li Yong not the mutual continuous probability of four text block content of text of language model prediction.By taking text block 1 as an example, three are obtained
Probability value P (text block 1, text block 2), P (text block 1, text block 3) and P (text block 1, text block 4), for these three
Text continuity probability, available P (text block 1, text block 2) is big if text block 1 and 2 content of text block have continuity
In P (text block 1, text block 3) and P (text block 1, text block 4).Wherein P (text block 1, text block 2) indicates P (mild weight
Disease, insurance money), i.e., 1 content of text of text block " mild weight disease " is pre- using language model with 2 content of text of text block " insurance money "
The continuous probability surveyed.
Step S210 merges PDF text block using text block category classifier and content connected speech model prediction.
Wherein, text block classification is predicted using text block category classifier, recycle content connected speech model to adjacent
Text block carries out the continuous probability of content, finally carries out content according to two parts result and merges judgement.
It should be noted that merging the decision logic of PDF text block are as follows: adjacent text block is with the same category and continuously general
The text block that rate is greater than probability threshold value carries out text merging.For insuring PDF document, such as Fig. 4, the specific steps of which are as follows:
It is " title " using the classification that text block classifier obtains text block 1,2, the classification of text block 3,4 is " text ";
Using content connected speech model obtain greater than probability door line k2 text block continuous probability have P (text block 1,
Text block 2) and P (text block 3, text block 4);
According to judgment rule " adjacent text block have the same category and continuous probability be greater than the text block of probability threshold value into
Compose a piece of writing this merging ", available text block 1 merges with text block 2, and has label " title ", and text block 3 is closed with text block 4
And and there is title " text ".
Through the above steps, training sample set is constructed using the method for image labeling, then passes through classifier and language
Model come carry out classifying content with merge, do not need to safeguard a large amount of rules by hand, while significantly promoting accuracy rate, such as Fig. 5 institute
Show, the PDF merging method in this preferred embodiment and keyword matching method F1 score test data of all categories compare column diagram.
Wherein, it is as shown in the table for the initial data in Fig. 5.
Table 1
Text categories method | The method of the present invention | Keyword matching method |
Title | 0.95 | 0.64 |
Text | 0.92 | 0.72 |
Headerfooter | 0.89 | 0.85 |
Annotation | 0.90 | 0.82 |
According to embodiments of the present invention, a kind of embodiment of the processing unit of document is additionally provided, it should be noted that this article
The processing method namely the text in the embodiment of the present invention that the processing unit of shelves can be used for executing the document in the embodiment of the present invention
The processing method of shelves can execute in the processing unit of the document.
Fig. 6 is the schematic diagram of the processing unit of document according to an embodiment of the present invention, as shown in fig. 6, the processing of the document
Device may include: acquiring unit 61, determination unit 63, combining unit 65.Wherein, details are as follows.
Acquiring unit 61, for obtaining multiple text blocks in document.
Wherein, above-mentioned acquiring unit may include: the first conversion module, for transforming a document to picture;Division module,
For dividing different regions according to preset rules to picture, wherein different regions corresponds respectively to multiple text blocks.
Determination unit 63, for determining the affiliated label of each text block in multiple blocks of files, Yi Jitong by classifier
It crosses preset language model to analyze text block adjacent two-by-two in multiple blocks of files, obtains text block adjacent two-by-two
Continuous probability, wherein language model is trained using multi-group data by machine learning, every group of data in multi-group data
It include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two.
Wherein, above-mentioned determination unit may include: the second conversion module, for converting language for text block adjacent two-by-two
Message breath;Determining module, it is true by analysis of the language model to voice messaging for voice messaging to be input in speech model
Determine the continuous probability of text block adjacent two-by-two.
Combining unit 65 is greater than preset threshold for the continuous probability in determining specified text block adjacent two-by-two,
And in the case that specified text block adjacent two-by-two belongs to same label, specified text block adjacent two-by-two is closed
And.
It should be noted that the acquiring unit 61 in the embodiment can be used for executing the step in the embodiment of the present invention
S102, the determination unit 63 in the embodiment can be used for executing the step S104 in the embodiment of the present invention, in the embodiment
Combining unit 65 can be used for executing the step S106 in the embodiment of the present invention.Above-mentioned module is shown with what corresponding step was realized
Example is identical with application scenarios, but is not limited to the above embodiments disclosure of that.
Optionally, above-mentioned apparatus can also include: mark unit, for dividing different areas according to preset rules to picture
After domain, respectively to different area marking labels, wherein the label to different area markings is belonging to each text block
Label.
Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, and storage medium includes storage
Program, wherein equipment where control storage medium executes following operation when program is run: passing through the multiple texts obtained in document
This block;The affiliated label of each text block in multiple blocks of files is determined by classifier, and passes through preset language model pair
Text block adjacent two-by-two is analyzed in multiple blocks of files, obtains the continuous probability of text block adjacent two-by-two, wherein language
Model is trained using multi-group data by machine learning, and every group of data in multi-group data include: adjacent two-by-two
The continuous probability of text block and text block adjacent two-by-two;Determining the big of the continuous probability of specified text block adjacent two-by-two
In preset threshold, and in the case that specified text block adjacent two-by-two belongs to same label, by specified text adjacent two-by-two
This block merges.
Another aspect according to an embodiment of the present invention additionally provides a kind of processor, and processor is used to run program,
Wherein, following operation is executed when program is run: by obtaining multiple text blocks in document;Multiple files are determined by classifier
The affiliated label of each text block in block, and by preset language model to text block adjacent two-by-two in multiple blocks of files
It is analyzed, obtains the continuous probability of text block adjacent two-by-two, wherein language model is to pass through engineering using multi-group data
What habit trained, every group of data in multi-group data include: the company of adjacent two-by-two text block and text block adjacent two-by-two
Continuous probability;In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and specify adjacent two-by-two
Text block belong to same label in the case where, specified text block adjacent two-by-two is merged.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module
It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (8)
1. a kind of processing method of document characterized by comprising
Obtain multiple text blocks in document;
The affiliated label of each text block in the multiple blocks of files is determined by classifier, and passes through preset language model
Text block adjacent two-by-two in the multiple blocks of files is analyzed, the continuous general of the adjacent text block two-by-two is obtained
Rate, wherein the language model is trained using multi-group data by machine learning, every group of number in the multi-group data
According to the continuous probability for including: adjacent two-by-two text block and text block adjacent two-by-two;
In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and it is described specified adjacent two-by-two
Text block belong to same label in the case where, the specified text block adjacent two-by-two is merged.
2. the method according to claim 1, wherein the multiple text block obtained in the document includes:
Picture is converted by the document;
Different regions is divided according to preset rules to the picture, wherein the different region corresponds respectively to described more
A text block.
3. according to the method described in claim 2, it is characterized in that, dividing different regions according to preset rules to the picture
Later, the method also includes:
Respectively to the different area marking label, wherein the label to the different area marking is each text
The affiliated label of this block.
4. a kind of processing unit of document characterized by comprising
Acquiring unit, for obtaining multiple text blocks in document;
Determination unit for determining the affiliated label of each text block in the multiple blocks of files by classifier, and passes through
Preset language model analyzes text block adjacent two-by-two in the multiple blocks of files, obtains the adjacent text two-by-two
The continuous probability of this block, wherein the language model is trained using multi-group data by machine learning, the multiple groups number
Every group of data in include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;
Combining unit is greater than preset threshold for the continuous probability in determining specified text block adjacent two-by-two, and described
In the case that specified text block adjacent two-by-two belongs to same label, the specified text block adjacent two-by-two is closed
And.
5. device according to claim 4, which is characterized in that the acquiring unit includes:
First conversion module, for converting picture for the document;
Division module, for dividing different regions according to preset rules to the picture, wherein the different region difference
Corresponding to the multiple text block.
6. device according to claim 5, which is characterized in that described device further include:
Unit is marked, after dividing different regions according to preset rules to the picture, respectively to the different area
Domain marks label, wherein the label to the different area marking is the affiliated label of each text block.
7. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program
When control the storage medium where equipment perform claim require any one of 1 to 3 described in method.
8. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run
Benefit require any one of 1 to 3 described in method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910697312.5A CN110427488B (en) | 2019-07-30 | 2019-07-30 | Document processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910697312.5A CN110427488B (en) | 2019-07-30 | 2019-07-30 | Document processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110427488A true CN110427488A (en) | 2019-11-08 |
CN110427488B CN110427488B (en) | 2022-09-23 |
Family
ID=68413183
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910697312.5A Active CN110427488B (en) | 2019-07-30 | 2019-07-30 | Document processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110427488B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112101308A (en) * | 2020-11-11 | 2020-12-18 | 北京云测信息技术有限公司 | Method and device for combining text boxes based on language model and electronic equipment |
CN112818823A (en) * | 2021-01-28 | 2021-05-18 | 建信览智科技(北京)有限公司 | Text extraction method based on bill content and position information |
CN113761906A (en) * | 2020-07-16 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method, device, equipment and computer readable medium for analyzing document |
CN114495147A (en) * | 2022-01-25 | 2022-05-13 | 北京百度网讯科技有限公司 | Identification method, device, equipment and storage medium |
CN116306575A (en) * | 2023-05-10 | 2023-06-23 | 杭州恒生聚源信息技术有限公司 | Document analysis method, document analysis model training method and device and electronic equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
US20110137900A1 (en) * | 2009-12-09 | 2011-06-09 | International Business Machines Corporation | Method to identify common structures in formatted text documents |
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
CN107424166A (en) * | 2017-07-18 | 2017-12-01 | 深圳市速腾聚创科技有限公司 | Point cloud segmentation method and device |
CN107808011A (en) * | 2017-11-20 | 2018-03-16 | 北京大学深圳研究院 | Classification abstracting method, device, computer equipment and the storage medium of information |
US20180129944A1 (en) * | 2016-11-07 | 2018-05-10 | Xerox Corporation | Document understanding using conditional random fields |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN109857942A (en) * | 2019-03-14 | 2019-06-07 | 北京百度网讯科技有限公司 | For handling the method, apparatus, equipment and storage medium of document |
-
2019
- 2019-07-30 CN CN201910697312.5A patent/CN110427488B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110137900A1 (en) * | 2009-12-09 | 2011-06-09 | International Business Machines Corporation | Method to identify common structures in formatted text documents |
CN102033964A (en) * | 2011-01-13 | 2011-04-27 | 北京邮电大学 | Text classification method based on block partition and position weight |
CN105677764A (en) * | 2015-12-30 | 2016-06-15 | 百度在线网络技术(北京)有限公司 | Information extraction method and device |
US20180129944A1 (en) * | 2016-11-07 | 2018-05-10 | Xerox Corporation | Document understanding using conditional random fields |
CN107424166A (en) * | 2017-07-18 | 2017-12-01 | 深圳市速腾聚创科技有限公司 | Point cloud segmentation method and device |
CN107808011A (en) * | 2017-11-20 | 2018-03-16 | 北京大学深圳研究院 | Classification abstracting method, device, computer equipment and the storage medium of information |
CN108416279A (en) * | 2018-02-26 | 2018-08-17 | 阿博茨德(北京)科技有限公司 | Form analysis method and device in file and picture |
CN109857942A (en) * | 2019-03-14 | 2019-06-07 | 北京百度网讯科技有限公司 | For handling the method, apparatus, equipment and storage medium of document |
Non-Patent Citations (2)
Title |
---|
XING WANG 等: "A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files", 《 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR)》 * |
晏文坛: "半结构化中文简历的信息抽取", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113761906A (en) * | 2020-07-16 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method, device, equipment and computer readable medium for analyzing document |
CN112101308A (en) * | 2020-11-11 | 2020-12-18 | 北京云测信息技术有限公司 | Method and device for combining text boxes based on language model and electronic equipment |
CN112101308B (en) * | 2020-11-11 | 2021-02-09 | 北京云测信息技术有限公司 | Method and device for combining text boxes based on language model and electronic equipment |
CN112818823A (en) * | 2021-01-28 | 2021-05-18 | 建信览智科技(北京)有限公司 | Text extraction method based on bill content and position information |
CN112818823B (en) * | 2021-01-28 | 2024-04-12 | 金科览智科技(北京)有限公司 | Text extraction method based on bill content and position information |
CN114495147A (en) * | 2022-01-25 | 2022-05-13 | 北京百度网讯科技有限公司 | Identification method, device, equipment and storage medium |
CN116306575A (en) * | 2023-05-10 | 2023-06-23 | 杭州恒生聚源信息技术有限公司 | Document analysis method, document analysis model training method and device and electronic equipment |
CN116306575B (en) * | 2023-05-10 | 2023-08-29 | 杭州恒生聚源信息技术有限公司 | Document analysis method, document analysis model training method and device and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN110427488B (en) | 2022-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110427488A (en) | The processing method and processing device of document | |
CN108717406B (en) | Text emotion analysis method and device and storage medium | |
CN110795919B (en) | Form extraction method, device, equipment and medium in PDF document | |
US7783642B1 (en) | System and method of identifying web page semantic structures | |
CN106201465B (en) | Software project personalized recommendation method for open source community | |
US5669007A (en) | Method and system for analyzing the logical structure of a document | |
US7444325B2 (en) | Method and system for information extraction | |
EP1736901B1 (en) | Method for classifying sub-trees in semi-structured documents | |
US8874581B2 (en) | Employing topic models for semantic class mining | |
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
WO2021084702A1 (en) | Document image analysis device, document image analysis method, and program | |
JP2009026195A (en) | Article classification apparatus, article classification method and program | |
JP2009193571A (en) | Method and device used for extracting webpage content | |
CN107767273B (en) | Asset configuration method based on social data, electronic device and medium | |
US20020016796A1 (en) | Document processing method, system and medium | |
CN115917613A (en) | Semantic representation of text in a document | |
CN108520065B (en) | Method, system, equipment and storage medium for constructing named entity recognition corpus | |
CN110929518B (en) | Text sequence labeling algorithm using overlapping splitting rule | |
Ha et al. | Information extraction from scanned invoice images using text analysis and layout features | |
JP5577546B2 (en) | Computer system | |
CN110020024B (en) | Method, system and equipment for classifying link resources in scientific and technological literature | |
CN112084376A (en) | Map knowledge based recommendation method and system and electronic device | |
Mohemad et al. | Automatic document structure analysis of structured PDF files | |
Bartík | Text-based web page classification with use of visual information | |
US20230177251A1 (en) | Method, device, and system for analyzing unstructured document |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |