CN110427488A - The processing method and processing device of document - Google Patents

The processing method and processing device of document Download PDF

Info

Publication number
CN110427488A
CN110427488A CN201910697312.5A CN201910697312A CN110427488A CN 110427488 A CN110427488 A CN 110427488A CN 201910697312 A CN201910697312 A CN 201910697312A CN 110427488 A CN110427488 A CN 110427488A
Authority
CN
China
Prior art keywords
text block
text
adjacent
document
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910697312.5A
Other languages
Chinese (zh)
Other versions
CN110427488B (en
Inventor
袁灿
于政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Mininglamp Software System Co ltd
Original Assignee
Beijing Mininglamp Software System Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Mininglamp Software System Co ltd filed Critical Beijing Mininglamp Software System Co ltd
Priority to CN201910697312.5A priority Critical patent/CN110427488B/en
Publication of CN110427488A publication Critical patent/CN110427488A/en
Application granted granted Critical
Publication of CN110427488B publication Critical patent/CN110427488B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention discloses a kind of processing method and processing devices of document.This method comprises: by obtaining multiple text blocks in document;The affiliated label of each text block in multiple blocks of files is determined by classifier, and text block adjacent two-by-two in multiple blocks of files is analyzed by preset language model, obtain the continuous probability of text block adjacent two-by-two, wherein, language model is trained using multi-group data by machine learning, and every group of data in multi-group data include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;In the case where the text block adjacent two-by-two for being greater than preset threshold and specifying for determining the continuous probability of specified text block adjacent two-by-two belongs to same label, specified text block adjacent two-by-two is merged.The present invention solves in the prior art, when merging to the text in document, because of situations such as there is line feed, subfield in text, text is caused to merge the technical problem low there are accuracy.

Description

The processing method and processing device of document
Technical field
The present invention relates to document processing fields, in particular to a kind of processing method and processing device of document.
Background technique
Currently, PDF document has many advantages, such as that document files is stable, the reservation of document format is preferable due to it, it has also become text The important form stored and transmitted of this document.It is all by PDF text for industry and academia's major part knowledge and information Shelves form exist, an important prerequisite intelligentized for current enterprises and institutions be exactly will be stored in it is various in PDF document Carried out again after structured storage after knowledge parsing using.There are position mistakes for the content that the PDF analytical tool of Most current obtains Disorderly, the problems such as contents directory level can not determine, needs that content of text after parsing is classified and closed by technological means And to it achieve the purpose that structuring or semi-structured.
And it is existing classified to PDF parsing content and combined method mainly pass through text block after parsing carry out it is crucial Word matches to determine text block classification.Keyword match is mainly by the keyword of text categories, such as to text according to title, Text, annotation etc. are classified, and are matched then generalling use the matching common key character of title such as " 1.1 " for title, Judge text block for title in the presence of if as matched.Multiple text blocks are then carried out by coordinate in PDF where text block straight Engagement simultaneously, from top to bottom according to coordinate, from left to right merges.
Its major defect of the method for keyword match is that keyword match needs to accumulate matching rule, for PDF document Its title form is ever-changing, needs to safeguard a large amount of rules, in addition the problem of difficult optimization low there is also regular recall rate.And for The merging method of text block, the method merged according to coordinate is excessively rough, does not do for there is situations such as line feed, subfield in text Consider, text merging is caused to there is a situation where that mistake is more.
For in the prior art, when being merged to the text in document, due to there are the feelings such as line feed, subfield in text Condition causes text merging to there is a problem of that accuracy is low, and currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of processing method and processing devices of document, at least to solve in the prior art, to text Text in shelves is when merging, and because of situations such as there is line feed, subfield in text, causes text to merge that there are accuracy low Technical problem.
According to an aspect of an embodiment of the present invention, a kind of processing method of document is provided, comprising: obtain in document Multiple text blocks;The affiliated label of each text block in the multiple blocks of files is determined by classifier, and by preset Language model analyzes text block adjacent two-by-two in the multiple blocks of files, obtains the adjacent text block two-by-two Continuous probability, wherein the language model is trained using multi-group data by machine learning, in the multi-group data Every group of data include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;It is specified two-by-two determining The continuous probability of adjacent text block is greater than preset threshold, and the specified text block adjacent two-by-two belongs to same label In the case where, the specified text block adjacent two-by-two is merged.
Further, obtaining the multiple text block in the document includes: to convert picture for the document;To institute It states picture and divides different regions according to preset rules, wherein the different region corresponds respectively to the multiple text block.
Further, after dividing different regions according to preset rules to the picture, the method also includes: respectively To the different area marking label, wherein be the institute of each text block to the label of the different area marking Belong to label.
According to another aspect of an embodiment of the present invention, a kind of processing unit of document is additionally provided, comprising: acquiring unit, For obtaining multiple text blocks in document;Determination unit, for determining each text in the multiple blocks of files by classifier The affiliated label of this block, and text block adjacent two-by-two in the multiple blocks of files is divided by preset language model Analysis obtains the continuous probability of the adjacent text block two-by-two, wherein the language model is to pass through machine using multi-group data What learning training went out, every group of data in the multi-group data include: text block adjacent two-by-two and text adjacent two-by-two The continuous probability of block;Combining unit is greater than default threshold for the continuous probability in determining specified text block adjacent two-by-two Value, and in the case that the specified text block adjacent two-by-two belongs to same label, by the specified text adjacent two-by-two This block merges.
Further, the acquiring unit includes: the first conversion module, for converting picture for the document;It divides Module, for dividing different regions according to preset rules to the picture, wherein the different region corresponds respectively to institute State multiple text blocks.
Further, described device further include: mark unit, it is different for being divided to the picture according to preset rules After region, respectively to the different area marking label, wherein the label to the different area marking is described every The affiliated label of a text block.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, the storage medium includes storage Program, wherein described program executes the processing method of document described in any of the above embodiments when running.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, the processor is used to run program, Wherein, the processing method of document described in any of the above embodiments is executed when described program is run.
In embodiments of the present invention, by obtaining multiple text blocks in document;Multiple blocks of files are determined by classifier In each text block affiliated label, and by preset language model to text block adjacent two-by-two in multiple blocks of files into Row analysis, obtains the continuous probability of text block adjacent two-by-two, wherein language model is to pass through machine learning using multi-group data It trains, every group of data in multi-group data include: the continuous of adjacent two-by-two text block and text block adjacent two-by-two Probability;In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and specify adjacent two-by-two In the case that text block belongs to same label, specified text block adjacent two-by-two is merged, has been reached according to text block The continuous probability of label and adjacent two text block carry out the combined purpose of text block, provide text merged block to realize Accuracy technical effect, and then solve in the prior art, when being merged to the text in document, due to being deposited in text The line feed, subfield situations such as, text is caused to merge the technical problem low there are accuracy.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the processing method of document according to an embodiment of the present invention;
Fig. 2 is the flow chart for the method that a kind of PDF parsing classifying content according to the preferred embodiment of the invention merges;
Fig. 3 is the schematic diagram of picture mark according to the preferred embodiment of the invention;
Fig. 4 is the schematic diagram of text block after the parsing according to the preferred embodiment of the invention to PDF;
Fig. 5 is that PDF merging method according to the preferred embodiment of the invention and keyword matching method F1 score of all categories try Test data comparison column diagram;And
Fig. 6 is the schematic diagram of the processing unit of document according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
Firstly, the part noun or term that occur during the embodiment of the present invention is described are suitable for following solution It releases:
F1 score: also known as balance F score, it is defined as the harmonic-mean of accurate rate and recall rate.
Wherein, F1 is F1 score, and precision is accurate rate, and recall is recall rate.
Two gram language models: language model is generally constructed with the probability distribution p (s) of character string s, and p (s) attempts to reflect here Be frequency that character string s occurs as a sentence.For one, by l primitive, (" primitive " can be word, word or phrase Deng we only with " word " lead to finger later) the sentence s=w1w2 ... wl that constitutes, two-dimensional grammar model is to appear in i-th bit Word wiOnly with a history ssss word w before iti-1It is related.
Classifier: sorting algorithm uses model, such as common decision tree, logistic regression.
According to embodiments of the present invention, a kind of processing method embodiment of document is provided, it should be noted that in attached drawing The step of process illustrates can execute in a computer system such as a set of computer executable instructions, although also, Logical order is shown in flow chart, but in some cases, it can be to be different from shown by sequence execution herein or retouch The step of stating.
The processing method of the document to the embodiment of the present invention is described in detail below.
Fig. 1 is the flow chart of the processing method of document according to an embodiment of the present invention, as shown in Figure 1, the processing of the document Method includes the following steps:
Step S102 obtains multiple text blocks in document.
Wherein, document in step s 102 may include: the document of PDF format.I.e. through the above steps S102 is recorded Scheme, multiple text blocks in available PDF document.
It should be noted that the multiple text blocks obtained in document may include: to transform a document to picture;Picture is pressed Different regions is divided according to preset rules, wherein different regions corresponds respectively to multiple text blocks.Transform a document to figure Piece be labeled according to the difference of classification to picture, and then determines the label of the text in document.It can be convenient, simply Determination document in text affiliated label.
It should also be noted that, the above method can also wrap after dividing different regions according to preset rules to picture It includes: respectively to different area marking labels, wherein the label to different area markings is the affiliated mark of each text block Label.
Wherein, picture mark label can be passed through, the rectangle frame of different rectangle frames, different colours is labeled.
Step S104 determines the affiliated label of each text block in multiple blocks of files by classifier, and by default Language model text block adjacent two-by-two in multiple blocks of files is analyzed, obtain the continuous general of text block adjacent two-by-two Rate, wherein language model is trained using multi-group data by machine learning, and every group of data in multi-group data are wrapped It includes: the continuous probability of adjacent text block and text block adjacent two-by-two two-by-two.
Wherein, text block adjacent two-by-two in multiple blocks of files is analyzed by preset language model, obtains two The continuous probability of two adjacent text blocks may include: to convert voice messaging for text block adjacent two-by-two;By voice messaging It is input in language model, determines the continuous probability of text block adjacent two-by-two to the analysis of voice messaging by language model.
Step S106 in the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and is specified Text block adjacent two-by-two belong to same label in the case where, specified text block adjacent two-by-two is merged.
Through the above steps, by obtaining multiple text blocks in document;It is determined by classifier every in multiple blocks of files The affiliated label of a text block, and text block adjacent two-by-two in multiple blocks of files is divided by preset language model Analysis, obtains the continuous probability of text block adjacent two-by-two, wherein language model is to pass through machine learning training using multi-group data Out, every group of data in multi-group data include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two; In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and specified text block adjacent two-by-two In the case where belonging to same label, specified text block adjacent two-by-two is merged, the label according to text block has been reached The combined purpose of text block is carried out with the continuous probability of adjacent two text block, provides the accurate of text merged block to realize Property technical effect, and then solve in the prior art, when being merged to the text in document, changed due to existing in text Situations such as row, subfield, causes text to merge the technical problem low there are accuracy.
The present invention also provides a kind of preferred embodiment, the preferred embodiment provides a kind of PDF parsing classifying content conjunction And method (processing method for being equivalent to document).
As shown in Fig. 2, a kind of flow chart for the method that PDF parsing classifying content merges.Specific steps are as follows:
Step S202 marks PDF document picture tag.
Wherein, by being labeled to part PDF text according to text categories, classification includes title, text, header, page Foot, annotation, catalogue, legend, table, figure illustrate, two kinds or more in table explanation that mask method is by converting PDF document It is labeled for picture.
The concrete operations of step S202 are as follows: to the PDF document that needs parse, utilize standard PDF analytical tool and PDF picture Chemical industry tool is parsed and is converted, XML file and jpeg format picture after being parsed.Due to the special file of PDF itself Format can not be marked efficiently, and using the mask method of image, being labeled to PDF content can solve mark Problem.JPEG picture is labeled using annotation tool, mark classification includes title, text, header, footer, annotation, mesh Record, legend, table, figure illustrate, two kinds or more in table explanation.
It should be noted that mark process is as follows: presetting the classification for needing to mark in annotation tool;Load needs to mark Picture to annotation tool in;Picture model corresponding to the content of text to be marked is chosen using rectangular graph marquee It encloses;Mark classification is judged according to text in picture, and selects corresponding classification.
For insuring PDF, the picture after mark is as shown in figure 3, the schematic diagram that picture marks.The choosing of different colours rectangle Select frame represent it is different classes of.
Step S204 obtains PDF and parses text block label.
PDF is obtained using PDF transformed picture label and parses text block label, and PDF coordinate is mapped to according to Picture Coordinate, is obtained Into PDF document, text block etc. is marked.
According to the annotation results of PDF transformed picture, dimension of picture, mark coordinates regional and label classification, PDF parsing are obtained There is text block PDF coordinate XML can be calculated according to PDF size and PDF transformed picture dimension scale in XML file afterwards The corresponding relationship of text block and PDF transformed picture tab area, finally obtains the label of text block.Specific mapping method are as follows: figure Chip size be width a, length b, tab area be rectangular coordinates [x1, y1, x2, y2], i.e., rectangle top left co-ordinate [x1, y1] and Bottom right angular coordinate [x2, y2];PDF dimension width be c, length d, text block coordinate be [m, n, k, l], correspond to top away from From until left end distance, text block width, text block height.PDF transformed picture marks rectangle result mapping equation to PDF document Middle coordinate is [(c/a) * y1, (b/d) * x1, (b/d) * (x2-x1), (c/a) * (y2-y1)], such as PDF text block coordinate and PDF PDF document coordinate overlapping region ratio is greater than threshold value k1 after transformed picture mark mapping, then is labeled as corresponding to by PDF text block The mark classification of PDF transformed picture.
Step S206, building PDF parse text block classifier.
For text block after PDF parsing, text block text information feature, position feature, format character are extracted, mark is recycled Label is infused, simultaneously train classification models are constructed.
It should be noted that the label data using step S204 constructs training sample set, using in text block text Hold information, location information, the information architecture feature of front and back text block, using the sorting algorithm of decision tree, training classifier.
Specific step is as follows for classifier: 1) for marking text block, each text block constructs sample as a sample Set;
2) with the text information of text block, location information, text formatting constructs derived data as current text block feature, Construction feature;
Feature is constructed from following three classes: text information feature: numerical value class character quantity, Chinese character quantity, English character Quantity, space character quantity, spcial character quantity;Location information feature: four coordinate average values of the page, four coordinates of the page are most Big value, four coordinate minimum values of the page, four coordinate values of text block;Text formatting feature: current font accounts for page font percentage Than current font size accounts for page font size percentage.
3) using logistic regression as classifier is returned, simultaneously tuning disaggregated model is trained.
Step S208 carries out continuous probabilistic forecasting to PDF parsing text block.
Language model training is carried out using corpus content, can carry out whether content connects to adjacent text block according to language model Continuous probabilistic forecasting.
For insuring related PDF document, obtains and insure related corpus data, two gram language models of training are (relative to language Sound model), and then utilize the continuous probability of binary language model prediction text block content.For insuring PDF, specific steps are such as Under:
1) insurance text is obtained it is anticipated that crawling related it is anticipated that training two gram language models by insuring related web site.
2) to the schematic diagram of text block after PDF parsing, as shown in Figure 4.Four text blocks after parsing are 1,2,3,4, point It Li Yong not the mutual continuous probability of four text block content of text of language model prediction.By taking text block 1 as an example, three are obtained Probability value P (text block 1, text block 2), P (text block 1, text block 3) and P (text block 1, text block 4), for these three Text continuity probability, available P (text block 1, text block 2) is big if text block 1 and 2 content of text block have continuity In P (text block 1, text block 3) and P (text block 1, text block 4).Wherein P (text block 1, text block 2) indicates P (mild weight Disease, insurance money), i.e., 1 content of text of text block " mild weight disease " is pre- using language model with 2 content of text of text block " insurance money " The continuous probability surveyed.
Step S210 merges PDF text block using text block category classifier and content connected speech model prediction.
Wherein, text block classification is predicted using text block category classifier, recycle content connected speech model to adjacent Text block carries out the continuous probability of content, finally carries out content according to two parts result and merges judgement.
It should be noted that merging the decision logic of PDF text block are as follows: adjacent text block is with the same category and continuously general The text block that rate is greater than probability threshold value carries out text merging.For insuring PDF document, such as Fig. 4, the specific steps of which are as follows:
It is " title " using the classification that text block classifier obtains text block 1,2, the classification of text block 3,4 is " text ";
Using content connected speech model obtain greater than probability door line k2 text block continuous probability have P (text block 1, Text block 2) and P (text block 3, text block 4);
According to judgment rule " adjacent text block have the same category and continuous probability be greater than the text block of probability threshold value into Compose a piece of writing this merging ", available text block 1 merges with text block 2, and has label " title ", and text block 3 is closed with text block 4 And and there is title " text ".
Through the above steps, training sample set is constructed using the method for image labeling, then passes through classifier and language Model come carry out classifying content with merge, do not need to safeguard a large amount of rules by hand, while significantly promoting accuracy rate, such as Fig. 5 institute Show, the PDF merging method in this preferred embodiment and keyword matching method F1 score test data of all categories compare column diagram. Wherein, it is as shown in the table for the initial data in Fig. 5.
Table 1
Text categories method The method of the present invention Keyword matching method
Title 0.95 0.64
Text 0.92 0.72
Headerfooter 0.89 0.85
Annotation 0.90 0.82
According to embodiments of the present invention, a kind of embodiment of the processing unit of document is additionally provided, it should be noted that this article The processing method namely the text in the embodiment of the present invention that the processing unit of shelves can be used for executing the document in the embodiment of the present invention The processing method of shelves can execute in the processing unit of the document.
Fig. 6 is the schematic diagram of the processing unit of document according to an embodiment of the present invention, as shown in fig. 6, the processing of the document Device may include: acquiring unit 61, determination unit 63, combining unit 65.Wherein, details are as follows.
Acquiring unit 61, for obtaining multiple text blocks in document.
Wherein, above-mentioned acquiring unit may include: the first conversion module, for transforming a document to picture;Division module, For dividing different regions according to preset rules to picture, wherein different regions corresponds respectively to multiple text blocks.
Determination unit 63, for determining the affiliated label of each text block in multiple blocks of files, Yi Jitong by classifier It crosses preset language model to analyze text block adjacent two-by-two in multiple blocks of files, obtains text block adjacent two-by-two Continuous probability, wherein language model is trained using multi-group data by machine learning, every group of data in multi-group data It include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two.
Wherein, above-mentioned determination unit may include: the second conversion module, for converting language for text block adjacent two-by-two Message breath;Determining module, it is true by analysis of the language model to voice messaging for voice messaging to be input in speech model Determine the continuous probability of text block adjacent two-by-two.
Combining unit 65 is greater than preset threshold for the continuous probability in determining specified text block adjacent two-by-two, And in the case that specified text block adjacent two-by-two belongs to same label, specified text block adjacent two-by-two is closed And.
It should be noted that the acquiring unit 61 in the embodiment can be used for executing the step in the embodiment of the present invention S102, the determination unit 63 in the embodiment can be used for executing the step S104 in the embodiment of the present invention, in the embodiment Combining unit 65 can be used for executing the step S106 in the embodiment of the present invention.Above-mentioned module is shown with what corresponding step was realized Example is identical with application scenarios, but is not limited to the above embodiments disclosure of that.
Optionally, above-mentioned apparatus can also include: mark unit, for dividing different areas according to preset rules to picture After domain, respectively to different area marking labels, wherein the label to different area markings is belonging to each text block Label.
Another aspect according to an embodiment of the present invention, additionally provides a kind of storage medium, and storage medium includes storage Program, wherein equipment where control storage medium executes following operation when program is run: passing through the multiple texts obtained in document This block;The affiliated label of each text block in multiple blocks of files is determined by classifier, and passes through preset language model pair Text block adjacent two-by-two is analyzed in multiple blocks of files, obtains the continuous probability of text block adjacent two-by-two, wherein language Model is trained using multi-group data by machine learning, and every group of data in multi-group data include: adjacent two-by-two The continuous probability of text block and text block adjacent two-by-two;Determining the big of the continuous probability of specified text block adjacent two-by-two In preset threshold, and in the case that specified text block adjacent two-by-two belongs to same label, by specified text adjacent two-by-two This block merges.
Another aspect according to an embodiment of the present invention additionally provides a kind of processor, and processor is used to run program, Wherein, following operation is executed when program is run: by obtaining multiple text blocks in document;Multiple files are determined by classifier The affiliated label of each text block in block, and by preset language model to text block adjacent two-by-two in multiple blocks of files It is analyzed, obtains the continuous probability of text block adjacent two-by-two, wherein language model is to pass through engineering using multi-group data What habit trained, every group of data in multi-group data include: the company of adjacent two-by-two text block and text block adjacent two-by-two Continuous probability;In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and specify adjacent two-by-two Text block belong to same label in the case where, specified text block adjacent two-by-two is merged.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (8)

1. a kind of processing method of document characterized by comprising
Obtain multiple text blocks in document;
The affiliated label of each text block in the multiple blocks of files is determined by classifier, and passes through preset language model Text block adjacent two-by-two in the multiple blocks of files is analyzed, the continuous general of the adjacent text block two-by-two is obtained Rate, wherein the language model is trained using multi-group data by machine learning, every group of number in the multi-group data According to the continuous probability for including: adjacent two-by-two text block and text block adjacent two-by-two;
In the preset threshold that is greater than for determining the continuous probability of specified text block adjacent two-by-two, and it is described specified adjacent two-by-two Text block belong to same label in the case where, the specified text block adjacent two-by-two is merged.
2. the method according to claim 1, wherein the multiple text block obtained in the document includes:
Picture is converted by the document;
Different regions is divided according to preset rules to the picture, wherein the different region corresponds respectively to described more A text block.
3. according to the method described in claim 2, it is characterized in that, dividing different regions according to preset rules to the picture Later, the method also includes:
Respectively to the different area marking label, wherein the label to the different area marking is each text The affiliated label of this block.
4. a kind of processing unit of document characterized by comprising
Acquiring unit, for obtaining multiple text blocks in document;
Determination unit for determining the affiliated label of each text block in the multiple blocks of files by classifier, and passes through Preset language model analyzes text block adjacent two-by-two in the multiple blocks of files, obtains the adjacent text two-by-two The continuous probability of this block, wherein the language model is trained using multi-group data by machine learning, the multiple groups number Every group of data in include: the continuous probability of adjacent two-by-two text block and text block adjacent two-by-two;
Combining unit is greater than preset threshold for the continuous probability in determining specified text block adjacent two-by-two, and described In the case that specified text block adjacent two-by-two belongs to same label, the specified text block adjacent two-by-two is closed And.
5. device according to claim 4, which is characterized in that the acquiring unit includes:
First conversion module, for converting picture for the document;
Division module, for dividing different regions according to preset rules to the picture, wherein the different region difference Corresponding to the multiple text block.
6. device according to claim 5, which is characterized in that described device further include:
Unit is marked, after dividing different regions according to preset rules to the picture, respectively to the different area Domain marks label, wherein the label to the different area marking is the affiliated label of each text block.
7. a kind of storage medium, which is characterized in that the storage medium includes the program of storage, wherein run in described program When control the storage medium where equipment perform claim require any one of 1 to 3 described in method.
8. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 3 described in method.
CN201910697312.5A 2019-07-30 2019-07-30 Document processing method and device Active CN110427488B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910697312.5A CN110427488B (en) 2019-07-30 2019-07-30 Document processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910697312.5A CN110427488B (en) 2019-07-30 2019-07-30 Document processing method and device

Publications (2)

Publication Number Publication Date
CN110427488A true CN110427488A (en) 2019-11-08
CN110427488B CN110427488B (en) 2022-09-23

Family

ID=68413183

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910697312.5A Active CN110427488B (en) 2019-07-30 2019-07-30 Document processing method and device

Country Status (1)

Country Link
CN (1) CN110427488B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101308A (en) * 2020-11-11 2020-12-18 北京云测信息技术有限公司 Method and device for combining text boxes based on language model and electronic equipment
CN112818823A (en) * 2021-01-28 2021-05-18 建信览智科技(北京)有限公司 Text extraction method based on bill content and position information
CN113761906A (en) * 2020-07-16 2021-12-07 北京沃东天骏信息技术有限公司 Method, device, equipment and computer readable medium for analyzing document
CN114495147A (en) * 2022-01-25 2022-05-13 北京百度网讯科技有限公司 Identification method, device, equipment and storage medium
CN116306575A (en) * 2023-05-10 2023-06-23 杭州恒生聚源信息技术有限公司 Document analysis method, document analysis model training method and device and electronic equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
US20110137900A1 (en) * 2009-12-09 2011-06-09 International Business Machines Corporation Method to identify common structures in formatted text documents
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device
CN107424166A (en) * 2017-07-18 2017-12-01 深圳市速腾聚创科技有限公司 Point cloud segmentation method and device
CN107808011A (en) * 2017-11-20 2018-03-16 北京大学深圳研究院 Classification abstracting method, device, computer equipment and the storage medium of information
US20180129944A1 (en) * 2016-11-07 2018-05-10 Xerox Corporation Document understanding using conditional random fields
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN109857942A (en) * 2019-03-14 2019-06-07 北京百度网讯科技有限公司 For handling the method, apparatus, equipment and storage medium of document

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110137900A1 (en) * 2009-12-09 2011-06-09 International Business Machines Corporation Method to identify common structures in formatted text documents
CN102033964A (en) * 2011-01-13 2011-04-27 北京邮电大学 Text classification method based on block partition and position weight
CN105677764A (en) * 2015-12-30 2016-06-15 百度在线网络技术(北京)有限公司 Information extraction method and device
US20180129944A1 (en) * 2016-11-07 2018-05-10 Xerox Corporation Document understanding using conditional random fields
CN107424166A (en) * 2017-07-18 2017-12-01 深圳市速腾聚创科技有限公司 Point cloud segmentation method and device
CN107808011A (en) * 2017-11-20 2018-03-16 北京大学深圳研究院 Classification abstracting method, device, computer equipment and the storage medium of information
CN108416279A (en) * 2018-02-26 2018-08-17 阿博茨德(北京)科技有限公司 Form analysis method and device in file and picture
CN109857942A (en) * 2019-03-14 2019-06-07 北京百度网讯科技有限公司 For handling the method, apparatus, equipment and storage medium of document

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XING WANG 等: "A Font Setting Based Bayesian Model to Extract Mathematical Expression in PDF Files", 《 2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR)》 *
晏文坛: "半结构化中文简历的信息抽取", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761906A (en) * 2020-07-16 2021-12-07 北京沃东天骏信息技术有限公司 Method, device, equipment and computer readable medium for analyzing document
CN112101308A (en) * 2020-11-11 2020-12-18 北京云测信息技术有限公司 Method and device for combining text boxes based on language model and electronic equipment
CN112101308B (en) * 2020-11-11 2021-02-09 北京云测信息技术有限公司 Method and device for combining text boxes based on language model and electronic equipment
CN112818823A (en) * 2021-01-28 2021-05-18 建信览智科技(北京)有限公司 Text extraction method based on bill content and position information
CN112818823B (en) * 2021-01-28 2024-04-12 金科览智科技(北京)有限公司 Text extraction method based on bill content and position information
CN114495147A (en) * 2022-01-25 2022-05-13 北京百度网讯科技有限公司 Identification method, device, equipment and storage medium
CN116306575A (en) * 2023-05-10 2023-06-23 杭州恒生聚源信息技术有限公司 Document analysis method, document analysis model training method and device and electronic equipment
CN116306575B (en) * 2023-05-10 2023-08-29 杭州恒生聚源信息技术有限公司 Document analysis method, document analysis model training method and device and electronic equipment

Also Published As

Publication number Publication date
CN110427488B (en) 2022-09-23

Similar Documents

Publication Publication Date Title
CN110427488A (en) The processing method and processing device of document
CN108717406B (en) Text emotion analysis method and device and storage medium
CN110795919B (en) Form extraction method, device, equipment and medium in PDF document
US7783642B1 (en) System and method of identifying web page semantic structures
CN106201465B (en) Software project personalized recommendation method for open source community
US5669007A (en) Method and system for analyzing the logical structure of a document
US7444325B2 (en) Method and system for information extraction
EP1736901B1 (en) Method for classifying sub-trees in semi-structured documents
US8874581B2 (en) Employing topic models for semantic class mining
CN107590219A (en) Webpage personage subject correlation message extracting method
WO2021084702A1 (en) Document image analysis device, document image analysis method, and program
JP2009026195A (en) Article classification apparatus, article classification method and program
JP2009193571A (en) Method and device used for extracting webpage content
CN107767273B (en) Asset configuration method based on social data, electronic device and medium
US20020016796A1 (en) Document processing method, system and medium
CN115917613A (en) Semantic representation of text in a document
CN108520065B (en) Method, system, equipment and storage medium for constructing named entity recognition corpus
CN110929518B (en) Text sequence labeling algorithm using overlapping splitting rule
Ha et al. Information extraction from scanned invoice images using text analysis and layout features
JP5577546B2 (en) Computer system
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN112084376A (en) Map knowledge based recommendation method and system and electronic device
Mohemad et al. Automatic document structure analysis of structured PDF files
Bartík Text-based web page classification with use of visual information
US20230177251A1 (en) Method, device, and system for analyzing unstructured document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant