CN102982028A - Method and device for extracting document structure - Google Patents

Method and device for extracting document structure Download PDF

Info

Publication number
CN102982028A
CN102982028A CN2011102591727A CN201110259172A CN102982028A CN 102982028 A CN102982028 A CN 102982028A CN 2011102591727 A CN2011102591727 A CN 2011102591727A CN 201110259172 A CN201110259172 A CN 201110259172A CN 102982028 A CN102982028 A CN 102982028A
Authority
CN
China
Prior art keywords
content
particle
document
mapping ruler
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011102591727A
Other languages
Chinese (zh)
Inventor
曾建英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN2011102591727A priority Critical patent/CN102982028A/en
Publication of CN102982028A publication Critical patent/CN102982028A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method for extracting a document structure. The method for extracting the document structure comprises particles used for recognizing the content of the document according to a presupposed content and style rule, using item labels to label the particles; choosing a mapping rule corresponds to the type of the document from a presupposed mapping rules groups, using a selected mapping rule to map the item labels to structure labels; and using structure labels on label the particles. The invention further provides a device for a document structure. Effect of improving the efficiency of extracting the document structure is achieved.

Description

Be used for extracting the method and apparatus of file structure
Technical field
The present invention relates to the digital publishing field, in particular to the method and apparatus that is used for extracting file structure.
Background technology
At traditional publishing area, the document format of books, newpapers and periodicals is just in order to satisfy the demand of traditional printing, be confined to the vision key element such as profile, color, position of literal, figure, image for the description of content, do not make up logic content and the internal relation of document.In the digital publishing field, the granularity of the logic content of document, incidence relation, content more to be paid close attention to, it is the precondition of carrying out the digital content recycling that document is carried out structuring processing.
The content of the document that one piece of standard is complete can be divided into a plurality of other particles of level usually, such as the first order for a piece of writing, the second level be chapter, the third level for joint, the fourth stage for section, level V for sentence etc., take chapter sections sentence as content style setting matched rule document is mated by in advance, can identify the content corresponding to chapter sections sentence, and be marked with the structure label.
The inventor finds that these all are a more abstract and vague and general concept for chapter sections sentences.For different Doctypes, their content pattern may be different, and such as the such document of paper, the granularity that its content can be divided may be large topic, stem, option or answer etc.Therefore correlation technique is in order to extract file structure to dissimilar documents, and necessary separate provision is corresponding to the matched rule of different content pattern, to generate different structure labels.This causes operation more loaded down with trivial details, makes mistakes easily.
Summary of the invention
The present invention aims to provide for the method and apparatus that extracts file structure, to solve the more loaded down with trivial details problem of correlation technique operation.
In an embodiment of the present invention, provide a kind of method for extracting file structure, having comprised: the particle of identifying the content of document with default content style rule; With entry tags mark particle; From default mapping ruler group, select the mapping ruler corresponding with the type of document; Use the mapping ruler of selecting that entry tags is mapped to the structure label; Use structure label for labelling particle.
In an embodiment of the present invention, a kind of device for extracting file structure comprises: identification module is used for the particle with the content of default content style rule identification document; The clauses and subclauses labeling module is used for entry tags mark particle; Module is selected in mapping, is used for selecting the mapping ruler corresponding with the type of document from default mapping ruler group; Mapping block is used for using the mapping ruler of selecting that entry tags is mapped to the structure label; The structure labeling module is used for using structure label for labelling particle.
The method and apparatus that is used for extracting file structure of the above embodiment of the present invention because adopt entry tags isolation structure label and content pattern, so overcome the problem of correlation technique complex operation, has reached and has improved the effect of extracting file structure efficient.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention;
Fig. 2 shows according to the preferred embodiment of the invention MVC model;
Fig. 3 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure;
Fig. 4 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the process flow diagram according to the method that is used for the extraction file structure of the embodiment of the invention, comprising:
Identify the particle of the content of document with default content style rule;
With entry tags mark particle;
From default mapping ruler group, select the mapping ruler corresponding with the type of document;
Use the mapping ruler of selecting that entry tags is mapped to the structure label;
Use structure label for labelling particle.
Correlation technique is extracted file structure to dissimilar documents, and necessary separate provision is corresponding to the matched rule of different content pattern, to generate different structure labels.This causes operation more loaded down with trivial details, makes mistakes easily.The inventor conducts in-depth research this, find that above-mentioned correlation technique arranges the content pattern of various structure labels regularly, be that the degree of coupling between structure label and the content pattern is larger, can't adapt to neatly various Doctype, be unfavorable for later maintenance and Function Extension.
In the method for the above embodiment of the present invention, created entry tags, entry tags only marks the particle level of document content, and does not pay close attention to other any attributes of particle.No matter be the document of paper, paper or other types, all be identical in this, namely all need document content is divided level, set up tree structure.As for other structure attributes relevant with Doctype, present embodiment is processed by the label mapping rule, thus so that structure label and content style rule are separate, intermediate isolating entry tags.Therefore, by breaking the coupling between structure label and the content pattern, thereby can adapt to neatly various Doctype.
Preferably, entry tags comprises: the paragraph heading of particle, paragraph content, position and level, the type of document comprise following one of at least: news, novel, text, paper, dictionary, paper.This preferred embodiment has been enumerated some main types, can pre-defined mapping ruler according to these types, for example for the paper type, can select the mapping ruler of paper type, the primary granule of entry tags mark is mapped as " subject " attribute of structure label, the secondary granule of entry tags mark is mapped as " stem " attribute of structure label.More than describe for illustrating the present invention, the present invention is not limited thereto.Obviously stipulate other Doctypes and corresponding mapping ruler thereof, also belong to spiritual scope of the present invention.
Preferably, the structure label comprises the content of entry tags, also comprises: title is used to indicate the structure type of particle; Scope is used to indicate the reference position of present granule to the content of the reference position of next particle.For example, step S10 recognizes a particle in a paper document, and marking this particle paragraph heading with entry tags is " Chinese language final examination ", and paragraph content is " Chinese language final examination ", and the position is the document starting position, and level is one-level.Be the paper document according to the document, should select the mapping ruler of paper class, this entry tags is mapped as the structure label of paper class, this structure label is except comprising the content of above-mentioned entry tags, can also comprise title " subject ", scope is that the reference position of current primary granule is to the reference position of next particle.Obviously, the present invention is not limited thereto, and the user can also be according to demand to other attributes of structure label requirement, also comprises difficulty attribute, term attribute etc. such as the structure label of regulation paper class.
Preferably, this method also comprises: from the demonstration rule that shows that regular group selection is corresponding with the type of document; The demonstration rule of use selecting is according to the content of the mark display document of structure label.In this preferred embodiment, further Graphics Processing is also isolated with the regulation of content pattern, thereby further improved the efficient of extracting file structure.
Preferably, adopt the XML formal definition to show rule.XML is the structured language of standard, is used for defining showing rule than being easier to realization.
According to above preferred embodiment of the present invention, formed a kind of MVC model, as shown in Figure 2.Be about to the content style rule and be encapsulated as the data model module, being used for dividing document content is particle, makes up tree shaped model, and this is the Model in the MVC model; Then concrete mapping method is encapsulated as label mapping structuring control module, is Control among the MVC; To show that at last rule is encapsulated as the mapping result display module, being used for the structure tag combination is final Show Styles, is the Viewer among the MVC.Data model module M, label mapping structuring control module C, mapping result display module V, uncoupled relation between three modules, be responsible for separately function separately, the M module is related by control module C and display module V simultaneously, can realize like this separating of content and form, process laying the foundation for the dirigibility of whole system.
Preferably, in above-mentioned method, also comprise: the interface is provided, accepts the new mapping ruler of User Defined or revise existing mapping ruler.Because differing, the various mapping rulers of software provider original definition satisfy surely user's demand, namely differ and produce surely the structure label of user's expectation, so by an interactive interface is provided, the user is the existing mapping ruler of additional modifications more neatly, thus the structure label that obtains expecting.Can in the light of actual conditions safeguard the Extraordinary tag types by this interface user, such as interpolation, revise or the deletion tag types.Such as paper, the user can add the personalized labels types such as stem, option, answer.These information are finally stored with the form of XML file.
Preferably, in above-mentioned method, adopt XML formal definition content pattern matching rule and mapping ruler.XML is the structured language of standard, is used for defining matched rule and mapping ruler ratio and is easier to realize.
Fig. 3 shows the process flow diagram of the method that is used for according to the preferred embodiment of the invention the extraction file structure, may further comprise the steps:
Step S0 obtains the content of document, divides the particle of document content according to the content style rule, is marked with entry tags.
Step S1 collects all entry tags, according to tree structure store items label and demonstration, as the data model of whole system;
Step S2 uses the tag types custom interface to carry out the personalized labels type definition;
Further, native system adopts window to accept the definition of user individual tag types, and the attribute of definition tag types comprises title, rank, Show Color.The user can increase the personalized labels type by this window, also can make amendment to existing tag types, perhaps with existing tag types deletion.In addition, the user also can expand or safeguards at this attribute to the specific label type, can think that tag types increases the Extraordinary attribute, such as font, and font size etc.
Step S3 stores the structuring tag types result of personalized customization according to the XML document form;
Further, also can be personalized self-defined on the mode that the customized label type is generated the XML file, namely user's these personalized labels types of layout voluntarily are concerned about what concrete entry tags is and do not spend.Only need to be mapped to entry tags on the structuring personality label type when showing at last, the result is just natural so shows according to tag types output XML document definition pattern.
Step S4 uses the entry structure mapping algorithm that entry tags is mapped as the structure label, then according to the Show Styles that defines in the tag types output XML file, mapping result is assembled into a complete structured document output;
Further, can carry out according to title content, chapters and sections rank, tag tree level, map tag types the clauses and subclauses of fast finding needs mapping; By a certain attribute the clauses and subclauses of collecting are sorted; Using one or more attributes, fraternal hierarchical relationship that clauses and subclauses are screened searches or the clauses and subclauses hierarchical structure is carried out preview etc.;
Concrete steps comprise: parse documents at first, and collect corresponding clauses and subclauses resource, and obtain the attribute information of each clauses and subclauses, comprise chapters and sections rank, tag tree level, the title content of entry tags, corresponding structure type; The clauses and subclauses of then shining upon according to entry attributes screening needs; The last personalized structure tag types that in bulk clauses and subclauses of having shone upon is increased correspondence.
Step S5 is combined into the output of complete structure document with mapping result.
Fig. 4 shows the synoptic diagram according to the device that is used for the extraction file structure of the embodiment of the invention, comprising:
Identification module 10 is used for the particle with the content of default content style rule identification document;
Clauses and subclauses labeling module 20 is used for entry tags mark particle;
Module 30 is selected in mapping, is used for selecting the mapping ruler corresponding with the type of document from default mapping ruler group;
Mapping block 40 is used for using the mapping ruler of selecting that entry tags is mapped to the structure label;
Structure labeling module 50 is used for using structure label for labelling particle.
This device has improved the efficient of extracting file structure.
Preferably, entry tags comprises: the paragraph heading of particle, paragraph content, position and level, and the structure label comprises the content of entry tags, also comprises: title is used to indicate the structure type of particle; Scope is used to indicate the reference position of present granule to the content of the reference position of next particle.
Preferably, this device also comprises: show and select module, be used for from the demonstration rule that shows that regular group selection is corresponding with the type of document; Display module, the demonstration of be used for to use selecting rule is according to the content of the mark display document of structure label.
As can be seen from the above description, the above embodiments of the present invention are mainly used in the document entry label are mapped as the structure label in bulk, and with the structuring of realization to chapter, clauses and subclauses, thereby output XML file is to resource database.The present invention has reached quick structurized target.
Obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on the single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby they can be stored in the memory storage, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is the preferred embodiments of the present invention only, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a method that is used for extracting file structure is characterized in that, comprising:
Identify the particle of the content of document with default content style rule;
Mark described particle with entry tags;
From default mapping ruler group, select the mapping ruler corresponding with the type of described document;
Use the mapping ruler of described selection that described entry tags is mapped to the structure label;
Use the described particle of described structure label for labelling.
2. method according to claim 1 is characterized in that, described entry tags comprises: the paragraph heading of described particle, paragraph content, position and level; The type of described document comprise following one of at least: news, novel, text, paper, dictionary, paper.
3. method according to claim 1 is characterized in that, described structure label comprises the content of described entry tags, also comprises: title is used to indicate the described structure type of described particle; Scope is used to indicate the reference position of current described particle to the content of the reference position of next described particle.
4. method according to claim 1 is characterized in that, also comprises:
From the demonstration rule that shows that regular group selection is corresponding with the type of described document;
Use the demonstration rule of described selection, show the content of described document according to the mark of described structure label.
5. method according to claim 4 is characterized in that, adopts the described demonstration rule of XML formal definition.
6. according to claim 1 to 4 each described methods, it is characterized in that, also comprise:
The interface is provided, accepts the new mapping ruler of User Defined or revise existing described mapping ruler.
7. according to claim 1 to 4 each described methods, it is characterized in that, adopt the described content pattern matching rule of XML formal definition and described mapping ruler.
8. a device that is used for extracting file structure is characterized in that, comprising:
Identification module is used for the particle with the content of default content style rule identification document;
The clauses and subclauses labeling module is used for marking described particle with entry tags;
Module is selected in mapping, is used for selecting the mapping ruler corresponding with the type of described document from default mapping ruler group;
Mapping block is used for using the mapping ruler of described selection that described entry tags is mapped to the structure label;
The structure labeling module is used for using the described particle of described structure label for labelling.
9. device according to claim 8 is characterized in that, entry tags comprises paragraph heading, paragraph content, position and the level of particle; Described structure label comprises the content of described entry tags, also comprises: title is used to indicate the described structure type of described particle; Scope is used to indicate the reference position of current described particle to the content of the reference position of next described particle.
10. device according to claim 8 is characterized in that, also comprises:
Show and select module, be used for from the demonstration rule that shows that regular group selection is corresponding with the type of described document;
Display module is used for using the demonstration of described selection regular, shows the content of described document according to the mark of described structure label.
CN2011102591727A 2011-09-02 2011-09-02 Method and device for extracting document structure Pending CN102982028A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011102591727A CN102982028A (en) 2011-09-02 2011-09-02 Method and device for extracting document structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011102591727A CN102982028A (en) 2011-09-02 2011-09-02 Method and device for extracting document structure

Publications (1)

Publication Number Publication Date
CN102982028A true CN102982028A (en) 2013-03-20

Family

ID=47856067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011102591727A Pending CN102982028A (en) 2011-09-02 2011-09-02 Method and device for extracting document structure

Country Status (1)

Country Link
CN (1) CN102982028A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591920A (en) * 2011-12-19 2012-07-18 刘松涛 Method and system for classifying document collection in document management system
CN103729412A (en) * 2013-12-11 2014-04-16 《中国激光》杂志社有限公司 System and method applicable to large-scale literature cluster mobile digital publishing
CN106845467A (en) * 2016-12-14 2017-06-13 北京航天测控技术有限公司 Aeronautical maintenance work card action recognition methods based on OCR
CN107391650A (en) * 2017-07-14 2017-11-24 北京神州泰岳软件股份有限公司 A kind of structuring method for splitting of document, apparatus and system
CN107622087A (en) * 2017-08-17 2018-01-23 珠海云游道科技有限责任公司 User-friendly document management apparatus and method
CN107632969A (en) * 2017-08-17 2018-01-26 珠海云游道科技有限责任公司 Document structure tree method and device for management information system
CN105786775B (en) * 2014-12-23 2018-11-16 珠海金山办公软件有限公司 Document schem drawing generating method and system
CN113065337A (en) * 2021-02-26 2021-07-02 成都环宇知了科技有限公司 Method and system for positioning and scoring documents based on OpenXml

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436185A (en) * 2007-11-12 2009-05-20 北大方正集团有限公司 Method for implementing multiple-file compatibility by XML memory tree
CN101488123A (en) * 2008-01-16 2009-07-22 鸿富锦精密工业(深圳)有限公司 Text resolution system and method
CN101561826A (en) * 2009-05-18 2009-10-21 汤胤 Method and application for sharing and cooperating online non-structural file based on node granularity semantics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101436185A (en) * 2007-11-12 2009-05-20 北大方正集团有限公司 Method for implementing multiple-file compatibility by XML memory tree
CN101488123A (en) * 2008-01-16 2009-07-22 鸿富锦精密工业(深圳)有限公司 Text resolution system and method
CN101561826A (en) * 2009-05-18 2009-10-21 汤胤 Method and application for sharing and cooperating online non-structural file based on node granularity semantics

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591920A (en) * 2011-12-19 2012-07-18 刘松涛 Method and system for classifying document collection in document management system
CN102591920B (en) * 2011-12-19 2013-11-20 刘松涛 Method and system for classifying document collection in document management system
CN103729412A (en) * 2013-12-11 2014-04-16 《中国激光》杂志社有限公司 System and method applicable to large-scale literature cluster mobile digital publishing
CN105786775B (en) * 2014-12-23 2018-11-16 珠海金山办公软件有限公司 Document schem drawing generating method and system
CN106845467A (en) * 2016-12-14 2017-06-13 北京航天测控技术有限公司 Aeronautical maintenance work card action recognition methods based on OCR
CN107391650A (en) * 2017-07-14 2017-11-24 北京神州泰岳软件股份有限公司 A kind of structuring method for splitting of document, apparatus and system
CN107622087A (en) * 2017-08-17 2018-01-23 珠海云游道科技有限责任公司 User-friendly document management apparatus and method
CN107632969A (en) * 2017-08-17 2018-01-26 珠海云游道科技有限责任公司 Document structure tree method and device for management information system
CN107622087B (en) * 2017-08-17 2024-03-22 珠海云游道科技有限责任公司 Document management apparatus and method convenient for user operation
CN107632969B (en) * 2017-08-17 2024-03-29 珠海云游道科技有限责任公司 Document generation method and device for management information system
CN113065337A (en) * 2021-02-26 2021-07-02 成都环宇知了科技有限公司 Method and system for positioning and scoring documents based on OpenXml

Similar Documents

Publication Publication Date Title
CN102982028A (en) Method and device for extracting document structure
CN110889883B (en) Self-adaptive intelligent banner advertisement picture generation method and system
CN101025738B (en) Template-free dynamic website generating method
CN101079024B (en) Special word list dynamic generation system and method
US20130205202A1 (en) Transformation of a Document into Interactive Media Content
CN106528526B (en) A kind of Chinese address semanteme marking method based on Bayes's segmentation methods
CN103218364B (en) A kind of searching method and system
CN101853246B (en) Method and device for converting document format
US20050066267A1 (en) Information processing system and method, program, and recording medium
CN105512331A (en) Video recommending method and device
CN104809142A (en) Trademark inquiring system and method
EP2633432A1 (en) Extraction of content from a web page
CN101359332A (en) Design method for visual search interface with semantic categorization function
CN101751379B (en) Method and equipment for manufacturing electronic newspaper document
CN104636428A (en) Trademark recommendation method and device
CN102567509A (en) Method and system for instant messaging with visual messaging assistance
CN111492370A (en) Device and method for recognizing text images of a structured layout
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
CN101770291B (en) Semantic analysis data hashing storage and analysis methods for input system
EP2599013A1 (en) Visual separator detection in web pages by using code analysis
CN114359924A (en) Data processing method, device, equipment and storage medium
US20070150808A1 (en) Method for transformation of an extensible markup language vocabulary to a generic document structure format
Hasan et al. Bangla font recognition using transfer learning method
Cheng et al. M6doc: A large-scale multi-format, multi-type, multi-layout, multi-language, multi-annotation category dataset for modern document layout analysis
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130320