CN102841893A

CN102841893A - Method and device for processing fragmentation data in document

Info

Publication number: CN102841893A
Application number: CN201110168129XA
Authority: CN
Inventors: 黄锴; 翟因为; 陈长刚
Original assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Current assignee: Peking University Founder Group Co Ltd; Beijing Founder Electronics Co Ltd
Priority date: 2011-06-21
Filing date: 2011-06-21
Publication date: 2012-12-26

Abstract

The invention provides a method and a device for processing fragmentation data in a document. The method disclosed by the invention comprises the following steps of: extracting the fragmentation data in the document, and recording attributes of the fragmentation data, the attribute of the document and the attribute of publication to which the document belongs in association. The invention also provides the device for processing the fragmentation data in the document. The device comprises an extraction module and a recording module; the extracting module is used for the fragmentation data in the document, and the recording module is used for recording the attributes of the fragmentation data, the attribute of the document and the attribute of publication to which the document belongs in association. According to the invention, the extracted attributes of the fragmentation data, the attribute of the document and the attribute of publication to which the document belongs are recorded in association, thus, fast search foundation is provided when the fragmentation data is subsequently searched.

Description

Handle the method and apparatus of segment data in the document

Technical field

The present invention relates to field of computer data processing, in particular to the method and apparatus of handling segment data in the document.

Background technology

In the present publishing area, mainly publish the paper publication thing through the mode of " selected topic is planned, solicited contributions, goes over a manuscript or draft, sets type, prints ".Books divide chapter usually, and collection of thesis is concentrated publication by many pieces of papers usually, and periodical is made up of many pieces of separate contributions.Various types of contents in the contribution, like picture, character, video segment, audio fragment etc., these contents are referred to as " segment data " usually.

Publication is formed by more segment data aggregate usually.The user needs to be scattered in segment data extract and the arrangement in many publications, and reduced data is assembled into publication.

The inventor finds that the segment data are dispersed in each electronic document, owing to not about the data relationship of segment data, be not easy to inquire about some segment data.The process that the user searches segment data in the publication is comparatively loaded down with trivial details, and for one piece of article even one section word of certain publication, owing to need browse the whole electronic document of this publication, search efficiency is lower.

Summary of the invention

The present invention aims to provide a kind of method and apparatus of handling segment data in the document, to solve the above-mentioned problem that can't set up about the data relationship of segment data.

In an embodiment of the present invention, a kind of method of handling segment data in the document is provided, has comprised: extracted the segment data in the document; Write down the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.

In an embodiment of the present invention, a kind of device of handling segment data in the document is provided, has comprised: extraction module is used for extracting the segment data of document; Logging modle is used for writing down explicitly the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document.

Embodiments of the invention write down the attribute of publication of attribute, ownership of attribute and ownership document of the segment data of extraction relatedly.Thereby being convenient to, provides and has searched foundation fast when searching the segment data for follow-up.

Description of drawings

Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:

Fig. 1 shows the process flow diagram of embodiment one;

Fig. 2 shows the process flow diagram of embodiment two;

Fig. 3 shows the screenshot capture of selecting document among the embodiment;

Fig. 4 shows the structured flowchart of embodiment three.

Embodiment

Below with reference to accompanying drawing and combine embodiment, specify the present invention.Referring to Fig. 1, Fig. 1 is the process flow diagram of the embodiment of the invention one, comprising:

Step S 11: extract the segment data in the document;

Publication is made up of a plurality of documents among the embodiment, and for example: the publication of photography class, the inside comprises a plurality of chapters and sections, in the content stores to of each chapters and sections document, in document, segment data such as note, picture is arranged.

Extract the process of the segment data in the document, can obtain to store the file of segment data in the document earlier, for example: the word document is made up of a plurality of subdocuments; Comprise the document of paragraph format, the document of Show Styles, the document of memory contents etc.; The word document is changed, can be obtained these documents of xml form, through the node of document of traversal memory contents; Can extract the content in each node, i.e. the segment data.

Step S12: the attribute that writes down the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.

For the segment data of extracting, for ease of follow-up searching, attribute that can the segment data are relevant is related ground record together.In the present embodiment, the property store of the publication that attribute and document belonged to of the attribute of segment data, document in a record, is convenient to the follow-up segment data of searching.

Through write down these attributes relatedly, be convenient to subsequent query segment data about the segment data.Through receiving the keyword of user's input, but search fast in the dependency data attribute related segment data, and be shown to the user.

Method among the embodiment; Also can define the acquisition module of the segment data of need extracting in advance, define various segment data, and set up storaging mark respectively for segment data, document, publication through acquisition module; Store in each database, thereby be convenient to search relatedly.Through the embodiment explanation, the process flow diagram referring to the embodiment shown in Fig. 2 two may further comprise the steps below:

Step S21: gather the segment data in the document according to predefined template.

In the present embodiment, be that example describes with the word document, the segment data storage is in the document of an xml form forming the word document.Need define the acquisition module of xml form in advance,, call the document of the xml form of storage segment data, thereby extract the segment data through acquisition module.

The partial code of acquisition module is following:

In acquisition module, the metadata (being attribute) and the intersegmental relation of relational database memory word of the segment data in the tableMap definition document.Relational database comprises a plurality of tables, the segment data that each table is corresponding a type.Corresponding segment data of a record of each table.Each table comprises multiple row, and each row is each metadata description of corresponding segment data respectively.Wherein, the table node definition table name of segment data storage, the meta node has specifically defined the metadata of segment data and the relation of database storing interfield.The meta node comprises following three attributes:

Name is the node name of document, is used for this node in the document location through this nodename.

ValType node processing type is handled the method for the node of appointment, the method for the corresponding a kind of processing node of each type through the decision of this attribute.For example, obtain the character data of node, standard (or standard) is changed the node character data, changes the form of picture, the form of conversion audio file etc., extracts the metadata (being attribute) of segment data simultaneously.After the node content processing, be kept at the attribute of segment data among the colName.

The run after fame field name of the database that is called " chapter storehouse " of colName is used to preserve the result to after the node processing.

During collection, from books, select corresponding publication or document to get final product, in the title zone as shown in Figure 3, the books of selection are The Analects of Confucius, the document of selection be The Analects of Confucius language material file (word formatted file), be document.

Step S22: with the Attribute Association of the publication of the attribute of the document of segment attribution data and ownership be recorded in same the record of database.

To extract the attribute of segment data document and the record of the Attribute Association of ownership publication in advance, and be stored in the database that name is called " Library ".After extracting the segment data, the document properties of storing in the Library and the attribute of ownership publication and the attribute of segment data are incorporated in the record.Relevant partial code is following:

Wherein, the meta node comprises two attributes:

ParentColName is metadata (being attribute) the row name of publication, finds in database table through this parentColName attribute and wants synchronous metadata.

ColName metadata row name, the field name of specifying synchronizing metadata to store.It is the field name in the chapter storehouse.

The attribute of the attribute of document and the said publication of document has been stored in the Library, these attributes directly stored under the colName field in chapter storehouse, thus the attribute of the attribute of realization association store segment data, ownership document properties, ownership publication.

Preferably, also comprise: for segment data, document, publication are created storaging mark respectively, in related ground record attribute, storaging mark, the storaging mark of document and the storaging mark of publication of related ground recorded piece data.

Storaging mark can be through contingency table the form record, referring to table 1:

Table 1 is preserved the contingency table of incidence relation between the resource

At table 1, the incidence relation between segment data, publication, the document can be through the contingency table storage, and for example: Root Resource ID is the storaging mark of publication, and the source resource ID is the storaging mark of document, and target resource ID is the storaging mark of segment data.Write down shut sequence and incidence relation number simultaneously,, can find the memory location of each segment data, related document, related publication through the incidence relation between these signs in the table 1.

Step S23: the keyword of in attribute data, searching reception.

Step S24: the attribute data that feedback search is arrived related segment data, show the access links of the document or the publication of segment data association simultaneously.

The storaging mark that in access links, comprises publication or document.

Step S25: receive the access links of selecting, show corresponding document or publication.

Owing to have storaging mark in the access links, call and show respective document or publication according to storaging mark.

Through the step among the embodiment two, can find relevant segment data fast according to the keyword that receives, through the storaging mark of association, can further find relevant document or the segment data in the publication.

Preferably, also can in advance the segment data in the document be replaced with placeholder, and in placeholder the sign of storage segment data, for example: placeholder is with " PAMCMS: // " beginning; After connect 4 identifiers by CSV, as shown in table 2, be respectively id, type; Lib, res, i.e. PAMCMS: //id; Type, lib, res

The concrete implication of 2:4 identifier of table

Title	Implication
		id	Quote the unique identifier of segment data
type	Quote resource type, can expand as required;
		lib	The location identifier of segment data is quoted in expression
res	Reserved identifier

In above-mentioned step S25, visit said document or publication according to said access links, through the storaging mark in the placeholder, extract segment data and demonstration.

Specified two embodiment of the present invention above, method of the present invention can adopt the form of module to be integrated in the electronic circuit, provides preferred embodiment three below, and specifies through the structural drawing of Fig. 4.Apparatus structure block diagram referring among Fig. 4 comprises:

Extraction module 41 is used for extracting the segment data of document;

Logging modle 42 is used for the segment data according to extraction module 41 extractions, writes down the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.

Preferably, also comprise:

Search module 43, be used for searching received keyword from the attribute of said logging modle 42 records;

Feedback module 44, be used for feeding back said search attribute that module 43 finds in logging modle 42 related segment data.

Preferably, also comprise:

Identification module 45 is used for said segment data, said document and said publication are stored respectively, and generates storaging mark respectively;

Identification record module 46; Be used for when logging modle 42 writes down said attribute relatedly said storaging mark, the said storaging mark of said document and the said storaging mark of said publication of the said segment data that related ground record identification module 45 generates.

Preferably, also comprise:

Link feedback module 47; When being used for feedback module 44 feedback segment data; Storaging mark through identification record module 46 records feeds back the said document of said segment data association or the access links of said publication, in said link, adds storaging mark.

Obviously; It is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize that they can concentrate on the single calculation element with the general calculation device; Perhaps be distributed on the network that a plurality of calculation element forms; Alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby can they be stored in the memory storage; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.

The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. a method of handling segment data in the document is characterized in that, comprising:

Extract the segment data in the document;

Write down the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.

2. method according to claim 1 is characterized in that, also comprises:

Receive keyword;

From said attribute, search received keyword;

Feed back the said attribute that finds related segment data.

3. method according to claim 1 is characterized in that, the process of said extraction comprises:

Said document is converted into the document of xml form;

Travel through the content of each node in the document of said xml form;

Extract said content as said segment data.

4. method according to claim 3 is characterized in that, the said process of record explicitly comprises:

In the process of said traversal, from the document of said xml form, extract the attribute of each said segment data;

With the property store of each said segment data in the data-base recording of creating in advance;

Confirm the publication that said document belongs to;

The attribute of the publication that attribute and said document belonged to of the attribute of each said segment data of storage, said document in each bar record of said database.

5. method according to claim 2 is characterized in that, also comprises:

Said segment data, said document and said publication are stored respectively, and generated storaging mark respectively;

When writing down said attribute, write down the said storaging mark of the said storaging mark of said segment data, said document and the said storaging mark of said publication relatedly relatedly.

6. method according to claim 5 is characterized in that, the attribute that said feedback search is arrived after the related segment data, also comprise:

Feed back the said document of said segment data association or the access links of said publication;

The storaging mark that contains said document or said publication in the said access links.

7. method according to claim 6 is characterized in that, also comprises:

The placeholder that contains said segment data storage sign in use is in advance replaced the segment data in the said document;

Visit said document according to said access links;

Show in the process of said document, obtain said segment data, replace said placeholder according to said storaging mark.

8. a device of handling segment data in the document is characterized in that, comprising:

Extraction module is used for extracting the segment data of document;

Logging modle is used for writing down explicitly the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document.

9. device according to claim 8 is characterized in that, also comprises:

Search module, be used for searching received keyword from said attribute;

Feedback module, be used to feed back the said attribute that finds related segment data.

10. device according to claim 9 is characterized in that, also comprises:

Identification module is used for said segment data, said document and said publication are stored respectively, and generates storaging mark respectively;

The identification record module is used for when writing down said attribute relatedly, writes down the said storaging mark of the said storaging mark of said segment data, said document and the said storaging mark of said publication relatedly.

11. device according to claim 10 is characterized in that, also comprises:

The link feedback module is used to feed back the said document of said segment data association or the access links of said publication.