CN102841893A - Method and device for processing fragmentation data in document - Google Patents

Method and device for processing fragmentation data in document Download PDF

Info

Publication number
CN102841893A
CN102841893A CN201110168129XA CN201110168129A CN102841893A CN 102841893 A CN102841893 A CN 102841893A CN 201110168129X A CN201110168129X A CN 201110168129XA CN 201110168129 A CN201110168129 A CN 201110168129A CN 102841893 A CN102841893 A CN 102841893A
Authority
CN
China
Prior art keywords
document
segment data
attribute
publication
storaging mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201110168129XA
Other languages
Chinese (zh)
Inventor
黄锴
翟因为
陈长刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201110168129XA priority Critical patent/CN102841893A/en
Publication of CN102841893A publication Critical patent/CN102841893A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for processing fragmentation data in a document. The method disclosed by the invention comprises the following steps of: extracting the fragmentation data in the document, and recording attributes of the fragmentation data, the attribute of the document and the attribute of publication to which the document belongs in association. The invention also provides the device for processing the fragmentation data in the document. The device comprises an extraction module and a recording module; the extracting module is used for the fragmentation data in the document, and the recording module is used for recording the attributes of the fragmentation data, the attribute of the document and the attribute of publication to which the document belongs in association. According to the invention, the extracted attributes of the fragmentation data, the attribute of the document and the attribute of publication to which the document belongs are recorded in association, thus, fast search foundation is provided when the fragmentation data is subsequently searched.

Description

Handle the method and apparatus of segment data in the document
Technical field
The present invention relates to field of computer data processing, in particular to the method and apparatus of handling segment data in the document.
Background technology
In the present publishing area, mainly publish the paper publication thing through the mode of " selected topic is planned, solicited contributions, goes over a manuscript or draft, sets type, prints ".Books divide chapter usually, and collection of thesis is concentrated publication by many pieces of papers usually, and periodical is made up of many pieces of separate contributions.Various types of contents in the contribution, like picture, character, video segment, audio fragment etc., these contents are referred to as " segment data " usually.
Publication is formed by more segment data aggregate usually.The user needs to be scattered in segment data extract and the arrangement in many publications, and reduced data is assembled into publication.
The inventor finds that the segment data are dispersed in each electronic document, owing to not about the data relationship of segment data, be not easy to inquire about some segment data.The process that the user searches segment data in the publication is comparatively loaded down with trivial details, and for one piece of article even one section word of certain publication, owing to need browse the whole electronic document of this publication, search efficiency is lower.
Summary of the invention
The present invention aims to provide a kind of method and apparatus of handling segment data in the document, to solve the above-mentioned problem that can't set up about the data relationship of segment data.
In an embodiment of the present invention, a kind of method of handling segment data in the document is provided, has comprised: extracted the segment data in the document; Write down the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.
In an embodiment of the present invention, a kind of device of handling segment data in the document is provided, has comprised: extraction module is used for extracting the segment data of document; Logging modle is used for writing down explicitly the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document.
Embodiments of the invention write down the attribute of publication of attribute, ownership of attribute and ownership document of the segment data of extraction relatedly.Thereby being convenient to, provides and has searched foundation fast when searching the segment data for follow-up.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram of embodiment one;
Fig. 2 shows the process flow diagram of embodiment two;
Fig. 3 shows the screenshot capture of selecting document among the embodiment;
Fig. 4 shows the structured flowchart of embodiment three.
Embodiment
Below with reference to accompanying drawing and combine embodiment, specify the present invention.Referring to Fig. 1, Fig. 1 is the process flow diagram of the embodiment of the invention one, comprising:
Step S 11: extract the segment data in the document;
Publication is made up of a plurality of documents among the embodiment, and for example: the publication of photography class, the inside comprises a plurality of chapters and sections, in the content stores to of each chapters and sections document, in document, segment data such as note, picture is arranged.
Extract the process of the segment data in the document, can obtain to store the file of segment data in the document earlier, for example: the word document is made up of a plurality of subdocuments; Comprise the document of paragraph format, the document of Show Styles, the document of memory contents etc.; The word document is changed, can be obtained these documents of xml form, through the node of document of traversal memory contents; Can extract the content in each node, i.e. the segment data.
Step S12: the attribute that writes down the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.
For the segment data of extracting, for ease of follow-up searching, attribute that can the segment data are relevant is related ground record together.In the present embodiment, the property store of the publication that attribute and document belonged to of the attribute of segment data, document in a record, is convenient to the follow-up segment data of searching.
Through write down these attributes relatedly, be convenient to subsequent query segment data about the segment data.Through receiving the keyword of user's input, but search fast in the dependency data attribute related segment data, and be shown to the user.
Method among the embodiment; Also can define the acquisition module of the segment data of need extracting in advance, define various segment data, and set up storaging mark respectively for segment data, document, publication through acquisition module; Store in each database, thereby be convenient to search relatedly.Through the embodiment explanation, the process flow diagram referring to the embodiment shown in Fig. 2 two may further comprise the steps below:
Step S21: gather the segment data in the document according to predefined template.
In the present embodiment, be that example describes with the word document, the segment data storage is in the document of an xml form forming the word document.Need define the acquisition module of xml form in advance,, call the document of the xml form of storage segment data, thereby extract the segment data through acquisition module.
The partial code of acquisition module is following:
Figure BSA00000522186100041
In acquisition module, the metadata (being attribute) and the intersegmental relation of relational database memory word of the segment data in the tableMap definition document.Relational database comprises a plurality of tables, the segment data that each table is corresponding a type.Corresponding segment data of a record of each table.Each table comprises multiple row, and each row is each metadata description of corresponding segment data respectively.Wherein, the table node definition table name of segment data storage, the meta node has specifically defined the metadata of segment data and the relation of database storing interfield.The meta node comprises following three attributes:
Name is the node name of document, is used for this node in the document location through this nodename.
ValType node processing type is handled the method for the node of appointment, the method for the corresponding a kind of processing node of each type through the decision of this attribute.For example, obtain the character data of node, standard (or standard) is changed the node character data, changes the form of picture, the form of conversion audio file etc., extracts the metadata (being attribute) of segment data simultaneously.After the node content processing, be kept at the attribute of segment data among the colName.
The run after fame field name of the database that is called " chapter storehouse " of colName is used to preserve the result to after the node processing.
During collection, from books, select corresponding publication or document to get final product, in the title zone as shown in Figure 3, the books of selection are The Analects of Confucius, the document of selection be The Analects of Confucius language material file (word formatted file), be document.
Step S22: with the Attribute Association of the publication of the attribute of the document of segment attribution data and ownership be recorded in same the record of database.
To extract the attribute of segment data document and the record of the Attribute Association of ownership publication in advance, and be stored in the database that name is called " Library ".After extracting the segment data, the document properties of storing in the Library and the attribute of ownership publication and the attribute of segment data are incorporated in the record.Relevant partial code is following:
Figure BSA00000522186100051
Wherein, the meta node comprises two attributes:
ParentColName is metadata (being attribute) the row name of publication, finds in database table through this parentColName attribute and wants synchronous metadata.
ColName metadata row name, the field name of specifying synchronizing metadata to store.It is the field name in the chapter storehouse.
The attribute of the attribute of document and the said publication of document has been stored in the Library, these attributes directly stored under the colName field in chapter storehouse, thus the attribute of the attribute of realization association store segment data, ownership document properties, ownership publication.
Preferably, also comprise: for segment data, document, publication are created storaging mark respectively, in related ground record attribute, storaging mark, the storaging mark of document and the storaging mark of publication of related ground recorded piece data.
Storaging mark can be through contingency table the form record, referring to table 1:
Table 1 is preserved the contingency table of incidence relation between the resource
Figure BSA00000522186100061
At table 1, the incidence relation between segment data, publication, the document can be through the contingency table storage, and for example: Root Resource ID is the storaging mark of publication, and the source resource ID is the storaging mark of document, and target resource ID is the storaging mark of segment data.Write down shut sequence and incidence relation number simultaneously,, can find the memory location of each segment data, related document, related publication through the incidence relation between these signs in the table 1.
Step S23: the keyword of in attribute data, searching reception.
Step S24: the attribute data that feedback search is arrived related segment data, show the access links of the document or the publication of segment data association simultaneously.
The storaging mark that in access links, comprises publication or document.
Step S25: receive the access links of selecting, show corresponding document or publication.
Owing to have storaging mark in the access links, call and show respective document or publication according to storaging mark.
Through the step among the embodiment two, can find relevant segment data fast according to the keyword that receives, through the storaging mark of association, can further find relevant document or the segment data in the publication.
Preferably, also can in advance the segment data in the document be replaced with placeholder, and in placeholder the sign of storage segment data, for example: placeholder is with " PAMCMS: // " beginning; After connect 4 identifiers by CSV, as shown in table 2, be respectively id, type; Lib, res, i.e. PAMCMS: //id; Type, lib, res
The concrete implication of 2:4 identifier of table
Title Implication
id Quote the unique identifier of segment data
type Quote resource type, can expand as required;
lib The location identifier of segment data is quoted in expression
res Reserved identifier
In above-mentioned step S25, visit said document or publication according to said access links, through the storaging mark in the placeholder, extract segment data and demonstration.
Specified two embodiment of the present invention above, method of the present invention can adopt the form of module to be integrated in the electronic circuit, provides preferred embodiment three below, and specifies through the structural drawing of Fig. 4.Apparatus structure block diagram referring among Fig. 4 comprises:
Extraction module 41 is used for extracting the segment data of document;
Logging modle 42 is used for the segment data according to extraction module 41 extractions, writes down the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.
Preferably, also comprise:
Search module 43, be used for searching received keyword from the attribute of said logging modle 42 records;
Feedback module 44, be used for feeding back said search attribute that module 43 finds in logging modle 42 related segment data.
Preferably, also comprise:
Identification module 45 is used for said segment data, said document and said publication are stored respectively, and generates storaging mark respectively;
Identification record module 46; Be used for when logging modle 42 writes down said attribute relatedly said storaging mark, the said storaging mark of said document and the said storaging mark of said publication of the said segment data that related ground record identification module 45 generates.
Preferably, also comprise:
Link feedback module 47; When being used for feedback module 44 feedback segment data; Storaging mark through identification record module 46 records feeds back the said document of said segment data association or the access links of said publication, in said link, adds storaging mark.
Obviously; It is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize that they can concentrate on the single calculation element with the general calculation device; Perhaps be distributed on the network that a plurality of calculation element forms; Alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby can they be stored in the memory storage; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1. a method of handling segment data in the document is characterized in that, comprising:
Extract the segment data in the document;
Write down the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document explicitly.
2. method according to claim 1 is characterized in that, also comprises:
Receive keyword;
From said attribute, search received keyword;
Feed back the said attribute that finds related segment data.
3. method according to claim 1 is characterized in that, the process of said extraction comprises:
Said document is converted into the document of xml form;
Travel through the content of each node in the document of said xml form;
Extract said content as said segment data.
4. method according to claim 3 is characterized in that, the said process of record explicitly comprises:
In the process of said traversal, from the document of said xml form, extract the attribute of each said segment data;
With the property store of each said segment data in the data-base recording of creating in advance;
Confirm the publication that said document belongs to;
The attribute of the publication that attribute and said document belonged to of the attribute of each said segment data of storage, said document in each bar record of said database.
5. method according to claim 2 is characterized in that, also comprises:
Said segment data, said document and said publication are stored respectively, and generated storaging mark respectively;
When writing down said attribute, write down the said storaging mark of the said storaging mark of said segment data, said document and the said storaging mark of said publication relatedly relatedly.
6. method according to claim 5 is characterized in that, the attribute that said feedback search is arrived after the related segment data, also comprise:
Feed back the said document of said segment data association or the access links of said publication;
The storaging mark that contains said document or said publication in the said access links.
7. method according to claim 6 is characterized in that, also comprises:
The placeholder that contains said segment data storage sign in use is in advance replaced the segment data in the said document;
Visit said document according to said access links;
Show in the process of said document, obtain said segment data, replace said placeholder according to said storaging mark.
8. a device of handling segment data in the document is characterized in that, comprising:
Extraction module is used for extracting the segment data of document;
Logging modle is used for writing down explicitly the attribute of the publication that attribute and said document belonged to of the attribute of said segment data, said document.
9. device according to claim 8 is characterized in that, also comprises:
Search module, be used for searching received keyword from said attribute;
Feedback module, be used to feed back the said attribute that finds related segment data.
10. device according to claim 9 is characterized in that, also comprises:
Identification module is used for said segment data, said document and said publication are stored respectively, and generates storaging mark respectively;
The identification record module is used for when writing down said attribute relatedly, writes down the said storaging mark of the said storaging mark of said segment data, said document and the said storaging mark of said publication relatedly.
11. device according to claim 10 is characterized in that, also comprises:
The link feedback module is used to feed back the said document of said segment data association or the access links of said publication.
CN201110168129XA 2011-06-21 2011-06-21 Method and device for processing fragmentation data in document Pending CN102841893A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110168129XA CN102841893A (en) 2011-06-21 2011-06-21 Method and device for processing fragmentation data in document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110168129XA CN102841893A (en) 2011-06-21 2011-06-21 Method and device for processing fragmentation data in document

Publications (1)

Publication Number Publication Date
CN102841893A true CN102841893A (en) 2012-12-26

Family

ID=47369266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110168129XA Pending CN102841893A (en) 2011-06-21 2011-06-21 Method and device for processing fragmentation data in document

Country Status (1)

Country Link
CN (1) CN102841893A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934336A (en) * 2015-12-31 2017-07-07 珠海金山办公软件有限公司 A kind of method and device of lantern slide identification

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090210264A1 (en) * 2006-08-09 2009-08-20 Anderson Denise M Conversation Mode Booking System
CN101894115A (en) * 2009-05-18 2010-11-24 北京大学 Image data processing method of electronic document and device thereof

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090210264A1 (en) * 2006-08-09 2009-08-20 Anderson Denise M Conversation Mode Booking System
CN101894115A (en) * 2009-05-18 2010-11-24 北京大学 Image data processing method of electronic document and device thereof

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张素智等: "XML数据库及其应用研究", 《计算机工程与应用》 *
陈玲灵等: "数字图书馆中文文本数据对象", 《燕山大学学报》 *
陈玲灵等: "数字图书馆中文文本数据对象转换为XML格式文档的实现方法", 《燕山大学学报》 *
陈玲灵等: "数字图书馆中文文本数据对象转换为XML格式文档的实现方法", 《燕山大学学报》, no. 02, 15 May 2002 (2002-05-15), pages 184 - 186 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106934336A (en) * 2015-12-31 2017-07-07 珠海金山办公软件有限公司 A kind of method and device of lantern slide identification
US10698943B2 (en) 2015-12-31 2020-06-30 Beijing Kingsoft Office Software, Inc. Method and apparatus for recognizing slide
CN106934336B (en) * 2015-12-31 2020-07-03 珠海金山办公软件有限公司 Method and device for identifying slide

Similar Documents

Publication Publication Date Title
CN102521416B (en) Data correlation query method and data correlation query device
US8943054B2 (en) Social media content management system and method
US20140358911A1 (en) Search and discovery system
AU2016345990A1 (en) A system and method for processing big data using electronic document and electronic file-based system that operates on RDBMS
JP5147947B2 (en) Method and system for generating search collection by query
JP2010067175A (en) Hybrid content recommendation server, recommendation system, and recommendation method
CN102314497B (en) Method and equipment for identifying body contents of markup language files
US8880463B2 (en) Standardized framework for reporting archived legacy system data
CA2619230A1 (en) Annotating documents in a collaborative application with data in disparate information systems
CN102184211A (en) File system, and method and device for retrieving, writing, modifying or deleting file
CN101477527B (en) Multimedia resource retrieval method and apparatus
CN103827852B (en) Assemble WEB page on search engine results page
CN103778202A (en) Enterprise electronic document managing server side and system
CN103020322A (en) Query method
US20140372412A1 (en) Dynamic filtering search results using augmented indexes
US20110219017A1 (en) System and methods for citation database construction and for allowing quick understanding of scientific papers
KR20150018880A (en) Information aggregation, classification and display method and system
CN102819601A (en) Information retrieval method and information retrieval equipment
US20150066996A1 (en) Method and system for automatically collecting publication digital resource
CN102841886A (en) Method and device for splitting document
CN110471925A (en) Realize the method and system that index data is synchronous in search system
WO2014144033A1 (en) Multiple schema repository and modular data procedures
WO2016206395A1 (en) Weekly report information processing method and device
CN102841893A (en) Method and device for processing fragmentation data in document
Desyaputri et al. News recommendation in Indonesian language based on user click behavior

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20121226