CN103136304A - Article processing method and device - Google Patents

Article processing method and device Download PDF

Info

Publication number
CN103136304A
CN103136304A CN2011104013863A CN201110401386A CN103136304A CN 103136304 A CN103136304 A CN 103136304A CN 2011104013863 A CN2011104013863 A CN 2011104013863A CN 201110401386 A CN201110401386 A CN 201110401386A CN 103136304 A CN103136304 A CN 103136304A
Authority
CN
China
Prior art keywords
index
xml document
territory
module
xpath
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2011104013863A
Other languages
Chinese (zh)
Other versions
CN103136304B (en
Inventor
刘浩
翟因为
陈长刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201110401386.3A priority Critical patent/CN103136304B/en
Publication of CN103136304A publication Critical patent/CN103136304A/en
Application granted granted Critical
Publication of CN103136304B publication Critical patent/CN103136304B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses an article processing method which comprises the following steps: establishing xtensible markup language (XML) documents to record the content of articles, wherein the XPATH of elements of the XML documents corresponds to chapter hierarchical relation of the content of the article; storing every XML document into an XML document domain to an article data sheet; and establishing an index of the XML document domain according to the XPATH of the elements of the XML documents. The invention provides an article processing device which comprises a structurized module, a database module and an index module. The structurized module is used for establishing the XML documents to record the content of the articles, wherein the XPATH of the elements inside the XML documents corresponds to the chapter hierarchical relation of the content of the article. The database module is used for storing each XML document into the XML document domain according to the XPATH of the elements of the XML documents. The index module is used for establishing the index of the XML document domain according to the XPATH of the elements of the XML documents. The article processing method and the device improves the efficiency of article retrieval.

Description

The disposal route of entry and device
Technical field
The present invention relates to the publication of mutual communication network field, in particular to a kind of disposal route and device of entry.
Background technology
The data of entry class have the chapters and sections hierarchical structure, for integrality and the hierarchical relationship of safeguarding entry contents, the mode that can adopt XML whole entry contents as property store in a territory of database, consist of the XML document territory, and other attributes of entry record of complete together.
When entry is retrieved, according to the mode in territory, the set of properties of entry is made into search condition, and then entry is retrieved.In search condition comprises entry contents during the restriction of element, at first need to obtain the record that meets other conditions, obtain the complete XML fragment of entry contents, then by the mode of XPATH, element is retrieved, and then obtain qualified record by the mode of filtering.
It is frequent that inventor's discovery, this retrieval mode cause XML document to load, and consumes resources is more.
Summary of the invention
The present invention aims to provide a kind of disposal route and device of entry, to improve the entry effectiveness of retrieval.
In an embodiment of the present invention, provide a kind of disposal route of entry, having comprised: create XML document with record strip purpose content, wherein, the chapters and sections hierarchical relationship in the content of the corresponding entry of the XPATH of the element in XML document; Each XML document is stored in the XML document territory of entry data table; According to the XPATH of the element in XML document, to the XML document territory establishment index of database.
In an embodiment of the present invention, provide a kind for the treatment of apparatus of entry, having comprised: structurized module, be used for creating XML document with record strip purpose content, wherein, the chapters and sections hierarchical relationship in the content of the corresponding entry of the XPATH of the element in XML document; Database module is for each XML document being stored into the XML document territory of entry data table; Index module is used for the XPATH according to the element of XML document, to the XML document territory establishment index of database.
The disposal route of the entry of the above embodiment of the present invention and device so overcome the lower problem of entry recall precision of prior art, have improved the entry effectiveness of retrieval because the XML document territory has been created index.
Description of drawings
Accompanying drawing described herein is used to provide a further understanding of the present invention, consists of the application's a part, and illustrative examples of the present invention and explanation thereof are used for explaining the present invention, do not consist of improper restriction of the present invention.In the accompanying drawings:
Fig. 1 shows the disposal route according to the entry of the embodiment of the present invention;
Fig. 2 shows index relative schematic diagram according to the preferred embodiment of the invention;
Fig. 3 shows the process flow diagram of execution index retrieval according to the preferred embodiment of the invention;
Fig. 4 shows the screenshot capture at index management interface according to the preferred embodiment of the invention;
Fig. 5 shows the schematic diagram according to the treating apparatus of the entry of the embodiment of the present invention.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the disposal route according to the entry of the embodiment of the present invention, comprising:
Step S10 creates XML document with record strip purpose content, wherein, and the chapters and sections hierarchical relationship in the content of the corresponding entry of the XPATH of the element in XML document;
Step S20 stores each XML document in the XML document territory of entry data table;
Step S30 is according to the XPATH of the element in XML document, to the XML document territory establishment index of database.
In the prior art, when utilizing XML technology retrieving head, obtain the complete XML fragment of entry contents, then retrieve by the mode of XPATH; And the disposal route of the entry of the present embodiment has created index to the XML document territory, so can utilize index to come retrieving head, need not to reload whole XML document, and this has reduced resource cost, has improved significantly recall precision, has shortened retrieval time.
In addition, prior art is undertaken by the mode of traversal addressing the retrieval of element, and retrieval rate is slow, and this method can utilize index to come retrieving head, need not again element to be traveled through addressing, and this has also shortened retrieval time.
Preferably, step S30 comprises: create corresponding index for the element in the XML document territory, and wherein, the title of index=XML document domain name claims+XPATH of domain name separating character+this element.This embodiment is simple.
Fig. 2 shows index relative schematic diagram according to the preferred embodiment of the invention.As can be seen from the figure, contacting of index territory and XML document is well-determined, therefore can convert retrieval to the index territory to the retrieval of element (its content is entry) of equal valuely, simultaneously, to the management of the paired index data table of the management transitions of element index data, make the retrieval of the element quickness and high efficiency that becomes.
For example, following tables of data is arranged:
Figure BSA00000630234600041
In this tables of data, the XML that stores in the DOC_XMLDATA of territory has following structure:
Figure BSA00000630234600042
According to this preferred embodiment, the title of the index of generation is as follows:
<node text=" DOC_XMLDATA_/paper/industry background "/〉
<node text=" DOC_XMLDATA_/paper/product orientation "/〉
<node text=" DOC_XMLDATA_/paper/key characteristic/functional characteristic "/〉
<node text=" DOC_XMLDATA_/paper/key characteristic/Performance Characteristics "/〉
<node text=" DOC_XMLDATA_/paper/key characteristic/technical characteristic "/〉
<node text=" DOC_XMLDATA_/paper/market outlook "/〉
<node text=" DOC_XMLDATA_/paper/risk assessment "/〉
Preferably, step S30 also comprises: each index venue is stored as the index data table, wherein, with the name storage of index in the index territory of index data table.
Preferably, also create title-domain in the index data table, be used for the simple name in record index territory, to present to the user.
As follows according to the index data table that above preferred embodiment creates:
Figure BSA00000630234600061
CLOB refers to elongated the text field.
Preferably, this method also comprises:
The simple name of title-domain record is dedicates the user to;
Receive the user to the retrieval word string of selection and the input of simple name;
The corresponding index of selected simple name territory is retrieved as key word with the retrieval word string;
The content in XML document that the index that retrieves is pointed territory is submitted to the user.
The search condition that this preferred embodiment is inputted based on the user, the retrieval grammer of organizing search engine, and project and input key word that the user only need select to want to retrieve get final product.Need to inquire about as the user document that industry background or product orientation belong to the digital publishing aspect, the retrieval grammer of tissue is as follows:
((DOC_XMLDATA_/paper/industry background LIKE ' digital publishing ') OR (DOC_XMLDATA_/paper/product orientation LIKE ' digital publishing '))
The syntax conversion device converts retrieve statement to the grammer of element retrieval, and sends to retrieval service, and element retrieval grammer is as follows:
Figure BSA00000630234600062
Figure BSA00000630234600071
Retrieval service receives search condition, calls the syntax conversion service, converts retrieve statement to and carries out retrieval, obtains retrieval set.Search engine turns back to retrieval set on human-computer interaction interface.
Fig. 3 shows the process flow diagram of execution index retrieval according to the preferred embodiment of the invention, comprising:
The first step, search engine receive the retrieval request that the leading portion page transmits,
Second step, search engine call the syntax conversion device, the search condition of the page are converted to the grammer of element retrieval,
In the 3rd step, search engine is initiated retrieval request, and retrieve statement is passed to retrieval service,
In the 4th step, retrieval service is resolved the retrieval grammer, carries out retrieval, obtains retrieval set
In the 5th step, retrieval service is returned to the indexed results collection that obtains to search engine,
In the 6th step, search engine analysis result collection obtains result document according to the index rule and returns to the leading portion processing.
Fig. 4 shows the screenshot capture at index management interface according to the preferred embodiment of the invention.
This preferred embodiment provides more friendly interactive interface, utilizes title-domain to help the user to select suitable index territory, has realized utilizing index that entry is retrieved, and is for the user, more easy-to-use.
Fig. 5 shows the schematic diagram according to the treating apparatus of the entry of the embodiment of the present invention, comprising:
Structurized module 10 is used for creating XML document with record strip purpose content, wherein, and the chapters and sections hierarchical relationship in the content of the corresponding entry of the XPATH of the element in XML document;
Database module 20 is for each XML document being stored into the XML document territory of entry data table;
Index module 30 is used for the XPATH according to the element of XML document, to the XML document territory establishment index of database.
This device has reduced resource cost, has improved significantly recall precision, has shortened retrieval time.
Preferably, index module is used for creating corresponding index for the element in XML document territory, and wherein, the title of index=XML document domain name claims+XPATH of domain name separating character+this element.
Preferably, index module also is used for each index venue is stored as the index data table, wherein, with the name storage of index in the index territory of index data table.
Preferably, index module also is used for also creating title-domain at the index data table, is used for the simple name in record index territory, to present to the user.
Preferably, also comprise: interface module is used for the simple name that title-domain is put down in writing is and dedicates the user to; Receiver module is used for receiving the user to the retrieval word string of selection and the input of simple name; Retrieval module is used for the corresponding index of selected simple name territory being retrieved as key word with the retrieval word string; Submit module to, the content that is used for XML document that the index that retrieves is pointed territory is submitted to the user.
As can be seen from the above description, the present invention has realized following technique effect:
Direct retrieval elements: on the basis that does not change original XML storage organization, directly the element of XML is retrieved.
Reduce the repeated load of resource: directly reduce for element, reduce the repeated load to complete XML document, economize on resources, resource utilization is provided.
Improved recall precision: abandon original mode by traversal, addressing, adopt by the way retrieval of index with direct retrieval elements, improved recall precision.
obviously, those skilled in the art should be understood that, above-mentioned each module of the present invention or each step can realize with general calculation element, they can concentrate on single calculation element, perhaps be distributed on the network that a plurality of calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in memory storage and be carried out by calculation element, perhaps they are made into respectively each integrated circuit modules, perhaps a plurality of modules in them or step being made into the single integrated circuit module realizes.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is only the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (10)

1. the disposal route of an entry, is characterized in that, comprising:
Create XML document with record strip purpose content, wherein, the chapters and sections hierarchical relationship in the content of the corresponding described entry of the XPATH of the element in described XML document;
Each described XML document is stored in the XML document territory of entry data table;
According to the XPATH of the element in described XML document, to the XML document territory establishment index of described database.
2. method according to claim 1, is characterized in that, according to the XPATH of the element in described XML document, the XML document territory of described database created index comprise:
Create corresponding index for the element in described XML document territory, wherein, the title of described index=described XML document domain name claims+XPATH of domain name separating character+this element.
3. method according to claim 2, is characterized in that, according to the XPATH of the element in described XML document, the XML document territory of described database created index also comprise:
Each described index venue is stored as the index data table, wherein, with the name storage of described index in the index territory of described index data table.
4. method according to claim 3, is characterized in that, also creates title-domain in described index data table, is used for putting down in writing the simple name in described index territory, to present to the user.
5. method according to claim 4, is characterized in that, also comprises:
The simple name of described title-domain record is dedicates the user to;
Receive the user to the retrieval word string of selection and the input of described simple name;
The corresponding index of selected simple name territory is retrieved as key word with described retrieval word string;
The content in XML document that the index that retrieves is pointed territory is submitted to the user.
6. the treating apparatus of an entry, is characterized in that, comprising:
Structurized module is used for creating XML document with record strip purpose content, wherein, and the chapters and sections hierarchical relationship in the content of the corresponding described entry of the XPATH of the element in described XML document;
Database module is for each described XML document being stored into the XML document territory of entry data table;
Index module is used for the XPATH according to the element of described XML document, to the XML document territory establishment index of described database.
7. device according to claim 6, is characterized in that, described index module is used for creating corresponding index for the element in described XML document territory, wherein, the title of described index=described XML document domain name claims+and the XPATH of domain name separating character+this element.
8. device according to claim 7, is characterized in that, described index module also is used for each described index venue is stored as the index data table, wherein, with the name storage of described index in the index territory of described index data table.
9. device according to claim 8, is characterized in that, described index module also is used for also creating title-domain at described index data table, is used for putting down in writing the simple name in described index territory, to present to the user.
10. device according to claim 9, is characterized in that, also comprises:
Interface module is used for the simple name that described title-domain is put down in writing is and dedicates the user to;
Receiver module is used for receiving the user to the retrieval word string of selection and the input of described simple name;
Retrieval module is used for the corresponding index of selected simple name territory being retrieved as key word with described retrieval word string;
Submit module to, the content that is used for XML document that the index that retrieves is pointed territory is submitted to the user.
CN201110401386.3A 2011-12-05 2011-12-05 Article processing method and device Expired - Fee Related CN103136304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110401386.3A CN103136304B (en) 2011-12-05 2011-12-05 Article processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110401386.3A CN103136304B (en) 2011-12-05 2011-12-05 Article processing method and device

Publications (2)

Publication Number Publication Date
CN103136304A true CN103136304A (en) 2013-06-05
CN103136304B CN103136304B (en) 2017-02-22

Family

ID=48496136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110401386.3A Expired - Fee Related CN103136304B (en) 2011-12-05 2011-12-05 Article processing method and device

Country Status (1)

Country Link
CN (1) CN103136304B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193849A (en) * 2016-03-15 2017-09-22 北大方正集团有限公司 XML file full-text search index generation method and device
CN109460394A (en) * 2018-11-20 2019-03-12 北京广利核系统工程有限公司 A kind of simplification method of multistage document entry tracing matrix

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050228768A1 (en) * 2004-04-09 2005-10-13 Ashish Thusoo Mechanism for efficiently evaluating operator trees
US20050228828A1 (en) * 2004-04-09 2005-10-13 Sivasankaran Chandrasekar Efficient extraction of XML content stored in a LOB
CN1965316A (en) * 2004-04-09 2007-05-16 甲骨文国际公司 Index for accessing XML data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050228768A1 (en) * 2004-04-09 2005-10-13 Ashish Thusoo Mechanism for efficiently evaluating operator trees
US20050228828A1 (en) * 2004-04-09 2005-10-13 Sivasankaran Chandrasekar Efficient extraction of XML content stored in a LOB
CN1965316A (en) * 2004-04-09 2007-05-16 甲骨文国际公司 Index for accessing XML data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193849A (en) * 2016-03-15 2017-09-22 北大方正集团有限公司 XML file full-text search index generation method and device
CN109460394A (en) * 2018-11-20 2019-03-12 北京广利核系统工程有限公司 A kind of simplification method of multistage document entry tracing matrix

Also Published As

Publication number Publication date
CN103136304B (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN102930059B (en) Method for designing focused crawler
Mani et al. Semantic data modeling using XML schemas
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
US9753960B1 (en) System, method, and computer program for dynamically generating a visual representation of a subset of a graph for display, based on search criteria
CN104516982A (en) Method and system for extracting Web information based on Nutch
CN102622453A (en) Body-based food security event semantic retrieval system
CN110222110A (en) A kind of resource description framework data conversion storage integral method based on ETL tool
CN102810114A (en) Personal computer resource management system based on body
CN102521232A (en) Distributed acquisition and processing system and method of internet metadata
CN102193798A (en) Method for automatically acquiring Open application programming interface (API) based on Internet
CN103020318A (en) Method for maintenance of database tables in database
US9959305B2 (en) Annotating structured data for search
CN101799890B (en) Certificate data processing method and system
CN103136304A (en) Article processing method and device
CN112417225A (en) Joint query method and system for multi-source heterogeneous data
CN105740250B (en) A kind of method and device for the property index creating XML node
Patil et al. Semantic search using ontology and RDBMS for cricket
CN112905759A (en) Intellectual property retrieval system and method
CN102819594B (en) A kind of method and apparatus of organization website information
Zheng et al. Design and implementation of news collecting and filtering system based on RSS
CN103729422A (en) Information fragment associative output method and system
CN104965924B (en) A kind of date storage method and device
CN104298685A (en) Method and device for achieving heterogeneous system unified searching
Kaczmarek et al. Information extraction from web pages for the needs of expert finding
Saraswathi et al. Design of dynamically updated automatic ontology for mobile phone information retrieval system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170222

Termination date: 20171205

CF01 Termination of patent right due to non-payment of annual fee