CN101957866A

CN101957866A - Network text information integration method and device

Info

Publication number: CN101957866A
Application number: CN2010105236614A
Authority: CN
Inventors: 高万林; 张树亮; 臧金玉; 李桢; 赵佳宁
Original assignee: China Agricultural University
Current assignee: China Agricultural University
Priority date: 2010-10-25
Filing date: 2010-10-25
Publication date: 2011-01-26

Abstract

The invention discloses a network text information integration method and a device. The method comprises the following steps: obtaining basic information of a web site, and analyzing the basic information through a program; traversing URLs according to the basic information, and utilizing the program to obtain page information of a web page; utilizing the program to carry out arrangement and storage on the obtained page information according to preset rules; and issuing the information after the arrangement to Internet through the web program. The method and the device can carry out multi-level analysis of the web page against the demands of a user, extract contents interested by the user, and carry out the storage and the issuance; furthermore, the method and the device can realize collection and the storage of agricultural network text information with high efficiency, and solve the problems of width and scope of the collected web page in the prior art.

Description

Network text information integration method and device

Technical field

The present invention relates to the network information gathering technology, relate in particular to a kind of network text information integration method and device.

Background technology

The fast development of building along with rural information is also carried out energetically in the whole nation and is built for peasant user provides the website of agricultural information service, and all there is the Agricultural Information website of oneself in national most of province, city and region.But because China region is wide, have 900,000,000 peasant's populations, the Agricultural Information amount is huge, therefore, what each local agricultural Website was collected all is the Agricultural Information of this area, comprises get rich information, agriculture quotation analysis, relevant peasant's policies and regulations or the like network text information of news information, agricultural science and technology information, agricultural.

The inventor finds that there is following defective at least in prior art in realizing process of the present invention: existing information acquisition system mostly is the reptile acquisition system, obtains info web by hyperlink, and it bases oneself upon the needs that satisfy all Internet users.The technical of info web structuring extraction all has certain embodiment in a lot of web crawlers products, but on the method for technical and extraction certain limitation arranged all.Cause that certain difficulty is just arranged on the Agricultural Information that is applied to reality is extracted:

1. existing system all is embedded in program inside at the technology that structurized web data extracts, and has adopted the rule of the collection of solidifying, and extracts structurized data.Such extracting mode can only be confined to only or similar webpage.

2. existing method is set the inquiry radius according to the layout type of webpage and is carried out the structuring extraction, but most information is to obtain by this kind extracting mode.

3. existing method is extracted structured message by configuration file.And these configuration files have only and can accomplish the familiar talent of web page program.Extraction in this way just greatly reduce user's scope.

This shows, constantly enlarge in the reptile range of application that the user is under the more and more higher situation of the requirement of Web page structural data acquisition, current web crawlers technology can't satisfy the demand of user to the intelligent acquisition of structural data.

Summary of the invention

The object of the present invention is to provide a kind of network text information integration method and device,, solve prior art and gather the width of webpage and the problem of range with collection and the storage that realizes the agriculture network text message efficiently.

A kind of network text information integration method of the present invention comprises the steps: the parameter input step, obtains the essential information of website, automatically essential information is analyzed; Acquisition step according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and storing step according to presetting rule, are analyzed the described info web that obtains automatically, comprise filtration, arrangement, classification and storage; Issuing steps is published to Internet automatically with the information of described arrangement.

Above-mentioned information integration method, whether whether in the preferred described parameter input step, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.

Above-mentioned information integration method, in the preferred described parameter input step, describedly automatically essential information analysis is comprised: according to essential information, by the user configured parameter of process analysis, by website URL inlet, based on reptile, all URL of the traversal page, by program URL is analyzed, URL is divided into the URL of addressable URL, repetition and discarded URL.

Above-mentioned information integration method, in the preferred described acquisition step, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.

Above-mentioned information integration method in the preferred described finish message step, filters the described webpage html source code information of obtaining, and obtains the information of text.

A kind of network text information integrating apparatus of the present invention comprises: parameter input module, be used to obtain the essential information of website, and automatically essential information is analyzed; Acquisition module is used for according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and memory module are used for according to presetting rule, automatically the described info web that obtains are analyzed, and comprise filtration, arrangement, classification and storage; Release module is used for the information of described arrangement is published to Internet automatically.

Above-mentioned information integrating apparatus, whether whether in the preferred described parameter input module, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.

Above-mentioned information integrating apparatus, in the preferred described parameter input module, describedly to essential information analysis be automatically: according to essential information, by the user configured parameter of process analysis, by website URL inlet, based on the reptile program, all URL of the traversal page, by program URL is analyzed, URL is divided into the URL of addressable URL, repetition and discarded URL.

Above-mentioned information integrating apparatus, in the preferred described acquisition module, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.

Above-mentioned information integrating apparatus in the preferred described finish message module, filters the described webpage html source code information of obtaining, and obtains the information of text.

In terms of existing technologies, the present invention can carry out the multi-level analysis of webpage at user's demand, extracts the content that the user was concerned about, stores and issues; And then, realize the collection and the storage of agriculture network text message efficiently, solve prior art and gather the width of webpage and the problem of range.

Description of drawings

Fig. 1 is the flow chart of steps of network text information integration method of the present invention;

Fig. 2 is the flow chart of steps of network text information integration method embodiment of the present invention;

Fig. 3 is the structural representation of network text information integrating apparatus of the present invention;

Fig. 4 is the structural representation of network text information integrating apparatus embodiment of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

Invention thought of the present invention: by programmed acquisition webpage URL, realize the circle collection of info web, and Information Monitoring is analyzed, puts in order, stored and issues.

With reference to Fig. 1, Fig. 1 is the flow chart of steps of network text information integration method of the present invention, comprising: parameter input step S110, obtain the essential information of website, and automatically essential information analyzed; Acquisition step S120 according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and storing step S130 according to presetting rule, analyze the described info web that obtains automatically, comprise filtration, arrangement, classification and storage; Issuing steps S140, the information with described arrangement is published to Internet automatically.

Method embodiment

Fig. 2 is the flow chart of steps of network text information integration method embodiment of the present invention, and as shown in Figure 2, the text message integrated approach of present embodiment mainly may further comprise the steps: be example with module in certain website in example.

Step S201, import the essential information of certain website, comprising: the preservation address of the URL of website, the character set of website, file, whether be the wall scroll collection and whether be issue automatically etc. attribute.By the user configured parameter of process analysis, utilize existing reptile program by website URL inlet, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition, discarded URL.

The essential information of website is gathered in this step management, by the information of user's input, opens up collection, arrangement and the storage that an independent thread is finished information for the website of gathering automatically.Simultaneously, arrangement and the issue of finishing data according to the parameter and the regular formulation of user's input.

Step S202, information acquisition mode mainly are page captures, and the addressable URL that obtains among the step S201 is passed through loop program, the info web of visit URL correspondence; Obtain the html source code information of webpage.

Information acquisition mainly is to carry out according to the website essential information of user's input.Comprise:

One, according to the parameter of input, automatically judge and gather wall scroll data and many data.

Two, according to the webpage URL (this URL also is the inlet of system acquisition webpage) that imports, the one-piece construction of automatic analyzing web site.What the structural approach of analyzing web page adopted is the collimation analytical approach, and this method has more specific aim, can accurately obtain the information that the user wants, and filters out the information of runing counter to its content.The method obtains and all URL link addresses of the equal catalogue of RUL that enters the mouth; Obtain the source code of webpage traveling through URL successively.

Step S203 analyzes by user configured html label, and the html code that collects among the step S202 is filtered, and obtains the information of agriculture text in the website.Then all information are classified, put in order, store in the server then.Below be simple web page text structure a:＜title〉agriculture network text message acquisition technique＜/title＜divclass=" time " 2010-6-30＜/div＜div class=" sourc " agricultural＜/div＜divclass=" author " agricultural＜/div＜div class=" content " summary of the invention be example＜/div

According to above webpage source code configuration rule: as follows

1, title: intercepting beginning label＜title 〉, the end-tag of intercepting＜/title 〉.

2, content: the beginning label class=of intercepting " content ", the end-tag of intercepting＜/div 〉.

3, the time: the beginning label＜div class=of intercepting " time ", the end-tag of intercepting＜/div 〉.

4, source: the beginning label class=of intercepting " sourc ", the end-tag of intercepting＜/div 〉.

5, author: class=" author " intercepting the beginning label, the end-tag of intercepting＜/div

The result who obtains by routine analyzer after rule configures is as follows:

1, title is: agriculture network text message acquisition technique.

2, content is: summary of the invention is an example.

3, the time is: 2010-6-30.

4, the source is: agricultural.

5, the author is: agricultural.

This step is finish message, is according to the rule treatments webpage source code of formulating, and extracts the part that the user needs from source code.Comprise: title, source, time, author, content.System can judge the problem of content paging automatically in finishing processes.Wherein title, source, time, this several sections of author filter out the original pattern of webpage automatically in the process that system extracts.Content then keeps original pattern automatically in the process that extracts.Next step then enters phase data memory the arrangement back.

Need to prove, before carrying out this step, formulate the rule of finish message.This rule is the key of finish message, and finish message can be according to the work of the rule specification web data arrangement of formulating.The formulation of rule is divided into and is several sections: title, time, source, author, paging.Above several sections is the basic structure design according to text message, and the user can gather different parts as required.The Rulemaking module has been broken away from the drainage pattern that solidifies, and does not need to gather a website and expands a secondary program.The essential information and the Rule Information that only need the user to dispose number of site get final product.

That is to say, after finishing information acquisition, need judge whether to formulate the rule of finish message automatically,, obtain source code again and just enter next step finish message if lay down a regulation.If the system that do not lay down a regulation then travel through URL and obtain finishing behind the source code.

Step S204, with the data that S203 puts in order, in conjunction with the input of parameter in the S201 step data are carried out categorical filtering, arrangement, analysis and storage.Program can judge whether that needs enter step S205 automatically automatically according to the input of parameter in the S201 step in the time of arrangement

Step S205, by the web program, the information that the S204 step was put in order is published to Internet.So just can view all news informations a website.For example: finished the collection of an information in the above example,, then directly clicked issue and get final product, be user-friendly to if this information manually is published to the network platform.

Step S204 and step S205 finish data storage and proposition process.Information stores adopts database storing, utilizes existing data base management system (DBMS), and the management of scattered message structureization is got up.Information after the issue of information will be gathered exactly is published to the existing network platform.Wherein the mode of information issue has two kinds: the one automated manner, it two is manual modes.System can be according to the mode of the parameter judgement information automatically issue of first step input when data storage.

Present embodiment has been expanded common reptile program, be applied in the agriculture network text message integrated system, has realized the integration of information acquisition, filtration, classification, storage, issue, make Agricultural Information more comprehensively, more accurate, more authoritative.

On the other hand, the present invention also provides a kind of network text information integrating apparatus, and with reference to Fig. 3, this device comprises: parameter input module 30, acquisition module 32, finish message and memory module 34 and release module 36.

Wherein, parameter input module 30 is used to obtain the essential information of website, and by program essential information is analyzed; Acquisition module 32 is used for according to described essential information, and traversal URL utilizes program to obtain the page info of webpage; Finish message and memory module 34 are used for utilizing program according to presetting rule, and the described info web that obtains is analyzed, and comprise filtration, arrangement, classification and storage; Release module 36 is used for the program by web, and the information of described arrangement is published to Internet.

Fig. 4 is the structural representation of network text information integrating apparatus embodiment of the present invention, as shown in Figure 4, present embodiment network text integrated system comprises: parameter input module 401, information acquisition module 402, finish message module 403, information storage module 404, information issuing module 405.

In concrete the enforcement, parameter input module 401 travels through all URL by web crawlers.

Information acquisition module 402 obtains the network address of parameter input module 401 inputs, by the corresponding URL of routine access, obtains corresponding website information, and initial information is a HTML code.

The HTML code that 403 pairs of information acquisition modules 402 of finish message module obtain is filtered, and obtains the information such as picture of text and text correspondence, and with information classification, integration.

The network text information that information storage module 404 is integrated finish message module 403 is carried out the network storage, and storage mode is NAS.

Information issuing module 405 is published to Internet with the agriculture network text message of the integration of storage in the information storage module 404 by the web program.

Present embodiment passes through, and improves common reptile program, be applied in the agriculture network text message integrated system, has realized the integration of information acquisition, filtration, classification, storage, issue, make Agricultural Information more comprehensively, more accurate, more authoritative.

More than a kind of network text information integration method provided by the present invention and device are described in detail, used specific embodiment herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, part in specific embodiments and applications all can change.In sum, this description should not be construed as limitation of the present invention.

Claims

1. a network text information integration method is characterized in that, described method comprises the steps:

The parameter input step obtains the essential information of website, automatically essential information is analyzed;

Acquisition step according to described essential information, travels through URL, obtains the page info of webpage automatically;

Finish message and storing step according to presetting rule, are analyzed the described info web that obtains automatically, comprise filtration, arrangement, classification and storage;

Issuing steps is published to Internet automatically with the information of described arrangement.

2. information integration method according to claim 1 is characterized in that,

Whether whether in the described parameter input step, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.

3. information integration method according to claim 2 is characterized in that, in the described parameter input step, describedly automatically essential information analysis is comprised:

According to essential information, by the user configured parameter of process analysis, by website URL inlet, based on reptile, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition and discarded URL.

4. information integration method according to claim 3 is characterized in that,

In the described acquisition step, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.

5. information integration method according to claim 4 is characterized in that,

In the described finish message step, the described webpage html source code information of obtaining is filtered, obtain the information of text.

6. a network text information integrating apparatus is characterized in that, described device comprises:

Parameter input module is used to obtain the essential information of website, automatically essential information is analyzed;

Acquisition module is used for according to described essential information, travels through URL, obtains the page info of webpage automatically;

Finish message and memory module are used for according to presetting rule, automatically the described info web that obtains are analyzed, and comprise filtration, arrangement, classification and storage;

Release module is used for the information of described arrangement is published to Internet automatically.

7. information integrating apparatus according to claim 6 is characterized in that,

Whether whether in the described parameter input module, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.

8. information integrating apparatus according to claim 7 is characterized in that, in the described parameter input module, describedly to essential information analysis is automatically:

According to essential information, by the user configured parameter of process analysis, by website URL inlet, based on the reptile program, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition and discarded URL.

9. information integrating apparatus according to claim 8 is characterized in that,

In the described acquisition module, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.

10. information integrating apparatus according to claim 9 is characterized in that,

In the described finish message module, the described webpage html source code information of obtaining is filtered, obtain the information of text.