CN101957866A - Network text information integration method and device - Google Patents

Network text information integration method and device Download PDF

Info

Publication number
CN101957866A
CN101957866A CN2010105236614A CN201010523661A CN101957866A CN 101957866 A CN101957866 A CN 101957866A CN 2010105236614 A CN2010105236614 A CN 2010105236614A CN 201010523661 A CN201010523661 A CN 201010523661A CN 101957866 A CN101957866 A CN 101957866A
Authority
CN
China
Prior art keywords
information
url
website
automatically
program
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010105236614A
Other languages
Chinese (zh)
Inventor
高万林
张树亮
臧金玉
李桢
赵佳宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Agricultural University
Original Assignee
China Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Agricultural University filed Critical China Agricultural University
Priority to CN2010105236614A priority Critical patent/CN101957866A/en
Publication of CN101957866A publication Critical patent/CN101957866A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a network text information integration method and a device. The method comprises the following steps: obtaining basic information of a web site, and analyzing the basic information through a program; traversing URLs according to the basic information, and utilizing the program to obtain page information of a web page; utilizing the program to carry out arrangement and storage on the obtained page information according to preset rules; and issuing the information after the arrangement to Internet through the web program. The method and the device can carry out multi-level analysis of the web page against the demands of a user, extract contents interested by the user, and carry out the storage and the issuance; furthermore, the method and the device can realize collection and the storage of agricultural network text information with high efficiency, and solve the problems of width and scope of the collected web page in the prior art.

Description

Network text information integration method and device
Technical field
The present invention relates to the network information gathering technology, relate in particular to a kind of network text information integration method and device.
Background technology
The fast development of building along with rural information is also carried out energetically in the whole nation and is built for peasant user provides the website of agricultural information service, and all there is the Agricultural Information website of oneself in national most of province, city and region.But because China region is wide, have 900,000,000 peasant's populations, the Agricultural Information amount is huge, therefore, what each local agricultural Website was collected all is the Agricultural Information of this area, comprises get rich information, agriculture quotation analysis, relevant peasant's policies and regulations or the like network text information of news information, agricultural science and technology information, agricultural.
The inventor finds that there is following defective at least in prior art in realizing process of the present invention: existing information acquisition system mostly is the reptile acquisition system, obtains info web by hyperlink, and it bases oneself upon the needs that satisfy all Internet users.The technical of info web structuring extraction all has certain embodiment in a lot of web crawlers products, but on the method for technical and extraction certain limitation arranged all.Cause that certain difficulty is just arranged on the Agricultural Information that is applied to reality is extracted:
1. existing system all is embedded in program inside at the technology that structurized web data extracts, and has adopted the rule of the collection of solidifying, and extracts structurized data.Such extracting mode can only be confined to only or similar webpage.
2. existing method is set the inquiry radius according to the layout type of webpage and is carried out the structuring extraction, but most information is to obtain by this kind extracting mode.
3. existing method is extracted structured message by configuration file.And these configuration files have only and can accomplish the familiar talent of web page program.Extraction in this way just greatly reduce user's scope.
This shows, constantly enlarge in the reptile range of application that the user is under the more and more higher situation of the requirement of Web page structural data acquisition, current web crawlers technology can't satisfy the demand of user to the intelligent acquisition of structural data.
Summary of the invention
The object of the present invention is to provide a kind of network text information integration method and device,, solve prior art and gather the width of webpage and the problem of range with collection and the storage that realizes the agriculture network text message efficiently.
A kind of network text information integration method of the present invention comprises the steps: the parameter input step, obtains the essential information of website, automatically essential information is analyzed; Acquisition step according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and storing step according to presetting rule, are analyzed the described info web that obtains automatically, comprise filtration, arrangement, classification and storage; Issuing steps is published to Internet automatically with the information of described arrangement.
Above-mentioned information integration method, whether whether in the preferred described parameter input step, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
Above-mentioned information integration method, in the preferred described parameter input step, describedly automatically essential information analysis is comprised: according to essential information, by the user configured parameter of process analysis, by website URL inlet, based on reptile, all URL of the traversal page, by program URL is analyzed, URL is divided into the URL of addressable URL, repetition and discarded URL.
Above-mentioned information integration method, in the preferred described acquisition step, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
Above-mentioned information integration method in the preferred described finish message step, filters the described webpage html source code information of obtaining, and obtains the information of text.
A kind of network text information integrating apparatus of the present invention comprises: parameter input module, be used to obtain the essential information of website, and automatically essential information is analyzed; Acquisition module is used for according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and memory module are used for according to presetting rule, automatically the described info web that obtains are analyzed, and comprise filtration, arrangement, classification and storage; Release module is used for the information of described arrangement is published to Internet automatically.
Above-mentioned information integrating apparatus, whether whether in the preferred described parameter input module, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
Above-mentioned information integrating apparatus, in the preferred described parameter input module, describedly to essential information analysis be automatically: according to essential information, by the user configured parameter of process analysis, by website URL inlet, based on the reptile program, all URL of the traversal page, by program URL is analyzed, URL is divided into the URL of addressable URL, repetition and discarded URL.
Above-mentioned information integrating apparatus, in the preferred described acquisition module, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
Above-mentioned information integrating apparatus in the preferred described finish message module, filters the described webpage html source code information of obtaining, and obtains the information of text.
In terms of existing technologies, the present invention can carry out the multi-level analysis of webpage at user's demand, extracts the content that the user was concerned about, stores and issues; And then, realize the collection and the storage of agriculture network text message efficiently, solve prior art and gather the width of webpage and the problem of range.
Description of drawings
Fig. 1 is the flow chart of steps of network text information integration method of the present invention;
Fig. 2 is the flow chart of steps of network text information integration method embodiment of the present invention;
Fig. 3 is the structural representation of network text information integrating apparatus of the present invention;
Fig. 4 is the structural representation of network text information integrating apparatus embodiment of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Invention thought of the present invention: by programmed acquisition webpage URL, realize the circle collection of info web, and Information Monitoring is analyzed, puts in order, stored and issues.
With reference to Fig. 1, Fig. 1 is the flow chart of steps of network text information integration method of the present invention, comprising: parameter input step S110, obtain the essential information of website, and automatically essential information analyzed; Acquisition step S120 according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and storing step S130 according to presetting rule, analyze the described info web that obtains automatically, comprise filtration, arrangement, classification and storage; Issuing steps S140, the information with described arrangement is published to Internet automatically.
Method embodiment
Fig. 2 is the flow chart of steps of network text information integration method embodiment of the present invention, and as shown in Figure 2, the text message integrated approach of present embodiment mainly may further comprise the steps: be example with module in certain website in example.
Step S201, import the essential information of certain website, comprising: the preservation address of the URL of website, the character set of website, file, whether be the wall scroll collection and whether be issue automatically etc. attribute.By the user configured parameter of process analysis, utilize existing reptile program by website URL inlet, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition, discarded URL.
The essential information of website is gathered in this step management, by the information of user's input, opens up collection, arrangement and the storage that an independent thread is finished information for the website of gathering automatically.Simultaneously, arrangement and the issue of finishing data according to the parameter and the regular formulation of user's input.
Step S202, information acquisition mode mainly are page captures, and the addressable URL that obtains among the step S201 is passed through loop program, the info web of visit URL correspondence; Obtain the html source code information of webpage.
Information acquisition mainly is to carry out according to the website essential information of user's input.Comprise:
One, according to the parameter of input, automatically judge and gather wall scroll data and many data.
Two, according to the webpage URL (this URL also is the inlet of system acquisition webpage) that imports, the one-piece construction of automatic analyzing web site.What the structural approach of analyzing web page adopted is the collimation analytical approach, and this method has more specific aim, can accurately obtain the information that the user wants, and filters out the information of runing counter to its content.The method obtains and all URL link addresses of the equal catalogue of RUL that enters the mouth; Obtain the source code of webpage traveling through URL successively.
Step S203 analyzes by user configured html label, and the html code that collects among the step S202 is filtered, and obtains the information of agriculture text in the website.Then all information are classified, put in order, store in the server then.Below be simple web page text structure a:<title〉agriculture network text message acquisition technique</title<divclass=" time " 2010-6-30</div<div class=" sourc " agricultural</div<divclass=" author " agricultural</div<div class=" content " summary of the invention be example</div
According to above webpage source code configuration rule: as follows
1, title: intercepting beginning label<title 〉, the end-tag of intercepting</title 〉.
2, content: the beginning label class=of intercepting " content ", the end-tag of intercepting</div 〉.
3, the time: the beginning label<div class=of intercepting " time ", the end-tag of intercepting</div 〉.
4, source: the beginning label class=of intercepting " sourc ", the end-tag of intercepting</div 〉.
5, author: class=" author " intercepting the beginning label, the end-tag of intercepting</div
The result who obtains by routine analyzer after rule configures is as follows:
1, title is: agriculture network text message acquisition technique.
2, content is: summary of the invention is an example.
3, the time is: 2010-6-30.
4, the source is: agricultural.
5, the author is: agricultural.
This step is finish message, is according to the rule treatments webpage source code of formulating, and extracts the part that the user needs from source code.Comprise: title, source, time, author, content.System can judge the problem of content paging automatically in finishing processes.Wherein title, source, time, this several sections of author filter out the original pattern of webpage automatically in the process that system extracts.Content then keeps original pattern automatically in the process that extracts.Next step then enters phase data memory the arrangement back.
Need to prove, before carrying out this step, formulate the rule of finish message.This rule is the key of finish message, and finish message can be according to the work of the rule specification web data arrangement of formulating.The formulation of rule is divided into and is several sections: title, time, source, author, paging.Above several sections is the basic structure design according to text message, and the user can gather different parts as required.The Rulemaking module has been broken away from the drainage pattern that solidifies, and does not need to gather a website and expands a secondary program.The essential information and the Rule Information that only need the user to dispose number of site get final product.
That is to say, after finishing information acquisition, need judge whether to formulate the rule of finish message automatically,, obtain source code again and just enter next step finish message if lay down a regulation.If the system that do not lay down a regulation then travel through URL and obtain finishing behind the source code.
Step S204, with the data that S203 puts in order, in conjunction with the input of parameter in the S201 step data are carried out categorical filtering, arrangement, analysis and storage.Program can judge whether that needs enter step S205 automatically automatically according to the input of parameter in the S201 step in the time of arrangement
Step S205, by the web program, the information that the S204 step was put in order is published to Internet.So just can view all news informations a website.For example: finished the collection of an information in the above example,, then directly clicked issue and get final product, be user-friendly to if this information manually is published to the network platform.
Step S204 and step S205 finish data storage and proposition process.Information stores adopts database storing, utilizes existing data base management system (DBMS), and the management of scattered message structureization is got up.Information after the issue of information will be gathered exactly is published to the existing network platform.Wherein the mode of information issue has two kinds: the one automated manner, it two is manual modes.System can be according to the mode of the parameter judgement information automatically issue of first step input when data storage.
Present embodiment has been expanded common reptile program, be applied in the agriculture network text message integrated system, has realized the integration of information acquisition, filtration, classification, storage, issue, make Agricultural Information more comprehensively, more accurate, more authoritative.
On the other hand, the present invention also provides a kind of network text information integrating apparatus, and with reference to Fig. 3, this device comprises: parameter input module 30, acquisition module 32, finish message and memory module 34 and release module 36.
Wherein, parameter input module 30 is used to obtain the essential information of website, and by program essential information is analyzed; Acquisition module 32 is used for according to described essential information, and traversal URL utilizes program to obtain the page info of webpage; Finish message and memory module 34 are used for utilizing program according to presetting rule, and the described info web that obtains is analyzed, and comprise filtration, arrangement, classification and storage; Release module 36 is used for the program by web, and the information of described arrangement is published to Internet.
Fig. 4 is the structural representation of network text information integrating apparatus embodiment of the present invention, as shown in Figure 4, present embodiment network text integrated system comprises: parameter input module 401, information acquisition module 402, finish message module 403, information storage module 404, information issuing module 405.
In concrete the enforcement, parameter input module 401 travels through all URL by web crawlers.
Information acquisition module 402 obtains the network address of parameter input module 401 inputs, by the corresponding URL of routine access, obtains corresponding website information, and initial information is a HTML code.
The HTML code that 403 pairs of information acquisition modules 402 of finish message module obtain is filtered, and obtains the information such as picture of text and text correspondence, and with information classification, integration.
The network text information that information storage module 404 is integrated finish message module 403 is carried out the network storage, and storage mode is NAS.
Information issuing module 405 is published to Internet with the agriculture network text message of the integration of storage in the information storage module 404 by the web program.
Present embodiment passes through, and improves common reptile program, be applied in the agriculture network text message integrated system, has realized the integration of information acquisition, filtration, classification, storage, issue, make Agricultural Information more comprehensively, more accurate, more authoritative.
More than a kind of network text information integration method provided by the present invention and device are described in detail, used specific embodiment herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, part in specific embodiments and applications all can change.In sum, this description should not be construed as limitation of the present invention.

Claims (10)

1. a network text information integration method is characterized in that, described method comprises the steps:
The parameter input step obtains the essential information of website, automatically essential information is analyzed;
Acquisition step according to described essential information, travels through URL, obtains the page info of webpage automatically;
Finish message and storing step according to presetting rule, are analyzed the described info web that obtains automatically, comprise filtration, arrangement, classification and storage;
Issuing steps is published to Internet automatically with the information of described arrangement.
2. information integration method according to claim 1 is characterized in that,
Whether whether in the described parameter input step, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
3. information integration method according to claim 2 is characterized in that, in the described parameter input step, describedly automatically essential information analysis is comprised:
According to essential information, by the user configured parameter of process analysis, by website URL inlet, based on reptile, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition and discarded URL.
4. information integration method according to claim 3 is characterized in that,
In the described acquisition step, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
5. information integration method according to claim 4 is characterized in that,
In the described finish message step, the described webpage html source code information of obtaining is filtered, obtain the information of text.
6. a network text information integrating apparatus is characterized in that, described device comprises:
Parameter input module is used to obtain the essential information of website, automatically essential information is analyzed;
Acquisition module is used for according to described essential information, travels through URL, obtains the page info of webpage automatically;
Finish message and memory module are used for according to presetting rule, automatically the described info web that obtains are analyzed, and comprise filtration, arrangement, classification and storage;
Release module is used for the information of described arrangement is published to Internet automatically.
7. information integrating apparatus according to claim 6 is characterized in that,
Whether whether in the described parameter input module, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
8. information integrating apparatus according to claim 7 is characterized in that, in the described parameter input module, describedly to essential information analysis is automatically:
According to essential information, by the user configured parameter of process analysis, by website URL inlet, based on the reptile program, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition and discarded URL.
9. information integrating apparatus according to claim 8 is characterized in that,
In the described acquisition module, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
10. information integrating apparatus according to claim 9 is characterized in that,
In the described finish message module, the described webpage html source code information of obtaining is filtered, obtain the information of text.
CN2010105236614A 2010-10-25 2010-10-25 Network text information integration method and device Pending CN101957866A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010105236614A CN101957866A (en) 2010-10-25 2010-10-25 Network text information integration method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010105236614A CN101957866A (en) 2010-10-25 2010-10-25 Network text information integration method and device

Publications (1)

Publication Number Publication Date
CN101957866A true CN101957866A (en) 2011-01-26

Family

ID=43485195

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010105236614A Pending CN101957866A (en) 2010-10-25 2010-10-25 Network text information integration method and device

Country Status (1)

Country Link
CN (1) CN101957866A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811177A (en) * 2011-06-03 2012-12-05 腾讯科技(深圳)有限公司 Method and system for sharing network information
CN103049537A (en) * 2012-12-25 2013-04-17 国云科技股份有限公司 Network information collection method
CN104021170A (en) * 2014-05-30 2014-09-03 华为技术有限公司 Information acquiring method and cloud server
CN104462431A (en) * 2014-12-16 2015-03-25 浪潮软件集团有限公司 Method for crawling web page recruitment information
CN104965929A (en) * 2015-07-24 2015-10-07 网易传媒科技(北京)有限公司 Method and device for data processing
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN105808545A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Forum data extraction method and forum data extraction apparatus
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system
CN106600438A (en) * 2016-11-29 2017-04-26 东莞华南设计创新院 Agricultural information service system
CN107451218A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 On-Line review method for automatically releasing and device
CN109446425A (en) * 2018-10-30 2019-03-08 郑州市景安网络科技股份有限公司 A kind of network information gathering and dissemination method, system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042747A (en) * 2006-03-24 2007-09-26 上海中经互联网络有限公司 Economic operation analysis system
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042747A (en) * 2006-03-24 2007-09-26 上海中经互联网络有限公司 Economic operation analysis system
CN101561802A (en) * 2008-04-18 2009-10-21 上海复旦光华信息科技股份有限公司 Web page structural data extraction method and system

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102811177A (en) * 2011-06-03 2012-12-05 腾讯科技(深圳)有限公司 Method and system for sharing network information
CN102811177B (en) * 2011-06-03 2017-03-22 腾讯科技(深圳)有限公司 Method and system for sharing network information
CN103049537A (en) * 2012-12-25 2013-04-17 国云科技股份有限公司 Network information collection method
CN104021170A (en) * 2014-05-30 2014-09-03 华为技术有限公司 Information acquiring method and cloud server
CN104021170B (en) * 2014-05-30 2018-01-16 华为技术有限公司 A kind of information acquisition method and cloud server
CN104462431A (en) * 2014-12-16 2015-03-25 浪潮软件集团有限公司 Method for crawling web page recruitment information
CN105808545A (en) * 2014-12-30 2016-07-27 Tcl集团股份有限公司 Forum data extraction method and forum data extraction apparatus
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN104965929A (en) * 2015-07-24 2015-10-07 网易传媒科技(北京)有限公司 Method and device for data processing
CN104965929B (en) * 2015-07-24 2019-07-02 网易传媒科技(北京)有限公司 A kind of data processing method and device
CN105868258A (en) * 2015-12-28 2016-08-17 乐视网信息技术(北京)股份有限公司 Crawler system
CN106600438A (en) * 2016-11-29 2017-04-26 东莞华南设计创新院 Agricultural information service system
CN107451218A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 On-Line review method for automatically releasing and device
CN107451218B (en) * 2017-07-17 2020-04-03 云润大数据服务有限公司 Automatic publishing method and device for online comments
CN109446425A (en) * 2018-10-30 2019-03-08 郑州市景安网络科技股份有限公司 A kind of network information gathering and dissemination method, system

Similar Documents

Publication Publication Date Title
CN101957866A (en) Network text information integration method and device
CN101320373B (en) Safety search engine system of website database
CN108229810B (en) Industry analysis system and method based on network information resources
CN108776671A (en) A kind of network public sentiment monitoring system and method
CN104346328A (en) Vertical intelligent crawler data collecting method based on webpage data capture
CN103139256B (en) A kind of many tenant network public sentiment method for supervising and system
CN102902703A (en) Network sensitive information-oriented screenshot discovery and locking callback method
CN102622445A (en) User interest perception based webpage push system and webpage push method
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN106484709A (en) A kind of auditing method of daily record data and audit device
CN102760151A (en) Implementation method of open source software acquisition and searching system
CN102270331A (en) Network shopping navigating method based on visual search
CN101650715A (en) Method and device for screening links on web pages
CN104317948A (en) Page data capturing method and system
CN105718590A (en) Multi-tenant oriented SaaS public opinion monitoring system and method
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN103544283A (en) Website information combination and de-duplication method
CN102323955A (en) Private cloud searching system and implement method thereof
CN101984432A (en) Method and device for constructing address database
CN102567494A (en) Website classification method and device
CN104182466A (en) House information base network system
CN102253939A (en) Searching method and system based on cloud computing technology
CN105117436A (en) Automatic website channel mining method
CN105335516A (en) Construction method of universal acquisition system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20110126