CN101957866A - Network text information integration method and device - Google Patents
Network text information integration method and device Download PDFInfo
- Publication number
- CN101957866A CN101957866A CN2010105236614A CN201010523661A CN101957866A CN 101957866 A CN101957866 A CN 101957866A CN 2010105236614 A CN2010105236614 A CN 2010105236614A CN 201010523661 A CN201010523661 A CN 201010523661A CN 101957866 A CN101957866 A CN 101957866A
- Authority
- CN
- China
- Prior art keywords
- information
- url
- website
- automatically
- program
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention discloses a network text information integration method and a device. The method comprises the following steps: obtaining basic information of a web site, and analyzing the basic information through a program; traversing URLs according to the basic information, and utilizing the program to obtain page information of a web page; utilizing the program to carry out arrangement and storage on the obtained page information according to preset rules; and issuing the information after the arrangement to Internet through the web program. The method and the device can carry out multi-level analysis of the web page against the demands of a user, extract contents interested by the user, and carry out the storage and the issuance; furthermore, the method and the device can realize collection and the storage of agricultural network text information with high efficiency, and solve the problems of width and scope of the collected web page in the prior art.
Description
Technical field
The present invention relates to the network information gathering technology, relate in particular to a kind of network text information integration method and device.
Background technology
The fast development of building along with rural information is also carried out energetically in the whole nation and is built for peasant user provides the website of agricultural information service, and all there is the Agricultural Information website of oneself in national most of province, city and region.But because China region is wide, have 900,000,000 peasant's populations, the Agricultural Information amount is huge, therefore, what each local agricultural Website was collected all is the Agricultural Information of this area, comprises get rich information, agriculture quotation analysis, relevant peasant's policies and regulations or the like network text information of news information, agricultural science and technology information, agricultural.
The inventor finds that there is following defective at least in prior art in realizing process of the present invention: existing information acquisition system mostly is the reptile acquisition system, obtains info web by hyperlink, and it bases oneself upon the needs that satisfy all Internet users.The technical of info web structuring extraction all has certain embodiment in a lot of web crawlers products, but on the method for technical and extraction certain limitation arranged all.Cause that certain difficulty is just arranged on the Agricultural Information that is applied to reality is extracted:
1. existing system all is embedded in program inside at the technology that structurized web data extracts, and has adopted the rule of the collection of solidifying, and extracts structurized data.Such extracting mode can only be confined to only or similar webpage.
2. existing method is set the inquiry radius according to the layout type of webpage and is carried out the structuring extraction, but most information is to obtain by this kind extracting mode.
3. existing method is extracted structured message by configuration file.And these configuration files have only and can accomplish the familiar talent of web page program.Extraction in this way just greatly reduce user's scope.
This shows, constantly enlarge in the reptile range of application that the user is under the more and more higher situation of the requirement of Web page structural data acquisition, current web crawlers technology can't satisfy the demand of user to the intelligent acquisition of structural data.
Summary of the invention
The object of the present invention is to provide a kind of network text information integration method and device,, solve prior art and gather the width of webpage and the problem of range with collection and the storage that realizes the agriculture network text message efficiently.
A kind of network text information integration method of the present invention comprises the steps: the parameter input step, obtains the essential information of website, automatically essential information is analyzed; Acquisition step according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and storing step according to presetting rule, are analyzed the described info web that obtains automatically, comprise filtration, arrangement, classification and storage; Issuing steps is published to Internet automatically with the information of described arrangement.
Above-mentioned information integration method, whether whether in the preferred described parameter input step, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
Above-mentioned information integration method, in the preferred described parameter input step, describedly automatically essential information analysis is comprised: according to essential information, by the user configured parameter of process analysis, by website URL inlet, based on reptile, all URL of the traversal page, by program URL is analyzed, URL is divided into the URL of addressable URL, repetition and discarded URL.
Above-mentioned information integration method, in the preferred described acquisition step, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
Above-mentioned information integration method in the preferred described finish message step, filters the described webpage html source code information of obtaining, and obtains the information of text.
A kind of network text information integrating apparatus of the present invention comprises: parameter input module, be used to obtain the essential information of website, and automatically essential information is analyzed; Acquisition module is used for according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and memory module are used for according to presetting rule, automatically the described info web that obtains are analyzed, and comprise filtration, arrangement, classification and storage; Release module is used for the information of described arrangement is published to Internet automatically.
Above-mentioned information integrating apparatus, whether whether in the preferred described parameter input module, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
Above-mentioned information integrating apparatus, in the preferred described parameter input module, describedly to essential information analysis be automatically: according to essential information, by the user configured parameter of process analysis, by website URL inlet, based on the reptile program, all URL of the traversal page, by program URL is analyzed, URL is divided into the URL of addressable URL, repetition and discarded URL.
Above-mentioned information integrating apparatus, in the preferred described acquisition module, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
Above-mentioned information integrating apparatus in the preferred described finish message module, filters the described webpage html source code information of obtaining, and obtains the information of text.
In terms of existing technologies, the present invention can carry out the multi-level analysis of webpage at user's demand, extracts the content that the user was concerned about, stores and issues; And then, realize the collection and the storage of agriculture network text message efficiently, solve prior art and gather the width of webpage and the problem of range.
Description of drawings
Fig. 1 is the flow chart of steps of network text information integration method of the present invention;
Fig. 2 is the flow chart of steps of network text information integration method embodiment of the present invention;
Fig. 3 is the structural representation of network text information integrating apparatus of the present invention;
Fig. 4 is the structural representation of network text information integrating apparatus embodiment of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
Invention thought of the present invention: by programmed acquisition webpage URL, realize the circle collection of info web, and Information Monitoring is analyzed, puts in order, stored and issues.
With reference to Fig. 1, Fig. 1 is the flow chart of steps of network text information integration method of the present invention, comprising: parameter input step S110, obtain the essential information of website, and automatically essential information analyzed; Acquisition step S120 according to described essential information, travels through URL, obtains the page info of webpage automatically; Finish message and storing step S130 according to presetting rule, analyze the described info web that obtains automatically, comprise filtration, arrangement, classification and storage; Issuing steps S140, the information with described arrangement is published to Internet automatically.
Method embodiment
Fig. 2 is the flow chart of steps of network text information integration method embodiment of the present invention, and as shown in Figure 2, the text message integrated approach of present embodiment mainly may further comprise the steps: be example with module in certain website in example.
Step S201, import the essential information of certain website, comprising: the preservation address of the URL of website, the character set of website, file, whether be the wall scroll collection and whether be issue automatically etc. attribute.By the user configured parameter of process analysis, utilize existing reptile program by website URL inlet, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition, discarded URL.
The essential information of website is gathered in this step management, by the information of user's input, opens up collection, arrangement and the storage that an independent thread is finished information for the website of gathering automatically.Simultaneously, arrangement and the issue of finishing data according to the parameter and the regular formulation of user's input.
Step S202, information acquisition mode mainly are page captures, and the addressable URL that obtains among the step S201 is passed through loop program, the info web of visit URL correspondence; Obtain the html source code information of webpage.
Information acquisition mainly is to carry out according to the website essential information of user's input.Comprise:
One, according to the parameter of input, automatically judge and gather wall scroll data and many data.
Two, according to the webpage URL (this URL also is the inlet of system acquisition webpage) that imports, the one-piece construction of automatic analyzing web site.What the structural approach of analyzing web page adopted is the collimation analytical approach, and this method has more specific aim, can accurately obtain the information that the user wants, and filters out the information of runing counter to its content.The method obtains and all URL link addresses of the equal catalogue of RUL that enters the mouth; Obtain the source code of webpage traveling through URL successively.
Step S203 analyzes by user configured html label, and the html code that collects among the step S202 is filtered, and obtains the information of agriculture text in the website.Then all information are classified, put in order, store in the server then.Below be simple web page text structure a:<title〉agriculture network text message acquisition technique</title<divclass=" time " 2010-6-30</div<div class=" sourc " agricultural</div<divclass=" author " agricultural</div<div class=" content " summary of the invention be example</div
According to above webpage source code configuration rule: as follows
1, title: intercepting beginning label<title 〉, the end-tag of intercepting</title 〉.
2, content: the beginning label class=of intercepting " content ", the end-tag of intercepting</div 〉.
3, the time: the beginning label<div class=of intercepting " time ", the end-tag of intercepting</div 〉.
4, source: the beginning label class=of intercepting " sourc ", the end-tag of intercepting</div 〉.
5, author: class=" author " intercepting the beginning label, the end-tag of intercepting</div
The result who obtains by routine analyzer after rule configures is as follows:
1, title is: agriculture network text message acquisition technique.
2, content is: summary of the invention is an example.
3, the time is: 2010-6-30.
4, the source is: agricultural.
5, the author is: agricultural.
This step is finish message, is according to the rule treatments webpage source code of formulating, and extracts the part that the user needs from source code.Comprise: title, source, time, author, content.System can judge the problem of content paging automatically in finishing processes.Wherein title, source, time, this several sections of author filter out the original pattern of webpage automatically in the process that system extracts.Content then keeps original pattern automatically in the process that extracts.Next step then enters phase data memory the arrangement back.
Need to prove, before carrying out this step, formulate the rule of finish message.This rule is the key of finish message, and finish message can be according to the work of the rule specification web data arrangement of formulating.The formulation of rule is divided into and is several sections: title, time, source, author, paging.Above several sections is the basic structure design according to text message, and the user can gather different parts as required.The Rulemaking module has been broken away from the drainage pattern that solidifies, and does not need to gather a website and expands a secondary program.The essential information and the Rule Information that only need the user to dispose number of site get final product.
That is to say, after finishing information acquisition, need judge whether to formulate the rule of finish message automatically,, obtain source code again and just enter next step finish message if lay down a regulation.If the system that do not lay down a regulation then travel through URL and obtain finishing behind the source code.
Step S204, with the data that S203 puts in order, in conjunction with the input of parameter in the S201 step data are carried out categorical filtering, arrangement, analysis and storage.Program can judge whether that needs enter step S205 automatically automatically according to the input of parameter in the S201 step in the time of arrangement
Step S205, by the web program, the information that the S204 step was put in order is published to Internet.So just can view all news informations a website.For example: finished the collection of an information in the above example,, then directly clicked issue and get final product, be user-friendly to if this information manually is published to the network platform.
Step S204 and step S205 finish data storage and proposition process.Information stores adopts database storing, utilizes existing data base management system (DBMS), and the management of scattered message structureization is got up.Information after the issue of information will be gathered exactly is published to the existing network platform.Wherein the mode of information issue has two kinds: the one automated manner, it two is manual modes.System can be according to the mode of the parameter judgement information automatically issue of first step input when data storage.
Present embodiment has been expanded common reptile program, be applied in the agriculture network text message integrated system, has realized the integration of information acquisition, filtration, classification, storage, issue, make Agricultural Information more comprehensively, more accurate, more authoritative.
On the other hand, the present invention also provides a kind of network text information integrating apparatus, and with reference to Fig. 3, this device comprises: parameter input module 30, acquisition module 32, finish message and memory module 34 and release module 36.
Wherein, parameter input module 30 is used to obtain the essential information of website, and by program essential information is analyzed; Acquisition module 32 is used for according to described essential information, and traversal URL utilizes program to obtain the page info of webpage; Finish message and memory module 34 are used for utilizing program according to presetting rule, and the described info web that obtains is analyzed, and comprise filtration, arrangement, classification and storage; Release module 36 is used for the program by web, and the information of described arrangement is published to Internet.
Fig. 4 is the structural representation of network text information integrating apparatus embodiment of the present invention, as shown in Figure 4, present embodiment network text integrated system comprises: parameter input module 401, information acquisition module 402, finish message module 403, information storage module 404, information issuing module 405.
In concrete the enforcement, parameter input module 401 travels through all URL by web crawlers.
The HTML code that 403 pairs of information acquisition modules 402 of finish message module obtain is filtered, and obtains the information such as picture of text and text correspondence, and with information classification, integration.
The network text information that information storage module 404 is integrated finish message module 403 is carried out the network storage, and storage mode is NAS.
Present embodiment passes through, and improves common reptile program, be applied in the agriculture network text message integrated system, has realized the integration of information acquisition, filtration, classification, storage, issue, make Agricultural Information more comprehensively, more accurate, more authoritative.
More than a kind of network text information integration method provided by the present invention and device are described in detail, used specific embodiment herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, part in specific embodiments and applications all can change.In sum, this description should not be construed as limitation of the present invention.
Claims (10)
1. a network text information integration method is characterized in that, described method comprises the steps:
The parameter input step obtains the essential information of website, automatically essential information is analyzed;
Acquisition step according to described essential information, travels through URL, obtains the page info of webpage automatically;
Finish message and storing step according to presetting rule, are analyzed the described info web that obtains automatically, comprise filtration, arrangement, classification and storage;
Issuing steps is published to Internet automatically with the information of described arrangement.
2. information integration method according to claim 1 is characterized in that,
Whether whether in the described parameter input step, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
3. information integration method according to claim 2 is characterized in that, in the described parameter input step, describedly automatically essential information analysis is comprised:
According to essential information, by the user configured parameter of process analysis, by website URL inlet, based on reptile, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition and discarded URL.
4. information integration method according to claim 3 is characterized in that,
In the described acquisition step, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
5. information integration method according to claim 4 is characterized in that,
In the described finish message step, the described webpage html source code information of obtaining is filtered, obtain the information of text.
6. a network text information integrating apparatus is characterized in that, described device comprises:
Parameter input module is used to obtain the essential information of website, automatically essential information is analyzed;
Acquisition module is used for according to described essential information, travels through URL, obtains the page info of webpage automatically;
Finish message and memory module are used for according to presetting rule, automatically the described info web that obtains are analyzed, and comprise filtration, arrangement, classification and storage;
Release module is used for the information of described arrangement is published to Internet automatically.
7. information integrating apparatus according to claim 6 is characterized in that,
Whether whether in the described parameter input module, described website essential information comprises: the preservation address of the URL of website, the character set of website, file, be the wall scroll collection and be automatic issue.
8. information integrating apparatus according to claim 7 is characterized in that, in the described parameter input module, describedly to essential information analysis is automatically:
According to essential information, by the user configured parameter of process analysis, by website URL inlet, based on the reptile program, all URL of the traversal page analyze URL by program, and URL is divided into the URL of addressable URL, repetition and discarded URL.
9. information integrating apparatus according to claim 8 is characterized in that,
In the described acquisition module, by loop program, the info web of visit URL correspondence obtains webpage html source code information with described addressable URL.
10. information integrating apparatus according to claim 9 is characterized in that,
In the described finish message module, the described webpage html source code information of obtaining is filtered, obtain the information of text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010105236614A CN101957866A (en) | 2010-10-25 | 2010-10-25 | Network text information integration method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010105236614A CN101957866A (en) | 2010-10-25 | 2010-10-25 | Network text information integration method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101957866A true CN101957866A (en) | 2011-01-26 |
Family
ID=43485195
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010105236614A Pending CN101957866A (en) | 2010-10-25 | 2010-10-25 | Network text information integration method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101957866A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102811177A (en) * | 2011-06-03 | 2012-12-05 | 腾讯科技(深圳)有限公司 | Method and system for sharing network information |
CN103049537A (en) * | 2012-12-25 | 2013-04-17 | 国云科技股份有限公司 | Network information collection method |
CN104021170A (en) * | 2014-05-30 | 2014-09-03 | 华为技术有限公司 | Information acquiring method and cloud server |
CN104462431A (en) * | 2014-12-16 | 2015-03-25 | 浪潮软件集团有限公司 | Method for crawling web page recruitment information |
CN104965929A (en) * | 2015-07-24 | 2015-10-07 | 网易传媒科技(北京)有限公司 | Method and device for data processing |
CN105468664A (en) * | 2015-05-12 | 2016-04-06 | 北京众标网络科技有限公司 | Information acquisition method and apparatus |
CN105808545A (en) * | 2014-12-30 | 2016-07-27 | Tcl集团股份有限公司 | Forum data extraction method and forum data extraction apparatus |
CN105868258A (en) * | 2015-12-28 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | Crawler system |
CN106600438A (en) * | 2016-11-29 | 2017-04-26 | 东莞华南设计创新院 | Agricultural information service system |
CN107451218A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | On-Line review method for automatically releasing and device |
CN109446425A (en) * | 2018-10-30 | 2019-03-08 | 郑州市景安网络科技股份有限公司 | A kind of network information gathering and dissemination method, system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042747A (en) * | 2006-03-24 | 2007-09-26 | 上海中经互联网络有限公司 | Economic operation analysis system |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
-
2010
- 2010-10-25 CN CN2010105236614A patent/CN101957866A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101042747A (en) * | 2006-03-24 | 2007-09-26 | 上海中经互联网络有限公司 | Economic operation analysis system |
CN101561802A (en) * | 2008-04-18 | 2009-10-21 | 上海复旦光华信息科技股份有限公司 | Web page structural data extraction method and system |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102811177A (en) * | 2011-06-03 | 2012-12-05 | 腾讯科技(深圳)有限公司 | Method and system for sharing network information |
CN102811177B (en) * | 2011-06-03 | 2017-03-22 | 腾讯科技(深圳)有限公司 | Method and system for sharing network information |
CN103049537A (en) * | 2012-12-25 | 2013-04-17 | 国云科技股份有限公司 | Network information collection method |
CN104021170A (en) * | 2014-05-30 | 2014-09-03 | 华为技术有限公司 | Information acquiring method and cloud server |
CN104021170B (en) * | 2014-05-30 | 2018-01-16 | 华为技术有限公司 | A kind of information acquisition method and cloud server |
CN104462431A (en) * | 2014-12-16 | 2015-03-25 | 浪潮软件集团有限公司 | Method for crawling web page recruitment information |
CN105808545A (en) * | 2014-12-30 | 2016-07-27 | Tcl集团股份有限公司 | Forum data extraction method and forum data extraction apparatus |
CN105468664A (en) * | 2015-05-12 | 2016-04-06 | 北京众标网络科技有限公司 | Information acquisition method and apparatus |
CN104965929A (en) * | 2015-07-24 | 2015-10-07 | 网易传媒科技(北京)有限公司 | Method and device for data processing |
CN104965929B (en) * | 2015-07-24 | 2019-07-02 | 网易传媒科技(北京)有限公司 | A kind of data processing method and device |
CN105868258A (en) * | 2015-12-28 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | Crawler system |
CN106600438A (en) * | 2016-11-29 | 2017-04-26 | 东莞华南设计创新院 | Agricultural information service system |
CN107451218A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | On-Line review method for automatically releasing and device |
CN107451218B (en) * | 2017-07-17 | 2020-04-03 | 云润大数据服务有限公司 | Automatic publishing method and device for online comments |
CN109446425A (en) * | 2018-10-30 | 2019-03-08 | 郑州市景安网络科技股份有限公司 | A kind of network information gathering and dissemination method, system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101957866A (en) | Network text information integration method and device | |
CN101320373B (en) | Safety search engine system of website database | |
CN108229810B (en) | Industry analysis system and method based on network information resources | |
CN108776671A (en) | A kind of network public sentiment monitoring system and method | |
CN104346328A (en) | Vertical intelligent crawler data collecting method based on webpage data capture | |
CN103139256B (en) | A kind of many tenant network public sentiment method for supervising and system | |
CN102902703A (en) | Network sensitive information-oriented screenshot discovery and locking callback method | |
CN102622445A (en) | User interest perception based webpage push system and webpage push method | |
CN101520798A (en) | Webpage classification technology based on vertical search and focused crawler | |
CN106484709A (en) | A kind of auditing method of daily record data and audit device | |
CN102760151A (en) | Implementation method of open source software acquisition and searching system | |
CN102270331A (en) | Network shopping navigating method based on visual search | |
CN101650715A (en) | Method and device for screening links on web pages | |
CN104317948A (en) | Page data capturing method and system | |
CN105718590A (en) | Multi-tenant oriented SaaS public opinion monitoring system and method | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
CN103544283A (en) | Website information combination and de-duplication method | |
CN102323955A (en) | Private cloud searching system and implement method thereof | |
CN101984432A (en) | Method and device for constructing address database | |
CN102567494A (en) | Website classification method and device | |
CN104182466A (en) | House information base network system | |
CN102253939A (en) | Searching method and system based on cloud computing technology | |
CN105117436A (en) | Automatic website channel mining method | |
CN105335516A (en) | Construction method of universal acquisition system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C12 | Rejection of a patent application after its publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20110126 |