CN104765823A - Method and device for collecting website data - Google Patents

Method and device for collecting website data Download PDF

Info

Publication number
CN104765823A
CN104765823A CN201510164201.XA CN201510164201A CN104765823A CN 104765823 A CN104765823 A CN 104765823A CN 201510164201 A CN201510164201 A CN 201510164201A CN 104765823 A CN104765823 A CN 104765823A
Authority
CN
China
Prior art keywords
data
website
channel
website data
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510164201.XA
Other languages
Chinese (zh)
Inventor
王兰莎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TVMining Beijing Media Technology Co Ltd
Original Assignee
TVMining Beijing Media Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by TVMining Beijing Media Technology Co Ltd filed Critical TVMining Beijing Media Technology Co Ltd
Priority to CN201510164201.XA priority Critical patent/CN104765823A/en
Publication of CN104765823A publication Critical patent/CN104765823A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a method and a device for collecting website data. The method and the device for collecting the website data are used to solve the problem that the website data can not be obtained in a classified mode, and achieve the purpose of rapidly obtaining the needed data in the classified mode. The method for collecting the website data includes: configuring a root web address of a website in advance; obtaining navigation bar information of the website according to the root web address, wherein the navigation bar information includes frequency channel information; matching needed frequency channels in the frequency channel information; obtaining the website data step by step according to the matched frequency channels. The method for collecting the website data obtains the website data step by step in allusion to each matched frequency channel, and thereby can obtain the website data in the classified mode. Simultaneously, the obtained data corresponds to a website structure cluster, and furthermore a later website data classification process is saved, and data collection efficiency is improved.

Description

A kind of method that website data gathers and device
Technical field
The present invention relates to data acquisition technology field, particularly a kind of method that gathers of website data and device.
Background technology
Along with enriching constantly and the continuous expansion of network information of Internet resources, people are more and more stronger to the dependence of network, find the specific resources needed for oneself to bring inconvenience fast to also service object from vast as the open sea Internet resources; Information just has unlimited value from ancient times, along with the development in epoch, the mankind have come the information age unconsciously, all trades and professions have all been full of countless information, and the value of information is just the circulation of data, if data can circulate timely and transmit, the incomparable value that competence exertion information is real; At Under the market economy condition, image data has become important instrument and means.
How from magnanimity information, collecting valuable data and analyze and research, forming the foundation of the various decision-making of enterprise, is the problem that data acquisition personnel and market researcher face; Will find rapidly from a large amount of data and the information obtained required for oneself and service, become more and more difficult, service object often loses their target or obtains some more biased results when Query Information; Data have to pass through and gather, integrate, analyze and could produce value, and scattered information can only be Improving News, cannot embody real commercial value; For enterprise and information analysis personnel, will filter out in a large amount of information on the one hand is effectively worth a little, reduce the cost obtaining corresponding information again simultaneously, the cost making the actual use value of information be greater than the processes such as collection, analytical information to produce, makes information be that the decision-making of enterprise brings added value.And carry out data analysis research must obtain required data.
The mode of existing collection website data mainly contains two kinds, a kind of traditional approach: main adopt artificial mode, from object website by copying, bonding method realizes the collection of website data.Another kind is the mode adopting software, as adopted web crawler, according to certain rule, captures program or the script of web message automatically.Concrete, web crawlers, from the URL of one or several Initial page, obtains the URL on Initial page, in the process capturing webpage, constantly extracts new URL from current page and puts into queue, until meet certain stop condition of system.Existing traditional approach is wasted time and energy, and workload is large; Although and adopt the mode of software can gather website data, but can not district office gather a large amount of website datas between relevance, and web crawlers generally obtains website data based on key word, be easy to collect useless junk information, information extraction is of low quality.
Summary of the invention
The invention provides method and the device of the collection of a kind of website data, obtain the problem of website data in order to solve to classify, realize the object that Fast Classification obtains desired data.
The invention provides a kind of method that website data gathers, comprising:
The root network address of pre-configured website;
Obtain the navigation bar information of website according to root network address, navigation bar information comprises channel information;
Required channel is mated from channel information;
Channel according to matching obtains website data step by step.
The method that a kind of website data that the embodiment of the present invention provides gathers, obtains website structure bunch information step by step according to website tree construction, for each channel matched, obtains website data step by step, thus acquisition data of can classifying.Meanwhile, the data of acquisition and website structure are bunch corresponding, and then the process of website data classification after can saving, improve the efficiency of data acquisition.
In one embodiment, the channel according to matching obtains website data step by step, specifically comprises:
According to the contents list in each channel of the channel acquisition matched;
Obtain content-data according to contents list classification, content-data is required website data.
In one embodiment, obtain content-data according to contents list classification, specifically comprise:
According to the address of contents list determination corresponding contents page;
Determine the source code of content pages according to the address of content pages, and obtain content-data from source code.
In the embodiment of the present invention, by obtaining content-data from source code, can effectively shield advertisement and irrelevant contents, the address link climbed to beyond this channel can also be prevented simultaneously.
In one embodiment, after acquisition website data step, also comprise:
Classification store website data, and Unified coding process is carried out to website data.
In one embodiment, classification store website data, comprising:
A structure bunch classification according to being associated with root network address arranges directory node;
Under the website data of acquisition successively classification is stored in corresponding directory node.
The device that website data gathers, comprising:
Configuration module, for the root network address of pre-configured website;
Acquisition module, for obtaining the navigation bar information of website according to root network address, navigation bar information comprises channel information;
Matching module, for mating required channel from channel information;
Processing module, for obtaining website data step by step according to the channel matched.
In one embodiment, processing module comprises:
Acquiring unit, for the contents list in each channel of channel acquisition that basis matches;
Processing unit, for obtaining content-data according to contents list classification, content-data is required website data.
In one embodiment, processing unit comprises:
Determine subelement, for the address according to contents list determination corresponding contents page;
Obtain subelement, for determining the source code of content pages according to the address of content pages, and obtain content-data from source code.
In one embodiment, this device also comprises:
Memory module, for classification store website data, and carries out Unified coding process to website data.
In one embodiment, memory module comprises:
Node configuration unit, for arranging directory node according to the structure bunch classification be associated with root network address;
Classification storage unit, under being stored in corresponding directory node by the website data of acquisition successively classification.
The method that a kind of website data that the embodiment of the present invention provides gathers and device, obtain website structure bunch information step by step according to website tree construction, for each channel matched, obtain website data step by step, thus acquisition data of can classifying.Meanwhile, the data of acquisition and website structure are bunch corresponding, and then the process of website data classification after can saving, improve the efficiency of data acquisition.By obtaining content-data from source code, can effectively shield advertisement and irrelevant contents, the address link climbed to beyond this channel can also be prevented simultaneously.By carrying out Unified coding process to website data, the format being convenient to data stores.Meanwhile, can the format style of redundancy in filtering original web data, thus can storage space be saved.
Other features and advantages of the present invention will be set forth in the following description, and, partly become apparent from instructions, or understand by implementing the present invention.Object of the present invention and other advantages realize by structure specifically noted in write instructions, claims and accompanying drawing and obtain.
Below by drawings and Examples, technical scheme of the present invention is described in further detail.
Accompanying drawing explanation
Accompanying drawing is used to provide a further understanding of the present invention, and forms a part for instructions, together with embodiments of the present invention for explaining the present invention, is not construed as limiting the invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of a kind of method that in the embodiment of the present invention, website data gathers;
Fig. 2 is the process flow diagram that the channel matched in the embodiment of the present invention obtains website data step by step;
Fig. 3 is the process flow diagram obtaining content-data in the embodiment of the present invention according to contents list classification;
Fig. 4 is the process flow diagram of a kind of method that in the embodiment of the present invention one, website data gathers;
Fig. 5 is the process flow diagram of a kind of method that in the embodiment of the present invention two, website data gathers;
Fig. 6 is the structural drawing of the device that in the embodiment of the present invention, the first website data gathers;
Fig. 7 is the structural drawing of processing module in the embodiment of the present invention;
Fig. 8 is the structural drawing of processing unit in the embodiment of the present invention;
Fig. 9 is the structural drawing of the device that in the embodiment of the present invention, the second website data gathers;
Figure 10 is the structural drawing of memory module in the embodiment of the present invention.
Embodiment
Below in conjunction with accompanying drawing, the preferred embodiments of the present invention are described, should be appreciated that preferred embodiment described herein is only for instruction and explanation of the present invention, is not intended to limit the present invention.
Fig. 1 is the process flow diagram of a kind of method that in the embodiment of the present invention, website data gathers.As shown in Figure 1, S101-S104 is comprised the following steps:
Step S101, the root network address of pre-configured website.
Wherein, the root network address of website is the homepage under root directory, is the root directory of server site.The homepage that arranges of the root directory of general website acquiescence is exactly homepage address, and the homepage address of such as Baidu is www.baidu.com.
It should be noted that, not all site home page is all the root directory of server site.Such as, server sets up a website A (network address is www.a.com), and then the Jia Zuoge forum that creates a file (such as, network address is www.a.com/bbs).Then for this forum, website root network address is www.a.com, and it is www.a.com/bbs that homepage connects.
Step S102, obtain the navigation bar information of website according to root network address, navigation bar information comprises channel information.
Navigation bar is generally positioned at header region, and at header banner picture top or following row of horizontal the navigation button, it plays a part each page linking next stage.Meanwhile, use navigation bar be that visitor is more clear bright and clear finds required resource area in order to allow, find resource.Such as, some options above Baidu's eyebrow page " MP3, knows for news, webpage ... " etc. a kind of example being exactly navigation bar.Navigation bar information in the embodiment of the present invention is the information relevant to guidance to website hurdle.
Channel is the catalogue of next stage in each navigation bar, and such as, the next stage of navigation bar " news " can be divided into " financial and economic news channel ", " military news channel ", " entertainment news channel " " sports news channel " etc.Channel information in the embodiment of the present invention is the information of the channel comprised in navigation bar.
Step S103, mates required channel from channel information.
Concrete, the website data obtained as required determines required channel information.Still for above-mentioned " news ", when only needing to obtain financial and economic news, match from the channel information comprising numerous news channel " financial and economic news channel ".Certainly, when needing the information obtaining all channels, above-mentioned required channel is all channels.
Step S104, the channel according to matching obtains website data step by step.
The method that a kind of website data that the embodiment of the present invention provides gathers, obtains website structure bunch information step by step according to website tree construction, for each channel matched, obtains website data step by step, thus acquisition data of can classifying.Meanwhile, the data of acquisition and website structure are bunch corresponding, and then the process of website data classification after can saving, improve the efficiency of data acquisition.
In one embodiment, as shown in Figure 2, obtain website data step by step according to the channel matched in step S104, specifically comprise step S201 and S202:
Step S201, according to the contents list in each channel of the channel acquisition matched.
This list of content comprised under the next stage of each channel comprises this channel.Still for above-mentioned " news ", required channel is " financial and economic news channel ", and the contents list of its next stage can comprise " stock ", " financing ", " finance and economics personage " etc., can obtain required website data according to this contents list.
Step S202, obtain content-data according to contents list classification, this content-data is required website data.
In one embodiment, as shown in Figure 3, obtain content-data according to contents list classification in above-mentioned steps S202, specifically comprise the following steps S301-S302:
Step S301, according to the address of contents list determination corresponding contents page.
Each content in contents list has corresponding address, after determining the address of corresponding contents page, can extract the website data in this address.
Step S302, determines the source code of content pages, and obtain content-data from source code according to the address of content pages.
Because the structure of each website is slightly different, and some website is also provided with the information such as advertisement.In the embodiment of the present invention, by obtaining content-data from source code, can effectively shield advertisement and irrelevant contents, the address link climbed to beyond this channel can also be prevented simultaneously.
Below by specific embodiment, the method that the website data that the embodiment of the present invention provides gathers is described, realizes obtaining website data step by step.
Embodiment one
The method that Fig. 4 gathers for a kind of website data provided in the embodiment of the present invention one.In embodiment one, the website data of acquisition is carried out classification and store, thus save the process of Data classification.As shown in Figure 4, the method comprises the following steps S401-S406:
Step S401, the root network address of pre-configured website.
Step S402, obtain the navigation bar information of website according to root network address, navigation bar information comprises the channel information of each channel.
Step S403, mates required channel from channel information.
Step S404, according to the contents list in each channel of the channel acquisition matched.
Step S405, obtain content-data according to contents list classification, this content-data is required website data.
Step S406, classification store website data, and Unified coding process is carried out to website data.
In the embodiment of the present invention one, by carrying out Unified coding process to website data, the format being convenient to data stores.Meanwhile, can the format style of redundancy in filtering original web data, thus can storage space be saved.
Embodiment two
The method that Fig. 5 gathers for a kind of website data provided in the embodiment of the present invention two.In embodiment two, from source code, obtain content-data, and the website data of acquisition is carried out classification storage according to website structure bunch, thus save the process of Data classification.As shown in Figure 5, the method comprises the following steps S501-S504:
Step S501, the root network address of pre-configured website.
Step S502, obtain the navigation bar information of website according to root network address, navigation bar information comprises the channel information of each channel.
Step S503, mates required channel from channel information.
Step S504, according to the contents list in each channel of the channel acquisition matched.
Step S505, according to the address of contents list determination corresponding contents page.
Step S506, determines the source code of content pages, and obtain content-data from source code, this content-data is required website data according to the address of content pages.
Step S507, the structure bunch classification according to being associated with root network address arranges directory node.
Step S508, under being stored in corresponding directory node by the website data of acquisition successively classification.
Step S509, carries out Unified coding process to website data.
Based on same inventive concept, corresponding to a kind of method gathered for website data that above-described embodiment provides, the device that the embodiment of the present invention also provides a kind of website data to gather, as shown in Figure 6, this device specifically comprises:
Configuration module 61, for the root network address of pre-configured website;
Acquisition module 62, for obtaining the navigation bar information of website according to root network address, navigation bar information comprises channel information;
Matching module 63, for mating required channel from channel information;
Processing module 64, for obtaining website data step by step according to the channel matched.
In one embodiment, shown in Figure 7, processing module 64 comprises:
Acquiring unit 641, for the contents list in each channel of channel acquisition that basis matches;
Processing unit 642, for obtaining content-data according to contents list classification, content-data is required website data.
In one embodiment, shown in Figure 8, processing unit 642 comprises:
Determine subelement 6421, for the address according to contents list determination corresponding contents page;
Obtain subelement 6422, for determining the source code of content pages according to the address of content pages, and obtain content-data from source code.
In one embodiment, as shown in Figure 9, this device also comprises:
Memory module 65, for classification store website data, and carries out Unified coding process to website data.
In one embodiment, shown in Figure 10, memory module 65 comprises:
Node configuration unit 651, for arranging directory node according to the structure bunch classification be associated with root network address;
Classification storage unit 652, under being stored in corresponding directory node by the website data of acquisition successively classification.
The method that a kind of website data that the embodiment of the present invention provides gathers and device, obtain website structure bunch information step by step according to website tree construction, for each channel matched, obtain website data step by step, thus acquisition data of can classifying.Meanwhile, the data of acquisition and website structure are bunch corresponding, and then the process of website data classification after can saving, improve the efficiency of data acquisition.By obtaining content-data from source code, can effectively shield advertisement and irrelevant contents, the address link climbed to beyond this channel can also be prevented simultaneously.By carrying out Unified coding process to website data, the format being convenient to data stores.Meanwhile, can the format style of redundancy in filtering original web data, thus can storage space be saved.
Those skilled in the art should understand, embodiments of the invention can be provided as method, system or computer program.Therefore, the present invention can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And the present invention can adopt in one or more form wherein including the upper computer program implemented of computer-usable storage medium (including but not limited to magnetic disk memory and optical memory etc.) of computer usable program code.
The present invention describes with reference to according to the process flow diagram of the method for the embodiment of the present invention, equipment (system) and computer program and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Obviously, those skilled in the art can carry out various change and modification to the present invention and not depart from the spirit and scope of the present invention.Like this, if these amendments of the present invention and modification belong within the scope of the claims in the present invention and equivalent technologies thereof, then the present invention is also intended to comprise these change and modification.

Claims (10)

1. a method for website data collection, is characterized in that, comprising:
The root network address of pre-configured website;
Obtain the navigation bar information of described website according to described network address, described navigation bar information comprises channel information;
Required channel is mated from described channel information;
Website data is obtained step by step according to the described channel matched.
2. method according to claim 1, is characterized in that, the channel matched described in described basis obtains website data step by step, comprising:
According to the contents list in described each channel of the channel acquisition matched;
Obtain content-data according to described contents list classification, described content-data is required website data.
3. method according to claim 2, is characterized in that, described according to described contents list classification acquisition content-data, specifically comprises:
According to the address of described contents list determination corresponding contents page;
Determine the source code of content pages according to the address of described content pages, and obtain described content-data from described source code.
4. according to the arbitrary described method of claim 1-3, it is characterized in that, after acquisition website data step, also comprise:
Classification stores described website data, and carries out Unified coding process to described website data.
5. method according to claim 4, is characterized in that, described classification stores described website data, comprising:
A structure bunch classification according to being associated with described network address arranges directory node;
Under the website data of acquisition successively classification is stored in corresponding directory node.
6. a device for website data collection, is characterized in that, comprising:
Configuration module, for the root network address of pre-configured website;
Acquisition module, for obtaining the navigation bar information of described website according to described network address, described navigation bar information comprises channel information;
Matching module, for mating required channel from described channel information;
Processing module, obtains website data step by step for the channel matched described in basis.
7. device according to claim 6, is characterized in that, described processing module comprises:
Acquiring unit, for the contents list in each channel of channel acquisition of matching described in basis;
Processing unit, for obtaining content-data according to described contents list classification, described content-data is required website data.
8. device according to claim 6, is characterized in that, described processing unit comprises:
Determine subelement, for the address according to described contents list determination corresponding contents page;
Obtain subelement, for determining the source code of content pages according to the address of described content pages, and obtain described content-data from described source code.
9., according to the arbitrary described device of claim 6-8, it is characterized in that, also comprise:
Memory module, stores described website data for classification, and carries out Unified coding process to described website data.
10. device according to claim 9, is characterized in that, described memory module comprises:
Node configuration unit, for arranging directory node according to the structure bunch classification be associated with described network address;
Classification storage unit, under being stored in corresponding directory node by the website data of acquisition successively classification.
CN201510164201.XA 2015-04-08 2015-04-08 Method and device for collecting website data Pending CN104765823A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510164201.XA CN104765823A (en) 2015-04-08 2015-04-08 Method and device for collecting website data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510164201.XA CN104765823A (en) 2015-04-08 2015-04-08 Method and device for collecting website data

Publications (1)

Publication Number Publication Date
CN104765823A true CN104765823A (en) 2015-07-08

Family

ID=53647653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510164201.XA Pending CN104765823A (en) 2015-04-08 2015-04-08 Method and device for collecting website data

Country Status (1)

Country Link
CN (1) CN104765823A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776636A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN106815273A (en) * 2015-12-02 2017-06-09 北京国双科技有限公司 Date storage method and device
CN110704760A (en) * 2019-10-16 2020-01-17 北京百度网讯科技有限公司 Data processing method and device
CN112230989A (en) * 2020-12-14 2021-01-15 北京智慧星光信息技术有限公司 Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN112632356A (en) * 2020-12-25 2021-04-09 深圳市高德信通信股份有限公司 Network information data classification collection method
CN113886661A (en) * 2021-12-06 2022-01-04 北京并行科技股份有限公司 Information acquisition method and device and computing equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN102236691A (en) * 2010-05-04 2011-11-09 张文广 Precision guided searching tool system
CN102521373A (en) * 2011-12-19 2012-06-27 李子平 Method for cross-site search and website system for same
CN103927400A (en) * 2014-05-07 2014-07-16 重庆邮电大学 Web site product detailed information classification crawling and product information base establishing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520798A (en) * 2009-03-06 2009-09-02 苏州锐创通信有限责任公司 Webpage classification technology based on vertical search and focused crawler
CN102236691A (en) * 2010-05-04 2011-11-09 张文广 Precision guided searching tool system
CN102521373A (en) * 2011-12-19 2012-06-27 李子平 Method for cross-site search and website system for same
CN103927400A (en) * 2014-05-07 2014-07-16 重庆邮电大学 Web site product detailed information classification crawling and product information base establishing method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776636A (en) * 2015-11-24 2017-05-31 北京国双科技有限公司 Data processing method and device
CN106815273A (en) * 2015-12-02 2017-06-09 北京国双科技有限公司 Date storage method and device
CN110704760A (en) * 2019-10-16 2020-01-17 北京百度网讯科技有限公司 Data processing method and device
CN110704760B (en) * 2019-10-16 2022-08-02 北京百度网讯科技有限公司 Data processing method and device
CN112230989A (en) * 2020-12-14 2021-01-15 北京智慧星光信息技术有限公司 Webpage channel navigation bar extraction method, system, electronic equipment and storage medium
CN112632356A (en) * 2020-12-25 2021-04-09 深圳市高德信通信股份有限公司 Network information data classification collection method
CN113886661A (en) * 2021-12-06 2022-01-04 北京并行科技股份有限公司 Information acquisition method and device and computing equipment

Similar Documents

Publication Publication Date Title
CN104765823A (en) Method and device for collecting website data
CN101370024B (en) Distributed information collection method and system
DE112020002228T5 (en) COGNITIVE VIDEO AND AUDIO SEARCH AGGREGATION
DE102017111438A1 (en) API LEARNING
CN102750326A (en) Log management optimization method of cluster system based on downsizing strategy
CN105069087A (en) Web log data mining based website optimization method
CN104144181A (en) Terminal aggregation method and system for network videos
CN105721944A (en) News information recommendation method for smart television
EP3030976A1 (en) Method for processing and displaying real-time social data on map
CN106230809B (en) A kind of mobile Internet public sentiment monitoring method and system based on URL
CN103914488A (en) Document collection, identification, association, search and display system
CN103688256A (en) Method, device and system for determining video quality parameter based on comment
Croft et al. Query representation and understanding workshop
CN105260447A (en) Webpage data analysis method and system
CN106844782A (en) The multichannel big data acquisition system and method for a kind of network-oriented
CN103870495A (en) Method and device for extracting information from website
CN105550179A (en) Webpage collection method and browser plug-in
CN104750853A (en) Method and device for searching heterogeneous data
CN105426407A (en) Web data acquisition method based on content analysis
CN105069034A (en) Recommendation information generation method and apparatus
CN103914486A (en) Document search and display system
CN105989167B (en) Collecting method and device based on news client
CN104331512B (en) A kind of BBS pages automatic acquiring method
DE112016004967T5 (en) Automated discovery of information
CN115840863A (en) Webpage content tracing method, knowledge graph construction method and related equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150708

RJ01 Rejection of invention patent application after publication