CN105279272A - Content aggregation method based on distributed web crawlers - Google Patents

Content aggregation method based on distributed web crawlers Download PDF

Info

Publication number
CN105279272A
CN105279272A CN201510724024.6A CN201510724024A CN105279272A CN 105279272 A CN105279272 A CN 105279272A CN 201510724024 A CN201510724024 A CN 201510724024A CN 105279272 A CN105279272 A CN 105279272A
Authority
CN
China
Prior art keywords
information
reptile
content
crawl
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510724024.6A
Other languages
Chinese (zh)
Inventor
黄韬
魏亮
魏静波
邓晓涛
周洪利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING FUTURE NETWORKS INDUSTRY INNOVATION Co Ltd
Original Assignee
NANJING FUTURE NETWORKS INDUSTRY INNOVATION Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING FUTURE NETWORKS INDUSTRY INNOVATION Co Ltd filed Critical NANJING FUTURE NETWORKS INDUSTRY INNOVATION Co Ltd
Priority to CN201510724024.6A priority Critical patent/CN105279272A/en
Publication of CN105279272A publication Critical patent/CN105279272A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention provides a content aggregation method based on distributed web crawlers, which comprises the steps that firstly different crawler platforms are arranged at different devices, a request is sent to a crawling network information source end, and the crawler platforms fabricate crawling rules according to target information required by a user and crawl information in which the target user is interested; the crawled network information is processed, similarity detection is carried out based on a data transmission and conversion method in a real-time database and by being combined with a locality sensitive hashing (LSH) method so as to reduce the redundancy of the information; and the information is classified and sorted by the system according to the category, the heat and keywords and then displayed on user equipment. According to the method provided by the invention, LSH and similarity comparison are carried out according to the data information acquired in an actual network so as to acquire a comparison result. Compared with a comparison result acquired by adopting a traditional mode of whole data duplication checking in the prior art, the content aggregation method is higher in calculation speed and more accurate in similarity comparison.

Description

A kind of content polymerization process based on distributed network reptile
Technical field
The present invention relates to spiders correlative technology field, particularly a kind of content polymerization process based on distributed network reptile.
Background technology
Along with the development of internet, large data age comes head-on, and the value of mass data also will obtain more embodiments.Due to day by day increasing progressively of the internet informations such as magnanimity streaming media video resource and abundant web page contents, specific user is allowed to be difficult to the network data needed for accurate and effective acquisition self in limited chip time section by handheld device.And existing content syndication technologies carries out mainly with the mode based on superstructure the superiority that analogue simulation proves its content-aggregated system greatly, lack and realization application is carried out to real network environment and customizing messages corresponding to potential user group.
Filtercondition selected by traditional content polymerization process is too extensive, and cannot carry out mass customization obtaining information, is difficult to the promptness of guarantee information and the correlativity of theme.Cannot adapt to the quick irregular change then of obtaining information end gauage, cause the transience of information source, acquisition of information cannot be carried out for a long time.The identical information different to source multiple on internet cannot be distinguished, and causes repeatability and the redundancy of information, reduces the high efficiency of targeted customer's obtaining information.Therefore for the worth research such as acquisition of information persistence, the de-redundant remaining of information, the polymerization classification of information how improving content aggregation platform.
Summary of the invention
Present invention achieves a kind of content polymerization process based on distributed network reptile, object is to solve web crawlers technology in prior art and can not effectively carries out being polymerized the problem of classifying to customizing the large batch of network information.
A kind of content polymerization process based on distributed network reptile provided by the invention, the method comprises following process:
Step one, capture target information, first different reptile platforms is arranged on different devices, send request to the network information source terminal crawled, reptile platform is formulated according to the required target information of user and is crawled rule, captures the information interested to targeted customer;
Step 2, crawl content information similarity to detect, the described network information obtained that crawls is processed, based on the data transmission in real-time data base and conversion method thereof in conjunction with local sensitivity Hash (LSH) method, carries out similarity detection thus the redundance of the information of reduction;
Step 3, crawl information fusion classification, on the basis of step 2, system carries out classification and ordination to garbled information category, temperature, key word, and shows on a user device.
In described step one, for crawling, arranging of platform is further comprising the steps of:
Before task starts, reptile platform is disposed, and configure reptile attribute, filtered and irrelevant the linking of user search by web page analysis algorithm, the link remained with is put in queue to be captured, and in filter process, first web page contents is changed into textual form and from queue, select next step url that will capture by text based web page analysis algorithm by background server, repeat above step, traversal full page, until meet the stop condition of program.
Described deployment reptile comprises the service configuration of reptile and task configuration.
Described step one specifically comprises:
Step 1.1, is divided into some large classes by root address url according to its class of service, selects to carry out information crawler with a certain large class corresponding to target information;
Step 1.2, the large class url configuration corresponding according to described target information crawls destination address, enters each page and obtains detailed label, crawl particular content.
Described step one also comprises step 1.3, when proceeding to described step 1.2, if be provided with more detailed classification to information in the described destination address page, enter each group page and obtain detailed label, crawl particular content, and repeat step 1.3, carry particular content until crawl.
Described step 2 specifically comprises:
Step 2.1, processes the described network information obtained that crawls, the blank string in replacement information and multimedia element, picture contained in information and video resource is extracted and replaces to corresponding text language;
Random length in text is that the substring of k is defined as k-shingle by step 2.2, then every bar information can be expressed as occurring that k-shingle once or repeatedly gathers in the text; Need described set to replace to the small set represented with the signature of small-scale, estimated the similarity of actual set by the signature set of comparison information;
Step 2.3, repeatedly local sensitivity Hash process is carried out to information, similar item more may be able to be hashing onto in same bucket than dissimilar item, will have at least the information that is once hashing onto same bucket to as being candidate couple, only to these candidates to carrying out similarity detection, information similarity being reached setting threshold value carries out screening deletion, reduces the redundance of information.
The present invention adopts above technical scheme compared with prior art, has following technique effect:
The method that the embodiment of the present invention provides, carries out similarity comparison after getting information used, proposes redundant information, obtains the information data that described redundance is lower.The method is carried out LSH according to the data message acquired in real network and is carried out similarity comparison and obtain comparing result, and the comparing result looked into double recipe formula with adopting traditional whole piece data in prior art and obtain, its computing velocity is faster, similarity comparison is more accurate.
Accompanying drawing explanation
Below with reference to accompanying drawing, the invention will be further described:
Fig. 1 is the process flow diagram that spiders provided by the present invention crawls rule;
Fig. 2 is the process flow diagram crawling content information similarity testing process provided by the present invention;
Fig. 3 is the process flow diagram crawling information fusion assorting process provided by the present invention;
Fig. 4 is the process flow diagram of reptile distributed deployment method provided by the present invention;
Fig. 5 is reptile distributed deployment system architecture schematic diagram provided by the present invention;
Fig. 6 is the structural representation of the content-aggregated system based on distributed network reptile provided by the present invention.
Embodiment
The invention provides a kind of content polymerization process based on distributed network reptile, for making object of the present invention, clearly, clearly, and the present invention is described in more detail with reference to accompanying drawing examples for technical scheme and effect.Should be appreciated that concrete enforcement described herein is only in order to explain the present invention, is not intended to limit the present invention.
As shown in Figure 6, this system comprises content-aggregated system architecture schematic diagram provided by the invention:
User interface: user is managed and task scheduling system by graphic user interface, dispatch service is responsible for by each node reptile, mainly provides and comprises reptile task start, task stopping and task status service; Graphic user interface is the visualized operation interface that content aggregation platform is supplied to user, reptile task management platform;
This interface, by calling the service interface of bottom, as the attribute status of management reptile node tasks and a central platform of daily record, makes system manager provide easy-to-use, parametric controller intuitively.
Content-aggregated sort module: provide reptile task related command by content-aggregated sort module, controls the concrete state of reptile task.According to customization specific requirement, setting crawls scope in detail, dynamically follows the tracks of crawling state, and the write of the data providing reptile task to be correlated with, update service obtain and crawl particular content.For after described acquisition of information instruction module gets required data, capture the data message obtained from database, become pending data.
Information similarity detection module: carry out similarity comparison after getting information used, eliminate redundancy information, obtains the information data that described redundance is lower.The information of bottom layer node reptile carries out looking into heavily processing by this module, and the information redundance in content-aggregated system is reduced.
Information pre-processing module: after information similarity described above detects, further investigation web page contents, by text based web page analysis algorithm, intercepting page text message, aggregation platform can according to content of text automatic acquisition corresponding informance, such as title, body part etc., thus fill corresponding blank framework, realize handheld terminal correspondence by background transfer and show.
Based on said system, a kind of content polymerization process based on distributed network reptile provided by the present invention comprises following process:
Step one, capture target information, first different reptile platforms is arranged on different devices, send request to the network information source terminal crawled, reptile platform is formulated according to the required target information of user and is crawled rule, captures the information interested to targeted customer;
Step 2, crawls content information similarity and detects, the described network information obtained that crawls processed, improve based on database data transmission, in conjunction with local sensitivity Hash (LSH) method, carries out similarity detection thus the redundance of the information of reduction;
Step 3, crawl information fusion classification, on the basis of step 2, system carries out classification and ordination to garbled information category, temperature, key word, and shows on a user device.
Before beginning task, be first introduced the deployment of reptile and configuration, as shown in Figure 4, this process specifically comprises in the reptile distributed deployment that the present invention points out:
Distributed reptile configuration comprises reptile service configuration and the configuration of reptile task; Reptile service configuration, the resource that the service of guarantee relies on can correctly obtain, the normal operation of support mission; Reptile service configuration is configured task attribute, as page-downloading interval time, mission thread number, and the tasks carrying frequency etc.
Distributed reptile is disposed, concrete, take web services as main deployment, need install Tomcat container, is finally issued as war bag, and after each war disposes, geography provides service.In actual deployment, a reptile node disposed by a physical equipment, for convenient test and resource make full use of, can at a multiple container of physical server deploy, and distributing different port provides service.IP address and port uniquely determine a reptile node.
Management platform in task-set, it mainly comprises task scheduling mode, and above-mentioned reptile aggregation management system, and it connects each reptile fringe node, carries out system task scheduling, controls each node state, manages and control task.
More concrete, reptile distributed deployment system architecture schematic diagram as shown in Figure 5:
This schematic diagram presents reptile distributed deployment system architecture, wherein in task-set, management platform is exactly above-mentioned said content-aggregated management platform, it connects each reptile fringe node, the task scheduling to crawler system can be realized, control each node state, supvr can select corresponding working node, manages and control task.
Crawler system is on different devices disposed in the representative of reptile fringe node, embodies distributed deployment.Different equipment represents and different crawls task, and the task matching simultaneously on node also can intersect realization.Because the task between each equipment is independently, necessarily do not rely on each other, this just makes full use of limited resources, improves tasks carrying speed.
During beginning task, the first step captures the process flow diagram of target information as shown in Figure 1, and idiographic flow is as follows:
Step 101: reptile sends request to the network information source station end crawled;
Step 102: crawl rule according to the different business classification configuration of each source station is corresponding, this rule, based on webmagic framework, inherently exists numerous network information source station in internet, namely crawl root address (initial url).To crawl webpage url for root url, based on webmagic framework, acquire all kinds of effective information according to web page source code from source station, carry out relevant configuration.As picture category, then detailed analyzing web page form, by picture tag sort and marking, is illustrated in the page of content aggregation platform on by picture by mark order by subsequent step; Video class, according to webpage format, obtains video redirect url and finally plays the page, inserting the content aggregation platform page, carries out final displaying.Dynamically can carry out according to the variation of source station the adjustment and the adaptation that crawl rule, if amendment reptile rule, then reptile when upper once obtaining information, will crawl according to the rule that crawls after upgrading.
Such as, client retrieved the interested information such as lastest news, focus, and it is summarized as information class, first crawls destination address according to each information url configuration, then enters each information page and obtain detailed label, crawls particular content.As video, audio class, first enters each video website homepage url, and each visual classification corresponding, this spy for film, TV play, variety three major types, obtains the detailed broadcast address of each page, then obtains each video information from the broadcasting page.
Root address can be categorized as different classes of according to own service, as information class, video class, software class etc.; On this basis, can crawl rule accordingly to the configuration of different source stations, reptile rule has controllability, thus the validity of the information of raising and readability.Existing network crawler technology, is extract url from current page, puts it in queue, until meet the stop condition of program.But source station information is constantly variation, this technology exists cannot the difficult point of this source station information of Obtaining Accurate for a long time.The present invention is according to this situation, and dynamic-configuration crawls rule and carries out self-adapting crawling.Such as information class, comprise lastest news, hot news, when starting task, configuration crawls item (being associated with the 4th step distributed deployment and task management) herein, filtered and information (information website source url) irrelevant linking (being associated with second step) by the web page analysis algorithm based on web page contents, the link (scope of creeping is controlled) remained with is put in queue to be captured, the final page url of the url(selecting next step to capture from queue by certain search strategy), repeat above step, until meet the stop condition of program.
This process dynamically can adapt to the change of information source end, information interested to adaptive acquisition target complex, compared with customizing the crawling method of transience with nothing in enormous quantities of the prior art, it is higher that it crawls the effectiveness of information obtained, and information source timeliness is more of a specified duration.
Crawl content information similarity testing process as shown in Figure 2, idiographic flow is as follows:
Step 201: the described network information obtained that crawls is processed, blank string in replacement information and multimedia element, picture contained in information and video resource are extracted and replaces to corresponding text language, the network information containing multielement is finally stored with text message;
Step 202: select certain k value, gathers every its k-shingle of bar information architecture, these k-shingle is mapped to shorter bucket numbering;
Step 203: the length n selecting min-hash signature, calculates the min-hash signature of every bar information;
Step 204: a threshold value t is set and defines the similarity degree that reach, and make it to be seen as similar right.Select the line number r in row number b and each row bar, make br=n, and threshold value t is approximately equal to.Need to select suitable b and r to be less than the threshold value of t to avoid the generation of pseudo-counter-example with generation, also will take into account the computing velocity of similarity-rough set simultaneously.Adopt LSH technology to build candidate couple after choosing suitable b and r, check the signature that each candidate is right, determine whether their conforming ratios are greater than t, if be greater than t, then delete the repetition that wherein information avoids information.
The method that the embodiment of the present invention provides, carries out similarity comparison after getting information used, and eliminate redundancy information obtains the information data that described redundance is lower.The method is carried out LSH according to the data message acquired in real network and is carried out similarity comparison and obtain comparing result, and the comparing result looked into double recipe formula with adopting traditional whole piece data in prior art and obtain, its computing velocity is faster, similarity comparison is more accurate.
Crawl information fusion assorting process as shown in Figure 3, idiographic flow comprises:
Step 301: aggregation platform is used for providing reptile task related command, controls the concrete state of reptile task.According to customization specific requirement, setting crawls scope in detail, dynamically follows the tracks of crawling state.
Particularly, reptile information fusion management platform depends on physical connection and the message communication of each reptile node in network, and with the intercommunication of database.Management platform and reptile node are served by http to realize, and port depends on the port that container is opened.The information such as reptile node write state are to database, and management platform obtains data from database, and therefore, they need to communicate with database, connect.
Reptile information office and management platform: management reptile, controls reptile state;
Node: each crawls a distributed deployment position;
Database: store and crawl information;
Interface: show and crawl information.
Step 302: data step 301 obtained carry out pre-service.According to customization item, aggregation platform obtains first run data from database.According to crawling the time, and reptile task status carries out the fuzzy displaying of the page to crawled content, for the cohesively managed page provides data.
Step 303: aggregated data becomes final pending data.So-called polymerization and data are by the data mode after overall treatment herein, are a kind of data processing of routine, carry out integrating and can carry out the data mode of successful presentation with centralization after step described above to the data crawled.
Step 304: the pending data obtained above-mentioned 303 present to user in visual mode, user also can according to page option customization information needed.
Finally, result display module is divided into PC to hold and two kinds, handheld client end displaying item.PC holds the corresponding reference address of cuit, and user logs in, and specifically customizes operation, customizes and requires to be transmitted by step 301, upgrade reptile state, be connected, by step 303 return data successful presentation with database.Conventional web content analysis cannot regular exhibition information, by further investigation web page contents, based on data present the feature of Diversity of information, theme polytrope, acquisition content is integrated, realize the functions such as source web page information depth parallel search, according to keywords search of the same type, according to potential applications, involved related content is sorted out, accurately can sort to data according to information timeliness simultaneously.Handheld device principle is held identical with PC.Result is shown and is also comprised page layout optimization function, and user can select focus or update time to carry out displayed page sequence for institute's customized content.
To the above-mentioned explanation of the disclosed embodiments, professional and technical personnel in the field are realized or uses the present invention.To be apparent for those skilled in the art to the multiple amendment of these embodiments, General Principle as defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention can not be restricted to these embodiments shown in this article, but will meet the widest scope consistent with principle disclosed herein and features of novelty.

Claims (6)

1. based on a content polymerization process for distributed network reptile, it is characterized in that, the method comprises following process:
Step one, capture target information, first different reptile platforms is arranged on different devices, send request to the network information source terminal crawled, reptile platform is formulated according to the required target information of user and is crawled rule, captures the information interested to targeted customer;
Step 2, crawl content information similarity to detect, the described network information obtained that crawls is processed, based on the data transmission in real-time data base and conversion method thereof in conjunction with local sensitivity Hash (LSH) method, carries out similarity detection thus the redundance of the information of reduction;
Step 3, crawl information fusion classification, on the basis of step 2, system carries out classification and ordination to garbled information category, temperature, key word, and shows on a user device.
2. a kind of content polymerization process based on distributed network reptile according to claim 1, is characterized in that,
In described step one, for crawling, arranging of platform is further comprising the steps of:
Before task starts, reptile platform is disposed, and configure reptile attribute, filtered and irrelevant the linking of user search by web page analysis algorithm, the link remained with is put in queue to be captured, and in filter process, first web page contents is changed into textual form and from queue, select next step url that will capture by text based web page analysis algorithm by background server, repeat above step, traversal full page, until meet the stop condition of program.
3. a kind of content polymerization process based on distributed network reptile according to claim 1, is characterized in that, described deployment reptile comprises the service configuration of reptile and task configuration.
4. a kind of content polymerization process based on distributed network reptile according to claim 1, is characterized in that,
Described step one specifically comprises:
Step 1.1, is divided into some large classes by root address url according to its class of service, selects to carry out information crawler with a certain large class corresponding to target information;
Step 1.2, the large class url configuration corresponding according to described target information crawls destination address, enters each page and obtains detailed label, crawl particular content.
5. a kind of content polymerization process based on distributed network reptile according to claim 3, it is characterized in that, described step one also comprises step 1.3, when proceeding to described step 1.2, if be provided with more detailed classification to information in the described destination address page, enter each group page and obtain detailed label, crawl particular content, and repeat step 1.3, carry particular content until crawl.
6. a kind of content polymerization process based on distributed network reptile according to claim 3, is characterized in that,
Described step 2 specifically comprises:
Step 2.1, processes the described network information obtained that crawls, the blank string in replacement information and multimedia element, picture contained in information and video resource is extracted and replaces to corresponding text language;
Random length in text is that the substring of k is defined as k-shingle by step 2.2, then every bar information can be expressed as occurring that k-shingle once or repeatedly gathers in the text; Need described set to replace to the small set represented with the signature of small-scale, estimated the similarity of actual set by the signature set of comparison information;
Step 2.3, repeatedly local sensitivity Hash process is carried out to information, similar item more may be able to be hashing onto in same bucket than dissimilar item, will have at least the information that is once hashing onto same bucket to as being candidate couple, only to these candidates to carrying out similarity detection, information similarity being reached setting threshold value carries out screening deletion, reduces the redundance of information.
CN201510724024.6A 2015-10-30 2015-10-30 Content aggregation method based on distributed web crawlers Pending CN105279272A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510724024.6A CN105279272A (en) 2015-10-30 2015-10-30 Content aggregation method based on distributed web crawlers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510724024.6A CN105279272A (en) 2015-10-30 2015-10-30 Content aggregation method based on distributed web crawlers

Publications (1)

Publication Number Publication Date
CN105279272A true CN105279272A (en) 2016-01-27

Family

ID=55148286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510724024.6A Pending CN105279272A (en) 2015-10-30 2015-10-30 Content aggregation method based on distributed web crawlers

Country Status (1)

Country Link
CN (1) CN105279272A (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893559A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Data pushing method and device
CN105956070A (en) * 2016-04-28 2016-09-21 优品财富管理有限公司 Method and system for integrating repetitive records
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN106844774A (en) * 2017-03-01 2017-06-13 苏州朗动网络科技有限公司 A kind of crawler system and grasping means based on C# crawl internet public datas
CN107066492A (en) * 2016-12-29 2017-08-18 百视通网络电视技术发展有限责任公司 Matchmaker provides metadata acquisition method and system
CN107315799A (en) * 2017-06-19 2017-11-03 重庆誉存大数据科技有限公司 A kind of internet duplicate message screening technique and system
CN107590236A (en) * 2017-09-09 2018-01-16 杭州数立方征信有限公司 A kind of big data acquisition method and system towards enterprise in charge of construction
CN107943588A (en) * 2017-11-22 2018-04-20 用友金融信息技术股份有限公司 Data processing method, system, computer equipment and readable storage medium storing program for executing
CN108268498A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The treating method and apparatus of batch reptile task
CN108446287A (en) * 2017-02-16 2018-08-24 北京国双科技有限公司 Web page crawl method and device
CN108650260A (en) * 2018-05-09 2018-10-12 北京邮电大学 A kind of recognition methods of malicious websites and device
CN109121001A (en) * 2018-09-05 2019-01-01 深圳市酷开网络科技有限公司 A kind of carousel system, carousel method and the smart television of more content quotient
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN109246141A (en) * 2018-10-26 2019-01-18 电子科技大学 A kind of anti-excessive crawler method based on SDN
CN109299260A (en) * 2018-09-29 2019-02-01 上海晶赞融宣科技有限公司 Data classification method, device and computer readable storage medium
CN109460500A (en) * 2018-10-24 2019-03-12 深圳市腾讯计算机系统有限公司 Focus incident finds method, apparatus, computer equipment and storage medium
CN109753596A (en) * 2018-12-29 2019-05-14 中国科学院计算技术研究所 Information source management and configuration method and system for the acquisition of large scale network data
CN109815382A (en) * 2018-12-29 2019-05-28 中国科学院计算技术研究所 The perception and acquisition methods and system of large scale network data
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN110286873A (en) * 2019-06-19 2019-09-27 深圳市微课科技有限公司 Web-page audio playback method, device, computer equipment and storage medium
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis
CN110502689A (en) * 2019-08-28 2019-11-26 上海智臻智能网络科技股份有限公司 The crawling method and device of knowledge point, storage medium, terminal
CN110968770A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Method and device for terminating crawling of crawler tool
CN111104617A (en) * 2019-12-11 2020-05-05 西安易朴通讯技术有限公司 Webpage data acquisition method and device, electronic equipment and storage medium
CN111131899A (en) * 2018-10-31 2020-05-08 中国移动通信集团浙江有限公司 Multi-site video playing record integration method and device
CN111859076A (en) * 2020-07-31 2020-10-30 平安健康保险股份有限公司 Data crawling method and device, computer equipment and computer readable storage medium
CN113076459A (en) * 2021-04-27 2021-07-06 无锡星凝互动科技有限公司 Neural network building method and system based on AI consultation
CN113312343A (en) * 2021-06-11 2021-08-27 北京思特奇信息技术股份有限公司 Business opportunity management method and system based on web crawler tool
CN113987569A (en) * 2021-10-14 2022-01-28 武汉联影医疗科技有限公司 Anti-crawler method and device, computer equipment and storage medium
CN114791978A (en) * 2022-04-19 2022-07-26 中国电信股份有限公司 News recommendation method, device, equipment and storage medium
US20220414163A1 (en) * 2020-03-10 2022-12-29 Haenasoft Company, Limited System for selectively importing web data by arbitrarily setting action design

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157666A1 (en) * 2007-12-14 2009-06-18 Fast Search & Transfer As Method for improving search engine efficiency
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature
US20150254344A1 (en) * 2008-06-18 2015-09-10 Zeitera, Llc Scalable, Adaptable, and Manageable System for Multimedia Identification

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157666A1 (en) * 2007-12-14 2009-06-18 Fast Search & Transfer As Method for improving search engine efficiency
US20150254344A1 (en) * 2008-06-18 2015-09-10 Zeitera, Llc Scalable, Adaptable, and Manageable System for Multimedia Identification
CN101645082A (en) * 2009-04-17 2010-02-10 华中科技大学 Similar web page duplicate-removing system based on parallel programming mode
CN101667194A (en) * 2009-09-29 2010-03-10 北京大学 Automatic abstracting method and system based on user comment text feature
CN102831234A (en) * 2012-08-31 2012-12-19 北京邮电大学 Personalized news recommendation device and method based on news content and theme feature

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王扬: "基于web的优惠网购系统的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
章群燕: "社交媒体中协作用户检测", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (44)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893559A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Data pushing method and device
CN105956070A (en) * 2016-04-28 2016-09-21 优品财富管理有限公司 Method and system for integrating repetitive records
CN106326447A (en) * 2016-08-26 2017-01-11 北京量科邦信息技术有限公司 Detection method and system of data captured by crowd sourcing network crawlers
CN107066492A (en) * 2016-12-29 2017-08-18 百视通网络电视技术发展有限责任公司 Matchmaker provides metadata acquisition method and system
CN108268498A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The treating method and apparatus of batch reptile task
CN108446287A (en) * 2017-02-16 2018-08-24 北京国双科技有限公司 Web page crawl method and device
CN106844774A (en) * 2017-03-01 2017-06-13 苏州朗动网络科技有限公司 A kind of crawler system and grasping means based on C# crawl internet public datas
CN107315799A (en) * 2017-06-19 2017-11-03 重庆誉存大数据科技有限公司 A kind of internet duplicate message screening technique and system
CN109213824A (en) * 2017-06-29 2019-01-15 北京京东尚科信息技术有限公司 Data grabber system, method and apparatus
CN109213824B (en) * 2017-06-29 2022-03-04 北京京东尚科信息技术有限公司 Data capture system, method and device
CN107590236A (en) * 2017-09-09 2018-01-16 杭州数立方征信有限公司 A kind of big data acquisition method and system towards enterprise in charge of construction
CN107943588A (en) * 2017-11-22 2018-04-20 用友金融信息技术股份有限公司 Data processing method, system, computer equipment and readable storage medium storing program for executing
CN108650260A (en) * 2018-05-09 2018-10-12 北京邮电大学 A kind of recognition methods of malicious websites and device
CN109121001A (en) * 2018-09-05 2019-01-01 深圳市酷开网络科技有限公司 A kind of carousel system, carousel method and the smart television of more content quotient
CN109121001B (en) * 2018-09-05 2021-07-27 深圳市酷开网络科技股份有限公司 Carousel system and carousel method for multiple content providers and smart television
CN109299260A (en) * 2018-09-29 2019-02-01 上海晶赞融宣科技有限公司 Data classification method, device and computer readable storage medium
CN110968770B (en) * 2018-09-29 2023-09-05 北京国双科技有限公司 Method and device for stopping crawling of crawler tool
CN110968770A (en) * 2018-09-29 2020-04-07 北京国双科技有限公司 Method and device for terminating crawling of crawler tool
CN109299260B (en) * 2018-09-29 2021-01-19 上海晶赞融宣科技有限公司 Data classification method, device and computer readable storage medium
CN109460500A (en) * 2018-10-24 2019-03-12 深圳市腾讯计算机系统有限公司 Focus incident finds method, apparatus, computer equipment and storage medium
CN109246141A (en) * 2018-10-26 2019-01-18 电子科技大学 A kind of anti-excessive crawler method based on SDN
CN109246141B (en) * 2018-10-26 2021-03-12 电子科技大学 SDN-based excessive crawler prevention method
CN111131899A (en) * 2018-10-31 2020-05-08 中国移动通信集团浙江有限公司 Multi-site video playing record integration method and device
CN109840298B (en) * 2018-12-29 2021-09-24 中国科学院计算技术研究所 Multi-information-source acquisition method and system for large-scale network data
CN109753596A (en) * 2018-12-29 2019-05-14 中国科学院计算技术研究所 Information source management and configuration method and system for the acquisition of large scale network data
CN109815382A (en) * 2018-12-29 2019-05-28 中国科学院计算技术研究所 The perception and acquisition methods and system of large scale network data
CN109840298A (en) * 2018-12-29 2019-06-04 中国科学院计算技术研究所 The multi information source acquisition method and system of large scale network data
CN109753596B (en) * 2018-12-29 2021-05-25 中国科学院计算技术研究所 Information source management and configuration method and system for large-scale network data acquisition
CN109902220B (en) * 2019-02-27 2023-11-24 腾讯科技(深圳)有限公司 Webpage information acquisition method, device and computer readable storage medium
CN109902220A (en) * 2019-02-27 2019-06-18 腾讯科技(深圳)有限公司 Webpage information acquisition methods, device and computer readable storage medium
CN110321466A (en) * 2019-06-14 2019-10-11 广发证券股份有限公司 A kind of security information duplicate checking method and system based on semantic analysis
CN110321466B (en) * 2019-06-14 2023-09-15 广发证券股份有限公司 Securities information duplicate checking method and system based on semantic analysis
CN110286873A (en) * 2019-06-19 2019-09-27 深圳市微课科技有限公司 Web-page audio playback method, device, computer equipment and storage medium
CN110502689A (en) * 2019-08-28 2019-11-26 上海智臻智能网络科技股份有限公司 The crawling method and device of knowledge point, storage medium, terminal
CN111104617A (en) * 2019-12-11 2020-05-05 西安易朴通讯技术有限公司 Webpage data acquisition method and device, electronic equipment and storage medium
CN111104617B (en) * 2019-12-11 2023-05-09 西安易朴通讯技术有限公司 Webpage data acquisition method and device, electronic equipment and storage medium
US20220414163A1 (en) * 2020-03-10 2022-12-29 Haenasoft Company, Limited System for selectively importing web data by arbitrarily setting action design
US11836195B2 (en) * 2020-03-10 2023-12-05 Haenasoft Company, Limited System for selectively importing web data by arbitrarily setting action design
CN111859076B (en) * 2020-07-31 2024-04-02 平安健康保险股份有限公司 Data crawling method, device, computer equipment and computer readable storage medium
CN111859076A (en) * 2020-07-31 2020-10-30 平安健康保险股份有限公司 Data crawling method and device, computer equipment and computer readable storage medium
CN113076459A (en) * 2021-04-27 2021-07-06 无锡星凝互动科技有限公司 Neural network building method and system based on AI consultation
CN113312343A (en) * 2021-06-11 2021-08-27 北京思特奇信息技术股份有限公司 Business opportunity management method and system based on web crawler tool
CN113987569A (en) * 2021-10-14 2022-01-28 武汉联影医疗科技有限公司 Anti-crawler method and device, computer equipment and storage medium
CN114791978A (en) * 2022-04-19 2022-07-26 中国电信股份有限公司 News recommendation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN105279272A (en) Content aggregation method based on distributed web crawlers
CN107273409B (en) Network data acquisition, storage and processing method and system
US11620300B2 (en) Real-time measurement and system monitoring based on generated dependency graph models of system components
US10956146B2 (en) Content deployment system having a content publishing module for selectively extracting content items for integration into a specific release and methods for implementing the same
US20200104402A1 (en) System Monitoring Driven By Automatically Determined Operational Parameters Of Dependency Graph Model With User Interface
US9449271B2 (en) Classifying resources using a deep network
CN109902220B (en) Webpage information acquisition method, device and computer readable storage medium
US10037538B2 (en) Selection and presentation of news stories identifying external content to social networking system users
US8756593B2 (en) Map generator for representing interrelationships between app features forged by dynamic pointers
CN108369709A (en) Network-based ad data service delay reduces
CN109241474B (en) Method for providing, displaying and releasing page information, server and client
CN100565518C (en) A kind of method and system that keep page current data information
KR20080028574A (en) Integrated search service system and method
CN101458690A (en) Advertisement publishing method and advertisement server
CN102473190A (en) Keyword assignment to a web page
US10691664B1 (en) User interface structural clustering and analysis
US20150089415A1 (en) Method of processing big data, apparatus performing the same and storage media storing the same
US20220237220A1 (en) Template generation using directed acyclic word graphs
US20170244741A1 (en) Malware Identification Using Qualitative Data
CN107563715A (en) Foreign trade set-off marketing system and method
US11392589B2 (en) Multi-vertical entity-based search system
US20200293160A1 (en) System for superimposed communication by object oriented resource manipulation on a data network
CN105117434A (en) Webpage classification method and webpage classification system
Clarkson et al. Where’s@ Waldo?: finding users on Twitter
JP2017091376A (en) Advertisement system and advertisement delivery method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160127