CN104598536B - A kind of distributed network information structuring processing method - Google Patents

A kind of distributed network information structuring processing method Download PDF

Info

Publication number
CN104598536B
CN104598536B CN201410840847.0A CN201410840847A CN104598536B CN 104598536 B CN104598536 B CN 104598536B CN 201410840847 A CN201410840847 A CN 201410840847A CN 104598536 B CN104598536 B CN 104598536B
Authority
CN
China
Prior art keywords
webpage
structuring
file
web
doc
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410840847.0A
Other languages
Chinese (zh)
Other versions
CN104598536A (en
Inventor
常鹏飞
伍赛
陈珂
寿黎但
陈刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201410840847.0A priority Critical patent/CN104598536B/en
Publication of CN104598536A publication Critical patent/CN104598536A/en
Application granted granted Critical
Publication of CN104598536B publication Critical patent/CN104598536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of distributed network information structuring processing method.Network information gathering task is configured, user's webpage interested is subjected to classification preservation, target web is used as;The network information is acquired, is cooperated jointly by multiple map/reduce processes and gathers webpage and carry out structuring processing, be stored in HDFS file system;Webpage after structuring is handled carries out structuring cluster by the way of tree edit distance;Structuring extraction is carried out to the info web after cluster, is saved in database.Present invention employs distributed framework, using the calculating and storage capacity of cheap computer cluster come the huge network data of processing data amount;Effectively webpage is classified;The mode for employing structuring is extracted and preserved to the network information, facilitates the further analyzing and processing to the network information.

Description

A kind of distributed network information structuring processing method
Technical field
The present invention relates to a kind of web information processing method in network information gathering field, more particularly to a kind of point Cloth network information structuring acquiring and processing method.
Background technology
Distributed system is by effectively being organized cheap computing cluster, performing large-scale data computing and deposit The system of storage.
Distributed system is different from one-of-a-kind system, and single node will be balanced by carrying out data operation and storage using computer cluster The cost between communication between computing capability and node, while it is also contemplated that caused by cluster interior joint failure system effectiveness and The problems such as restorability of data.Hadoop distributed treatments with HDFS distributed file systems are proposed based on Google What Map/Reduce computation models were designed and developed increase income distributed arithmetic and storage system.Because it effectively solves distribution The succinct versatility of intersystem problem and its framework, is obtained in many fields and is widely applied.
Structuring clustering method is one kind in clustering method, different from the method clustered by content, and structuring gathers Class according to structure it is emphasised that clustered, and this is accomplished by different method for measuring similarity.Tree edit distance method is a kind of Weigh tree similarity method, by one tree be converted to another one tree mean two set between carry out it is a series of The insertion of node, deletes and replaces, and operation each time expends certain cost.If the architectural difference of two trees is big, it is meant that operation Cost is high, and running cost is low, shows that the architectural difference of tree is small.What therefore the editing distance of tree was represented is that two tree conversions are required The minimum cost wanted.
Network information gathering is generally also web crawlers, is widely used in internet search engine or other similar websites, To obtain or update the content and retrieval mode of these websites.It can be with all pages that it is able to access that of automatic data collection Hold, handled so that search engine etc. is further.
The information that existing web crawlers is grabbed is to be stored in the form of original web page in storage system.It is such to deposit Storage mode has the following disadvantages that one is that storage needs larger memory space in the form of original web page;Two be the information of storage In have substantial amounts of irrelevant information, such as advertisement;Three be in the form of a web page to information carry out preserve be a kind of semi-structured Mode, relative to the storage mode of structuring, semi-structured storage mode can cause certain to the use of further information Obstacle.
The content of the invention
It is an object of the invention to the deficiency for existing network information acquiring technology, there is provided a kind of distributed network letter Cease structuring acquiring and processing method.
The technical solution adopted by the present invention comprises the following steps:
1) network information gathering task is configured, user's webpage interested is subjected to classification preservation, target is used as Webpage;
2) network information is acquired, is cooperated jointly by multiple map/reduce processes and collection webpage and carry out structure Change is handled, and is stored in HDFS file system;
3) webpage after structuring is handled carries out structuring cluster by the way of tree edit distance;
4) structuring extraction is carried out to the info web after cluster, be saved in database.
Described step 2) specifically include:
2.1) URL seed files are obtained, URL seed file set is preserved to the file to be captured of HDFS file system In, file storage to be captured has the URL to be captured, and sets initial sequence number to be 1;
2.2) judge in file to be captured whether be empty, if so, then jumping to step 2.7);Otherwise, next step is carried out It is rapid 2.3);
2.3) the corresponding webpage of each URL seed file in HDFS file system is carried out by map/reduce processes Collection, and web storage file storage in HDFS file system is stored in, the storage of web storage file has crude Webpage;
2.4) webpage captured in web storage file is therefrom extracted by map/reduce processes again and parsed New URL, and new URL is stored in the temporary folder of HDFS file system, temporary folder storage is parsed URL;
2.5) by map/reduce process optimization temporary folders, URL therein is filtered, the URL repeated is removed, so Result is updated in the file to be captured of HDFS file system afterwards;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is more than crawl depth value DepthThen enter step 2.7), otherwise Jump to step 2.2);
2.7) multiple web storage files that above-mentioned steps are obtained are merged into one by map/reduce process Web storage file, and remove the webpage wherein repeated.
Described step 3) clustered by map/reduce processes, comprise the following steps that:
3.1) in the map stages, for step 2) each webpage in web storage file after obtained merging, profit Tree edit distance method is used, the tag tree TREE of each webpage is calculated respectivelyxWith target web C each describediTag tree TREECiBetween tree edit distance DISCi, obtain tree edit distance set { DISC1, DISC2, DISC3..., DISCn, and generate Key-value pair<Ci, WEB>, minimum tree edit distance DIS is then chosen from tree edit distance setCmin, by minimum tree edit distance DISCminCorresponding key-value pair<Cmin, WEB>It is transmitted to the reduce stages;
3.2) in the reduce stages, according to above-mentioned key-value pair<Ci, WEB>Webpage with same keys is merged into by middle key assignments One file DOCCiIt is middle as same class webpage, and be stored in the destination file of HDFS file system folder in, each file DOCCiThe webpage with same web page structure is saved, structuring website construction result { DOC is obtainedC1, DOCC2, DOCC3..., DOCCn, complete the structuring cluster of webpage.
Described step 4) according to step 3) in obtained structuring website construction result { DOCC1, DOCC2, DOCC3..., DOCCn, each class webpage is extracted, the information in webpage is extracted and is saved in database.
By the Node extraction of corresponding tag tree in the webpage into database corresponding field.
The inhomogeneous webpage uses different extracting modes.
Step 1) network information gathering task is configured, it is interactive interface.It is different from the reptile of search engine, the present invention Main purpose is to be monitored for the specific information source of network.Can be by content as the user of information source webpage interested Type is divided into text message, pictorial information and video information etc., and news information and advertising message etc. can be also divided into by contents attribute. Meanwhile, the renewal frequency of different aforementioned sources is also not quite similar.Configured by network information gathering task, it may be determined that the information of collection The information in source, different acquisition methods are taken different types of information source so as to realize.
Step 2) be based on Hadoop distributed treatments and HDFS distributed file systems, to step 1) defined in network Acquisition tasks carry out distributed crawl.
Step 3) carry out Web page structural cluster, step 2) webpage that captures is still with semi-structured document form It is stored in HDFS, it is impossible to directly extract in database.Step 3) by the way of tree edit distance, its structure is pressed to webpage Clustered.Webpage is write using HTML, and single html file can be abstracted into the form of a tag tree, with phase With the webpage of information, its tag tree also has similar or even identical structure.In order to weigh the similitude of tag tree, the present invention is adopted With the method for tree edit distance, by step 2) webpage that grabs classified.
Step 4) to info web structuring extract, according to step 3) in structuring cluster result, and step 1) in To the extracting mode of each class webpage, the information in webpage is extracted and is saved in database.
The invention has the advantages that:
Present invention employs distributed framework, handled using the calculating and storage capacity of cheap computer cluster The huge network data of data volume;The Web page structural cluster mode of tree edit distance is employed, effectively webpage is divided Class;The mode for employing structuring is extracted and preserved to the network information, is facilitated at the further analysis to the network information Reason.
Brief description of the drawings
Fig. 1 is implementation steps flow chart of the present invention.
Fig. 2 is step 3.1 of the present invention) in web page tag tree.
Embodiment
The invention will be further described with reference to the accompanying drawings and examples.
As shown in figure 1, the present invention comprises the following steps:
1) network information gathering task is configured, user's webpage interested is subjected to classification preservation, target is used as Webpage;It can obtain the target web set { C that subsequent step is used for clustering1, C2, C3..., Cn};The target web classification of the present invention During preservation, the different information types in same website can be classified, such as there may be info class in same website can be divided into newly Hear class, product data class and picture category etc..
2) network information is acquired, is cooperated jointly by multiple map/reduce processes and collection webpage and carry out structure Change is handled, and is stored in HDFS file system.
Step 2) Hadoop distributed treatments and HDFS distributed file systems are based on, to the network defined in step 1 Acquisition tasks carry out distributed crawl.
2.1) URL seed files are obtained, URL seed file set is preserved to the file to be captured of HDFS file system In, file storage to be captured has the URL to be captured, and sets initial sequence number to be 1;
2.2) judge in file to be captured whether be empty, if so, then jumping to step 2.7);Otherwise, next step is carried out It is rapid 2.3);
2.3) the corresponding webpage of each URL seed file in HDFS file system is carried out by map/reduce processes Collection, and web storage file storage in HDFS file system is stored in, the storage of web storage file has crude Webpage;
2.4) webpage captured in web storage file is therefrom extracted and parsed by map/reduce processes again Go out new URL, and new URL is stored in the temporary folder of HDFS file system, temporary folder storage is parsed The URL arrived;
2.5) by map/reduce process optimization temporary folders, URL therein is filtered, the URL repeated is removed, so Result is updated in the file to be captured of HDFS file system afterwards;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is more than crawl depth value DepthThen enter step 2.7), otherwise Jump to step 2.2), repeat above-mentioned steps 2)~6) until current sequence number is more than crawl depth value Depth
2.7) multiple web storage files that above-mentioned steps are obtained are breathed out according to webpage by map/reduce process Uncommon value merges into a web storage file, and removes the webpage wherein repeated;Can be by the web storage file after merging It is stored in the web page files of HDFS file system folder, this document folder preserves the webpage grabbed.
3) webpage after structuring is handled carries out structuring cluster by the way of tree edit distance;
Step 3) clustered by map/reduce processes, comprise the following steps that:
3.1) in the map stages, for step 2) each webpage in web storage file after obtained merging, example Such as step 1) in user's webpage interested, such as a webpage in a certain Website News class webpage, the map stages are by target network Page is extracted into the form of tag tree with the webpage grabbed, as shown in Figure 2.Using tree edit distance method, calculate respectively each The tag tree TREE of individual webpagexWith target web C each describediTag tree TREECiBetween tree edit distance DISCi, obtain To tree edit distance set { DISC1, DISC2, DISC3..., DISCn, and generate key-value pair<Ci, WEB>, then from tree editor away from Tree edit distance DIS minimum from selection in setCmin, by minimum tree edit distance DISCminCorresponding key-value pair<Cmin, WEB>Pass To the reduce stages;
3.2) in the reduce stages, according to above-mentioned key-value pair<Ci, WEB>Webpage with same keys is merged into by middle key assignments One file DOCCiIt is middle as same class webpage, and be stored in the destination file of HDFS file system folder in, destination file folder deposits There is file DOCC1, file DOCC2, file DOCC3..., file DOCCn, each file DOCCiSave with same web page knot The webpage of structure, obtains structuring website construction result { DOCC1, DOCC2, DOCC3..., DOCCn, the structuring for completing webpage gathers Class.
4) structuring extraction is carried out to the info web after cluster, be saved in database.
Step 4) according to step 3) in obtained structuring website construction result { DOCC1, DOCC2, DOCC3..., DOCCn, Each class webpage is extracted, the information in webpage is extracted and is saved in database, extraction can be used in webpage The mode of the Node extraction of corresponding tag tree corresponding field into database.
Inhomogeneous webpage uses different extracting modes, the extracting mode { R of each class webpage of definable1, R2, R3..., Rn, the information in webpage is extracted and is saved in database.
The embodiment of the present invention is as follows:
For certain electric automobile website, user will obtain news category webpage therein and automobile model parameter class webpage, For this two classes webpage, each class obtains a typical webpage as target web, so as to form target web set { C1, C2}。
Step 2) in distributed crawl is carried out to the electric automobile website, obtain its web data, the process is time-consuming with treating It is relevant with the cluster scale for performing crawl task to capture website scale, for the cluster of ten nodes, crawl speed at full capacity 100,000/hour can be reached.
Step 3) webpage that will be got from the electric automobile website, news category and automobile model parameter are divided into by cluster The class webpage of class two, the accuracy rate of process cluster can accomplish more than 95%.
Step 4) for the extraction of this two classes webpage progress structuring, for news category webpage, extract title, text, hair The cloth date, source etc. information, be saved in database;For automobile model parameter class webpage, different model parameter extraction is gone out Come, be saved in database.
Wherein, a kind of R of extracting modeiForm:Wherein PubTitle, Content, PublicationDate, DCSource is the field in database, and each field is followed by a webpage xpath path, defines the field in webpage In position.To each class web page files DOCCi, utilize its corresponding extracting mode Ri, corresponding data in webpage are extracted Corresponding field in database.
It can be seen that, during whole crawl, it is only necessary to which user provides the exemplary web page in two class webpages interested, The information of corresponding types in targeted website, the extraction of structuring simultaneously can be just saved in database by this method, whole process There is provided very fast information handling rate on the basis of keeping compared with high-accuracy (embodiment obtain more than 95%).
Thus, the present invention be based on HDFS file system, for the network information it is semi-structured the characteristics of, it is proposed that using tree compile Collect distance to cluster webpage according to structure of web page, on the basis of cluster result, the webpage of each class is carried according to mode Win the confidence breath, and be saved in database, so that the structuring collection of info web is realized, with significant technique effect.

Claims (4)

1. a kind of distributed network information structuring processing method, it is characterised in that as follows the step of this method:
1) network information gathering task is configured, user's webpage interested is subjected to classification preservation, target web is used as;
2) network information is acquired, is cooperated jointly by multiple map/reduce processes and gather webpage and carry out at structuring Reason, is stored in HDFS file system;
3) webpage after structuring is handled carries out structuring cluster by the way of tree edit distance;
4) structuring extraction is carried out to the info web after cluster, be saved in database;
Described step 2) specifically include:
2.1) URL seed files are obtained, URL seed file set is preserved into the file to be captured of HDFS file system, File storage to be captured has the URL to be captured, and sets initial sequence number to be 1;
2.2) judge in file to be captured whether be empty, if so, then jumping to step 2.7);Otherwise, next step is carried out 2.3);
2.3) each corresponding webpage of URL seed files in HDFS file system is adopted by map/reduce processes Collection, and web storage file storage in HDFS file system is stored in, the storage of web storage file has crude net Page;
2.4) by map/reduce processes the webpage captured in web storage file is therefrom extracted again parse it is new URL, and new URL is stored in the temporary folder of HDFS file system, temporary folder storage has what is parsed URL;
2.5) by map/reduce process optimization temporary folders, URL therein is filtered, the URL repeated is removed, then will As a result it is updated in the file to be captured of HDFS file system;
2.6) by sequence number+1;
2.7) sequence number is judged, if current sequence number is more than crawl depth value DepthThen enter step 2.7), otherwise jump to Step 2.2);
2.7) multiple web storage files that above-mentioned steps are obtained are merged into by a webpage by map/reduce process Storage folder, and remove the webpage wherein repeated;
Described step 3) clustered by map/reduce processes, comprise the following steps that:
3.1) in the map stages, for step 2) each webpage in web storage file after obtained merging, utilize tree Edit distance approach, calculates the tag tree TREE of each webpage respectivelyxWith target web C each describediTag tree TREECiBetween tree edit distance DISCi, obtain tree edit distance set { DISC1, DISC2, DISC3..., DISCn, and generate Key-value pair<Ci, WEB>, minimum tree edit distance DIS is then chosen from tree edit distance setCmin, by minimum tree edit distance DISCminCorresponding key-value pair<Cmin, WEB>It is transmitted to the reduce stages;
3.2) in the reduce stages, according to above-mentioned key-value pair<Ci, WEB>Webpage with same keys is merged into one by middle key assignments File DOCCiIt is middle as same class webpage, and be stored in the destination file of HDFS file system folder in, each file DOCCiProtect The webpage with same web page structure has been deposited, structuring website construction result { DOC is obtainedC1, DOCC2, DOCC3..., DOCCn, it is complete Structuring into webpage is clustered.
2. a kind of distributed network information structuring processing method according to claim 1, it is characterised in that:Described step It is rapid 4) according to step 3) in obtained structuring website construction result { DOCC1, DOCC2, DOCC3..., DOCCn, to each class net Page is extracted, and the information in webpage is extracted and is saved in database.
3. a kind of distributed network information structuring processing method according to claim 2, it is characterised in that:By the net The Node extraction of corresponding tag tree corresponding field into database in page.
4. a kind of distributed network information structuring processing method according to claim 2, it is characterised in that:It is inhomogeneous The webpage uses different extracting modes.
CN201410840847.0A 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method Active CN104598536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410840847.0A CN104598536B (en) 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410840847.0A CN104598536B (en) 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method

Publications (2)

Publication Number Publication Date
CN104598536A CN104598536A (en) 2015-05-06
CN104598536B true CN104598536B (en) 2017-10-20

Family

ID=53124321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410840847.0A Active CN104598536B (en) 2014-12-29 2014-12-29 A kind of distributed network information structuring processing method

Country Status (1)

Country Link
CN (1) CN104598536B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815196B (en) * 2015-11-27 2020-07-31 北京国双科技有限公司 Soft text display frequency statistical method and device
CN106293465A (en) * 2016-08-09 2017-01-04 Tcl移动通信科技(宁波)有限公司 The Web page management method of a kind of mobile terminal and system
CN107451224A (en) * 2017-07-17 2017-12-08 广州特道信息科技有限公司 A kind of clustering method and system based on big data parallel computation
CN109829094A (en) * 2019-01-23 2019-05-31 钟祥博谦信息科技有限公司 Distributed reptile system
CN112115164B (en) * 2019-06-19 2024-09-03 北京金山云网络技术有限公司 Data processing method and device, data query method and device and network equipment
CN111177301B (en) * 2019-11-26 2023-05-26 云南电网有限责任公司昆明供电局 Method and system for identifying and extracting key information
CN113220943B (en) * 2021-06-04 2022-09-30 上海天旦网络科技发展有限公司 Target information positioning method and system in semi-structured flow data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
CN103257975A (en) * 2012-02-21 2013-08-21 腾讯科技(深圳)有限公司 Search method, search device and search system
CN103534700A (en) * 2011-05-20 2014-01-22 惠普发展公司,有限责任合伙企业 System and method for configuration policy extraction
CN104217020A (en) * 2014-09-25 2014-12-17 浪潮(北京)电子信息产业有限公司 Webpage clustering method and system based on MapReduce framework

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9323775B2 (en) * 2010-06-19 2016-04-26 Mapr Technologies, Inc. Map-reduce ready distributed file system
US8473486B2 (en) * 2010-12-08 2013-06-25 Microsoft Corporation Training parsers to approximately optimize NDCG
US8886679B2 (en) * 2011-10-11 2014-11-11 Hewlett-Packard Development Company, L.P. Mining web applications

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667201A (en) * 2009-09-18 2010-03-10 浙江大学 Integration method of Deep Web query interface based on tree merging
CN103534700A (en) * 2011-05-20 2014-01-22 惠普发展公司,有限责任合伙企业 System and method for configuration policy extraction
CN103257975A (en) * 2012-02-21 2013-08-21 腾讯科技(深圳)有限公司 Search method, search device and search system
CN104217020A (en) * 2014-09-25 2014-12-17 浪潮(北京)电子信息产业有限公司 Webpage clustering method and system based on MapReduce framework

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于树编辑距离的聚类算法数据记录抽取;宫丽娜 等;《赤峰学院学报(自然科学版)》;20130630;第29卷(第6期);第28-30页 *
树编辑距离在Web信息抽取中的应用与实现;聂卉 等;《现代图书情报技术》;20101231(第5期);第29-34页 *

Also Published As

Publication number Publication date
CN104598536A (en) 2015-05-06

Similar Documents

Publication Publication Date Title
CN104598536B (en) A kind of distributed network information structuring processing method
CN104951539B (en) Internet data center&#39;s harmful information monitoring system
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
CN104317801B (en) A kind of Data clean system and method towards big data
CN107423391B (en) Information extraction method of webpage structured data
CN103544176A (en) Method and device for generating page structure template corresponding to multiple pages
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN103942335A (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN102646095B (en) Object classifying method and system based on webpage classification information
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN103699591A (en) Page body extraction method based on sample page
CN104268148A (en) Forum page information auto-extraction method and system based on time strings
KR101801257B1 (en) Text-Mining Application Technique for Productive Construction Document Management
CN104899324A (en) Sample training system based on IDC (internet data center) harmful information monitoring system
CN103530429A (en) Webpage content extracting method
Bhardwaj et al. Web scraping using summarization and named entity recognition (ner)
CN103870495A (en) Method and device for extracting information from website
CN106844588A (en) A kind of analysis method and system of the user behavior data based on web crawlers
CN107086925B (en) Deep learning-based internet traffic big data analysis method
CN104156458B (en) The extracting method and device of a kind of information
CN104133913A (en) System and method for automatically establishing city shop information library based on video analysis, searching and aggregation
CN105653567A (en) Method for quickly looking for feature character strings in text sequential data
KR101693727B1 (en) Apparatus and method for reorganizing social issues from research and development perspective using social network
Yadav et al. Change detection in Web pages
CN109948015A (en) A kind of Meta Search Engine tabulating result abstracting method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant