CN104598536A - Structured processing method of distributed network information - Google Patents
Structured processing method of distributed network information Download PDFInfo
- Publication number
- CN104598536A CN104598536A CN201410840847.0A CN201410840847A CN104598536A CN 104598536 A CN104598536 A CN 104598536A CN 201410840847 A CN201410840847 A CN 201410840847A CN 104598536 A CN104598536 A CN 104598536A
- Authority
- CN
- China
- Prior art keywords
- webpage
- network information
- file
- doc
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title description 3
- 238000000034 method Methods 0.000 claims abstract description 52
- 238000000605 extraction Methods 0.000 claims abstract description 11
- 238000010276 construction Methods 0.000 claims 2
- 238000005457 optimization Methods 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 11
- 238000004458 analytical method Methods 0.000 abstract description 2
- 239000000284 extract Substances 0.000 description 11
- 230000009193 crawling Effects 0.000 description 4
- 230000010365 information processing Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000011084 recovery Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Transfer Between Computers (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种分布式网络信息结构化处理方法。对网络信息采集任务进行配置,将用户感兴趣的网页进行分类保存,作为目标网页;对网络信息进行采集,通过多个map/reduce过程共同协作采集网页并进行结构化处理,保存在HDFS文件系统中;将结构化处理后的网页采用树编辑距离的方式,进行结构化聚类;对聚类后的网页信息进行结构化提取,保存到数据库中。本发明采用了分布式的架构,利用廉价的计算机集群的计算以及存储能力来处理数据量庞大的网络数据;有效的对网页进行分类;采用了结构化的方式对网络信息进行提取并保存,方便了对网络信息的进一步分析处理。The invention discloses a method for structured processing of distributed network information. Configure the network information collection task, classify and save the webpages that users are interested in, and use them as target webpages; collect network information, collect webpages through multiple map/reduce processes and perform structured processing, and save them in the HDFS file system Middle; use the tree edit distance method to carry out structured clustering on the structured web pages; carry out structured extraction on the clustered web page information and save them in the database. The present invention adopts a distributed architecture, uses the computing and storage capabilities of cheap computer clusters to process network data with a huge amount of data; effectively classifies web pages; adopts a structured method to extract and store network information, which is convenient further analysis and processing of network information.
Description
技术领域technical field
本发明涉及了网络信息采集领域的一种网络信息处理方法,特别是涉及了一种分布式网络信息结构化采集处理方法。The invention relates to a network information processing method in the field of network information collection, in particular to a structured collection and processing method of distributed network information.
背景技术Background technique
分布式系统是通过将廉价的计算集群有效的组织起来,执行大规模数据运算和存储的系统。A distributed system is a system that efficiently organizes cheap computing clusters to perform large-scale data operations and storage.
分布式系统不同于单机系统,利用计算机集群进行数据运算和存储要平衡单节点计算能力和节点间的通信之间的代价,同时还要考虑集群中节点故障导致的系统有效性和数据的可恢复性等问题。Hadoop分布式处理与HDFS分布式文件系统是基于Google提出的Map/Reduce计算模型设计开发的开源分布式运算与存储系统。由于其有效的解决了分布式系统中的问题以及其架构的简洁通用性,在很多领域都得到了广泛的应用。Distributed systems are different from stand-alone systems. The use of computer clusters for data calculation and storage needs to balance the cost of single-node computing power and inter-node communication. At the same time, the system availability and data recovery caused by node failures in the cluster must also be considered. issues such as sex. Hadoop distributed processing and HDFS distributed file system are open source distributed computing and storage systems designed and developed based on the Map/Reduce computing model proposed by Google. Because of its effective solution to the problems in distributed systems and the simplicity and generality of its architecture, it has been widely used in many fields.
结构化聚类方法是聚类方法中的一种,与按内容进行聚类的方法不同,结构化聚类强调的是按照结构进行聚类,这就需要不同的相似度衡量方法。树编辑距离方法是一种衡量树状结构相似度的方法,将一棵树转换为另一棵树意味着在两棵树之间进行一系列的节点的插入,删除和替换,每一次操作耗费一定成本。若两棵树的结构差异大,意味着操作成本高,操作成本低则表明树的结构差异小。因此树的编辑距离表示的是两棵树转换所需要的最小成本。The structured clustering method is one of the clustering methods. Different from the content-based clustering method, the structured clustering method emphasizes clustering according to the structure, which requires different similarity measurement methods. The tree edit distance method is a method to measure the similarity of the tree structure. Converting one tree to another means performing a series of node insertion, deletion and replacement between the two trees. Each operation costs A certain cost. If the structure difference of the two trees is large, it means that the operation cost is high, and if the operation cost is low, it means that the structure difference of the trees is small. Therefore, the edit distance of the tree represents the minimum cost required for the transformation of two trees.
网络信息采集通常也叫网络爬虫,被广泛用于互联网搜索引擎或其他类似网站,以获取或更新这些网站的内容和检索方式。它可以自动采集所有其能够访问到的页面内容,以供搜索引擎等做进一步的处理。Network information collection is usually also called web crawler, which is widely used in Internet search engines or other similar websites to obtain or update the content and retrieval methods of these websites. It can automatically collect all the page content it can access for further processing by search engines and the like.
现有的网络爬虫抓取到的信息是以原始网页的形式保存于存储系统中。这样的存储方式存在以下缺点,一是以原始网页的形式存储需要较大的存储空间;二是存储的信息中有大量的不相关信息,如广告等;三是以网页的形式对信息进行保存是一种半结构化的方式,相对于结构化的存储方式,半结构化的存储方式会给进一步信息的使用造成一定的障碍。The information captured by existing web crawlers is stored in the storage system in the form of original web pages. Such a storage method has the following disadvantages. First, storing in the form of the original web page requires a large storage space; second, there are a large amount of irrelevant information in the stored information, such as advertisements; It is a semi-structured method. Compared with the structured storage method, the semi-structured storage method will cause certain obstacles to the use of further information.
发明内容Contents of the invention
本发明的目的在于针对现有网络信息采集技术的不足,提供了一种分布式网络信息结构化采集处理方法。The object of the present invention is to provide a method for structured acquisition and processing of distributed network information aiming at the deficiencies of the existing network information acquisition technology.
本发明采用的技术方案包括以下步骤:The technical scheme adopted in the present invention comprises the following steps:
1)对网络信息采集任务进行配置,将用户感兴趣的网页进行分类保存,作为目标网页;1) Configure the network information collection task, classify and save the webpages that the user is interested in, and use them as target webpages;
2)对网络信息进行采集,通过多个map/reduce过程共同协作采集网页并进行结构化处理,保存在HDFS文件系统中;2) Collect network information, collect web pages through multiple map/reduce processes and perform structured processing, and save them in the HDFS file system;
3)将结构化处理后的网页采用树编辑距离的方式,进行结构化聚类;3) carrying out structured clustering on the structured processed web pages by means of tree edit distance;
4)对聚类后的网页信息进行结构化提取,保存到数据库中。4) Structurally extracting the clustered web page information and storing it in the database.
所述的步骤2)具体包括:Described step 2) specifically comprises:
2.1)获取URL种子文件,将URL种子文件集合保存至HDFS文件系统的待抓取文件夹中,待抓取文件夹存放有要抓取的URL,并设置初始的层序数为1;2.1) Obtain the URL seed file, save the URL seed file set in the folder to be captured in the HDFS file system, the folder to be captured contains URLs to be captured, and the initial layer sequence number is set to 1;
2.2)判断待抓取文件夹中是否为空,若是,则跳转到步骤2.7);否则,进行下一步骤2.3);2.2) Determine whether the folder to be captured is empty, if so, jump to step 2.7); otherwise, proceed to the next step 2.3);
2.3)通过map/reduce过程对HDFS文件系统中的各个URL种子文件对应的网页进行采集,并保存在HDFS文件系统中网页存储文件夹存放,网页存储文件夹存放有未经加工的网页;2.3) collect the corresponding webpages of each URL seed file in the HDFS file system through the map/reduce process, and store in the webpage storage folder in the HDFS file system, and the webpage storage folder stores unprocessed webpages;
2.4)再通过map/reduce过程对网页存储文件夹中已抓取的网页从中提取解析出新的URL,并将新的URL保存在HDFS文件系统的临时文件夹中,临时文件夹存放有解析出来的URL;2.4) Through the map/reduce process, extract and analyze the new URL from the captured webpage in the webpage storage folder, and save the new URL in the temporary folder of the HDFS file system. The temporary folder has been parsed out the URL;
2.5)通过map/reduce过程优化临时文件夹,过滤其中的URL,将重复的URL去掉,然后将结果在HDFS文件系统的待抓取文件夹中进行更新;2.5) Optimize the temporary folder through the map/reduce process, filter the URLs therein, remove the duplicate URLs, and then update the results in the folder to be captured in the HDFS file system;
2.6)将层序数+1;2.6) Add layer number +1;
2.7)判断层序数,如果当前的层序数大于抓取深度值Depth则进入步骤2.7),否则跳转到步骤2.2);2.7) Judge the layer sequence number, if the current layer sequence number is greater than the capture depth value Depth then enter step 2.7), otherwise jump to step 2.2);
2.7)通过map/reduce的过程将上述步骤得到的多个网页存储文件夹合并为一个网页存储文件夹,并去掉其中重复的网页。2.7) Merge the multiple webpage storage folders obtained in the above steps into one webpage storage folder through the process of map/reduce, and remove duplicate webpages therein.
所述的步骤3)通过map/reduce过程进行聚类,具体步骤如下:Described step 3) carries out clustering by map/reduce process, concrete steps are as follows:
3.1)在map阶段,对于步骤2)得到的合并后的网页存储文件夹中的每一个网页,利用树编辑距离方法,分别计算每一个网页的标签树TREEx与每一个所述目标网页Ci的标签树TREECi之间的树编辑距离DISCi,得到树编辑距离集合{DISC1,DISC2,DISC3,…,DISCn},并生成键值对<Ci,WEB>,然后从树编辑距离集合中选取最小树编辑距离DISCmin,将最小树编辑距离DISCmin对应的键值对<Cmin,WEB>传给reduce阶段;3.1) In the map stage, for each webpage in the combined webpage storage folder obtained in step 2), the tree edit distance method is used to calculate the tag tree TREE x of each webpage and each of the target webpages C i The tree edit distance DIS Ci between the tag tree TREE Ci , get the tree edit distance set {DIS C1 , DIS C2 , DIS C3 , ..., DIS Cn }, and generate key-value pairs <C i , WEB>, and then from the tree Select the minimum tree edit distance DIS Cmin from the edit distance set, and pass the key-value pair <C min , WEB> corresponding to the minimum tree edit distance DIS Cmin to the reduce stage;
3.2)在reduce阶段,根据上述键值对<Ci,WEB>中键值将具有相同键的网页合并到一个文件DOCCi中作为同一类网页,并保存在HDFS文件系统的结果文件夹中,每一个文件DOCCi保存了具有相同网页结构的网页,得到结构化网页聚类结果{DOCC1,DOCC2,DOCC3,…,DOCCn},完成网页的结构化聚类。3.2) In the reduce stage, according to the above-mentioned key-value pair <C i , WEB>, the webpages with the same key will be merged into a file DOC Ci as the same type of webpage, and stored in the result folder of the HDFS file system, Each file DOC Ci saves webpages with the same webpage structure, and obtains the structured webpage clustering results {DOC C1 , DOC C2 , DOC C3 , . . . , DOC Cn }, and completes the structured webpage clustering.
所述的步骤4)根据步骤3)中得到的结构化网页聚类结果{DOCC1,DOCC2,DOCC3,…,DOCCn},对每一类网页进行提取,将网页中的信息提取出来保存到数据库中。The step 4) extracts each type of webpage according to the structured webpage clustering result {DOC C1 , DOC C2 , DOC C3 , ..., DOC Cn } obtained in the step 3), and extracts the information in the webpage Save to the database.
将所述网页中对应的标签树的节点提取到数据库中对应的字段。Extracting the nodes of the corresponding tag tree in the webpage to the corresponding fields in the database.
所述不同类的网页采用不同的提取方式。Different extraction methods are used for the different types of web pages.
步骤1)对网络信息采集任务配置,是交互的接口。与搜索引擎的爬虫不同,本发明主要目的在于对于网络特定的信息源进行监控。作为信息源的用户感兴趣的网页可按内容类型分为文本信息、图片信息和视频信息等,也可按内容属性分为新闻信息和广告信息等。同时,不同信息源的更新频率也不尽相同。通过网络信息采集任务配置,可确定采集的信息源的信息,从而实现对不同类型信息源采取不同的采集方法。Step 1) configures the network information collection task, which is an interactive interface. Different from crawlers of search engines, the main purpose of the present invention is to monitor specific information sources on the network. As an information source, the web pages that users are interested in can be divided into text information, picture information, and video information according to content types, and can also be divided into news information and advertisement information according to content attributes. At the same time, the update frequency of different information sources is not the same. Through the configuration of network information collection tasks, the information of the collected information sources can be determined, so that different collection methods can be adopted for different types of information sources.
步骤2)基于Hadoop分布式处理以及HDFS分布式文件系统,对步骤1)中定义的网络采集任务进行分布式抓取。Step 2) Based on Hadoop distributed processing and HDFS distributed file system, the network collection tasks defined in step 1) are distributed and captured.
步骤3)进行网页结构化聚类,步骤2)抓取下来的网页还是以半结构化的文件形式存储在HDFS中,不能直接提取到数据库中。步骤3)采用树编辑距离的方式,对网页按其结构进行聚类。网页都是采用HTML进行编写,单个HTML文件可抽象成一个标签树的形式,具有相同信息的网页,其标签树也具有相似甚至相同的结构。为了衡量标签树的相似性,本发明采用了树编辑距离的方法,将步骤2)抓取到的网页进行分类。Step 3) performs structured clustering of webpages, and the webpages captured in step 2) are still stored in HDFS in the form of semi-structured files, and cannot be directly extracted into the database. Step 3) The web pages are clustered according to their structure by adopting the method of tree edit distance. Web pages are written in HTML, and a single HTML file can be abstracted into a tag tree. For web pages with the same information, their tag trees also have similar or even the same structure. In order to measure the similarity of tag trees, the present invention adopts the method of tree edit distance to classify the web pages captured in step 2).
步骤4)对网页信息结构化提取,根据步骤3)中结构化聚类的结果,以及步骤1)中对每一类网页的提取方式,将网页中的信息提取出来保存到数据库中。Step 4) Extract the web page information in a structured manner, according to the result of structured clustering in step 3) and the extraction method for each type of web page in step 1), extract the information in the web page and save it in the database.
本发明具有的有益效果是:The beneficial effects that the present invention has are:
本发明采用了分布式的架构,利用廉价的计算机集群的计算以及存储能力来处理数据量庞大的网络数据;采用了树编辑距离的网页结构化聚类方式,有效的对网页进行分类;采用了结构化的方式对网络信息进行提取并保存,方便了对网络信息的进一步分析处理。The present invention adopts a distributed architecture, utilizes the computing and storage capacity of cheap computer clusters to process network data with a huge amount of data; adopts the structured clustering method of webpages with tree editing distance, and effectively classifies webpages; adopts The network information is extracted and saved in a structured way, which facilitates further analysis and processing of the network information.
附图说明Description of drawings
图1是本发明实施步骤流程图。Fig. 1 is a flowchart of the implementation steps of the present invention.
图2是本发明步骤3.1)中的网页标签树。Fig. 2 is the web page tag tree in step 3.1) of the present invention.
具体实施方式Detailed ways
下面结合附图和实施例对本发明作进一步说明。The present invention will be further described below in conjunction with drawings and embodiments.
如图1所示,本发明包括以下步骤:As shown in Figure 1, the present invention comprises the following steps:
1)对网络信息采集任务进行配置,将用户感兴趣的网页进行分类保存,作为目标网页;可得到后续步骤用来聚类的目标网页集合{C1,C2,C3,…,Cn};本发明的目标网页分类保存时,可对同一网站不同信息类型进行分类,比如同一网站中可能存在信息类可分为新闻类、产品数据类和图片类等。1) Configure the network information collection task, classify and save the webpages that the user is interested in, as the target webpage; obtain the target webpage set {C 1 , C 2 , C 3 , ..., C n used for clustering in the subsequent steps }; When the target webpage of the present invention is classified and saved, different information types of the same website can be classified. For example, the information types that may exist in the same website can be divided into news, product data, and pictures.
2)对网络信息进行采集,通过多个map/reduce过程共同协作采集网页并进行结构化处理,保存在HDFS文件系统中。2) Collect network information, collect web pages through multiple map/reduce processes and perform structured processing, and save them in the HDFS file system.
步骤2)基于Hadoop分布式处理以及HDFS分布式文件系统,对步骤1中定义的网络采集任务进行分布式抓取。Step 2) Based on Hadoop distributed processing and HDFS distributed file system, the network collection tasks defined in step 1 are distributed and captured.
2.1)获取URL种子文件,将URL种子文件集合保存至HDFS文件系统的待抓取文件夹中,待抓取文件夹存放有要抓取的URL,并设置初始的层序数为1;2.1) Obtain the URL seed file, save the URL seed file set in the folder to be captured in the HDFS file system, the folder to be captured contains URLs to be captured, and the initial layer sequence number is set to 1;
2.2)判断待抓取文件夹中是否为空,若是,则跳转到步骤2.7);否则,进行下一步骤2.3);2.2) Determine whether the folder to be captured is empty, if so, jump to step 2.7); otherwise, proceed to the next step 2.3);
2.3)通过map/reduce过程对HDFS文件系统中的各个URL种子文件对应的网页进行采集,并保存在HDFS文件系统中网页存储文件夹存放,网页存储文件夹存放有未经加工的网页;2.3) collect the corresponding webpages of each URL seed file in the HDFS file system through the map/reduce process, and store in the webpage storage folder in the HDFS file system, and the webpage storage folder stores unprocessed webpages;
2.4)再通过map/reduce过程对网页存储文件夹中已抓取的网页从中提取并解析出新的URL,并将新的URL保存在HDFS文件系统的临时文件夹中,临时文件夹存放有解析得到的URL;2.4) Extract and parse the new URL from the captured webpage in the webpage storage folder through the map/reduce process, and save the new URL in the temporary folder of the HDFS file system. The temporary folder stores the parsed Get the URL;
2.5)通过map/reduce过程优化临时文件夹,过滤其中的URL,将重复的URL去掉,然后将结果在HDFS文件系统的待抓取文件夹中进行更新;2.5) Optimize the temporary folder through the map/reduce process, filter the URLs therein, remove the duplicate URLs, and then update the results in the folder to be captured in the HDFS file system;
2.6)将层序数+1;2.6) Add layer number +1;
2.7)判断层序数,如果当前的层序数大于抓取深度值Depth则进入步骤2.7),否则跳转到步骤2.2),再重复上述步骤2)~6)直到当前的层序数大于抓取深度值Depth;2.7) Determine the layer number, if the current layer number is greater than the capture depth value Depth , then enter step 2.7), otherwise jump to step 2.2), and then repeat the above steps 2) to 6) until the current layer number is greater than the capture depth value Depth ;
2.7)通过map/reduce的过程将上述步骤得到的多个网页存储文件夹根据网页哈希值合并为一个网页存储文件夹,并去掉其中重复的网页;可将合并后的网页存储文件夹保存在HDFS文件系统的网页文件夹中,该文件夹保存抓取到的网页。2.7) Through the process of map/reduce, the multiple webpage storage folders obtained in the above steps are merged into one webpage storage folder according to the hash value of the webpage, and the repeated webpages are removed; the merged webpage storage folder can be saved in In the web page folder of the HDFS file system, this folder saves the captured web pages.
3)将结构化处理后的网页采用树编辑距离的方式,进行结构化聚类;3) carrying out structured clustering on the structured processed web pages by means of tree edit distance;
步骤3)通过map/reduce过程进行聚类,具体步骤如下:Step 3) clustering through the map/reduce process, the specific steps are as follows:
3.1)在map阶段,对于步骤2)得到的合并后的网页存储文件夹中的每一个网页,例如步骤1)中用户感兴趣的网页,比如某一网站新闻类网页中的一个网页,map阶段将目标网页与抓取到的网页提取成标签树的形式,如图2所示。利用树编辑距离方法,分别计算每一个网页的标签树TREEx与每一个所述目标网页Ci的标签树TREECi之间的树编辑距离DISCi,得到树编辑距离集合{DISC1,DISC2,DISC3,…,DISCn},并生成键值对<Ci,WEB>,然后从树编辑距离集合中选取最小树编辑距离DISCmin,将最小树编辑距离DISCmin对应的键值对<Cmin,WEB>传给reduce阶段;3.1) In the map stage, for each webpage in the combined webpage storage folder obtained in step 2), such as a webpage of interest to the user in step 1), such as a webpage in a news webpage of a certain website, the map stage The target webpage and the captured webpage are extracted into a tag tree form, as shown in Figure 2. Using the tree edit distance method, calculate the tree edit distance DIS Ci between the tag tree TREE x of each web page and the tag tree TREE Ci of each target web page C i , and obtain the tree edit distance set {DIS C1 , DIS C2 , DIS C3 ,..., DIS Cn }, and generate a key-value pair <C i , WEB>, then select the minimum tree edit distance DIS Cmin from the tree edit distance collection, and set the key-value pair corresponding to the minimum tree edit distance DIS Cmin < C min , WEB> pass to the reduce stage;
3.2)在reduce阶段,根据上述键值对<Ci,WEB>中键值将具有相同键的网页合并到一个文件DOCCi中作为同一类网页,并保存在HDFS文件系统的结果文件夹中,结果文件夹存有文件DOCC1,文件DOCC2,文件DOCC3,…,文件DOCCn,每一个文件DOCCi保存了具有相同网页结构的网页,得到结构化网页聚类结果{DOCC1,DOCC2,DOCC3,…,DOCCn},完成网页的结构化聚类。3.2) In the reduce stage, according to the above-mentioned key-value pair <C i , WEB>, the webpages with the same key will be merged into a file DOC Ci as the same type of webpage, and stored in the result folder of the HDFS file system, The result folder contains files DOC C1 , files DOC C2 , files DOC C3 , ..., files DOC Cn , each file DOC Ci saves web pages with the same web page structure, and obtains the structured web page clustering result {DOC C1 , DOC C2 , DOC C3 ,..., DOC Cn }, complete the structured clustering of web pages.
4)对聚类后的网页信息进行结构化提取,保存到数据库中。4) Structurally extracting the clustered web page information and storing it in the database.
步骤4)根据步骤3)中得到的结构化网页聚类结果{DOCC1,DOCC2,DOCC3,…,DOCCn},对每一类网页进行提取,将网页中的信息提取出来保存到数据库中,提取可采用将网页中对应的标签树的节点提取到数据库中对应的字段的方式。Step 4) According to the structured webpage clustering results obtained in step 3) {DOC C1 , DOC C2 , DOC C3 , ..., DOC Cn }, extract each type of webpage, extract the information in the webpage and save it in the database In the extraction, the node of the corresponding label tree in the web page may be extracted to the corresponding field in the database.
不同类的网页采用不同的提取方式,可定义每一类网页的提取方式{R1,R2,R3,…,Rn},将网页中的信息提取出来保存到数据库中。Different types of webpages use different extraction methods, and the extraction method {R 1 , R 2 , R 3 , ..., R n } for each type of webpage can be defined to extract the information in the webpage and save it in the database.
本发明的实施实例如下:Implementation examples of the present invention are as follows:
对于某电动汽车网站,用户要获取其中的新闻类网页以及汽车型号参数类网页,对于这两类网页,每一类获取一个典型的网页作为目标网页,从而形成目标网页集合{C1,C2}。For an electric vehicle website, users need to obtain the news webpages and car model parameter webpages. For these two types of webpages, each type obtains a typical webpage as the target webpage, thus forming a set of target webpages {C 1 , C 2 }.
步骤2)中对该电动汽车网站进行分布式抓取,获取其网页数据,该过程耗时与待抓取网站规模和执行抓取任务的集群规模有关,对于十个节点的集群,满负荷的抓取速率可以达到10万条/小时。In step 2), distributed crawling is performed on the electric vehicle website to obtain its webpage data. The time-consuming process is related to the scale of the website to be crawled and the scale of the cluster that executes the crawling task. For a cluster of ten nodes, a fully loaded The crawling rate can reach 100,000 pieces/hour.
步骤3)将从该电动汽车网站获取到的网页,通过聚类分为新闻类和汽车型号参数类两类网页,该过程聚类的准确率可以做到95%以上。Step 3) The webpages obtained from the electric vehicle website are divided into two types of webpages: news and car model parameters through clustering, and the accuracy rate of clustering in this process can be more than 95%.
步骤4)对于这两类网页进行结构化的提取,对于新闻类网页,提取标题,正文,发布日期,来源等信息,保存到数据库中;对于汽车型号参数类网页,将不同型号参数提取出来,保存到数据库中。Step 4) carry out structural extraction for these two types of web pages, for news web pages, extract title, text, release date, source and other information, save in the database; for car model parameter class web pages, extract different model parameters, saved to the database.
其中,一种提取方式的Ri形式:其中PubTitle,Content,PublicationDate,DCSource是数据库中的字段,每一个字段后面是一个网页xpath路径,定义了该字段在网页中的位置。对每一类网页文件DOCCi,利用其对应的提取方式Ri,将网页中对应的数据提取到数据库中对应的字段即可。Among them, an R i form of an extraction method: wherein PubTitle, Content, PublicationDate, and DCSource are fields in the database, and each field is followed by a webpage xpath path, which defines the position of the field in the webpage. For each type of webpage file DOC Ci , use its corresponding extraction method R i to extract the corresponding data in the webpage to the corresponding field in the database.
可以看到,在整个抓取过程中,只需要用户提供感兴趣的两类网页中的典型网页,该方法就可以将目标网站中对应类型的信息,结构化的提取并保存到数据库中,整个过程在保持较高准确率(实施例得到的在95%以上)的基础上,提供了较快信息处理速率。It can be seen that during the entire crawling process, only the typical webpages of the two types of webpages that the user is interested in are required, and this method can extract and store the corresponding types of information in the target website in a structured manner and save them in the database. The process provides a faster information processing rate on the basis of maintaining a higher accuracy rate (above 95% obtained in the embodiment).
由此,本发明基于HDFS文件系统,针对网络信息半结构化的特点,提出了利用树编辑距离对网页按照网页结构进行聚类,在聚类结果的基础上,对每一类的网页按照方式提取信息,并保存到数据库中,从而实现网页信息的结构化采集,具有显著的技术效果。Therefore, based on the HDFS file system, the present invention proposes to use the tree edit distance to cluster webpages according to the webpage structure, and to classify each type of webpage according to the method based on the clustering results. The information is extracted and stored in the database, so as to realize the structured collection of web page information, which has a remarkable technical effect.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410840847.0A CN104598536B (en) | 2014-12-29 | 2014-12-29 | A kind of distributed network information structuring processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410840847.0A CN104598536B (en) | 2014-12-29 | 2014-12-29 | A kind of distributed network information structuring processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104598536A true CN104598536A (en) | 2015-05-06 |
CN104598536B CN104598536B (en) | 2017-10-20 |
Family
ID=53124321
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410840847.0A Active CN104598536B (en) | 2014-12-29 | 2014-12-29 | A kind of distributed network information structuring processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104598536B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106293465A (en) * | 2016-08-09 | 2017-01-04 | Tcl移动通信科技(宁波)有限公司 | The Web page management method of a kind of mobile terminal and system |
CN106815196A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Soft text represents number of times statistical method and device |
CN107451224A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | A kind of clustering method and system based on big data parallel computation |
CN109829094A (en) * | 2019-01-23 | 2019-05-31 | 钟祥博谦信息科技有限公司 | Distributed reptile system |
CN111177301A (en) * | 2019-11-26 | 2020-05-19 | 云南电网有限责任公司昆明供电局 | Key information identification and extraction method and system |
CN112115164A (en) * | 2019-06-19 | 2020-12-22 | 北京金山云网络技术有限公司 | Data processing method and device, data query method and device, and network equipment |
CN113220943A (en) * | 2021-06-04 | 2021-08-06 | 上海天旦网络科技发展有限公司 | Target information positioning method and system in semi-structured flow data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667201A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Integration method of Deep Web query interface based on tree merging |
US20110313973A1 (en) * | 2010-06-19 | 2011-12-22 | Srivas Mandayam C | Map-Reduce Ready Distributed File System |
US20120150836A1 (en) * | 2010-12-08 | 2012-06-14 | Microsoft Corporation | Training parsers to approximately optimize ndcg |
US20130091414A1 (en) * | 2011-10-11 | 2013-04-11 | Omer BARKOL | Mining Web Applications |
CN103257975A (en) * | 2012-02-21 | 2013-08-21 | 腾讯科技(深圳)有限公司 | Search method, search device and search system |
CN103534700A (en) * | 2011-05-20 | 2014-01-22 | 惠普发展公司,有限责任合伙企业 | System and method for configuration policy extraction |
CN104217020A (en) * | 2014-09-25 | 2014-12-17 | 浪潮(北京)电子信息产业有限公司 | Webpage clustering method and system based on MapReduce framework |
-
2014
- 2014-12-29 CN CN201410840847.0A patent/CN104598536B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101667201A (en) * | 2009-09-18 | 2010-03-10 | 浙江大学 | Integration method of Deep Web query interface based on tree merging |
US20110313973A1 (en) * | 2010-06-19 | 2011-12-22 | Srivas Mandayam C | Map-Reduce Ready Distributed File System |
US20120150836A1 (en) * | 2010-12-08 | 2012-06-14 | Microsoft Corporation | Training parsers to approximately optimize ndcg |
CN103534700A (en) * | 2011-05-20 | 2014-01-22 | 惠普发展公司,有限责任合伙企业 | System and method for configuration policy extraction |
US20130091414A1 (en) * | 2011-10-11 | 2013-04-11 | Omer BARKOL | Mining Web Applications |
CN103257975A (en) * | 2012-02-21 | 2013-08-21 | 腾讯科技(深圳)有限公司 | Search method, search device and search system |
CN104217020A (en) * | 2014-09-25 | 2014-12-17 | 浪潮(北京)电子信息产业有限公司 | Webpage clustering method and system based on MapReduce framework |
Non-Patent Citations (2)
Title |
---|
宫丽娜 等: "基于树编辑距离的聚类算法数据记录抽取", 《赤峰学院学报(自然科学版)》 * |
聂卉 等: "树编辑距离在Web信息抽取中的应用与实现", 《现代图书情报技术》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106815196A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Soft text represents number of times statistical method and device |
CN106815196B (en) * | 2015-11-27 | 2020-07-31 | 北京国双科技有限公司 | Method and device for counting the number of press releases |
CN106293465A (en) * | 2016-08-09 | 2017-01-04 | Tcl移动通信科技(宁波)有限公司 | The Web page management method of a kind of mobile terminal and system |
CN107451224A (en) * | 2017-07-17 | 2017-12-08 | 广州特道信息科技有限公司 | A kind of clustering method and system based on big data parallel computation |
CN109829094A (en) * | 2019-01-23 | 2019-05-31 | 钟祥博谦信息科技有限公司 | Distributed reptile system |
CN112115164A (en) * | 2019-06-19 | 2020-12-22 | 北京金山云网络技术有限公司 | Data processing method and device, data query method and device, and network equipment |
CN112115164B (en) * | 2019-06-19 | 2024-09-03 | 北京金山云网络技术有限公司 | Data processing method and device, data query method and device and network equipment |
CN111177301A (en) * | 2019-11-26 | 2020-05-19 | 云南电网有限责任公司昆明供电局 | Key information identification and extraction method and system |
CN111177301B (en) * | 2019-11-26 | 2023-05-26 | 云南电网有限责任公司昆明供电局 | Method and system for identifying and extracting key information |
CN113220943A (en) * | 2021-06-04 | 2021-08-06 | 上海天旦网络科技发展有限公司 | Target information positioning method and system in semi-structured flow data |
Also Published As
Publication number | Publication date |
---|---|
CN104598536B (en) | 2017-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN106934014B (en) | Hadoop-based network data mining and analyzing platform and method thereof | |
Williams et al. | Scholarly big data information extraction and integration in the citeseer χ digital library | |
CN102270331B (en) | Network shopping navigating method based on visual search | |
CN104298771A (en) | Massive web log data query and analysis method | |
CN103955529A (en) | Internet information searching and aggregating presentation method | |
US20130185429A1 (en) | Processing Store Visiting Data | |
CN106982150A (en) | A kind of mobile Internet user behavior analysis method based on Hadoop | |
CN104133868B (en) | A kind of strategy integrated for the classification of vertical reptile data | |
KR101801257B1 (en) | Text-Mining Application Technique for Productive Construction Document Management | |
CN105868327A (en) | A Distributed Web Crawler Crawling Method Based on Different Update Strategies | |
CN103530429A (en) | Webpage content extracting method | |
CN103778238A (en) | Method for automatically building classification tree from semi-structured data of Wikipedia | |
US20210109945A1 (en) | Self-orchestrated system for extraction, analysis, and presentation of entity data | |
CN103177022A (en) | Method and device of malicious file search | |
CN104298669A (en) | Person geographic information mining model based on social network | |
CN103870495A (en) | Method and device for extracting information from website | |
AL-Msie'deen et al. | Detecting commonality and variability in use-case diagram variants | |
CN107704620A (en) | A kind of method, apparatus of file administration, equipment and storage medium | |
Rehman et al. | Building socially-enabled event-enriched maps | |
KR101693727B1 (en) | Apparatus and method for reorganizing social issues from research and development perspective using social network | |
CN104063456A (en) | We media transmission atlas analysis method and device based on vector query | |
CN103886049B (en) | Method for mining heterogeneous related data set in data space | |
Yang et al. | A study of automation from seed URL generation to focused web archive development: the CTRnet context | |
Rao et al. | A novel and efficient method for protecting internet usage from unauthorized access using map reduce |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |