CN109033203A

CN109033203A - A kind of feature extraction method for parallel processing towards big data

Info

Publication number: CN109033203A
Application number: CN201810697344.0A
Authority: CN
Inventors: 刘震; 梁旭; 黄明; 焦璇; 黄辉
Original assignee: Dalian Jiaotong University
Current assignee: Dalian Jiaotong University
Priority date: 2018-06-29
Filing date: 2018-06-29
Publication date: 2018-12-18

Abstract

The invention discloses a kind of feature extraction method for parallel processing towards big data, change the feature extraction method for parallel processing of traditional big data, it is improved on the data grab method of internet grasping system, memory space is distributed first on GPU for task data and characteristic, then keyword in characteristic keywords database is provided, URL is acquired on a search engine, and provide user's acquisition of customized keyword, according to grabbed configuration information, greatly improve the convenience of operation, then crawl target network address is determined, first find the network address containing required data, judge the reliability of data and the feasibility of crawl and difficulty, analyze content of pages and its organizational form, determine rules for grasping, finally, by regular expression matching to the text of each level, according to the identification string of definition, to web page text Matching search is carried out to extract required data, crawl is high-efficient, and the accuracy of crawl also greatly promotes.

Description

A kind of feature extraction method for parallel processing towards big data

Technical field

The invention belongs to big data processing technology fields, more specifically more particularly to a kind of feature towards big data Extract method for parallel processing.

Background technique

With the arriving of big data era, big data how is quickly handled, and extracts effective information and has become IT row The research hotspot of industry frontier nature." big data " refers to that a scale of construction is especially big, and data category is more and requires processing speed sufficiently fast Data set, and such data set can not be extracted and be managed to its content with traditional database tool.According to existing There is the retrieval of Patent data, mainly have to the processing method of big data at present: improving CPU core quantity, establishes distributed type assemblies system System and optimization parallel algorithm etc..But since these methods are all limited only to rely on the calculation processing power of CPU, in addition CPU The limited amount of core, the restriction for establishing the factors such as distributed cluster system higher cost, to the processing method and ability of big data Still up for further innovating and improving.Currently, Feature Extraction Technology is in image procossing, pattern-recognition and network invasion monitoring etc. Aspect with more and more extensive, the efficiency of feature extraction has become restriction quickly processing data especially under big data environment The bottleneck of ability.

For this purpose, application No. is a kind of feature extraction parallel processing sides towards big data disclosed in CN201310487250.8 Method, this method are based on CUDA framework and are handled using GPU computation capability big data.When handling big data, pass through Using can parallelization matrix array processing method, to data carry out multi-thread concurrent execution processing, to greatly speed up feature The speed of extraction.It is used can the matrix array processing method of parallelization be every feature by task data and characteristic Character successively carries out PARALLEL MATCHING, forms " 01 " matrix, then according to the length of characteristic, carries out to this " 01 " matrix Parallel processing, to obtain correct matched result.This method utilizes the characteristics of matrix array, has good concurrency, energy Enough effective, fully swift nature extractions by data processing parallelization, especially suitable for big data.

But above scheme still has certain defect, duplicate removal can not be carried out to duplicate data, so that data volume mistake Greatly, data band being extracted to the later period and carrying out very big difficulty, there is certain limitation.

Summary of the invention

The purpose of the present invention is to solve disadvantages existing in the prior art, and a kind of spy towards big data proposed Sign extracts method for parallel processing.

To achieve the above object, the invention provides the following technical scheme: a kind of spy towards big data provided by the invention Sign extracts method for parallel processing, specifically comprises the following steps:

S1: memory space is distributed for task data and characteristic on GPU；

S2: providing the keyword in characteristic keywords database, and URL is acquired on a search engine, and is provided user and made by oneself The acquisition of adopted keyword；

S3: according to grabbed configuration information, since the space of a whole page index page of targeted website, the space of a whole page index is grabbed one by one The link of the text occurred on page, and the link for going deep into text crawls text paging information and body matter, system utilizes the school URL The URL that the mode tested will acquire carries out duplicate removal；

It includes depth-first and breadth first algorithm that S4:URL, which acquires crawler, and can configure and crawl depth and user right, The service interface of data grabber is determined by caller service, and is serviced by supplier and determined that the realization of response service interface takes Business, and then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core industry Business document is dimension, grabs automatically to the data of business paper associated with it, greatly improves the convenience of operation；

S5: it determines crawl target network address, first finds the network address containing required data, judge the reliability and crawl of data Feasibility and difficulty；

S6: analysis content of pages and its organizational form determine rules for grasping；

S7: regular expression matching matches the text of each level according to the identification string of definition to web page text Search is to extract required data.

As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method in the step S4, pays attention to the website for avoiding applying anti-acquisition measure, such as: limiting IP address within a certain period of time to page The access times in face just may browse through after being logged in the javascript encrypted content page, a permission user and only allow to pass through The our station page connects the website checked.

As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method further includes a large amount of formats and other other than data content since webpage is semi-structured document in the step S5 Multimedia messages must understand the tissue characteristic of web data before crawl, the recognition rule of target data item be determined, by checking Source file is analyzed.

As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method matches in search process in the step S6, in order to enhance flexibility as far as possible, uses regular expression.

As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method in the step S3, provides to URL tag resolution function, includes the content under title, date, author, text specific label It extracts and classifies, provide and key message in the specific label for searching out result is extracted, having Domestic News class webpage just Literary information extraction function.

As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method, in the step S4 when the process of crawl occurs abnormal, record log information, parallelization distributed interconnection data are grabbed at this time System is taken to carry out retrying crawl, until grabbing successfully.

Technical effect and advantage of the invention: a kind of feature extraction method for parallel processing towards big data of the present invention changes The feature extraction method for parallel processing for becoming traditional big data, improves on the data grab method of internet grasping system, Memory space is distributed for task data and characteristic first on GPU, then the key in characteristic keywords database is provided Word, URL is acquired on a search engine, and provides user's acquisition of customized keyword, according to grabbed configuration information, from target The space of a whole page index page of website starts, and grabs the link of the text occurred on the space of a whole page index page one by one, and gos deep into the chain of text Connect and crawl text paging information and body matter, system by URL verify in the way of the URL that will acquire carry out duplicate removal, improve Working efficiency, it includes depth-first and breadth first algorithm that URL, which acquires crawler, and it is configurable crawl depth and user right, The service interface of data grabber is determined by caller service, and is serviced by supplier and determined that the realization of response service interface takes Business, and then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core industry Business document is dimension, grabs automatically to the data of business paper associated with it, greatly improves the convenience of operation, Then determine crawl target network address, first find the network address containing required data, judge data reliability and crawl it is feasible Property and difficulty, analyze content of pages and its organizational form, determine rules for grasping, finally, by regular expression matching to each The text of level carries out matching search to web page text according to the identification string of definition to extract required data, and crawl is high-efficient, The accuracy of crawl also greatly promotes.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, to this Invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.

A kind of feature extraction method for parallel processing towards big data provided by the invention, specifically comprises the following steps:

S1: memory space is distributed for task data and characteristic on GPU；

In summary: a kind of feature extraction method for parallel processing towards big data of the present invention changes traditional big data Feature extraction method for parallel processing improves on the data grab method of internet grasping system, appoints first on GPU Data of being engaged in and characteristic distribute memory space, then provide the keyword in characteristic keywords database, on a search engine URL acquisition, and user's acquisition of customized keyword is provided, according to grabbed configuration information, indexed from the space of a whole page of targeted website Page starts, and grabs the link of the text occurred on the space of a whole page index page one by one, and the link for going deep into text crawls text paging Information and body matter, system by URL verify in the way of the URL that will acquire carry out duplicate removal, improve work efficiency, URL Acquiring crawler includes depth-first and breadth first algorithm, and can configure and crawl depth and user right, passes through caller service It determines the service interface of data grabber, and services the realization service for determining response service interface by supplier, and then pass through tune It is serviced with realizing, to be grabbed to the data in other business papers, is made it possible to realize using core business document as dimension, Automatically the data of business paper associated with it are grabbed, greatly improves the convenience of operation, then determine crawl Target network address first finds the network address containing required data, judges the reliability of data and the feasibility of crawl and difficulty, analysis Content of pages and its organizational form, determine rules for grasping, finally, by text of the regular expression matching to each level, root According to the identification string of definition, matching search is carried out to extract required data to web page text, grabs high-efficient, the accuracy of crawl It greatly promotes, is suitble to promote the use of.

Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features, All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims

1. a kind of feature extraction method for parallel processing towards big data, it is characterised in that: specifically comprise the following steps:

S1: memory space is distributed for task data and characteristic on GPU；

S2: providing the keyword in characteristic keywords database, and URL is acquired on a search engine, and provides the customized pass of user The acquisition of keyword；

S3: it according to grabbed configuration information, since the space of a whole page index page of targeted website, is grabbed on the space of a whole page index page one by one The link of the text of appearance, and the link for going deep into text crawls text paging information and body matter, system utilizes URL verification The URL that mode will acquire carries out duplicate removal；

It includes depth-first and breadth first algorithm that S4:URL, which acquires crawler, and can configure and crawl depth and user right, is passed through Caller service determines the service interface of data grabber, and the realization service for determining response service interface is serviced by supplier, And then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core business Document is dimension, is grabbed automatically to the data of business paper associated with it, and the convenience of operation is greatly improved；

S5: determine crawl target network address, first find the network address containing required data, judge data reliability and crawl can Row and difficulty；

S7: regular expression matching carries out matching search to web page text according to the identification string of definition to the text of each level To extract required data.

2. a kind of data grab method based on internet data grasping system according to claim 1, it is characterised in that: In the step S5, the website for avoiding applying anti-acquisition measure is paid attention to, as: IP address is limited within a certain period of time to the page Access times just may browse through after being logged in the javascript encrypted content page, a permission user and only allow to pass through our station The page connects the website checked.

3. a kind of data grab method based on internet data grasping system according to claim 1, it is characterised in that: It further include a large amount of formats and other more matchmakers other than data content since webpage is semi-structured document in the step S6 Body information must understand the tissue characteristic of web data before crawl, the recognition rule of target data item be determined, by checking source document Part is analyzed.

4. a kind of data grab method based on internet data grasping system according to claim 1, it is characterised in that: It is matched in search process in the step S7, in order to enhance flexibility as far as possible, uses regular expression.

5. a kind of feature extraction method for parallel processing towards big data according to claim 1, it is characterised in that: described In step S4, provide to URL tag resolution function, simultaneously comprising the contents extraction under title, date, author, text specific label Classification, provides and extracts to key message in the specific label for searching out result, there is the text message of Domestic News class webpage Extract function.

6. a kind of feature extraction method for parallel processing towards big data according to claim 1, it is characterised in that: described In step S5 when the process of crawl occurs abnormal, record log information, parallelization distributed interconnection data grabber system at this time It carries out retrying crawl, until grabbing successfully.