CN109033203A - A kind of feature extraction method for parallel processing towards big data - Google Patents

A kind of feature extraction method for parallel processing towards big data Download PDF

Info

Publication number
CN109033203A
CN109033203A CN201810697344.0A CN201810697344A CN109033203A CN 109033203 A CN109033203 A CN 109033203A CN 201810697344 A CN201810697344 A CN 201810697344A CN 109033203 A CN109033203 A CN 109033203A
Authority
CN
China
Prior art keywords
data
crawl
text
parallel processing
feature extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810697344.0A
Other languages
Chinese (zh)
Inventor
刘震
梁旭
黄明
焦璇
黄辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian Jiaotong University
Original Assignee
Dalian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian Jiaotong University filed Critical Dalian Jiaotong University
Priority to CN201810697344.0A priority Critical patent/CN109033203A/en
Publication of CN109033203A publication Critical patent/CN109033203A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of feature extraction method for parallel processing towards big data, change the feature extraction method for parallel processing of traditional big data, it is improved on the data grab method of internet grasping system, memory space is distributed first on GPU for task data and characteristic, then keyword in characteristic keywords database is provided, URL is acquired on a search engine, and provide user's acquisition of customized keyword, according to grabbed configuration information, greatly improve the convenience of operation, then crawl target network address is determined, first find the network address containing required data, judge the reliability of data and the feasibility of crawl and difficulty, analyze content of pages and its organizational form, determine rules for grasping, finally, by regular expression matching to the text of each level, according to the identification string of definition, to web page text Matching search is carried out to extract required data, crawl is high-efficient, and the accuracy of crawl also greatly promotes.

Description

A kind of feature extraction method for parallel processing towards big data
Technical field
The invention belongs to big data processing technology fields, more specifically more particularly to a kind of feature towards big data Extract method for parallel processing.
Background technique
With the arriving of big data era, big data how is quickly handled, and extracts effective information and has become IT row The research hotspot of industry frontier nature." big data " refers to that a scale of construction is especially big, and data category is more and requires processing speed sufficiently fast Data set, and such data set can not be extracted and be managed to its content with traditional database tool.According to existing There is the retrieval of Patent data, mainly have to the processing method of big data at present: improving CPU core quantity, establishes distributed type assemblies system System and optimization parallel algorithm etc..But since these methods are all limited only to rely on the calculation processing power of CPU, in addition CPU The limited amount of core, the restriction for establishing the factors such as distributed cluster system higher cost, to the processing method and ability of big data Still up for further innovating and improving.Currently, Feature Extraction Technology is in image procossing, pattern-recognition and network invasion monitoring etc. Aspect with more and more extensive, the efficiency of feature extraction has become restriction quickly processing data especially under big data environment The bottleneck of ability.
For this purpose, application No. is a kind of feature extraction parallel processing sides towards big data disclosed in CN201310487250.8 Method, this method are based on CUDA framework and are handled using GPU computation capability big data.When handling big data, pass through Using can parallelization matrix array processing method, to data carry out multi-thread concurrent execution processing, to greatly speed up feature The speed of extraction.It is used can the matrix array processing method of parallelization be every feature by task data and characteristic Character successively carries out PARALLEL MATCHING, forms " 01 " matrix, then according to the length of characteristic, carries out to this " 01 " matrix Parallel processing, to obtain correct matched result.This method utilizes the characteristics of matrix array, has good concurrency, energy Enough effective, fully swift nature extractions by data processing parallelization, especially suitable for big data.
But above scheme still has certain defect, duplicate removal can not be carried out to duplicate data, so that data volume mistake Greatly, data band being extracted to the later period and carrying out very big difficulty, there is certain limitation.
Summary of the invention
The purpose of the present invention is to solve disadvantages existing in the prior art, and a kind of spy towards big data proposed Sign extracts method for parallel processing.
To achieve the above object, the invention provides the following technical scheme: a kind of spy towards big data provided by the invention Sign extracts method for parallel processing, specifically comprises the following steps:
S1: memory space is distributed for task data and characteristic on GPU;
S2: providing the keyword in characteristic keywords database, and URL is acquired on a search engine, and is provided user and made by oneself The acquisition of adopted keyword;
S3: according to grabbed configuration information, since the space of a whole page index page of targeted website, the space of a whole page index is grabbed one by one The link of the text occurred on page, and the link for going deep into text crawls text paging information and body matter, system utilizes the school URL The URL that the mode tested will acquire carries out duplicate removal;
It includes depth-first and breadth first algorithm that S4:URL, which acquires crawler, and can configure and crawl depth and user right, The service interface of data grabber is determined by caller service, and is serviced by supplier and determined that the realization of response service interface takes Business, and then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core industry Business document is dimension, grabs automatically to the data of business paper associated with it, greatly improves the convenience of operation;
S5: it determines crawl target network address, first finds the network address containing required data, judge the reliability and crawl of data Feasibility and difficulty;
S6: analysis content of pages and its organizational form determine rules for grasping;
S7: regular expression matching matches the text of each level according to the identification string of definition to web page text Search is to extract required data.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method in the step S4, pays attention to the website for avoiding applying anti-acquisition measure, such as: limiting IP address within a certain period of time to page The access times in face just may browse through after being logged in the javascript encrypted content page, a permission user and only allow to pass through The our station page connects the website checked.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method further includes a large amount of formats and other other than data content since webpage is semi-structured document in the step S5 Multimedia messages must understand the tissue characteristic of web data before crawl, the recognition rule of target data item be determined, by checking Source file is analyzed.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method matches in search process in the step S6, in order to enhance flexibility as far as possible, uses regular expression.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method in the step S3, provides to URL tag resolution function, includes the content under title, date, author, text specific label It extracts and classifies, provide and key message in the specific label for searching out result is extracted, having Domestic News class webpage just Literary information extraction function.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method, in the step S4 when the process of crawl occurs abnormal, record log information, parallelization distributed interconnection data are grabbed at this time System is taken to carry out retrying crawl, until grabbing successfully.
Technical effect and advantage of the invention: a kind of feature extraction method for parallel processing towards big data of the present invention changes The feature extraction method for parallel processing for becoming traditional big data, improves on the data grab method of internet grasping system, Memory space is distributed for task data and characteristic first on GPU, then the key in characteristic keywords database is provided Word, URL is acquired on a search engine, and provides user's acquisition of customized keyword, according to grabbed configuration information, from target The space of a whole page index page of website starts, and grabs the link of the text occurred on the space of a whole page index page one by one, and gos deep into the chain of text Connect and crawl text paging information and body matter, system by URL verify in the way of the URL that will acquire carry out duplicate removal, improve Working efficiency, it includes depth-first and breadth first algorithm that URL, which acquires crawler, and it is configurable crawl depth and user right, The service interface of data grabber is determined by caller service, and is serviced by supplier and determined that the realization of response service interface takes Business, and then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core industry Business document is dimension, grabs automatically to the data of business paper associated with it, greatly improves the convenience of operation, Then determine crawl target network address, first find the network address containing required data, judge data reliability and crawl it is feasible Property and difficulty, analyze content of pages and its organizational form, determine rules for grasping, finally, by regular expression matching to each The text of level carries out matching search to web page text according to the identification string of definition to extract required data, and crawl is high-efficient, The accuracy of crawl also greatly promotes.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with specific embodiment, to this Invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, not For limiting the present invention.Based on the embodiments of the present invention, those of ordinary skill in the art are not before making creative work Every other embodiment obtained is put, shall fall within the protection scope of the present invention.
A kind of feature extraction method for parallel processing towards big data provided by the invention, specifically comprises the following steps:
S1: memory space is distributed for task data and characteristic on GPU;
S2: providing the keyword in characteristic keywords database, and URL is acquired on a search engine, and is provided user and made by oneself The acquisition of adopted keyword;
S3: according to grabbed configuration information, since the space of a whole page index page of targeted website, the space of a whole page index is grabbed one by one The link of the text occurred on page, and the link for going deep into text crawls text paging information and body matter, system utilizes the school URL The URL that the mode tested will acquire carries out duplicate removal;
It includes depth-first and breadth first algorithm that S4:URL, which acquires crawler, and can configure and crawl depth and user right, The service interface of data grabber is determined by caller service, and is serviced by supplier and determined that the realization of response service interface takes Business, and then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core industry Business document is dimension, grabs automatically to the data of business paper associated with it, greatly improves the convenience of operation;
S5: it determines crawl target network address, first finds the network address containing required data, judge the reliability and crawl of data Feasibility and difficulty;
S6: analysis content of pages and its organizational form determine rules for grasping;
S7: regular expression matching matches the text of each level according to the identification string of definition to web page text Search is to extract required data.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method in the step S4, pays attention to the website for avoiding applying anti-acquisition measure, such as: limiting IP address within a certain period of time to page The access times in face just may browse through after being logged in the javascript encrypted content page, a permission user and only allow to pass through The our station page connects the website checked.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method further includes a large amount of formats and other other than data content since webpage is semi-structured document in the step S5 Multimedia messages must understand the tissue characteristic of web data before crawl, the recognition rule of target data item be determined, by checking Source file is analyzed.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method matches in search process in the step S6, in order to enhance flexibility as far as possible, uses regular expression.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method in the step S3, provides to URL tag resolution function, includes the content under title, date, author, text specific label It extracts and classifies, provide and key message in the specific label for searching out result is extracted, having Domestic News class webpage just Literary information extraction function.
As advanced optimizing for the technical program, a kind of feature extraction parallel processing side towards big data of the present invention Method, in the step S4 when the process of crawl occurs abnormal, record log information, parallelization distributed interconnection data are grabbed at this time System is taken to carry out retrying crawl, until grabbing successfully.
In summary: a kind of feature extraction method for parallel processing towards big data of the present invention changes traditional big data Feature extraction method for parallel processing improves on the data grab method of internet grasping system, appoints first on GPU Data of being engaged in and characteristic distribute memory space, then provide the keyword in characteristic keywords database, on a search engine URL acquisition, and user's acquisition of customized keyword is provided, according to grabbed configuration information, indexed from the space of a whole page of targeted website Page starts, and grabs the link of the text occurred on the space of a whole page index page one by one, and the link for going deep into text crawls text paging Information and body matter, system by URL verify in the way of the URL that will acquire carry out duplicate removal, improve work efficiency, URL Acquiring crawler includes depth-first and breadth first algorithm, and can configure and crawl depth and user right, passes through caller service It determines the service interface of data grabber, and services the realization service for determining response service interface by supplier, and then pass through tune It is serviced with realizing, to be grabbed to the data in other business papers, is made it possible to realize using core business document as dimension, Automatically the data of business paper associated with it are grabbed, greatly improves the convenience of operation, then determine crawl Target network address first finds the network address containing required data, judges the reliability of data and the feasibility of crawl and difficulty, analysis Content of pages and its organizational form, determine rules for grasping, finally, by text of the regular expression matching to each level, root According to the identification string of definition, matching search is carried out to extract required data to web page text, grabs high-efficient, the accuracy of crawl It greatly promotes, is suitble to promote the use of.
Finally, it should be noted that the foregoing is only a preferred embodiment of the present invention, it is not intended to restrict the invention, Although the present invention is described in detail referring to the foregoing embodiments, for those skilled in the art, still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features, All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included in of the invention Within protection scope.

Claims (6)

1. a kind of feature extraction method for parallel processing towards big data, it is characterised in that: specifically comprise the following steps:
S1: memory space is distributed for task data and characteristic on GPU;
S2: providing the keyword in characteristic keywords database, and URL is acquired on a search engine, and provides the customized pass of user The acquisition of keyword;
S3: it according to grabbed configuration information, since the space of a whole page index page of targeted website, is grabbed on the space of a whole page index page one by one The link of the text of appearance, and the link for going deep into text crawls text paging information and body matter, system utilizes URL verification The URL that mode will acquire carries out duplicate removal;
It includes depth-first and breadth first algorithm that S4:URL, which acquires crawler, and can configure and crawl depth and user right, is passed through Caller service determines the service interface of data grabber, and the realization service for determining response service interface is serviced by supplier, And then by calling the service of realization, to grab the data in other business papers, make it possible to realize with core business Document is dimension, is grabbed automatically to the data of business paper associated with it, and the convenience of operation is greatly improved;
S5: determine crawl target network address, first find the network address containing required data, judge data reliability and crawl can Row and difficulty;
S6: analysis content of pages and its organizational form determine rules for grasping;
S7: regular expression matching carries out matching search to web page text according to the identification string of definition to the text of each level To extract required data.
2. a kind of data grab method based on internet data grasping system according to claim 1, it is characterised in that: In the step S5, the website for avoiding applying anti-acquisition measure is paid attention to, as: IP address is limited within a certain period of time to the page Access times just may browse through after being logged in the javascript encrypted content page, a permission user and only allow to pass through our station The page connects the website checked.
3. a kind of data grab method based on internet data grasping system according to claim 1, it is characterised in that: It further include a large amount of formats and other more matchmakers other than data content since webpage is semi-structured document in the step S6 Body information must understand the tissue characteristic of web data before crawl, the recognition rule of target data item be determined, by checking source document Part is analyzed.
4. a kind of data grab method based on internet data grasping system according to claim 1, it is characterised in that: It is matched in search process in the step S7, in order to enhance flexibility as far as possible, uses regular expression.
5. a kind of feature extraction method for parallel processing towards big data according to claim 1, it is characterised in that: described In step S4, provide to URL tag resolution function, simultaneously comprising the contents extraction under title, date, author, text specific label Classification, provides and extracts to key message in the specific label for searching out result, there is the text message of Domestic News class webpage Extract function.
6. a kind of feature extraction method for parallel processing towards big data according to claim 1, it is characterised in that: described In step S5 when the process of crawl occurs abnormal, record log information, parallelization distributed interconnection data grabber system at this time It carries out retrying crawl, until grabbing successfully.
CN201810697344.0A 2018-06-29 2018-06-29 A kind of feature extraction method for parallel processing towards big data Pending CN109033203A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810697344.0A CN109033203A (en) 2018-06-29 2018-06-29 A kind of feature extraction method for parallel processing towards big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810697344.0A CN109033203A (en) 2018-06-29 2018-06-29 A kind of feature extraction method for parallel processing towards big data

Publications (1)

Publication Number Publication Date
CN109033203A true CN109033203A (en) 2018-12-18

Family

ID=65520907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810697344.0A Pending CN109033203A (en) 2018-06-29 2018-06-29 A kind of feature extraction method for parallel processing towards big data

Country Status (1)

Country Link
CN (1) CN109033203A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162607A (en) * 2019-02-20 2019-08-23 北京捷风数据技术有限公司 A kind of government organization document information retroactive method and device based on convolutional neural networks
CN112579855A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Method and device for extracting feature codes of WeChat article
CN112966169A (en) * 2021-04-13 2021-06-15 四川省广播电视科学技术研究所 Internet emergency information capturing method
CN112988796A (en) * 2021-03-09 2021-06-18 纽扣互联(北京)科技有限公司 System and method for system data retrieval
CN114547171A (en) * 2022-02-22 2022-05-27 广州品推科技有限公司 Business data processing method and system based on big data analysis

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577160A (en) * 2013-10-17 2014-02-12 江苏科技大学 Characteristic extraction parallel-processing method for big data
CN108052632A (en) * 2017-12-20 2018-05-18 成都律云科技有限公司 A kind of method for obtaining network information, system and company information search system
US20180167336A1 (en) * 2014-10-31 2018-06-14 The Nielsen Company (Us), Llc Method and apparatus to throttle media access by web crawlers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577160A (en) * 2013-10-17 2014-02-12 江苏科技大学 Characteristic extraction parallel-processing method for big data
US20180167336A1 (en) * 2014-10-31 2018-06-14 The Nielsen Company (Us), Llc Method and apparatus to throttle media access by web crawlers
CN108052632A (en) * 2017-12-20 2018-05-18 成都律云科技有限公司 A kind of method for obtaining network information, system and company information search system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110162607A (en) * 2019-02-20 2019-08-23 北京捷风数据技术有限公司 A kind of government organization document information retroactive method and device based on convolutional neural networks
CN110162607B (en) * 2019-02-20 2021-08-31 北京捷风数据技术有限公司 Government organization official document information tracing method and device based on convolutional neural network
CN112579855A (en) * 2019-09-30 2021-03-30 北京国双科技有限公司 Method and device for extracting feature codes of WeChat article
CN112988796A (en) * 2021-03-09 2021-06-18 纽扣互联(北京)科技有限公司 System and method for system data retrieval
CN112988796B (en) * 2021-03-09 2023-08-18 纽扣互联(北京)科技有限公司 System and method for system data retrieval
CN112966169A (en) * 2021-04-13 2021-06-15 四川省广播电视科学技术研究所 Internet emergency information capturing method
CN114547171A (en) * 2022-02-22 2022-05-27 广州品推科技有限公司 Business data processing method and system based on big data analysis

Similar Documents

Publication Publication Date Title
CN109033203A (en) A kind of feature extraction method for parallel processing towards big data
CN102054015B (en) System and method of organizing community intelligent information by using organic matter data model
Shinzato et al. Tsubaki: An open search engine infrastructure for developing information access methodology
CN108737423B (en) Phishing website discovery method and system based on webpage key content similarity analysis
US20090070366A1 (en) Method and system for web document clustering
CN102752154B (en) Detecting method of dead link of Web site
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
Kaplan et al. Automatic extraction of citation contexts for research paper summarization: A coreference-chain based approach
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN106446124B (en) A kind of Website classification method based on cyberrelationship figure
CN110321466A (en) A kind of security information duplicate checking method and system based on semantic analysis
CN108416034B (en) Information acquisition system based on financial heterogeneous big data and control method thereof
CN108846117A (en) The duplicate removal screening technique and device of business news flash
CN104268289B (en) The abatement detecting method and device of link URL
CN107704515A (en) Data grab method based on internet data grasping system
CN106776609A (en) Reprint the statistical method and device of quantity in website
CN106027528A (en) WEB horizontal authority automatic identification method and device
CN108959368A (en) A kind of information monitoring method, storage medium and server
CN110020161B (en) Data processing method, log processing method and terminal
US8046360B2 (en) Reduction of annotations to extract structured web data
Hansen et al. Comparing open source search engine functionality, efficiency and effectiveness with respect to digital forensic search
CN108197465A (en) A kind of network address detection method and device
CN101576933A (en) Fully-automatic grouping method of WEB pages based on title separator
Oyri News Item Extraction for Text Mining inWeb Newspapers
KR20120090131A (en) Method, system and computer readable recording medium for providing search results

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181218

RJ01 Rejection of invention patent application after publication