CN109783619A

CN109783619A - A kind of data filtering method for digging

Info

Publication number: CN109783619A
Application number: CN201811532016.1A
Authority: CN
Inventors: 柴满; 吴少丹; 刘坤杰
Original assignee: GUANGDONG CREAWOR TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: GUANGDONG CREAWOR TECHNOLOGY DEVELOPMENT Co Ltd
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-05-21

Abstract

Technical solution of the present invention includes a kind of data filtering method for digging, for realizing: exploitation crawler engine crawls internet mass data, the high price value information of mass data is excavated using neural network, the real-time automatic fitration of machine learning algorithm building core value model, including carrying out big data processing and natural language processing to the data of acquisition, and the Data Analysis Model system including text analyzing model, information value evaluation model and deep learning algorithm tuning model is established, target data is handled and is pushed to user.The invention has the benefit that user is helped precisely to identify potential customers, compare artificial treatment, treatment process, more simply, efficiently, discrimination it is higher, high price value information, potential customers are precisely identified by machine learning algorithm automatically, artificial erroneous judgement is reduced and fails to judge.

Description

A kind of data filtering method for digging

Technical field

The present invention relates to a kind of data filtering method for digging, belong to field of computer technology.

Background technique

Computer technology and the communication technology rapidly develop, user demand rapid growth, so that the application of computer network is got over Come that wider, scale is increasing, big data technology is with data be essence the revolutionary information technology of a new generation, dug in data During latent, it is able to drive the innovation of theory, mode, technology and application practice.Under big data era background, large quantities of enterprises Business is to digitlization, information-based transition, and in face of there is a large amount of lengthy and jumbled internet datas, and each user is in the limited time Under energy, be difficult efficiently, accurately recognize valuable information.

Summary of the invention

To solve the above problems, developing crawler engine the purpose of the present invention is to provide a kind of data filtering method for digging Internet mass data are crawled, are excavated using neural network, the real-time automatic fitration of machine learning algorithm building core value model The high price value information of mass data including carrying out big data processing and natural language processing to the data of acquisition, and is established and includes The Data Analysis Model system of text analyzing model, information value evaluation model and deep learning algorithm tuning model, to mesh Mark data are handled and are pushed to user.

Technical solution used by the present invention solves the problems, such as it is: a kind of data filtering method for digging, which is characterized in that packet It includes following steps: data acquisition S100, being carried out to targeted website using Nutch distributed reptile technology；S200, the number to acquisition According to big data processing and natural language processing is carried out, wherein big data processing include data pick-up, verification, loading, storage and It calculates, wherein natural language processing includes carrying out re-scheduling, filtering, text classification and abstract to data；S300, data analysis is established Model system, the Data Analysis Model system include that text analyzing model, information value evaluation model and deep learning are calculated Method tuning model；S400, using Data Analysis Model system, to treated, data are analyzed, and according to field feedback It optimizes.

Further, S100 includes:

S101, creation uniform resource locator list file；S102, according to uniform resource locator list file, addition Entrance URL address generates list to be downloaded；S103, list generation downloading task to be downloaded is extracted；S104, from Specific content of pages is downloaded according to downloading task in global wide area network；S105, analysis extract content of pages, and judge content of pages In whether include uniform resource locator, if so, return to step S102, index inspection is otherwise established according to content of pages Rope, data acquisition.

Further, in the step S500, extracting content of pages includes: that web page contents automatically extract, and is distinguished in webpage Title and text message, and internally container there are successional multiple web page contents to be merged automatically, network forum information oneself It is dynamic to extract；Web page text extracts, and automatically analyzes from the structure of a webpage confusion and proposes body part；Character sets multiple coding Including but not limited to gb2132 and UTF8 character is executed identification conversion using Unicode coding mode by conversion.

It is further, described that carry out re-scheduling to data include re-scheduling based on uniform resource locator and based in webpage The re-scheduling of appearance.

Further, the text classification includes: to be trained using assignment algorithm to data, obtains different training moulds Type is simultaneously put into the container of selection；Classified according to obtained training pattern to text data.

Further, the assignment algorithm includes linear classification algorithm, closes on sorting algorithm and Blang's term clustering algorithm.

Further, the information extraction that data are carried out to make a summary include: to data based on statistics, according to clue word word Allusion quotation, word frequency, word and sentence statistical law carry out pattern match and draw digest；Information extraction to data based on understanding utilizes The knowledge such as syntax, semantic knowledge extract digest on the basis of the content to article understands.

Further, the text analyzing model is based on natural language processing technique and carries out sentence and list to long text content The segmentation of word obtains keyword, the keyword classification matching, word frequency statistics information that text includes, is information value evaluation model Data preparation is provided.

Further, the information value evaluation model is according to the data after text analyzing model treatment, based on certainly The subordinate sentence word segmentation result of right Language Processing carries out classifying rules and article by the matching of keyword as unit of words/phrases The calculating of information value score, classification and value for different articles are scored；According to the pass for meeting customer service demand Key sentence carries out screening identification to the sentence to have scored, obtains article information abstract, and be sent to user.

Further, the deep learning algorithm tuning model is used for according to the field feedback received, to value Keyword basic score appraisement system and value information decision threshold in evaluation model carry out algorithm iteration amendment；Pass through nerve Network is iterated training to every a batch of sample size, is carried out specified time according to positive and negative sample size accounting to multiplying power and basis point Tuning is completed in number training, and to approach desired value, wherein predetermined number of times can self-setting.

The beneficial effects of the present invention are: a kind of data filtering method for digging that the present invention uses, helps user precisely to identify Potential customers, compare artificial treatment, treatment process, more simply, efficiently, discrimination it is higher, it is automatic by machine learning algorithm Precisely identification high price value information, potential customers reduces artificial erroneous judgement and fails to judge.

Detailed description of the invention

Fig. 1 show the method flow schematic diagram of present pre-ferred embodiments；

Fig. 2 show the method flow schematic diagram of preferred embodiment according to the present invention；

Fig. 3 show preferred embodiment flow diagram according to the present invention；

Fig. 4 show preferred embodiment one according to the present invention.

Specific embodiment

It is carried out below with reference to technical effect of the embodiment and attached drawing to design of the invention, specific structure and generation clear Chu, complete description, to be completely understood by the purpose of the present invention, scheme and effect.

It should be noted that unless otherwise specified, the descriptions such as upper and lower, left and right used in the disclosure are only opposite In attached drawing for the mutual alignment relation of each component part of the disclosure." the one of used singular in the disclosure Kind ", " described " and "the" are also intended to including most forms, unless the context clearly indicates other meaning.In addition, unless otherwise Definition, all technical and scientific terms used herein and the normally understood meaning phase of those skilled in the art Together.Term used in the description is intended merely to description specific embodiment herein, is not intended to be limiting of the invention.

The use of provided in this article any and all example or exemplary language (" such as ", " such as ") is intended merely to more Illustrate the embodiment of the present invention well, and unless the context requires otherwise, otherwise the scope of the present invention will not be applied and be limited.

The method flow schematic diagram for showing present pre-ferred embodiments referring to 1, includes the following steps,

Data acquisition is carried out to targeted website using Nutch distributed reptile technology；

Nutch is the search engine that an open source Java is realized.It provides complete needed for the search engine of operation oneself Portion's tool.Including full-text search and Web crawler.Although Web search is the basic demand for roaming Internet, existing web The number of search engine is but declining, and this very possible further differentiation is almost all of as a company monopolizing Web search seeks commercial interest for it, this is obviously unfavorable for numerous Internet user, and the search commercial relative to those is drawn Hold up, Nutch will be more transparent as open source code search engine, thus it is more worth everybody to trust currently all main Search engine all uses privately owned sort algorithm, without explaining why a webpage can come a specific position, remove Except this, the expense that some search engines are paid according to website, rather than be ranked up according to the value of themselves, with them What difference, Nutch need to conceal without, and the result .Nutch for also going distortion to search for without motivation will use up oneself maximum effort Best search result is provided for user.

The emphasis of crawler is in two aspects, the format and meaning of the workflow of crawler and the data file being related to.Data File mainly includes three classes, is web database respectively, and a series of segment adds index, the physical file point of three It is not stored under the db catalogue under crawling results catalogue in webdb sub-folder, segments file and index file.

Many segment can be generated by once creeping, and what is stored in each segment is that crawler crawler is individually once grabbing Take the index of the webpage and these webpages caught in circulation.Crawler can be according to the link relationship in WebDB according to one when creeping Fetchlist needed for fixed crawl policy generates crawl circulation every time, then Fetcher passes through the URLs in fetchlist It grabs these webpages and indexes, be then deposited into segment.Segment has the time limit, when these webpages by crawler again After crawl, the segment that previously crawl had generated just cancels.In storage.Segment file is named with generation time , facilitate the segments for deleting and cancelling to save memory space.

Index is the index of all webpages of crawler capturing, it is by carrying out to the index in all single segment Merging treatment is resulting.Nutch is indexed using Lucene technology, so the interface pair operated in Lucene to index Index in Nutch is equally effective.It is however noted that the difference in segment and Nutch in Lucene, Segment in Lucene is a part for indexing index, but the segment in Nutch is various pieces in WebDB The content and index of webpage have had no bearing on finally by its index generated with these segment.

Web database, is also WebDB, wherein what is stored is the link structure information between the grabbed webpage of crawler, It only crawler crawler work in use and with the no any relationship of the work of Searcher.Two kinds of entities are stored in WebDB Information: page and link.Page entity characterizes an actual net by the characteristic information of a webpage on description network Page passes through two kinds of indexing means pair of MD5 of the URL of webpage and web page contents because webpage has many to need to describe in WebDB These page entities are indexed.The web page characteristics of Page entity description mainly include the link number in webpage, grab this The related crawl information such as time of webpage, the different degree scoring etc. to this webpage.Likewise, Link entity description is two Linking relationship between page entity.WebDB constitutes the link structure figure of a grabbed webpage, Page entity in this figure It is the node of figure, and Link entity then represents the side of figure.Specific data acquisition flow can refer to Fig. 2 and Fig. 3.

Big data processing is carried out to the data of acquisition, wherein big data processing includes data pick-up, verification, loading, storage And calculate step；

Big data processing is also referred to as ELT, and ETL is responsible for data will disperse, in heterogeneous data source such as relation data, puts down It after face data file etc. is drawn into interim middle layer, is cleaned, converted, integrated, be finally loaded into data warehouse or data set In city, become on-line analytical processing, data mining provides the data of decision support.

1, data cleansing:

Data are filled a vacancy: being carried out data to empty data, missing data and filled a vacancy operation, what can not be handled makes marks.

Data replacement: the replacement of data is carried out to invalid data.

Format specification: the Data Format Transform that source data extracts is become to the target data lattice for facilitating access for warehouse processing Formula.

Main foreign key constraint: by establishing main foreign key constraint, data replacement is carried out to invalid data or exports to wrong file Again it handles.

2, data conversion

Data merge: multimeter association realizes that size table, which is associated with, uses lookup, significantly table intersection join (each field Family's index, guarantees the efficiency of correlation inquiry)

Data are split: carrying out data fractionation by certain rule

Ranks exchange, sequence/modification serial number, removal repeat to record

Data verification: loolup, sum, count

Implementation:

(SQL cannot achieve) is carried out in ETL engine

Carry out (SQL may be implemented) in the database

3, data loading method:

Timestamp mode: unified addition field is as timestamp in traffic table, when OLAP system updates modification business number According to when, while modification time stab field value.

Log sheet mode: adding log sheet in OLAP system, when business datum changes, in updating maintenance log sheet Hold.

Full table way of contrast: extracting institute's active data, first carries out data according to major key and field before updating object table It compares, there is the carry out update or insert of update.

Full table deletes inserted mode: source data is entirely insertable by delete target table data.

Abnormality processing

During ETL, essential the problem of facing data exception, treating method:

1, error message is individually exported, continues to execute ETL, individually load again after wrong data modification.ETL is interrupted, is repaired ETL is re-executed after changing.Principle: data are received to greatest extent.

2, abnormal for caused by the external causes such as network interruption, it sets number of attempt or attempts the time, it is super several or overtime Afterwards, by external staff's manual intervention.

3, change such as source data structure, unusual condition interface change, after should synchronizing, in loading data.

Natural language processing is carried out to the data of acquisition, wherein natural language processing include data are carried out re-scheduling, filtering, Text classification and abstract；

The NLP that this programme uses supports the natural language processing under multi-lingual system and the kit developed, is also included as realizing The machine learning algorithm and data set of these tasks are realized automatic re-scheduling and automatic fitration, text classification of natural language etc., are closed The Data Datas treatment processes such as keyword extraction.

This programme will realize the re-scheduling of information in terms of two:

Re-scheduling based on URL: the work is realized during internet information grabs.

Re-scheduling based on web page contents: the weight that disappears is carried out according to web page contents.

Text classification includes two parts, and first part is text training, is instructed by different algorithms to training data Different training patterns is got, the model that training generates is stored in the container of selection；Second part is according to obtained instruction Practice model to classify to text data.

There are two ways to text training: linear classification, KNN classification.Classify for linear classification and KNN, text analyzing Program can read in selected container and be trained that (file name corresponds to classification, such as with the file that .data ends up Sport.data).After having selected container and training algorithm, training mission is submitted.Training can be in selected appearance after terminating The file that model.gz is generated in device, tests for text classification and uses.Text cluster function also needs selection and includes cluster numbers According to container and clustering algorithm.For Brown term clustering algorithm, clustering program can read the txt in selected container File (changeable format) then carries out clustering to file.

Autoabstract is to carry out the important form of information extraction, mainly includes based on statistics and two kinds of sides based on understanding Formula, the digest based on statistics are to carry out pattern match according to the statistical law of clue word dictionary, word frequency, word and sentence to draw text It plucks；And the mode based on understanding is then using knowledge such as syntax, semantic knowledges, on the basis of the content to article understands Extract digest.Both the above method, integrated use statistics and philological knowledge are combined in this programme, at document Reason, provides the digest system of high quality.

Establish Data Analysis Model system, the Data Analysis Model system includes that text analyzing model, information value are commented Valence model and deep learning algorithm tuning model；

Internet information content obtains a series of after the acquisition of network information crawler, natural language analysis processing Word, phrase or sentence, wherein only a small amount of word, phrase or sentence include value information relevant to business.Pass through Establish Data Analysis Model system, including text analyzing model, value assessment model, user interface feedback mechanism, deep learning The Data Analysis Models such as algorithm tuning model form value assessment process closed loop.

Using Data Analysis Model system, to treated, data are analyzed, and excellent according to field feedback progress Change.

Text analyzing believes that model carries out the segmentation of sentence and word based on natural language processing technique to long text content, obtains The keyword that includes to text, keyword classification matching, word frequency statistics etc. information, be the information value evaluation model of next step Data preparation is provided.

Subordinate sentence word segmentation result based on natural language processing is carried out as unit of words/phrases by the matching of keyword Classification and value scoring for different articles are realized in the calculating of classifying rules and article information value score.

The partition of sentence is had been completed for the article content that needs are analyzed by information value scoring calculating process And the Quantitative marking of the information value of each sentence.It is to need to carry out screening identification to the sentence to have scored in next step, selects symbol The key sentence of family business demand is shared, article information abstract is formed, exports as final result to user.

After the information for receiving user feedback each time, start deep learning algorithm tuning model, to value assessment mould Keyword basic score appraisement system and value information decision threshold in type carry out algorithm iteration amendment.Pass through neural network pair It is iterated training per a batch of sample size, n times training is carried out to multiplying power and basis point according to positive and negative sample size accounting and is completed Tuning constantly approaches desired value.

It show the method flow schematic diagram of preferred embodiment according to the present invention referring to Fig. 2, is originally data source in embodiment It is set to targeted website, data acquisition is carried out to targeted website by the crawler technology of internet, collected data are counted According to processing, ETL data warehouse technology and natural language processing including big data, to treated, data carry out data point Analysis, initially sets up analysis model and information value assessment models, by the data-pushing after model analysis to user, according to user Feedback, carry out double optimization using deep learning algorithm, and the information after optimization be pushed to user again, to value assessment Keyword basic score appraisement system and value information decision threshold in model carry out algorithm iteration amendment.Pass through neural network Training is iterated to every a batch of sample size, n times are carried out to multiplying power and basis point according to positive and negative sample size accounting and have been trained At tuning, desired value is constantly approached.

The step of showing preferred embodiment flow diagram according to the present invention, specially internet crawler referring to Fig. 3,

Create uniform resource locator list file；

According to uniform resource locator list file, entrance URL address is added, list to be downloaded is generated；

It extracts list to be downloaded and generates downloading task；

Specific content of pages is downloaded according to downloading task from global wide area network；

Whether content of pages is extracted in analysis, and judge to hold comprising uniform resource locator if so, returning in content of pages Otherwise row second step establishes indexed search, data acquisition according to content of pages.

It is shown preferred embodiment one according to the present invention referring to Fig. 4, is the application scenarios of data acquisition,

The technology of network information crawler is realized

Targeted website quantity in this programme is more, web site contents structure is complicated multiplicity, need using distributed reptile come Realize the acquisition of data.This programme uses Nutch distributed reptile open source technology, relies on the MapReduce of Hadoop to carry out big The acquisition of scale website adds our own protocol processes mode and data processing method by the Plugin Mechanism of Nutch.

(2) web page contents automatically extract

The contents such as advertisement, copyright information, script description language are generally comprised in webpage.Web page contents Intelligent Extraction Technology energy Efficiently extract the effective information in webpage, distinguish the items of information such as title, the text in webpage, and internally container have it is successional Multiple web page contents are merged automatically, network forum information automation extraction etc..

(3) Web page text extracts

Automatically extracting text is exactly the part for automatically analyzing and proposing text from the structure of a webpage confusion.

(4) character sets multiple code conversion

Design when system and website that the staff of the character set of one website and this website uses are established etc. because It is known as pass.Common character set has gb2132, UTF8 etc..Unicode is that a kind of mainly show encodes terrestrial reference with switch character It is quasi-.It covers the U.S., Europe, the Middle East, Africa, India, Asia-pacific region ground language.The website of some country in order to It supports global access, using Unicode coding mode, automatic identification and can convert.

It should be appreciated that the embodiment of the present invention can be by computer hardware, the combination of hardware and software or by depositing The computer instruction in non-transitory computer-readable memory is stored up to be effected or carried out.Standard volume can be used in the method Journey technology-includes that the non-transitory computer-readable storage media configured with computer program is realized in computer program, In configured in this way storage medium computer is operated in a manner of specific and is predefined --- according in a particular embodiment The method and attached drawing of description.Each program can with the programming language of level process or object-oriented come realize with department of computer science System communication.However, if desired, the program can be realized with compilation or machine language.Under any circumstance, which can be volume The language translated or explained.In addition, the program can be run on the specific integrated circuit of programming for this purpose.

In addition, the operation of process described herein can be performed in any suitable order, unless herein in addition instruction or Otherwise significantly with contradicted by context.Process described herein (or modification and/or combination thereof) can be held being configured with It executes, and is can be used as jointly on the one or more processors under the control of one or more computer systems of row instruction The code (for example, executable instruction, one or more computer program or one or more application) of execution, by hardware or its group It closes to realize.The computer program includes the multiple instruction that can be performed by one or more processors.

Further, the method can be realized in being operably coupled to suitable any kind of computing platform, wrap Include but be not limited to PC, mini-computer, main frame, work station, network or distributed computing environment, individual or integrated Computer platform or communicated with charged particle tool or other imaging devices etc..Each aspect of the present invention can be to deposit The machine readable code on non-transitory storage medium or equipment is stored up to realize no matter be moveable or be integrated to calculating Platform, such as hard disk, optical reading and/or write-in storage medium, RAM, ROM, so that it can be read by programmable calculator, when Storage medium or equipment can be used for configuration and operation computer to execute process described herein when being read by computer.This Outside, machine readable code, or part thereof can be transmitted by wired or wireless network.When such media include combining microprocessor Or other data processors realize steps described above instruction or program when, invention as described herein including these and other not The non-transitory computer-readable storage media of same type.When methods and techniques according to the present invention programming, the present invention It further include computer itself.

Computer program can be applied to input data to execute function as described herein, to convert input data with life At storing to the output data of nonvolatile memory.Output information can also be applied to one or more output equipments as shown Device.In the preferred embodiment of the invention, the data of conversion indicate physics and tangible object, including the object generated on display Reason and the particular visual of physical objects are described.

The above, only presently preferred embodiments of the present invention, the invention is not limited to above embodiment, as long as It reaches technical effect of the invention with identical means, all should belong to protection scope of the present invention.In protection model of the invention Its technical solution and/or embodiment can have a variety of different modifications and variations in enclosing.

Claims

1. a kind of data filtering method for digging, which comprises the following steps:

S100, data acquisition is carried out to targeted website using Nutch distributed reptile technology；

S200, big data processing and natural language processing are carried out to the data of acquisition, wherein big data processing include data pick-up, Verification is loaded, stores and is calculated, and wherein natural language processing includes that data are carried out with re-scheduling, filtering, text classification and is plucked It wants；

S300, Data Analysis Model system is established, the Data Analysis Model system includes that text analyzing model, information value are commented Valence model and deep learning algorithm tuning model；

S400, using Data Analysis Model system, to treated, data are analyzed, and excellent according to field feedback progress Change.

2. data filtering method for digging according to claim 1, which is characterized in that the S100 includes:

S101, creation uniform resource locator list file；

S102, according to uniform resource locator list file, add entrance URL address, generate column to be downloaded Table；

S103, list generation downloading task to be downloaded is extracted；

S104, specific content of pages downloaded according to downloading task from global wide area network；

Whether content of pages is extracted in S105, analysis, and judge comprising uniform resource locator in content of pages, if so, returning Step S102 is executed, indexed search, data acquisition are otherwise established according to content of pages.

3. data filtering method for digging according to claim 2, which is characterized in that in the step S105, extract the page Content includes:

Web page contents automatically extract, and distinguish title and text message in webpage, and internally container has successional multiple webpages Content merged automatically, network forum information automation extraction；

Web page text extracts, and automatically analyzes from the structure of a webpage confusion and proposes body part；

Character sets multiple code conversion executes including but not limited to gb2132 and UTF8 character using Unicode coding mode Identification conversion.

4. data filtering method for digging according to claim 1, which is characterized in that described includes base to data progress re-scheduling Re-scheduling in uniform resource locator and the re-scheduling based on web page contents.

5. data filtering method for digging according to claim 1, which is characterized in that the text classification includes:

Data are trained using assignment algorithm, obtain different training patterns and are put into the container of selection；

Classified according to obtained training pattern to text data.

6. data filtering method for digging according to claim 5, which is characterized in that the assignment algorithm includes linear classification Algorithm closes on sorting algorithm and Blang's term clustering algorithm.

7. data filtering method for digging according to claim 1, which is characterized in that it is described to data carry out abstract include:

Information extraction to data based on statistics carries out pattern match according to the statistical law of clue word dictionary, word frequency, word and sentence Draw digest；

Information extraction to data based on understanding is understood using knowledge such as syntax, semantic knowledges in the content to article On the basis of extract digest.

8. data filtering method for digging according to claim 1, which is characterized in that the text analyzing model is based on nature Language processing techniques carry out the segmentation of sentence and word to long text content, obtain keyword, the keyword classification that text includes Matching, word frequency statistics information, provide data preparation for information value evaluation model.

9. data filtering method for digging according to claim 1, which is characterized in that the information value evaluation model according to Data after text analyzing model treatment, the subordinate sentence word segmentation result based on natural language processing are single with words/phrases Position carries out the calculating that classifying rules and article information are worth score by the matching of keyword, classification for different articles and Value scores；

According to the key sentence for meeting customer service demand, screening identification is carried out to the sentence to have scored, article information is obtained and plucks It wants, and is sent to user.

10. data filtering method for digging according to claim 1, which is characterized in that the deep learning algorithm tuning mould Type is used for according to the field feedback received, to the keyword basic score appraisement system and value in value assessment model Information decision threshold carries out algorithm iteration amendment；

Training is iterated to every a batch of sample size by neural network, according to positive and negative sample size accounting to multiplying power and basis Divide and carry out predetermined number of times training completion tuning, to approach desired value, wherein predetermined number of times can self-setting.