CN108280213A

CN108280213A - A kind of analysis system of big data

Info

Publication number: CN108280213A
Application number: CN201810096704.1A
Authority: CN
Inventors: 李永敢
Original assignee: Foshan Gaicheng Intellectual Property Service Co Ltd
Current assignee: Foshan Gaicheng Intellectual Property Service Co Ltd
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2018-07-13

Abstract

An embodiment of the present invention provides a kind of analysis system of big data, the method includes the steps：According to preset data collecting rule, web data is collected；Collected web data is filtered and normalized, obtains garbled data；Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data；The K classes data are clustered using Clustering Model is preset, according to classification and cluster result, data are uniformly stored to and established index, form large database concept.Using the embodiment of the present invention, the validity of data analysis can be improved.

Description

A kind of analysis system of big data

Technical field

The present invention relates to electronic technology field more particularly to a kind of analysis systems of big data.

Background technology

With the arriving of cloud era, big data（Bigdata）Also more and more concerns have been attracted.Big data（Big data）The a large amount of unstructured and semi-structured data created commonly used to describe a company, these data are downloading to pass It is that type database can overspending time and money when being used to analyze.Big data analysis is often linked together with cloud computing, because Real-time large data set analysis needs the frame as MapReduce to come to tens of, hundreds of or even thousands of computer point With work.

2016, China's big data industry kept high speed development situation, and governments at all levels and enterprise carry forward vigorously, technological innovation Apparent breakthrough is obtained, big data application promotes the impetus good, and industrial system has begun to take shape, and enabling capabilities are increasingly enhanced.Prospect 2017, big data industry development will welcome " golden period ", and further Characteristic Development, innovation driving will be by Industrial agglomeration The main keynote of industry development, big data fusion application process accelerate, and for digital economy of running business big and strong, drive conventional industries transition and upgrade New power is provided.

Big data apply and vision of the future, immediately following internet+, it is convenient future people’s lives, facilitate user.Free thought future Interpersonal contact is arrived " Six Degrees " from social networks, community culture；Hope of the prospect education for country, big number According to integrated with education, rational early education, to personal help, the contribution to society and country；The control state of an illness of patient is helped, and Morbidity etc., in conjunction with medical platform, prediction is according to existing life style, and to personal disturbance degree, accurate medical rescue helps old People, which send, examines medical treatment；Natural calamity is reduced to the mankind, the influence to ecological environment, " buterfly effect " predicts the generation of natural calamity； From developer's angle, user data, adaptation to market variations are integrated, user demand guesses that " you " likes, develops and meet user demand Application；The connected applications of big data and recognition of face, human face analysis, dynamic advertisement is complete " automatic ", non-" manual " to answer With emphasizing the completely new social mode that guessing between people and people " you " is liked.Big data will be more extensive in following application, Big data how is obtained, big data is grasped, extracts big data, big data is integrated, is related to the every aspect of people's future life, Who has grasped big data, has just grasped future.

Information extraction field is an emerging research field, is generally referred to automatic from a given collection of document It identifies the type informations such as preset entity, relationship and event, and structured storage and management is carried out to these information Process.Information extraction has important application in many fields.

Information extraction（Information Extraction: IE）It is that the information for including is carried out at structuring in text Reason, becomes the same organizational form of table.Input information extraction system is urtext, output be set form information Point.Information point is extracted from various documents, is then integrated in the form of unified.Here it is information pumpings The main task taken.

The benefit that information is integrated in the form of unified is to facilitate inspection and compare.Such as more different recruitments and Merchandise news.It is also an advantage that being that can make automatic business processing to data.Such as data are found and explained with data digging method Model.

Information extraction technique is not intended to comprehensive understanding entire chapter document, only in document include relevant information part into Row analysis.Which information to be relevant as, that will be depending on the territory that fixed when being designed by system.Information extraction technique pair In from extracted in a large amount of document needs it is specific the fact for be highly useful.It there is so text on internet Shelves library.Online, the information usually dispersion of same subject is stored on different web sites, and the form of performance is also different.If energy These information are collected together, are stored with structured form, that will be beneficial.Since online information carrier is mainly text This, so, information extraction technique for those internets treat as be Knowledge Source people for be vital.Information is taken out System is taken to can be regarded as the system for information being converted into from different document data-base recording.Therefore, successful information extraction System will become internet huge database.

In increasingly information-based and networking contemporary society, how to find required information and useful information is returned Class filters or extracts an always more urgent practical problem.Correspondingly, various help people search, classification and Theory, technology, application tool and the system for storing information are constantly developing and are updating always, and remain vigorous vigor. In recent years, a kind of technology being called information extraction gradually receives the concern of people.It is expected to become a kind of very popular reality With information technology, great effectiveness is played in the routine work and life of people.

In recent years, with the development of network, the information on internet is more and more.Almost all of network information be all with What the form of structuring or semi-structured text was presented to the user.Web page information extraction is exactly the related letter for including in webpage Breath extracts and carries out structuring processing, is allowed to become the same organizational form of table.The main task of webpage information is exactly Scheduled information point is extracted from various webpages, is then integrated in the form of unified, facilitates inspection With compare.

In recent years, with the development of network, the information on internet is more and more.Almost all of network information is all It is presented to the user in the form of structuring or semi-structured text.Web page information extraction is exactly to have include in webpage It closes information extraction and comes out and carry out structuring processing, be allowed to become the same organizational form of table.The main task of webpage information Scheduled information point is extracted from various webpages exactly, is then integrated in the form of unified, it is convenient It checks and compares.

On the internet, the information of same subject usually dispersion is stored on different websites, the form of performance also it is each not It is identical, in the prior art, it is difficult to which expected web mining is complete.In addition, on internet, information is reprinted frequently how The normalization of realization duplicate message and a key.

Invention content

The embodiment of the present invention is designed to provide a kind of analysis system of big data, improves the validity of data analysis.

In order to achieve the above object, an embodiment of the present invention provides a kind of analysis systems of big data, including：

Collection module, for according to preset data collecting rule, collecting web data；

Screening module obtains garbled data for being filtered to collected web data and normalized；

Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes Data；

Analysis module, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture data rule, And the push mode scheme generated using big data analysis result of calculation.

Optionally, the collection module, is specifically used for：Data acquisition network page is customized according to target；According to webpage knot Structure determines web page body data block, automatically generates web data extraction template and extracts web data.

Optionally, the system also includes：

Coding module, encodes for each section of text to the garbled data, and segmentation comparison is carried out according to coding, judges Data duplication degree；Duplicate data is normalized, garbled data.

Optionally, the screening technique is the method using criterion and quantity dynamic state of parameters garbled data, and this method is abundant The characteristics of considering and apply data bulk, mobilism and coincidence statistics probability distribution, can be from magnanimity quantized data In filter out the data for complying with standard quantization parameter screening conditions.

Optionally, the sort module, is specifically used for：According to classification and cluster result, classify to K class data, The data for being included in each data class are clustered, data are uniformly stored to and established index, form large database concept.

Optionally, described according to classification results, database is divided into two topic, data class ranks, carries out on this basis Two kinds of clusterings.

Optionally, described according to classification results, database can be subdivided into topic, topic cluster, data class, data class cluster four A rank, the four kinds of clusterings carried out on this basis.

Optionally, the collection module, including：

Using Bloom filter, collected web data is filtered.

Advantageous effect：

The embodiment of the present invention provides a kind of analysis system of big data, including collection module, for being acquired according to preset data Rule collects web data；Screening module is obtained for being filtered to collected web data and normalized Obtain garbled data；Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, being classified K class data afterwards；Analysis module, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture number According to rule, and the push mode scheme generated using big data analysis result of calculation.It is thus possible to improve data analysis Validity.

Description of the drawings

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.

Fig. 1 is the first flow diagram of the analysis system of big data.

Fig. 2 is second of flow diagram of the analysis system of big data.

Fig. 3 is the third flow diagram of the analysis system of big data.

Specific implementation mode

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Below by specific embodiment, the present invention will be described in detail.

It is the flow diagram of the analysis system of big data provided by the invention referring to Fig. 1, including：

Collection module 100, for according to preset data collecting rule, collecting web data；

Screening module 110 obtains garbled data for being filtered to collected web data and normalized；

Sort module 120, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K Class data；

Analysis module 130, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture data rule Then, and using big data analysis result of calculation the push mode scheme generated.

Optionally, the collection module 110, is specifically used for：Data acquisition network page is customized according to target；According to webpage Structure determines web page body data block, automatically generates web data extraction template and extracts web data.

Optionally, the system also includes：

Coding module 140 encodes for each section of text to the garbled data, segmentation comparison is carried out according to coding, Judge Data duplication degree；Duplicate data is normalized, garbled data.

Optionally, the collection module 100, including：

Using Bloom filter, collected web data is filtered.

The prefabricated of data source pays close attention to webpage expected from user so that the draw-off direction of web data more has Specific aim is conducive to improve data acquisition efficiency.Collection point it is complete can to improve looking into for data acquisition at last to the supplement of data source Rate.The complementation of data source and collection point may make data acquisition efficiency and recall ratio to reach a more satisfactory balance.

In present embodiment, Unified coding is carried out to web data, duplicate data is normalized, garbled data, referring to figure 3, it specifically includes：

S301 encodes each section of text；

S302 carries out segmentation comparison according to coding, judges Data duplication degree；

S303 normalizes duplicate data, garbled data.

This text carries out segment encoding, and carries out segmentation comparison, can effectively find that text repeats degree, avoid omitting.

In present embodiment, according to classification and cluster result, data are uniformly stored to and established index, form big number According to library, it is specifically divided into：

N number of data class is clustered；

The data for being included in each data class are clustered.

Further, described that collected web data is filtered, including：

Using Bloom filter, collected web data is filtered.

According to classification results, database is divided into two topic, data class ranks, the two kinds of clusters carried out on this basis Database, can be subdivided into four topic, topic cluster, data class, data class cluster ranks, further establish Indexing Mechanism by analysis, So that user to the management of database, retrieval, using more convenient.

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.

Each embodiment in this specification is all made of relevant mode and describes, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims

1. a kind of analysis system of big data, which is characterized in that including：

2. a kind of analysis system of big data according to claim 1, which is characterized in that the collection module, it is specific to use In：Data acquisition network page is customized according to target；According to structure of web page, determines web page body data block, automatically generate net Page data extraction template extracts web data.

3. a kind of analysis system of big data according to claim 1, which is characterized in that the system also includes：

4. a kind of analysis system of big data according to claim 1, which is characterized in that the screening technique is using mark The method of quasi- quantization parameter dynamic garbled data, this method fully consider and apply data bulk, mobilism and symbol The characteristics of closing statistics probability distribution, can filter out the number for complying with standard quantization parameter screening conditions from magnanimity quantized data According to.

5. a kind of analysis system of big data according to claim 1, which is characterized in that the sort module is specific to use In：According to classification and cluster result, classifies to K class data, the data for being included in each data class are gathered Data are uniformly stored and are established index by class, form large database concept.

6. a kind of analysis system of big data according to claim 5, which is characterized in that described according to classification results, number It is divided into two topic, data class ranks, the two kinds of clusterings carried out on this basis according to library.

7. a kind of analysis system of big data according to claim 5, which is characterized in that it is described according to classification results, it can Database is subdivided into four topic, topic cluster, data class, data class cluster ranks, the four kinds of clusters point carried out on this basis Analysis.

8. a kind of analysis system of big data according to claim 1, which is characterized in that the collection module, including：

Using Bloom filter, collected web data is filtered.