CN108280213A - A kind of analysis system of big data - Google Patents

A kind of analysis system of big data Download PDF

Info

Publication number
CN108280213A
CN108280213A CN201810096704.1A CN201810096704A CN108280213A CN 108280213 A CN108280213 A CN 108280213A CN 201810096704 A CN201810096704 A CN 201810096704A CN 108280213 A CN108280213 A CN 108280213A
Authority
CN
China
Prior art keywords
data
analysis system
analysis
big data
big
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201810096704.1A
Other languages
Chinese (zh)
Inventor
李永敢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Gaicheng Intellectual Property Service Co Ltd
Original Assignee
Foshan Gaicheng Intellectual Property Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Gaicheng Intellectual Property Service Co Ltd filed Critical Foshan Gaicheng Intellectual Property Service Co Ltd
Priority to CN201810096704.1A priority Critical patent/CN108280213A/en
Publication of CN108280213A publication Critical patent/CN108280213A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An embodiment of the present invention provides a kind of analysis system of big data, the method includes the steps:According to preset data collecting rule, web data is collected;Collected web data is filtered and normalized, obtains garbled data;Using default disaggregated model, classify to the garbled data obtained, obtains sorted K classes data;The K classes data are clustered using Clustering Model is preset, according to classification and cluster result, data are uniformly stored to and established index, form large database concept.Using the embodiment of the present invention, the validity of data analysis can be improved.

Description

A kind of analysis system of big data
Technical field
The present invention relates to electronic technology field more particularly to a kind of analysis systems of big data.
Background technology
With the arriving of cloud era, big data(Bigdata)Also more and more concerns have been attracted.Big data(Big data)The a large amount of unstructured and semi-structured data created commonly used to describe a company, these data are downloading to pass It is that type database can overspending time and money when being used to analyze.Big data analysis is often linked together with cloud computing, because Real-time large data set analysis needs the frame as MapReduce to come to tens of, hundreds of or even thousands of computer point With work.
2016, China's big data industry kept high speed development situation, and governments at all levels and enterprise carry forward vigorously, technological innovation Apparent breakthrough is obtained, big data application promotes the impetus good, and industrial system has begun to take shape, and enabling capabilities are increasingly enhanced.Prospect 2017, big data industry development will welcome " golden period ", and further Characteristic Development, innovation driving will be by Industrial agglomeration The main keynote of industry development, big data fusion application process accelerate, and for digital economy of running business big and strong, drive conventional industries transition and upgrade New power is provided.
Big data apply and vision of the future, immediately following internet+, it is convenient future people’s lives, facilitate user.Free thought future Interpersonal contact is arrived " Six Degrees " from social networks, community culture;Hope of the prospect education for country, big number According to integrated with education, rational early education, to personal help, the contribution to society and country;The control state of an illness of patient is helped, and Morbidity etc., in conjunction with medical platform, prediction is according to existing life style, and to personal disturbance degree, accurate medical rescue helps old People, which send, examines medical treatment;Natural calamity is reduced to the mankind, the influence to ecological environment, " buterfly effect " predicts the generation of natural calamity; From developer's angle, user data, adaptation to market variations are integrated, user demand guesses that " you " likes, develops and meet user demand Application;The connected applications of big data and recognition of face, human face analysis, dynamic advertisement is complete " automatic ", non-" manual " to answer With emphasizing the completely new social mode that guessing between people and people " you " is liked.Big data will be more extensive in following application, Big data how is obtained, big data is grasped, extracts big data, big data is integrated, is related to the every aspect of people's future life, Who has grasped big data, has just grasped future.
Information extraction field is an emerging research field, is generally referred to automatic from a given collection of document It identifies the type informations such as preset entity, relationship and event, and structured storage and management is carried out to these information Process.Information extraction has important application in many fields.
Information extraction(Information Extraction: IE)It is that the information for including is carried out at structuring in text Reason, becomes the same organizational form of table.Input information extraction system is urtext, output be set form information Point.Information point is extracted from various documents, is then integrated in the form of unified.Here it is information pumpings The main task taken.
The benefit that information is integrated in the form of unified is to facilitate inspection and compare.Such as more different recruitments and Merchandise news.It is also an advantage that being that can make automatic business processing to data.Such as data are found and explained with data digging method Model.
Information extraction technique is not intended to comprehensive understanding entire chapter document, only in document include relevant information part into Row analysis.Which information to be relevant as, that will be depending on the territory that fixed when being designed by system.Information extraction technique pair In from extracted in a large amount of document needs it is specific the fact for be highly useful.It there is so text on internet Shelves library.Online, the information usually dispersion of same subject is stored on different web sites, and the form of performance is also different.If energy These information are collected together, are stored with structured form, that will be beneficial.Since online information carrier is mainly text This, so, information extraction technique for those internets treat as be Knowledge Source people for be vital.Information is taken out System is taken to can be regarded as the system for information being converted into from different document data-base recording.Therefore, successful information extraction System will become internet huge database.
In increasingly information-based and networking contemporary society, how to find required information and useful information is returned Class filters or extracts an always more urgent practical problem.Correspondingly, various help people search, classification and Theory, technology, application tool and the system for storing information are constantly developing and are updating always, and remain vigorous vigor. In recent years, a kind of technology being called information extraction gradually receives the concern of people.It is expected to become a kind of very popular reality With information technology, great effectiveness is played in the routine work and life of people.
In recent years, with the development of network, the information on internet is more and more.Almost all of network information be all with What the form of structuring or semi-structured text was presented to the user.Web page information extraction is exactly the related letter for including in webpage Breath extracts and carries out structuring processing, is allowed to become the same organizational form of table.The main task of webpage information is exactly Scheduled information point is extracted from various webpages, is then integrated in the form of unified, facilitates inspection With compare.
Information extraction field is an emerging research field, is generally referred to automatic from a given collection of document It identifies the type informations such as preset entity, relationship and event, and structured storage and management is carried out to these information Process.Information extraction has important application in many fields.
In recent years, with the development of network, the information on internet is more and more.Almost all of network information is all It is presented to the user in the form of structuring or semi-structured text.Web page information extraction is exactly to have include in webpage It closes information extraction and comes out and carry out structuring processing, be allowed to become the same organizational form of table.The main task of webpage information Scheduled information point is extracted from various webpages exactly, is then integrated in the form of unified, it is convenient It checks and compares.
On the internet, the information of same subject usually dispersion is stored on different websites, the form of performance also it is each not It is identical, in the prior art, it is difficult to which expected web mining is complete.In addition, on internet, information is reprinted frequently how The normalization of realization duplicate message and a key.
Invention content
The embodiment of the present invention is designed to provide a kind of analysis system of big data, improves the validity of data analysis.
In order to achieve the above object, an embodiment of the present invention provides a kind of analysis systems of big data, including:
Collection module, for according to preset data collecting rule, collecting web data;
Screening module obtains garbled data for being filtered to collected web data and normalized;
Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes Data;
Analysis module, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture data rule, And the push mode scheme generated using big data analysis result of calculation.
Optionally, the collection module, is specifically used for:Data acquisition network page is customized according to target;According to webpage knot Structure determines web page body data block, automatically generates web data extraction template and extracts web data.
Optionally, the system also includes:
Coding module, encodes for each section of text to the garbled data, and segmentation comparison is carried out according to coding, judges Data duplication degree;Duplicate data is normalized, garbled data.
Optionally, the screening technique is the method using criterion and quantity dynamic state of parameters garbled data, and this method is abundant The characteristics of considering and apply data bulk, mobilism and coincidence statistics probability distribution, can be from magnanimity quantized data In filter out the data for complying with standard quantization parameter screening conditions.
Optionally, the sort module, is specifically used for:According to classification and cluster result, classify to K class data, The data for being included in each data class are clustered, data are uniformly stored to and established index, form large database concept.
Optionally, described according to classification results, database is divided into two topic, data class ranks, carries out on this basis Two kinds of clusterings.
Optionally, described according to classification results, database can be subdivided into topic, topic cluster, data class, data class cluster four A rank, the four kinds of clusterings carried out on this basis.
Optionally, the collection module, including:
Using Bloom filter, collected web data is filtered.
Advantageous effect:
The embodiment of the present invention provides a kind of analysis system of big data, including collection module, for being acquired according to preset data Rule collects web data;Screening module is obtained for being filtered to collected web data and normalized Obtain garbled data;Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, being classified K class data afterwards;Analysis module, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture number According to rule, and the push mode scheme generated using big data analysis result of calculation.It is thus possible to improve data analysis Validity.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with Obtain other attached drawings according to these attached drawings.
Fig. 1 is the first flow diagram of the analysis system of big data.
Fig. 2 is second of flow diagram of the analysis system of big data.
Fig. 3 is the third flow diagram of the analysis system of big data.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Below by specific embodiment, the present invention will be described in detail.
It is the flow diagram of the analysis system of big data provided by the invention referring to Fig. 1, including:
Collection module 100, for according to preset data collecting rule, collecting web data;
Screening module 110 obtains garbled data for being filtered to collected web data and normalized;
Sort module 120, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K Class data;
Analysis module 130, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture data rule Then, and using big data analysis result of calculation the push mode scheme generated.
Optionally, the collection module 110, is specifically used for:Data acquisition network page is customized according to target;According to webpage Structure determines web page body data block, automatically generates web data extraction template and extracts web data.
Optionally, the system also includes:
Coding module 140 encodes for each section of text to the garbled data, segmentation comparison is carried out according to coding, Judge Data duplication degree;Duplicate data is normalized, garbled data.
Optionally, the sort module, is specifically used for:According to classification and cluster result, classify to K class data, The data for being included in each data class are clustered, data are uniformly stored to and established index, form large database concept.
Optionally, the collection module 100, including:
Using Bloom filter, collected web data is filtered.
The prefabricated of data source pays close attention to webpage expected from user so that the draw-off direction of web data more has Specific aim is conducive to improve data acquisition efficiency.Collection point it is complete can to improve looking into for data acquisition at last to the supplement of data source Rate.The complementation of data source and collection point may make data acquisition efficiency and recall ratio to reach a more satisfactory balance.
In present embodiment, Unified coding is carried out to web data, duplicate data is normalized, garbled data, referring to figure 3, it specifically includes:
S301 encodes each section of text;
S302 carries out segmentation comparison according to coding, judges Data duplication degree;
S303 normalizes duplicate data, garbled data.
This text carries out segment encoding, and carries out segmentation comparison, can effectively find that text repeats degree, avoid omitting.
In present embodiment, according to classification and cluster result, data are uniformly stored to and established index, form big number According to library, it is specifically divided into:
N number of data class is clustered;
The data for being included in each data class are clustered.
Further, described that collected web data is filtered, including:
Using Bloom filter, collected web data is filtered.
According to classification results, database is divided into two topic, data class ranks, the two kinds of clusters carried out on this basis Database, can be subdivided into four topic, topic cluster, data class, data class cluster ranks, further establish Indexing Mechanism by analysis, So that user to the management of database, retrieval, using more convenient.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, identical similar portion between each embodiment Point just to refer each other, and each embodiment focuses on the differences from other embodiments.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (8)

1. a kind of analysis system of big data, which is characterized in that including:
Collection module, for according to preset data collecting rule, collecting web data;
Screening module obtains garbled data for being filtered to collected web data and normalized;
Sort module, for using disaggregated model is preset, classifying to the garbled data obtained, obtaining sorted K classes Data;
Analysis module, by carrying out analysis calculating to K classes data, and the collection module carries out feedback capture data rule, And the push mode scheme generated using big data analysis result of calculation.
2. a kind of analysis system of big data according to claim 1, which is characterized in that the collection module, it is specific to use In:Data acquisition network page is customized according to target;According to structure of web page, determines web page body data block, automatically generate net Page data extraction template extracts web data.
3. a kind of analysis system of big data according to claim 1, which is characterized in that the system also includes:
Coding module, encodes for each section of text to the garbled data, and segmentation comparison is carried out according to coding, judges Data duplication degree;Duplicate data is normalized, garbled data.
4. a kind of analysis system of big data according to claim 1, which is characterized in that the screening technique is using mark The method of quasi- quantization parameter dynamic garbled data, this method fully consider and apply data bulk, mobilism and symbol The characteristics of closing statistics probability distribution, can filter out the number for complying with standard quantization parameter screening conditions from magnanimity quantized data According to.
5. a kind of analysis system of big data according to claim 1, which is characterized in that the sort module is specific to use In:According to classification and cluster result, classifies to K class data, the data for being included in each data class are gathered Data are uniformly stored and are established index by class, form large database concept.
6. a kind of analysis system of big data according to claim 5, which is characterized in that described according to classification results, number It is divided into two topic, data class ranks, the two kinds of clusterings carried out on this basis according to library.
7. a kind of analysis system of big data according to claim 5, which is characterized in that it is described according to classification results, it can Database is subdivided into four topic, topic cluster, data class, data class cluster ranks, the four kinds of clusters point carried out on this basis Analysis.
8. a kind of analysis system of big data according to claim 1, which is characterized in that the collection module, including:
Using Bloom filter, collected web data is filtered.
CN201810096704.1A 2018-01-31 2018-01-31 A kind of analysis system of big data Withdrawn CN108280213A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810096704.1A CN108280213A (en) 2018-01-31 2018-01-31 A kind of analysis system of big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810096704.1A CN108280213A (en) 2018-01-31 2018-01-31 A kind of analysis system of big data

Publications (1)

Publication Number Publication Date
CN108280213A true CN108280213A (en) 2018-07-13

Family

ID=62807158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810096704.1A Withdrawn CN108280213A (en) 2018-01-31 2018-01-31 A kind of analysis system of big data

Country Status (1)

Country Link
CN (1) CN108280213A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711773A (en) * 2018-12-11 2019-05-03 武汉理工大学 A kind of container cargo based on clustering algorithm flows to flow statistical method
CN109933705A (en) * 2019-03-22 2019-06-25 国家电网有限公司 A kind of big data platform operation management system
CN110287054A (en) * 2019-06-28 2019-09-27 李璐昆 IT operation management method and IT operation management device
CN112559828A (en) * 2020-07-08 2021-03-26 北京德风新征程科技有限公司 Big data visual analysis and display component type system and interaction method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104899303A (en) * 2015-06-10 2015-09-09 杭州祥声通讯股份有限公司 Cloud big data analysis system applied to rail transportation means
CN105205055A (en) * 2014-06-06 2015-12-30 上海商会网网络信息技术有限公司 Big data analyzing system
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN107038608A (en) * 2017-04-21 2017-08-11 北京恒冠网络数据处理有限公司 A kind of big data analysis system
CN107577724A (en) * 2017-08-22 2018-01-12 佛山市高研信息技术有限公司 A kind of big data processing method
CN107590181A (en) * 2017-08-01 2018-01-16 佛山市深研信息技术有限公司 A kind of intelligent analysis system of big data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205055A (en) * 2014-06-06 2015-12-30 上海商会网网络信息技术有限公司 Big data analyzing system
CN104899303A (en) * 2015-06-10 2015-09-09 杭州祥声通讯股份有限公司 Cloud big data analysis system applied to rail transportation means
CN106528504A (en) * 2015-09-11 2017-03-22 北京国双科技有限公司 Data screening method and device for social application
CN107038608A (en) * 2017-04-21 2017-08-11 北京恒冠网络数据处理有限公司 A kind of big data analysis system
CN107590181A (en) * 2017-08-01 2018-01-16 佛山市深研信息技术有限公司 A kind of intelligent analysis system of big data
CN107577724A (en) * 2017-08-22 2018-01-12 佛山市高研信息技术有限公司 A kind of big data processing method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711773A (en) * 2018-12-11 2019-05-03 武汉理工大学 A kind of container cargo based on clustering algorithm flows to flow statistical method
CN109711773B (en) * 2018-12-11 2022-08-26 武汉理工大学 Container cargo flow direction and flow rate statistical method based on clustering algorithm
CN109933705A (en) * 2019-03-22 2019-06-25 国家电网有限公司 A kind of big data platform operation management system
CN109933705B (en) * 2019-03-22 2021-10-19 国家电网有限公司 Big data platform operation and maintenance management system
CN110287054A (en) * 2019-06-28 2019-09-27 李璐昆 IT operation management method and IT operation management device
CN112559828A (en) * 2020-07-08 2021-03-26 北京德风新征程科技有限公司 Big data visual analysis and display component type system and interaction method

Similar Documents

Publication Publication Date Title
Jain et al. An intelligent cognitive-inspired computing with big data analytics framework for sentiment analysis and classification
CN105740228B (en) A kind of internet public feelings analysis method and system
CN107577724A (en) A kind of big data processing method
US10565233B2 (en) Suffix tree similarity measure for document clustering
CN111435344B (en) Big data-based drilling acceleration influence factor analysis model
CN108280213A (en) A kind of analysis system of big data
CN109033387A (en) A kind of Internet of Things search system, method and storage medium merging multi-source data
CN106557558A (en) A kind of data analysing method and device
Dave et al. Different clustering algorithms for Big Data analytics: A review
CN111460323B (en) Focus user mining method and device based on artificial intelligence
CN112148881A (en) Method and apparatus for outputting information
Zhang Application of data mining technology in digital library.
Yassir et al. Sentimental classification analysis of polarity multi-view textual data using data mining techniques.
Jurek-Loughrey et al. Semi-supervised and unsupervised approaches to record pairs classification in multi-source data linkage
CN114996549A (en) Intelligent tracking method and system based on active object information mining
Hardaya et al. Application of text mining for classification of community complaints and proposals
EP3535661A2 (en) A system for managing, analyzing, navigating or searching of data information across one or more sources within a computer or a computer network, without copying, moving or manipulating the source or the data information stored in the source
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
Alfred et al. Data summarization approach to relational domain learning based on frequent pattern to support the development of decision making
Zhao et al. Collecting, managing and analyzing social networking data effectively
CN116842936A (en) Keyword recognition method, keyword recognition device, electronic equipment and computer readable storage medium
Onan Artificial immune system based web page classification
Arif et al. Solving social media text classification problems using code fragment-based XCSR
Zhang et al. Robust social event detection via deep clustering
CN112785156A (en) Industrial leader identification method based on clustering and comprehensive evaluation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20180713

WW01 Invention patent application withdrawn after publication