CN107103050A - A kind of big data Modeling Platform and method - Google Patents

A kind of big data Modeling Platform and method Download PDF

Info

Publication number
CN107103050A
CN107103050A CN201710211258.XA CN201710211258A CN107103050A CN 107103050 A CN107103050 A CN 107103050A CN 201710211258 A CN201710211258 A CN 201710211258A CN 107103050 A CN107103050 A CN 107103050A
Authority
CN
China
Prior art keywords
data
submodule
variable
algorithm
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710211258.XA
Other languages
Chinese (zh)
Inventor
林伟豪
李学辉
李敬涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Haitong Constant (dalian) Big Data Technology Co Ltd
Original Assignee
Haitong Constant (dalian) Big Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Haitong Constant (dalian) Big Data Technology Co Ltd filed Critical Haitong Constant (dalian) Big Data Technology Co Ltd
Priority to CN201710211258.XA priority Critical patent/CN107103050A/en
Publication of CN107103050A publication Critical patent/CN107103050A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/248Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A kind of big data Modeling Platform and method, platform include:Data assets module, for the upload of data source, by user data update to cloud platform by the way of uploading or automatically updating manually, user handles the data of oneself upload by way of pulling modeling by hand;Data cleansing module, the ETL processing for carrying out data to data source finds and corrected the mistake that can recognize that in data file, including check data consistency, processing invalid value and missing values;Data check module, for carrying out detection and basic statistical work to data;Algoritic module, is modeled to mass data using the classical classification of some in machine learning or clustering algorithm, is then predicted using model;Front end display module, is patterned and shows for the data to having treated or the data not handled.The present invention includes many-sided function such as structural data modeling, data exhibiting, the self-service Intellectual analysis of support, can towed data exhibiting and modeling.

Description

A kind of big data Modeling Platform and method
" technical field "
The invention belongs to the technical fields such as electronic information, big data, and in particular to collection, storage, analysis, the exhibition of big data A kind of existing big data Modeling Platform and method.
" background technology "
With the fast development of internet, the data volume produced daily is very huge.Before the appearance of big data technology, pass The data processing of system encounters many bottlenecks.Will in the case that data volume is very big firstly, for traditional database Storage is caused to reach the upper limit, solution is to change the bigger hard disk of capacity, but the cost done so is very high.Next is exactly to count Calculation machine can not quickly handle big data quantity, and bottleneck can be also run into data processing speed.
At present, using big data technology can solve the poor autgmentability that traditional information technology infrastructure is present, poor fault tolerance, Performance is low, installation and deployment and many bottlenecks such as difficult in maintenance.Data are carried out using Hadoop HDFS distributed file systems Storage, favorable expandability, fault-tolerance are high.Large-scale dataset (being more than 1TB) is counted parallel using Hadoop MapReduce Calculate, improve calculating speed, performance is high.Realize that traditional database carries out the transmission of data with Hadoop using Sqoop components.But Existing big data technology is less susceptible to for non-technical personnel using big data technology.
" content of the invention "
The present invention is intended to provide a kind of big data Modeling Platform and method, include structural data modeling, data exhibiting etc. Many-sided function, support self-service Intellectual analysis, can towed data exhibiting and modeling etc., can be within the extremely short time The management cockpit and extemporaneous query analysis decision-making platform that decision-making foundation is provided are produced for business decision layer.The mesh of the present invention Realized by following technical scheme:
A kind of big data Modeling Platform, including:
Data assets module, for the upload of data source, by user data by the way of uploading or automatically updating manually Cloud platform is updated, user handles the data of oneself upload by way of pulling modeling by hand;
Data cleansing module, the ETL processing for carrying out data to data source finds and corrected in data file can recognize that Mistake, including check data consistency, processing invalid value and missing values;
Data check module, for carrying out detection and basic statistical work to data;
Algoritic module, is modeled, so using the classical classification of some in machine learning or clustering algorithm to mass data It is predicted afterwards using model;
Front end display module, is patterned and shows for the data to having treated or the data not handled.
As specific technical scheme, the data assets module includes three kinds of upload data modes, local file uploads, Bottom data is uploaded, database is uploaded, and wherein database, which is uploaded, supports tri- kinds of databases of MySql, Oracle, Sqlserver.
As specific technical scheme, the data cleansing module includes Sql processing submodule, sampling submodule, classification Collects submodule, merging data submodule, deletion repeat submodule, data partition submodule, sorting sub-module, Data Discretization Submodule, data normalization submodule, filtered variable submodule, transposition submodule, field reset submodule, missing values processing Module, outlier processing submodule, lookup transform subblock, insertion variable submodule, weighting submodule, the balanced submodule of sample Block, participle analyzing sub-module;Sql processing submodules are performed for direct editing Sql sentences, and sampling submodule is used to utilize Different sampling modes is sampled processing to data, and Classifying Sum submodule is used for field variable content in table according to equal Value, counting, summing mode are calculated, and generate respective labels variable column, wherein it is configurable to collect variable with calculating variable;Merge Data submodule is used to add the data of two tables according to row record addition or row variable, and row please be kept during row record addition Name variable is consistent, otherwise by newly-increased variable column;Delete and repeat submodule for the duplicate contents in selected variable to be deleted;Number It is used for the quantity or ratio for specifying sample data in training center and test section according to subregion submodule;The change that sorting sub-module will be selected Content is measured to arrange according to ascending order or descending;Data Discretization submodule is used for the variable column of selected continuous type, according to wide Branch mailbox waits frequency division case method, carries out discretization and is classified;Data normalization submodule only supports numeric type by selected Variable column carry out 0-1 standardization, as a result fall on [0,1] interval;Carry out Z standardization, data fit standardized normal distribution, average For 0, standard deviation is 1;Filtered variable submodule is used to be deleted selected variable row;Transposition submodule is used for institute in data It is ranks conversion that some row and columns, which carry out transposition,;Field, which resets submodule, to be used to rearrange the row variable position in data; Missing values processing submodule is used to variable will have been selected to be empty row record deletion;Outlier processing submodule is used for according to exceptional value Recognition rule is deleted exceptional value by setting ratio, and recognition rule includes standard deviation and quantile, i.e., certain apart from average Data beyond the standard deviation or quantile of multiple are identified as abnormal data;Searching transform subblock is used in selected variable Hold according to the lookup that imposes a condition, and replace with desired value;Insertion variable submodule is used to selected variable carrying out arithmetic, raw Into new variables row, in algorithm frame, the title of variable column is manually entered, arithmetic expression is edited.Weighting submodule is used to become selected Amount is weighted, in weight factor, input weight numerical value.Sample equilibrium submodule is used in selected numeric type variable In row, according to the lookup target data that imposes a condition, and the input weight factor, target data weighting is handled;Participle parses submodule Block is used for after the content of text parsing by selected participle field, according to entry generation row record after parsing.
As specific technical scheme, the data check module include data examination & verification submodule, Frequence Analysis submodule, Descriptive statistic submodule;Sample index and detection that data examination & verification submodule is used in statistical analysis selected variable, index bag Include virtual value, invalid value, null value and its accounting;Frequence Analysis submodule is used in selected variable, occurs to all the elements The frequency is counted;Descriptive statistic submodule, which is used to arrange specifying variable, carries out average, mode, median, the statistics amounted to Calculate.
As specific technical scheme, the algorithm model module includes Apriori algorithm submodule, Kmeans algorithms Module, NB Algorithm submodule, logistic regression algorithm submodule, ridge regression algorithm submodule, LASSO algorithm submodules Block, linear regression algorithm submodule;Apriori algorithm submodule is used to combine the associate field content statisticses frequency, by dimension word Content in section carries out the probability calculation of two frequency collection, and draws analysis indexes such as support;Kmeans algorithm submodules are used for will The data of word selection section are divided into n cluster, are configured to wherein clustering number, iterations, random count parameter, realize data The function of convergence;Naive Bayesian submodule, logistic regression submodule, ridge regression algorithm submodule, LASSO algorithm submodules, Linear regression algorithm submodule contributes to simulate sorting algorithm model to be predicted to new data.
As specific technical scheme, the front end display module by way of dilatory drag visual operating assembly come Tested, data cleansing and Algorithm Analysis are carried out according to well-established business model, by result data collection by connecting Front end display module in chart carry out visualizing multidimensional degree and show.
As specific technical scheme, the visualizing multidimensional degree show including:1st, with different data structures with not Same subtype is intuitively showed;2nd, the form bored below carries out the displaying of multi-dimensional data;3rd, customized with node Display form shows to being customized of special data;4th, exhibition preferably is carried out to data in the form of the linkage displaying of many figures It is existing.
A kind of modeling method based on above-mentioned big data Modeling Platform, step is as follows:
First, it would be desirable to which the data source of processing is upload the data on platform using interpolation data source node in data assets In case subsequently using;
Then, the need for according to business scenario, data are cleaned using the functional node in data cleansing module, such as Handled using missing values and delete field in data for empty data line, carried out the field that business needs using filtered variable Retain, a series of processing such as other field delete processing obtain the data of desired specific format;
Secondly, if the demand in business scenario not to algorithm, it is possible to utilize graphically entering for front end display module Row final data show;Show pattern class in the selection of functional node in front end various, select difference according to demand Figure carry out Data Representation;Algorithm is used if desired, it is necessary to add algorithm node;
Finally, respective nodes have been selected, have been showed data source nodes, data cleansing node, algorithm node or without, front end Node is attached preservation, and logical whole flow process can just be run by clicking on operation, and final data is come out with graphic exhibition.
In summary, the present invention is very flexible in terms of whole flow chart of data processing, and user can be according to different need Seek the corresponding workflow of completion;The stage is uploaded in data source, there are a variety of upload modes to provide selection;In the data cleansing stage, have A variety of processing modes provide selection;The algorithm stage, similarly comprising many algorithms;In the data exhibiting stage, include a variety of figures;This A variety of core main flow algorithms library out-of-the-boxs are supported in invention so that big data analysis is simplified and popular, and user understands pole Few statistics and Data Mining knowledge, just easily can carry out data mining and modeling analysis using the platform to big data.
" brief description of the drawings "
Fig. 1 be big data Modeling Platform provided in an embodiment of the present invention in imported using Sqoop technologies from database HDFS schematic diagram.
Fig. 2 is transmitting file on being carried out in big data Modeling Platform provided in an embodiment of the present invention in the form of data flow Schematic diagram.
" embodiment "
The embodiment to the present invention is described further below in conjunction with the accompanying drawings:
The big data Modeling Platform that the present embodiment is provided includes:Data assets, data cleansing, data detection, algorithm model, Front end shows.Each module is introduced in detail below:
Data assets module is used for the upload of data source, by the way of uploading or automatically updating manually by user data more Cloud platform is newly arrived, user can handle the data of oneself upload by way of pulling modeling by hand.
Data cleansing module is used for the ETL processing that data are carried out to data source, finds and correct in data file can recognize that Mistake, including check data consistency, handle invalid value and missing values etc..
Data check module is used to carry out detection and basic statistical work to data.
The purposes of algoritic module is that mass data is carried out using the classical classification of some in machine learning or clustering algorithm Modeling, is then predicted using model.
Front end display module is used to the data treated or the data not handled are patterned and showed, and gives The more intuitive form of expression of user.
Data assets module includes three kinds of upload data modes, and local file is uploaded, bottom data is uploaded, on database Pass, wherein database, which is uploaded, supports tri- kinds of databases of MySql, Oracle, Sqlserver.
Data cleansing module includes Sql processing, sampling, Classifying Sum, merging data, deletion repetition, data partition, row Sequence, Data Discretization, data normalization, filtered variable, transposition, field reset, missing values processing, search conversion, insertion variable, The modules such as weighting, balanced, the participle parsing of sample.Sql is handled to be performed for direct editing Sql sentences.Sample for utilizing Different sampling modes (N takes 1, random % etc.) are sampled processing to data.Classifying Sum is used in field variable in table Hold and calculated according to average, counting (collecting), summation (total) mode, generate respective labels variable column, wherein collecting variable and meter It is configurable to calculate variable.Merging data is used to add the data of two tables according to row record addition or row variable, row note Record please keep row name variable consistent when additional, otherwise by newly-increased variable column.Deletion is recycled and reused for the repetition in selected variable Content is deleted.Data partition is used for the quantity or ratio for specifying sample data in training center and test section.Sort selected change Content is measured to arrange according to ascending order or descending.Data Discretization is used to, by the variable column of selected continuous type, according to wide branch mailbox (divide Case width) or frequency division case method (branch mailbox quantity) is waited, carry out discretization and classified.Data normalization arranges selected variable (only supporting numeric type) carries out 0-1 standardization (it is interval that result falls on [0,1]);Z standardization (data fit standard normal point Cloth, average is 0, and 1) standard deviation is.Filtered variable is used to be deleted selected variable row.Transposition is used for will be all in data It is ranks conversion that row and column, which carries out transposition,.Field is reset for the row variable position in data to be rearranged.Missing values processing For variable will to have been selected to be empty row record deletion.Outlier, which is handled, to be used for exceptional value according to outlier identification rule by setting Ratio is deleted.(recognition rule include standard deviation and quantile, i.e., apart from average certain multiple standard deviation or quantile with Outer data are identified as abnormal data).Searching conversion is used for the content of selected variable according to the lookup that imposes a condition, and replaces with Desired value.Insertion variable is used to selected variable carrying out arithmetic, generation new variables row (alias).In algorithm frame, manually The title of input variable row, edits arithmetic expression.Weight for selected variable to be weighted, in weight factor, input Weighted value.Sample equilibrium is used in selected numeric type variable column, according to the lookup target data that imposes a condition, and inputs power Repeated factor, target data weighting is handled.Participle is parsed for after the content of text parsing by selected participle field, according to solution Entry generation row record after analysis.
Data check module includes data examination & verification, frequency disribution, descriptive statistic module.Data are audited for statistical analysis Sample index's (virtual value, invalid value, null value and its accounting) and detection in selected variable.Frequence Analysis is used in selected variable In, the frequency that all the elements occur is counted.Descriptive statistic is used to carry out average, mode, middle position to specifying variable row The statistics such as number, total are calculated.
Algorithm model module includes Apriori algorithm, Kmeans algorithms, NB Algorithm, logistic regression algorithm, ridge The modules such as regression algorithm, LASSO algorithms, linear regression algorithm.Apriori algorithm is used to combine associate field content statisticses frequency It is secondary, the content in dimension field is carried out to the probability calculation of two frequency collection, and draw analysis indexes such as support etc..Kmeans algorithms For the data of the section of word selection to be divided into n cluster, wherein cluster number, iterations, random count parameter can be configured, Realize the function of convergence of data.Naive Bayesian, logistic regression scheduling algorithm are all sorting algorithms, and basic ideas are similar, for mould Sorting algorithm model is drawn up to be predicted new data.
Front end display module is the data analysis platform integrated with showing.User can be visual by way of dilatory drag The operating assembly of change is tested so that engineer without machine learning background can also play data digging well by left-hand seat easily Pick.Platform carries out data cleansing and Algorithm Analysis according to well-established business model, by result data collection by connecting Chart in the display module of front end carries out showing for various dimensions.
Intuitively showed with different subtypes with different data structures.
The form bored below carries out the displaying of multi-dimensional data.
Customize display form to show being customized of special data with node.
Preferably data are showed in the form of the linkage displaying of many figures.
Data picture is converted into data query, each item data interaction linkage, display data under different dimensions index In the tendency, ratio, relation of different angles, user's identification trend is helped, the knowledge and rule of data behind is found.Except original The data exhibiting modes such as some cake charts, column diagram, thermal map, geography information figure, can also be by the color of image, brightness, big The various ways such as small, shape, movement tendency are analyzed data in a series of figures, are helped user by interaction, are excavated Association between data.And the upper brill test of data, multidimensional parallel parsing are supported, promote decision-making using data.
Visualization can provide the user a total general view, then by scaling and screening, for needed for people provide it more Deep detailed information.Visual process is served when helping people to obtain more complete customer information using big data Key effect.And crisscross relation is the important ring in numerous big data scenes, social networks is perhaps exactly most significant Example, it is desirable to understand that big data information therein is extremely difficult by the form of text or form;On the contrary, visualization but can It is enough that the trend and natural mode of these networks are showed relatively sharp.The relation between social network user is embodied in image When, usually used is the method for visualizing based on cloud computing.Describe user node in social networks by correlation models Hierarchical relationship, this method can intuitively show the social relationships of user.In addition, it can also be by the sea using cloud Dupp software platform (Hadoop) is by visualization process parallelization, so that the big data for accelerating social networks is collected.
Big data visualization can be realized by a variety of methods, such as in multi-angle display data, focusing mass data Dynamic change, and filter information (including dynamic inquiry screening, star chart displaying, and close-coupled) etc..It is following some can Depending on change method analyzed and classified according to different data types (Large volume data, delta data and dynamic data) 's:
Tree-shaped schema:Space filling method for visualizing based on individual-layer data.
Circular filled type:The direct replacement of tree-shaped schema.It uses circle as original-shape, and can dividing from higher level Introduce more circular in Rotating fields.
Rising sun type:Polar coordinate system is transformed on the basis of dendrogram visualization.Variable parameter therein is by wide and high change Into radius and arc length.
Parallel coordinates formula:By visual analyzing, the multiple data factor in not homology theory township is expanded and come.
Steam schema:One kind of stack region figure, data are deployed around an axis, and with flowing and organic shape State.
Recirculating network schema:Data around circular arrangement, and according to their own related sex rate by curve phase Connect.Generally with different line widths or the correlation of color saturation measurement data object.
The main functional modules of big data Modeling Platform are described above, the present invention is while above-mentioned functions are disclosed, also Whole flow process process is disclosed, step is as follows:
First, it would be desirable to which the data source of processing is upload the data on platform using interpolation data source node in data assets In case subsequently using.
Then, the need for according to business scenario, data are cleaned using the functional node in data cleansing module, such as Handled using missing values and delete field in data for empty data line, carried out the field that business needs using filtered variable Retain, a series of processing such as other field delete processing obtain the data of desired specific format.
Secondly, if the demand in business scenario not to algorithm, it is possible to utilize graphically entering for front end display module Row final data show.Show pattern class in the selection of functional node in front end various, select difference according to demand Figure carry out Data Representation.It is such as naive Bayesian, linear if needing exist for using algorithm, it is necessary to add algorithm node Return etc..
Finally, respective nodes have been selected, it is necessary to by data source nodes, data cleansing node, algorithm node or without, front end Show node and be attached preservation, logical whole flow process can just be run by clicking on operation, and final data is come out with graphic exhibition.
The function and realization principle to each module are further described in detail below:
1. data assets
, can be with due to Modeling Platform data source disunity, it is necessary to which different data sources to be converted into unified data source By relevant database, such as Oracle, Mysql, Sqlserver etc., the data of file format, such as txt, csv etc., also may be used With on the basis of existing data source carry out processing form new data source, be converted into unified HIVE data sources, it is flat for modeling Platform flow processing provides data source.
(1) be directed to relevant database, using Sqoop technical finesses, Sqoop be by a MapReduce operation from A table is imported in database, this operation is extracted from table to be recorded line by line, is then written to HDFS, as shown in Figure 1.
Before importing starts, Sqoop checks the table that will be imported using JDBC.Retrieve row all in table and The SQL data types of row.These SQL types (VARCHAR, INTEGER) be mapped to Java data types (String, Integer etc.), it will preserve the value of field using these corresponding Java types in MapReduce applications.Sqoop generation Code generator creates the class of corresponding table using these information, the record extracted for preserving from table.
(2) data of file format are directed to, upper transmitting file is carried out in the form of data flow, is carried out using Hadoop technologies Processing, local file is uploaded to file on HDFS by MapReduce, then by file by specified table name and field name, It is stored in Hive, as shown in Figure 2.
(3) existing data source is directed to, is handled using Hive technologies, Hive is used on the basis of legacy data source Select sentences create new Hive tables, produce new data source, can also be directly using the data source existed.
2. data cleansing
In order to ensure that logical permanent big data Modeling Platform, for the requirement of data consistency, data cleansing work(is provided for this Can, mainly include SQL processing, sampling, Classifying Sum, merging data collection, data partition, sequence, data discrete, data standard, Filter scalar, transposition, field rearrangement, weighting, sample equilibrium etc..
(1) SQL processing is that new data source is created according to original data source using Hive select sentences.
(2) sampling is that original data source is sampled using Hive, produces new data source.
(3) Classifying Sum foundation collects variable and is grouped, and average is calculated using variable is calculated, collects, amount to.
(4) merging data collection is divided into row record addition and row variable is added, and row variable additional demand selection combining variable makes Handled with Hive and produce data set.
(5) duplicate keys are deleted and filters the data for removing and repeating according to duplicate removal variable using Hive.
(6) data partition carries out data partition according to specified training sample.
(7) sort by processing variable is ranked up to data source.
(8) Data Discretization produces discrete data formation result set according to processing variable.
(9) data normalization is according to processing variable, and selection standard method produces data set.
(10) filtered variable deletes it according to variable is deleted from result set, produces new data source.
(11) transposition is by all row and column transposition, after transposition, and newly-generated row name naming rule is transposition_1,transposition_2,……,transposition_15。
(12) field resets the order of specific field.
(13) missing values processing removes data according to processing variable, produces new result set.
(14) conversion is searched according to processing variable, if processing variable meets condition, is replaced.
(15) outlier is handled, and it is handled using exclusion pattern and recognition rule according to processing variable.
(16) the insertion variable variable new according to original row insertion.
(17) weighting adds weighted factor according to processing variable to it.
(18) sample is balanced according to processing variable, to its adding conditional, if meeting condition, according to its progress of factor pair Conversion.
Data cleansing is mainly handled using Hive Sql technologies, and the undesirable data of removal, which are mainly, endless Whole data, the data of mistake, the data repeated, can also be handled on the basis of legacy data.
3. data check
In order to meet requirement of the logical permanent big data Modeling Platform to data processing, data check function is provided for this, can To provide the flexibility ratio of processing data, the data for not meeting index can be appointed as invalid data.Mainly audited including data, Frequency disribution, descriptive statistic etc..
(1) data examination & verification is to use to handle data source by Hive select sentences, invalid value detection method Be divided into two classes, field type detection and numerical value detection, index be effective sample, effective sample %, null value, null value %, invalid value, Invalid value %, collects according to processing variable packet, produces percentage, and then produce result set.
(2) frequency disribution formula is handled data source by Hive select sentences, is grouped according to processing variable And obtain total (count).
(3) descriptive statistic is that data source is handled using hive select sentences, first according to processing variable Carry out processing and obtain mode, average, median, total, maximum, minimum value, scope, standard error of mean can also be obtained, Percentile the and percentile_approx functions provided according to hive, obtain data statistics result quartile and Five quantiles etc..
4. algorithm model
Logical perseverance big data Modeling Platform is a machine learning algorithm platform based on Distributed Calculation engine.User passes through The dilatory visual operating assembly of mode dragged is tested so that the engineer without machine learning background can also be easily Left-hand seat plays data mining well.Platform provide Apriori, K_means, naive Bayesian, logistic regression, ridge regression, LASSO, The abundant machine language such as linear regression.
(1) algorithm model is mainly realized using Spark technologies, and the data set and training data of preparation are submitted into Spark Cluster efficient process simultaneously obtains result set.
(2) algorithm is realized using Java voices, and the algorithm routine of realization is broken into jar bags first disposes respectively with platform, And then Modeling Platform and the degree of coupling of algorithm are reduced, in Deployment Algorithm, do not interfere with the use of platform.
(3) implementing for algorithm is that task is submitted into the processing of Spark clusters, can by the Distributed Calculation of cluster Fast and effectively to iterate to calculate.
The algorithm model module of big data Modeling Platform mainly make use of spark mllib api to be programmed realization, Computing engines arithmetic speeds of the spark based on internal memory is fast, and many machine learning algorithms are included in spark mllib storehouses: Apriori, kmeans, naive Bayesian, logistic regression, ridge regression, lasso scheduling algorithms, these algorithms are largely divided into two classes:Point Class and cluster.Kmeans algorithms belong to cluster inside these algorithms, and above-named algorithm belongs to sorting algorithm, in code In realization, there are different logics in two class problems, will illustrate skill that algorithm model module is related in terms of the two below Art problem.
Clustering algorithm
Cluster, Cluster analysis are also translated into cluster class sometimes, and its core missions is:By one group of target The object that object is divided between several clusters, each cluster is similar as far as possible, and the object between cluster and cluster is as far as possible It is different.So-called clustering problem, is exactly to give an element set D, wherein each element has n observable attribute, uses D is divided into k subset by certain algorithm, it is desirable to which distinctiveness ratio is as low as possible between the element of each intra-subset, and different subsets Element distinctiveness ratio it is as high as possible.Wherein each subset is called a cluster.
Kmeans belongs to the iteration based on square error and reassigns clustering algorithm, its core concept very simple:
(1) K central point is randomly choosed.
(2) distance for arriving this K central point a little is calculated, the nearest central point of chosen distance is the cluster where it.
(3) center of K cluster is simply recalculated using arithmetic average (mean).
(4) repeat step 2 and 3, until cluster class is not changing or reaching greatest iteration value.
(5) output result.
The result quality of Kmeans algorithms is easily trapped into locally optimal solution, to K dependent on the selection to initial cluster center The no criterion of selection of value can be followed, more sensitive to abnormal data, can only handle the data of numerical attribute, cluster structure may It is uneven.
Kmeans algorithm flows and ins and outs are described below.Data source is obtained, interpolation data source is simultaneously in data assets Data source is dragged in painting canvas.Then the node of connection data cleansing carries out necessary ETL processing to data source so that data source It disclosure satisfy that calling for algorithm part.After the operation of data cleansing node, one can be deposited in the data warehouse hive of cluster Data after this node is treated, are called for algorithm part.
It is described in detail in algorithm part.Data source nodes need to be connected kmeans after completing with data cleansing node Algorithm node, kmeans algorithm nodes are dragged in painting canvas, are double-clicked minor node, can be ejected the configuration page, are wrapped in the configuration page Contain:1. choose which row to run kmeans algorithms, because in actual business demand, can not necessarily use all row; 2. needing to configure cluster class number, refer to that current data source is thought finally to be polymerized to how many classes;3. maximum iteration, algorithm performs are needed Want iteration how many times;4. random number of times;Click on and preserve after configuration is good, then click on operation and start configuration processor.
In the code of backstage, when program judges nodeType (node type) for K_Means, it can enter In KmeansServiceImpl stepKmeans methods.Inside this method, obtain first in parameters such as configuration interfaces, Set methods are performed to these parameters using KmeansInfo instance object, the parameter that kmeans algorithms need all is preserved In KmeansInfo instance objects.Then toKmeansString methods are performed, the parameter character with space-separated is obtained String.
Then, the formatTableData methods in DataRevert are performed, the effect of this method is to carry out feature to turn Change, because unavoidable in data source have character string, and spark kmeans algorithms require that data are double types, so secondary It is extremely important whether method runs succeeded for algorithm.
Perform algorithm jar bags here be spark yarn-client submission patterns, the benefit of this pattern is Script need not be write, jar bags can be directly run.It can perform afterwards in Co-Insight-mllib.jar KMeansInfo, parameter therein is incoming in web terminal, and main thinking is to obtain word selection segment data from the specified tables of hive, Corresponding format conversion is carried out to data, the vector format of requirement is changed into.Generated using Kmeans.train api training datas KmeansModel models, secondary step is the most important step of whole algorithm, and only generating model could be using model Predict methods determine the cluster situation of data.Finally result is stored in hdfs, then hive builds table and reads hdfs numbers According to.Finally show data, result hive tables and prediction data are merged into displaying, (cluster is calculated to this basic kmeans algorithm Method) complete.
Sorting algorithm
What is sorting algorithmIn simple terms, exactly the object with some characteristics is sorted out and corresponds to a known class Not Ji He in some classification on.For mathematical angle, it can be defined as follows:
Known collection:C={ y1, y2 .., yn } and I={ x1, x2 .., xm .. }, determines mapping ruler y=f (x), makes Any xi ∈ I one and only one yj ∈ C causes yj=f (xi) to set up.
Wherein, C is category set, and I is object to be sorted, and f is then grader, and the main task of sorting algorithm is exactly structure Make grader f.
The construction of sorting algorithm usually requires the set of a known class to be trained, and as a rule trains what is come Sorting algorithm can not possibly reach 100% accuracy rate.The quality of grader often with training data, checking data, training data The factors such as sample size are related.
For example, a stranger is seen in our daily lifes, the first thing feelings to be done are exactly to judge its sex, The process for judging sex is exactly the process of a classification.According to the conventional experience of life, hair length, dress ornament and body are generally gone through These three key elements of type are with regard to that can judge the sex of a people.Here " experience of life " be exactly one train on sex The model of judgement, its training data is the panoramic people run into daily life.Have one day suddenly, ma's big gun is gone to In face of you, close-fitting clothing are worn in long hair fluttering, but build but very man, and then you just feel uncertain, according to conventional warp Test --- the model namely trained, it is impossible to judge the sex of this people.Then you have learned to judge by Adam's apple Its sex, the quality that so your model is trained to is higher.But it is undeniable to be, occur that one you can not sentence forever Disconnected property others.So model be unable to reach forever 100% it is accurate, only can infinitely be connect with being on the increase for training data Nearly 100% it is accurate.
It is a difference in that the realization of spark mllib bottoms is different in sorting algorithm, between algorithms of different, is calling In the case of api, simply the method parameter of training data can be somewhat different, and other programmed logics are substantially similar, here with simplicity Described in detail exemplified by bayesian algorithm.
Naive Bayes Classification, Naive Bayes, you can also be its NB algorithm.Its core concept is very simple:For A certain prediction term, calculates the probability that the prediction term is each classification respectively, and what then select probability was maximum is categorized as its prediction point Class.Just look like that you predict that ma's big gun is that the possibility of woman is 40%, the possibility for being man is 41%, then can just be sentenced Breaking, he is man.
NB Algorithm flow and ins and outs are described below.Sorting algorithm is different from clustering algorithm flow, secondary stream Journey needs to obtain two data sources, and a data source carries label column as training data, and another data source label is classified as Sky is used as prediction data.Two data sources require that field name is identical with type, next will utilize the conjunction in data cleansing And data set, two data sources are merged in hive as follow-up processing in a table, it is normal herein directly to utilize row Record addition.
ETL operations can be carried out by having merged data set, carried out cleaning treatment to data, be then dragged in the Piao in algorithm model The algorithm node connection of plain Bayes, in the configuration page of naive Bayesian node, can select label column, which row conduct Perform algorithm to use, alpha attributes, training data ratio.Preserve operation after configuration is good, web terminal execution logic substantially with The execution logic of kmeans algorithm web terminals is similar.
Algorithmic code is right when choosing training data in Co-Insight-mllib.jar NativeBayes The hive tables merged before choose the data set that training dataset is predicted with needs so that whether label column is empty.Then utilize NaiveBayes.train methods train NaiveBayesModel models, and forecast set is entered also with the predict of model Row classification prediction, most result is stored in hive table at last, and the front end for after shows.
Summarize, the technology that algoritic module is mainly utilized is that spark mllib api is called, the reusability in algorithm realization By force, development rate is fast, and training pattern efficiency high can be good at utilizing cluster resource, substantially meet the universal of algorithm.
5. front end shows
Logical perseverance big data Modeling Platform provides abundant instrument and showed, and platform is in time using form that is more lively, having had Reveal and be hidden in fast changing and extraneous data behind business and see clearly.No matter in fields such as traffic, communications, by interactive real When data visualization come help business personnel find, diagnosis traffic issues, increasingly become in big data solution to close weight The ring wanted.Mainly include form displaying, block diagram, bar chart, line chart, scatter diagram, bubble diagram, worm hole, geographic distribution Deng.
(1) form displaying shows data in table form.
(2) block diagram, bar chart, line chart, pie chart, area-graph, scatter diagram etc. are according to X-axis and Y-axis display data.
(3) circular chart is according to classified variable and collects variable, and calculation is quantity or summation.
(4) radar map is according to classified variable, reduced parameters 1 and reduced parameters 2, and calculation is summation, maximum, equal Value.
Front end shows mainly to be showed using front end JQuery technologies.
The big data Modeling Platform of the application can allow non-technical personnel to require no knowledge about the situation of bottom big data technology Under can easily use.Platform utilize Gooflow procedure technologies, only need to carry out to data source, data processing, algorithm, The dragging connection of the nodes such as data exhibiting can just realize big data processing procedure.Big data Modeling Platform mainly utilizes Hive numbers Data are stored according to warehouse, data processing section is directly realized using Hive sql sentences.When the huge situation of data volume can also Reply, excellent performance well.Realized using Spark millib big data Modeling Platform algorithm part.Spark's is excellent Point is that output result can be stored in internal memory in the middle of job, so as to no longer need to read and write HDFS, is calculated based on internal memory, operation effect Rate is high.Spark machine learning storehouse includes algorithm wide variety, and classification, cluster scheduling algorithm disclosure satisfy that the demand of user.Create One-stop data analysis flow, finishing service demand are realized in data source, data processing, algorithm, the connection of data exhibiting node.
Above example is only that abundant disclosure is not intended to limit the present invention, all based on creation purport of the invention, without creating Property work equivalence techniques feature replacement, should be considered as the application exposure scope.

Claims (8)

1. a kind of big data Modeling Platform, it is characterised in that including:
Data assets module, for the upload of data source, by user data update by the way of uploading or automatically updating manually To cloud platform, user handles the data of oneself upload by way of pulling modeling by hand;
Data cleansing module, the ETL processing for carrying out data to data source finds and corrected the mistake that can recognize that in data file Miss, including check data consistency, processing invalid value and missing values;
Data check module, for carrying out detection and basic statistical work to data;
Algoritic module, is modeled, Ran Houli using the classical classification of some in machine learning or clustering algorithm to mass data It is predicted with model;
Front end display module, is patterned and shows for the data to having treated or the data not handled.
2. big data Modeling Platform according to claim 1, it is characterised in that the data assets module is included on three kinds Biography data mode, local file is uploaded, bottom data is uploaded, database is uploaded, wherein database upload support MySql, Tri- kinds of databases of Oracle, Sqlserver.
3. big data Modeling Platform according to claim 2, it is characterised in that the data cleansing module is included at Sql Manage submodule, sampling submodule, Classifying Sum submodule, merging data submodule, deletion repetition submodule, data partition submodule Block, sorting sub-module, Data Discretization submodule, data normalization submodule, filtered variable submodule, transposition submodule, word Section reset submodule, missing values processing submodule, outlier processing submodule, search transform subblock, insertion variable submodule, Weight the balanced submodule of submodule, sample, participle analyzing sub-module;Sql handles submodule and carried out for direct editing Sql sentences Perform, sampling submodule is used to be sampled data processing using different sampling modes, and Classifying Sum submodule is used for will In table field variable content according to average, counting, summing mode calculate, generate respective labels variable column, wherein collect variable with It is configurable to calculate variable;Merging data submodule is used to chase after the data of two tables according to row record addition or row variable Plus, row name variable please be keep consistent during row record addition, otherwise by newly-increased variable column;Delete and repeat submodule for that will select Duplicate contents in variable are deleted;Data partition submodule is used to specify the quantity of sample data or ratio in training center and test section Example;Sorting sub-module arranges selected variant content according to ascending order or descending;Data Discretization submodule is used for will be selected The variable column of continuous type, according to wide branch mailbox or waits frequency division case method, carries out discretization and is simultaneously classified;Data normalization submodule The selected variable column for only supporting numeric type is carried out 0-1 standardization by block, as a result falls on [0,1] interval;Carry out Z standardization, number According to standardized normal distribution is met, average is 0, and standard deviation is 1;Filtered variable submodule is used to be deleted selected variable row; It is ranks conversion that transposition submodule, which is used to row and column all in data carrying out transposition,;Field, which resets submodule, to be used for data In row variable position rearrange;Missing values processing submodule is used to variable will have been selected to be empty row record deletion;Outlier Processing submodule is used to be deleted exceptional value by setting ratio according to outlier identification rule, and recognition rule includes standard deviation And quantile, the i.e. data beyond the standard deviation or quantile of average certain multiple are identified as abnormal data;Search conversion Submodule is used for the content of selected variable according to the lookup that imposes a condition, and replaces with desired value;Insertion variable submodule is used for Selected variable is subjected to arithmetic, generation new variables row in algorithm frame, are manually entered the title of variable column, edit computing Formula.Weighting submodule is used to selected variable being weighted, in weight factor, input weight numerical value.The balanced son of sample Module is used in selected numeric type variable column, according to the lookup target data that imposes a condition, and the input weight factor, by target Data weighting processing;Participle analyzing sub-module is used for after the content of text parsing by selected participle field, according to word after parsing Bar generation row record.
4. big data Modeling Platform according to claim 3, it is characterised in that the data check module is examined including data Nucleon module, Frequence Analysis submodule, descriptive statistic submodule;Data examination & verification submodule is used in statistical analysis selected variable Sample index and detection, index include virtual value, invalid value, null value and its accounting;Frequence Analysis submodule is used for selected In variable, the frequency that all the elements occur is counted;Descriptive statistic submodule be used for specifying variable row carry out average, Mode, median, the statistics amounted to are calculated.
5. big data Modeling Platform according to claim 4, it is characterised in that the algorithm model module includes Apriori algorithm submodule, Kmeans algorithm submodules, NB Algorithm submodule, logistic regression algorithm submodule, ridge Regression algorithm submodule, LASSO algorithm submodules, linear regression algorithm submodule;Apriori algorithm submodule, which is used to combine, to close Join the field contents statistics frequency, the content in dimension field is carried out to the probability calculation of two frequency collection, and draw analysis indexes such as branch Degree of holding;Kmeans algorithm submodules be used for by word selection section data be divided into n cluster, to wherein cluster number, iterations, with Machine count parameter is configured, and realizes the function of convergence of data;Naive Bayesian submodule, logistic regression submodule, ridge regression Algorithm submodule, LASSO algorithm submodules, linear regression algorithm submodule contribute to simulate sorting algorithm model to new Data are predicted.
6. big data Modeling Platform according to claim 5, it is characterised in that the front end display module is dragged by dilatory Mode visual operating assembly tested, carry out data cleansing and algorithm point according to well-established business model Analysis, showing for visualizing multidimensional degree is carried out by result data collection by the chart in the front end display module that connects.
7. big data Modeling Platform according to claim 6, it is characterised in that the visualizing multidimensional degree shows bag Include:1st, intuitively showed with different subtypes with different data structures;2nd, the form bored below carries out various dimensions The displaying of data;3rd, customize display form to show being customized of special data with node;4th, to scheme linkage exhibition more The form shown preferably shows to data.
8. a kind of modeling method based on big data Modeling Platform described in claim 1, step is as follows:
First, it would be desirable to the data source of processing upload the data to using interpolation data source node in data assets on platform in case Subsequently use;
Then, the need for according to business scenario, data are cleaned using the functional node in data cleansing module, such as utilized Missing values processing deletes field in data for empty data line, is protected the field that business needs using filtered variable Stay, a series of processing such as other field delete processing, obtain the data of desired specific format;
Secondly, if the demand in business scenario not to algorithm, it is possible to using front end display module graphical progress most Whole data show;Show pattern class in the selection of functional node in front end various, select different figures according to demand Shape carries out Data Representation;Algorithm is used if desired, it is necessary to add algorithm node;
Finally, respective nodes have been selected, by data source nodes, data cleansing node, algorithm node or without, front end have showed node Preservation is attached, logical whole flow process can just be run by clicking on operation, and final data is come out with graphic exhibition.
CN201710211258.XA 2017-03-31 2017-03-31 A kind of big data Modeling Platform and method Pending CN107103050A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710211258.XA CN107103050A (en) 2017-03-31 2017-03-31 A kind of big data Modeling Platform and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710211258.XA CN107103050A (en) 2017-03-31 2017-03-31 A kind of big data Modeling Platform and method

Publications (1)

Publication Number Publication Date
CN107103050A true CN107103050A (en) 2017-08-29

Family

ID=59676193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710211258.XA Pending CN107103050A (en) 2017-03-31 2017-03-31 A kind of big data Modeling Platform and method

Country Status (1)

Country Link
CN (1) CN107103050A (en)

Cited By (79)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107526832A (en) * 2017-09-05 2017-12-29 江苏电力信息技术有限公司 A kind of method for building the big data business model that technology is pulled based on the page
CN107544450A (en) * 2017-10-11 2018-01-05 齐鲁工业大学 Process industry network model construction method and system based on data
CN107609064A (en) * 2017-08-30 2018-01-19 成都中建科联网络科技有限公司 Rival's intelligent analysis method based on data mining
CN107643970A (en) * 2017-09-13 2018-01-30 曙光信息产业(北京)有限公司 What thermal map configured shows method and shows system
CN107679129A (en) * 2017-09-21 2018-02-09 无线生活(杭州)信息科技有限公司 A kind of big data processing method and processing device
CN107844634A (en) * 2017-09-30 2018-03-27 平安科技(深圳)有限公司 Polynary universal model platform modeling method, electronic equipment and computer-readable recording medium
CN107958268A (en) * 2017-11-22 2018-04-24 用友金融信息技术股份有限公司 The training method and device of a kind of data model
CN108170770A (en) * 2017-12-26 2018-06-15 山东联科云计算股份有限公司 A kind of analyzing and training platform based on big data
CN108229828A (en) * 2018-01-04 2018-06-29 上海电气集团股份有限公司 A kind of analysis system based on industrial data
CN108334501A (en) * 2018-03-21 2018-07-27 王欣 Electronic document analysis system based on machine learning and method
CN108415695A (en) * 2018-01-25 2018-08-17 新智数字科技有限公司 A kind of data processing method, device and equipment based on visualization component
CN108447118A (en) * 2018-03-20 2018-08-24 北京知道创宇信息技术有限公司 Big data method for visualizing, device and the electronic equipment that 3D visions are presented
CN108460087A (en) * 2018-01-22 2018-08-28 北京邮电大学 Heuristic high dimensional data visualization device and method
CN108509485A (en) * 2018-02-07 2018-09-07 深圳壹账通智能科技有限公司 Preprocess method, device, computer equipment and the storage medium of data
CN108595627A (en) * 2018-04-23 2018-09-28 温州市鹿城区中津先进科技研究院 A kind of self-service data analysis Modeling Platform
CN108898426A (en) * 2018-06-14 2018-11-27 上海米飞网络科技有限公司 The visualization system and method for payment data processing classification
CN108959480A (en) * 2018-06-21 2018-12-07 江苏赛睿信息科技股份有限公司 The method and device of stream data realization data visualization
CN108981785A (en) * 2018-06-19 2018-12-11 江苏高远智能科技有限公司 A kind of intelligent Detection of coal breaker equipment safety
CN109063964A (en) * 2018-07-02 2018-12-21 浙江百先得服饰有限公司 A kind of platform data processing system
CN109241107A (en) * 2018-08-03 2019-01-18 北京邮电大学 Big data controlling device based on Hadoop
CN109240163A (en) * 2018-09-25 2019-01-18 南京信息工程大学 Intelligent node and its control method for industrialization manufacture
CN109255524A (en) * 2018-08-16 2019-01-22 广西电网有限责任公司电力科学研究院 A kind of measuring equipment data analyzing evaluation method and system
CN109307811A (en) * 2018-08-06 2019-02-05 国网浙江省电力有限公司宁波供电公司 A kind of user's dedicated transformer electricity consumption monitoring method excavated based on big data
CN109325541A (en) * 2018-09-30 2019-02-12 北京字节跳动网络技术有限公司 Method and apparatus for training pattern
CN109376152A (en) * 2018-09-13 2019-02-22 广州帷策智能科技有限公司 Big data system file data preparation method and system
CN109389143A (en) * 2018-06-19 2019-02-26 北京九章云极科技有限公司 A kind of Data Analysis Services system and method for automatic modeling
CN109558398A (en) * 2018-10-31 2019-04-02 平安医疗健康管理股份有限公司 Data cleaning method and relevant apparatus based on big data
CN109558395A (en) * 2018-10-17 2019-04-02 中国光大银行股份有限公司 Data processing system and data digging method
WO2019062444A1 (en) * 2017-09-26 2019-04-04 深圳市宇数科技有限公司 Data exploring and discovering method and system, electronic device and storage medium
CN109635026A (en) * 2018-11-29 2019-04-16 宝晟(广州)生物信息技术有限公司 A kind of biological sample bank data distributing nodes sharing method, system and device
CN109636482A (en) * 2018-12-21 2019-04-16 苏宁易购集团股份有限公司 Data processing method and system based on similarity model
CN109634941A (en) * 2018-11-14 2019-04-16 金色熊猫有限公司 Medical data processing method, device, electronic equipment and storage medium
CN109657803A (en) * 2018-03-23 2019-04-19 新华三大数据技术有限公司 The building of machine learning model
CN109783859A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Model building method, device and computer readable storage medium
CN109800277A (en) * 2018-12-18 2019-05-24 合肥天源迪科信息技术有限公司 A kind of machine learning platform and the data model optimization method based on the platform
CN109947826A (en) * 2019-03-29 2019-06-28 山东浪潮云信息技术有限公司 A method of with big data technology building region portrait analysis model
CN110007989A (en) * 2018-12-13 2019-07-12 国网信通亿力科技有限责任公司 Data visualization platform system
CN110175191A (en) * 2019-05-14 2019-08-27 复旦大学 Data filtering rule modeling method in data analysis
CN110188887A (en) * 2018-09-26 2019-08-30 第四范式(北京)技术有限公司 The data managing method and device of Machine oriented study
CN110245875A (en) * 2019-06-21 2019-09-17 深圳前海微众银行股份有限公司 Risk of fraud appraisal procedure, device, equipment and storage medium
CN110363321A (en) * 2018-03-26 2019-10-22 吕纪竹 A kind of method of real-time prediction big data variation tendency
CN110362605A (en) * 2019-06-04 2019-10-22 苏州神州数码捷通科技有限公司 A kind of E book data verification method based on big data
CN110378569A (en) * 2019-06-19 2019-10-25 平安国际智慧城市科技股份有限公司 Industrial relations chain building method, apparatus, equipment and storage medium
WO2019204975A1 (en) * 2018-04-24 2019-10-31 深圳职业技术学院 Multiparty quantum summation method and system
CN110442620A (en) * 2019-08-05 2019-11-12 赵玉德 A kind of big data is explored and cognitive approach, device, equipment and computer storage medium
CN110502509A (en) * 2019-08-27 2019-11-26 广东工业大学 A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame
CN110727670A (en) * 2019-10-11 2020-01-24 集奥聚合(北京)人工智能科技有限公司 Data structure prediction transfer and automatic data processing method based on flow chart
CN110850824A (en) * 2019-11-12 2020-02-28 北京矿冶科技集团有限公司 Implementation method for acquiring data of distributed control system to Hadoop platform
CN110909039A (en) * 2019-10-25 2020-03-24 北京华如科技股份有限公司 Big data mining tool and method based on drag type process
CN110908573A (en) * 2019-12-03 2020-03-24 北京明略软件系统有限公司 Algorithm model training method, device, equipment and storage medium
CN110928922A (en) * 2019-11-27 2020-03-27 开普云信息科技股份有限公司 Public policy analysis model deployment method and system based on big data mining
CN110990384A (en) * 2019-11-04 2020-04-10 武汉中卫慧通科技有限公司 Big data platform BI analysis method
CN111080170A (en) * 2019-12-30 2020-04-28 北京云享智胜科技有限公司 Workflow modeling method and device, electronic equipment and storage medium
CN111125052A (en) * 2019-10-25 2020-05-08 北京华如科技股份有限公司 Big data intelligent modeling system and method based on dynamic metadata
CN111177200A (en) * 2019-12-31 2020-05-19 北京九章云极科技有限公司 Data processing system and method
CN111177220A (en) * 2019-12-26 2020-05-19 中国平安财产保险股份有限公司 Data analysis method, device and equipment based on big data and readable storage medium
CN111222833A (en) * 2018-11-27 2020-06-02 中云开源数据技术(上海)有限公司 Algorithm configuration combination platform based on data lake server
CN111367969A (en) * 2020-03-19 2020-07-03 北京三维天地科技股份有限公司 Data mining method and system
CN111399838A (en) * 2020-06-04 2020-07-10 成都四方伟业软件股份有限公司 Data modeling method and device based on spark SQ L and materialized view
CN111537982A (en) * 2020-05-08 2020-08-14 东南大学 Distortion drag array line spectrum feature enhancement method and system
CN111538494A (en) * 2020-07-09 2020-08-14 南京红松信息技术有限公司 Big data automatic modeling and verification engine system and method
CN111654853A (en) * 2020-08-04 2020-09-11 索信达(北京)数据技术有限公司 Data analysis method based on user information
CN111756600A (en) * 2020-06-24 2020-10-09 厦门长江电子科技有限公司 Multi-communication system and method for realizing multiple switch test machines
CN111931945A (en) * 2020-07-31 2020-11-13 北京百度网讯科技有限公司 Data processing method, device and equipment based on label engine and storage medium
CN111949640A (en) * 2020-08-04 2020-11-17 上海微亿智造科技有限公司 Intelligent parameter adjusting method and system based on industrial big data
CN112182333A (en) * 2020-09-25 2021-01-05 山东亿云信息技术有限公司 Talent space-time big data processing method and system based on random forest
CN112214524A (en) * 2020-08-27 2021-01-12 优学汇信息科技(广东)有限公司 Data evaluation system and evaluation method based on deep data mining
CN112308410A (en) * 2020-10-30 2021-02-02 云南电网有限责任公司电力科学研究院 Enterprise asset data management method based on asset classification
CN112328216A (en) * 2020-11-03 2021-02-05 成都中科大旗软件股份有限公司 Method, system, computer device and storage medium for developing data based on canvas nodes
CN112506930A (en) * 2020-12-15 2021-03-16 北京三维天地科技股份有限公司 Data insight platform based on machine learning technology
WO2021047506A1 (en) * 2019-09-11 2021-03-18 中兴通讯股份有限公司 System and method for statistical analysis of data, and computer-readable storage medium
CN112667735A (en) * 2020-12-23 2021-04-16 武汉烽火众智数字技术有限责任公司 Visualization model establishing and analyzing system and method based on big data
CN112685380A (en) * 2020-12-03 2021-04-20 成都大数据产业技术研究院有限公司 Big data value discovery and application innovation platform system
CN113220566A (en) * 2021-04-26 2021-08-06 深圳市云网万店科技有限公司 Interface performance test script generation method and device and computer equipment
CN113468187A (en) * 2021-09-02 2021-10-01 太平金融科技服务(上海)有限公司深圳分公司 Multi-party data integration method and device, computer equipment and storage medium
CN114205164A (en) * 2021-12-16 2022-03-18 北京百度网讯科技有限公司 Traffic classification method and device, training method and device, equipment and medium
CN114254588A (en) * 2021-12-16 2022-03-29 马上消费金融股份有限公司 Data tag processing method and device
CN115345461A (en) * 2022-08-08 2022-11-15 航天神舟智慧系统技术有限公司 Police service efficiency evaluation method and device based on data modeling
CN115357657A (en) * 2022-10-24 2022-11-18 成都数联云算科技有限公司 Data processing method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070203939A1 (en) * 2003-07-31 2007-08-30 Mcardle James M Alert Flags for Data Cleaning and Data Analysis
CN102201037A (en) * 2011-06-14 2011-09-28 中国农业大学 Agricultural disaster forecast method
CN105956015A (en) * 2016-04-22 2016-09-21 四川中软科技有限公司 Service platform integration method based on big data
CN106022477A (en) * 2016-05-18 2016-10-12 国网信通亿力科技有限责任公司 Intelligent analysis decision system and method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070203939A1 (en) * 2003-07-31 2007-08-30 Mcardle James M Alert Flags for Data Cleaning and Data Analysis
CN102201037A (en) * 2011-06-14 2011-09-28 中国农业大学 Agricultural disaster forecast method
CN105956015A (en) * 2016-04-22 2016-09-21 四川中软科技有限公司 Service platform integration method based on big data
CN106022477A (en) * 2016-05-18 2016-10-12 国网信通亿力科技有限责任公司 Intelligent analysis decision system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
关大伟: "数据挖掘中的数据预处理", 《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》 *

Cited By (103)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107609064A (en) * 2017-08-30 2018-01-19 成都中建科联网络科技有限公司 Rival's intelligent analysis method based on data mining
CN107526832A (en) * 2017-09-05 2017-12-29 江苏电力信息技术有限公司 A kind of method for building the big data business model that technology is pulled based on the page
CN107643970A (en) * 2017-09-13 2018-01-30 曙光信息产业(北京)有限公司 What thermal map configured shows method and shows system
CN107679129A (en) * 2017-09-21 2018-02-09 无线生活(杭州)信息科技有限公司 A kind of big data processing method and processing device
WO2019062444A1 (en) * 2017-09-26 2019-04-04 深圳市宇数科技有限公司 Data exploring and discovering method and system, electronic device and storage medium
CN107844634A (en) * 2017-09-30 2018-03-27 平安科技(深圳)有限公司 Polynary universal model platform modeling method, electronic equipment and computer-readable recording medium
CN107544450A (en) * 2017-10-11 2018-01-05 齐鲁工业大学 Process industry network model construction method and system based on data
CN107544450B (en) * 2017-10-11 2019-06-21 齐鲁工业大学 Process industry network model construction method and system based on data
CN107958268A (en) * 2017-11-22 2018-04-24 用友金融信息技术股份有限公司 The training method and device of a kind of data model
CN108170770A (en) * 2017-12-26 2018-06-15 山东联科云计算股份有限公司 A kind of analyzing and training platform based on big data
CN108229828A (en) * 2018-01-04 2018-06-29 上海电气集团股份有限公司 A kind of analysis system based on industrial data
CN108460087A (en) * 2018-01-22 2018-08-28 北京邮电大学 Heuristic high dimensional data visualization device and method
CN108415695A (en) * 2018-01-25 2018-08-17 新智数字科技有限公司 A kind of data processing method, device and equipment based on visualization component
CN108509485A (en) * 2018-02-07 2018-09-07 深圳壹账通智能科技有限公司 Preprocess method, device, computer equipment and the storage medium of data
CN108447118A (en) * 2018-03-20 2018-08-24 北京知道创宇信息技术有限公司 Big data method for visualizing, device and the electronic equipment that 3D visions are presented
CN108334501A (en) * 2018-03-21 2018-07-27 王欣 Electronic document analysis system based on machine learning and method
CN108334501B (en) * 2018-03-21 2021-07-20 王欣 Electronic document analysis system and method based on machine learning
CN109657803B (en) * 2018-03-23 2020-04-03 新华三大数据技术有限公司 Construction of machine learning models
CN109657803A (en) * 2018-03-23 2019-04-19 新华三大数据技术有限公司 The building of machine learning model
CN110363321A (en) * 2018-03-26 2019-10-22 吕纪竹 A kind of method of real-time prediction big data variation tendency
CN110363321B (en) * 2018-03-26 2024-04-19 吕纪竹 Method for predicting big data change trend in real time
CN108595627A (en) * 2018-04-23 2018-09-28 温州市鹿城区中津先进科技研究院 A kind of self-service data analysis Modeling Platform
WO2019204975A1 (en) * 2018-04-24 2019-10-31 深圳职业技术学院 Multiparty quantum summation method and system
CN108898426A (en) * 2018-06-14 2018-11-27 上海米飞网络科技有限公司 The visualization system and method for payment data processing classification
CN113935434A (en) * 2018-06-19 2022-01-14 北京九章云极科技有限公司 Data analysis processing system and automatic modeling method
CN108981785A (en) * 2018-06-19 2018-12-11 江苏高远智能科技有限公司 A kind of intelligent Detection of coal breaker equipment safety
CN109389143A (en) * 2018-06-19 2019-02-26 北京九章云极科技有限公司 A kind of Data Analysis Services system and method for automatic modeling
CN108959480B (en) * 2018-06-21 2020-07-14 江苏赛睿信息科技股份有限公司 Method and device for realizing data visualization of stream data
CN108959480A (en) * 2018-06-21 2018-12-07 江苏赛睿信息科技股份有限公司 The method and device of stream data realization data visualization
CN109063964A (en) * 2018-07-02 2018-12-21 浙江百先得服饰有限公司 A kind of platform data processing system
CN109241107A (en) * 2018-08-03 2019-01-18 北京邮电大学 Big data controlling device based on Hadoop
CN109307811A (en) * 2018-08-06 2019-02-05 国网浙江省电力有限公司宁波供电公司 A kind of user's dedicated transformer electricity consumption monitoring method excavated based on big data
CN109255524A (en) * 2018-08-16 2019-01-22 广西电网有限责任公司电力科学研究院 A kind of measuring equipment data analyzing evaluation method and system
CN109376152A (en) * 2018-09-13 2019-02-22 广州帷策智能科技有限公司 Big data system file data preparation method and system
CN109240163A (en) * 2018-09-25 2019-01-18 南京信息工程大学 Intelligent node and its control method for industrialization manufacture
CN109240163B (en) * 2018-09-25 2024-01-02 南京信息工程大学 Intelligent node for industrial manufacturing and control method thereof
CN110188887B (en) * 2018-09-26 2022-11-08 第四范式(北京)技术有限公司 Data management method and device for machine learning
CN110188887A (en) * 2018-09-26 2019-08-30 第四范式(北京)技术有限公司 The data managing method and device of Machine oriented study
CN109325541A (en) * 2018-09-30 2019-02-12 北京字节跳动网络技术有限公司 Method and apparatus for training pattern
CN109558395A (en) * 2018-10-17 2019-04-02 中国光大银行股份有限公司 Data processing system and data digging method
CN109558398B (en) * 2018-10-31 2023-09-19 深圳平安医疗健康科技服务有限公司 Data cleaning method based on big data and related device
CN109558398A (en) * 2018-10-31 2019-04-02 平安医疗健康管理股份有限公司 Data cleaning method and relevant apparatus based on big data
CN109634941A (en) * 2018-11-14 2019-04-16 金色熊猫有限公司 Medical data processing method, device, electronic equipment and storage medium
CN111222833A (en) * 2018-11-27 2020-06-02 中云开源数据技术(上海)有限公司 Algorithm configuration combination platform based on data lake server
CN109635026A (en) * 2018-11-29 2019-04-16 宝晟(广州)生物信息技术有限公司 A kind of biological sample bank data distributing nodes sharing method, system and device
CN110007989A (en) * 2018-12-13 2019-07-12 国网信通亿力科技有限责任公司 Data visualization platform system
CN109783859A (en) * 2018-12-13 2019-05-21 重庆金融资产交易所有限责任公司 Model building method, device and computer readable storage medium
CN109800277A (en) * 2018-12-18 2019-05-24 合肥天源迪科信息技术有限公司 A kind of machine learning platform and the data model optimization method based on the platform
CN109636482A (en) * 2018-12-21 2019-04-16 苏宁易购集团股份有限公司 Data processing method and system based on similarity model
CN109636482B (en) * 2018-12-21 2021-07-27 南京星云数字技术有限公司 Data processing method and system based on similarity model
CN109947826A (en) * 2019-03-29 2019-06-28 山东浪潮云信息技术有限公司 A method of with big data technology building region portrait analysis model
CN110175191A (en) * 2019-05-14 2019-08-27 复旦大学 Data filtering rule modeling method in data analysis
CN110362605A (en) * 2019-06-04 2019-10-22 苏州神州数码捷通科技有限公司 A kind of E book data verification method based on big data
CN110378569A (en) * 2019-06-19 2019-10-25 平安国际智慧城市科技股份有限公司 Industrial relations chain building method, apparatus, equipment and storage medium
CN110245875A (en) * 2019-06-21 2019-09-17 深圳前海微众银行股份有限公司 Risk of fraud appraisal procedure, device, equipment and storage medium
CN110442620B (en) * 2019-08-05 2023-08-29 赵玉德 Big data exploration and cognition method, device, equipment and computer storage medium
CN110442620A (en) * 2019-08-05 2019-11-12 赵玉德 A kind of big data is explored and cognitive approach, device, equipment and computer storage medium
CN110502509A (en) * 2019-08-27 2019-11-26 广东工业大学 A kind of traffic big data cleaning method and relevant apparatus based on Hadoop Yu Spark frame
CN110502509B (en) * 2019-08-27 2023-04-18 广东工业大学 Traffic big data cleaning method based on Hadoop and Spark framework and related device
WO2021047506A1 (en) * 2019-09-11 2021-03-18 中兴通讯股份有限公司 System and method for statistical analysis of data, and computer-readable storage medium
CN110727670A (en) * 2019-10-11 2020-01-24 集奥聚合(北京)人工智能科技有限公司 Data structure prediction transfer and automatic data processing method based on flow chart
CN110727670B (en) * 2019-10-11 2022-08-09 北京小向创新人工智能科技有限公司 Data structure prediction transfer and automatic data processing method based on flow chart
CN110909039A (en) * 2019-10-25 2020-03-24 北京华如科技股份有限公司 Big data mining tool and method based on drag type process
CN111125052A (en) * 2019-10-25 2020-05-08 北京华如科技股份有限公司 Big data intelligent modeling system and method based on dynamic metadata
CN110990384B (en) * 2019-11-04 2023-08-22 武汉中卫慧通科技有限公司 Big data platform BI analysis method
CN110990384A (en) * 2019-11-04 2020-04-10 武汉中卫慧通科技有限公司 Big data platform BI analysis method
CN110850824A (en) * 2019-11-12 2020-02-28 北京矿冶科技集团有限公司 Implementation method for acquiring data of distributed control system to Hadoop platform
CN110928922A (en) * 2019-11-27 2020-03-27 开普云信息科技股份有限公司 Public policy analysis model deployment method and system based on big data mining
CN110928922B (en) * 2019-11-27 2020-07-24 开普云信息科技股份有限公司 Public policy analysis model deployment method and system based on big data mining
CN110908573B (en) * 2019-12-03 2021-07-06 北京明略软件系统有限公司 Algorithm model training method, device, equipment and storage medium
CN110908573A (en) * 2019-12-03 2020-03-24 北京明略软件系统有限公司 Algorithm model training method, device, equipment and storage medium
CN111177220A (en) * 2019-12-26 2020-05-19 中国平安财产保险股份有限公司 Data analysis method, device and equipment based on big data and readable storage medium
CN111080170A (en) * 2019-12-30 2020-04-28 北京云享智胜科技有限公司 Workflow modeling method and device, electronic equipment and storage medium
CN111080170B (en) * 2019-12-30 2023-09-05 北京云享智胜科技有限公司 Workflow modeling method and device, electronic equipment and storage medium
CN111177200A (en) * 2019-12-31 2020-05-19 北京九章云极科技有限公司 Data processing system and method
CN111177200B (en) * 2019-12-31 2021-05-11 北京九章云极科技有限公司 Data processing system and method
CN111367969B (en) * 2020-03-19 2020-12-01 北京三维天地科技股份有限公司 Data mining method and system
CN111367969A (en) * 2020-03-19 2020-07-03 北京三维天地科技股份有限公司 Data mining method and system
CN111537982A (en) * 2020-05-08 2020-08-14 东南大学 Distortion drag array line spectrum feature enhancement method and system
CN111537982B (en) * 2020-05-08 2022-04-12 东南大学 Distortion drag array line spectrum feature enhancement method and system
CN111399838A (en) * 2020-06-04 2020-07-10 成都四方伟业软件股份有限公司 Data modeling method and device based on spark SQ L and materialized view
CN111756600A (en) * 2020-06-24 2020-10-09 厦门长江电子科技有限公司 Multi-communication system and method for realizing multiple switch test machines
CN111538494A (en) * 2020-07-09 2020-08-14 南京红松信息技术有限公司 Big data automatic modeling and verification engine system and method
CN111931945A (en) * 2020-07-31 2020-11-13 北京百度网讯科技有限公司 Data processing method, device and equipment based on label engine and storage medium
CN111654853A (en) * 2020-08-04 2020-09-11 索信达(北京)数据技术有限公司 Data analysis method based on user information
CN111654853B (en) * 2020-08-04 2020-11-10 索信达(北京)数据技术有限公司 Data analysis method based on user information
CN111949640A (en) * 2020-08-04 2020-11-17 上海微亿智造科技有限公司 Intelligent parameter adjusting method and system based on industrial big data
CN112214524A (en) * 2020-08-27 2021-01-12 优学汇信息科技(广东)有限公司 Data evaluation system and evaluation method based on deep data mining
CN112182333A (en) * 2020-09-25 2021-01-05 山东亿云信息技术有限公司 Talent space-time big data processing method and system based on random forest
CN112308410A (en) * 2020-10-30 2021-02-02 云南电网有限责任公司电力科学研究院 Enterprise asset data management method based on asset classification
CN112328216A (en) * 2020-11-03 2021-02-05 成都中科大旗软件股份有限公司 Method, system, computer device and storage medium for developing data based on canvas nodes
CN112685380A (en) * 2020-12-03 2021-04-20 成都大数据产业技术研究院有限公司 Big data value discovery and application innovation platform system
CN112506930A (en) * 2020-12-15 2021-03-16 北京三维天地科技股份有限公司 Data insight platform based on machine learning technology
CN112667735A (en) * 2020-12-23 2021-04-16 武汉烽火众智数字技术有限责任公司 Visualization model establishing and analyzing system and method based on big data
CN113220566A (en) * 2021-04-26 2021-08-06 深圳市云网万店科技有限公司 Interface performance test script generation method and device and computer equipment
CN113468187A (en) * 2021-09-02 2021-10-01 太平金融科技服务(上海)有限公司深圳分公司 Multi-party data integration method and device, computer equipment and storage medium
CN113468187B (en) * 2021-09-02 2021-11-23 太平金融科技服务(上海)有限公司深圳分公司 Multi-party data integration method and device, computer equipment and storage medium
CN114254588A (en) * 2021-12-16 2022-03-29 马上消费金融股份有限公司 Data tag processing method and device
CN114254588B (en) * 2021-12-16 2023-10-13 马上消费金融股份有限公司 Data tag processing method and device
CN114205164A (en) * 2021-12-16 2022-03-18 北京百度网讯科技有限公司 Traffic classification method and device, training method and device, equipment and medium
CN115345461A (en) * 2022-08-08 2022-11-15 航天神舟智慧系统技术有限公司 Police service efficiency evaluation method and device based on data modeling
CN115357657B (en) * 2022-10-24 2023-03-24 成都数联云算科技有限公司 Data processing method and device, computer equipment and storage medium
CN115357657A (en) * 2022-10-24 2022-11-18 成都数联云算科技有限公司 Data processing method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107103050A (en) A kind of big data Modeling Platform and method
CN107193967A (en) A kind of multi-source heterogeneous industry field big data handles full link solution
US20180196868A1 (en) Multi-Dimensional Modeling in a Functional Information System
Wang et al. Graphs in scientific visualization: A survey
CN108701254A (en) System and method for the tracking of dynamic family, reconstruction and life cycle management
CN110008259A (en) The method and terminal device of visualized data analysis
Dhaenens et al. Metaheuristics for big data
San Martín et al. Representing, querying and transforming social networks with RDF/SPARQL
CN111444348A (en) Method, system and medium for constructing and applying knowledge graph architecture
Wang Big Data Algebra (BDA): A Denotational Mathematical Structure for Big Data Science and Engineering
CN112667735A (en) Visualization model establishing and analyzing system and method based on big data
CN110737805A (en) Method and device for processing graph model data and terminal equipment
Wang et al. Research on evaluation model of music education informatization system based on machine learning
Wang et al. Association rules mining in parallel conditional tree based on grid computing inspired partition algorithm
Ledesma et al. Educational tool for generation and analysis of multidimensional modeling on data warehouse
Agocs et al. Interactive graph query language for multidimensional data in collaboration spotting visual analytics framework
Sayed et al. A conceptual framework for using big data in Egyptian agriculture
Tsitseklis et al. Scalable community detection for complex data graphs via hyperbolic network embedding and graph databases
Wang Graph-based techniques for visual analytics of scientific data sets
Cao Design and optimization of a decision support system for sports training based on data mining technology
Palivela et al. Survey on mining techniques for breast cancer related data
Feng et al. ASMaaS: Automatic Semantic Modeling as a Service
Ulhaq Mapping System Model and Clustering of Fishery Products using K-Means Algorithm with Web GIS Approach
Kamakshaiah et al. Prototype survey analysis of different information retrieval classification and grouping approaches for categorical information
Tiwari et al. DBSCAN: An Assessment of Density Based Clustering and It’s Approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170829

RJ01 Rejection of invention patent application after publication