CN112667735A - Visualization model establishing and analyzing system and method based on big data - Google Patents

Visualization model establishing and analyzing system and method based on big data Download PDF

Info

Publication number
CN112667735A
CN112667735A CN202011536216.1A CN202011536216A CN112667735A CN 112667735 A CN112667735 A CN 112667735A CN 202011536216 A CN202011536216 A CN 202011536216A CN 112667735 A CN112667735 A CN 112667735A
Authority
CN
China
Prior art keywords
data
module
algorithm
sub
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011536216.1A
Other languages
Chinese (zh)
Inventor
张凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Fiberhome Digtal Technology Co Ltd
Original Assignee
Wuhan Fiberhome Digtal Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Fiberhome Digtal Technology Co Ltd filed Critical Wuhan Fiberhome Digtal Technology Co Ltd
Priority to CN202011536216.1A priority Critical patent/CN112667735A/en
Publication of CN112667735A publication Critical patent/CN112667735A/en
Pending legal-status Critical Current

Links

Images

Abstract

A big data based visualization model building and analyzing system, comprising: the system comprises a data asset module, a data operation module, an algorithm model module and a front-end display module; the system is a novel database interaction implementation scheme based on a front-end page, and by customizing a general visual component, a user can implement compiling of a processing script and management of a processing flow through dragging and connecting of the visual component, thereby implementing interaction and business processing with a database. The invention aims at the service technology users (users who have no deep knowledge of the database, need to interact with the database and have uncertain requirements), and has the advantages of intuition, simplicity, easy use, high flexibility, low coupling and strong expansibility. Each visual component node can check SQL and field information assembled up to the node, and can analyze the SQL and the field information in real time, so that the script condition can be checked conveniently and visually. The scheme of the invention reduces the interaction difficulty of the database, and the visualization SQL configuration is qualitatively improved compared with the IDE.

Description

Visualization model establishing and analyzing system and method based on big data
Technical Field
The invention relates to the field of data processing, in particular to a visualization model establishing and analyzing system and method based on big data.
Background
With the rapid development of the internet, the amount of data generated every day is very large. Before the advent of big data technology, traditional data processing encountered a number of bottlenecks. First, for the conventional database, the storage reaches the upper limit in the case of very large data volume, and the solution is to replace the hard disk with larger capacity, but the cost for doing so is very high. Secondly, the computer cannot process large data volume quickly, and also encounters a bottleneck in data processing speed.
Data modeling is a process of an information system for defining and analyzing the requirements of data and the corresponding support it needs. Therefore, in the process of data modeling, the professional data modeling work involved is closely related to the benefits of enterprises and information systems of users. From the requirements to the actual database, there are three different types. The data model for information systems, as a conceptual data model, is essentially the first canonical technique for a set of recorded data requirements. The data is first used to discuss the initial requirements of the enterprise and then transformed into a logical data model that can be implemented in a conceptual model of the data structure in the database. The implementation of one conceptual data model may require multiple logical data models. The last step in data modeling is to determine the logical data model into the physical data model to the specific requirements on data access performance and storage. Data modeling defines not only data elements, but also their structure and relationships between them.
At present, a big data technology can be used for solving various bottlenecks of poor expansibility, poor fault tolerance, low performance, difficult installation, deployment and maintenance and the like of a traditional information technology architecture. The data are stored by utilizing the HDFS distributed file system of Hadoop, and the data storage system is good in expansibility and high in fault tolerance. And the MapReduce of Hadoop is used for carrying out parallel computation on a large-scale data set (larger than 1TB), so that the computation speed is increased, and the performance is high. However, the existing big data technology is not easy for non-technical personnel to use the big data technology, no professional personnel can deeply know the database, and the interaction with the database is difficult to realize by using the SQL statement, but compared with ordinary users, the existing big data technology directly interacts with the database, the requirement is not fixed, the advanced encapsulation of service cannot be carried out, the flexibility is low, and the requirement of a business technology user cannot be met.
Disclosure of Invention
In view of the above, the present invention has been developed to provide a big data based visualization model building analysis system and method that overcomes or at least partially solves the above-mentioned problems.
In order to solve the technical problem, the embodiment of the application discloses the following technical scheme:
a big data based visualization model building and analyzing system, comprising:
the system comprises a data asset module, a data operation module, an algorithm model module and a front-end display module; wherein
The data asset module is used for setting a data source and creating a data resource, updating user data to a system in a manual creation or resource binding mode, and processing the user data by a user in a dragging manual modeling mode;
the data operation module is used for carrying out ETL processing on data on the data source and finding and correcting recognizable errors in the data file;
the algorithm model module is used for modeling a large amount of data by using a classical algorithm in machine learning and then predicting by using the model;
and the front-end display module is used for graphically displaying the processed data or the unprocessed data.
Further, the data uploading mode of the data asset module at least comprises the following steps: uploading a local file, uploading interface data and uploading a database.
Further, the database upload supported database at least comprises: ORACLE, MYSQL, HIVE.
Further, through checking data consistency, processing invalid values and missing values, recognizable errors in the data file are found and corrected.
Further, the data operation module at least comprises: the system comprises an Sql processing sub-module, a sampling sub-module, a classifying and gathering sub-module, a data merging sub-module, a deleting and repeating sub-module, a data partitioning sub-module, a sorting sub-module, a data discretization sub-module, a data standardization sub-module, a filtering variable sub-module, a transposition sub-module, a field rearrangement sub-module, a missing value processing sub-module, an outlier processing sub-module, a searching and converting sub-module, an insertion variable sub-module, a weighting sub-module, a sample balancing sub.
Further, the algorithm model module at least comprises: apriori algorithm submodule, Kmeans algorithm submodule, naive Bayes algorithm submodule, logistic regression algorithm submodule, ridge regression algorithm submodule, LASSO algorithm submodule and linear regression algorithm submodule.
Further, the front end rendering module canvas is implemented using the following related front end components: jsPlumb, vue, elementui, draggable, axios; and performing a test by using an operation assembly which is visualized in a dragging and pulling mode, performing data operation and algorithm analysis according to the established service model, and performing visualized multi-dimensional display on the result data set through a chart in a connected front-end display module.
Further, visualizing the multi-dimensional presentation comprises: visually displaying the chart in different chart types by using different data structures; displaying multi-dimensional data in a drilling mode; performing customized display on the special data in a node customized display form; and displaying the data in a multi-graph linkage display mode.
The invention also discloses a visualization model establishing and analyzing method based on big data, which comprises the following steps:
adding a data source node in a data asset by a data source to be processed, and uploading data to a system for subsequent use;
according to the needs of a service scene, filtering data by using a functional node in a data operation module, for example, deleting a row of data with an empty field in the data by using missing value processing, reserving a field required by the service by using a filtering variable, deleting other fields and the like, and acquiring the data with a desired specific format;
if the service scene has no requirement on the algorithm, the final data display can be carried out by utilizing the imaging of the front-end display module; the front end displays various graphs in the selection of the functional nodes, and different graphs are selected for data representation according to requirements; if an algorithm is needed, an algorithm node is needed to be added;
and after the corresponding nodes are selected, connecting and storing the data source nodes, the data operation nodes and the algorithm model nodes or the nodes without the algorithm model nodes and the front-end display nodes, running through the whole process by clicking operation, and graphically displaying the final data by blood margin analysis.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the system and the method for establishing and analyzing the visualization model based on the big data are very flexible in the aspect of the whole data processing flow, and a user can complete corresponding workflow according to different requirements; in the data source uploading stage, a plurality of uploading modes are provided for selection; in the data operation stage, a plurality of processing modes are provided for selection; the algorithm stage also comprises a plurality of algorithms; a data display stage, which comprises a plurality of graphs; the invention supports the use of various core mainstream algorithm libraries after unpacking, simplifies and popularizes big data analysis, and enables users to easily use the system to carry out data mining and modeling analysis on big data by knowing little knowledge in the field of statistics and data mining.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
fig. 1 is a schematic structural diagram of a visualization model building analysis system based on big data in embodiment 1 of the present invention;
fig. 2 is a structural diagram of a custom SQL configuration window in embodiment 1 of the present invention;
fig. 3 is a diagram showing the visualization components and the final connection effect in embodiment 1 of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
In order to solve the problem that the existing big data technology is not easy for non-technical personnel to use the big data technology in the prior art, the embodiment of the invention provides a system and a method for establishing and analyzing a visualization model based on big data.
Example 1
A big data based visualization modeling analysis system, as shown in fig. 1, comprising:
the system comprises a data asset module 1, a data operation module 2, an algorithm model module 3 and a front end display module 4; wherein
The data asset module 1 is used for setting a data source and creating a data resource, user data is updated to a system in a manual creation or resource binding mode, and a user processes the user data in a dragging and manual modeling mode.
In this embodiment, the data asset module 1 includes three data uploading modes, namely local file uploading, message data uploading, and database uploading, where the database uploading supports different types of databases such as ORACLE, MYSQL, HIVE:
ORACLE is a relational database management system developed by ORACLE corporation.
MYSQL MYSQL is a relational database management system.
HIVE is a data warehouse tool based on Hadoop, can map structured data files into a database table, provides a simple sql query function, and can convert sql statements into MapReduce tasks for operation.
Meanwhile, an open source component of the Impala is utilized; importing data into Hive from mysql, oracle and mongodb databases through Sqoop; data synchronization service is provided through Zookeeper, Impala supplements hive, and efficient sql query can be achieved.
Specifically, for the data asset module 1, because the data sources of the system are not uniform, different data sources need to be converted into uniform data sources, a relational database, such as Oracle, Mysql, Sqlserver, and the like, and data in a file format, such as txt, csv, and the like, can be processed on the basis of the existing data sources to form a new data source, and the new data source is converted into a uniform HIVE data source, so as to provide a data source for the process processing of the system.
(1) For the relational database, Sqoop technology is adopted to process, wherein Sqoop is a table imported from the database through a MapReduce operation, and the operation extracts a row of records from the table and then writes the row of records into the HDFS.
Before the start of import, Sqoop checks the table to be imported using JDBC. All columns in the table are retrieved, as well as the SQL data types for the columns. These SQL types (VARCHAR, INTEGER) are mapped to Java data types (String, INTEGER, etc.), which are used in MapReduce applications to save the values of the fields. The code generator of Sqoop uses this information to create a class corresponding to the table for holding records extracted from the table.
(2) And uploading the file in a data stream mode aiming at the data in the file format, processing the file by using a Hadoop technology, uploading the file to the HDFS by the local file through MapReduce, and storing the file into the Hive through a specified table name and a specified field name.
(3) Aiming at the existing data source, a Hive technology is adopted for processing, a new Hive table is created by adopting a select statement of Hive on the basis of the original data source, a new data source is generated, and the existing data source can be directly used.
And the data operation module 2 is used for carrying out ETL processing on data on the data source and finding and correcting recognizable errors in the data file. In the embodiment, the recognizable errors in the data file are found and corrected by checking the data consistency and processing invalid values and missing values.
Specifically, the data operation module 2 at least includes: the system comprises an Sql processing sub-module, a sampling sub-module, a classifying and gathering sub-module, a data merging sub-module, a deleting and repeating sub-module, a data partitioning sub-module, a sorting sub-module, a data discretization sub-module, a data standardization sub-module, a filtering variable sub-module, a transposition sub-module, a field rearrangement sub-module, a missing value processing sub-module, an outlier processing sub-module, a searching and converting sub-module, an insertion variable sub-module, a weighting sub-module, a sample balancing sub. The sub-modules realize the processing processes of SQL processing, sampling, sorting and summarizing, data set merging, data partitioning, sorting, data dispersion, data standard, filtering scalar quantity, transposition, field rearrangement, weighting, sample equalization and the like.
Specifically, in this embodiment:
(1) SQL processing is to create a new data source from the original data source using the select statement of Hive, and the structure diagram of the SQL configuration window is shown in FIG. 2.
(2) Sampling is to sample the original data source by using Hive to generate a new data source.
(3) The classified collection is grouped according to the collection variable, and the calculation variable is used for calculating the mean value, the collection and the total.
(4) The merged data set is divided into row record addition and column variable addition, the column variable addition needs to select a merged variable, and the data set is generated by using Hive processing.
(5) Deleting duplicate entries duplicate data is removed using Hive based on a deduplication variable filter.
(6) And carrying out data partitioning according to the specified training sample by the data partitioning.
(7) The sorting sorts the data sources according to the processing variables.
(8) The data discretization generates discrete data according to the processing variables to form a result set.
(9) Data normalization selects a normalization method based on the process variables to generate a data set.
(10) The filter variable is removed from the result set based on the delete variable, resulting in a new data source.
(11) The transposition transposes all rows and columns, and after the transposition, the newly generated column name naming rules are transposition _1, transposition _2, … …, and transposition _ 15.
(12) The field reordering specifies the order of the fields.
(13) Missing value processing removes data according to processing variables, producing a new result set.
(14) The translation is looked up against the process variable and if the process variable satisfies the condition, it is replaced.
(15) And (4) outlier processing, namely processing the outlier according to the processing variable by adopting an exclusion mode and an identification rule.
(16) Inserting variables inserts new variables according to the original columns.
(17) The weighting adds a weighting factor to it depending on the process variable.
(18) And (4) adding a condition to the sample according to the processing variable by sample equalization, and converting the sample according to the factor if the condition is met.
Data operation is mainly processed by a Hive Sql technology, data which do not meet requirements are removed, and the data mainly comprise incomplete data, wrong data and repeated data, and the data can also be processed on the basis of original data.
And the algorithm model module 3 is used for modeling a large amount of data by using a classical algorithm in machine learning and then predicting by using the model. In this embodiment, the algorithm model module 3 at least includes: apriori algorithm submodule, Kmeans algorithm submodule, naive Bayes algorithm submodule, logistic regression algorithm submodule, ridge regression algorithm submodule, LASSO algorithm submodule and linear regression algorithm submodule. The algorithm model of the system is a machine learning algorithm system based on a distributed computing engine. The user can carry out the experiment through the visual operating assembly of dragging and pulling mode, makes the engineer who does not have the machine learning background can play data mining with the hands easily. The system provides rich machine languages of Apriori, K-means, naive bayes, logistic regression, ridge regression, LASSO, linear regression, and the like.
Specifically, the algorithm model in the algorithm model module 3 is mainly implemented by using Spark technology, and the prepared data set and training data are submitted to a Spark cluster for efficient processing and result set acquisition. The algorithm is realized by using Java voice, firstly, realized algorithm programs are made into jar packages and are respectively deployed with the system, so that the coupling degree of the modeling system and the algorithm is reduced, and the use of the system is not influenced when the algorithm is deployed. The specific implementation of the algorithm is that tasks are submitted to Spark cluster processing, and fast and effective iterative computation can be performed through distributed computation of the clusters.
The algorithm model module 3 of the system is mainly realized by programming by utilizing spark mllib api, the spark memory-based calculation engine has high operation speed, and the spark mllib library comprises a plurality of machine learning algorithms, such as Apriori, kmeans, naive Bayes, logistic regression, ridge regression, lasso and the like, and the algorithms are mainly divided into two types, namely classification and clustering. Among these algorithms, the kmeans algorithm belongs to clustering, while the above listed algorithms belong to classification algorithms, and in terms of code implementation, the two types of problems have different logics, and the technical problems related to the algorithm model module 3 will be explained in the following.
Clustering algorithm
Clustering, Cluster analysis, is also sometimes translated into Cluster classes, whose core tasks are: a group of target objects is divided into a plurality of clusters, the objects between each cluster are similar as much as possible, and the objects between the clusters are different as much as possible. The clustering problem is that given a set of elements D, each element having n observable properties, an algorithm is used to divide D into k subsets, requiring as low a degree of dissimilarity between the elements within each subset as possible, while the degree of dissimilarity between the elements of different subsets is as high as possible. Where each subset is called a cluster.
Kmeans belongs to an iterative redistribution clustering algorithm based on square error, and the core idea is very simple:
(1) k center points are randomly selected.
(2) And calculating the distances from all the points to the K central points, and selecting the central point with the closest distance as the cluster where the central point is located.
(3) The centers of the K clusters are simply recalculated using an arithmetic mean (mean).
(4) And repeating the steps 2 and 3 until the cluster class is not changed or the maximum iteration value is reached.
(5) And outputting the result.
The result of the Kmeans algorithm depends on the selection of an initial clustering center, is easy to fall into a local optimal solution, has no criterion for the selection of a K value to follow, is sensitive to abnormal data, can only process data with numerical attributes, and has possibly unbalanced clustering structure.
The following describes the flow and technical details of the Kmeans algorithm. And acquiring a data source, adding the data source in the data asset and dragging the data source into the canvas. The nodes connecting the data operations then perform the necessary ETL processing on the data source so that the data source can satisfy the invocation of the algorithm portion. After the data operation node runs, data processed by the node is stored in a data warehouse hive of the cluster and is used for algorithm part calling.
The detailed description is made in the algorithm section. After the data source node and the data operation node are completed, the kmeans algorithm node needs to be connected, the kmeans algorithm node is dragged into a canvas, a configuration page can be popped up by double-clicking the node, and the configuration page comprises: 1. which columns to run the kmeans algorithm is selected, because not all columns are necessarily used in the actual business requirements; 2. the cluster class number is required to be configured, which means how many classes the current data source wants to be finally clustered; 3. maximum number of iterations, how many times the algorithm needs to be iterated; 4. random times; after configuration, the storage is clicked, and then the execution program is clicked to start.
In the background code, when the program determines that the nodeType (node type) is K _ Means, the process proceeds to the stepKmeans method of KmeanServiceImpl. In the method, firstly, parameters such as a configuration interface and the like are obtained, a set method is executed on the parameters by using an instantiation object of the KmeansInfo, and all the parameters required by the KmeansInfo algorithm are stored in the KmeansInfo instantiation object. The tokmeanString method is then executed resulting in a parameter string separated by a space.
Subsequently, a format tabledata method in datareverse is executed, which is used for feature transformation, because a data source has character strings inevitably, and the krans algorithm of spark requires data to be of a double type, so that the secondary method is very important for the success of algorithm execution.
The execution algorithm jar package utilizes a spark's horn-client submission mode, and the mode has the advantage that the jar package can be directly run without writing a script. And then, executing KMeasInfo in Co-Insight-mllib. jar, wherein parameters are transmitted from a web end, and the main idea is to obtain selected field data from a hive designation table, perform corresponding format conversion on the data, and convert the data into a required vector format. Train's api training data is used to generate a Kmeans model, the next step is the most important step of the whole algorithm, and the clustering condition of data can be determined by using the predic method of the model only if the model is generated. And finally, storing the result into hdfs, and then hive establishes a table and reads hdfs data. And finally displaying data, and merging and displaying the result hive table and the predicted data until the basic kmeans algorithm (clustering algorithm) is completed.
Classification algorithm
In short, objects with certain characteristics are classified and correspond to a certain class in a known class set. Mathematically, the following definitions can be made:
the known set is: c ═ y1, y2,., yn } and I ═ x1, x2,. the.xm,. the.the mapping rule y ═ f (x) is determined such that any xi epsilon I has and only one yj epsilon C has such that yj ═ f (xi) holds.
Wherein C is a category set, I is an object to be classified, f is a classifier, and the main task of the classification algorithm is to construct the classifier f.
The construction of a classification algorithm usually requires a set of known classes to be trained, and usually the trained classification algorithm cannot achieve 100% accuracy. The quality of the classifier is often related to factors such as training data, validation data, training data sample size, etc.
In the classification algorithm, the difference between different algorithms is that the implementation of the spark mllib bottom layer is different, and in the case of calling api, only the method parameters of the training data are slightly different, and other program logics are substantially similar, which is described in detail by taking a naive bayes algorithm as an example.
Naive Bayes classification, Naive Bayes, you can also call it NB algorithm. The core idea is very simple: for a certain prediction item, the probability of the prediction item being each classification is calculated respectively, and then the classification with the highest probability is selected as the prediction classification. It is judged that a woman is a man if you predict that the possibility that a woman is 40% and the possibility that the woman is a man is 41%.
The naive bayes algorithm flow and technical details are presented below. The classification algorithm is different from the clustering algorithm, two data sources are required to be obtained in one flow, one data source is provided with a label column as training data, and the other data source label column is empty as prediction data. The two data sources require the field names and types to be the same, and then the two data sources are merged in a table in hive by using a merged data set in data operation as subsequent processing, wherein the row record is normally and directly added.
After the data sets are combined, ETL operation can be carried out, data are operated and processed, then the data are dragged into the algorithm node connection of naive Bayes in the algorithm model, and in a configuration page of the naive Bayes node, label columns can be selected, which columns are used as execution algorithms, alpha attributes and data proportion are trained. And after the configuration, the operation is saved, and the execution logic of the web end is basically similar to that of the web end of the kmeans algorithm.
When the algorithm code is used for selecting training data in NativeBayes of Co-Insight-mllib. And then, training a NaiveBayes model by using a NaiveBayes.train method, classifying and predicting a prediction set by using the prediction of the model, and finally storing the result in a hive table for later front-end display.
The algorithm module mainly utilizes the technique of api calling of the mllib of spark, has strong reusability on algorithm implementation, high development speed and high efficiency of training a model, can well utilize cluster resources, and basically meets the general type of the algorithm.
And the front-end presentation module 4 is used for graphically presenting the processed data or the unprocessed data. In this embodiment, the front end rendering module 4 canvas is implemented using the following related front end components: jsPlumb, vue, elementui, draggable, axios; and performing a test by using an operation assembly visualized in a dragging and pulling mode, performing data operation and algorithm analysis according to the established service model, and performing visualized multi-dimensional display on the result data set through a chart in the connected front-end display module 4. Visualizing a multidimensional presentation comprising: visually displaying the chart in different chart types by using different data structures; displaying multi-dimensional data in a drilling mode; performing customized display on the special data in a node customized display form; and displaying the data in a multi-graph linkage display mode.
Specifically, the front-end display module 4 provides rich instrument display, and the system is more vivid and has a good form, and timely presents service insight hidden behind transient and numerous data. No matter in the fields of traffic, communication and the like, the interactive real-time data visualization is used for helping business personnel to find and diagnose business problems, and the interactive real-time data visualization becomes a vital ring in a big data solution. Preferably, the front-end display module 4 works as follows:
(1) the data is presented in the form of a table or card.
(2) Dragging and adding a visualization component in a component library into a canvas, and generating a visualization component node in the canvas; due to low coupling between visualization components, the system can support different types of databases such as ORACLE, MYSQL, HIVE, etc. The implementation of the components and canvas uses the following related front-end components: jsPlumb, vue, elementui, draggable, axios;
(3) popping up a self-defined SQL configuration window binding configuration SQL by a right key of the visualized component node;
(4) the SQL configuration window comprises three parts, namely configuration basic information, configuration field information and SQL preview, wherein the configuration basic information comprises a create type, a creation table type, a library name, a table name and high-level configuration.
(5) Dragging arrows appear when the mouse is suspended on the nodes of the visual components, the association between the SQL tables is realized through dragging, and the execution sequence logic is determined;
(6) after the addition of the components and the connection configuration, the node relation of the visual components is analyzed into a graph structure for storage. The visual components and the final connection effect graph are shown in figure 3.
Corresponding to the above method embodiment, the present embodiment further discloses a visualization model establishing and analyzing method based on big data, including:
adding a data source node in a data asset by a data source to be processed, and uploading data to a system for subsequent use;
according to the needs of a service scene, the data is filtered by using a functional node in the data operation module 2, for example, a row of data with empty fields in the data is deleted by using missing value processing, the fields required by the service are reserved by using a filtering variable, other fields are deleted, and other series of processing are carried out, so that the data with a desired specific format is obtained;
if the service scene has no requirement on the algorithm, the final data display can be carried out by utilizing the imaging of the front-end display module 4; the front end displays various graphs in the selection of the functional nodes, and different graphs are selected for data representation according to requirements; if an algorithm is needed, an algorithm node is needed to be added;
and after the corresponding nodes are selected, connecting and storing the data source nodes, the data operation nodes and the algorithm model nodes or the nodes without the algorithm model nodes and the front-end display nodes, running through the whole process by clicking operation, and graphically displaying the final data by blood margin analysis.
The method and the system for establishing and analyzing the visualization model based on the big data disclosed by the embodiment can be easily used by non-technical personnel without knowing the technology of the bottom big data. The system utilizes the Flowable flow technology, and the large data processing process can be realized only by carrying out dragging connection on nodes such as a data source, data processing, an algorithm, data display and the like. The system mainly utilizes a Hive data warehouse to store data, and the data processing part is directly realized by using Hive sql statements. The method can be well used under the condition of large data volume, and has excellent performance. The algorithm part of the system is realized by using millib of Spark. Spark has the advantages that the jobintermediate output result can be stored in the memory, so that the HDFS does not need to be read and written, and the operation efficiency is high based on memory calculation. The machine learning library of Spark contains various algorithm types, and algorithms such as classification and clustering can meet the requirements of users. And establishing a data source, processing data, calculating and displaying the node connection to realize a one-stop data analysis process and complete the service requirement.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims (9)

1. A big data based visualization model building and analyzing system is characterized by comprising:
the system comprises a data asset module, a data operation module, an algorithm model module and a front-end display module; wherein
The data asset module is used for setting a data source and creating a data resource, updating user data to a system in a manual creation or resource binding mode, and processing the user data by a user in a dragging manual modeling mode;
the data operation module is used for carrying out ETL processing on data on the data source and finding and correcting recognizable errors in the data file;
the algorithm model module is used for modeling a large amount of data by using a classical algorithm in machine learning and then predicting by using the model;
and the front-end display module is used for graphically displaying the processed data or the unprocessed data.
2. The big data-based visualization model building and analyzing system of claim 1, wherein the data asset module uploading data mode at least comprises: uploading a local file, uploading interface data and uploading a database.
3. The big-data-based visualization model building and analyzing system according to claim 2, wherein the database upload support database comprises at least: ORACLE, MYSQL, HIVE.
4. The big data based visualization modeling analysis system of claim 1, wherein discovering and correcting recognizable errors in data files is accomplished by checking data consistency, dealing with invalid and missing values.
5. The big data-based visualization model building and analyzing system according to claim 1, wherein the data operation module at least comprises: the system comprises an Sql processing sub-module, a sampling sub-module, a classifying and gathering sub-module, a data merging sub-module, a deleting and repeating sub-module, a data partitioning sub-module, a sorting sub-module, a data discretization sub-module, a data standardization sub-module, a filtering variable sub-module, a transposition sub-module, a field rearrangement sub-module, a missing value processing sub-module, an outlier processing sub-module, a searching and converting sub-module, an insertion variable sub-module, a weighting sub-module, a sample balancing sub.
6. The big-data-based visualization modeling and analyzing system of claim 1, wherein the algorithmic model module comprises at least: apriori algorithm submodule, Kmeans algorithm submodule, naive Bayes algorithm submodule, logistic regression algorithm submodule, ridge regression algorithm submodule, LASSO algorithm submodule and linear regression algorithm submodule.
7. The big-data-based visualization model building and analysis system of claim 1, wherein the front-end rendering module canvas is implemented using the following associated front-end components: jsPlumb, vue, elementui, draggable, axios; and performing a test by using an operation assembly which is visualized in a dragging and pulling mode, performing data operation and algorithm analysis according to the established service model, and performing visualized multi-dimensional display on the result data set through a chart in a connected front-end display module.
8. The big-data-based visualization modeling analysis system of claim 7, wherein visualizing the multi-dimensional representation comprises: visually displaying the chart in different chart types by using different data structures; displaying multi-dimensional data in a drilling mode; performing customized display on the special data in a node customized display form; and displaying the data in a multi-graph linkage display mode.
9. A visualization model building and analyzing method based on big data is characterized by comprising the following steps:
adding a data source node in a data asset by a data source to be processed, and uploading data to a system for subsequent use;
according to the needs of a service scene, filtering data by using a functional node in a data operation module, for example, deleting a row of data with an empty field in the data by using missing value processing, reserving a field required by the service by using a filtering variable, deleting other fields and the like, and acquiring the data with a desired specific format;
if the service scene has no requirement on the algorithm, the final data display can be carried out by utilizing the imaging of the front-end display module; the front end displays various graphs in the selection of the functional nodes, and different graphs are selected for data representation according to requirements; if an algorithm is needed, an algorithm node is needed to be added;
and after the corresponding nodes are selected, connecting and storing the data source nodes, the data operation nodes and the algorithm model nodes or the nodes without the algorithm model nodes and the front-end display nodes, running through the whole process by clicking operation, and graphically displaying the final data by blood margin analysis.
CN202011536216.1A 2020-12-23 2020-12-23 Visualization model establishing and analyzing system and method based on big data Pending CN112667735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011536216.1A CN112667735A (en) 2020-12-23 2020-12-23 Visualization model establishing and analyzing system and method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011536216.1A CN112667735A (en) 2020-12-23 2020-12-23 Visualization model establishing and analyzing system and method based on big data

Publications (1)

Publication Number Publication Date
CN112667735A true CN112667735A (en) 2021-04-16

Family

ID=75408174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011536216.1A Pending CN112667735A (en) 2020-12-23 2020-12-23 Visualization model establishing and analyzing system and method based on big data

Country Status (1)

Country Link
CN (1) CN112667735A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094776A (en) * 2021-04-19 2021-07-09 城云科技(中国)有限公司 Method and system for constructing visual component model data and electronic equipment
CN113448951A (en) * 2021-09-02 2021-09-28 深圳市信润富联数字科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN113515528A (en) * 2021-09-14 2021-10-19 北京江融信科技有限公司 Asset screening system and method based on big data and ORACLE mass data
CN113867859A (en) * 2021-09-13 2021-12-31 深圳市鸿普森科技股份有限公司 Visualization method for user side configurable chart
CN116860227A (en) * 2023-07-12 2023-10-10 北京东方金信科技股份有限公司 Data development system and method based on big data ETL script arrangement
CN117171238A (en) * 2023-11-02 2023-12-05 菲特(天津)检测技术有限公司 Big data algorithm platform and data mining method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103050A (en) * 2017-03-31 2017-08-29 海通安恒(大连)大数据科技有限公司 A kind of big data Modeling Platform and method
CN110245175A (en) * 2019-06-19 2019-09-17 山东浪潮商用系统有限公司 A kind of visualization process and treat system and method based on big data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107103050A (en) * 2017-03-31 2017-08-29 海通安恒(大连)大数据科技有限公司 A kind of big data Modeling Platform and method
CN110245175A (en) * 2019-06-19 2019-09-17 山东浪潮商用系统有限公司 A kind of visualization process and treat system and method based on big data

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094776A (en) * 2021-04-19 2021-07-09 城云科技(中国)有限公司 Method and system for constructing visual component model data and electronic equipment
CN113448951A (en) * 2021-09-02 2021-09-28 深圳市信润富联数字科技有限公司 Data processing method, device, equipment and computer readable storage medium
CN113867859A (en) * 2021-09-13 2021-12-31 深圳市鸿普森科技股份有限公司 Visualization method for user side configurable chart
CN113515528A (en) * 2021-09-14 2021-10-19 北京江融信科技有限公司 Asset screening system and method based on big data and ORACLE mass data
CN113515528B (en) * 2021-09-14 2022-04-05 北京江融信科技有限公司 Asset screening system and method based on big data and ORACLE mass data
CN116860227A (en) * 2023-07-12 2023-10-10 北京东方金信科技股份有限公司 Data development system and method based on big data ETL script arrangement
CN116860227B (en) * 2023-07-12 2024-02-09 北京东方金信科技股份有限公司 Data development system and method based on big data ETL script arrangement
CN117171238A (en) * 2023-11-02 2023-12-05 菲特(天津)检测技术有限公司 Big data algorithm platform and data mining method
CN117171238B (en) * 2023-11-02 2024-02-23 菲特(天津)检测技术有限公司 Big data algorithm platform and data mining method

Similar Documents

Publication Publication Date Title
CN112667735A (en) Visualization model establishing and analyzing system and method based on big data
JP7034924B2 (en) Systems and methods for dynamic lineage tracking, reconfiguration, and lifecycle management
CN108038222B (en) System of entity-attribute framework for information system modeling and data access
Karnitis et al. Migration of relational database to document-oriented database: Structure denormalization and data transformation
US10740396B2 (en) Representing enterprise data in a knowledge graph
CN107103050A (en) A kind of big data Modeling Platform and method
CN110618983A (en) JSON document structure-based industrial big data multidimensional analysis and visualization method
AU2011224139B2 (en) Analysis of object structures such as benefits and provider contracts
CN104573124B (en) A kind of education cloud application statistical method based on parallelization association rule algorithm
Kamimura et al. Extracting candidates of microservices from monolithic application code
CN110300963A (en) Data management system in large-scale data repository
CN110134671B (en) Traceability application-oriented block chain database data management system and method
JP2003044267A (en) Data sorting method, data sorting device and data sorting program
EP2663939A2 (en) Systems and methods for high-speed searching and filtering of large datasets
Attwal et al. Exploring data mining tool-Weka and using Weka to build and evaluate predictive models
EP3561688A1 (en) Hierarchical tree data structures and uses thereof
Link et al. Recover and RELAX: concern-oriented software architecture recovery for systems development and maintenance
Singh et al. Spatial data analysis with ArcGIS and MapReduce
Brown et al. Overview—The social media data processing pipeline
Niu Optimization of teaching management system based on association rules algorithm
CN110825792A (en) High-concurrency distributed data retrieval method based on golang middleware coroutine mode
US20140067874A1 (en) Performing predictive analysis
US20130218893A1 (en) Executing in-database data mining processes
CN111612156A (en) Interpretation method for XGboost model
Riazi SparkGalaxy: Workflow-based Big Data Processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210416