CN114579630A

CN114579630A - Data online processing and displaying method and system

Info

Publication number: CN114579630A
Application number: CN202210034609.5A
Authority: CN
Inventors: 毛尚伟; 张涛; 汤槟; 郑成坤; 陶术江; 刘欣; 刘雨佳; 李士果; 王汶; 沈一飞
Original assignee: CISDI Chongqing Information Technology Co Ltd
Current assignee: CISDI Chongqing Information Technology Co Ltd
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-06-03

Abstract

The invention provides a method and a system for displaying data online processing, which comprise the following steps: acquiring target data, searching the target data, and acquiring characteristic information in the target data; performing data processing on the target data according to the characteristic information, and displaying the target data after the data processing is completed; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management. The invention solves the problem of poor data processing capability of the existing data mining tool by carrying out online processing on the target data in various modes; meanwhile, after the online processing in various modes is carried out, the data mining tool can be displayed through various charts, so that the problem of poor humanization of the existing data mining tool is solved, various icon display modes are provided, each chart can correspond to the same or different operations, and the problem of high difficulty in interface operation of the existing data mining tool is solved.

Description

Data online processing and displaying method and system

Technical Field

The invention relates to the technical field of computer data, in particular to a method and a system for online processing and displaying data.

Background

Data mining refers to the process of searching through algorithms from a large amount of data for information hidden therein. Data mining is generally related to computer science and achieves this through many methods such as statistics, online analytical processing, intelligence retrieval, machine learning, expert systems (relying on past rules of thumb), and pattern recognition. Considering data itself, data mining usually requires 8 steps of data cleaning, data transformation, data mining implementation process, mode evaluation and knowledge representation. The subject data mining is to perform data mining processing for a certain type, but the existing mining tool has the defects of poor humanization, high interface operation difficulty, poor data processing capability and the like.

Disclosure of Invention

In view of the above drawbacks of the prior art, the present invention provides a method and a system for displaying data online processing, which are used to solve the problems of data mining in the prior art.

In order to achieve the above objects and other related objects, the present invention provides a method for displaying data online processing, comprising the following steps:

acquiring target data to be subjected to data mining from a preset data source;

searching the target data to acquire characteristic information in the target data;

performing data processing on the target data according to the characteristic information, and displaying the target data after the data processing is completed; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management.

Optionally, the process of performing correlation processing on the target data includes:

acquiring any two pieces of characteristic information in the target data, and recording one piece of characteristic information as a first characteristic and recording the other piece of characteristic information as a second characteristic;

comparing whether the values corresponding to the first feature and the second feature are equal or not; if so, determining that the first feature is related to the second feature; if not, judging that the first feature is not related to the second feature;

or calculating the intersection ratio of the first characteristic and the second characteristic; if the intersection ratio is 1, judging that the first feature is related to the second feature; if the intersection ratio is 0, judging that the first feature is not related to the second feature;

or calculating a difference value between the first feature and the second feature, and if the absolute value of the difference value is 1, determining that the first feature is related to the second feature; and if the absolute value of the difference is 0, judging that the first feature is not related to the second feature.

Optionally, the process of identifying the missing value of the target data includes:

judging whether certain characteristic information exists in the target data and contains a plurality of data types; if the data type is the main data type corresponding to the characteristic information, taking the data type with the most data types in the corresponding characteristic information as the main data type corresponding to the characteristic information, and recording other data types as missing values;

constructing one or more substitute values for missing values in the target data, and generating a corresponding data set according to the constructed substitute values;

and performing the same analysis on each data set, and determining a final target variable according to a corresponding analysis result.

Optionally, the process of performing outlier identification on the target data includes:

carrying out maximum value and minimum value segmentation on the target data, screening the segmented target data, and determining that the characteristic information of each node is the same after segmentation;

and acquiring a value corresponding to the characteristic information on each node, comparing the acquired value with a preset value, and identifying whether the acquired value is a characteristic value or an outlier.

acquiring a value corresponding to each piece of feature information in the target data, and performing outlier calculation on the acquired value to obtain an outlier corresponding to each piece of feature information;

or acquiring a mean value and a standard deviation of the target data, and generating a first numerical value and a second numerical value according to the mean value and the standard deviation; wherein the first value is greater than the second value;

obtaining a value corresponding to each feature information in the target data, comparing the obtained value with a first numerical value and a second numerical value, and if the obtained value is larger than the first numerical value, judging that the corresponding numerical value is an outlier; if the obtained value is smaller than the second numerical value, the corresponding numerical value is judged as the characteristic value.

Optionally, the performing of the feature controllability management on the target data includes:

identifying the controllability of the continuous numerical values, marking the numerical variables as controllable and marking other variables as uncontrollable;

or setting a range for the variables of the data type, and if the range exceeds the set range, marking the variables as uncontrollable; if the set range is not exceeded, the control is marked as controllable.

Optionally, the process of performing data discretization management on the target data includes:

acquiring a value interval of the target data;

dividing the value interval into two adjacent intervals, wherein each interval corresponds to a discrete attribute value;

judging whether the discrete attribute value corresponding to each interval is obvious or not, and stopping interval splitting if the discrete attribute value is not obvious; and if so, dividing the corresponding interval into two new adjacent intervals, judging whether the discrete attribute value corresponding to each new interval is obvious or not, and repeating or stopping interval splitting based on the judgment result.

Optionally, the method further comprises: displaying the target data according to the characteristic information; the display comprises the following steps: the target data are displayed in a blocking mode, the target data are displayed in a single-feature detailed mode, and the target data are displayed in a single-feature screening mode;

when the target data are displayed in a blocking mode, the minimum value, the maximum value, the average value, the median, the standard deviation, the missing value, the outlier and/or the factor number of the target data are displayed;

and when the target data is displayed in detail through the single feature, displaying by using a histogram, a scatter diagram, a graph and/or a thermodynamic diagram.

Optionally, in the process of displaying the target data after the data processing is completed, the method includes: the display is carried out by using various colors and the display is carried out by using various charts.

The invention also provides a data online processing and displaying method, which comprises the following steps:

the data acquisition module is used for acquiring target data to be subjected to data mining from a preset data source;

the data searching module is used for searching the target data to acquire characteristic information in the target data;

the data processing and displaying module is used for processing the target data according to the characteristic information and displaying the target data after the data processing is finished; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management.

As described above, the present invention provides a method and a system for displaying data online processing, which have the following advantages: firstly, acquiring target data to be subjected to data mining from a preset data source; then searching the target data to acquire characteristic information in the target data; performing data processing on the target data according to the characteristic information, and displaying the target data after the data processing is completed; wherein the data processing comprises at least one of: correlation, missing value identification, outlier identification, feature controllability management and data discretization management. The invention solves the problem of poor data processing capability of the existing data mining tool by carrying out online processing on the target data in various modes; meanwhile, after the online processing in various modes is carried out, the data mining tool can be displayed through various charts, so that the problem of poor humanization of the existing data mining tool is solved, and meanwhile, due to the fact that various icon display modes are provided, each chart can correspond to the same or different operations, the problem of high operation difficulty of the interface of the existing data mining tool is solved. Therefore, compared with the prior art, the method has the advantages of simple and clear flow and easy operation; the structure is complete, and the data processing capacity is strong; meanwhile, the system has a data cleaning function and meets diversified requirements; and the data display capability after processing is realized, and the function is strong.

Drawings

FIG. 1 is a schematic flow chart illustrating a method for online processing and displaying of data according to an embodiment;

FIG. 2 is a schematic flow chart of a data online processing display method according to another embodiment;

FIG. 3 is a diagram illustrating an embodiment of obtaining target data from a predetermined data source;

FIG. 4 is a schematic diagram of a data processing flow according to an embodiment;

FIG. 5 is a schematic diagram of a data processing flow according to another embodiment;

FIG. 6 is a schematic diagram of a data processing flow according to yet another embodiment;

FIG. 7 is a diagram illustrating search target data, according to an embodiment;

8 a-8 d are diagrams illustrating the detail of a target data list feature in a histogram according to an embodiment;

FIGS. 9a to 9c are schematic diagrams illustrating the operation of analyzing the correlation between variables according to an embodiment;

FIGS. 10 and 10b are schematic diagrams illustrating operations of deleting and filling data according to an embodiment;

11 a-11 d are schematic diagrams illustrating operations of data discretization management according to an embodiment;

fig. 12 is a schematic hardware structure diagram of the data online processing display system according to an embodiment.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the present embodiment are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

Referring to fig. 1, the present embodiment provides a data online processing and displaying method, including the following steps:

s100, target data to be subjected to data mining are obtained from a preset data source. The data sources in this embodiment include, but are not limited to: the system comprises a local CSV data file, data files of other projects stored in a cloud, a Hive database, a MySql database, a Prometheus time sequence database, a Graphite time sequence database, an OpenTSDB time sequence database, an InfluxDB time sequence database and the like.

S200, searching the target data to acquire the characteristic information in the target data.

S300, performing data processing on the target data according to the characteristic information, and displaying the target data after the data processing is completed; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management. When the target data after the data processing is completed is displayed, the method includes: the display is carried out by using various colors and the display is carried out by using various charts.

In an exemplary embodiment, the method further comprises: displaying the target data according to the characteristic information; the display comprises the following steps: the target data are displayed in a blocking mode, the target data are displayed in a single-feature detailed mode, and the target data are displayed in a single-feature screening mode; when the target data are displayed in a blocking mode, the minimum value, the maximum value, the average value, the median, the standard deviation, the missing value, the outlier and/or the factor number of the target data are displayed; and when the target data is displayed in detail through the single feature, displaying by using a histogram, a scatter diagram, a graph and/or a thermodynamic diagram.

Specifically, the process of performing correlation processing on the target data by the method includes: acquiring any two pieces of characteristic information in the target data, and recording one piece of characteristic information as a first characteristic and recording the other piece of characteristic information as a second characteristic; comparing whether the corresponding values of the first feature and the second feature are equal; if so, determining that the first feature is related to the second feature; and if not, judging that the first characteristic is not related to the second characteristic. As another example, or, calculating an intersection ratio of the first feature and the second feature; if the intersection ratio is 1, judging that the first feature is related to the second feature; and if the intersection ratio is 0, judging that the first feature is not related to the second feature. As another example, or a difference between the first feature and the second feature is calculated, and if an absolute value of the difference is 1, it is determined that the first feature is related to the second feature; and if the absolute value of the difference is 0, judging that the first feature is not related to the second feature.

Specifically, the process of identifying the missing value of the target data by the method includes: judging whether certain characteristic information exists in the target data and contains a plurality of data types; if the data type exists, taking the data type with the most number in the corresponding characteristic information as a main data type of the corresponding characteristic information, and recording other data types as missing values; constructing one or more substitute values for missing values in the target data, and generating a corresponding data set according to the constructed substitute values; and performing the same analysis on each data set, and determining a final target variable according to a corresponding analysis result.

Specifically, the process of performing outlier identification on the target data by the method includes: carrying out maximum value and minimum value segmentation on the target data, screening the segmented target data, and determining that the characteristic information of each node is the same after segmentation; and acquiring a value corresponding to the characteristic information on each node, comparing the acquired value with a preset value, and identifying whether the acquired value is a characteristic value or an outlier.

Specifically, the process of performing outlier identification on the target data by the method includes: and acquiring a value corresponding to each piece of characteristic information in the target data, and performing outlier calculation on the acquired value to obtain an outlier corresponding to each piece of characteristic information. Or acquiring a mean value and a standard deviation of the target data, and generating a first numerical value and a second numerical value according to the mean value and the standard deviation; wherein the first value is greater than the second value; obtaining a value corresponding to each feature information in the target data, comparing the obtained value with a first numerical value and a second numerical value, and if the obtained value is larger than the first numerical value, judging that the corresponding numerical value is an outlier; if the obtained value is smaller than the second numerical value, the corresponding numerical value is judged as the characteristic value.

Specifically, the process of performing feature controllability management on the target data by the method includes: and identifying the controllability of the continuous numerical values, marking the numerical variables as controllable and marking other variables as uncontrollable. Or setting a range for the variables of the data type, and if the range exceeds the set range, marking the variables as uncontrollable; if the set range is not exceeded, the flag is controllable.

Specifically, the process of performing data discretization management on the target data by the method comprises the following steps: acquiring a value interval of the target data; dividing the value interval into two adjacent intervals, wherein each interval corresponds to a discrete attribute value; judging whether the discrete attribute value corresponding to each interval is obvious or not, and stopping interval splitting if the discrete attribute value is not obvious; and if so, dividing the corresponding interval into two new adjacent intervals, judging whether the discrete attribute value corresponding to each new interval is obvious or not, and repeating or stopping interval splitting based on the judgment result.

In another embodiment, as shown in fig. 2, the present embodiment provides a data online processing display method, including the following steps:

s1, performing data access; in this embodiment, for the data access in step S1, the method includes: connecting a Hive common database and a MySql common database; uploading a local CSV data file, uploading data files of other projects stored in a cloud, and connecting time sequence databases such as Prometheus, Graphite, OpenTSDB, InfluxDB and the like. As shown in fig. 3, the data access includes the following steps: clicking the created project, filling information of the created project, clicking to test connectivity, and performing subsequent operation after confirming the data communication effect.

S2, searching and searching data; as an example, for example, using the search function principle: and inputting a feature name, and finally automatically displaying the searched feature information.

And S3, displaying the original data. As an example, the original data description information may be presented in blocks, such as presenting a minimum value, a maximum value, a mean value, a median, a standard deviation, a missing value, an outlier, a factor number. As another example, the raw data may be presented in single-feature detail, such as by a histogram, scatter plot, graph, thermodynamic diagram, or the like. As yet another example, raw data may also be single feature screened to show that the extent of discrete regions is viewed by selecting the extent of the graph and dragging the outlier line.

S4, data cleaning management is performed. The embodiment of the cleaning management of the data comprises the following steps: relevance association, identification of missing values, identification of outliers, management of feature controllability, and discretized management of variables.

Specifically, the relevance association manner in this embodiment includes, but is not limited to, the following:

first, the correlation calculation method for the features a, B is,

wherein, J (A)_i，B_i) For comparison A_iAnd B_iIf equal, is 1, and if not equal, is 0.

Secondly, the correlation calculation method for the features X, Y is:

wherein, the value of the intersection divided by the union is 1, which is related, and the value of 0, which is unrelated.

Thirdly, the correlation calculation method for the features M, N is as follows:

wherein, | M_k-N_kI is comparison M_kAnd N_kIf equal, is 1, and if not equal, is 0.

Specifically, the missing value loss identification manner in this embodiment includes, but is not limited to, the following:

first, in all data records, if a certain feature contains multiple data types, the most numerous one of the types is taken as the main data type, and the other types are recorded as missing values, and if the different types of data have the same record, the main data type of the feature is selected according to the numerical value and the character string.

Secondly, n alternative values are constructed for each missing value in the original data, so that n complete data sets are generated, then the same method is adopted for analyzing each complete data set, n results are finally obtained, and then the results are integrated, so that the final target variable can be obtained based on a certain principle.

Specifically, the outlier identification method in this embodiment includes, but is not limited to, the following:

first, the outlier can be operated according to the formula for calculating the outlier:

wherein X_b: the outliers examined.

X: is a set of measurements to know the arithmetic mean.

σ: is determined from other experimentally measured values excluding abnormal values.

If the Y value calculated according to the above formula is greater than the threshold value at the corresponding confidence in the rejection limit, then X is added_bDiscarded as an outlier. Then the present embodiment may leave these discarded outliers as null for later analysis.

Secondly, the principle of decision tree can be utilized to perform a maximum value and minimum value based segmentation on the original data, then the two blocks of data are respectively screened layer by layer until all the characteristics of the sample on each node are the same, and then 1 and-1 are used to identify whether the outlier is present, wherein 1 is the characteristic value, and-1 is the outlier.

Thirdly, an outlier can also be identified by the mean (a) and the standard deviation (b) of the original data, and if the value is greater than (a +3b), the embodiment determines that the value is an outlier, or if the value is less than (a-3b), the embodiment determines that the value is a feature value.

Specifically, the manner of managing the controllability of the features in the present embodiment includes, but is not limited to, the following:

first, continuous numerical controllability recognition marks numerical variables as controllable and other variables as uncontrollable.

Second, the manual marking controllability method, which marks the data set controllably and uncontrollably through the production experience and expertise of the individual.

Thirdly, interval controllability identification, namely, giving a range to the variables of the data type, and marking the variables as uncontrollable if the variables exceed the range; if this range is not exceeded, the flag is controlled.

Specifically, the present embodiment includes, but is not limited to, the following ways for managing data discretization:

firstly, the value range of the whole data can be regarded as a discrete attribute value, then the range is divided into two adjacent ranges, each range corresponds to a discrete attribute value, then the steps are circulated all the time, and the division of the range is stopped when the attribute value is not significant finally.

And secondly, sequencing value ranges of values of the original data according to the size, regarding each value as a point which can be divided, dividing the interval into two parts in sequence to calculate the entropy value of the two parts, selecting the interval with the minimum entropy value as a first division point, selecting the interval with the maximum entropy value to repeat the step, and stopping dividing the interval when the number of the intervals reaches the number specified by a user or meets a specified termination condition.

Thirdly, all variables of the original data can be arranged in a descending or ascending manner, the ordering name is an ordering result, then the variables with the same numerical value are divided into a group, and the division of the interval is stopped after all the numerical values are divided.

And S5, displaying the data after the data cleaning management. Specifically, the present embodiment shows the processed data in the following manners, including but not limited to the following:

first, the demonstration of the variable correlation property, the correlation between each variable can be classified using color, if the color is the same, the correlation of the variable is high; if the colors are different, the correlation of the variables is low. Meanwhile, the variables with the same color can be arranged together, so that the user can conveniently check the variables.

And secondly, displaying whether the variable has the controllable attribute or not can ensure that each variable has the controllable attribute or not, and a user can select the controllability of each variable when checking the controllability, so that the subsequent analysis is facilitated.

Thirdly, the data type of the variable is shown, and the data type of each variable can also be shown, such as continuous type, subtype and the like. And the user can view a presentation of the graph corresponding to each variable type.

Fourthly, the missing value and the outlier of the variable are shown, and may also be processed, and the missing value and the outlier are filled by using the mean value, the substitute value, or 0 in the embodiment. Meanwhile, the embodiment also deletes some columns of missing values and outliers. The user can more clearly view the numerical information of each variable.

Fifth, the display of real-time data, if the user configures some time series data, the embodiment can also display the time series data, and even if the user updates the time series data, the system can also synchronously process the updated data and then display the updated data.

In another exemplary embodiment, as shown in fig. 3 to 11d, the present embodiment provides a theme data online processing and displaying method, including: data access, data search, original data display, success data cleaning and processed data display.

As shown in fig. 3, the data access includes the following steps: clicking the created project, filling information of the created project, clicking to test connectivity, and performing subsequent operation after confirming the data communication effect.

The data searching method comprises the following working steps: starting, clicking the magnifying glass icon at the upper right corner, inputting content by the input box, and automatically updating the search content.

The working steps of the original data display are as follows: clicking a histogram cell, dragging a sliding bar below the histogram, and dragging an outlier line on the histogram;

clicking the histogram cell pops up a histogram detail window for setting and selecting, and dragging a sliding bar below the histogram comprises the following steps: the upper and lower limits of the selectable range and the selectable abscissa range on the left and right sides are shown in the figure, and the discrete region can be selected by dragging the outlier line on the histogram.

The first working step of data cleaning is as follows: as shown in fig. 4 and 5, click on the "variable name" drop-down box, delete and fill data, click on the drop-down button next to "missing value", dot descending/delete/fill, click on the drop-down button next to "outlier", dot descending/delete/fill/empty, flush set/click execute. Clicking on the "variable name" drop-down box contains: clicking a correlation switch, grouping the same color continuously, expressing that the correlation of the variable is high, adjusting a correlation grouping threshold parameter and regenerating a variable grouping; the data deleting and filling characteristics are that the data cannot be cancelled after filling is executed. The dot descending/deleting/filling steps are characterized in that: 1) descending order: sorting the variables in descending order of the number of missing values, 2) deleting: deleting the row data of the missing value, and 3) replacing: the missing values are filled using a mean value. The dot descending/deleting/filling/air placing characteristics are as follows: 1) descending order: sorting the variables in descending order of the number of outliers, 2) deleting: deleting the row data of the missing value, and 3) replacing: filling missing values with a mean value, 4) emptying: the outlier is set to null. The clean setup post/click execution sub-step comprises: the "post-cleaning method may be effective. Note! The click cannot be cancelled after being executed. "prompt process.

The second working step of data cleaning is as follows: as shown in fig. 6, discretizing the variables, clicking a drop-down button next to the variable name, clicking a discretization button, selecting a filtering condition, clicking a confirm button, undoing the discretization of the variables, clicking a drop-down button of the variable name, clicking an undoing discretization, and ending. The selection screening conditions are set as follows: 1) equal width, equal width binning according to a data variable range, and the length of each binning interval is consistent, 2) equal frequency, equal frequency binning according to the number of samples, and the number of samples in each binning interval is basically consistent. The undo variable discretization step would do: the prompt of "already discretized and not clicking the execution button in the discretized state"; discretizing, clicking a variable name pull-down button, clicking cancel discretization, and then ending.

Specifically, the implementation flow of this embodiment is as follows:

(1) as shown in fig. 3, the creation item is clicked, information of the creation item is filled, after the data source user password is filled, the connectivity is clicked to test, and the subsequent operation can be performed after the data communication effect is confirmed.

(2) As shown in fig. 7, operations of clicking the magnifying glass icon at the upper right corner and clicking the input box to input content, and automatically updating the search content after the content is input are performed.

(3) As shown in fig. 8a, "click histogram cell pop-up histogram detail window", as shown in fig. 8b, "drag right and left selectable range upper and lower limits of slider bar below histogram", as shown in fig. 8c, "drag histogram lower slider bar selectable abscissa range", as shown in fig. 8d, "drag off-line on histogram selected discrete region".

(4) As shown in fig. 9a, a "click" of a variable name "drop-down box is performed, a correlation switch is clicked, as shown in fig. 9b, a group having the same color is continuously formed, indicating that the correlation of the variable is high, and as shown in fig. 9c, a" click "of a variable name" drop-down box is performed, the correlation group threshold parameter is adjusted, and the variable group is newly generated.

(5) As shown in fig. 10a, clicking the drop-down button next to the "missing value", clicking the descending order, and arranging the variables in descending order according to the number of the missing values; clicking deletion, and deleting the row data of the missing value; click replace, fill missing values with mean values "; as shown in fig. 10b, clicking the drop-down button next to the "outlier" can click down-order, and arrange the variables in descending order according to the number of outliers; clicking deletion, and deleting the row data of the missing value; clicking mean value substitution, and filling missing values with mean values; and clicking to set the cluster value to be null, clicking to execute after setting the cluster value to be null and cleaning, and enabling the cleaning party to take effect after executing. Note! And the operation cannot be cancelled after clicking execution.

(6) As shown in fig. 11a, "click the pull-down button next to the variable name" and "click the discretization button", as shown in fig. 11b, select the screening condition, select the equal width, and bin the equal width according to the data variable range, and the length of each bin interval is consistent; selecting equal frequency, performing equal frequency binning according to the number of samples, wherein the number of samples in each binning interval is basically consistent, as shown in fig. 11c, clicking an 'confirm' button, as shown in fig. 11d, clicking a variable name pull-down button, and clicking to cancel discretization.

In summary, the present invention provides a data online processing and displaying method, which includes obtaining target data to be subjected to data mining from a preset data source; then searching the target data to acquire characteristic information in the target data; performing data processing on the target data according to the characteristic information, and displaying the target data after the data processing is completed; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management. The method solves the problem of poor data processing capability of the existing data mining tool by carrying out online processing on target data in various modes; meanwhile, after the online processing in various modes is carried out, the method can also display through various charts, so that the problem of poor humanization of the existing data mining tool is solved, and meanwhile, due to the fact that various icon display modes are provided, each chart can correspond to the same or different operations, the problem of high difficulty in interface operation of the existing data mining tool is solved. Therefore, compared with the prior art, the method has simple and clear flow and easy operation; the structure is complete, and the data processing capacity is strong; meanwhile, the system has a data cleaning function and meets diversified requirements; and the data display capability after processing is realized, and the function is strong.

As shown in fig. 12, the present invention further provides a data online processing and displaying method, which includes:

the data acquisition module M10 is used for acquiring target data to be subjected to data mining from a preset data source; the data sources in this embodiment include, but are not limited to: the system comprises a local CSV data file, data files of other projects stored in a cloud, a Hive database, a MySql database, a Prometheus time sequence database, a Graphite time sequence database, an OpenTSDB time sequence database, an InfluxDB time sequence database and the like.

The data searching module M20 is configured to search the target data to obtain feature information in the target data;

the data processing and displaying module M30 is configured to perform data processing on the target data according to the feature information, and display the target data after the data processing is completed; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management. When the target data after the data processing is completed is displayed, the method includes: the display is performed by using various colors and various diagrams.

In an exemplary embodiment, the system further comprises: displaying the target data according to the characteristic information; the display comprises the following steps: the target data are displayed in a blocking mode, the target data are displayed in a single-feature detailed mode, and the target data are displayed in a single-feature screening mode; when the target data are displayed in a blocking mode, the minimum value, the maximum value, the average value, the median, the standard deviation, the missing value, the outlier and/or the factor of the target data are displayed; and when the target data is displayed in detail through the single feature, displaying by using a histogram, a scatter diagram, a graph and/or a thermodynamic diagram.

Specifically, the process of performing correlation processing on the target data by the system includes: acquiring any two pieces of characteristic information in the target data, and recording one piece of characteristic information as a first characteristic and recording the other piece of characteristic information as a second characteristic; comparing whether the values corresponding to the first feature and the second feature are equal or not; if yes, judging that the first feature is related to the second feature; if not, the first characteristic is judged not to be related to the second characteristic. As another example, or, calculating an intersection ratio of the first feature and the second feature; if the intersection ratio is 1, judging that the first feature is related to the second feature; and if the intersection ratio is 0, judging that the first feature is not related to the second feature. As another example, or a difference between the first feature and the second feature is calculated, and if an absolute value of the difference is 1, it is determined that the first feature is related to the second feature; and if the absolute value of the difference is 0, judging that the first feature is not related to the second feature.

Specifically, the process of the system for identifying the missing value of the target data includes: judging whether certain characteristic information exists in the target data and contains a plurality of data types; if the data type is the main data type corresponding to the characteristic information, taking the data type with the most data types in the corresponding characteristic information as the main data type corresponding to the characteristic information, and recording other data types as missing values; constructing one or more substitute values for missing values in the target data, and generating a corresponding data set according to the constructed substitute values; and performing the same analysis on each data set, and determining a final target variable according to a corresponding analysis result.

Specifically, the process of the system for performing outlier identification on the target data includes: carrying out maximum value and minimum value segmentation on the target data, screening the segmented target data, and determining that the characteristic information of each node is the same after segmentation; and acquiring a value corresponding to the characteristic information on each node, comparing the acquired value with a preset value, and identifying whether the acquired value is a characteristic value or an outlier.

Specifically, the process of the system for performing outlier identification on the target data includes: and acquiring a value corresponding to each piece of characteristic information in the target data, and performing outlier calculation on the acquired value to obtain an outlier corresponding to each piece of characteristic information. Or acquiring a mean value and a standard deviation of the target data, and generating a first numerical value and a second numerical value according to the mean value and the standard deviation; wherein the first value is greater than the second value; obtaining a value corresponding to each feature information in the target data, comparing the obtained value with a first numerical value and a second numerical value, and if the obtained value is larger than the first numerical value, judging that the corresponding numerical value is an outlier; if the obtained value is smaller than the second numerical value, the corresponding numerical value is judged as the characteristic value.

Specifically, the process of performing feature controllability management on the target data by the system includes: and identifying the controllability of the continuous numerical values, and marking the numerical variables as controllable and marking other variables as uncontrollable. Or setting a range for the variables of the data type, and if the range exceeds the set range, marking the variables as uncontrollable; if the set range is not exceeded, the flag is controllable.

Specifically, the process of performing data discretization management on the target data by the system comprises the following steps: acquiring a value interval of the target data; dividing the value interval into two adjacent intervals, wherein each interval corresponds to a discrete attribute value; judging whether the discrete attribute value corresponding to each interval is obvious or not, and stopping interval splitting if the discrete attribute value is not obvious; and if so, dividing the corresponding interval into two new adjacent intervals, judging whether the discrete attribute value corresponding to each new interval is obvious or not, and repeating or stopping interval splitting based on the judgment result.

In another embodiment, the present embodiment provides an online data processing and displaying system, configured to perform the following steps:

performing data access; for data access, the embodiment includes: connecting a Hive common database and a MySql common database; uploading a local CSV data file, uploading data files of other projects stored in a cloud, and connecting time sequence databases such as Prometheus, Graphite, OpenTSDB, InfluxDB and the like. As shown in fig. 3, the data access includes the following steps: clicking the created project, filling information of the created project, clicking to test connectivity, and performing subsequent operation after confirming the data communication effect.

Searching and searching data; as an example, for example, using the search function principle: and inputting a feature name, and finally automatically displaying the searched feature information.

And displaying the original data. As an example, the original data description information may be presented in blocks, such as presenting a minimum value, a maximum value, a mean value, a median, a standard deviation, a missing value, an outlier, a factor number. As another example, the raw data may be presented in single-feature detail, such as by a histogram, scatter plot, graph, thermodynamic diagram, or the like. As yet another example, the raw data may also be single feature screened to show that the extent of the discrete region is viewed by selecting the extent of the graph and dragging the outlier line.

And performing cleaning management on the data. The embodiment of the cleaning management of the data comprises the following steps: relevance association, identification of missing values, identification of outliers, management of feature controllability, and discretized management of variables.

first, the correlation calculation method for the features a, B is,

Secondly, the correlation calculation method for the features X, Y is:

Specifically, the missing value loss recognition method in this embodiment includes, but is not limited to, the following:

wherein, X_b: the outliers examined.

X: is a set of measurements to know the arithmetic mean.

Specifically, the present embodiment manages the data discretization in a manner including, but not limited to, the following:

And displaying the data after the data cleaning management. Specifically, the present embodiment shows the processed data in the following manners, including but not limited to the following:

Secondly, the display of whether the variable has the controllable attribute can ensure that each variable has the controllable attribute, and a user can select the controllability of each variable when checking the controllability, so that the subsequent analysis is facilitated.

Thirdly, showing the data types of the variables, and also showing the data types of each variable, such as continuous type, subtype and the like. And the user can view a presentation of the graph corresponding to each variable type.

In another exemplary embodiment, as shown in fig. 3 to 11d, the present embodiment provides a theme data online processing and presentation system, including: the device comprises a data access module, a data search module, an original data display module, a power data cleaning module and a processed data display module.

As shown in fig. 3, the data access module executes the following working steps: clicking the created project, filling information of the created project, clicking to test connectivity, and performing subsequent operation after confirming the data communication effect.

The data searching module executes the following working steps: starting, clicking the magnifying glass icon at the upper right corner, inputting content by the input box, and automatically updating the search content.

The original data display module executes the following working steps: clicking a histogram cell, dragging a sliding bar below the histogram, and dragging an outlier line on the histogram;

The first working step executed by the data cleaning module is as follows: as shown in fig. 4 and 5, click on the "variable name" drop-down box, delete and fill data, click on the drop-down button next to "missing value", dot descending/delete/fill, click on the drop-down button next to "outlier", dot descending/delete/fill/empty, flush set/click execute. Clicking on the "variable name" drop-down box contains: clicking a relevance switch, grouping with continuously same colors, representing that the relevance of the variables is high, adjusting the threshold parameter of the relevance grouping and regenerating the variable grouping; the data deleting and filling characteristics are that the data cannot be cancelled after filling is executed. The dot descending/deleting/filling steps are characterized in that: 1) descending order: sorting the variables in descending order of the number of missing values, 2) deleting: deleting the row data of the missing value, and 3) replacing: the missing values are filled using a mean value. The point-descending/deleting/filling/air-filling features are: 1) descending order: sorting the variables in descending order of the number of outliers, 2) deleting: deleting the row data of the missing value, and 3) replacing: filling missing values with a mean value, 4) emptying: the outlier is set to null. The clean setup post/click execution sub-step comprises: the "post-cleaning method may be effective. Note! The click cannot be cancelled after being executed. "prompt process.

The second working step executed by the data cleaning module is as follows: as shown in fig. 6, discretizing the variables, clicking a drop-down button next to the variable name, clicking a discretization button, selecting a filtering condition, clicking a confirm button, undoing the discretization of the variables, clicking a drop-down button of the variable name, clicking an undoing discretization, and ending. The selection screening conditions are set as follows: 1) equal width, equal width binning according to a data variable range, and the length of each binning interval is consistent, 2) equal frequency, equal frequency binning according to the number of samples, and the number of samples in each binning interval is basically consistent. The undo variable discretization step would do: the prompt of "already discretized and not clicking the execution button in the discretized state"; discretizing, clicking a variable name pull-down button, clicking a cancel discretization, and then ending.

Specifically, the implementation flow of this embodiment is as follows:

(5) As shown in fig. 10a, clicking the pull-down button next to the "missing value" can click the down-order to arrange the variables in the descending order according to the number of the missing values; clicking deletion, and deleting the row data of the missing value; click replace, fill missing values with mean values "; as shown in fig. 10b, clicking the drop-down button next to the "outlier" can click down-order, and arrange the variables in descending order according to the number of outliers; clicking deletion, and deleting the row data of the missing value; clicking mean value substitution, and filling missing values with mean values; and clicking to set the cluster value to be null, clicking to execute after the cluster value is set to be null, and clicking to execute after the cluster value is set to be clean, wherein the cleaning party can take effect after the cluster value is executed. Note! And the operation can not be cancelled after clicking is executed.

(6) As shown in fig. 11a, "click the pull-down button next to the variable name" and "click the discretization button", as shown in fig. 11b, select the screening condition, select the equal width, and bin the equal width according to the data variable range, and the length of each bin interval is consistent; selecting equal frequency, performing equal frequency binning according to the number of samples, wherein the number of samples in each binning interval is basically consistent, as shown in FIG. 11c, clicking an ' OK ' button, as shown in FIG. 11d, and clicking a variable name pull-down button > clicking to cancel discretization ' operation.

In summary, the present invention provides a data online processing and displaying system, which first obtains target data to be subjected to data mining from a preset data source; then searching the target data to acquire characteristic information in the target data; performing data processing on the target data according to the characteristic information, and displaying the target data after the data processing is completed; wherein the data processing comprises at least one of: relevance association, missing value identification, outlier identification, feature controllability management and data discretization management. The system solves the problem of poor data processing capability of the existing data mining tool by carrying out online processing on target data in various modes; meanwhile, after the system carries out online processing in various modes, the system can also display through various charts, so that the problem of poor humanization of the existing data mining tool is solved, and meanwhile, as the system provides various icon display modes, each chart can correspond to the same or different operations, so that the problem of high operation difficulty of the interface of the existing data mining tool is solved. Therefore, compared with the prior art, the system has simple and clear flow and easy operation; the structure is complete, and the data processing capacity is strong; meanwhile, the system has a data cleaning function and meets diversified requirements; and the data display capability after processing is realized, and the function is strong. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A data online processing display method is characterized by comprising the following steps:

acquiring target data to be subjected to data mining from a preset data source;

2. The method for displaying data online processing according to claim 1, wherein the process of performing correlation processing on the target data includes:

3. The method for displaying data online processing according to claim 1, wherein the step of identifying missing values of the target data comprises:

judging whether certain characteristic information exists in the target data and contains a plurality of data types; if the data type exists, taking the data type with the most number in the corresponding characteristic information as a main data type of the corresponding characteristic information, and recording other data types as missing values;

4. The method for displaying data online processing according to claim 1, wherein the process of performing outlier identification on the target data comprises:

5. The method for displaying data online processing according to claim 1, wherein the process of performing outlier identification on the target data comprises:

obtaining a value corresponding to each feature information in the target data, comparing the obtained value with a first numerical value and a second numerical value, and if the obtained value is larger than the first numerical value, judging that the corresponding numerical value is an outlier; if the obtained value is smaller than the second value, the corresponding value is judged to be the characteristic value.

6. The method for displaying data online processing according to claim 1, wherein the process of performing feature controllability management on the target data comprises:

or setting a range for the variables of the data type, and if the range exceeds the set range, marking the variables as uncontrollable; if the set range is not exceeded, the flag is controllable.

7. The method for displaying data online processing according to claim 1, wherein the process of performing data discretization management on the target data comprises:

acquiring a value interval of the target data;

8. The method for displaying data online processing according to claim 1, further comprising: displaying the target data according to the characteristic information; the display comprises the following steps: the target data are displayed in a blocking mode, the target data are displayed in a single-feature detailed mode, and the target data are displayed in a single-feature screening mode;

and when the target data is displayed in detail through the single characteristic, displaying by utilizing a histogram, a scatter diagram, a curve diagram and/or a thermodynamic diagram.

9. The method for displaying data online processing according to any one of claims 1 to 8, wherein the displaying of the target data after the data processing comprises: the display is carried out by using various colors and the display is carried out by using various charts.

10. A data online processing display method is characterized by comprising the following steps: