CN110968627A

CN110968627A - Big data analysis method and system

Info

Publication number: CN110968627A
Application number: CN201911097264.2A
Authority: CN
Inventors: 蒋健
Original assignee: Nanjing Fengkaiyunge Data Technology Co Ltd
Current assignee: Nanjing Fengkaiyunge Data Technology Co Ltd
Priority date: 2019-11-11
Filing date: 2019-11-11
Publication date: 2020-04-07

Abstract

The application relates to a big data analysis method which is characterized by comprising the following steps: acquiring big data; preprocessing the acquired big data; performing machine learning on the preprocessed big data to detect abnormal data; and displaying the detection result.

Description

Big data analysis method and system

Technical Field

The application relates to the technical field of the next generation information network industry, in particular to a big data analysis method and system.

Background

Big data analysis refers to the analysis of data on a huge scale. Big data can be summarized as 5V, large data Volume (Volume), fast speed (Velocity), multiple types (Variety), Value (Value), and authenticity (Veracity).

Big data is used as the vocabulary of the IT industry which is the most fiery at present, and the utilization of the commercial value of the big data, such as data warehouse, data security, data analysis, data mining and the like, becomes the profit focus which is pursued by the industry people gradually. Big data era of victor Mayer schenberg (Viktor Mayer sche herring berger): the great changes of life, work and thinking were thought that data could become valuable assets in the future. In holidays, the system can shake greatly and enter the balance and debt.

With the advent of the big data era, whether data has value and is garbage or treasure, and most importantly, whether the data to be analyzed and mined is high-quality, so that big data analysis should be carried out.

Disclosure of Invention

In order to overcome the problems in the related art, the application provides a big data analysis method and a big data analysis system.

According to an embodiment of the present application, a big data analysis method is provided, which is characterized by including the following steps:

acquiring big data;

preprocessing the acquired big data;

performing machine learning on the preprocessed big data to detect abnormal data;

and displaying the detection result.

Preferably, the preprocessing the acquired big data comprises:

performing data conversion on the acquired big data;

and performing data cleaning on the converted big data.

Preferably, the data conversion of the acquired big data comprises:

and converting the data of the character string type, the character type and the Boolean value type in the acquired big data into data of a digital type.

Preferably, the preprocessing the acquired big data further comprises:

discretization is performed before data conversion.

Preferably, the data in question is cleaned by adopting a preset logic rule.

Preferably, the machine learning of the preprocessed big data includes:

and discovering abnormal data from the preprocessed big data by adopting a deep learning network model and a statistical process control model.

Preferably, presenting the detection result comprises:

and determining and displaying the acquired big data risk value, error value and correct value according to the detection result of the abnormal data.

Preferably, the method further comprises the following steps: and carrying out data integrity detection on the acquired big data.

Preferably, the data integrity detection of the acquired big data includes:

acquiring data consistency requirement rating of system data transmission/exchange;

establishing a detection sample pool according to the obtained data consistency requirement rating;

randomly selecting samples from a detection sample pool according to the consistency test requirement for detection;

updating data consistency requirements.

In an embodiment of the present invention, a big data analysis system is further provided, including:

the acquisition module is used for acquiring big data;

the preprocessing module is used for preprocessing the acquired big data;

the machine learning module is used for performing machine learning on the preprocessed big data so as to detect abnormal data;

and the display module is used for displaying the detection result.

The technical scheme provided by the embodiment of the application can have the following beneficial effects: the invention provides a big data analysis method and a big data analysis system, which are used for carrying out anomaly detection on big data by adopting machine learning, thereby improving the data quality of the big data and further better mining the value from the big data.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a big data analysis method according to an exemplary embodiment;

FIG. 2 is a schematic diagram of a big data analytics system, shown in accordance with another exemplary embodiment;

FIG. 3 is a flow diagram illustrating a big data analysis method according to an exemplary embodiment;

FIG. 4 is a schematic diagram illustrating a big data analytics system, according to another exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings, in which like numerals in different drawings represent the same or similar elements, unless otherwise specified. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The following disclosure provides many different embodiments, or examples, for implementing different features of the application. In order to simplify the disclosure of the present application, specific example components and arrangements are described below. Of course, they are merely examples and are not intended to limit the present application. Further, the present application may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Further, examples of various specific processes and materials are provided herein, but one of ordinary skill in the art may recognize the applicability of other processes and/or the use of other materials. In addition, the structure of a first feature described below as "on" a second feature may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features are formed between the first and second features, such that the first and second features may not be in direct contact.

In the description of the present application, it should be noted that unless otherwise specified and limited, the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, mechanically or electrically connected, or interconnected between two elements, directly or indirectly through intervening media, and the specific meanings of the terms as described above will be understood by those skilled in the art according to specific situations.

Fig. 1 is a flowchart illustrating a data sharing method for a smart security system based on big data according to an exemplary embodiment. Referring to fig. 1, it includes the following steps:

step S10, acquiring big data;

step S20, preprocessing the acquired big data;

step S30, machine learning is carried out on the preprocessed big data to detect abnormal data;

and step S40, displaying the detection result.

Whether the data has value or not, whether the data becomes garbage or treasure or not is the most important, whether the data to be analyzed and excavated is high-quality or not is judged, and the big data analysis method of the embodiment detects the abnormality of the big data by adopting machine learning, so that the data quality of the big data can be improved, and the value can be better excavated from the big data.

Preferably, the preprocessing the acquired big data comprises:

performing data conversion on the acquired big data;

and performing data cleaning on the converted big data.

The preprocessing set by the preferred embodiment can efficiently filter out data with basic data quality problems in advance with a small amount of computation, thereby significantly reducing the amount of computation in subsequent machine learning steps.

Preferably, the data conversion of the acquired big data comprises:

Machine learning generally uses a numerical processor to perform operations using a graphic display card, and the processing performance in numerical terms is very good. In the preferred embodiment, various types of data are converted into digital types in advance, so that isomorphism of the data is facilitated, the storage efficiency is improved, and the efficiency of subsequent machine learning steps is remarkably improved.

Preferably, the data of the character string type, the character type, and the boolean value type in the acquired big data is converted into data of a numeric type by a hash code function.

Preferably, the preprocessing the acquired big data further comprises:

discretization is performed before data conversion.

Discretization refers to the mapping of a set of volumes present in a space into a particular space. The data volume of the acquired big data is usually huge, the data sources are various, and the preferred embodiment performs discretization processing on the acquired big data in advance, so that the data volume is remarkably reduced, the storage efficiency is remarkably improved, and the efficiency of the big data analysis method is further improved.

Preferably, the discretization process comprises: the continuous space is partitioned into a plurality of small spaces, and the resulting continuous small spaces are then associated with discrete forms of data values.

Preferably, the discretization process comprises:

(1) calculating the important attributes of the acquired big data, and sequencing according to the calculation result to obtain [ a ]₁,…,a_m]Where M denotes the original data amount, O denotes the number of output classes, and a denotes the attribute of the original sample.

(2) If the value of the initial attribute k is set to 1 and the discrete point i is divided into l, the feature a is obtained_iAfter ordering the original values from small to large, a sequence D' can be obtained, from which the sample instance at attribute a can be obtained_kMaximum value of (d) respectively₀And d_mAnd (4) showing.

(3) And calculating the midpoints of adjacent elements in the sorted D' set, and organizing the calculation results to construct a discrete point candidate set L.

(4) After the initialization operation is performed on the set, L' ═ d is obtained₀,…,d_m]Then the maximum value is 0.

(5) And adding the point values which do not belong to the L ' in the L into the L ' according to the discrete points L and the processing set L '.

(6) And selecting a value maximum breakpoint from the calculation result and storing the value maximum breakpoint into L'.

(7) For all the characteristics, all the first discrete points are selected, the inconsistency of the samples is calculated, the calculation result is analyzed, if the fact that the inconsistency cannot meet the preset condition is found, i is i + l, when k is less than m, the previous step is continuously operated, the ith discrete point of the kth attribute is selected, then k is k +1, and the previous step is returned; and once the calculation result meets the preset condition, ending the process.

Through a large number of simulation experiments, the correct data identification rate is improved.

Preferably, the data in question is cleaned by adopting a preset logic rule.

The preferred embodiment eliminates obviously wrong data and improves the efficiency of big data analysis.

For example, the logic rule may specify that data that is null belongs to the exception data and should be flushed.

Preferably, the machine learning of the preprocessed big data includes:

The existing deep learning network models are more and more, and the preferred embodiment can adopt the existing deep learning network models, so that the popularization and the application of the big data analysis method can be obviously improved.

The preferred embodiment provides a data quality control method capable of fusing and applying a deep learning mode and a statistical process control model, which can effectively utilize computing resources and algorithm control to detect outlier data and provide greater value for services.

Preferably, presenting the detection result comprises:

The preferred embodiment meets the application requirements of various scenes and can better warn users.

Preferably, the big data analysis method further comprises: and carrying out data integrity detection on the acquired big data.

The data integrity is also an important index of the big data quality, and the data quality of the big data can be further improved by carrying out data integrity detection on the obtained big data.

Preferably, the data integrity detection of the acquired big data includes:

updating data consistency requirements.

Through a large number of practical simulations, it is found that the random algorithm adopted by the preferred embodiment is the simplest or fastest algorithm, and the time complexity is the lowest.

FIG. 2 is a schematic diagram illustrating a big data analytics system, as shown, including, in accordance with another exemplary embodiment:

the acquisition module 10 is used for acquiring big data;

the preprocessing module 20 is configured to preprocess the acquired big data;

a machine learning module 30, configured to perform machine learning on the preprocessed big data to detect abnormal data;

and a display module 40 for displaying the detection result.

Whether the data has value or not, whether the data becomes garbage or treasure or not is the most important, whether the data to be analyzed and excavated is high-quality or not is judged, and the big data analysis system of the embodiment detects the abnormality of the big data by adopting machine learning, so that the data quality of the big data can be improved, and the value can be better excavated from the big data.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A big data analysis method is characterized by comprising the following steps:

acquiring big data;

preprocessing the acquired big data;

and displaying the detection result.

2. The big data analysis method according to claim 1, wherein preprocessing the acquired big data comprises:

performing data conversion on the acquired big data;

and performing data cleaning on the converted big data.

3. The big data analysis method of claim 2, wherein the data transformation of the acquired big data comprises:

4. The big data analysis method according to claim 3, wherein preprocessing the acquired big data further comprises:

discretization is performed before data conversion.

5. The big data analysis method according to claim 2, wherein problematic data is cleaned using preset logic rules.

6. The big data analysis method of claim 1, wherein machine learning the preprocessed big data comprises:

7. The big data analysis method of claim 1, wherein presenting the detection results comprises:

8. The big data analysis method according to claim 1, further comprising: and carrying out data integrity detection on the acquired big data.

9. The big data analysis method according to claim 8, wherein performing data integrity check on the obtained big data comprises:

updating data consistency requirements.

10. A big data analytics system, comprising:

the acquisition module is used for acquiring big data;

the preprocessing module is used for preprocessing the acquired big data;

and the display module is used for displaying the detection result.