CN110968627A - Big data analysis method and system - Google Patents

Big data analysis method and system Download PDF

Info

Publication number
CN110968627A
CN110968627A CN201911097264.2A CN201911097264A CN110968627A CN 110968627 A CN110968627 A CN 110968627A CN 201911097264 A CN201911097264 A CN 201911097264A CN 110968627 A CN110968627 A CN 110968627A
Authority
CN
China
Prior art keywords
big data
data
analysis method
acquired
preprocessing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911097264.2A
Other languages
Chinese (zh)
Inventor
蒋健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Fengkaiyunge Data Technology Co Ltd
Original Assignee
Nanjing Fengkaiyunge Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Fengkaiyunge Data Technology Co Ltd filed Critical Nanjing Fengkaiyunge Data Technology Co Ltd
Priority to CN201911097264.2A priority Critical patent/CN110968627A/en
Publication of CN110968627A publication Critical patent/CN110968627A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database

Abstract

The application relates to a big data analysis method which is characterized by comprising the following steps: acquiring big data; preprocessing the acquired big data; performing machine learning on the preprocessed big data to detect abnormal data; and displaying the detection result.

Description

Big data analysis method and system
Technical Field
The application relates to the technical field of the next generation information network industry, in particular to a big data analysis method and system.
Background
Big data analysis refers to the analysis of data on a huge scale. Big data can be summarized as 5V, large data Volume (Volume), fast speed (Velocity), multiple types (Variety), Value (Value), and authenticity (Veracity).
Big data is used as the vocabulary of the IT industry which is the most fiery at present, and the utilization of the commercial value of the big data, such as data warehouse, data security, data analysis, data mining and the like, becomes the profit focus which is pursued by the industry people gradually. Big data era of victor Mayer schenberg (Viktor Mayer sche herring berger): the great changes of life, work and thinking were thought that data could become valuable assets in the future. In holidays, the system can shake greatly and enter the balance and debt.
With the advent of the big data era, whether data has value and is garbage or treasure, and most importantly, whether the data to be analyzed and mined is high-quality, so that big data analysis should be carried out.
Disclosure of Invention
In order to overcome the problems in the related art, the application provides a big data analysis method and a big data analysis system.
According to an embodiment of the present application, a big data analysis method is provided, which is characterized by including the following steps:
acquiring big data;
preprocessing the acquired big data;
performing machine learning on the preprocessed big data to detect abnormal data;
and displaying the detection result.
Preferably, the preprocessing the acquired big data comprises:
performing data conversion on the acquired big data;
and performing data cleaning on the converted big data.
Preferably, the data conversion of the acquired big data comprises:
and converting the data of the character string type, the character type and the Boolean value type in the acquired big data into data of a digital type.
Preferably, the preprocessing the acquired big data further comprises:
discretization is performed before data conversion.
Preferably, the data in question is cleaned by adopting a preset logic rule.
Preferably, the machine learning of the preprocessed big data includes:
and discovering abnormal data from the preprocessed big data by adopting a deep learning network model and a statistical process control model.
Preferably, presenting the detection result comprises:
and determining and displaying the acquired big data risk value, error value and correct value according to the detection result of the abnormal data.
Preferably, the method further comprises the following steps: and carrying out data integrity detection on the acquired big data.
Preferably, the data integrity detection of the acquired big data includes:
acquiring data consistency requirement rating of system data transmission/exchange;
establishing a detection sample pool according to the obtained data consistency requirement rating;
randomly selecting samples from a detection sample pool according to the consistency test requirement for detection;
updating data consistency requirements.
In an embodiment of the present invention, a big data analysis system is further provided, including:
the acquisition module is used for acquiring big data;
the preprocessing module is used for preprocessing the acquired big data;
the machine learning module is used for performing machine learning on the preprocessed big data so as to detect abnormal data;
and the display module is used for displaying the detection result.
The technical scheme provided by the embodiment of the application can have the following beneficial effects: the invention provides a big data analysis method and a big data analysis system, which are used for carrying out anomaly detection on big data by adopting machine learning, thereby improving the data quality of the big data and further better mining the value from the big data.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow diagram illustrating a big data analysis method according to an exemplary embodiment;
FIG. 2 is a schematic diagram of a big data analytics system, shown in accordance with another exemplary embodiment;
FIG. 3 is a flow diagram illustrating a big data analysis method according to an exemplary embodiment;
FIG. 4 is a schematic diagram illustrating a big data analytics system, according to another exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings, in which like numerals in different drawings represent the same or similar elements, unless otherwise specified. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The following disclosure provides many different embodiments, or examples, for implementing different features of the application. In order to simplify the disclosure of the present application, specific example components and arrangements are described below. Of course, they are merely examples and are not intended to limit the present application. Further, the present application may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed. Further, examples of various specific processes and materials are provided herein, but one of ordinary skill in the art may recognize the applicability of other processes and/or the use of other materials. In addition, the structure of a first feature described below as "on" a second feature may include embodiments in which the first and second features are formed in direct contact, and may also include embodiments in which additional features are formed between the first and second features, such that the first and second features may not be in direct contact.
In the description of the present application, it should be noted that unless otherwise specified and limited, the terms "mounted," "connected," and "connected" are to be interpreted broadly, and may be, for example, mechanically or electrically connected, or interconnected between two elements, directly or indirectly through intervening media, and the specific meanings of the terms as described above will be understood by those skilled in the art according to specific situations.
Fig. 1 is a flowchart illustrating a data sharing method for a smart security system based on big data according to an exemplary embodiment. Referring to fig. 1, it includes the following steps:
step S10, acquiring big data;
step S20, preprocessing the acquired big data;
step S30, machine learning is carried out on the preprocessed big data to detect abnormal data;
and step S40, displaying the detection result.
Whether the data has value or not, whether the data becomes garbage or treasure or not is the most important, whether the data to be analyzed and excavated is high-quality or not is judged, and the big data analysis method of the embodiment detects the abnormality of the big data by adopting machine learning, so that the data quality of the big data can be improved, and the value can be better excavated from the big data.
Preferably, the preprocessing the acquired big data comprises:
performing data conversion on the acquired big data;
and performing data cleaning on the converted big data.
The preprocessing set by the preferred embodiment can efficiently filter out data with basic data quality problems in advance with a small amount of computation, thereby significantly reducing the amount of computation in subsequent machine learning steps.
Preferably, the data conversion of the acquired big data comprises:
and converting the data of the character string type, the character type and the Boolean value type in the acquired big data into data of a digital type.
Machine learning generally uses a numerical processor to perform operations using a graphic display card, and the processing performance in numerical terms is very good. In the preferred embodiment, various types of data are converted into digital types in advance, so that isomorphism of the data is facilitated, the storage efficiency is improved, and the efficiency of subsequent machine learning steps is remarkably improved.
Preferably, the data of the character string type, the character type, and the boolean value type in the acquired big data is converted into data of a numeric type by a hash code function.
Preferably, the preprocessing the acquired big data further comprises:
discretization is performed before data conversion.
Discretization refers to the mapping of a set of volumes present in a space into a particular space. The data volume of the acquired big data is usually huge, the data sources are various, and the preferred embodiment performs discretization processing on the acquired big data in advance, so that the data volume is remarkably reduced, the storage efficiency is remarkably improved, and the efficiency of the big data analysis method is further improved.
Preferably, the discretization process comprises: the continuous space is partitioned into a plurality of small spaces, and the resulting continuous small spaces are then associated with discrete forms of data values.
Preferably, the discretization process comprises:
(1) calculating the important attributes of the acquired big data, and sequencing according to the calculation result to obtain [ a ]1,…,am]Where M denotes the original data amount, O denotes the number of output classes, and a denotes the attribute of the original sample.
(2) If the value of the initial attribute k is set to 1 and the discrete point i is divided into l, the feature a is obtainediAfter ordering the original values from small to large, a sequence D' can be obtained, from which the sample instance at attribute a can be obtainedkMaximum value of (d) respectively0And dmAnd (4) showing.
(3) And calculating the midpoints of adjacent elements in the sorted D' set, and organizing the calculation results to construct a discrete point candidate set L.
(4) After the initialization operation is performed on the set, L' ═ d is obtained0,…,dm]Then the maximum value is 0.
(5) And adding the point values which do not belong to the L ' in the L into the L ' according to the discrete points L and the processing set L '.
(6) And selecting a value maximum breakpoint from the calculation result and storing the value maximum breakpoint into L'.
(7) For all the characteristics, all the first discrete points are selected, the inconsistency of the samples is calculated, the calculation result is analyzed, if the fact that the inconsistency cannot meet the preset condition is found, i is i + l, when k is less than m, the previous step is continuously operated, the ith discrete point of the kth attribute is selected, then k is k +1, and the previous step is returned; and once the calculation result meets the preset condition, ending the process.
Through a large number of simulation experiments, the correct data identification rate is improved.
Preferably, the data in question is cleaned by adopting a preset logic rule.
The preferred embodiment eliminates obviously wrong data and improves the efficiency of big data analysis.
For example, the logic rule may specify that data that is null belongs to the exception data and should be flushed.
Preferably, the machine learning of the preprocessed big data includes:
and discovering abnormal data from the preprocessed big data by adopting a deep learning network model and a statistical process control model.
The existing deep learning network models are more and more, and the preferred embodiment can adopt the existing deep learning network models, so that the popularization and the application of the big data analysis method can be obviously improved.
The preferred embodiment provides a data quality control method capable of fusing and applying a deep learning mode and a statistical process control model, which can effectively utilize computing resources and algorithm control to detect outlier data and provide greater value for services.
Preferably, presenting the detection result comprises:
and determining and displaying the acquired big data risk value, error value and correct value according to the detection result of the abnormal data.
The preferred embodiment meets the application requirements of various scenes and can better warn users.
Preferably, the big data analysis method further comprises: and carrying out data integrity detection on the acquired big data.
The data integrity is also an important index of the big data quality, and the data quality of the big data can be further improved by carrying out data integrity detection on the obtained big data.
Preferably, the data integrity detection of the acquired big data includes:
acquiring data consistency requirement rating of system data transmission/exchange;
establishing a detection sample pool according to the obtained data consistency requirement rating;
randomly selecting samples from a detection sample pool according to the consistency test requirement for detection;
updating data consistency requirements.
Through a large number of practical simulations, it is found that the random algorithm adopted by the preferred embodiment is the simplest or fastest algorithm, and the time complexity is the lowest.
FIG. 2 is a schematic diagram illustrating a big data analytics system, as shown, including, in accordance with another exemplary embodiment:
the acquisition module 10 is used for acquiring big data;
the preprocessing module 20 is configured to preprocess the acquired big data;
a machine learning module 30, configured to perform machine learning on the preprocessed big data to detect abnormal data;
and a display module 40 for displaying the detection result.
Whether the data has value or not, whether the data becomes garbage or treasure or not is the most important, whether the data to be analyzed and excavated is high-quality or not is judged, and the big data analysis system of the embodiment detects the abnormality of the big data by adopting machine learning, so that the data quality of the big data can be improved, and the value can be better excavated from the big data.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims (10)

1. A big data analysis method is characterized by comprising the following steps:
acquiring big data;
preprocessing the acquired big data;
performing machine learning on the preprocessed big data to detect abnormal data;
and displaying the detection result.
2. The big data analysis method according to claim 1, wherein preprocessing the acquired big data comprises:
performing data conversion on the acquired big data;
and performing data cleaning on the converted big data.
3. The big data analysis method of claim 2, wherein the data transformation of the acquired big data comprises:
and converting the data of the character string type, the character type and the Boolean value type in the acquired big data into data of a digital type.
4. The big data analysis method according to claim 3, wherein preprocessing the acquired big data further comprises:
discretization is performed before data conversion.
5. The big data analysis method according to claim 2, wherein problematic data is cleaned using preset logic rules.
6. The big data analysis method of claim 1, wherein machine learning the preprocessed big data comprises:
and discovering abnormal data from the preprocessed big data by adopting a deep learning network model and a statistical process control model.
7. The big data analysis method of claim 1, wherein presenting the detection results comprises:
and determining and displaying the acquired big data risk value, error value and correct value according to the detection result of the abnormal data.
8. The big data analysis method according to claim 1, further comprising: and carrying out data integrity detection on the acquired big data.
9. The big data analysis method according to claim 8, wherein performing data integrity check on the obtained big data comprises:
acquiring data consistency requirement rating of system data transmission/exchange;
establishing a detection sample pool according to the obtained data consistency requirement rating;
randomly selecting samples from a detection sample pool according to the consistency test requirement for detection;
updating data consistency requirements.
10. A big data analytics system, comprising:
the acquisition module is used for acquiring big data;
the preprocessing module is used for preprocessing the acquired big data;
the machine learning module is used for performing machine learning on the preprocessed big data so as to detect abnormal data;
and the display module is used for displaying the detection result.
CN201911097264.2A 2019-11-11 2019-11-11 Big data analysis method and system Pending CN110968627A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911097264.2A CN110968627A (en) 2019-11-11 2019-11-11 Big data analysis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911097264.2A CN110968627A (en) 2019-11-11 2019-11-11 Big data analysis method and system

Publications (1)

Publication Number Publication Date
CN110968627A true CN110968627A (en) 2020-04-07

Family

ID=70030569

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911097264.2A Pending CN110968627A (en) 2019-11-11 2019-11-11 Big data analysis method and system

Country Status (1)

Country Link
CN (1) CN110968627A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649455A (en) * 2016-09-24 2017-05-10 孙燕群 Big data development standardized systematic classification and command set system
CN109213752A (en) * 2018-08-06 2019-01-15 国网福建省电力有限公司信息通信分公司 A kind of data cleansing conversion method based on CIM
CN110019174A (en) * 2018-12-13 2019-07-16 阿里巴巴集团控股有限公司 The quality of data determines method, apparatus, electronic equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649455A (en) * 2016-09-24 2017-05-10 孙燕群 Big data development standardized systematic classification and command set system
CN109213752A (en) * 2018-08-06 2019-01-15 国网福建省电力有限公司信息通信分公司 A kind of data cleansing conversion method based on CIM
CN110019174A (en) * 2018-12-13 2019-07-16 阿里巴巴集团控股有限公司 The quality of data determines method, apparatus, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHENG-JUNG TSAI 等: "A discretization algorithm based on Class-Attribute Contingency Coefficient", 《INFORMATION SCIENCES》 *
吴信东 等: "数据治理技术", 《软件学报》 *

Similar Documents

Publication Publication Date Title
US20200065710A1 (en) Normalizing text attributes for machine learning models
CN109388675A (en) Data analysing method, device, computer equipment and storage medium
CN111815432A (en) Financial service risk prediction method and device
CN110647995A (en) Rule training method, device, equipment and storage medium
CN110781960A (en) Training method, classification method, device and equipment of video classification model
CN111310829A (en) Confusion matrix-based classification result detection method and device and storage medium
CN111444677A (en) Reading model optimization method, device, equipment and medium based on big data
CN113516417A (en) Service evaluation method and device based on intelligent modeling, electronic equipment and medium
CN114492601A (en) Resource classification model training method and device, electronic equipment and storage medium
CN113656797B (en) Behavior feature extraction method and behavior feature extraction device
EP2348403B1 (en) Method and system for analyzing a legacy system based on trails through the legacy system
CN114116998A (en) Reply sentence generation method and device, computer equipment and storage medium
CN111738290B (en) Image detection method, model construction and training method, device, equipment and medium
CN108830302B (en) Image classification method, training method, classification prediction method and related device
CN115186738B (en) Model training method, device and storage medium
CN110968627A (en) Big data analysis method and system
CN115687674A (en) Big data demand analysis method and system serving smart cloud service platform
CN113742488B (en) Embedded knowledge graph completion method and device based on multitask learning
CN115167965A (en) Transaction progress bar processing method and device
CN110851501A (en) Big data analysis method and system
CN112559589A (en) Remote surveying and mapping data processing method and system
CN110968690A (en) Clustering division method and device for words, equipment and storage medium
CN114418752B (en) Method and device for processing user data without type label, electronic equipment and medium
CN115660722B (en) Prediction method and device for silver life customer conversion and electronic equipment
CN117058432B (en) Image duplicate checking method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200407

WD01 Invention patent application deemed withdrawn after publication