CN109684309A - A kind of quality of data evaluating method and device, computer equipment and storage medium - Google Patents

A kind of quality of data evaluating method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN109684309A
CN109684309A CN201811333857.XA CN201811333857A CN109684309A CN 109684309 A CN109684309 A CN 109684309A CN 201811333857 A CN201811333857 A CN 201811333857A CN 109684309 A CN109684309 A CN 109684309A
Authority
CN
China
Prior art keywords
data
abnormal
real
time
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811333857.XA
Other languages
Chinese (zh)
Inventor
刘卫卫
杨訸
刘贺
王晓慧
黄复鹏
陈江琦
张迪
雷舒雅
张希
佟鹏
周洪伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
Global Energy Interconnection Research Institute
Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
Global Energy Interconnection Research Institute
Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, Global Energy Interconnection Research Institute, Economic and Technological Research Institute of State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Publication of CN109684309A publication Critical patent/CN109684309A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply

Abstract

The invention discloses a kind of quality of data evaluating method and devices, computer equipment and storage medium, wherein quality of data evaluating method includes the following steps: to obtain real-time stream by the SPOUT component under STORM distributed computing framework;According to preset data constraint rule, the data in real-time stream are labeled;When data fit constraint rule, data are labeled as normal data, when data do not meet constraint rule, data are labeled as abnormal data;Abnormal data is subjected to real-time storage;Abnormal data and normal data are subjected to classification offline storage.The data flow in real-time stream processing platform (such as Kafka system) is obtained in real time by STORM distributed computing framework, and based on the data in constraint rule evaluation and test data flow, and the data for meeting constraint rule and the data for not meeting constraint rule are labeled respectively, realize the real-time evaluation to data, thus, it is possible to meet the requirement that data are carried out with real time analysis.

Description

A kind of quality of data evaluating method and device, computer equipment and storage medium
Technical field
The present invention relates to technical field of data administration more particularly to a kind of quality of data evaluating method, the quality of data to comment Survey device, computer equipment and computer readable storage medium.
Background technique
In electrical network field, Various types of data acquisition device has spread the extensive covering of the whole network, especially intelligent electric meter, each industry The real-time operation of business system with interact, produce a large amount of real time data.At the same time, instruct company operation, service optimization, The requirement of real-time of the data analysis of accurate decision is also higher and higher, and the quality of data of real time data also becomes restriction number in real time According to the key of precision of analysis.
In the prior art, sales service application and user information acquire message area mostly after off-line data, then logarithm According to quality testing is carried out, off-line data quality is improved, thus, it is possible to improve the accuracy of off line data analysis.But for reality When data quality assessment lack the evaluating method of real time data there are no perfect evaluation and test process, and base in the prior art In the quality testing of off-line data, it is difficult to which the network for adapting to high speed development variation is unable to satisfy and carries out real time analysis to data Requirement, constrain the timeliness and timeliness of company operation and decision.
Summary of the invention
Therefore, the technical problem to be solved in the present invention is that solving the quality inspection in the prior art based on off-line data It surveys, is unable to satisfy the problem of requirement of real time analysis is carried out to data, a kind of pair of real time data is provided and carries out real-time quality The quality of data evaluating method of detection.
For this purpose, according in a first aspect, including the following steps: to pass through the present invention provides a kind of quality of data evaluating method STORM distributed computing framework obtains real-time stream;According to preset data constraint rule, to the data in real-time stream into Rower note;When data fit constraint rule, data are labeled as normal data, when data do not meet constraint rule, will be counted According to being labeled as abnormal data;Abnormal data is subjected to real-time storage;Abnormal data and normal data are subjected to classification offline storage.
Optionally, abnormal data is subjected to real-time storage, includes the following steps: to obtain the real-time stream after mark;It mentions The abnormal data in real-time stream after taking mark;Abnormal data is written in database.
Optionally, further include following steps before abnormal data and normal data being carried out classification offline storage: receiving different The confirmation of regular data is labeled as normal data as a result, will be confirmed to be and abnormal abnormal data is not present;To abnormal data and just Regular data carries out feature calculation, updates constraint rule.
Optionally, quality of data evaluating method further includes following steps: according to the abnormal data and normal number of offline storage According to time identifier judge whether there is data transmission exception;When there are data transmission exception, abnormal period is reacquired Data, and the data of abnormal period are labeled.
Optionally, the data of abnormal period are reacquired, and the data of abnormal period are labeled, further include walking as follows It is rapid: to obtain the data of abnormal period one by one sequentially in time;Judge whether the data currently obtained have been marked;It is obtained when currently When the data taken have been marked, next data of abnormal period are obtained;When the data currently obtained are not marked, to currently obtaining The data taken are labeled.
Optionally, the time between the time point time point that data transmission exception starts terminated to data transmission exception Section, as abnormal period;Alternatively, by the period of the time point backtracking preset duration since data transmission exception, and number The period between time point that the time point started according to transmission abnormality terminates to data transmission exception, as abnormal period.
Optionally, obtaining real-time stream by STORM distributed computing framework includes: to utilize STORM distributed computing SPOUT component under frame carries out the real-time acquisition of data;Alternatively, being labeled to the data in real-time stream includes: benefit The mark of data is carried out with the BOLT component under STORM distributed computing framework;Data constraint rule is stored in REDIS system In.
According to second aspect, the present invention also provides a kind of quality of data evaluating apparatus, comprising: and data flow obtains module, For obtaining real-time stream by STORM distributed computing framework;Data labeling module is advised for being constrained according to preset data Then, the data in real-time stream are labeled;When data fit constraint rule, data are labeled as normal data, when When data do not meet constraint rule, data are labeled as abnormal data;Real-time storage module, it is real-time for carrying out abnormal data Storage;Offline storage module, for abnormal data and normal data to be carried out classification offline storage.
Optionally, real-time storage module includes: data flow acquiring unit, for obtaining the real-time stream after marking;Number According to extraction unit, for extracting the abnormal data in the real-time stream after marking;Writing unit, for abnormal data to be written In database.
Optionally, quality of data evaluating apparatus further include: confirmation correction module, for receiving the confirmation knot of abnormal data Fruit, will be confirmed to be that there is no abnormal abnormal datas to be labeled as normal data;Policy Updates module, for abnormal data and Normal data carries out feature calculation, updates constraint rule.
Optionally, quality of data evaluating apparatus further include: transmission detection module, for the abnormal data according to offline storage Data transmission exception is judged whether there is with the time identifier of normal data;Data rewind module, for when there are data transmission When abnormal, the data of abnormal period are reacquired, and be labeled to the data of abnormal period.
Optionally, data rewind module includes: data capture unit, for obtaining abnormal period one by one sequentially in time Data;Judging unit, for judging whether the data currently obtained have been marked;Unit is marked, for currently obtaining Data are labeled.
Optionally, the time between the time point time point that data transmission exception starts terminated to data transmission exception Section, as abnormal period;Alternatively, by the period of the time point backtracking preset duration since data transmission exception, and number The period between time point that the time point started according to transmission abnormality terminates to data transmission exception, as abnormal period.
According to the third aspect, the present invention provides a kind of computer equipments, comprising: at least one processor;And with extremely The memory of few processor communication connection;Wherein, memory is stored with the instruction that can be executed by a processor, instructs quilt At least one processor executes, so that at least one processor executes the method as described in any one in first aspect.
According to fourth aspect, the present invention provides a kind of computer computer instructions, real when which is executed by processor The now method as described in any one of first aspect.
Technical solution provided in an embodiment of the present invention, has the advantages that
1, quality of data evaluating method provided by the invention includes the following steps: to obtain by STORM distributed computing framework Take real-time stream;According to preset data constraint rule, the data in real-time stream are labeled;When data fit constrains When regular, data are labeled as normal data, when data do not meet constraint rule, data are labeled as abnormal data;It will be different Regular data carries out real-time storage;Abnormal data and normal data are subjected to classification offline storage.Pass through STORM distributed computing frame Frame obtains the data flow in real-time stream processing platform (such as Kafka system) in real time, and evaluates and tests data based on constraint rule Data in stream are respectively labeled the data for meeting constraint rule and the data for not meeting constraint rule, realize logarithm According to real-time evaluation, thus, it is possible to meet to data carry out real time analysis requirement.
2, quality of data evaluating method provided by the invention, by abnormal data and normal data carry out classification offline storage it Before, further include following steps: receiving the confirmation of abnormal data and be labeled as a result, will be confirmed to be and abnormal abnormal data is not present Normal data;Feature calculation is carried out to abnormal data and normal data, updates constraint rule.It is true by being carried out to abnormal data Recognize, and corrected in time when it is abnormal data that normal data is by error label, so as to mark out real abnormal data, is mentioned The high quality of data in real time evaluates and tests precision.In addition, after abnormal data confirmation, according to the normal data and abnormal data after confirmation Constraint rule is updated, a possibility that follow-up data is by error label can be reduced, improves subsequent quality of data evaluation and test precision.
3, quality of data evaluating method provided by the invention, according to the time of the abnormal data of offline storage and normal data Mark judges whether there is data transmission exception;When there are data transmission exception, the data of abnormal period are reacquired, and right The data of abnormal period are labeled.Confirmed by the time tag of the data to offline storage, it can be in off-line data When lacking the data of some period in library, data transmission exception is found in time, reacquires the number for lacking the period of data According to, thus, it is possible to prevent data omission, improve the quality of data evaluating method evaluation and test it is comprehensive and evaluation and test reliability.
4, quality of data evaluating method provided by the invention, reacquires the data of abnormal period, and to abnormal period Data are labeled, and further include following steps: obtaining the data of abnormal period one by one sequentially in time;What judgement currently obtained Whether data have been marked;When the data currently obtained have been marked, next data of abnormal period are obtained;It is obtained when current Data when not being marked, the data currently obtained are labeled.It is carried out in the data to a certain moment in abnormal period Before mark, judge whether it is identified, is able to solve when normally, i.e., some time data in abnormal period transmits A part of data in abnormal period are directly labeled the total data of abnormal period and make by when successfully mark stores At partial data repeat mark the problem of, reduce redundancy execute step, so as to improve the quality of data evaluating method Evaluate and test efficiency.
Detailed description of the invention
It, below will be to specific in order to illustrate more clearly of the specific embodiment of the invention or technical solution in the prior art Embodiment or attached drawing needed to be used in the description of the prior art be briefly described, it should be apparent that, it is described below Attached drawing is some embodiments of the present invention, for those of ordinary skill in the art, before not making the creative labor It puts, is also possible to obtain other drawings based on these drawings.
Fig. 1 is the method flow diagram for the quality of data evaluating method that embodiment 1 provides;
Fig. 2 is the data flow figure for the quality of data evaluating method that embodiment 1 provides;
Fig. 3 is the specific method flow chart of the step S800 for the quality of data evaluating method that embodiment 1 provides;
Fig. 4 is the structural schematic diagram for the quality of data evaluating apparatus that embodiment 2 provides;
Fig. 5 is the hardware structural diagram for the computer equipment that embodiment 3 provides.
Specific embodiment
Technical solution of the present invention is clearly and completely described below in conjunction with attached drawing, it is clear that described implementation Example is a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be noted that term " first ", " second ", " third " are used for description purposes only, It is not understood to indicate or imply relative importance.
Embodiment 1
A kind of quality of data evaluating method is present embodiments provided, as shown in Figure 1.It should be noted that in the stream of attached drawing The step of journey illustrates can execute in a computer system such as a set of computer executable instructions, although also, flowing Logical order is shown in journey figure, but in some cases, it can be to be different from shown or described by sequence execution herein The step of.The process includes the following steps:
Step S100 obtains real-time stream by STORM distributed computing framework.In the present embodiment, STORM is distributed Formula Computational frame obtains real-time stream from Kafka system, specifically, the SPOUT component under STORM distributed computing framework From Kafka system subscribe to real-time stream, and by data flow according to certain rule classify to data stream, for example, according to Family difference classifies etc. to electricity consumption data.
Step S200 is labeled the data in real-time stream according to preset data constraint rule.In the present embodiment In, when data fit constraint rule, data are labeled as normal data, when data do not meet constraint rule, by data mark Note is abnormal data.In the present embodiment, preset data constraint rule is according to the normal data obtained in advance and abnormal number According to being calculated.In the present embodiment, constraint rule is after the processing such as data being carried out with feature calculation and carries out trend statistics, to obtain To be able to reflect the quality of data, can reflect data whether be normal data feature constant and rule threshold, for example, When data are numeric type data, the variable in data is obtained by carrying out feature calculation to data, by becoming to data Gesture counts to obtain the change threshold of the variable when data are normal data, then the variable and its change threshold are to constrain rule Then;When data are figure type data (by taking curve as an example), obtain being able to reflect data matter by carrying out feature calculation to data The key point of amount and its corresponding slope count to obtain each key point when curve is normalized curve by carrying out trend to data Slope threshold value, then key point and its corresponding slope threshold value are constraint rule.In the present embodiment, by respectively in exception Corresponding label is set in data and normal data, realizes data mark.
In the present embodiment, as shown in Fig. 2, carrying out data using the BOLT component under STORM distributed computing framework Mark, specifically, BOLT component can be one or more, when BOLT component is multiple, same class data (such as same user Electricity consumption data) be labeled in the same BOLT component and complete.In the present embodiment, data constraint rule is stored in REDIS In system, constraint rule in BOLT component call REDIS system, and data are labeled according to constraint rule.It will constraint Regular separate storage is in REDIS system, the operations such as convenience is searched constraint rule and updated.
In the present embodiment, when the Outlier Detection Algorithm to be labeled to data is fairly simple or data mark only When being carried out in a BOLT component, directly Outlier Detection Algorithm can be written in BOLT component;When Outlier Detection Algorithm ratio More complex or data are labeled in when carrying out in multiple BOLT components, Outlier Detection Algorithm can be individually encapsulated in server In, BOLT component can reduce matching for BOLT component by calling the Outlier Detection Algorithm being individually encapsulated to be labeled data It sets the time, improves the quality of data and evaluate and test efficiency, while the operation such as conveniently being searched Outlier Detection Algorithm and being updated.
Abnormal data is carried out real-time storage by step S300.In the present embodiment, by being obtained in real time from Kafka system Abnormal data after taking mark, and be stored in database, complete the real-time storage to abnormal data.
Abnormal data and normal data are carried out classification offline storage by step S400.In the present embodiment, according to default week Phase carries out classification offline storage to abnormal data and normal data, and is pressed according to the time identifier in abnormal data and normal data Storing abnormal data and normal data according to time sequencing specifically can set predetermined period to one day or a hour, It is of course also possible to select other predetermined periods according to the needs of practical application scene.In the present embodiment, using Hadoop Distributed file system (Hadoop Distributed File System, HDFS) classification storage normal data and abnormal number According to.
Quality of data evaluating method provided in this embodiment obtains number in real time by STORM distributed computing framework in real time It is right respectively according to the data flow in stream process platform (such as Kafka system), and based on the data in constraint rule evaluation and test data flow The data for meeting constraint rule and the data for not meeting constraint rule are labeled, and realize the real-time evaluation to data, from And it can satisfy the requirement that data are carried out with real time analysis.
In an alternate embodiment of the invention, step S300 includes the following steps:
Step S301, the real-time stream after obtaining mark.In the present embodiment, mark is obtained in real time from Kafka system Digital data stream after note.
Step S302, the abnormal data in real-time stream after extracting mark.In the present embodiment, according in data Label distinguishes normal data and abnormal data.
Abnormal data is written in database step S302.In the present embodiment, as shown in Fig. 2, using MYSQL data Library stores abnormal data.
In an alternate embodiment of the invention, further include following steps before step S400:
Step S500 receives the confirmation of abnormal data and is labeled as a result, will be confirmed to be and abnormal abnormal data is not present Normal data.In the present embodiment, receive the confirmation of abnormal data of the user to real-time storage in MYSQL database as a result, When abnormal data, which is confirmed to be, is not present abnormal, the label of the abnormal data is changed to the label of normal data, and will modification Data afterwards remove MYSQL database.
Step S600 carries out feature calculation to abnormal data and normal data, updates constraint rule.In the present embodiment, First according to the abnormal data and normal data of confirmation modified result offline storage, recalculated about further according to revised data Beam rule, specific calculation method is identical as the calculation method in step S200, and details are not described herein.By to abnormal data into Row confirmation, and corrected in time when it is abnormal data that normal data is by error label, so as to mark out really abnormal number According to improving real-time quality of data evaluation and test precision.In addition, after abnormal data confirmation, according to normal data after confirmation and different Regular data updates constraint rule, can reduce a possibility that follow-up data is by error label, improves subsequent quality of data evaluation and test Precision.
In an alternate embodiment of the invention, quality of data evaluating method provided in this embodiment further includes following steps:
Step S700 judges whether there is data according to the time identifier of the abnormal data of offline storage and normal data and passes Defeated exception.
Step S800 reacquires the data of abnormal period, and to the number of abnormal period when there are data transmission exception According to being labeled.In the present embodiment, the time point time point that data transmission exception starts terminated to data transmission exception Between period, as abnormal period;Alternatively, by the time of the time point backtracking preset duration since data transmission exception The period between time point that the time point that section and data transmission exception start terminates to data transmission exception, as different The normal period.In the present embodiment, the preset duration of backtracking can be according to the accurate of the transmission frequencies of data and data quality assessment Property require etc. practical application scenes be determined.In the present embodiment, the time point since data transmission exception is recalled pre- If the period of duration is also used as a part of abnormal period, for example, 16 points to 16: 30 are distributed raw data transmission exception, then weigh 16 points of 30 minutes data are assigned in new acquisition 15: 50, can prevent occurring data omission again when reacquiring.It needs to illustrate It is that above-mentioned specific value is only to be convenient for those skilled in the art understand that the scheme of the present embodiment and the specific example lifted, above-mentioned tool Body numerical value is understood not to the limitation constituted to the present embodiment technical solution.
In an alternate embodiment of the invention, as shown in figure 3, step S800 includes the following steps:
Step S801 obtains the data of abnormal period one by one sequentially in time.
Step S802, judges whether the data currently obtained have been marked.In the present embodiment, by judging current obtain Data in whether be set label, judge whether the data currently obtained have been marked.In the present embodiment, it is obtained when currently When the data taken have been marked, next data of abnormal period are obtained, that is, return to step S801;When the data currently obtained When not being marked, step S803 is executed.
Step S803 is labeled the data currently obtained.
Quality of data evaluating method provided in this embodiment is labeled in the data to a certain moment in abnormal period Before, judge whether it is identified, be able to solve when normally, i.e., extremely some time data in abnormal period transmits When a part of data in period have successfully been marked storage, caused by being directly labeled to the total data of abnormal period The problem of partial data repeat mark, reduces redundancy and executes step, so as to improve the evaluation and test of the quality of data evaluating method Efficiency.
Embodiment 2
A kind of quality of data evaluating apparatus is provided in the present embodiment, it is for realizing above-described embodiment 1 and its preferably real Mode is applied, the descriptions that have already been made will not be repeated.As used below, the soft of predetermined function may be implemented in term " module " The combination of part and/or hardware.Although device described in following embodiment is preferably realized with software, hardware, or The realization of the combination of software and hardware is also that may and be contemplated.
Quality of data evaluating apparatus provided in this embodiment, as shown in Figure 4, comprising: data flow obtains module 100, data Labeling module 200, real-time storage module 300 and offline storage module 400.
Wherein, data flow obtains module 100 and is used to obtain real-time stream by STORM distributed computing framework;Data Labeling module 200 is used to be labeled the data in real-time stream according to preset data constraint rule;When data symbols contract When beam rule, data are labeled as normal data, when data do not meet constraint rule, data are labeled as abnormal data;It is real When memory module 300 be used to abnormal data carrying out real-time storage;Offline storage module 400 is used for abnormal data and normal number According to carrying out classification offline storage.
In an alternate embodiment of the invention, real-time storage module 300 includes: data flow acquiring unit, data extracting unit and is write Enter unit.
Wherein, data flow acquiring unit is used to obtain the real-time stream after mark;Data extracting unit is for extracting mark The abnormal data in real-time stream after note;Writing unit is used to abnormal data be written in database.
In an alternate embodiment of the invention, quality of data evaluating apparatus further include: confirmation correction module and Policy Updates module.
Wherein, the different of exception is not present as a result, will be confirmed to be in the confirmation that confirmation correction module is used to receive abnormal data Regular data is labeled as normal data;Policy Updates module is used to carry out feature calculation, the more New Testament to abnormal data and normal data Beam rule.
In an alternate embodiment of the invention, quality of data evaluating apparatus further include: transmission detection module and data roll-back module.
Wherein, transmission detection module for being according to the abnormal data of offline storage and the time identifier judgement of normal data It is no that there are data transmission exceptions;Data rewind module is used for when there are data transmission exception, reacquires the number of abnormal period According to, and the data of abnormal period are labeled.
In an alternate embodiment of the invention, data rewind module includes: data capture unit, judging unit and mark unit.
Data capture unit for obtaining the data of abnormal period one by one sequentially in time;Judging unit is worked as judging Whether the data of preceding acquisition have been marked;Mark unit is for being labeled the data currently obtained.In the present embodiment, when The data of preceding acquisition are to execute mark unit when not being marked.
In an alternate embodiment of the invention, the time point time point that data transmission exception starts terminated to data transmission exception Between period, as abnormal period;Alternatively, by the time of the time point backtracking preset duration since data transmission exception The period between time point that the time point that section and data transmission exception start terminates to data transmission exception, as different The normal period.
Embodiment 3
The embodiment of the present invention also provides a kind of computer equipment, as shown in figure 5, the equipment may include: at least one Manage device 501, such as CPU (Central Processing Unit, central processing unit), at least one communication interface 503, storage Device 504, at least one communication bus 502.Wherein, communication bus 502 is for realizing the connection communication between these components.Its In, communication interface 503 may include display screen (Display), keyboard (Keyboard), and optional communication interface 503 can also wrap Include standard wireline interface and wireless interface.Memory 504 can be high speed RAM memory (Random Access Memory, Effumability random access memory), it is also possible to non-labile memory (non-volatile memory), such as extremely A few magnetic disk storage.Memory 504 optionally can also be that at least one is located remotely from the storage of aforementioned processor 501 dress It sets.Wherein store application program in memory 504, and processor 501 calls the program code stored in memory 504, with The method step either in execution embodiment 1, i.e., for performing the following operations:
Real-time stream is obtained by STORM distributed computing framework;According to preset data constraint rule, to real time data Data in stream are labeled;When data fit constraint rule, data are labeled as normal data, when data do not meet constraint When regular, data are labeled as abnormal data;Abnormal data is subjected to real-time storage;Abnormal data and normal data are divided Class offline storage.
In the embodiment of the present invention, processor 501 calls the program code in memory 504, is also used to execute following operation: Abnormal data is subjected to real-time storage, includes the following steps: to obtain the real-time stream after mark;Real-time number after extracting mark According to the abnormal data in stream;Abnormal data is written in database.
In the embodiment of the present invention, processor 501 calls the program code in memory 504, is also used to execute following operation: Further include following steps before abnormal data and normal data are carried out classification offline storage: receiving the confirmation knot of abnormal data Fruit, will be confirmed to be that there is no abnormal abnormal datas to be labeled as normal data;Feature is carried out to abnormal data and normal data It calculates, updates constraint rule.
In the embodiment of the present invention, processor 501 calls the program code in memory 504, is also used to execute following operation: Quality of data evaluating method further includes following steps: being judged according to the abnormal data of offline storage and the time identifier of normal data With the presence or absence of data transmission exception;When there are data transmission exception, the data of abnormal period are reacquired, and to abnormal period Data be labeled.
In the embodiment of the present invention, processor 501 calls the program code in memory 504, is also used to execute following operation: The data of abnormal period are reacquired, and the data of abnormal period are labeled, further include following steps: sequentially in time The data of abnormal period are obtained one by one;Judge whether the data currently obtained have been marked;When the data of the data currently obtained When being marked, next data of abnormal period are obtained;When the data of the data currently obtained are not marked, obtained to current Data be labeled.
In the embodiment of the present invention, processor 501 calls the program code in memory 504, is also used to execute following operation: The period between time point that the time point that data transmission exception starts is terminated to data transmission exception, as it is abnormal when Section;Alternatively, by the period of the time point backtracking preset duration since data transmission exception and data transmission exception Time point for terminating to data transmission exception at time point between period, as abnormal period.
In the embodiment of the present invention, processor 501 calls the program code in memory 504, is also used to execute following operation: Obtaining real-time stream by STORM distributed computing framework includes: to utilize the SPOUT group under STORM distributed computing framework The real-time acquisition of part progress data;Alternatively, being labeled to the data in real-time stream includes: to utilize STORM distribution meter Calculate the mark that the BOLT component under frame carries out data;Data constraint rule is stored in REDIS system.
Wherein, communication bus 502 can be Peripheral Component Interconnect standard (peripheral component Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (extended industry standard Architecture, abbreviation EISA) bus etc..Communication bus 502 can be divided into address bus, data/address bus, control bus etc.. Only to be indicated with a line in Fig. 5, it is not intended that an only bus or a type of bus convenient for indicating.
Wherein, memory 504 may include volatile memory (English: volatile memory), such as arbitrary access Memory (English: random-access memory, abbreviation: RAM);Memory also may include nonvolatile memory (English Text: non-volatile memory), for example, flash memory (English: flash memory), hard disk (English: hard disk Drive, abbreviation: HDD) or solid state hard disk (English: solid-state drive, abbreviation: SSD);Memory 504 can also wrap Include the combination of the memory of mentioned kind.
Wherein, processor 501 can be central processing unit (English: central processing unit, abbreviation: CPU), the combination of network processing unit (English: network processor, abbreviation: NP) or CPU and NP.
Wherein, processor 501 can further include hardware chip.Above-mentioned hardware chip can be specific integrated circuit (English: application-specific integrated circuit, abbreviation: ASIC), programmable logic device (English: Programmable logic device, abbreviation: PLD) or combinations thereof.Above-mentioned PLD can be Complex Programmable Logic Devices (English: complex programmable logic device, abbreviation: CPLD), field programmable gate array (English: Field-programmable gate array, abbreviation: FPGA), Universal Array Logic (English: generic array Logic, abbreviation: GAL) or any combination thereof.
Embodiment 4
The embodiment of the invention also provides a kind of non-transient computer storage medium, the computer storage medium is stored with Computer executable instructions, the computer executable instructions can be performed embodiment 1 in either method step.Wherein, described to deposit Storage media can be magnetic disk, CD, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (Flash Memory), hard disk (Hard Disk Drive, contracting Write: HDD) or solid state hard disk (Solid-State Drive, SSD) etc.;The storage medium can also include depositing for mentioned kind The combination of reservoir.
Obviously, the above embodiments are merely examples for clarifying the description, and does not limit the embodiments.It is right For those of ordinary skill in the art, can also make on the basis of the above description it is other it is various forms of variation or It changes.There is no necessity and possibility to exhaust all the enbodiments.And it is extended from this it is obvious variation or It changes still within the protection scope of the invention.

Claims (15)

1. a kind of quality of data evaluating method, which comprises the steps of:
Real-time stream is obtained by STORM distributed computing framework;
According to preset data constraint rule, the data in the real-time stream are labeled;Described in the data fit When constraint rule, the data are labeled as normal data, when the data do not meet the constraint rule, by the data It is labeled as abnormal data;
The abnormal data is subjected to real-time storage;
The abnormal data and the normal data are subjected to classification offline storage.
2. quality of data evaluating method according to claim 1, which is characterized in that deposited the abnormal data in real time Storage, includes the following steps:
Real-time stream after obtaining mark;
The abnormal data in real-time stream after extracting the mark;
The abnormal data is written in database.
3. quality of data evaluating method according to claim 1, which is characterized in that by the abnormal data and described normal Further include following steps before data carry out classification offline storage:
It receives the confirmation of the abnormal data and is labeled as normal number as a result, will be confirmed to be and the abnormal abnormal data is not present According to;
Feature calculation is carried out to the abnormal data and the normal data, updates the constraint rule.
4. quality of data evaluating method according to claim 1, which is characterized in that further include following steps:
It is different that data transmission is judged whether there is according to the time identifier of the abnormal data of offline storage and the normal data Often;
When there are data transmission exception, the data of abnormal period are reacquired, and mark to the data of the abnormal period Note.
5. quality of data evaluating method according to claim 4, which is characterized in that the data of abnormal period are reacquired, And the data of the abnormal period are labeled, include the following steps:
Obtain the data of the abnormal period one by one sequentially in time;
Judge whether the data currently obtained have been marked;
When the data currently obtained have been marked, next data of the abnormal period are obtained;
When the data currently obtained are not marked, the data currently obtained are labeled.
6. quality of data evaluating method according to claim 4 or 5, which is characterized in that open the data transmission exception The period between time point that the time point of beginning terminates to the data transmission exception, as the abnormal period;Alternatively,
By the period of the time point backtracking preset duration since the data transmission exception and the data transmission exception The period between time point that the time point of beginning terminates to the data transmission exception, as the abnormal period.
7. quality of data evaluating method according to claim 1-6, which is characterized in that pass through STORM distribution It includes: to carry out the data using the SPOUT component under STORM distributed computing framework that Computational frame, which obtains real-time stream, It obtains in real time;Alternatively,
Being labeled to the data in the real-time stream includes: BOLT component using under STORM distributed computing framework Carry out the mark of the data;
The data constraint rule is stored in REDIS system.
8. a kind of quality of data evaluating apparatus characterized by comprising
Data flow obtains module, for obtaining real-time stream by STORM distributed computing framework;
Data labeling module, for being labeled to the data in the real-time stream according to preset data constraint rule;When When constraint rule described in the data fit, the data are labeled as normal data, when the data do not meet the constraint When regular, the data are labeled as abnormal data;
Real-time storage module, for the abnormal data to be carried out real-time storage;
Offline storage module, for the abnormal data and the normal data to be carried out classification offline storage.
9. quality of data evaluating apparatus according to claim 8, which is characterized in that the real-time storage module includes:
Data flow acquiring unit, for obtaining the real-time stream after marking;
Data extracting unit, for extracting the abnormal data in the real-time stream after the mark;
Writing unit, for the abnormal data to be written in database.
10. quality of data evaluating apparatus according to claim 8, which is characterized in that further include:
Confirm correction module, the described different of exception is not present as a result, will be confirmed to be in the confirmation for receiving the abnormal data Regular data is labeled as normal data;
Policy Updates module updates the constraint rule for carrying out feature calculation to the abnormal data and the normal data Then.
11. quality of data evaluating apparatus according to claim 8, which is characterized in that further include:
Detection module is transmitted, for being according to the abnormal data of offline storage and the time identifier judgement of the normal data It is no that there are data transmission exceptions;
Data rewind module, for when there are data transmission exception, reacquiring the data of abnormal period, and to the exception The data of period are labeled.
12. quality of data evaluating apparatus according to claim 11, which is characterized in that the data rewind module includes:
Data capture unit, for obtaining the data of the abnormal period one by one sequentially in time;
Judging unit, for judging whether the data currently obtained have been marked;
Unit is marked, for being labeled to the data currently obtained.
13. quality of data evaluating apparatus according to claim 11 or 12, which is characterized in that by the data transmission exception The period between time point that the time point of beginning terminates to the data transmission exception, as the abnormal period;Alternatively,
By the period of the time point backtracking preset duration since the data transmission exception and the data transmission exception The period between time point that the time point of beginning terminates to the data transmission exception, as the abnormal period.
14. a kind of computer equipment characterized by comprising at least one processor;And at least one described processor The memory of communication connection;Wherein, the memory is stored with the instruction that can be executed by one processor, described instruction quilt At least one described processor executes, so that described in any at least one described processor execution the claims 1-7 Method.
15. a kind of computer readable storage medium, is stored thereon with computer instruction, which is characterized in that the instruction is by processor The step of any the method in the claims 1-7 is realized when execution.
CN201811333857.XA 2018-08-03 2018-11-09 A kind of quality of data evaluating method and device, computer equipment and storage medium Pending CN109684309A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810875663 2018-08-03
CN2018108756636 2018-08-03

Publications (1)

Publication Number Publication Date
CN109684309A true CN109684309A (en) 2019-04-26

Family

ID=66185306

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811333857.XA Pending CN109684309A (en) 2018-08-03 2018-11-09 A kind of quality of data evaluating method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109684309A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851464A (en) * 2019-11-11 2020-02-28 广州及包子信息技术咨询服务有限公司 Data quality treatment method and system
CN111881106A (en) * 2020-07-30 2020-11-03 北京智能工场科技有限公司 Data labeling and processing method based on AI (Artificial Intelligence) inspection
CN112379656A (en) * 2020-10-09 2021-02-19 爱普(福建)科技有限公司 Processing method, device, equipment and medium for detecting abnormal data of industrial system
CN113127635A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Data processing method, device and system, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104579848A (en) * 2015-01-26 2015-04-29 华东师范大学 Data analyzing and network monitoring integration method for OBS instrument
CN105404581A (en) * 2015-12-25 2016-03-16 北京奇虎科技有限公司 Database evaluation method and device
CN106789885A (en) * 2016-11-17 2017-05-31 国家电网公司 User's unusual checking analysis method under a kind of big data environment
CN106815338A (en) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 A kind of real-time storage of big data, treatment and inquiry system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104579848A (en) * 2015-01-26 2015-04-29 华东师范大学 Data analyzing and network monitoring integration method for OBS instrument
CN105404581A (en) * 2015-12-25 2016-03-16 北京奇虎科技有限公司 Database evaluation method and device
CN106789885A (en) * 2016-11-17 2017-05-31 国家电网公司 User's unusual checking analysis method under a kind of big data environment
CN106815338A (en) * 2016-12-25 2017-06-09 北京中海投资管理有限公司 A kind of real-time storage of big data, treatment and inquiry system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张瑞: ""网络异常流量检测模型设计与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
李洋: ""基于Storm与Hadoop的日志数据实时处理研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110851464A (en) * 2019-11-11 2020-02-28 广州及包子信息技术咨询服务有限公司 Data quality treatment method and system
CN110851464B (en) * 2019-11-11 2023-10-27 广州及包子信息技术咨询服务有限公司 Data quality management method and system
CN113127635A (en) * 2019-12-31 2021-07-16 阿里巴巴集团控股有限公司 Data processing method, device and system, storage medium and electronic equipment
CN113127635B (en) * 2019-12-31 2024-04-02 阿里巴巴集团控股有限公司 Data processing method, device and system, storage medium and electronic equipment
CN111881106A (en) * 2020-07-30 2020-11-03 北京智能工场科技有限公司 Data labeling and processing method based on AI (Artificial Intelligence) inspection
CN111881106B (en) * 2020-07-30 2024-03-29 北京智能工场科技有限公司 Data labeling and processing method based on AI (advanced technology attachment) test
CN112379656A (en) * 2020-10-09 2021-02-19 爱普(福建)科技有限公司 Processing method, device, equipment and medium for detecting abnormal data of industrial system

Similar Documents

Publication Publication Date Title
CN109684309A (en) A kind of quality of data evaluating method and device, computer equipment and storage medium
CN109067610B (en) Monitoring method and device
EP2778929A1 (en) Test script generation system
CN110276458A (en) Processing method, device, storage medium and the electronic device of vehicle trouble
US11799748B2 (en) Mitigating failure in request handling
CN107679683B (en) Software development progress early warning method and device
CN112766655B (en) Automatic scheduling method, device, equipment and computer readable storage medium
US20160132798A1 (en) Service-level agreement analysis
CN110647447B (en) Abnormal instance detection method, device, equipment and medium for distributed system
CN107909234A (en) Time limit based reminding method, processing method and its device of Work stream data, equipment
CN107733710A (en) Construction method, device, computer equipment and the storage medium of link call relation
CN107066519A (en) A kind of task detection method and device
US9910487B1 (en) Methods, systems and computer program products for guiding users through task flow paths
CN111400294B (en) Data anomaly monitoring method, device and system
CN108897669A (en) Using monitoring method and equipment
CN109189677B (en) Test method and device for updating state of variable value
CN109828883B (en) Task data processing method and device, storage medium and electronic device
US20230385048A1 (en) Predictive recycling of computer systems in a cloud environment
CN110943887B (en) Probe scheduling method, device, equipment and storage medium
CN105162931A (en) Method and device for classifying communication numbers
CN110502486B (en) Log processing method and device, electronic equipment and computer readable storage medium
CN112035286A (en) Method and device for determining fault cause, storage medium and electronic device
CN111324583B (en) Service log classification method and device
CN116645082A (en) System inspection method, device, equipment and storage medium
CN112241362A (en) Test method, test device, server and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190426