CN106095870A - Data balancing verification method and device - Google Patents

Data balancing verification method and device Download PDF

Info

Publication number
CN106095870A
CN106095870A CN201610393585.7A CN201610393585A CN106095870A CN 106095870 A CN106095870 A CN 106095870A CN 201610393585 A CN201610393585 A CN 201610393585A CN 106095870 A CN106095870 A CN 106095870A
Authority
CN
China
Prior art keywords
file
journal file
daily record
short
preset time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610393585.7A
Other languages
Chinese (zh)
Inventor
郑宇�
张甲超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Information Technology Beijing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610393585.7A priority Critical patent/CN106095870A/en
Publication of CN106095870A publication Critical patent/CN106095870A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kind of data balancing verification method and device, including: receive daily record data, and generate full log file and short journal file according to described configuration file;Described configuration file includes that short journal file generates information, and the log information comprised in described short journal file is less than the log information comprised in described full log file;According to described short journal file, receive, in adding up the first preset time threshold, the short journal file quantity obtained;Obtain putting journal file in storage by described full log file write distributed file system parsing;According to described warehouse-in journal file, add up the warehouse-in journal file quantity in described first preset time threshold;According to described short journal file quantity and warehouse-in journal file quantity, verify whether the daily record amount in described first preset time threshold balances.The data balancing verification method of present invention proposition and device, it is possible to reduce when data balancing is verified and system resource is taken.

Description

Data balancing verification method and device
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of data balancing verification method and device.
Background technology
Hadoop and Hive is the storage of current industry widely used data and the Distributed-solution of inquiry. Hive, is a Tool for Data Warehouse based on Hadoop, structurized data file can be mapped as a database table, And simple sql (Structured Query Language, SQL) query function is provided, can be by sql language Sentence is converted to MapReduce (MapReduce) task and runs.Its advantage is that learning cost is low, can pass through class SQL statement Quickly realize simple MapReduce statistics, it is not necessary to develop special MapReduce application, be very suitable for the system of data warehouse Meter is analyzed.
HDFS, that is, Hadoop distributed file system, it is designed to be suitable for operating in common hardware (commodity Hardware) distributed file system on.It and existing distributed file system have a lot of common ground.But meanwhile, it and The difference of other distributed file system is also clearly.HDFS is the system of an Error Tolerance, is suitable for being deployed in On cheap machine.HDFS is provided that the data access of high-throughput, the application being especially suitable on large-scale dataset.HDFS is put Wide a part of POSIX (Portable Operating System Interface, portable operating system interface) standard Constraint, realizes streaming and reads the purpose of file system data.
Existing technology carry out data logging amount test flat (that is, data logging amount balance checking, belong to data monitoring side The one of formula) time, generally use server (server) to receive journal file (access_log) statistics obtained and obtain daily record It is mounted to the daily record quantity that Hive obtains and contrasts, by contrast twice through resolving after quantity, with journal file write HDFS The daily record quantity that obtains of statistics is the most equal carry out testing flat.
Generally for meeting needs, in configuration file, access_log can be pre-configured with log_format (i.e. access_ The information of log storage daily record), wherein comprise a lot of information, such as remote_addr, time_local, request, http_ Content_type, status etc..So, when peak period daily record amount is the biggest, the journal file of formation also can very big (can Reach GB rank).A lot of system resource can be taken, so statistical operation is likely frequently during the record number of accounting log file The performance of server can be affected, the when of serious, likely affect the regular traffic of server.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of data balancing verification method and device, it is possible to put down in data Reduce during weighing apparatus checking and system resource is taken.
The data balancing verification method provided based on the above-mentioned purpose embodiment of the present invention, including:
Receive daily record data, and generate full log file and short journal file according to described configuration file;Described configuration File includes that short journal file generates information, and the log information comprised in described short journal file is less than described full log literary composition The log information comprised in part;
According to described short journal file, receive, in adding up the first preset time threshold, the short journal file quantity obtained;
Obtain putting journal file in storage by described full log file write distributed file system parsing;
According to described warehouse-in journal file, add up the warehouse-in journal file quantity in described first preset time threshold;
According to described short journal file quantity and described warehouse-in journal file quantity, verify described first preset time threshold Whether interior daily record amount balances.
In some embodiments, the log information that described short journal file comprises be daily record data produce time local time Between or stl status.
In some embodiments, described warehouse-in journal file includes effective journal file and invalid journal file;Described Warehouse-in journal file quantity is quantity and the quantity sum of invalid journal file of described effective journal file.
In some embodiments, described according to described short journal file quantity and warehouse-in journal file quantity, verify institute The step whether the daily record amount in the first preset time threshold of stating balances includes:
Calculate the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity Ratio;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge the daily record amount in described first preset time threshold Balance;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record in described first preset time threshold Amount imbalance.
In some embodiments, described configuration file also includes the second preset time threshold, join described in described basis The step putting file generated full log file and short journal file includes:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as completely Journal file and short journal file.
The another aspect of the embodiment of the present invention, additionally provides a kind of data balancing checking device, including:
Journal file generation module, is used for receiving daily record data, and generates full log file according to described configuration file With short journal file;Described configuration file includes that short journal file generates information, the daily record comprised in described short journal file Information is less than the log information comprised in described full log file;
Short daily record counting module, for according to described short journal file, receives in adding up the first preset time threshold and obtains Short journal file quantity;
Enter library file and obtain module, for described full log file write distributed file system parsing are entered Storehouse journal file;
Enter library file counting module, for according to described warehouse-in journal file, add up in described first preset time threshold Warehouse-in journal file quantity;
Balance authentication module, for according to described short journal file quantity and warehouse-in journal file quantity, verifying described the Whether the daily record amount in one preset time threshold balances.
In some embodiments, the log information that described short journal file comprises be daily record data produce time local time Between or stl status.
In some embodiments, described warehouse-in journal file includes effective journal file and invalid journal file;Described Warehouse-in journal file quantity is quantity and the quantity sum of invalid journal file of described effective journal file.
In some embodiments, described balance authentication module, specifically for:
Calculate the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity Ratio;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge the daily record amount in described first preset time threshold Balance;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record in described first preset time threshold Amount imbalance.
In some embodiments, also including the second preset time threshold in described configuration file, described journal file is raw Become module, specifically for:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as completely Journal file and short journal file.
From the above it can be seen that the data balancing verification method of embodiment of the present invention offer and device, by statistics The daily record data received is added up by short journal file, to warehouse-in journal file after full log file rule is put in storage Add up, thus complete data balancing checking according to two statistical datas;So, the daily record data received is being united Timing, because having only to add up quantity, and need not be analyzed the actual content of daily record data, therefore can be the most right The quantity of short journal file is added up, and without the quantity of full log file is added up, thus to the day received The quantity statistics of will data is to use the quantity adding up short journal file to complete so that can reduce when data balancing is verified System resource is taken such that it is able to save substantial amounts of time and resource.
Accompanying drawing explanation
The schematic flow sheet of one embodiment of the data balancing verification method that Fig. 1 provides for the present invention;
The schematic flow sheet of another embodiment of the data balancing verification method that Fig. 2 provides for the present invention;
The modular structure schematic diagram of the data balancing checking device embodiment that Fig. 3 provides for the present invention.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in more detail.
It should be noted that the statement of all uses " first " and " second " is for distinguishing two in the embodiment of the present invention The entity of individual same names non-equal or the parameter of non-equal, it is seen that " first " " second ", only for the convenience of statement, should not Being interpreted as the restriction to the embodiment of the present invention, this is illustrated by subsequent embodiment the most one by one.
First aspect of the embodiment of the present invention, it is proposed that one can reduce system resource when data balancing is verified The data balancing verification method taken and an embodiment of device.As it is shown in figure 1, the data balancing for present invention offer is tested The schematic flow sheet of one embodiment of card method.
Described data balancing verification method, comprises the following steps:
Step 101: receive daily record data, and (wherein comprised according to described configuration file generation full log file Log information in need) and short journal file;Described configuration file includes that short journal file generates information, described short daily record The log information comprised in file is less than the log information comprised in described full log file;Described full log file is permissible Refer to the journal file that system is recorded under normal circumstances, wherein contain a conventional journal file required for have all Log information;
Optionally, the embodiment of the present invention is directed to off-line analysis framework, and the embodiment of the present invention is applied to Nginx, here Configuration file can directly utilize the configuration file in Nginx, increase in this configuration file short journal file generate information; (also referred to as " engine x ", engine x) is that (HyperText Transfer Protocol surpasses a kind of high performance HTTP to Nginx Text transfer protocol) and Reverse Proxy, also it is an IMAP (Internet Mail Access Protocol, interconnection Net Mail Access Protocol)/POP3 (PostOffice Protocol-Version 3, Post Office Protocol,Version 3)/SMTP (Simple Mail TransferProtocol, Simple Mail Transfer protocol) server;Nginx is as load-balanced server, the most permissible Directly support that in inside Rails (can be translated into track, be the complete frame of a kind of web application for developing database-driven Frame) and PHP (Hypertext Preprocessor, HyperText Preprocessor) program externally service, it is also possible to support make Externally service for http proxy server;
There is a lot of configuration information in configuration file in Nginx, the journal format of journal file (access_log) is set Configuration be one of which, such as:
Wherein pv, sm are the journal formats pre-set, such as:
Wherein, the data volume that this journal format of pv comprises is more, and the information that contrary sm comprises is little;
Optionally, according to the configuration of step 101, can rotate once at set intervals, after rotation, server can add again Carry (reload) described configuration file, server /log/con catalogue under generate two journal files: full log file (corresponding to pv journal format, the actual name of full log file is the title being renamed to con.log, such as Con.20160512-0110.log) and short journal file cons.log (corresponding to sm journal format, same, short journal file Actual name be the title being renamed, such as cons.20160512-0110.log), when data volume is bigger, each daily record File can store a lot of bar daily record data;Wherein, the every data in full log file con.log all comprises a lot of numbers It is believed that breath, the every data in short journal file cons.log then only can store some for the number distinguishing different daily record data It is believed that breath, such as, receive the time of corresponding daily record data;When daily record amount is the biggest, statistics full log file and short daily record literary composition Time and the contrast of resource that part is consumed respectively are the most obvious;
Wherein, described rotation refers to that daily record rotates, and in simple terms, refers to, by existing journal file renaming, then weigh Newly created original empty journal file;
Such as, configuration file has following configuration information:
After configuration information sets, will not change within a period of time, if not carrying out daily record rotation, then server Receive All Files all can be stored in/logs/con/con.log and/two files of logs/con/cons.log in, along with The growth of time, journal file can be increasing;
In order to enable to process in time journal file, after general a period of time (depending on concrete condition, may by the hour, sky or week, Optionally, it is set to 10 minutes) can be by receiving that file renaming of daily record, as a example by con.log, will every 10 minutes Con.log renaming (such as con.20160512-0110.log), then re-creates an empty journal file con.log; Because being provided that in configuration file that the data received are stored in con.log, the data that server is newly received depend on So it is deposited in con.log file, after rotation, after the data in con.20160512-0110.log just can be used to do Continuous operates, it may be assumed that write HDFS-> document analysis-> be mounted to hive;
Step 102: according to described short journal file, receives, in adding up the first preset time threshold, the short journal file obtained Quantity;Described first preset time threshold, may refer to need to carry out data and tests flat time period (some in such as a day Period, the data that this period gathers are best for testing flat effect) or disclosure satisfy that data are tested flat needs and should be gathered The time span of data (such as carried out every 2 hours a secondary data test flat effect best);Here, according to actual needs, can So that the first preset time threshold is selected, it is possible to according to the change of practical situation, the first preset time threshold is carried out Adjust;
Step 103: obtain putting journal file in storage by described full log file write distributed file system parsing;
Here, after server receives data, can store on the disk of server, then full log file is in service Store after device rule in distributed file system HDFS, obtain seq (can be translated into sequence, a kind of default external command, typically Simplification literary style as a pile numeral) the rule journal file of file format, the rule journal file warp of described seq file format Obtain RC (run command, run order) file after crossing analysis program and be mounted to Hive and complete warehouse-in, obtaining putting daily record in storage File;
Concrete, the rule process (i.e. the process of full log file write HDFS) of full log file can include following Step: after server receives daily record data, is temporarily stored in journal file con.log, within every ten minutes, rotates once, after rotation, Aforesaid journal file is renamed into another file, and (file of this renaming is exactly the described full log for rule File, such as cons.20160512-0110.log), reload (reload) described configuration file simultaneously, generate one newly Journal file con.log (although identical with previous journal file con.log name, but because previous journal file heavily ordered Being formed as a full log file after Ming, the journal file con.log regenerated here is then a brand-new empty day Will file, wherein next proceeds to the content kept in i.e. new log content);Then, obtain after being renamed is complete Journal file can use a program (such as glume (can be translated into grain husk), and one is similar to the program of Flume (can be translated into tank)) Storage is to HDFS, so, just completes the process receiving local rule write HDFS from daily record data;Flume is a kind of high That can use, highly reliable, distributed massive logs collection, the system be polymerized and transmit, it is fixed that Flume supports in log system Various types of data sender processed, is used for collecting data;Meanwhile, Flume provides and data carries out simple process, and writes various data The ability of reciever (customizable);
Step 104: according to described warehouse-in journal file, add up the warehouse-in journal file in described first preset time threshold Quantity;Optionally, described warehouse-in journal file quantity is added up from Hive and is obtained;
Step 105: according to described short journal file quantity and warehouse-in journal file quantity, verify described first Preset Time Whether the daily record amount in threshold value balances;
Optionally, the method whether the daily record amount in described first preset time threshold of verifying balances is it may be that judge institute Stating short journal file quantity and warehouse-in journal file quantity is the most equal, if equal, then daily record amount balances, if unequal, then day Will amount is uneven.
From above-described embodiment it can be seen that the data balancing verification method of embodiment of the present invention offer, arranged by increase Short journal file, and by adding up short journal file, the daily record data received is added up, fall at full log file After dish warehouse-in, warehouse-in journal file is added up, thus complete data balancing checking according to two statistical datas;So, right When the daily record data received is added up, because having only to the quantity of the daily record data received is added up, and need not The actual content of daily record data is analyzed, therefore can only the quantity of short journal file be added up, and without to complete The quantity of whole journal file is added up, thus the quantity statistics to the daily record data received is to use to add up short journal file Quantity complete so that can reduce when data balancing is verified and system resource is taken and shortens timing statistics, in day When will data are the hugest, it is possible to save substantial amounts of time and resource.
Second aspect of the embodiment of the present invention, it is proposed that one can reduce system resource when data balancing is verified Another embodiment of the data balancing verification method taken.As in figure 2 it is shown, the data balancing authentication provided for the present invention The schematic flow sheet of another embodiment of method.
Described data balancing verification method, comprises the following steps:
Step 201: receive daily record data, according to described second preset time threshold, load described configuration file, and generate Full log file and short journal file;Described configuration file includes that short journal file generates information, described short journal file In the log information that comprises less than the log information comprised in described full log file;Optionally, in some embodiments, The log information that described short journal file comprises is the local zone time (time_local) during daily record data generation or stl status (status), the two data, the resource on the one hand taken is less, on the other hand can tentatively distinguish daily record data, Thus facilitate quantity statistics;
Here, load a configuration file, generate a full log file con.log and a short journal file Cons.log, is respectively used to store corresponding daily record data, every described second preset time threshold, described full log file Con.log and short journal file cons.log is renamed and separately deposits, and the most again loads described configuration file, and generates new An one full log file con.log and short journal file cons.log, so goes round and begins again, can be when first presets Between produce multiple full log files and short journal file in threshold value, for carrying out the quantity statistics of journal file;Described second Preset time threshold is less than the first preset time threshold, can be configured as required, such as 5~10 minutes, at daily record number According to when measuring bigger, can suitably shorten described second preset time threshold;
Step 202: according to described short journal file, receives, in adding up the first preset time threshold, the short journal file obtained Quantity;
Step 203: described full log file is write distributed file system and obtains putting journal file in storage;
Step 204: according to described warehouse-in journal file, add up the warehouse-in journal file in described first preset time threshold Quantity;Described first preset time threshold, may refer to need to carry out data and tests flat time period (some in such as a day Period, the data that this period gathers are best for testing flat effect) or disclosure satisfy that data are tested flat needs and should be gathered The time span of data (such as carried out every 2 hours a secondary data test flat effect best);Here, according to actual needs, can So that the first preset time threshold is selected, it is possible to according to the change of practical situation, the first preset time threshold is carried out Adjust;
It is written to the warehouse-in journal file of rule warehouse-in in described distributed file system, some warehouse-in daily record literary composition therein Part, because daily record data therein does not meets specification or requirement, can be cleaned out, and the data washed can carry out other place Reason, therefore, the total number of files amount of the warehouse-in journal file of rule warehouse-in then includes the quantity of effective journal file and invalid daily record The quantity of file, the data that invalid journal file is i.e. cleaned out;Thus, in some optional embodiments, described warehouse-in day Will file includes effective journal file and invalid journal file;Described warehouse-in journal file quantity is described effective journal file The quantity sum of quantity and invalid journal file;So, when data balancing is verified, will not be because of the invalid daily record being cleaned out File is not added up and is affected data balancing the result;Optionally, described effective journal file and invalid journal file are Obtain through described distributed file system analysis;
As the optional embodiment of one of step 104, may particularly include following steps:
Step 205: calculate the described short journal file quantity in described first preset time threshold and described warehouse-in daily record The ratio of quantity of documents;
Step 206: in the range of judging whether described ratio is in default fractional threshold;
Under normal circumstances, described default fractional threshold scope is 1, i.e. described short journal file quantity and described warehouse-in day Will quantity of documents must be equal;But, in modern network technology, every day can produce a large amount of daily record data, so, normal condition Under, after analysis program, the reading of loss of data or data that the warehouse-in journal file of warehouse-in may exist a part is lost The problem such as lose produces, therefore, described default fractional threshold scope, refer to the ratio range being able to verify that data balancing preset, Such as 0.97~1, so, it is allowed to the normal loss of a part of data, without the problem that data nonbalance occurs;
Step 207: if in the range of described ratio is in default fractional threshold, then in judging described first preset time threshold Daily record amount balance;
Step 208: if in the range of described ratio is not at presetting fractional threshold, then judging described first preset time threshold Interior daily record amount is uneven;
By the embodiment of the step 104 that above-mentioned steps 205~step 208 realize so that in the mistake of data balancing checking Cheng Zhong, on the one hand ensure that the correctness that data balancing is verified, on the other hand in turn allow for the normal a small amount of of data and loses, from And make a small amount of loss of data not affect the judged result of data balancing checking.From above-described embodiment it can be seen that the present invention is real Execute the data balancing verification method that example provides, short journal file is set by increase, and docks by adding up short journal file The daily record data received is added up, and adds up warehouse-in journal file after full log file rule is put in storage, thus root Data balancing checking is completed according to two statistical datas;So, when the daily record data received is added up, because having only to The quantity of the daily record data received is added up, and the actual content of daily record data need not be analyzed, the most permissible Only the quantity of short journal file is added up, and without the quantity of full log file is added up, thus to receiving The quantity statistics of daily record data be to use the quantity adding up short journal file to complete so that can when data balancing is verified Reduce and system resource taken and shortens timing statistics, when daily record data is the hugest, it is possible to save the substantial amounts of time and Resource.
3rd aspect of the embodiment of the present invention, it is proposed that one can reduce system resource when data balancing is verified The data balancing taken checking device embodiment.As it is shown on figure 3, the data balancing checking device enforcement provided for the present invention The modular structure schematic diagram of example.
Described data balancing checking device, including:
Journal file generation module 301, is used for receiving daily record data, and generates full log literary composition according to described configuration file Part (wherein including log information in need) and short journal file;Described configuration file includes that short journal file generates Information, the log information comprised in described short journal file is less than the log information comprised in described full log file;Described Full log file may refer to the journal file that system under normal circumstances is recorded, and wherein contains a conventional journal file Required for all log informations of having;
Optionally, the embodiment of the present invention is directed to off-line analysis framework, and the embodiment of the present invention is applied to Nginx, here Configuration file can directly utilize the configuration file in Nginx, increase in this configuration file short journal file generate information; (also referred to as " engine x ", engine x) is that (HyperText Transfer Protocol surpasses a kind of high performance HTTP to Nginx Text transfer protocol) and Reverse Proxy, also it is an IMAP (Internet Mail Access Protocol, interconnection Net Mail Access Protocol)/POP3 (PostOffice Protocol-Version 3, Post Office Protocol,Version 3)/smtp server (Simple Mail TransferProtocol, Simple Mail Transfer protocol);Nginx, as load-balanced server, both may be used Directly to support that in inside Rails (can be translated into track, be the complete of a kind of web application for developing database-driven Framework) and PHP (Hypertext Preprocessor, HyperText Preprocessor) program externally service, it is also possible to support Externally service as http proxy server;
There is a lot of configuration information in configuration file in Nginx, the journal format of journal file (access_log) is set Configuration be one of which, such as:
Wherein pv, sm are the journal formats pre-set, such as:
Wherein, the data volume that this journal format of pv comprises is more, and the information that contrary sm comprises is little;
Optionally, according to the configuration of step 101, can rotate once at set intervals, after rotation, server can add again Carry (reload) described configuration file, server /log/con catalogue under generate two journal files: full log file (corresponding to pv journal format, the actual name of full log file is the title being renamed to con.log, such as Con.20160512-0110.log), short journal file cons.log (corresponding to sm journal format, same, short journal file Actual name be the title being renamed, such as cons.20160512-0110.log), when data volume is bigger, each daily record File can store a lot of bar daily record data;Wherein, the every data in full log file con.log all comprises a lot of numbers It is believed that breath, the every data in short journal file cons.log then only can store some for the number distinguishing different daily record data It is believed that breath, such as, receive the time of corresponding daily record data;When daily record amount is the biggest, statistics full log file and short daily record literary composition Time and the contrast of resource that part is consumed respectively are the most obvious;
Wherein, described rotation refers to that daily record rotates, and in simple terms, refers to, by existing journal file renaming, then weigh Newly created original empty journal file;
Such as, configuration file has following configuration information:
After configuration information sets, will not change within a period of time, if not carrying out daily record rotation, then server Receive All Files all can be stored in/logs/con/con.log and/two files of logs/con/cons.log in, along with The growth of time, journal file can be increasing;
In order to enable to process in time journal file, after general a period of time (depending on concrete condition, may by the hour, sky or week, Optionally, it is set to 10 minutes) can be by receiving that file renaming of daily record, as a example by con.log, will every 10 minutes Con.log renaming (such as con.20160512-0110.log), then re-creates an empty journal file con.log; Because being provided that in configuration file that the data received are stored in con.log, the data that server is newly received depend on So it is deposited in con.log file, after rotation, after the data in con.20160512-0110.log just can be used to do Continuous operates, it may be assumed that write HDFS-> document analysis-> be mounted to hive;
Short daily record counting module 302, for according to described short journal file, receives in adding up the first preset time threshold The short journal file quantity arrived;
Described first preset time threshold, the time period that may refer to need to carry out data balancing checking is (in such as one day Some period, the data that this period gathers are for test flat effect best) or disclosure satisfy that data to test flat needs and The time span (such as carried out a secondary data test flat effect best) of data should be gathered every 2 hours;Here, according to reality Need, the first preset time threshold can be selected, it is possible to according to the change of practical situation, to the first Preset Time threshold Value is adjusted;
Enter library file and obtain module 303, for described full log file being write distributed file system and resolving To warehouse-in journal file;
Here, after server receives data, can store on the disk of server, then full log file is in service Store after device rule in distributed file system HDFS, obtain the rule journal file of seq file format, described seq file The rule journal file of form obtain after analysis program RC file and be mounted to Hive complete warehouse-in, obtain put in storage daily record literary composition Part;
Concrete, the rule process (i.e. the process of full log file write HDFS) of full log file can include following Step: after server receives daily record data, is temporarily stored in journal file con.log, within every ten minutes, rotates once, after rotation, Aforesaid journal file is renamed into another file, and (file of this renaming is exactly the described full log for rule File, such as cons.20160512-0110.log), reload (reload) described configuration file simultaneously, generate one newly Journal file con.log (although identical with previous journal file con.log name, but because previous journal file heavily ordered Being formed as a full log file after Ming, the journal file con.log regenerated here is then a brand-new empty day Will file, wherein next proceeds to the content kept in i.e. new log content);Then, obtain after being renamed is complete Journal file can use a program (such as glume (can be translated into grain husk), and one is similar to the program of Flume (can be translated into tank)) Storage is to HDFS, so, just completes the process receiving local rule write HDFS from daily record data;Flume is a kind of high That can use, highly reliable, distributed massive logs collection, the system be polymerized and transmit, it is fixed that Flume supports in log system Various types of data sender processed, is used for collecting data;Meanwhile, Flume provides and data carries out simple process, and writes various data The ability of reciever (customizable);
Enter library file counting module 304, for according to described warehouse-in journal file, add up described first preset time threshold Interior warehouse-in journal file quantity;Optionally, described warehouse-in journal file quantity is added up from Hive and is obtained;
Balance authentication module 305, for according to described short journal file quantity and warehouse-in journal file quantity, checking is described Whether the daily record amount in the first preset time threshold balances;
Optionally, the method whether the daily record amount in described first preset time threshold of verifying balances is it may be that judge institute Stating short journal file quantity and warehouse-in journal file quantity is the most equal, if equal, then daily record amount balances, if unequal, then day Will amount is uneven.
From above-described embodiment it can be seen that the data balancing that the embodiment of the present invention provides verifies device, arranged by increase Short journal file, and by adding up short journal file, the daily record data received is added up, fall at full log file After dish warehouse-in, warehouse-in journal file is added up, thus complete data balancing checking according to two statistical datas;So, right When the daily record data received is added up, because having only to the quantity of the daily record data received is added up, and need not The actual content of daily record data is analyzed, therefore can only the quantity of short journal file be added up, and without to complete The quantity of whole journal file is added up, thus the quantity statistics to the daily record data received is to use to add up short journal file Quantity complete so that can reduce when data balancing is verified and system resource is taken and shortens timing statistics, in day When will data are the hugest, it is possible to save substantial amounts of time and resource.
Optionally, in some embodiments, when the log information that described short journal file comprises is daily record data generation Local zone time (time_local) or stl status (status), the two data, the resource on the one hand taken is less, another Daily record data can tentatively be distinguished by aspect, thus facilitates quantity statistics.
It is written to the warehouse-in journal file of rule warehouse-in in described distributed file system, some warehouse-in daily record literary composition therein Part, because daily record data therein does not meets specification or requirement, can be cleaned out, and the data washed can carry out other place Reason, therefore, the total number of files amount of the warehouse-in journal file of rule warehouse-in then includes the quantity of effective journal file and invalid daily record The quantity of file, the data that invalid journal file is i.e. cleaned out;Thus, further, in some optional embodiments, institute State warehouse-in journal file and include effective journal file and invalid journal file;Described warehouse-in journal file quantity is described effective day The quantity of will file and the quantity sum of invalid journal file;So, test at ordinary times in data, will not be invalid because of be cleaned out Journal file is not added up and is affected data and test reef knot fruit;Optionally, described effective journal file and invalid journal file are Obtain through described distributed file system analysis.
It is also preferred that the left in some optional embodiments, described balance authentication module 305, specifically for:
Calculate the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity Ratio;
In the range of judging whether described ratio is in default fractional threshold;
Under normal circumstances, described default fractional threshold scope is 1, i.e. described short journal file quantity and described warehouse-in day Will quantity of documents must be equal;But, in modern network technology, every day can produce a large amount of daily record data, so, normal condition Under, after analysis program, the reading of loss of data or data that the warehouse-in journal file of warehouse-in may exist a part is lost The problem such as lose produces, therefore, described default fractional threshold scope, refer to the ratio range being able to verify that data balancing preset, Such as 0.97~1, so, it is allowed to the normal loss of a part of data, without the problem that data nonbalance occurs;
If in the range of described ratio is in default fractional threshold, then judge the daily record amount in described first preset time threshold Balance;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record in described first preset time threshold Amount imbalance.
Pass through above-described embodiment so that during data are tested and put down, on the one hand ensure that data test flat correctness, separately On the one hand in turn allow for the normal a small amount of of data to lose, so that a small amount of loss of data does not affect data and tests flat judgement knot Really.
Optionally, in some embodiments, described configuration file also includes the second preset time threshold, described daily record File generating module 301, specifically for:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as completely Journal file and short journal file;Here, load a configuration file, generate a full log file con.log and one Short journal file cons.log, is respectively used to store corresponding daily record data, every described second preset time threshold, described complete Whole journal file con.log and short journal file cons.log is renamed and separately deposits, and the most again loads described configuration file, And generate an a new full log file con.log and short journal file cons.log, and so go round and begin again, Ji Ke Multiple full log files and short journal file is produced, for carrying out the quantity system of journal file in first preset time threshold Meter;Described second preset time threshold is less than the first preset time threshold, can be configured as required, such as 5~10 points Clock, when daily record data amount is bigger, can suitably shorten described second preset time threshold.
Those of ordinary skill in the field are it is understood that the discussion of any of the above embodiment is exemplary only, not It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under the thinking of the present invention, above example Or can also be combined between the technical characteristic in different embodiments, step can realize with random order, and exists such as Other change of the many of the different aspect of the upper described present invention, in order to concisely they do not provide in details.
It addition, for simplifying explanation and discussing, and in order to obscure the invention, can in the accompanying drawing provided To illustrate or can not illustrate and integrated circuit (IC) chip and the known power supply/grounding connection of other parts.Furthermore, it is possible to Device is shown in block diagram form, in order to avoid obscuring the invention, and this have also contemplated that following facts, i.e. about this The details of the embodiment of a little block diagram arrangements be the platform that depends highly on and will implement the present invention (that is, these details should In the range of being completely in the understanding of those skilled in the art).Elaborating that detail (such as, circuit) is to describe the present invention's In the case of exemplary embodiment, it will be apparent to those skilled in the art that can there is no these details In the case of or these details change in the case of implement the present invention.Therefore, these descriptions are considered as explanation Property rather than restrictive.
Although invention has been described to have been incorporated with the specific embodiment of the present invention, but according to retouching above Stating, a lot of replacements, amendment and the modification of these embodiments will be apparent from for those of ordinary skills.Example As, other memory architecture (such as, dynamic ram (DRAM)) can use discussed embodiment.
Embodiments of the invention be intended to fall into all such replacement within the broad range of claims, Amendment and modification.Therefore, all within the spirit and principles in the present invention, any omission of being made, amendment, equivalent, improvement Deng, should be included within the scope of the present invention.

Claims (10)

1. a data balancing verification method, it is characterised in that including:
Receive daily record data, and generate full log file and short journal file according to described configuration file;Described configuration file Including that short journal file generates information, the log information comprised in described short journal file is less than in described full log file The log information comprised;
According to described short journal file, receive, in adding up the first preset time threshold, the short journal file quantity obtained;
Obtain putting journal file in storage by described full log file write distributed file system parsing;
According to described warehouse-in journal file, add up the warehouse-in journal file quantity in described first preset time threshold;
According to described short journal file quantity and described warehouse-in journal file quantity, in verifying described first preset time threshold Whether daily record amount balances.
Method the most according to claim 1, it is characterised in that the log information that described short journal file comprises is daily record number According to local zone time when producing or stl status.
Method the most according to claim 1, it is characterised in that described warehouse-in journal file includes effective journal file and nothing Effect journal file;Described warehouse-in journal file quantity be the quantity of described effective journal file and invalid journal file quantity it With.
Method the most according to claim 1, it is characterised in that described according to described short journal file quantity and warehouse-in daily record Quantity of documents, verifies that the step whether the daily record amount in described first preset time threshold balances includes:
Calculate the ratio of the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity Value;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge that the daily record amount in described first preset time threshold is put down Weighing apparatus;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record amount in described first preset time threshold not Balance.
5. according to the method described in claim 1-4 any one, it is characterised in that described configuration file also including, second is pre- If time threshold, the described step according to described configuration file generation full log file and short journal file includes:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as full log File and short journal file.
6. a data balancing checking device, it is characterised in that including:
Journal file generation module, is used for receiving daily record data, and generates full log file and short according to described configuration file Journal file;Described configuration file includes that short journal file generates information, the log information comprised in described short journal file Less than the log information comprised in described full log file;
Short daily record counting module, for according to described short journal file, receives obtain short in adding up the first preset time threshold Journal file quantity;
Enter library file and obtain module, for obtaining putting day in storage by described full log file write distributed file system parsing Will file;
Enter library file counting module, for according to described warehouse-in journal file, add up entering in described first preset time threshold Storehouse journal file quantity;
Balance authentication module, for according to described short journal file quantity and warehouse-in journal file quantity, verifying described first pre- If whether the daily record amount in time threshold balances.
Device the most according to claim 6, it is characterised in that the log information that described short journal file comprises is daily record number According to local zone time when producing or stl status.
Device the most according to claim 6, it is characterised in that described warehouse-in journal file includes effective journal file and nothing Effect journal file;Described warehouse-in journal file quantity be the quantity of described effective journal file and invalid journal file quantity it With.
Device the most according to claim 6, it is characterised in that described balance authentication module, specifically for:
Calculate the ratio of the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity Value;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge that the daily record amount in described first preset time threshold is put down Weighing apparatus;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record amount in described first preset time threshold not Balance.
10. according to the device described in claim 6-9 any one, it is characterised in that described configuration file also includes second Preset time threshold, described journal file generation module, specifically for:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as full log File and short journal file.
CN201610393585.7A 2016-06-06 2016-06-06 Data balancing verification method and device Pending CN106095870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610393585.7A CN106095870A (en) 2016-06-06 2016-06-06 Data balancing verification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610393585.7A CN106095870A (en) 2016-06-06 2016-06-06 Data balancing verification method and device

Publications (1)

Publication Number Publication Date
CN106095870A true CN106095870A (en) 2016-11-09

Family

ID=57447288

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610393585.7A Pending CN106095870A (en) 2016-06-06 2016-06-06 Data balancing verification method and device

Country Status (1)

Country Link
CN (1) CN106095870A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108667680A (en) * 2017-10-30 2018-10-16 上海幻电信息科技有限公司 A kind of monitoring system and method for multilink real time data steaming transfer

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004045155A1 (en) * 2002-11-14 2004-05-27 Huawei Technologies Co., Ltd. Network traffic statistics method of ip device
CN101364898A (en) * 2008-09-22 2009-02-11 中国联合通信有限公司 Method and system for flow balance checking
CN102004971A (en) * 2010-12-27 2011-04-06 用友软件股份有限公司 Metering method and system for ERP (Enterprise Resource Planning) system
CN102486795A (en) * 2010-12-03 2012-06-06 中国移动通信集团陕西有限公司 Method and device for inspecting balance of dynamic file
CN105631026A (en) * 2015-12-30 2016-06-01 北京奇艺世纪科技有限公司 Security data analysis system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004045155A1 (en) * 2002-11-14 2004-05-27 Huawei Technologies Co., Ltd. Network traffic statistics method of ip device
CN101364898A (en) * 2008-09-22 2009-02-11 中国联合通信有限公司 Method and system for flow balance checking
CN102486795A (en) * 2010-12-03 2012-06-06 中国移动通信集团陕西有限公司 Method and device for inspecting balance of dynamic file
CN102004971A (en) * 2010-12-27 2011-04-06 用友软件股份有限公司 Metering method and system for ERP (Enterprise Resource Planning) system
CN105631026A (en) * 2015-12-30 2016-06-01 北京奇艺世纪科技有限公司 Security data analysis system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108667680A (en) * 2017-10-30 2018-10-16 上海幻电信息科技有限公司 A kind of monitoring system and method for multilink real time data steaming transfer
CN108667680B (en) * 2017-10-30 2020-11-24 上海幻电信息科技有限公司 Monitoring system and method for multilink real-time data stream transmission

Similar Documents

Publication Publication Date Title
CN106293892B (en) Distributed stream computing system, method and apparatus
Power et al. The inner structure of ΛCDM haloes—I. A numerical convergence study
CN104185840B (en) It is used for being prioritized the mthods, systems and devices of multiple tests in lasting deployment streamline
CN103401698B (en) For the monitoring system that server health is reported to the police in server set group operatione
CN103530347B (en) A kind of Internet resources method for evaluating quality based on big data mining and system
CN103473672A (en) System, method and platform for auditing metadata quality of enterprise-level data center
Zoldan et al. Structural and dynamical properties of galaxies in a hierarchical Universe: sizes and specific angular momenta
CN104426713A (en) Method and device for monitoring network site access effect data
CN105939234A (en) Data monitoring method and device
Jirka et al. A lightweight approach for the sensor observation service to share environmental data across Europe
Li et al. Toward smart distribution management by integrating advanced metering infrastructure
CN109800259A (en) Collecting method, device and terminal device
CN110795305A (en) System, apparatus and method for processing and managing WEB traffic data
CN107451058A (en) A kind of software development methodology and device
CN106339321A (en) Method and device for testing performance of application
CN111177193A (en) Flink-based log streaming processing method and system
Hilty et al. The role of ICT in energy consumption and energy efficiency
Barber Jr et al. Economic performance assessment for the construction industry in the southeastern United States
Ducruet et al. Spatial network analysis of container port operations: the case of ship turnaround times
CN107480056A (en) A kind of method for testing software and device
CN106095870A (en) Data balancing verification method and device
CN109726988A (en) A kind of flow engine call method, device, electronic equipment and readable storage medium storing program for executing
CN103902447A (en) Distributed system testing method and device
CN109614380A (en) Log processing method, system, computer equipment and readable medium
CN106326280A (en) Data processing method, apparatus and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20161109

WD01 Invention patent application deemed withdrawn after publication