CN106095870A - Data balancing verification method and device - Google Patents
Data balancing verification method and device Download PDFInfo
- Publication number
- CN106095870A CN106095870A CN201610393585.7A CN201610393585A CN106095870A CN 106095870 A CN106095870 A CN 106095870A CN 201610393585 A CN201610393585 A CN 201610393585A CN 106095870 A CN106095870 A CN 106095870A
- Authority
- CN
- China
- Prior art keywords
- file
- journal file
- daily record
- short
- preset time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a kind of data balancing verification method and device, including: receive daily record data, and generate full log file and short journal file according to described configuration file;Described configuration file includes that short journal file generates information, and the log information comprised in described short journal file is less than the log information comprised in described full log file;According to described short journal file, receive, in adding up the first preset time threshold, the short journal file quantity obtained;Obtain putting journal file in storage by described full log file write distributed file system parsing;According to described warehouse-in journal file, add up the warehouse-in journal file quantity in described first preset time threshold;According to described short journal file quantity and warehouse-in journal file quantity, verify whether the daily record amount in described first preset time threshold balances.The data balancing verification method of present invention proposition and device, it is possible to reduce when data balancing is verified and system resource is taken.
Description
Technical field
The present invention relates to technical field of data processing, particularly relate to a kind of data balancing verification method and device.
Background technology
Hadoop and Hive is the storage of current industry widely used data and the Distributed-solution of inquiry.
Hive, is a Tool for Data Warehouse based on Hadoop, structurized data file can be mapped as a database table,
And simple sql (Structured Query Language, SQL) query function is provided, can be by sql language
Sentence is converted to MapReduce (MapReduce) task and runs.Its advantage is that learning cost is low, can pass through class SQL statement
Quickly realize simple MapReduce statistics, it is not necessary to develop special MapReduce application, be very suitable for the system of data warehouse
Meter is analyzed.
HDFS, that is, Hadoop distributed file system, it is designed to be suitable for operating in common hardware (commodity
Hardware) distributed file system on.It and existing distributed file system have a lot of common ground.But meanwhile, it and
The difference of other distributed file system is also clearly.HDFS is the system of an Error Tolerance, is suitable for being deployed in
On cheap machine.HDFS is provided that the data access of high-throughput, the application being especially suitable on large-scale dataset.HDFS is put
Wide a part of POSIX (Portable Operating System Interface, portable operating system interface) standard
Constraint, realizes streaming and reads the purpose of file system data.
Existing technology carry out data logging amount test flat (that is, data logging amount balance checking, belong to data monitoring side
The one of formula) time, generally use server (server) to receive journal file (access_log) statistics obtained and obtain daily record
It is mounted to the daily record quantity that Hive obtains and contrasts, by contrast twice through resolving after quantity, with journal file write HDFS
The daily record quantity that obtains of statistics is the most equal carry out testing flat.
Generally for meeting needs, in configuration file, access_log can be pre-configured with log_format (i.e. access_
The information of log storage daily record), wherein comprise a lot of information, such as remote_addr, time_local, request, http_
Content_type, status etc..So, when peak period daily record amount is the biggest, the journal file of formation also can very big (can
Reach GB rank).A lot of system resource can be taken, so statistical operation is likely frequently during the record number of accounting log file
The performance of server can be affected, the when of serious, likely affect the regular traffic of server.
Summary of the invention
In view of this, it is an object of the invention to propose a kind of data balancing verification method and device, it is possible to put down in data
Reduce during weighing apparatus checking and system resource is taken.
The data balancing verification method provided based on the above-mentioned purpose embodiment of the present invention, including:
Receive daily record data, and generate full log file and short journal file according to described configuration file;Described configuration
File includes that short journal file generates information, and the log information comprised in described short journal file is less than described full log literary composition
The log information comprised in part;
According to described short journal file, receive, in adding up the first preset time threshold, the short journal file quantity obtained;
Obtain putting journal file in storage by described full log file write distributed file system parsing;
According to described warehouse-in journal file, add up the warehouse-in journal file quantity in described first preset time threshold;
According to described short journal file quantity and described warehouse-in journal file quantity, verify described first preset time threshold
Whether interior daily record amount balances.
In some embodiments, the log information that described short journal file comprises be daily record data produce time local time
Between or stl status.
In some embodiments, described warehouse-in journal file includes effective journal file and invalid journal file;Described
Warehouse-in journal file quantity is quantity and the quantity sum of invalid journal file of described effective journal file.
In some embodiments, described according to described short journal file quantity and warehouse-in journal file quantity, verify institute
The step whether the daily record amount in the first preset time threshold of stating balances includes:
Calculate the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity
Ratio;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge the daily record amount in described first preset time threshold
Balance;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record in described first preset time threshold
Amount imbalance.
In some embodiments, described configuration file also includes the second preset time threshold, join described in described basis
The step putting file generated full log file and short journal file includes:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as completely
Journal file and short journal file.
The another aspect of the embodiment of the present invention, additionally provides a kind of data balancing checking device, including:
Journal file generation module, is used for receiving daily record data, and generates full log file according to described configuration file
With short journal file;Described configuration file includes that short journal file generates information, the daily record comprised in described short journal file
Information is less than the log information comprised in described full log file;
Short daily record counting module, for according to described short journal file, receives in adding up the first preset time threshold and obtains
Short journal file quantity;
Enter library file and obtain module, for described full log file write distributed file system parsing are entered
Storehouse journal file;
Enter library file counting module, for according to described warehouse-in journal file, add up in described first preset time threshold
Warehouse-in journal file quantity;
Balance authentication module, for according to described short journal file quantity and warehouse-in journal file quantity, verifying described the
Whether the daily record amount in one preset time threshold balances.
In some embodiments, the log information that described short journal file comprises be daily record data produce time local time
Between or stl status.
In some embodiments, described warehouse-in journal file includes effective journal file and invalid journal file;Described
Warehouse-in journal file quantity is quantity and the quantity sum of invalid journal file of described effective journal file.
In some embodiments, described balance authentication module, specifically for:
Calculate the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity
Ratio;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge the daily record amount in described first preset time threshold
Balance;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record in described first preset time threshold
Amount imbalance.
In some embodiments, also including the second preset time threshold in described configuration file, described journal file is raw
Become module, specifically for:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as completely
Journal file and short journal file.
From the above it can be seen that the data balancing verification method of embodiment of the present invention offer and device, by statistics
The daily record data received is added up by short journal file, to warehouse-in journal file after full log file rule is put in storage
Add up, thus complete data balancing checking according to two statistical datas;So, the daily record data received is being united
Timing, because having only to add up quantity, and need not be analyzed the actual content of daily record data, therefore can be the most right
The quantity of short journal file is added up, and without the quantity of full log file is added up, thus to the day received
The quantity statistics of will data is to use the quantity adding up short journal file to complete so that can reduce when data balancing is verified
System resource is taken such that it is able to save substantial amounts of time and resource.
Accompanying drawing explanation
The schematic flow sheet of one embodiment of the data balancing verification method that Fig. 1 provides for the present invention;
The schematic flow sheet of another embodiment of the data balancing verification method that Fig. 2 provides for the present invention;
The modular structure schematic diagram of the data balancing checking device embodiment that Fig. 3 provides for the present invention.
Detailed description of the invention
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference
Accompanying drawing, the present invention is described in more detail.
It should be noted that the statement of all uses " first " and " second " is for distinguishing two in the embodiment of the present invention
The entity of individual same names non-equal or the parameter of non-equal, it is seen that " first " " second ", only for the convenience of statement, should not
Being interpreted as the restriction to the embodiment of the present invention, this is illustrated by subsequent embodiment the most one by one.
First aspect of the embodiment of the present invention, it is proposed that one can reduce system resource when data balancing is verified
The data balancing verification method taken and an embodiment of device.As it is shown in figure 1, the data balancing for present invention offer is tested
The schematic flow sheet of one embodiment of card method.
Described data balancing verification method, comprises the following steps:
Step 101: receive daily record data, and (wherein comprised according to described configuration file generation full log file
Log information in need) and short journal file;Described configuration file includes that short journal file generates information, described short daily record
The log information comprised in file is less than the log information comprised in described full log file;Described full log file is permissible
Refer to the journal file that system is recorded under normal circumstances, wherein contain a conventional journal file required for have all
Log information;
Optionally, the embodiment of the present invention is directed to off-line analysis framework, and the embodiment of the present invention is applied to Nginx, here
Configuration file can directly utilize the configuration file in Nginx, increase in this configuration file short journal file generate information;
(also referred to as " engine x ", engine x) is that (HyperText Transfer Protocol surpasses a kind of high performance HTTP to Nginx
Text transfer protocol) and Reverse Proxy, also it is an IMAP (Internet Mail Access Protocol, interconnection
Net Mail Access Protocol)/POP3 (PostOffice Protocol-Version 3, Post Office Protocol,Version 3)/SMTP (Simple
Mail TransferProtocol, Simple Mail Transfer protocol) server;Nginx is as load-balanced server, the most permissible
Directly support that in inside Rails (can be translated into track, be the complete frame of a kind of web application for developing database-driven
Frame) and PHP (Hypertext Preprocessor, HyperText Preprocessor) program externally service, it is also possible to support make
Externally service for http proxy server;
There is a lot of configuration information in configuration file in Nginx, the journal format of journal file (access_log) is set
Configuration be one of which, such as:
Wherein pv, sm are the journal formats pre-set, such as:
Wherein, the data volume that this journal format of pv comprises is more, and the information that contrary sm comprises is little;
Optionally, according to the configuration of step 101, can rotate once at set intervals, after rotation, server can add again
Carry (reload) described configuration file, server /log/con catalogue under generate two journal files: full log file
(corresponding to pv journal format, the actual name of full log file is the title being renamed to con.log, such as
Con.20160512-0110.log) and short journal file cons.log (corresponding to sm journal format, same, short journal file
Actual name be the title being renamed, such as cons.20160512-0110.log), when data volume is bigger, each daily record
File can store a lot of bar daily record data;Wherein, the every data in full log file con.log all comprises a lot of numbers
It is believed that breath, the every data in short journal file cons.log then only can store some for the number distinguishing different daily record data
It is believed that breath, such as, receive the time of corresponding daily record data;When daily record amount is the biggest, statistics full log file and short daily record literary composition
Time and the contrast of resource that part is consumed respectively are the most obvious;
Wherein, described rotation refers to that daily record rotates, and in simple terms, refers to, by existing journal file renaming, then weigh
Newly created original empty journal file;
Such as, configuration file has following configuration information:
After configuration information sets, will not change within a period of time, if not carrying out daily record rotation, then server
Receive All Files all can be stored in/logs/con/con.log and/two files of logs/con/cons.log in, along with
The growth of time, journal file can be increasing;
In order to enable to process in time journal file, after general a period of time (depending on concrete condition, may by the hour, sky or week,
Optionally, it is set to 10 minutes) can be by receiving that file renaming of daily record, as a example by con.log, will every 10 minutes
Con.log renaming (such as con.20160512-0110.log), then re-creates an empty journal file con.log;
Because being provided that in configuration file that the data received are stored in con.log, the data that server is newly received depend on
So it is deposited in con.log file, after rotation, after the data in con.20160512-0110.log just can be used to do
Continuous operates, it may be assumed that write HDFS-> document analysis-> be mounted to hive;
Step 102: according to described short journal file, receives, in adding up the first preset time threshold, the short journal file obtained
Quantity;Described first preset time threshold, may refer to need to carry out data and tests flat time period (some in such as a day
Period, the data that this period gathers are best for testing flat effect) or disclosure satisfy that data are tested flat needs and should be gathered
The time span of data (such as carried out every 2 hours a secondary data test flat effect best);Here, according to actual needs, can
So that the first preset time threshold is selected, it is possible to according to the change of practical situation, the first preset time threshold is carried out
Adjust;
Step 103: obtain putting journal file in storage by described full log file write distributed file system parsing;
Here, after server receives data, can store on the disk of server, then full log file is in service
Store after device rule in distributed file system HDFS, obtain seq (can be translated into sequence, a kind of default external command, typically
Simplification literary style as a pile numeral) the rule journal file of file format, the rule journal file warp of described seq file format
Obtain RC (run command, run order) file after crossing analysis program and be mounted to Hive and complete warehouse-in, obtaining putting daily record in storage
File;
Concrete, the rule process (i.e. the process of full log file write HDFS) of full log file can include following
Step: after server receives daily record data, is temporarily stored in journal file con.log, within every ten minutes, rotates once, after rotation,
Aforesaid journal file is renamed into another file, and (file of this renaming is exactly the described full log for rule
File, such as cons.20160512-0110.log), reload (reload) described configuration file simultaneously, generate one newly
Journal file con.log (although identical with previous journal file con.log name, but because previous journal file heavily ordered
Being formed as a full log file after Ming, the journal file con.log regenerated here is then a brand-new empty day
Will file, wherein next proceeds to the content kept in i.e. new log content);Then, obtain after being renamed is complete
Journal file can use a program (such as glume (can be translated into grain husk), and one is similar to the program of Flume (can be translated into tank))
Storage is to HDFS, so, just completes the process receiving local rule write HDFS from daily record data;Flume is a kind of high
That can use, highly reliable, distributed massive logs collection, the system be polymerized and transmit, it is fixed that Flume supports in log system
Various types of data sender processed, is used for collecting data;Meanwhile, Flume provides and data carries out simple process, and writes various data
The ability of reciever (customizable);
Step 104: according to described warehouse-in journal file, add up the warehouse-in journal file in described first preset time threshold
Quantity;Optionally, described warehouse-in journal file quantity is added up from Hive and is obtained;
Step 105: according to described short journal file quantity and warehouse-in journal file quantity, verify described first Preset Time
Whether the daily record amount in threshold value balances;
Optionally, the method whether the daily record amount in described first preset time threshold of verifying balances is it may be that judge institute
Stating short journal file quantity and warehouse-in journal file quantity is the most equal, if equal, then daily record amount balances, if unequal, then day
Will amount is uneven.
From above-described embodiment it can be seen that the data balancing verification method of embodiment of the present invention offer, arranged by increase
Short journal file, and by adding up short journal file, the daily record data received is added up, fall at full log file
After dish warehouse-in, warehouse-in journal file is added up, thus complete data balancing checking according to two statistical datas;So, right
When the daily record data received is added up, because having only to the quantity of the daily record data received is added up, and need not
The actual content of daily record data is analyzed, therefore can only the quantity of short journal file be added up, and without to complete
The quantity of whole journal file is added up, thus the quantity statistics to the daily record data received is to use to add up short journal file
Quantity complete so that can reduce when data balancing is verified and system resource is taken and shortens timing statistics, in day
When will data are the hugest, it is possible to save substantial amounts of time and resource.
Second aspect of the embodiment of the present invention, it is proposed that one can reduce system resource when data balancing is verified
Another embodiment of the data balancing verification method taken.As in figure 2 it is shown, the data balancing authentication provided for the present invention
The schematic flow sheet of another embodiment of method.
Described data balancing verification method, comprises the following steps:
Step 201: receive daily record data, according to described second preset time threshold, load described configuration file, and generate
Full log file and short journal file;Described configuration file includes that short journal file generates information, described short journal file
In the log information that comprises less than the log information comprised in described full log file;Optionally, in some embodiments,
The log information that described short journal file comprises is the local zone time (time_local) during daily record data generation or stl status
(status), the two data, the resource on the one hand taken is less, on the other hand can tentatively distinguish daily record data,
Thus facilitate quantity statistics;
Here, load a configuration file, generate a full log file con.log and a short journal file
Cons.log, is respectively used to store corresponding daily record data, every described second preset time threshold, described full log file
Con.log and short journal file cons.log is renamed and separately deposits, and the most again loads described configuration file, and generates new
An one full log file con.log and short journal file cons.log, so goes round and begins again, can be when first presets
Between produce multiple full log files and short journal file in threshold value, for carrying out the quantity statistics of journal file;Described second
Preset time threshold is less than the first preset time threshold, can be configured as required, such as 5~10 minutes, at daily record number
According to when measuring bigger, can suitably shorten described second preset time threshold;
Step 202: according to described short journal file, receives, in adding up the first preset time threshold, the short journal file obtained
Quantity;
Step 203: described full log file is write distributed file system and obtains putting journal file in storage;
Step 204: according to described warehouse-in journal file, add up the warehouse-in journal file in described first preset time threshold
Quantity;Described first preset time threshold, may refer to need to carry out data and tests flat time period (some in such as a day
Period, the data that this period gathers are best for testing flat effect) or disclosure satisfy that data are tested flat needs and should be gathered
The time span of data (such as carried out every 2 hours a secondary data test flat effect best);Here, according to actual needs, can
So that the first preset time threshold is selected, it is possible to according to the change of practical situation, the first preset time threshold is carried out
Adjust;
It is written to the warehouse-in journal file of rule warehouse-in in described distributed file system, some warehouse-in daily record literary composition therein
Part, because daily record data therein does not meets specification or requirement, can be cleaned out, and the data washed can carry out other place
Reason, therefore, the total number of files amount of the warehouse-in journal file of rule warehouse-in then includes the quantity of effective journal file and invalid daily record
The quantity of file, the data that invalid journal file is i.e. cleaned out;Thus, in some optional embodiments, described warehouse-in day
Will file includes effective journal file and invalid journal file;Described warehouse-in journal file quantity is described effective journal file
The quantity sum of quantity and invalid journal file;So, when data balancing is verified, will not be because of the invalid daily record being cleaned out
File is not added up and is affected data balancing the result;Optionally, described effective journal file and invalid journal file are
Obtain through described distributed file system analysis;
As the optional embodiment of one of step 104, may particularly include following steps:
Step 205: calculate the described short journal file quantity in described first preset time threshold and described warehouse-in daily record
The ratio of quantity of documents;
Step 206: in the range of judging whether described ratio is in default fractional threshold;
Under normal circumstances, described default fractional threshold scope is 1, i.e. described short journal file quantity and described warehouse-in day
Will quantity of documents must be equal;But, in modern network technology, every day can produce a large amount of daily record data, so, normal condition
Under, after analysis program, the reading of loss of data or data that the warehouse-in journal file of warehouse-in may exist a part is lost
The problem such as lose produces, therefore, described default fractional threshold scope, refer to the ratio range being able to verify that data balancing preset,
Such as 0.97~1, so, it is allowed to the normal loss of a part of data, without the problem that data nonbalance occurs;
Step 207: if in the range of described ratio is in default fractional threshold, then in judging described first preset time threshold
Daily record amount balance;
Step 208: if in the range of described ratio is not at presetting fractional threshold, then judging described first preset time threshold
Interior daily record amount is uneven;
By the embodiment of the step 104 that above-mentioned steps 205~step 208 realize so that in the mistake of data balancing checking
Cheng Zhong, on the one hand ensure that the correctness that data balancing is verified, on the other hand in turn allow for the normal a small amount of of data and loses, from
And make a small amount of loss of data not affect the judged result of data balancing checking.From above-described embodiment it can be seen that the present invention is real
Execute the data balancing verification method that example provides, short journal file is set by increase, and docks by adding up short journal file
The daily record data received is added up, and adds up warehouse-in journal file after full log file rule is put in storage, thus root
Data balancing checking is completed according to two statistical datas;So, when the daily record data received is added up, because having only to
The quantity of the daily record data received is added up, and the actual content of daily record data need not be analyzed, the most permissible
Only the quantity of short journal file is added up, and without the quantity of full log file is added up, thus to receiving
The quantity statistics of daily record data be to use the quantity adding up short journal file to complete so that can when data balancing is verified
Reduce and system resource taken and shortens timing statistics, when daily record data is the hugest, it is possible to save the substantial amounts of time and
Resource.
3rd aspect of the embodiment of the present invention, it is proposed that one can reduce system resource when data balancing is verified
The data balancing taken checking device embodiment.As it is shown on figure 3, the data balancing checking device enforcement provided for the present invention
The modular structure schematic diagram of example.
Described data balancing checking device, including:
Journal file generation module 301, is used for receiving daily record data, and generates full log literary composition according to described configuration file
Part (wherein including log information in need) and short journal file;Described configuration file includes that short journal file generates
Information, the log information comprised in described short journal file is less than the log information comprised in described full log file;Described
Full log file may refer to the journal file that system under normal circumstances is recorded, and wherein contains a conventional journal file
Required for all log informations of having;
Optionally, the embodiment of the present invention is directed to off-line analysis framework, and the embodiment of the present invention is applied to Nginx, here
Configuration file can directly utilize the configuration file in Nginx, increase in this configuration file short journal file generate information;
(also referred to as " engine x ", engine x) is that (HyperText Transfer Protocol surpasses a kind of high performance HTTP to Nginx
Text transfer protocol) and Reverse Proxy, also it is an IMAP (Internet Mail Access Protocol, interconnection
Net Mail Access Protocol)/POP3 (PostOffice Protocol-Version 3, Post Office Protocol,Version 3)/smtp server
(Simple Mail TransferProtocol, Simple Mail Transfer protocol);Nginx, as load-balanced server, both may be used
Directly to support that in inside Rails (can be translated into track, be the complete of a kind of web application for developing database-driven
Framework) and PHP (Hypertext Preprocessor, HyperText Preprocessor) program externally service, it is also possible to support
Externally service as http proxy server;
There is a lot of configuration information in configuration file in Nginx, the journal format of journal file (access_log) is set
Configuration be one of which, such as:
Wherein pv, sm are the journal formats pre-set, such as:
Wherein, the data volume that this journal format of pv comprises is more, and the information that contrary sm comprises is little;
Optionally, according to the configuration of step 101, can rotate once at set intervals, after rotation, server can add again
Carry (reload) described configuration file, server /log/con catalogue under generate two journal files: full log file
(corresponding to pv journal format, the actual name of full log file is the title being renamed to con.log, such as
Con.20160512-0110.log), short journal file cons.log (corresponding to sm journal format, same, short journal file
Actual name be the title being renamed, such as cons.20160512-0110.log), when data volume is bigger, each daily record
File can store a lot of bar daily record data;Wherein, the every data in full log file con.log all comprises a lot of numbers
It is believed that breath, the every data in short journal file cons.log then only can store some for the number distinguishing different daily record data
It is believed that breath, such as, receive the time of corresponding daily record data;When daily record amount is the biggest, statistics full log file and short daily record literary composition
Time and the contrast of resource that part is consumed respectively are the most obvious;
Wherein, described rotation refers to that daily record rotates, and in simple terms, refers to, by existing journal file renaming, then weigh
Newly created original empty journal file;
Such as, configuration file has following configuration information:
After configuration information sets, will not change within a period of time, if not carrying out daily record rotation, then server
Receive All Files all can be stored in/logs/con/con.log and/two files of logs/con/cons.log in, along with
The growth of time, journal file can be increasing;
In order to enable to process in time journal file, after general a period of time (depending on concrete condition, may by the hour, sky or week,
Optionally, it is set to 10 minutes) can be by receiving that file renaming of daily record, as a example by con.log, will every 10 minutes
Con.log renaming (such as con.20160512-0110.log), then re-creates an empty journal file con.log;
Because being provided that in configuration file that the data received are stored in con.log, the data that server is newly received depend on
So it is deposited in con.log file, after rotation, after the data in con.20160512-0110.log just can be used to do
Continuous operates, it may be assumed that write HDFS-> document analysis-> be mounted to hive;
Short daily record counting module 302, for according to described short journal file, receives in adding up the first preset time threshold
The short journal file quantity arrived;
Described first preset time threshold, the time period that may refer to need to carry out data balancing checking is (in such as one day
Some period, the data that this period gathers are for test flat effect best) or disclosure satisfy that data to test flat needs and
The time span (such as carried out a secondary data test flat effect best) of data should be gathered every 2 hours;Here, according to reality
Need, the first preset time threshold can be selected, it is possible to according to the change of practical situation, to the first Preset Time threshold
Value is adjusted;
Enter library file and obtain module 303, for described full log file being write distributed file system and resolving
To warehouse-in journal file;
Here, after server receives data, can store on the disk of server, then full log file is in service
Store after device rule in distributed file system HDFS, obtain the rule journal file of seq file format, described seq file
The rule journal file of form obtain after analysis program RC file and be mounted to Hive complete warehouse-in, obtain put in storage daily record literary composition
Part;
Concrete, the rule process (i.e. the process of full log file write HDFS) of full log file can include following
Step: after server receives daily record data, is temporarily stored in journal file con.log, within every ten minutes, rotates once, after rotation,
Aforesaid journal file is renamed into another file, and (file of this renaming is exactly the described full log for rule
File, such as cons.20160512-0110.log), reload (reload) described configuration file simultaneously, generate one newly
Journal file con.log (although identical with previous journal file con.log name, but because previous journal file heavily ordered
Being formed as a full log file after Ming, the journal file con.log regenerated here is then a brand-new empty day
Will file, wherein next proceeds to the content kept in i.e. new log content);Then, obtain after being renamed is complete
Journal file can use a program (such as glume (can be translated into grain husk), and one is similar to the program of Flume (can be translated into tank))
Storage is to HDFS, so, just completes the process receiving local rule write HDFS from daily record data;Flume is a kind of high
That can use, highly reliable, distributed massive logs collection, the system be polymerized and transmit, it is fixed that Flume supports in log system
Various types of data sender processed, is used for collecting data;Meanwhile, Flume provides and data carries out simple process, and writes various data
The ability of reciever (customizable);
Enter library file counting module 304, for according to described warehouse-in journal file, add up described first preset time threshold
Interior warehouse-in journal file quantity;Optionally, described warehouse-in journal file quantity is added up from Hive and is obtained;
Balance authentication module 305, for according to described short journal file quantity and warehouse-in journal file quantity, checking is described
Whether the daily record amount in the first preset time threshold balances;
Optionally, the method whether the daily record amount in described first preset time threshold of verifying balances is it may be that judge institute
Stating short journal file quantity and warehouse-in journal file quantity is the most equal, if equal, then daily record amount balances, if unequal, then day
Will amount is uneven.
From above-described embodiment it can be seen that the data balancing that the embodiment of the present invention provides verifies device, arranged by increase
Short journal file, and by adding up short journal file, the daily record data received is added up, fall at full log file
After dish warehouse-in, warehouse-in journal file is added up, thus complete data balancing checking according to two statistical datas;So, right
When the daily record data received is added up, because having only to the quantity of the daily record data received is added up, and need not
The actual content of daily record data is analyzed, therefore can only the quantity of short journal file be added up, and without to complete
The quantity of whole journal file is added up, thus the quantity statistics to the daily record data received is to use to add up short journal file
Quantity complete so that can reduce when data balancing is verified and system resource is taken and shortens timing statistics, in day
When will data are the hugest, it is possible to save substantial amounts of time and resource.
Optionally, in some embodiments, when the log information that described short journal file comprises is daily record data generation
Local zone time (time_local) or stl status (status), the two data, the resource on the one hand taken is less, another
Daily record data can tentatively be distinguished by aspect, thus facilitates quantity statistics.
It is written to the warehouse-in journal file of rule warehouse-in in described distributed file system, some warehouse-in daily record literary composition therein
Part, because daily record data therein does not meets specification or requirement, can be cleaned out, and the data washed can carry out other place
Reason, therefore, the total number of files amount of the warehouse-in journal file of rule warehouse-in then includes the quantity of effective journal file and invalid daily record
The quantity of file, the data that invalid journal file is i.e. cleaned out;Thus, further, in some optional embodiments, institute
State warehouse-in journal file and include effective journal file and invalid journal file;Described warehouse-in journal file quantity is described effective day
The quantity of will file and the quantity sum of invalid journal file;So, test at ordinary times in data, will not be invalid because of be cleaned out
Journal file is not added up and is affected data and test reef knot fruit;Optionally, described effective journal file and invalid journal file are
Obtain through described distributed file system analysis.
It is also preferred that the left in some optional embodiments, described balance authentication module 305, specifically for:
Calculate the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity
Ratio;
In the range of judging whether described ratio is in default fractional threshold;
Under normal circumstances, described default fractional threshold scope is 1, i.e. described short journal file quantity and described warehouse-in day
Will quantity of documents must be equal;But, in modern network technology, every day can produce a large amount of daily record data, so, normal condition
Under, after analysis program, the reading of loss of data or data that the warehouse-in journal file of warehouse-in may exist a part is lost
The problem such as lose produces, therefore, described default fractional threshold scope, refer to the ratio range being able to verify that data balancing preset,
Such as 0.97~1, so, it is allowed to the normal loss of a part of data, without the problem that data nonbalance occurs;
If in the range of described ratio is in default fractional threshold, then judge the daily record amount in described first preset time threshold
Balance;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record in described first preset time threshold
Amount imbalance.
Pass through above-described embodiment so that during data are tested and put down, on the one hand ensure that data test flat correctness, separately
On the one hand in turn allow for the normal a small amount of of data to lose, so that a small amount of loss of data does not affect data and tests flat judgement knot
Really.
Optionally, in some embodiments, described configuration file also includes the second preset time threshold, described daily record
File generating module 301, specifically for:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as completely
Journal file and short journal file;Here, load a configuration file, generate a full log file con.log and one
Short journal file cons.log, is respectively used to store corresponding daily record data, every described second preset time threshold, described complete
Whole journal file con.log and short journal file cons.log is renamed and separately deposits, and the most again loads described configuration file,
And generate an a new full log file con.log and short journal file cons.log, and so go round and begin again, Ji Ke
Multiple full log files and short journal file is produced, for carrying out the quantity system of journal file in first preset time threshold
Meter;Described second preset time threshold is less than the first preset time threshold, can be configured as required, such as 5~10 points
Clock, when daily record data amount is bigger, can suitably shorten described second preset time threshold.
Those of ordinary skill in the field are it is understood that the discussion of any of the above embodiment is exemplary only, not
It is intended to imply that the scope of the present disclosure (including claim) is limited to these examples;Under the thinking of the present invention, above example
Or can also be combined between the technical characteristic in different embodiments, step can realize with random order, and exists such as
Other change of the many of the different aspect of the upper described present invention, in order to concisely they do not provide in details.
It addition, for simplifying explanation and discussing, and in order to obscure the invention, can in the accompanying drawing provided
To illustrate or can not illustrate and integrated circuit (IC) chip and the known power supply/grounding connection of other parts.Furthermore, it is possible to
Device is shown in block diagram form, in order to avoid obscuring the invention, and this have also contemplated that following facts, i.e. about this
The details of the embodiment of a little block diagram arrangements be the platform that depends highly on and will implement the present invention (that is, these details should
In the range of being completely in the understanding of those skilled in the art).Elaborating that detail (such as, circuit) is to describe the present invention's
In the case of exemplary embodiment, it will be apparent to those skilled in the art that can there is no these details
In the case of or these details change in the case of implement the present invention.Therefore, these descriptions are considered as explanation
Property rather than restrictive.
Although invention has been described to have been incorporated with the specific embodiment of the present invention, but according to retouching above
Stating, a lot of replacements, amendment and the modification of these embodiments will be apparent from for those of ordinary skills.Example
As, other memory architecture (such as, dynamic ram (DRAM)) can use discussed embodiment.
Embodiments of the invention be intended to fall into all such replacement within the broad range of claims,
Amendment and modification.Therefore, all within the spirit and principles in the present invention, any omission of being made, amendment, equivalent, improvement
Deng, should be included within the scope of the present invention.
Claims (10)
1. a data balancing verification method, it is characterised in that including:
Receive daily record data, and generate full log file and short journal file according to described configuration file;Described configuration file
Including that short journal file generates information, the log information comprised in described short journal file is less than in described full log file
The log information comprised;
According to described short journal file, receive, in adding up the first preset time threshold, the short journal file quantity obtained;
Obtain putting journal file in storage by described full log file write distributed file system parsing;
According to described warehouse-in journal file, add up the warehouse-in journal file quantity in described first preset time threshold;
According to described short journal file quantity and described warehouse-in journal file quantity, in verifying described first preset time threshold
Whether daily record amount balances.
Method the most according to claim 1, it is characterised in that the log information that described short journal file comprises is daily record number
According to local zone time when producing or stl status.
Method the most according to claim 1, it is characterised in that described warehouse-in journal file includes effective journal file and nothing
Effect journal file;Described warehouse-in journal file quantity be the quantity of described effective journal file and invalid journal file quantity it
With.
Method the most according to claim 1, it is characterised in that described according to described short journal file quantity and warehouse-in daily record
Quantity of documents, verifies that the step whether the daily record amount in described first preset time threshold balances includes:
Calculate the ratio of the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity
Value;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge that the daily record amount in described first preset time threshold is put down
Weighing apparatus;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record amount in described first preset time threshold not
Balance.
5. according to the method described in claim 1-4 any one, it is characterised in that described configuration file also including, second is pre-
If time threshold, the described step according to described configuration file generation full log file and short journal file includes:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as full log
File and short journal file.
6. a data balancing checking device, it is characterised in that including:
Journal file generation module, is used for receiving daily record data, and generates full log file and short according to described configuration file
Journal file;Described configuration file includes that short journal file generates information, the log information comprised in described short journal file
Less than the log information comprised in described full log file;
Short daily record counting module, for according to described short journal file, receives obtain short in adding up the first preset time threshold
Journal file quantity;
Enter library file and obtain module, for obtaining putting day in storage by described full log file write distributed file system parsing
Will file;
Enter library file counting module, for according to described warehouse-in journal file, add up entering in described first preset time threshold
Storehouse journal file quantity;
Balance authentication module, for according to described short journal file quantity and warehouse-in journal file quantity, verifying described first pre-
If whether the daily record amount in time threshold balances.
Device the most according to claim 6, it is characterised in that the log information that described short journal file comprises is daily record number
According to local zone time when producing or stl status.
Device the most according to claim 6, it is characterised in that described warehouse-in journal file includes effective journal file and nothing
Effect journal file;Described warehouse-in journal file quantity be the quantity of described effective journal file and invalid journal file quantity it
With.
Device the most according to claim 6, it is characterised in that described balance authentication module, specifically for:
Calculate the ratio of the described warehouse-in journal file quantity in described first preset time threshold and described short journal file quantity
Value;
In the range of judging whether described ratio is in default fractional threshold;
If in the range of described ratio is in default fractional threshold, then judge that the daily record amount in described first preset time threshold is put down
Weighing apparatus;
If in the range of described ratio is not at presetting fractional threshold, then judging the daily record amount in described first preset time threshold not
Balance.
10. according to the device described in claim 6-9 any one, it is characterised in that described configuration file also includes second
Preset time threshold, described journal file generation module, specifically for:
According to described second preset time threshold, load described configuration file, and described daily record data is generated as full log
File and short journal file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610393585.7A CN106095870A (en) | 2016-06-06 | 2016-06-06 | Data balancing verification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610393585.7A CN106095870A (en) | 2016-06-06 | 2016-06-06 | Data balancing verification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106095870A true CN106095870A (en) | 2016-11-09 |
Family
ID=57447288
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610393585.7A Pending CN106095870A (en) | 2016-06-06 | 2016-06-06 | Data balancing verification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095870A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108667680A (en) * | 2017-10-30 | 2018-10-16 | 上海幻电信息科技有限公司 | A kind of monitoring system and method for multilink real time data steaming transfer |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004045155A1 (en) * | 2002-11-14 | 2004-05-27 | Huawei Technologies Co., Ltd. | Network traffic statistics method of ip device |
CN101364898A (en) * | 2008-09-22 | 2009-02-11 | 中国联合通信有限公司 | Method and system for flow balance checking |
CN102004971A (en) * | 2010-12-27 | 2011-04-06 | 用友软件股份有限公司 | Metering method and system for ERP (Enterprise Resource Planning) system |
CN102486795A (en) * | 2010-12-03 | 2012-06-06 | 中国移动通信集团陕西有限公司 | Method and device for inspecting balance of dynamic file |
CN105631026A (en) * | 2015-12-30 | 2016-06-01 | 北京奇艺世纪科技有限公司 | Security data analysis system |
-
2016
- 2016-06-06 CN CN201610393585.7A patent/CN106095870A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004045155A1 (en) * | 2002-11-14 | 2004-05-27 | Huawei Technologies Co., Ltd. | Network traffic statistics method of ip device |
CN101364898A (en) * | 2008-09-22 | 2009-02-11 | 中国联合通信有限公司 | Method and system for flow balance checking |
CN102486795A (en) * | 2010-12-03 | 2012-06-06 | 中国移动通信集团陕西有限公司 | Method and device for inspecting balance of dynamic file |
CN102004971A (en) * | 2010-12-27 | 2011-04-06 | 用友软件股份有限公司 | Metering method and system for ERP (Enterprise Resource Planning) system |
CN105631026A (en) * | 2015-12-30 | 2016-06-01 | 北京奇艺世纪科技有限公司 | Security data analysis system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108667680A (en) * | 2017-10-30 | 2018-10-16 | 上海幻电信息科技有限公司 | A kind of monitoring system and method for multilink real time data steaming transfer |
CN108667680B (en) * | 2017-10-30 | 2020-11-24 | 上海幻电信息科技有限公司 | Monitoring system and method for multilink real-time data stream transmission |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106293892B (en) | Distributed stream computing system, method and apparatus | |
Power et al. | The inner structure of ΛCDM haloes—I. A numerical convergence study | |
CN104185840B (en) | It is used for being prioritized the mthods, systems and devices of multiple tests in lasting deployment streamline | |
CN103401698B (en) | For the monitoring system that server health is reported to the police in server set group operatione | |
CN103530347B (en) | A kind of Internet resources method for evaluating quality based on big data mining and system | |
CN103473672A (en) | System, method and platform for auditing metadata quality of enterprise-level data center | |
Zoldan et al. | Structural and dynamical properties of galaxies in a hierarchical Universe: sizes and specific angular momenta | |
CN104426713A (en) | Method and device for monitoring network site access effect data | |
CN105939234A (en) | Data monitoring method and device | |
Jirka et al. | A lightweight approach for the sensor observation service to share environmental data across Europe | |
Li et al. | Toward smart distribution management by integrating advanced metering infrastructure | |
CN109800259A (en) | Collecting method, device and terminal device | |
CN110795305A (en) | System, apparatus and method for processing and managing WEB traffic data | |
CN107451058A (en) | A kind of software development methodology and device | |
CN106339321A (en) | Method and device for testing performance of application | |
CN111177193A (en) | Flink-based log streaming processing method and system | |
Hilty et al. | The role of ICT in energy consumption and energy efficiency | |
Barber Jr et al. | Economic performance assessment for the construction industry in the southeastern United States | |
Ducruet et al. | Spatial network analysis of container port operations: the case of ship turnaround times | |
CN107480056A (en) | A kind of method for testing software and device | |
CN106095870A (en) | Data balancing verification method and device | |
CN109726988A (en) | A kind of flow engine call method, device, electronic equipment and readable storage medium storing program for executing | |
CN103902447A (en) | Distributed system testing method and device | |
CN109614380A (en) | Log processing method, system, computer equipment and readable medium | |
CN106326280A (en) | Data processing method, apparatus and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20161109 |
|
WD01 | Invention patent application deemed withdrawn after publication |