WO2019153735A1 - 数据处理方法、装置和系统 - Google Patents

数据处理方法、装置和系统 Download PDF

Info

Publication number
WO2019153735A1
WO2019153735A1 PCT/CN2018/104530 CN2018104530W WO2019153735A1 WO 2019153735 A1 WO2019153735 A1 WO 2019153735A1 CN 2018104530 W CN2018104530 W CN 2018104530W WO 2019153735 A1 WO2019153735 A1 WO 2019153735A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
target type
aggregation
target
group number
Prior art date
Application number
PCT/CN2018/104530
Other languages
English (en)
French (fr)
Inventor
胡洋
张赞
李泽敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019153735A1 publication Critical patent/WO2019153735A1/zh
Priority to US16/990,640 priority Critical patent/US20200372039A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2291User-Defined Types; Storage management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/63Routing a service request depending on the request content or context

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a data processing method, apparatus, and system.
  • the statistical law of data can be applied to the monitoring and analysis of things.
  • the statistical rules of the CPU (Central Processing Unit) usage rate of each server in the equipment room can monitor the operation of the analysis server and utilize the precipitation in each area.
  • the statistical law can monitor and analyze the meteorological changes in various regions, and use the statistical rules of the performance of each student in this city to monitor and analyze the education situation of the city, and use the statistical rules of the wages of all citizens in the country this year to monitor and analyze the national living standards of this year. Situation, etc.
  • Data for monitoring can be stored randomly in multiple storage servers, but when the data size is large, it will result in wasted storage resources. Therefore, the data can be statistically processed, and the obtained aggregated data can be stored again to reduce the overhead of the storage resource.
  • the statistical methods generally include statistical maximum value, statistical minimum value, statistical average value, summation, statistical number, etc., and the large amount of data collected over a period of time is counted as the maximum value, the minimum value, and the value during this period, The number of data, etc., is the aggregated data for this period of time.
  • the above aggregated data can reflect the statistical law of the data, and the original data can be no longer needed when monitoring and analyzing things.
  • the computing server can transmit the same type of data on each storage server through the network, and then perform statistical processing on the obtained data to obtain aggregated data.
  • the computing server needs to wait for each storage server to transmit data, and the process may increase the time from the triggering to the end of the statistical processing, thereby reducing the efficiency of the data statistical processing.
  • embodiments of the present invention provide a data processing method, apparatus, and system.
  • the technical solution is as follows:
  • a data processing method for a distribution server, the method comprising: acquiring raw data, wherein the raw data includes a parameter value and at least one attribute value; determining a target type to which the original data belongs, wherein The target type includes an attribute value in the at least one attribute value; determining, according to the target type, the target computing server to which the original data belongs; and transmitting a data storage request to the target computing server, wherein the data storage request carries the original data.
  • the distribution server when the distribution server obtains the original data, the distribution server may distribute the original data to the target computing server according to the target type of the original data.
  • the distribution server may periodically acquire the original data of the target type.
  • the distribution server may determine the target computing server to which the original data needs to be distributed according to the target type of the original data, and then A data storage request carrying the original data is sent to the target computing server. In this way, the same type of raw data can be distributed to the same computing server.
  • the computing server performs statistical processing, the data that the computing relies on is stored in the computing server, and no longer needs to wait for other servers to transmit data, thereby increasing the data. The efficiency of statistical processing.
  • the target computing server to which the original data belongs is determined according to the target type, including: determining a group number of the target group corresponding to the target type, and grouping the target according to a correspondence between the preset group and the computing server.
  • the corresponding computing server determines the target computing server to which the original data belongs; the data storage request also carries the group number of the target grouping.
  • the target grouping of the original data may be calculated according to the target type of the original data, and further, the distribution server may according to the correspondence between the preset group and the computing server.
  • the target computing server corresponding to the target group is determined, and the target computing server is the target computing server to which the original data of the target type belongs.
  • the group number of the target packet may also be correspondingly added to the data storage request of the original data.
  • determining a group number of the target group corresponding to the target type includes: calculating a group number of the target group corresponding to the original data of the target type based on the attribute value included in the target type.
  • the target type is converted into a corresponding identifier string, and the group number of the target group corresponding to the original data of the target type may be calculated according to the identifier string.
  • the identification string can uniquely represent the target type so that different types of raw data may calculate different group numbers.
  • the group number of the target group corresponding to the target type is calculated based on the attribute value included in the target type, including: determining the encoding of the preset encoding type corresponding to each character in the attribute value included in the target type; Calculating a feature code corresponding to the target type based on each of the determined calculation functions and the preset calculation function; performing a remainder operation on the feature code and the total number of the packets, and determining the obtained remainder as the group number of the target group corresponding to the target type.
  • the distribution server may convert the original data into a first data tuple in a unified format, and then convert each of the attributes into a string type, and Each character is converted into a code of a preset encoding type, and a feature code corresponding to the target type is calculated by a preset calculation function for indicating the target type. Dividing the feature code by the total number of packets, the corresponding remainder can be obtained, and the remainder is in one-to-one correspondence with the group number of the group. Therefore, the obtained remainder can be directly determined as the group number of the target group corresponding to the target type, simplifying the remainder and the group number. Correspondence.
  • the preset calculation function includes one of the following functions or a combination function of a plurality of functions: a sum function, a difference function, a product function, a bitwise function, and a function.
  • the feature code corresponding to the target type can be calculated through different preset calculation functions, and the obtained feature code is used to distinguish the target type from other types regardless of the calculation function. .
  • the encoding of the preset encoding type is an American Standard Code for Information Intercode (ASCII) code.
  • ASCII American Standard Code for Information Intercode
  • each character may have a unique corresponding ASCII code, and the ASCII code of each character in the string may be used to represent the target type.
  • a data processing method is provided, the method is used for a computing server, and the method includes: receiving a data storage request sent by a distribution server, where the data storage request carries original data, where the original data includes parameter values and at least An attribute value, the original data belongs to the target type, and the target type includes the attribute value in at least one attribute value; the original data of the target type is stored; and each time the preset aggregation period is reached, the target received according to the current aggregation period belongs to the target The raw data of the type that determines the aggregated data belonging to the target type of the current aggregation period.
  • the computing server can receive the data storage request sent by the distribution server at any time, and then the original data carried in the data storage request can be obtained and stored in the memory.
  • the calculation server can read the original data of the target type received in the current aggregation period from the memory, perform statistical processing on the read original data, and calculate the aggregated data of the target type of the current aggregation period.
  • the computing server may receive more than one type of raw data, and may perform the above processing on each type of original data to obtain each type of aggregated data of the current aggregation period. The data that is dependent on the statistical processing no longer needs to occupy the network bandwidth for transmission, thereby reducing the occupation of the network bandwidth.
  • the data storage request further carries the group number of the target group; the method further includes: storing the group number of the target group corresponding to the target type; and each time the preset aggregation period is reached, according to the current The original data of the target type received in the aggregation period, and the aggregated data of the target type of the current aggregation period is determined, including: each time the preset aggregation period is reached, for each group number, according to the current aggregation period corresponding to the group number The raw data of the received target type determines the aggregated data of the target type of the current aggregation period.
  • the computing server may also acquire the group number of the target group to which the original data belongs, and store it in the memory corresponding to the original data.
  • the target computing server may read the original data corresponding to the group number of the group stored in the current aggregation period in the memory according to the group corresponding to the process. Then, according to the custom aggregation function, the original data of the same type is statistically processed to obtain each type of aggregated data of the current aggregation period.
  • the aggregation period includes multiple first-level sub-aggregation periods, and the i-th sub-aggregation period includes multiple i+1-th sub-aggregation periods, where i is greater than 1 and less than n.
  • n is a preset positive integer; each time the preset aggregation period is reached, for each group number, the current aggregation period is determined according to the original data of the target type received in the current aggregation period corresponding to the group number.
  • the aggregated data of the target type includes: when the nth sub-aggregation period is reached, the original data corresponding to each group number received in the current n-th sub-aggregation period is obtained, and for each group number, the acquired group is obtained.
  • the original data of the target type in the original data corresponding to the number is statistically processed to obtain the aggregated data of the target type of the current nth sub-aggregation period, and the group number corresponding to each aggregated data is stored;
  • the aggregation data of each i+1th sub-aggregation period corresponding to each group number obtained in the current i-th sub-aggregation period is obtained for each group number.
  • the aggregated data of all the i+1th sub-aggregation periods corresponding to the group number are statistically processed to obtain the aggregated data of the target type of the current i-th sub-aggregation period, and the group number corresponding to each aggregated data is stored; Whenever the preset aggregation period is reached, the aggregated data of all the first-level sub-aggregation periods corresponding to each group number obtained in the current aggregation period is obtained, and for each group number, all the first-level corresponding to the group number
  • the aggregated data of the sub-aggregation cycle is statistically processed to obtain aggregated data of the target type of the current aggregation cycle.
  • the scheme shown in the embodiment of the present invention triggers the statistical processing on the original data every time the nth sub-aggregation period is reached, and then automatically indexes all the data in the current group by using the aggregate function, respectively, based on each process, and
  • the original data of the same type is statistically processed to obtain aggregated data of the target type of the current cycle, and the aggregated data and the corresponding group number are stored in the memory.
  • the statistical processing of all the i+1th-level aggregated data in the current period is triggered, and the aggregated data of the target type of the current period of each group is obtained respectively, and the aggregated data is correspondingly
  • the group number is stored in memory.
  • the preset aggregation period When the preset aggregation period is reached, the statistical processing of all the aggregated data of the first level in the current period is triggered, and the aggregated data of the target type of the current period of each group is obtained respectively, and the aggregated data and the corresponding group number are obtained. Stored in memory. In this way, the processing of the original data in the preset aggregation period is dispersed into each sub-aggregation period, and the amount of data calculated at one time is reduced, thereby reducing the processing time of the calculation server and improving the efficiency of the data statistics processing.
  • the aggregation period includes m first-level sub-aggregation periods, and the i-th sub-aggregation period includes m i+1-th sub-aggregation periods, where m is a preset positive integer.
  • the multiples between the aggregation periods of each layer are the same, so that the amount of data used in each statistical calculation is relatively balanced, so that the computing efficiency and memory usage of each computing server during data aggregation are used.
  • the rate is balanced and the data aggregation system runs smoothly.
  • the original data corresponding to each group number received in the current n-th sub-aggregation period is deleted; and the current i-th is obtained.
  • the aggregated data of all the i+1th sub-aggregation periods corresponding to each group number obtained in the current i-th sub-aggregation period is deleted; and the aggregate data corresponding to the current aggregation period is obtained.
  • the aggregated data of all the first-level sub-aggregation periods corresponding to each group number obtained in the current aggregation period is deleted.
  • the data deletion on which the aggregated data is calculated is deleted to save memory usage.
  • a distribution server comprising at least one module for implementing the data processing method provided by the first aspect above.
  • a computing server comprising at least one module for implementing the data processing method provided by the second aspect above.
  • a data processing system comprising a distribution server and a computing server, wherein:
  • a distribution server configured to obtain raw data, wherein the original data includes a parameter value and at least one attribute value; determining a target type to which the original data belongs, wherein the target type includes an attribute value in at least one attribute value; determining, according to the target type, a target computing server to which the original data belongs; sending a data storage request to the target computing server, wherein the data storage request carries the original data;
  • a computing server configured to receive a data storage request sent by the distribution server, where the data storage request carries the original data, the original data includes a parameter value and at least one attribute value, the original data belongs to the target type, and the target type includes the attribute value at least In an attribute value; storing the original data of the target type; each time the preset aggregation period is reached, the aggregated data of the target type of the current aggregation period is determined according to the original data of the target type received in the current aggregation period.
  • a distribution server comprising a processor, a memory configured to execute instructions stored in the memory, and the processor implementing the data processing method provided by the first aspect by executing the instructions.
  • a computing server comprising a processor, a memory configured to execute instructions stored in the memory, and the processor implementing the data processing method provided by the second aspect by executing the instructions.
  • a computer readable storage medium comprising instructions for causing a distribution server to perform the method of the first aspect when the computer readable storage medium is run on a distribution server.
  • a computer program product comprising instructions for causing a distribution server to perform the method of the first aspect when the computer program product is run on a distribution server.
  • a computer readable storage medium comprising instructions for causing a computing server to perform the method of the second aspect when the computer readable storage medium is run on a computing server.
  • a computer program product comprising instructions for causing a computing server to perform the method of the second aspect when the computer program product is run on a computing server.
  • the distribution server may determine the target computing server to which the original data belongs according to the target type, and then send the original data of the target type by sending a data storage request to the target computing server. Further, the target computing server may receive the data storage request sent by the distribution server, and store the original data of the target type.
  • the preset aggregation period is reached, the current data is determined according to each type of original data received in the current aggregation period. Aggregate data for each type of aggregation cycle. In this way, the same type of raw data can be distributed to the same computing server.
  • the computing server performs statistical processing, the data that the computing relies on is stored in the computing server, and no longer needs to wait for other servers to transmit data, thereby increasing the data. The efficiency of statistical processing.
  • FIG. 1 is a schematic diagram of a system framework provided by an embodiment of the present invention.
  • FIG. 2 is a schematic structural diagram of a distribution server according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a computing server according to an embodiment of the present invention.
  • FIG. 4 is a flowchart of a method for data aggregation according to an embodiment of the present invention.
  • FIG. 5 is a flowchart of a method for data aggregation according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a calculation group number according to an embodiment of the present invention.
  • FIG. 7 is a schematic diagram of an aggregation period division according to an embodiment of the present invention.
  • FIG. 8 is a schematic diagram of parallel processing according to an embodiment of the present invention.
  • FIG. 9 is a schematic diagram of a binary tree aggregation period division according to an embodiment of the present invention.
  • FIG. 10 is a schematic diagram of an apparatus for data aggregation according to an embodiment of the present invention.
  • FIG. 11 is a schematic diagram of an apparatus for data aggregation according to an embodiment of the present invention.
  • FIG. 12 is a schematic diagram of an apparatus for data aggregation according to an embodiment of the present invention.
  • the embodiment of the present invention provides a data processing method, which can be used in a data processing system.
  • the system can include at least a distribution server and a computing server, and the system can include multiple computing servers. Includes one or more distribution servers.
  • a communication connection can be established between the distribution server and the computing server.
  • the distribution server can distribute the same type of raw data to the same computing server after acquiring the original data of the data source, and can input various types of raw data. Distribute to each computing server.
  • the computing server can perform statistical processing on the original data to obtain aggregated data.
  • the above-mentioned distribution server and computing server can implement corresponding functions in the actual scenario by the same server.
  • the server is a logical distribution server when executing the distribution process, and is a logical computing server when executing the calculation process.
  • the distribution server can include a processor 210, a transmitter 220, a receiver 230, and a receiver 230 and a transmitter 220 can be coupled to the processor 210, respectively, as shown in FIG.
  • the receiver 230 can be used to receive messages or data, that is, can receive original data sent by other electronic devices
  • the transmitter 220 and the receiver 230 can be network cards
  • the transmitter 220 can be used to send messages or data, that is, the obtained data can be obtained.
  • Raw data is sent to each computing server.
  • the processor 210 can be the control center of the server, connecting various parts of the entire server, such as the receiver 230 and the transmitter 220, using various interfaces and lines.
  • the processor 210 may be a CPU, which may be used to determine related processing of the target computing server to which the original data belongs.
  • the processor 210 may include one or more processing units; the processor 210 may integrate application processing.
  • a modem processor wherein the application processor primarily processes an operating system, and the modem processor primarily processes wireless communications.
  • Processor 210 can also be a digital signal processor, an application specific integrated circuit, a field programmable gate array, or other programmable logic device or the like.
  • the server may also include a memory 240 that may be used to store software programs and modules, and the processor 210 performs various functional applications and data processing of the server by reading software code and modules stored in the memory.
  • the computing server can include a processor 310, a transmitter 320, a receiver 330, and a receiver 330 and a transmitter 320 can be coupled to the processor 310, respectively, as shown in FIG.
  • Receiver 330 can be used to receive messages or data, i.e., can receive raw data transmitted by various distribution servers
  • transmitter 320 and receiver 330 can be network cards
  • transmitter 320 can be used to transmit messages or data.
  • the processor 310 can be the control center of the server, connecting various parts of the entire server, such as the receiver 330 and the transmitter 320, using various interfaces and lines.
  • the processor 310 may be a CPU, which may be used to determine related processing of aggregated data.
  • the processor 310 may include one or more processing units; the processor 310 may integrate an application processor and modem.
  • a processor wherein the application processor primarily processes an operating system, and the modem processor primarily processes wireless communications.
  • Processor 310 can also be a digital signal processor, an application specific integrated circuit, a field programmable gate array, or other programmable logic device or the like.
  • the server may also include a memory 340 that may be used to store software programs and modules, and the processor 310 performs various functional applications and data processing of the server by reading software code and modules stored in the memory.
  • step 401 the distribution server obtains the raw data.
  • the original data is data that is provided by the data source device to the distribution server, and includes a parameter value and at least one attribute value, that is, the original data may include a parameter value that needs to be counted and an attribute value corresponding to the parameter value.
  • a combination of individual attribute values of the raw data can be used to indicate the type of the original data.
  • the target type is a type to which the original data currently acquired by the distribution server belongs, and the attribute value included is at least one attribute value of the original data.
  • the original data of the same type is aggregated, so in the subsequent processing of this solution, the original data of the same type is stored in the same computing server for aggregation processing.
  • the technician can set the combination of attributes of the raw data needed for the statistics. For example, the long-term performance of the scores of any of the subjects in any class can be monitored.
  • the raw data can be as shown in Table 1 below, where each row corresponds to a piece of raw data.
  • class, name, and subject are attributes
  • grades are parameters
  • one class and two classes are attribute values of class attributes
  • Zhang San, Li Si, and Wang Six are attribute values of name attributes
  • language and math are subject attributes.
  • the attribute value, 90, 85, 100, etc. are the parameter values of the performance parameters.
  • one class, three classes, and the language are one type, which can be called type 1, two classes, Li four, and language is one type.
  • Called type 2 one class, three, mathematics is a type, can be called type 3, and so on. Only one test score is recorded in this table. For each type, the scores of multiple exams can be counted, and the scores of multiple exams can be analyzed. For example, the scores of a group of Zhang San in consecutive exams are 76.
  • the type 1 scores received in the statistical process are 76, 79, 82, 86, 88, 90, and then the type 1 data can be analyzed, that is, An analysis of the language scores of a group of Zhang San shows that his language is improving.
  • the long-term status of the total score of any student in any class can be monitored.
  • the original data can be as shown in Table 2 below, where each row corresponds to a piece of raw data.
  • the class and name are attributes
  • the total score is the parameter
  • one class and two classes are the attribute values of the class attribute
  • Zhang San, Li Si, and Wang Liu are attribute values of the name attribute
  • 602, 586, and 627 are total.
  • the parameter value of the performance parameter wherein one class and three is a type, which can be called type 4, two classes, and Li four is a type, which can be called type 5, one class, and the king six is a type. Can be called type 6, and so on. Only one test score is recorded in this table. For each type, you can count the scores of multiple exams and analyze the scores of multiple exams. For example, the total score of a group of Zhang San in consecutive exams is 580.
  • the long-term situation of the average language score of any class can be monitored.
  • the original data can be as shown in Table 3 below, where each row corresponds to one piece of original data.
  • the class is attribute
  • the average grade is parameter
  • the first class and the second class are the attribute values of the class
  • 90 and 85 are the parameter values of the average grade parameter, wherein one class is a type, which can be called type 7
  • the second class is another type, which can be called type 8, and so on.
  • Only the average score of a language test is recorded in this table.
  • the average score of multiple language tests can be counted, and the average score of multiple language tests can be analyzed.
  • one class is in multiple consecutive Chinese exams.
  • the average scores are 85, 80, 86, 90, 76, 84, which means that the average score of type 7 obtained in the statistical process is 85, 80, 86, 90, 76, 84, which can be used for type 7 data.
  • the analysis that is, the analysis of the average scores of a class of Chinese, can be seen that the average score of a class of Chinese is at an excellent level.
  • the source of the original data may be diverse.
  • the original data when the data used for monitoring is the student's grade, the original data may come from the cloud-side stored data on the network side; when the data used for monitoring is the amount of precipitation, the original The data can come from the data sent by the monitoring device of each monitoring station; when the data used for monitoring is the CPU usage and memory usage of the server, the original data can come from the distribution server itself.
  • the type of the original data can be various.
  • the embodiment of the present invention takes the original data of one type (ie, the target type) as an example, and the processing processes of other types of original data are the same, and are not described again.
  • the distribution server may periodically acquire the raw data. For example, each server in the equipment room can collect CPU usage every 10 seconds, and then can send the collected CPU usage as raw data to the distribution server, and the distribution server can obtain the CPU usage of each server.
  • the format of the original data obtained by the distribution server may be text, RDD (Resilient Distributed Datasets), JSON (Java Script Object Notation, Java Script Object Notation), and the like. If the CPU usage of the monitoring server is used as an example, the original data may be "CPU usage of server 1 is 54%", and "Server 1" and "CPU usage rate” are attribute values of the original data, "54%””is the parameter value of the original data.
  • the first data tuple data1 (p 1 , p 2 , . . . , p s , d 1 , ..
  • p i is the i-th attribute value in the original data
  • d j is the j-th parameter value in the original data
  • the combination of all p i in data1 can be used to indicate the type of the data.
  • step 402 the distribution server determines the target type to which the original data belongs.
  • the distribution server may extract the attribute value of the required at least one attribute from the received original data, obtain the target type to which the original data belongs, and then extract the target data.
  • the attribute value is assigned to p i of the first data tuple described above, and the extracted parameter value is assigned to d j .
  • the distribution server determines the target computing server to which the original data belongs according to the target type.
  • each time the distribution server obtains a piece of original data the target computing server to which the original data needs to be distributed may be determined according to the target type of the original data.
  • the same type of original data can be distributed to the same computing server, occupying network bandwidth only in the process of distribution, and no longer occupying bandwidth in the process of statistics, reducing the network transmission overhead during the calculation process, and shortening the entire data.
  • the time of the aggregated method flow is not limited to distribute the target type of the original data.
  • the original data may be grouped, so that the computing server performs parallel processing on the original data of different groups, and the corresponding processing may be as follows: determining a group number of the target group corresponding to the target type, according to the preset grouping and computing server Corresponding relationship, the computing server corresponding to the target group is determined as the target computing server to which the original data belongs.
  • the degree of parallelism k is the number of processes that can be executed simultaneously in the data aggregation system.
  • the parallelism k of the data aggregation system can be preset according to the total CPU core of all computing servers. Generally, the parallelism k is equal to 2 to 3 times the total CPU core. For example, if there are 3 computing servers, each The compute server's CPU has 4 cores, so the parallelism k can be set to 24.
  • the total number of packets of data may be k, and may be numbered according to 0 to k-1 for k processes to process data in the packet.
  • the number of the group that the calculation server needs to calculate may be randomly set, or may be set according to a certain rule, which is not limited herein. Then, the number of the packet and the identifier of the computing server can be added to the corresponding relationship table, and the correspondence between the packet and the computing server is established, and the correspondence between the packet and the computing server is stored in the distribution server. For example, when the calculation server 2 sets the data of the packet 2 and the packet 3, the correspondence between the packet 2 and the calculation server 2, and the correspondence between the packet 3 and the calculation server 2 can be stored in the distribution server.
  • the target group to which it belongs can be calculated according to the target type of the original data.
  • the distribution server may calculate the group number of the target group corresponding to the target type based on the attribute value included in the target type, as shown in FIG. 5, and the specific processing may be as follows:
  • step 4031 an encoding of a preset encoding type corresponding to each of the attribute values included in the target type is determined.
  • the encoding of the preset encoding type may be an ASCII code, or may be an encoding based on a preset character-to-digital mapping relationship, such as an encoding based on a SHA (Secure Hash Algorithm).
  • SHA Secure Hash Algorithm
  • the distribution server may each p i are converted to a string type
  • the target type can be obtained
  • the included attribute value corresponds to multiple characters of the identification string.
  • the distribution server can then convert each character to a number in the corresponding ASCII code.
  • step 4032 a feature code corresponding to the target type is calculated based on each of the determined coding and the preset calculation function.
  • the number of the ASCII code corresponding to each character determined in step 4031 is calculated by a calculation function set in advance to obtain a feature code corresponding to the target type for representing the target type.
  • the preset calculation function may include one of the following functions or a combination function of a plurality of functions: a sum function, a difference function, a product function, a bitwise and a function. As shown in the calculation group number diagram shown in FIG. 6, if the attributes of the original data have "123" and "abc", each attribute can be converted into the ASCII code corresponding to the characters "123", "abc", and "1".
  • the number is 49, "2" corresponds to 50, “3” corresponds to 51, "a” corresponds to 97, “b” corresponds to "98", and "c” corresponds to 99.
  • the summation operation is performed to obtain the feature code S corresponding to the target type. Is 444.
  • step 4033 the feature code and the total number of groups are subjected to a remainder operation, and the obtained remainder is determined as the group number of the target group corresponding to the target type.
  • the corresponding remainder can be obtained.
  • the total number of groups is k
  • the group number of the group is 0 to k-1.
  • the range of the remainder should be 0 to k-1, and
  • the group numbers of the groups correspond one-to-one. Therefore, the obtained remainder can be directly determined as the group number of the target group corresponding to the original data of the target type, and the correspondence between the remainder and the group number is simplified.
  • the calculation group number is shown in FIG. 6.
  • the feature code S corresponding to the target type is 444
  • %k 60, that is, the target packet to which the original data of the target type belongs is the packet 60. .
  • the distribution server may determine the target computing server corresponding to the target group according to the correspondence between the preset group and the computing server, and the target computing server is the target computing server to which the original data of the target type belongs.
  • the calculation server to which each type of original data belongs can be determined according to the above procedure.
  • the computing servers to which different types of raw data belong may be the same or different, but they can still effectively reduce the amount of data that a process needs to process, thereby improving the efficiency of process processing.
  • step 404 the distribution server sends a data storage request to the target computing server.
  • the data storage request for storing the original data may be sent to the target computing server.
  • the data storage request carries the original data of the target type.
  • the distribution server only needs to occupy a certain amount of bandwidth when distributing the original data, and the data that is dependent on the subsequent statistical processing no longer needs to occupy the network bandwidth for transmission, thereby reducing the occupation of the network bandwidth.
  • the data storage request may also carry a group number of the target group to which the original data belongs.
  • the data storage request carries the original data, and the original data may also be the original data converted into the first data tuple in the above process for subsequent processing.
  • step 405 the target computing server receives a data storage request sent by the distribution server.
  • the target computing server may receive the data storage request sent by the distribution server, and then the original data carried in the data storage request may be obtained.
  • the target computing server may also acquire the group number of the target group to which the original data belongs.
  • step 406 the target computing server stores raw data of the target type.
  • the target computing server may store the acquired raw data into memory for subsequent processing.
  • the target computing server may also store the group number of the target group corresponding to the target type, and also store the group number of the target group to which the original data belongs, corresponding to the original data in the memory.
  • the target computing server can receive the data storage request of the original data at any time.
  • the above steps 405-406 are repeatedly executed within the aggregation period, and only when the aggregation period ends, step 407 is continued.
  • step 407 each time the preset aggregation period is reached, the target computing server determines the aggregated data of the target type of the current aggregation period based on each type of raw data received during the current aggregation period.
  • Spark is a fast and versatile computing engine designed for large-scale data processing. Spark can be installed in the computing server and processed based on Spark. The technician can pre-set the aggregation period in Spark. When the aggregation period is reached, the target computing server can read the original data of the target type received in the current aggregation period from the memory, and perform the read original data. Statistical processing, which calculates the aggregated data of the target type of the current aggregation period. For example, the preset aggregation period may be 60 minutes. Starting from the program running of the data aggregation, the maximum, minimum, average, and the CPU usage of the server 1 in the 60 minutes may be obtained every time the 60 minutes is reached. Value, number of data, etc. The target computing server may receive more than one type of raw data, and may perform the above processing on each type of original data to obtain each type of aggregated data of the current aggregation period.
  • the target computing server may separately process the original data of each group according to the group to which the stored original data belongs, and the corresponding processing may be as follows: each time a preset aggregation period is reached, for each group number, The aggregated data of the target type of the current aggregation period is determined according to the original data of the target type received in the current aggregation period corresponding to the group number.
  • the second data tuple of the same attribute is statistically processed to obtain each type of aggregated data of the current aggregation period.
  • the computing server can also delete the original data that has been statistically processed to save memory usage.
  • each process is independent of each other, that is, each set of data can be processed simultaneously, thereby improving the parallelism of statistical processing.
  • the aggregation period may be further divided into multiple levels of sub-aggregation periods, and the aggregated data of the sub-aggregation period with a longer period may be generated according to the aggregated data of the sub-aggregation period with a shorter period.
  • the aggregation period includes a plurality of first-level sub-aggregation periods, and the i-th sub-aggregation period includes a plurality of (i+1)th sub-aggregation periods, where i is any positive integer greater than 1 and less than n, and n is a preset positive Integer.
  • Each sub-aggregation cycle and aggregation cycle can be arranged in ascending order to form an aggregate time series ⁇ t 0 , t 1 , . . .
  • the 600-second aggregation period can be divided into two 300-second first-level sub-aggregation periods.
  • the first-level sub-aggregation period of each 300-second period can be divided into five 60-second periods.
  • the level 2 sub-aggregation cycle, so the aggregate time series can be ⁇ 60, 300, 600 ⁇ .
  • the data of each packet is processed independently without interference, and the statistical processing can be repeated according to the aggregation time series ⁇ t 0 , t 1 , . . . , t w ⁇ .
  • the target computing server may obtain the original data corresponding to each group number received in the current n-th sub-aggregation period, and for each group number, the original corresponding to the obtained group number.
  • the original data of the target type in the data is statistically processed to obtain the aggregated data of the target type of the current nth sub-aggregation cycle, and the group number corresponding to each aggregated data is stored.
  • the period length of the nth sub-aggregation period is the shortest, and the data dependent on the calculation is the original data received in the current period. That is, each time the nth sub-aggregation cycle is reached, the statistical processing of the original data is triggered, and then, based on each process, all the data in the current group is automatically indexed by the aggregation function, and the second item having the same attribute is obtained.
  • the parameter values in the data tuple are statistically processed to obtain aggregated data of the target type of the current cycle, and the aggregated data and the corresponding group number are stored in the memory for subsequent processing.
  • the 60-second second-level sub-aggregation period corresponds to the n-th sub-aggregation period here, and the calculation-dependent data is the original data received within the current 60 seconds.
  • the original data corresponding to each group number received in the current n-th sub-aggregation period may also be deleted, that is, the current Calculate the data deletion relied on to save memory usage.
  • the resulting aggregated data can also be stored in a database or exported to Kafka, a high-throughput distributed publish-subscribe messaging system, for user query or use.
  • the aggregated data obtained in the above process may be in the format of the second data tuple, and the aggregated data may be converted into the format of the first data tuple before being stored in the database or output to Kafka, that is, the second data element
  • the attributes in the group are split into the individual attributes of the original first data tuple, which can be easily used to query according to different attribute values.
  • the target computing server can obtain the aggregated data of all the i+1th sub-aggregation periods corresponding to each group number obtained in the current i-th sub-aggregation period, for each group.
  • the aggregation data of all the i+1th sub-aggregation periods corresponding to the group number is statistically processed to obtain the aggregated data of the target type of the current i-th sub-aggregation period, and the group corresponding to each aggregated data is stored. Numbering.
  • the data that is calculated in the i-th sub-aggregation cycle is the aggregate data of all the i+1th-level data obtained in the current cycle. That is, each time the i-th sub-aggregation period is reached, the statistical processing of all the i+1th-level aggregated data in the current period is triggered, and the aggregated data of the target type of the current period of each group is respectively obtained, and the aggregation is performed.
  • the data and the corresponding group number are stored in the memory.
  • the specific process is similar to the statistical processing performed in the n-th sub-aggregation cycle described above, and is not described here.
  • the 300-second first-level sub-aggregation period corresponds to the i-th sub-aggregation period here. When calculating the 300-second aggregated data, it can be based on five 60-second periods. The aggregated data is calculated.
  • the aggregated data of all the i+1th sub-aggregation periods corresponding to each group number obtained in the current i-th sub-aggregation period may be deleted, and the obtained aggregated data may also be stored in the database. Or output to Kafka, no more details here.
  • the target computing server can obtain the aggregated data of all the first-level sub-aggregation periods corresponding to each group number obtained in the current aggregation period, for each group number, corresponding to the group number.
  • the aggregated data of all the first-level sub-aggregation cycles are statistically processed to obtain aggregated data of the target type of the current aggregation cycle.
  • the preset aggregation period has the longest period length
  • the calculation-dependent data is all the aggregate data of the first level obtained in the current period. That is, each time the preset aggregation period is reached, the statistical processing of all the aggregated data of the first level in the current period is triggered, and the aggregated data of the target type of the current period of each group is respectively obtained, and the specific process is described above.
  • the statistical processing performed during the n-level sub-aggregation cycle is similar and will not be described here.
  • the aggregation period of 600 seconds corresponds to the preset aggregation period here. When calculating the aggregated data for 600 seconds, it can be calculated based on the aggregate data of two 300-second periods. .
  • the aggregated data of all the i+1th sub-aggregation periods corresponding to each group number obtained in the current first-level sub-aggregation period may be deleted, and the obtained aggregated data may also be stored in the database. Or output to Kafka, no more details here. Since the aggregation period is the preset maximum length period, the aggregated data between the two aggregation periods is no longer statistically processed. Therefore, after each type of aggregated data in the current aggregation period is stored in the database or output to Kafka, The aggregated data cached in the compute server can be deleted.
  • step 407 may be repeated to perform the calculation of the next aggregation period.
  • the amount of data calculated at one time may be relatively large, which may result in a longer processing time of the computing server.
  • the processing of the original data in the preset aggregation period is dispersed into each sub-aggregation period, and the amount of data calculated at one time is reduced, thereby reducing the processing time of the calculation server and improving the efficiency of the data statistics processing.
  • the aggregation period may include m first-level sub-aggregation periods, and the i-th sub-aggregation period may also include m i+1-th sub-aggregation periods, where m is a preset positive integer. That is, the multiples between the aggregation periods of each level are the same. As shown in FIG. 9, the binary aggregation period is divided into two.
  • the aggregation time series may be ⁇ 75, 150, 300, 600 ⁇ .
  • step 407 can be performed according to the determined aggregation time sequence, and details are not described herein again. Since the multiples between the aggregation periods of each level are the same, the amount of data used in each statistical calculation is relatively balanced, so that the computing efficiency and memory usage of each computing server are balanced during data aggregation, and the data aggregation system can Smooth operation.
  • the user can query or call the aggregated data according to the required attribute information to analyze the trend of the corresponding thing. For example, the user can query the database for the maximum, minimum, and average CPU usage of the server 1 every 10 minutes in the past hour.
  • the distribution server may determine the target computing server to which the original data belongs according to the target type, and then send the original data of the target type by sending a data storage request to the target computing server. Further, the target computing server may receive the data storage request sent by the distribution server, and store the original data of the target type.
  • the preset aggregation period is reached, the current aggregation is determined according to the original data of the target type received in the current aggregation period. Aggregate data for the target type of the cycle. In this way, the same type of raw data can be distributed to the same computing server.
  • the computing server performs statistical processing, the data that the computing relies on is stored in the computing server, and no longer needs to wait for other servers to transmit data, thereby increasing the data. The efficiency of statistical processing.
  • the embodiment of the present invention further provides a data processing device, which may be the above-mentioned distribution server. As shown in FIG. 10, the device includes:
  • the obtaining module 1010 is configured to obtain the original data, where the original data includes a parameter value and at least one attribute value, and specifically, the obtaining function in the foregoing step 401, and other implicit steps may be implemented;
  • the first determining module 1020 is configured to determine a target type to which the original data belongs, where the target type includes an attribute value in the at least one attribute value, specifically, the determining function in the foregoing step 402, and other
  • the second determining module 1030 is configured to determine, according to the target type, the target computing server to which the original data belongs, and specifically implement the determining function in the foregoing step 403, and other implicit steps;
  • the sending module 1040 is configured to send a data storage request to the target computing server, where the data storage request carries the original data of the target type, specifically, the sending function in the foregoing step 404, and other implied step.
  • the second determining module 1030 is configured to:
  • the data storage request also carries the group number of the target packet.
  • the second determining module 1030 is configured to:
  • the second determining module 1030 is configured to:
  • the feature code and the total number of groups are subjected to a remainder operation, and the obtained remainder is determined as the group number of the target group corresponding to the original data of the target type.
  • the preset calculation function includes one function of the following function or a combination function of multiple functions:
  • the encoding of the preset encoding type is an American Standard Code for Information Interchange (ASCII) code.
  • ASCII American Standard Code for Information Interchange
  • the foregoing obtaining module 1010 may be implemented by a transceiver
  • the first determining module 1020 may be implemented by a processor
  • the second determining module 1030 may be implemented by a processor
  • the sending module 1040 may be implemented by a transceiver.
  • the embodiment of the present invention further provides a data processing device, which may be the foregoing computing server. As shown in FIG. 11, the device includes:
  • the receiving module 1110 is configured to receive a data storage request sent by the distribution server, where the data storage request carries original data of a target type, where the original data includes a parameter value and at least one attribute value, where the original data belongs to the target a type, the attribute value included in the target type is in the at least one attribute value, specifically, the receiving function in the above step 405, and other implicit steps may be implemented;
  • the storage module 1120 is configured to store the original data of the target type, and specifically implement the storage function in the foregoing step 406, and other implicit steps;
  • the determining module 1130 is configured to determine, according to the original data of the target type received in the current aggregation period, the aggregated data of the target type of the current aggregation period, which may be determined in the foregoing step 407. Features, and other implied steps.
  • the data storage request further carries a group number of the target group
  • the storage module 1120 is further configured to: store a group number of the target group corresponding to the target type;
  • the determining module 1130 is configured to determine a target of the current aggregation period according to the original data of the target type received in the current aggregation period corresponding to the group number for each group number each time a preset aggregation period is reached. Type of aggregated data.
  • the aggregation period includes multiple first-level sub-aggregation periods, and the i-th sub-aggregation period includes multiple i+1-th sub-aggregation periods, where i is any positive integer greater than 1 and less than n. , n is a preset positive integer; the determining module 1130 is configured to:
  • the original data corresponding to each group number received in the current n-th sub-aggregation period is obtained, and for each group number, the original data corresponding to the obtained group number is obtained.
  • the original data of the target type is separately processed, and the aggregated data of the target type of the current nth sub-aggregation period is obtained, and the group number corresponding to each aggregated data is stored;
  • the aggregated data of all the i+1-th sub-aggregation periods corresponding to each group number obtained in the current i-th sub-aggregation period is obtained, for each group number,
  • the aggregated data of all the i+1th sub-aggregation periods corresponding to the group number are statistically processed to obtain the aggregated data of the target type of the current i-th sub-aggregation period, and the group number corresponding to each aggregated data is stored;
  • the aggregation data of all the first-level sub-aggregation periods corresponding to each group number obtained in the current aggregation period is obtained, and for each group number, all the groups corresponding to the group number are obtained.
  • the aggregated data of the first-level sub-aggregation cycle is statistically processed to obtain aggregated data of the target type of the current aggregation cycle.
  • the aggregation period includes m first-level sub-aggregation periods, and the i-th sub-aggregation period includes m i+1-th sub-aggregation periods, where the m is a preset positive integer.
  • the device further includes:
  • the deleting module 1140 is configured to delete the original data corresponding to each group number received in the current nth sub-aggregation period after obtaining the aggregated data corresponding to the current n-th sub-aggregation period; After the aggregation data corresponding to the i-level sub-aggregation period, the aggregated data of all the i+1-th sub-aggregation periods corresponding to each group number obtained in the current i-th sub-aggregation period is deleted; the current aggregation period is obtained. After the aggregated data is deleted, the aggregated data of all the first-level sub-aggregation periods corresponding to each group number obtained in the current aggregation period is deleted.
  • receiving module 1110 can be implemented by a transceiver
  • storage module 1120 can be implemented by a memory
  • determining module 1130 can be implemented by a processor
  • deleting module 1140 can be implemented by a processor and a memory.
  • the distribution server may determine the target computing server to which the original data belongs according to the target type, and then send the original data of the target type by sending a data storage request to the target computing server. Further, the target computing server may receive the data storage request sent by the distribution server, and store the original data of the target type.
  • the preset aggregation period is reached, the current aggregation is determined according to the original data of the target type received in the current aggregation period. Aggregate data for the target type of the cycle. In this way, the same type of raw data can be distributed to the same computing server.
  • the computing server performs statistical processing, the data that the computing relies on is stored in the computing server, and no longer needs to wait for other servers to transmit data, thereby increasing the data. The efficiency of statistical processing.
  • the data processing apparatus provided by the foregoing embodiment only illustrates the division of each functional module in the processing of data. In actual applications, the function allocation may be completed by different functional modules as needed.
  • the internal structure of the distribution server and the computing server are divided into different functional modules to perform all or part of the functions described above.
  • the data processing apparatus and the data processing method embodiment provided by the foregoing embodiments are in the same concept, and the specific implementation process is described in detail in the method embodiment, and details are not described herein again.
  • an embodiment of the present invention further provides a data processing system, where the system includes a distribution server and a computing server, where:
  • a distribution server configured to obtain raw data, wherein the original data includes a parameter value and at least one attribute value; determining a target type to which the original data belongs, wherein the target type includes an attribute value in at least one attribute value; determining, according to the target type, a target computing server to which the original data belongs; sending a data storage request to the target computing server, wherein the data storage request carries the original data;
  • a computing server configured to receive a data storage request sent by the distribution server, where the data storage request carries original data of a target type, where the original data includes a parameter value and at least one attribute value, the original data belongs to the target type, and the target type includes attributes The value is in at least one attribute value; the original data of the target type is stored; and each time the preset aggregation period is reached, the aggregated data of the target type of the current aggregation period is determined according to the original data of the target type received in the current aggregation period.
  • the distribution server may determine the target computing server to which the original data belongs according to the target type, and then send the original data of the target type by sending a data storage request to the target computing server. Further, the target computing server may receive the data storage request sent by the distribution server, and store the original data of the target type.
  • the preset aggregation period is reached, the current aggregation is determined according to the original data of the target type received in the current aggregation period. Aggregate data for the target type of the cycle. In this way, the same type of raw data can be distributed to the same computing server.
  • the computing server performs statistical processing, the data that the computing relies on is stored in the computing server, and no longer needs to wait for other servers to transmit data, thereby increasing the data. The efficiency of statistical processing.
  • the computer program product comprises one or more computer instructions that, when loaded and executed on a device, produce, in whole or in part, a process or function in accordance with an embodiment of the present invention.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transmission to another website site, computer, server or data center via wired (eg coaxial cable, fiber optic, digital subscriber line) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that the device can access or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape, etc.), or an optical medium (such as a Digital Video Disk (DVD), etc.), or a semiconductor medium (such as a solid state hard disk or the like).
  • a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, etc.
  • an optical medium such as a Digital Video Disk (DVD), etc.
  • DVD Digital Video Disk
  • semiconductor medium such as a solid state hard disk or the like.
  • a person skilled in the art may understand that all or part of the steps of implementing the above embodiments may be completed by hardware, or may be instructed by a program to execute related hardware, and the program may be stored in a computer readable storage medium.
  • the storage medium mentioned may be a read only memory, a magnetic disk or an optical disk or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)
  • Computer And Data Communications (AREA)

Abstract

本发明实施例公开了一种数据处理方法、装置和系统,属于计算机技术领域。所述方法包括:分发服务器在获取原始数据之后,确定原始数据的目标类型,根据目标类型确定原始数据所属的目标计算服务器,然后通过向目标计算服务器发送数据存储请求来发送该目标类型的原始数据。进而,目标计算服务器接收分发服务器发送的数据存储请求,并存储目标类型的原始数据,每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。采用本发明,可以提高数据统计处理的效率。

Description

数据处理方法、装置和系统 技术领域
本发明涉及计算机技术领域,特别涉及一种数据处理方法、装置和系统。
背景技术
数据的统计规律可以应用于对事物的监控分析,例如,利用机房内各个服务器的CPU(Central Processing Unit,中央处理器)使用率的统计规律可以监控分析服务器的运行情况、利用各地区的降水量的统计规律可以监控分析各地区的气象变化情况、利用本市各个学生的成绩的统计规律可以监控分析本市的教育情况、利用本年度全国各个公民的工资的统计规律可以监控分析今年的国民生活水平情况等。
用于监控的数据可以随机存储在多个存储服务器中,但是当数据规模较大时,会导致浪费存储资源。因此,可以对数据进行统计处理,对得到的聚合数据再进行存储,减少存储资源的开销。统计的方法一般包括统计最大值、统计最小值、统计平均值、求和、统计个数等,将一段时间内采集到的大量数据统计为这段时间内的最大值、最小值、和值、数据个数等,即得到这段时间的聚合数据。上述聚合数据就可以反映数据的统计规律,在对事物进行监控分析时,可以不再需要原始的数据。在现有技术中,每当达到预设的聚合周期,计算服务器可以通过网络传输,获取各个存储服务器上的相同类型的数据,进而,将获取到的数据进行统计处理得到聚合数据。
在实现本发明的过程中,发明人发现现有技术至少存在以下问题:
基于上述处理方式,每当进行统计处理,计算服务器需要等待各存储服务器传输数据,该过程会导致从统计处理的触发到结束的时间增加,从而降低数据统计处理的效率。
发明内容
为了实现提高数据统计处理的效率的目的,本发明实施例提供了一种数据处理方法、装置和系统。所述技术方案如下:
第一方面,提供了一种数据处理方法,该方法用于分发服务器,该方法包括:获取原始数据,其中,原始数据包括参数值和至少一个属性值;确定原始数据所属的目标类型,其中,目标类型包括的属性值在至少一个属性值中;根据目标类型,确定原始数据所属的目标计算服务器;向目标计算服务器发送数据存储请求,其中,数据存储请求中携带有原始数据。
本发明实施例所示的方案,分发服务器在获取到原始数据时,可以根据原始数据的目标类型,将原始数据分发给所属的目标计算服务器。分发服务器可以是周期性地获取该目标类型的原始数据,每当分发服务器获取到一条原始数据时,可以根据该原始数据的目标类型,确定需要将该原始数据分发到的目标计算服务器,然后可以向该 目标计算服务器发送携带有该原始数据的数据存储请求。这样,同一类型的原始数据可以分发到同一个计算服务器上,当计算服务器进行统计处理时,计算所依赖的数据都存储在计算服务器中,而不再需要等待其它服务器传输数据,从而,提高数据统计处理的效率。
在一种可能的实现方式中,根据目标类型,确定原始数据所属的目标计算服务器,包括:确定目标类型对应的目标分组的组编号,根据预先设置的分组与计算服务器的对应关系,将目标分组对应的计算服务器确定为原始数据所属的目标计算服务器;数据存储请求中还携带有目标分组的组编号。
本发明实施例所示的方案,每当分发服务器接收到原始数据时,可以根据原始数据的目标类型计算得到所属的目标分组,进而,分发服务器可以根据预先设置的分组与计算服务器的对应关系,确定目标分组对应的目标计算服务器,该目标计算服务器即为该目标类型的原始数据所属的目标计算服务器。在得到原始数据所属的目标分组时,还可以将该目标分组的组编号对应地添加到原始数据的数据存储请求中。
在一种可能的实现方式中,确定目标类型对应的目标分组的组编号,包括:基于目标类型包括的属性值,计算目标类型的原始数据对应的目标分组的组编号。
本发明实施例所示的方案,将目标类型转换为对应的标识字符串,进而可以根据该标识字符串计算目标类型的原始数据对应的目标分组的组编号。标识字符串可以唯一地表示目标类型,使得不同类型的原始数据可能计算得到不同的组编号。
在一种可能的实现方式中,基于目标类型包括的属性值,计算目标类型对应的目标分组的组编号,包括:确定目标类型包括的属性值中每个字符对应的预设编码类型的编码;基于确定出的每个编码和预设的计算函数,计算目标类型对应的特征码;将特征码与分组总数目进行取余运算,将得到的余数确定为目标类型对应的目标分组的组编号。
本发明实施例所示的方案,分发服务器每当接收到原始数据时,可以将原始数据转换为统一格式的第一数据元组,然后将其中的每个属性都转换为字符串类型,并将每个字符转换为预设编码类型的编码,通过预先设置的计算函数,计算得到目标类型对应的特征码,用于表示该目标类型。将特征码除以分组总数目,可以得到对应的余数,余数与分组的组编号一一对应,因此,可以直接将得到的余数确定为目标类型对应的目标分组的组编号,简化余数与组编号的对应关系。
在一种可能的实现方式中,预设的计算函数包括以下函数中的一个函数或多个函数组成的组合函数:求和函数、求差函数、乘积函数、按位与函数。
本发明实施例所示的方案,可以通过不同的预设的计算函数,计算得到目标类型对应的特征码,不论是哪种计算函数,得到的特征码都用于将目标类型与其它类型区别开。
在一种可能的实现方式中,预设编码类型的编码为美国信息交换标准代码ASCII(American Standard Code for Information Interchange)码。
本发明实施例所示的方案,每个字符可以有唯一对应的ASCII码,将字符串中每个字符的ASCII码组合起来可以用于表示目标类型。
第二方面,提供了一种数据处理方法,该方法用于计算服务器,该方法包括:接 收分发服务器发送的数据存储请求,其中,数据存储请求中携带有原始数据,原始数据包括参数值和至少一个属性值,原始数据属于目标类型,目标类型包括的属性值在至少一个属性值中;存储目标类型的原始数据;每当达到预设的聚合周期,根据当前的聚合周期内接收的属于该目标类型的原始数据,确定当前的聚合周期的属于该目标类型的聚合数据。
本发明实施例所示的方案,计算服务器随时可以接收到分发服务器发送的数据存储请求,然后,可以将数据存储请求中携带的原始数据获取出来,存储到内存中。每当达到聚合周期时,计算服务器可以从内存中读取出当前聚合周期内接收到目标类型的原始数据,对读取出的原始数据进行统计处理,计算当前聚合周期的目标类型的聚合数据。计算服务器可能接收到不止一个类型的原始数据,都可以对每种类型的原始数据进行上述处理,得到当前的聚合周期的每种类型的聚合数据。在统计处理时所依赖的数据不再需要占用网络带宽来传输,从而减少网络带宽的占用。
在一种可能的实现方式中,数据存储请求中还携带有目标分组的组编号;该方法还包括:存储目标类型对应的目标分组的组编号;每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据,包括:每当达到预设的聚合周期,对于每个组编号,根据组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
本发明实施例所示的方案,计算服务器还可以同时将原始数据所属的目标分组的组编号获取出来,与原始数据对应地存储在内存中。每当需要对原始数据进行处理时,目标计算服务器可以根据进程对应的分组,将内存中当前的聚合周期内存储的该分组的组编号对应的原始数据读取出来。然后根据自定义聚合函数,对相同类型的原始数据进行统计处理,得到当前的聚合周期的每种类型的聚合数据。
在一种可能的实现方式中,聚合周期中包括多个第1级子聚合周期,第i级子聚合周期中包括多个第i+1级子聚合周期,其中,i为大于1小于n的任意正整数,n为预设正整数;每当达到预设的聚合周期,对于每个组编号,根据组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据,包括:每当达到第n级子聚合周期,分别获取当前的第n级子聚合周期内接收的每个组编号对应的原始数据,对于每个组编号,对获取的组编号对应的原始数据中目标类型的原始数据,分别进行统计处理,得到当前的第n级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;每当达到第i级子聚合周期,分别获取当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,对于每个组编号,对组编号对应的所有第i+1级子聚合周期的聚合数据,分别进行统计处理,得到当前的第i级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;每当达到预设的聚合周期,分别获取当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据,对于每个组编号,对组编号对应的所有第1级子聚合周期的聚合数据,分别进行统计处理,得到当前的聚合周期的目标类型的聚合数据。
本发明实施例所示的方案,每当达到第n级子聚合周期时,触发对原始数据的统计处理,进而,分别基于每个进程,通过聚合函数自动索引当前分组中的所有数据, 并将具有相同类型的原始数据进行统计处理,得到当前周期的目标类型的聚合数据,并将聚合数据与对应的组编号存储在内存中。每当达到第i级子聚合周期时,触发对当前周期内所有第i+1级的聚合数据的统计处理,分别得到每个分组的当前周期的目标类型的聚合数据,并将聚合数据与对应的组编号存储在内存中。每当达到预设的聚合周期时,触发对当前周期内所有第1级的聚合数据的统计处理,分别得到每个分组的当前周期的目标类型的聚合数据,并将聚合数据与对应的组编号存储在内存中。这样,将对预设的聚合周期内的原始数据的处理分散到各个子聚合周期中,一次计算的数据量减少,从而使得计算服务器的处理时间减少,提高数据统计处理的效率。
在一种可能的实现方式中,聚合周期包括m个第1级子聚合周期,第i级子聚合周期包括m个第i+1级子聚合周期,其中,m为预设正整数。
本发明实施例所示的方案,每个层次的聚合周期之间的倍数相同,使得每次进行统计计算时所使用的数据量较为均衡,从而数据聚合时每个计算服务器的计算效率和内存使用率达到平衡,数据聚合系统可以平稳运行。
在一种可能的实现方式中,得到当前的第n级子聚合周期对应的聚合数据之后,删除当前的第n级子聚合周期内接收的每个组编号对应的原始数据;得到当前的第i级子聚合周期对应的聚合数据之后,删除当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据;得到当前的聚合周期对应的聚合数据之后,删除当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据。
本发明实施例所示的方案,每当得到聚合数据之后,删除计算该聚合数据所依赖的数据删除,以节省内存的使用。
第三方面,提供了一种分发服务器,该分法服务器包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的数据处理方法。
第四方面,提供了一种计算服务器,该计算服务器包括至少一个模块,该至少一个模块用于实现上述第二方面所提供的数据处理方法。
第五方面,提供了一种数据处理系统,该系统包括分发服务器和计算服务器,其中:
分发服务器,用于获取原始数据,其中,原始数据包括参数值和至少一个属性值;确定原始数据所属的目标类型,其中,目标类型包括的属性值在至少一个属性值中;根据目标类型,确定原始数据所属的目标计算服务器;向目标计算服务器发送数据存储请求,其中,数据存储请求中携带有原始数据;
计算服务器,用于接收分发服务器发送的数据存储请求,其中,数据存储请求中携带有原始数据,原始数据包括参数值和至少一个属性值,原始数据属于目标类型,目标类型包括的属性值在至少一个属性值中;存储目标类型的原始数据;每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
第六方面,提供一种分发服务器,该分发服务器包括处理器、存储器,处理器被配置为执行存储器中存储的指令;处理器通过执行指令来实现上述第一方面所提供的数据处理方法。
第七方面,提供一种计算服务器,该计算服务器包括处理器、存储器,处理器被配置为执行存储器中存储的指令;处理器通过执行指令来实现上述第二方面所提供的数据处理方法。
第八方面,提供了计算机可读存储介质,包括指令,当所述计算机可读存储介质在分发服务器上运行时,使得分发服务器执行第一方面所述的方法。
第九方面,提供了一种包含指令的计算机程序产品,当所述计算机程序产品在分发服务器上运行时,使得分发服务器执行第一方面所述的方法。
第十方面,提供了一种计算机可读存储介质,包括指令,当所述计算机可读存储介质在计算服务器上运行时,使得计算服务器执行第二方面所述的方法。
第十一方面,提供了一种包含指令的计算机程序产品,当所述计算机程序产品在计算服务器上运行时,使得计算服务器执行第二方面所述的方法。
本发明实施例提供的技术方案带来的有益效果是:
本发明实施例中,分发服务器可以在获取目标类型的原始数据之后,根据目标类型确定原始数据所属的目标计算服务器,然后通过向目标计算服务器发送数据存储请求来发送该目标类型的原始数据。进而,目标计算服务器可以接收分发服务器发送的数据存储请求,并存储目标类型的原始数据,每当达到预设的聚合周期,根据当前的聚合周期内接收的每种类型的原始数据,确定当前的聚合周期的每种类型的聚合数据。这样,同一类型的原始数据可以分发到同一个计算服务器上,当计算服务器进行统计处理时,计算所依赖的数据都存储在计算服务器中,而不再需要等待其它服务器传输数据,从而,提高数据统计处理的效率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种系统框架示意图;
图2是本发明实施例提供的一种分发服务器结构示意图;
图3是本发明实施例提供的一种计算服务器结构示意图;
图4是本发明实施例提供的一种数据聚合的方法流程图;
图5是本发明实施例提供的一种数据聚合的方法流程图;
图6是本发明实施例提供的一种计算组编号示意图;
图7是本发明实施例提供的一种聚合周期划分示意图;
图8是本发明实施例提供的一种并行处理示意图;
图9是本发明实施例提供的一种二叉树聚合周期划分示意图;
图10是本发明实施例提供的一种数据聚合的装置示意图;
图11是本发明实施例提供的一种数据聚合的装置示意图;
图12是本发明实施例提供的一种数据聚合的装置示意图。
具体实施方式
本发明实施例提供了一种数据处理方法,该方法可以用于数据处理系统,如图1所示,该系统中可以至少包括分发服务器和计算服务器,并且系统中可以包括多个计算服务器,可以包括一个或多个分发服务器。分发服务器与计算服务器之间可以建立通信连接。为了避免在聚合计算的过程中数据需要在各个服务器之间传输,分发服务器在获取数据源的原始数据后,可以将同一类型的原始数据分发给同一个计算服务器,并且可以将各个类型的原始数据分发给各个计算服务器。计算服务器可以对原始数据进行统计处理,得到聚合数据。上述分发服务器和计算服务器在实际场景中可以由同一个服务器实现相应的功能,该服务器在执行分发进程时即为逻辑上的分发服务器,在执行计算进程时即为逻辑上的计算服务器。
分发服务器可以包括处理器210、发射器220、接收器230,接收器230和发射器220可以分别与处理器210连接,如图2所示。接收器230可以用于接收消息或数据,即可以接收其它电子设备发送的原始数据,发射器220和接收器230可以是网卡,发射器220可以用于发送消息或数据,即可以将获取到的原始数据发送给各个计算服务器。处理器210可以是服务器的控制中心,利用各种接口和线路连接整个服务器的各个部分,如接收器230和发射器220等。在本发明中,处理器210可以是CPU,可以用于确定原始数据所属的目标计算服务器的相关处理,可选的,处理器210可以包括一个或多个处理单元;处理器210可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统,调制解调处理器主要处理无线通信。处理器210还可以是数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件等。服务器还可以包括存储器240,存储器240可用于存储软件程序以及模块,处理器210通过读取存储在存储器的软件代码以及模块,从而执行服务器的各种功能应用以及数据处理。
计算服务器可以包括处理器310、发射器320、接收器330,接收器330和发射器320可以分别与处理器310连接,如图3所示。接收器330可以用于接收消息或数据,即可以接收各个分发服务器发送的原始数据,发射器320和接收器330可以是网卡,发射器320可以用于发送消息或数据。处理器310可以是服务器的控制中心,利用各种接口和线路连接整个服务器的各个部分,如接收器330和发射器320等。在本发明中,处理器310可以是CPU,可以用于确定聚合数据的相关处理,可选的,处理器310可以包括一个或多个处理单元;处理器310可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统,调制解调处理器主要处理无线通信。处理器310还可以是数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件等。服务器还可以包括存储器340,存储器340可用于存储软件程序以及模块,处理器310通过读取存储在存储器的软件代码以及模块,从而执行服务器的各种功能应用以及数据处理。
下面将结合具体实施方式,对图4所示的数据聚合的方法流程图,进行详细的说明,内容可以如下:
在步骤401中,分发服务器获取原始数据。
其中,原始数据是数据源设备提供给分发服务器的数据,包括参数值和至少一个 属性值,也即原始数据中可以包括需要统计的参数值和该参数值对应的属性值。原始数据的各个属性值的组合可以用来表示该原始数据的类型。目标类型是分发服务器当前获取到的原始数据所属的类型,其包括的属性值在原始数据的至少一个属性值中。在本方案中,是针对同类型的原始数据进行聚合处理的,所以本方案的后续处理中会将同类型的原始数据存储在同一计算服务器中,以便进行聚合处理。
根据不同的监控需求,技术人员可以设置统计时所需要的原始数据的属性组合。例如,可以对任一班级中任一学生任一科目的成绩的长期情况进行监控,原始数据可以如下表一所示,其中,每一行对应一条原始数据。
表一 本校班级学生各科成绩表
班级 姓名 科目 成绩
一班 张三 语文 90
二班 李四 语文 85
一班 张三 数学 100
一班 王六 语文 95
二班 李四 数学 90
在表一中,班级、姓名、科目为属性,成绩为参数,一班、二班为班级属性的属性值,张三、李四、王六为姓名属性的属性值,语文、数学为科目属性的属性值,90、85、100等为成绩参数的参数值,其中,一班、张三、语文即为一个类型,可称作类型1,二班、李四、语文又为一个类型,可称作类型2,一班、张三、数学即为一个类型,可称作类型3,等等。此表中只记录了一次考试成绩,对于每个类型,可以统计多次考试的成绩,对多次考试的成绩进行分析,例如,一班张三在连续多次考试中的语文成绩分别为76、79、82、86、88、90,也即统计过程中接收到的类型1的成绩依次为76、79、82、86、88、90,进而可以对类型1的数据进行分析,也即对一班张三的语文成绩进行分析,可以看出他的语文是在进步的。
又例如,可以对任一班级中任一学生的总成绩的长期情况进行监控,原始数据可以如下表二所示,其中,每一行对应一条原始数据。
表二 本校班级学生成绩表
班级 姓名 总成绩
一班 张三 602
二班 李四 586
一班 王六 627
在表二中,班级、姓名为属性,总成绩为参数,一班、二班为班级属性的属性值,张三、李四、王六为姓名属性的属性值,602、586、627为总成绩参数的参数值,其中,一班、张三即为一个类型,可称作类型4,二班、李四又为一个类型,可称作类型5,一班、王六即为一个类型,可称作类型6,等等。此表中只记录了一次考试成绩,对于每个类型,可以统计多次考试的成绩,对多次考试的成绩进行分析,例如,一班张三在连续多次考试中的总成绩分别为580、585、610、596、572、602,也即统计过程中得到的类型4的总成绩依次为580、585、610、596、572、602,进而可以对类型4的数据进行分析,也即对一班张三的总成绩进行分析,可以看出他在高考中得到一 本是很有希望的。
再例如,可以对任一班级的语文平均成绩的长期情况进行监控,原始数据可以如下表三所示,其中,每一行对应一条原始数据。
表三 本校班级语文平均成绩表
班级 平均成绩
一班 90
二班 85
在表三中,班级为属性,平均成绩为参数,一班、二班为班级的属性值,90、85为平均成绩参数的参数值,其中,一班即为一个类型,可称作类型7,二班又为一个类型,可称作类型8,等等。此表中只记录了一次语文考试的平均成绩,对于每个类型,可以统计多次语文考试的平均成绩,对多次语文考试的平均成绩进行分析,例如,一班在连续多次语文考试中的平均成绩分别为85、80、86、90、76、84,也即统计过程中得到的类型7的平均成绩依次为85、80、86、90、76、84,进而可以对类型7的数据进行分析,也即对一班的语文平均成绩进行分析,可以看出一班的语文平均成绩处于优秀水平。
在实施中,原始数据来源可以是多样的,例如,当用于监控的数据为学生的成绩时,原始数据可以来自网络侧的云端存储的数据;当用于监控的数据为降水量时,原始数据可以来自各个监控站的监控设备发送的数据;当用于监控的数据为服务器的CPU使用率、内存使用率时,原始数据可以来自于分发服务器本身。由此可见,原始数据的类型可以是多种多样的,本发明实施例以一个类型(即目标类型)的原始数据为例,其它类型的原始数据的处理过程相同,不再赘述。
对于目标类型的原始数据,分发服务器可以是周期性地获取该原始数据。例如,机房内的每台服务器可以每隔10秒采集一次CPU使用率,然后可以将采集的CPU使用率作为原始数据发送给分发服务器,进而分发服务器可以获取到各个服务器的CPU使用率。
分发服务器获取到的原始数据的格式可以是文本、RDD(Resilient Distributed Datasets,弹性分布式数据集)、JSON(Java Script Object Notation,Java脚本对象标记)等。若以监控服务器的CPU使用率为例,则原始数据可以为“服务器1的CPU使用率为54%”,“服务器1”与“CPU使用率”皆为该原始数据的属性值,“54%”是该原始数据的参数值。为了保证对各种格式的原始数据都能进行相同的数据聚合处理,可以预先设置固定格式的第一数据元组data1=(p 1,p 2,...,p s,d 1,...,d t),其中,p i为原始数据中的第i个属性值,d j为原始数据中第j个参数值,data1中的所有p i的组合可以用于表示数据的类型。
当分发服务器接收到一条原始数据时,即可继续进行步骤402。
在步骤402中,分发服务器确定原始数据所属的目标类型。
在实施中,根据设置好的所需的至少一个属性,分发服务器可以从接收到的原始数据中提取出所需的至少一个属性的属性值,得到该原始数据所属的目标类型,然后可以将提取出的属性值赋值给上述第一数据元组的p i,并且提取参数值赋值给d j。也即将原始数据转换为统一格式的第一数据元组,例如,可以将上述例子中的原始数据 转换为data1=(服务器1,CPU使用率,54%)。
在步骤403中,分发服务器根据目标类型,确定原始数据所属的目标计算服务器。
在实施中,每当分发服务器获取到一条原始数据时,可以根据该原始数据的目标类型,确定需要将该原始数据分发到的目标计算服务器。经过上述处理,同一类型的原始数据可以分发到同一个计算服务器,仅在分发的过程中占用网络带宽,在统计的过程中可以不再占用带宽,减少计算过程中网络传输的开销,缩短整个数据聚合的方法流程的时间。
可选的,可以对原始数据进行分组,以便计算服务器对不同分组的原始数据进行并行处理,相应的处理可以如下:确定目标类型对应的目标分组的组编号,根据预先设置的分组与计算服务器的对应关系,将目标分组对应的计算服务器确定为原始数据所属的目标计算服务器。
在实施中,并行度k为数据聚合系统中可以同时执行的进程的数目。数据聚合系统的并行度k可以根据所有计算服务器的总CPU核数来预先设置,一般来说,并行度k等于总CPU核数的2到3倍,例如,如果计算服务器有3台,每台计算服务器的CPU都有4个核,那么并行度k可以设置为24。进而,数据的分组的总数目可以为k个,并且可以按照0~k-1进行编号,分别用于k个进程对分组中的数据进行处理。然后,可以随机设置计算服务器需要计算的分组的编号,也可以是按照一定的规则进行设置,此处不作限定。然后可以将分组的编号与计算服务器的标识添加到对应关系表中,建立分组与计算服务器的对应关系,进而将分组与计算服务器的对应关系存储在分发服务器中。例如,设置计算服务器2为处理分组2、分组3的数据时,可以将分组2与计算服务器2的对应关系、分组3与计算服务器2的对应关系存储在分发服务器中。
每当分发服务器接收到原始数据时,可以根据原始数据的目标类型计算得到所属的目标分组。可选的,分发服务器可以基于目标类型包括的属性值,计算目标类型对应的目标分组的组编号,如图5所示,具体的处理可以如下:
在步骤4031中,确定目标类型包括的属性值中每个字符对应的预设编码类型的编码。
其中,预设编码类型的编码可以是ASCII码,也可以是基于预设的字符到数字的映射关系得到的编码,例如基于SHA(Secure Hash Algorithm,安全散列算法)得到的编码。
可选的,当预设编码类型的编码可以为ASCII码时,对于上述第一数据元组的原始数据,分发服务器可以将其中的每个p i都转换为字符串类型,即可得到目标类型包括的属性值对应的标识字符串的多个字符。然后,分发服务器可以将每个字符都转换为对应的ASCII码的数字。
在步骤4032中,基于确定出的每个编码和预设的计算函数,计算目标类型对应的特征码。
将步骤4031中确定下的每个字符对应的ASCII码的数字,通过预先设置的计算函数,计算得到目标类型对应的特征码,用于代表该目标类型。可选的,预设的计算函数可以包括以下函数中的一个函数或多个函数组成的组合函数:求和函数、求差函数、乘积函数、按位与函数。如图6所示的计算组编号示意图,如果原始数据的属性有“123” 和“abc”,则可以将每个属性转换为字符串“123”、“abc”,“1”对应的ASCII码的数字为49,“2”对应50,“3”对应51,“a”对应97,“b”对应“98”,“c”对应99,进行求和运算,得到目标类型对应的特征码S为444。
在步骤4033中,将特征码与分组总数目进行取余运算,将得到的余数确定为目标类型对应的目标分组的组编号。
将特征码除以分组总数目,可以得到对应的余数。上述预先设置分组的组编号的内容中介绍到,分组总数目为k,分组的组编号为0~k-1,则分组总数目作为除数时,余数的范围应为0~k-1,与分组的组编号一一对应。因此,可以直接将得到的余数确定为目标类型的原始数据对应的目标分组的组编号,简化余数与组编号的对应关系。如图6所示的计算组编号示意图,目标类型对应的特征码S为444,分组总数目k等于128,|S|%k=60,即该目标类型的原始数据所属的目标分组为分组60。
进而,分发服务器可以根据预先设置的分组与计算服务器的对应关系,确定目标分组对应的目标计算服务器,该目标计算服务器即为该目标类型的原始数据所属的目标计算服务器。
对于每种类型的原始数据,每当分发服务器接收到原始数据时,都可以按照上述过程确定每种类型的原始数据所属的计算服务器。不同类型的原始数据所属的计算服务器可能相同,也可能不同,但是依然能够有效地减小一个进程所需要处理的数据量,从而提高进程处理的效率。
在步骤404中,分发服务器向目标计算服务器发送数据存储请求。
在实施中,分发服务器在上述过程中确定下需要将原始数据分发到的目标计算服务器后,可以向该目标计算服务器发送存储该原始数据的数据存储请求。其中,数据存储请求中携带有目标类型的原始数据。分发服务器仅仅需要在分发原始数据时占用一定的带宽,而在后续统计处理时所依赖的数据不再需要占用网络带宽来传输,从而减少网络带宽的占用。
可选的,数据存储请求中还可以携带有原始数据所属的目标分组的组编号。数据存储请求中携带有原始数据,该原始数据还可以是上述过程中转换成第一数据元组的原始数据,以便后续处理。
在步骤405中,目标计算服务器接收分发服务器发送的数据存储请求。
在实施中,目标计算服务器可以接收到分发服务器发送的数据存储请求,然后,可以将数据存储请求中携带的原始数据获取出来。可选的,目标计算服务器还可以同时将原始数据所属的目标分组的组编号获取出来。
在步骤406中,目标计算服务器存储目标类型的原始数据。
在实施中,目标计算服务器可以将获取到的原始数据存储到内存中,以便后续处理使用。可选的,目标计算服务器还可以同时存储目标类型对应的目标分组的组编号,也即将原始数据所属的目标分组的组编号,与原始数据对应地存储在内存中。
在聚合周期开始时,目标计算服务器可以随时接收到原始数据的数据存储请求。上述步骤405-406会在聚合周期之内重复执行,而只有聚合周期结束时,才继续执行步骤407。
在步骤407中,每当达到预设的聚合周期,目标计算服务器根据当前的聚合周期 内接收的每种类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
在实施中,Spark是专为大规模数据处理而设计的快速通用的计算引擎,计算服务器中可以安装有Spark并基于Spark对数据进行处理。技术人员可以在Spark中对聚合周期进行预先设置,每当达到聚合周期时,目标计算服务器可以从内存中读取出当前聚合周期内接收到目标类型的原始数据,对读取出的原始数据进行统计处理,计算当前聚合周期的目标类型的聚合数据。例如,预设的聚合周期可以是60分钟,从数据聚合的程序运行开始,每当达到60分钟时,可以得到该60分钟内服务器1的CPU使用率的最大值、最小值、平均值、和值、数据个数等。目标计算服务器可能接收到不止一个类型的原始数据,都可以对每种类型的原始数据进行上述处理,得到当前的聚合周期的每种类型的聚合数据。
可选的,目标计算服务器可以根据存储的原始数据所属的分组,分别对每个分组的原始数据进行并行处理,相应的处理可以如下:每当达到预设的聚合周期,对于每个组编号,根据组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
在实施中,目标计算服务器可以基于多个进程对数据进行处理,每个进程对应一个分组。每当需要对原始数据进行处理时,目标计算服务器可以根据进程对应的分组,将内存中当前的聚合周期内存储的该分组的组编号对应的原始数据读取出来。对于上述第一数据元组的原始数据,可以将其中的每个p i进行拼接,得到第二数据元组,各个属性拼接后构成第二数据元组的唯一属性,例如,第一数据元组data1=(服务器1,CPU使用率,54%),可以得到相应的第二数据元组data2=(服务器1CPU使用率,54%)。然后根据自定义聚合函数,对相同属性的第二数据元组进行统计处理,得到当前的聚合周期的每种类型的聚合数据。之后,计算服务器还可以将已经进行过统计处理的原始数据进行删除,以节省内存的使用。
基于多个进程对多个分组的数据进行处理时,每个进程相互独立,也即每组数据可以同时进行处理,提高统计处理的并行度。
将原始数据转换成第一数据元组的格式时,没有添加多余的结构信息来构成DataFrame(数据帧)的格式,因此不能直接使用Spark中自带的聚合函数,而需要用户自定义。但是在进行具体的统计处理时,并没有使用到结构信息,而是在调用Spark自带的聚合函数时才会用到。因此,存储转换成第一数据元组的原始数据,可以避免存储多余的结构信息,从而减少内存的开销,提高内存使用率。
可选的,聚合周期还可以划分为多层次的子聚合周期,并可以根据周期较短的子聚合周期的聚合数据生成周期较长的子聚合周期的聚合数据。聚合周期中包括多个第1级子聚合周期,第i级子聚合周期中包括多个第i+1级子聚合周期,其中,i为大于1小于n的任意正整数,n为预设正整数。每个子聚合周期与聚合周期可以按照从小到大的顺序排列,构成一个聚合时间序列{t 0,t 1,…,t w}。如图7所示的聚合周期划分示意图,600秒的聚合周期内可以划分为2个300秒的第1级子聚合周期,每个300秒的第1级子聚合周期可以划分为5个60秒的第2级子聚合周期,因此聚合时间序列可以为{60,300,600}。
如图8所示的并行处理示意图,每个分组的数据独立进行处理,互不干扰,并且 可以根据聚合时间序列{t 0,t 1,…,t w}重复进行统计处理。下面对各个子聚合周期以及聚合周期的统计处理进行详细介绍:
每当达到第n级子聚合周期,目标计算服务器可以分别获取当前的第n级子聚合周期内接收的每个组编号对应的原始数据,对于每个组编号,对获取的组编号对应的原始数据中目标类型的原始数据,分别进行统计处理,得到当前的第n级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号。
在实施中,第n级子聚合周期的周期长度最短,计算依赖的数据是当前周期内接收的原始数据。也即,每当达到第n级子聚合周期时,触发对原始数据的统计处理,进而,分别基于每个进程,通过聚合函数自动索引当前分组中的所有数据,并将具有相同属性的第二数据元组中的参数值进行统计处理,得到当前周期的目标类型的聚合数据,并将聚合数据与对应的组编号存储在内存中,以便后续处理。如图7所示的聚合周期划分示意图,60秒的第2级子聚合周期即对应于此处的第n级子聚合周期,计算依赖的数据为当前60秒内接收的原始数据。
可选的,每当得到当前的第n级子聚合周期的每种类型的聚合数据之后,还可以删除当前的第n级子聚合周期内接收的每个组编号对应的原始数据,也即将当前计算所依赖的数据删除,以节省内存的使用。得到的聚合数据还可以存入数据库或输出到Kafka(一种高吞吐量的分布式发布订阅消息系统),以便用户查询或使用。上述过程中得到的聚合数据可能是第二数据元组的格式,则在存入数据库或输出到Kafka之前,可以将聚合数据转换为第一数据元组的格式,也即,将第二数据元组中的属性拆分为原第一数据元组的各个属性,这样可以便于用于根据不同的属性值进行查询。
每当达到第i级子聚合周期,目标计算服务器可以分别获取当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,对于每个组编号,对组编号对应的所有第i+1级子聚合周期的聚合数据,分别进行统计处理,得到当前的第i级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号。
在实施中,第i级子聚合周期中计算依赖的数据是当前周期内得到的所有第i+1级的聚合数据。也即,每当达到第i级子聚合周期时,触发对当前周期内所有第i+1级的聚合数据的统计处理,分别得到每个分组的当前周期的目标类型的聚合数据,并将聚合数据与对应的组编号存储在内存中,具体过程与上面介绍第n级子聚合周期内进行的统计处理相类似,此处不再赘述。如图7所示的聚合周期划分示意图,300秒的第1级子聚合周期即对应于此处的第i级子聚合周期,计算300秒的聚合数据时,可以根据其中的5个60秒周期的聚合数据进行计算。
可选的,在此之后,还可以删除当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,得到的聚合数据还可以存入数据库或输出到Kafka,此处不再赘述。
每当达到预设的聚合周期,目标计算服务器可以分别获取当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据,对于每个组编号,对组编号对应的所有第1级子聚合周期的聚合数据,分别进行统计处理,得到当前的聚合周期的目标类型的聚合数据。
在实施中,预设的聚合周期的周期长度最长,计算依赖的数据是当前周期内得到 的所有第1级的聚合数据。也即,每当达到预设的聚合周期时,触发对当前周期内所有第1级的聚合数据的统计处理,分别得到每个分组的当前周期的目标类型的聚合数据,具体过程与上面介绍第n级子聚合周期内进行的统计处理相类似,此处不再赘述。如图7所示的聚合周期划分示意图,600秒的聚合周期即对应于此处的预设的聚合周期,计算600秒的聚合数据时,可以根据其中的2个300秒周期的聚合数据进行计算。
可选的,在此之后,还可以删除当前的第1级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,得到的聚合数据还可以存入数据库或输出到Kafka,此处不再赘述。由于聚合周期为预设的最大长度的周期,两个聚合周期之间的聚合数据不再进行统计处理,因此,在当前的聚合周期的每种类型的聚合数据存入数据库或输出到Kafka之后,可以将计算服务器中缓存的该聚合数据删除。
此时,聚合时间序列中的各个时间都已经执行过统计处理,则可以重复步骤407,进行下一个聚合周期的计算。如果直接对预设的聚合周期内的原始数据进行处理,一次计算的数据量可能比较大,则可能导致计算服务器的处理时间较长。而将对预设的聚合周期内的原始数据的处理分散到各个子聚合周期中,一次计算的数据量减少,从而使得计算服务器的处理时间减少,提高数据统计处理的效率。
可选的,聚合周期可以包括m个第1级子聚合周期,第i级子聚合周期也可以包括m个第i+1级子聚合周期,其中,m为预设正整数。也即,每个层次的聚合周期之间的倍数相同。如图9所示的二叉树聚合周期划分示意图,当m等于2时,各个子聚合周期与预设的聚合周期可以构成一个二叉树的形式,各个子聚合周期可以根据预设的聚合周期来确定,即t i=2 i*t 0,其中,t i为聚合时间序列{t 0,t 1,…,t w}中的任一时间。例如,预设的聚合周期为600秒,600=2 3*75,则聚合时间序列可以为{75,150,300,600}。
进而,可以根据确定下的聚合时间序列执行步骤407的处理,此处不再赘述。由于每个层次的聚合周期之间的倍数相同,使得每次进行统计计算时所使用的数据量较为均衡,从而数据聚合时每个计算服务器的计算效率和内存使用率达到平衡,数据聚合系统可以平稳运行。
如果每个类型的数据得到的聚合数据存入数据库或输出到Kafka,则用户可以根据所需的属性信息,查询或调用聚合数据,以分析对应事物的变化趋势。例如,用户可以在数据库中查询,在过去的1个小时内服务器1每10分钟的CPU使用率的最大值、最小值、平均值等。
本发明实施例中,分发服务器可以在获取目标类型的原始数据之后,根据目标类型确定原始数据所属的目标计算服务器,然后通过向目标计算服务器发送数据存储请求来发送该目标类型的原始数据。进而,目标计算服务器可以接收分发服务器发送的数据存储请求,并存储目标类型的原始数据,每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。这样,同一类型的原始数据可以分发到同一个计算服务器上,当计算服务器进行统计处理时,计算所依赖的数据都存储在计算服务器中,而不再需要等待其它服务器传输数据,从而,提高数据统计处理的效率。
基于相同的技术构思,本发明实施例还提供了一种数据处理装置,该装置可以是上述分发服务器,如图10所示,该装置包括:
获取模块1010,用于获取原始数据,其中,所述原始数据包括参数值和至少一个属性值,具体可以实现上述步骤401中的获取功能,以及其他隐含步骤;
第一确定模块1020,用于确定所述原始数据所属的目标类型,其中,所述目标类型包括的属性值在所述至少一个属性值中,具体可以实现上述步骤402中的确定功能,以及其他隐含步骤;第二确定模块1030,用于根据所述目标类型,确定所述原始数据所属的目标计算服务器,具体可以实现上述步骤403中的确定功能,以及其他隐含步骤;
发送模块1040,用于向所述目标计算服务器发送数据存储请求,其中,所述数据存储请求中携带有所述目标类型的原始数据,具体可以实现上述步骤404中的发送功能,以及其他隐含步骤。
可选的,所述第二确定模块1030用于:
确定所述目标类型对应的目标分组的组编号,根据预先设置的分组与计算服务器的对应关系,将所述目标分组对应的计算服务器确定为所述原始数据所属的目标计算服务器;
所述数据存储请求中还携带有所述目标分组的组编号。
可选的,所述第二确定模块1030用于:
基于所述目标类型包括的属性值,计算所述目标类型的原始数据对应的目标分组的组编号。
可选的,所述第二确定模块1030用于:
确定所述目标类型包括的属性值中每个字符对应的预设编码类型的编码;
基于确定出的每个编码和预设的计算函数,计算所述目标类型对应的特征码;
将所述特征码与分组总数目进行取余运算,将得到的余数确定为所述目标类型的原始数据对应的目标分组的组编号。
可选的,所述预设的计算函数包括以下函数中的一个函数或多个函数组成的组合函数:
求和函数、求差函数、乘积函数、按位与函数。
可选的,所述预设编码类型的编码为美国信息交换标准代码ASCII码。
需要说明的是,上述获取模块1010可以由收发器实现,第一确定模块1020可以由处理器实现,第二确定模块1030可以由处理器实现,发送模块1040可以由收发器实现。
基于相同的技术构思,本发明实施例还提供了一种数据处理装置,该装置可以是上述计算服务器,如图11所示,该装置包括:
接收模块1110,用于接收分发服务器发送的数据存储请求,其中,所述数据存储请求中携带有目标类型的原始数据,所述原始数据包括参数值和至少一个属性值,所述原始数据属于目标类型,所述目标类型包括的属性值在所述至少一个属性值中,具体可以实现上述步骤405中的接收功能,以及其他隐含步骤;
存储模块1120,用于存储所述目标类型的原始数据,具体可以实现上述步骤406 中的存储功能,以及其他隐含步骤;
确定模块1130,用于每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据,具体可以实现上述步骤407中的确定功能,以及其他隐含步骤。
可选的,所述数据存储请求中还携带有目标分组的组编号;
所述存储模块1120还用于:存储所述目标类型对应的所述目标分组的组编号;
所述确定模块1130用于:每当达到预设的聚合周期,对于每个组编号,根据所述组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
可选的,所述聚合周期中包括多个第1级子聚合周期,第i级子聚合周期中包括多个第i+1级子聚合周期,其中,i为大于1小于n的任意正整数,n为预设正整数;所述确定模块1130用于:
每当达到第n级子聚合周期,分别获取当前的第n级子聚合周期内接收的每个组编号对应的原始数据,对于每个组编号,对获取的所述组编号对应的原始数据中目标类型的原始数据,分别进行统计处理,得到当前的第n级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;
每当达到第i级子聚合周期,分别获取当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,对于每个组编号,对所述组编号对应的所有第i+1级子聚合周期的聚合数据,分别进行统计处理,得到当前的第i级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;
每当达到预设的聚合周期,分别获取当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据,对于每个组编号,对所述组编号对应的所有第1级子聚合周期的聚合数据,分别进行统计处理,得到当前的聚合周期的目标类型的聚合数据。
可选的,所述聚合周期包括m个第1级子聚合周期,第i级子聚合周期包括m个第i+1级子聚合周期,其中,所述m为预设正整数。
可选的,如图12所示,所述装置还包括:
删除模块1140,用于所述得到当前的第n级子聚合周期对应的聚合数据之后,删除当前的第n级子聚合周期内接收的每个组编号对应的原始数据;所述得到当前的第i级子聚合周期对应的聚合数据之后,删除当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据;所述得到当前的聚合周期对应的聚合数据之后,删除当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据。
需要说明的是,上述接收模块1110可以由收发器实现,存储模块1120可以由存储器实现,确定模块1130可以由处理器实现,删除模块1140可以由处理器与存储器共同实现。
本发明实施例中,分发服务器可以在获取目标类型的原始数据之后,根据目标类型确定原始数据所属的目标计算服务器,然后通过向目标计算服务器发送数据存储请求来发送该目标类型的原始数据。进而,目标计算服务器可以接收分发服务器发送的 数据存储请求,并存储目标类型的原始数据,每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。这样,同一类型的原始数据可以分发到同一个计算服务器上,当计算服务器进行统计处理时,计算所依赖的数据都存储在计算服务器中,而不再需要等待其它服务器传输数据,从而,提高数据统计处理的效率。
需要说明的是:上述实施例提供的数据处理装置在处理数据时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将分发服务器和计算服务器的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据处理装置与数据处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
基于相同的技术构思,本发明实施例还提供了一种数据处理系统,该系统包括分发服务器和计算服务器,其中:
分发服务器,用于获取原始数据,其中,原始数据包括参数值和至少一个属性值;确定原始数据所属的目标类型,其中,目标类型包括的属性值在至少一个属性值中;根据目标类型,确定原始数据所属的目标计算服务器;向目标计算服务器发送数据存储请求,其中,数据存储请求中携带有原始数据;
计算服务器,用于接收分发服务器发送的数据存储请求,其中,数据存储请求中携带有目标类型的原始数据,原始数据包括参数值和至少一个属性值,原始数据属于目标类型,目标类型包括的属性值在至少一个属性值中;存储目标类型的原始数据;每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
本发明实施例中,分发服务器可以在获取目标类型的原始数据之后,根据目标类型确定原始数据所属的目标计算服务器,然后通过向目标计算服务器发送数据存储请求来发送该目标类型的原始数据。进而,目标计算服务器可以接收分发服务器发送的数据存储请求,并存储目标类型的原始数据,每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。这样,同一类型的原始数据可以分发到同一个计算服务器上,当计算服务器进行统计处理时,计算所依赖的数据都存储在计算服务器中,而不再需要等待其它服务器传输数据,从而,提高数据统计处理的效率。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在设备上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴光缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是设备能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(如软盘、硬盘和磁带等), 也可以是光介质(如数字视盘(Digital Video Disk,DVD)等),或者半导体介质(如固态硬盘等)。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (23)

  1. 一种数据处理方法,其特征在于,所述方法用于分发服务器,所述分发服务器与多个计算服务器建立通信连接,所述方法包括:
    获取原始数据,其中,所述原始数据包括参数值和至少一个属性值;
    确定所述原始数据所属的目标类型,其中,所述目标类型包括的属性值在所述至少一个属性值中;
    根据所述目标类型,确定所述原始数据所属的目标计算服务器;
    向所述目标计算服务器发送数据存储请求,其中,所述数据存储请求中携带有所述原始数据。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述目标类型,确定所述原始数据所属的目标计算服务器,包括:
    确定所述目标类型对应的目标分组的组编号,根据预先设置的分组与计算服务器的对应关系,将所述目标分组对应的计算服务器确定为所述原始数据所属的目标计算服务器;
    所述数据存储请求中还携带有所述目标分组的组编号。
  3. 根据权利要求2所述的方法,其特征在于,所述确定所述目标类型对应的目标分组的组编号,包括:
    基于所述目标类型包括的属性值,计算所述目标类型对应的目标分组的组编号。
  4. 根据权利要求3所述的方法,其特征在于,所述基于所述目标类型包括的属性值,计算所述目标类型对应的目标分组的组编号,包括:
    确定所述目标类型包括的属性值中每个字符对应的预设编码类型的编码;
    基于确定出的每个编码和预设的计算函数,计算所述目标类型对应的特征码;
    将所述特征码与分组总数目进行取余运算,将得到的余数确定为所述目标类型对应的目标分组的组编号。
  5. 一种数据处理方法,其特征在于,所述方法用于计算服务器,所述计算服务器与至少一个分发服务器建立通信连接,所述方法包括:
    接收分发服务器发送的数据存储请求,其中,所述数据存储请求中携带有原始数据,所述原始数据包括参数值和至少一个属性值,所述原始数据属于目标类型,所述目标类型包括的属性值在所述至少一个属性值中;
    存储所述目标类型的原始数据;
    每当达到预设的聚合周期,根据当前的聚合周期内接收的属于所述目标类型的原始数据,确定当前的聚合周期的属于所述目标类型的聚合数据。
  6. 根据权利要求5所述的方法,其特征在于,所述数据存储请求中还携带有目标分组的组编号;
    所述方法还包括:存储所述目标类型对应的所述目标分组的组编号;
    所述每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据,包括:每当达到预设的聚合周期,对于每个组编号,根据所述组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
  7. 根据权利要求6所述的方法,其特征在于,所述聚合周期中包括多个第1级子聚合周期,第i级子聚合周期中包括多个第i+1级子聚合周期,其中,i为大于1小于n的任意正整数,n为预设正整数;所述每当达到预设的聚合周期,对于每个组编号,根据所述组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据,包括:
    每当达到第n级子聚合周期,分别获取当前的第n级子聚合周期内接收的每个组编号对应的原始数据,对于每个组编号,对获取的所述组编号对应的原始数据中目标类型的原始数据,分别进行统计处理,得到当前的第n级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;
    每当达到第i级子聚合周期,分别获取当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,对于每个组编号,对所述组编号对应的所有第i+1级子聚合周期的聚合数据,分别进行统计处理,得到当前的第i级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;
    每当达到预设的聚合周期,分别获取当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据,对于每个组编号,对所述组编号对应的所有第1级子聚合周期的聚合数据,分别进行统计处理,得到当前的聚合周期的目标类型的聚合数据。
  8. 根据权利要求7所述的方法,其特征在于,所述聚合周期包括m个第1级子聚合周期,第i级子聚合周期包括m个第i+1级子聚合周期,其中,所述m为预设正整数。
  9. 根据权利要求7所述的方法,其特征在于,所述得到当前的第n级子聚合周期对应的聚合数据之后,所述方法还包括:删除当前的第n级子聚合周期内接收的每个组编号对应的原始数据;
    所述得到当前的第i级子聚合周期对应的聚合数据之后,所述方法还包括:删除当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据;
    所述得到当前的聚合周期对应的聚合数据之后,所述方法还包括:删除当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据。
  10. 一种分发服务器,其特征在于,所述分发服务器包括:
    获取模块,用于获取原始数据,其中,所述原始数据包括参数值和至少一个属性值;
    第一确定模块,用于确定所述原始数据所属的目标类型,其中,所述目标类型包括的属性值在所述至少一个属性值中;
    第二确定模块,用于根据所述目标类型,确定所述原始数据所属的目标计算服务器;
    发送模块,用于向所述目标计算服务器发送数据存储请求,其中,所述数据存储请求中携带有所述目标类型的原始数据。
  11. 根据权利要求10所述的分发服务器,其特征在于,所述第二确定模块用于:
    确定所述目标类型对应的目标分组的组编号,根据预先设置的分组与计算服务器 的对应关系,将所述目标分组对应的计算服务器确定为所述原始数据所属的目标计算服务器;
    所述数据存储请求中还携带有所述目标分组的组编号。
  12. 根据权利要求11所述的分发服务器,其特征在于,所述第二确定模块用于:
    基于所述目标类型包括的属性值,计算所述目标类型对应的目标分组的组编号。
  13. 根据权利要求12所述的分发服务器,其特征在于,所述第二确定模块用于:
    确定所述目标类型包括的属性值中每个字符对应的预设编码类型的编码;
    基于确定出的每个编码和预设的计算函数,计算所述目标类型对应的特征码;
    将所述特征码与分组总数目进行取余运算,将得到的余数确定为所述目标类型对应的目标分组的组编号。
  14. 一种计算服务器,其特征在于,所述计算服务器包括:
    接收模块,用于接收分发服务器发送的数据存储请求,其中,所述数据存储请求中携带有原始数据,所述原始数据包括参数值和至少一个属性值,所述原始数据属于目标类型,所述目标类型包括的属性值在所述至少一个属性值中;
    存储模块,用于存储所述目标类型的原始数据;
    确定模块,用于每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
  15. 根据权利要求14所述的计算服务器,其特征在于,所述数据存储请求中还携带有目标分组的组编号;
    所述存储模块还用于:存储所述目标类型对应的所述目标分组的组编号;
    所述确定模块用于:每当达到预设的聚合周期,对于每个组编号,根据所述组编号对应的当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
  16. 根据权利要求15所述的计算服务器,其特征在于,所述聚合周期中包括多个第1级子聚合周期,第i级子聚合周期中包括多个第i+1级子聚合周期,其中,i为大于1小于n的任意正整数,n为预设正整数;所述确定模块用于:
    每当达到第n级子聚合周期,分别获取当前的第n级子聚合周期内接收的每个组编号对应的原始数据,对于每个组编号,对获取的所述组编号对应的原始数据中目标类型的原始数据,分别进行统计处理,得到当前的第n级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;
    每当达到第i级子聚合周期,分别获取当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据,对于每个组编号,对所述组编号对应的所有第i+1级子聚合周期的聚合数据,分别进行统计处理,得到当前的第i级子聚合周期的目标类型的聚合数据,并存储每个聚合数据对应的组编号;
    每当达到预设的聚合周期,分别获取当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据,对于每个组编号,对所述组编号对应的所有第1级子聚合周期的聚合数据,分别进行统计处理,得到当前的聚合周期的目标类型的聚合数据。
  17. 根据权利要求16所述的计算服务器,其特征在于,所述聚合周期包括m个第 1级子聚合周期,第i级子聚合周期包括m个第i+1级子聚合周期,其中,所述m为预设正整数。
  18. 根据权利要求16所述的计算服务器,其特征在于,所述计算服务器还包括:
    删除模块,用于所述得到当前的第n级子聚合周期对应的聚合数据之后,删除当前的第n级子聚合周期内接收的每个组编号对应的原始数据;所述得到当前的第i级子聚合周期对应的聚合数据之后,删除当前的第i级子聚合周期内得到的每个组编号对应的所有第i+1级子聚合周期的聚合数据;所述得到当前的聚合周期对应的聚合数据之后,删除当前的聚合周期内得到的每个组编号对应的所有第1级子聚合周期的聚合数据。
  19. 一种数据处理系统,其特征在于,所述系统包括分发服务器和计算服务器,其中:
    所述分发服务器,用于获取原始数据,其中,所述原始数据包括参数值和至少一个属性值;确定所述原始数据所属的目标类型,其中,所述目标类型包括的属性值在所述至少一个属性值中;根据所述目标类型,确定所述原始数据所属的目标计算服务器;向所述目标计算服务器发送数据存储请求,其中,所述数据存储请求中携带有所述原始数据;
    所述计算服务器,用于接收分发服务器发送的数据存储请求,其中,所述数据存储请求中携带有原始数据,所述原始数据包括参数值和至少一个属性值,所述原始数据属于目标类型,所述目标类型包括的属性值在所述至少一个属性值中;存储所述目标类型的原始数据;每当达到预设的聚合周期,根据当前的聚合周期内接收的目标类型的原始数据,确定当前的聚合周期的目标类型的聚合数据。
  20. 一种分发服务器,其特征在于,所述分发服务器包括收发器和处理器,其中:
    所述收发器和所述处理器,被配置为执行所述权利要求1-4中任一权利要求所述的方法。
  21. 一种计算服务器,其特征在于,所述计算服务器包括收发器、存储器和处理器,其中:
    所述收发器、所述存储器和所述处理器,被配置为执行所述权利要求5-9中任一权利要求所述的方法。
  22. 一种计算机可读存储介质,其特征在于,包括指令,当所述计算机可读存储介质在分发服务器上运行时,使得所述分发服务器执行所述权利要求1-4中任一权利要求所述的方法。
  23. 一种计算机可读存储介质,其特征在于,包括指令,当所述计算机可读存储介质在计算服务器上运行时,使得所述计算服务器执行所述权利要求5-9中任一权利要求所述的方法。
PCT/CN2018/104530 2018-02-11 2018-09-07 数据处理方法、装置和系统 WO2019153735A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/990,640 US20200372039A1 (en) 2018-02-11 2020-08-11 Data processing method, apparatus, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810142085.5 2018-02-11
CN201810142085.5A CN108427725B (zh) 2018-02-11 2018-02-11 数据处理方法、装置和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/990,640 Continuation US20200372039A1 (en) 2018-02-11 2020-08-11 Data processing method, apparatus, and system

Publications (1)

Publication Number Publication Date
WO2019153735A1 true WO2019153735A1 (zh) 2019-08-15

Family

ID=63156912

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/104530 WO2019153735A1 (zh) 2018-02-11 2018-09-07 数据处理方法、装置和系统

Country Status (3)

Country Link
US (1) US20200372039A1 (zh)
CN (1) CN108427725B (zh)
WO (1) WO2019153735A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930531A (zh) * 2020-07-01 2020-11-13 北京奇艺世纪科技有限公司 数据处理、数据生产、数据消费方法、装置、设备及介质
CN112100661A (zh) * 2020-09-16 2020-12-18 深圳集智数字科技有限公司 一种数据处理方法及装置
CN113468385A (zh) * 2021-08-27 2021-10-01 国网浙江省电力有限公司 基于边缘处理端的能源梯度确定方法、装置及存储介质

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108427725B (zh) * 2018-02-11 2021-08-03 华为技术有限公司 数据处理方法、装置和系统
CN109558403B (zh) * 2018-09-28 2024-02-02 中国平安人寿保险股份有限公司 数据聚合方法及装置、计算机装置及计算机可读存储介质
CN110046187B (zh) * 2018-12-25 2023-10-27 创新先进技术有限公司 数据处理系统、方法及装置
CN111796916A (zh) * 2019-04-09 2020-10-20 Oppo广东移动通信有限公司 数据分发方法、装置、存储介质及服务器
CN110175210A (zh) * 2019-04-26 2019-08-27 厦门市美亚柏科信息股份有限公司 一种数据分发方法、装置、系统及存储介质
CN110647543A (zh) * 2019-08-29 2020-01-03 凡普数字技术有限公司 数据聚合方法、装置以及存储介质
CN110839061B (zh) * 2019-10-16 2020-11-06 北京达佳互联信息技术有限公司 数据分发方法、装置及存储介质
CN111369033B (zh) * 2020-01-02 2024-03-26 东软集团股份有限公司 运维指标的取值分布的预测方法和装置
CN111866082A (zh) * 2020-06-22 2020-10-30 远光软件股份有限公司 一种基于目标系统配置的数据分发方法和装置
CN112615773B (zh) * 2020-12-02 2023-02-28 海南车智易通信息技术有限公司 一种消息处理方法及系统
CN112799905A (zh) * 2021-01-05 2021-05-14 杭州涂鸦信息技术有限公司 一种软件运行的监测方法、系统及相关装置
CN113110803B (zh) * 2021-04-19 2022-10-21 浙江中控技术股份有限公司 一种数据存储方法及装置
CN114969009A (zh) * 2022-06-09 2022-08-30 四川鲁尔物联科技有限公司 雨量数据处理系统、方法、电子设备以及存储介质
CN114822540A (zh) * 2022-06-29 2022-07-29 广州小鹏汽车科技有限公司 车辆语音交互方法、服务器和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236657A (zh) * 2010-04-28 2011-11-09 阿里巴巴集团控股有限公司 一种处理上报数据的方法和服务器
CN103678042A (zh) * 2013-12-25 2014-03-26 上海爱数软件有限公司 一种基于数据分析的备份策略信息生成方法
CN106649890A (zh) * 2017-02-07 2017-05-10 税云网络科技服务有限公司 数据存储方法和装置
CN107092439A (zh) * 2017-03-07 2017-08-25 华为技术有限公司 一种数据存储的方法及设备
US20170358045A1 (en) * 2015-02-06 2017-12-14 Fronteo, Inc. Data analysis system, data analysis method, and data analysis program
CN108427725A (zh) * 2018-02-11 2018-08-21 华为技术有限公司 数据处理方法、装置和系统

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101557316B (zh) * 2009-05-14 2011-07-27 阿里巴巴集团控股有限公司 一种更新统计数据的方法和系统
CN102567396A (zh) * 2010-12-30 2012-07-11 中国移动通信集团公司 一种基于云计算的数据挖掘方法、系统及装置
CN103067514B (zh) * 2012-12-29 2016-09-07 深圳先进技术研究院 用于视频监控分析系统的云计算资源优化的方法和系统
CN103942253B (zh) * 2014-03-18 2017-07-14 深圳市房地产评估发展中心 一种负载均衡的空间数据处理系统
CN105407119A (zh) * 2014-09-12 2016-03-16 北京计算机技术及应用研究所 一种云计算系统及其方法
US11222034B2 (en) * 2015-09-15 2022-01-11 Gamesys Ltd. Systems and methods for long-term data storage
US10353924B2 (en) * 2015-11-19 2019-07-16 International Business Machines Corporation Data warehouse single-row operation optimization
CN107026881B (zh) * 2016-02-02 2020-04-03 腾讯科技(深圳)有限公司 业务数据的处理方法、装置及系统
CN107193839A (zh) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 数据聚合方法及装置
CN106484791B (zh) * 2016-09-21 2019-12-06 中国银联股份有限公司 一种数据统计方法和装置
US20180032612A1 (en) * 2017-09-12 2018-02-01 Secrom LLC Audio-aided data collection and retrieval

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102236657A (zh) * 2010-04-28 2011-11-09 阿里巴巴集团控股有限公司 一种处理上报数据的方法和服务器
CN103678042A (zh) * 2013-12-25 2014-03-26 上海爱数软件有限公司 一种基于数据分析的备份策略信息生成方法
US20170358045A1 (en) * 2015-02-06 2017-12-14 Fronteo, Inc. Data analysis system, data analysis method, and data analysis program
CN106649890A (zh) * 2017-02-07 2017-05-10 税云网络科技服务有限公司 数据存储方法和装置
CN107092439A (zh) * 2017-03-07 2017-08-25 华为技术有限公司 一种数据存储的方法及设备
CN108427725A (zh) * 2018-02-11 2018-08-21 华为技术有限公司 数据处理方法、装置和系统

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930531A (zh) * 2020-07-01 2020-11-13 北京奇艺世纪科技有限公司 数据处理、数据生产、数据消费方法、装置、设备及介质
CN111930531B (zh) * 2020-07-01 2023-08-18 北京奇艺世纪科技有限公司 数据处理、数据生产、数据消费方法、装置、设备及介质
CN112100661A (zh) * 2020-09-16 2020-12-18 深圳集智数字科技有限公司 一种数据处理方法及装置
CN112100661B (zh) * 2020-09-16 2024-03-12 深圳集智数字科技有限公司 一种数据处理方法及装置
CN113468385A (zh) * 2021-08-27 2021-10-01 国网浙江省电力有限公司 基于边缘处理端的能源梯度确定方法、装置及存储介质
CN113468385B (zh) * 2021-08-27 2023-09-19 国网浙江省电力有限公司 基于边缘处理端的能源梯度确定方法、装置及存储介质

Also Published As

Publication number Publication date
CN108427725A (zh) 2018-08-21
US20200372039A1 (en) 2020-11-26
CN108427725B (zh) 2021-08-03

Similar Documents

Publication Publication Date Title
WO2019153735A1 (zh) 数据处理方法、装置和系统
US11822975B2 (en) Systems and methods for synthetic data generation for time-series data using data segments
CN107634848B (zh) 一种采集分析网络设备信息的系统和方法
CN109684052B (zh) 事务分析方法、装置、设备及存储介质
CN110784419A (zh) 铁路电务专业数据可视化方法及系统
US11188443B2 (en) Method, apparatus and system for processing log data
CN108491267B (zh) 用于生成信息的方法和装置
CN111414516A (zh) 一种直播间消息处理方法、装置、电子设备及存储介质
WO2019133157A1 (en) Enhanced data aggregation techniques for anomaly detection and analysis
CN105302885B (zh) 一种全文数据的提取方法和装置
WO2023143264A1 (zh) 数据压缩方法及装置
CN117251414B (zh) 一种基于异构技术的数据存储及处理方法
CN116910144A (zh) 算力网络资源中心、算力服务系统以及数据处理方法
CN115296904B (zh) 域名反射攻击检测方法及装置、电子设备、存储介质
CN110737691B (zh) 用于处理访问行为数据的方法和装置
CN116519095A (zh) 一种仪表状态诊断及响应方法、装置、设备及存储介质
CN116542013A (zh) 一种电力边缘计算芯片可靠性评估方法、系统及存储介质
CN110389875A (zh) 用于监控计算机系统运行状态的方法、装置和存储介质
CN115277355A (zh) 一种监控系统状态码数据的处理方法、装置、设备及介质
CN111046416B (zh) 基于区块链的大健康数据管理系统
CN114172856A (zh) 消息自动回复方法、装置、设备及存储介质
CN113934894A (zh) 基于指标树的数据显示方法、终端设备
CN110119364B (zh) 一种输入/输出批量提交的方法和系统
CN115759236B (zh) 模型训练方法、信息发送方法、装置、设备和介质
CN110677463B (zh) 并行数据传输方法、装置、介质及电子设备

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18905063

Country of ref document: EP

Kind code of ref document: A1