CN112084219A

CN112084219A - Method, apparatus, electronic device, and medium for processing data

Info

Publication number: CN112084219A
Application number: CN202010974080.6A
Authority: CN
Inventors: 李丹枫; 程建波; 吕军; 王宇光; 孟祥涛; 刘红申; 廖艳丽
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2020-12-15

Abstract

Embodiments of the present disclosure disclose methods, apparatuses, electronic devices, and media for processing data. One embodiment of the method comprises: acquiring service data to be processed; obtaining cache data matched with the to-be-processed business data, wherein the matched cache data is used for representing the statistical characteristics of a data set associated with the to-be-processed business data; and generating processed service data based on the service data to be processed and the matched cache data, wherein the processed service data is used for representing the statistical characteristics of a set formed by the service data to be processed and a data set associated with the service data to be processed. The implementation method effectively reduces the data volume needing to be stored and transmitted through the network, and can remarkably improve the performance of the cache in a large-scale data processing scene.

Description

Method, apparatus, electronic device, and medium for processing data

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for processing data.

Background

With the rapid development of internet technology, the size of data volume is also expanding dramatically. How to efficiently process and store data becomes an increasingly worthy issue.

One of the prior art generally stores detail data in a cache middleware based on an SQL (Structured Query Language) database or a time series data structure, and obtains statistical characteristics such as variance by performing SQL calculation or CPU calculation based on a memory on the stored detail data when statistical characteristics of a certain scale data amount such as variance are required to be obtained. However, when massive data is processed in the above manner, the performance problem of SQL calculation may be caused by excessive data volume, and the data transmission cost may be excessive due to the large amount of data loaded into the local memory, and the calculation amount is limited by the machine memory, resulting in slow calculation or even unavailable.

Disclosure of Invention

Embodiments of the present disclosure propose methods, apparatuses, electronic devices, and media for processing data.

In a first aspect, an embodiment of the present disclosure provides a method for processing data, the method including: acquiring service data to be processed; obtaining cache data matched with the to-be-processed business data, wherein the matched cache data is used for representing the statistical characteristics of a data set associated with the to-be-processed business data; and generating processed service data based on the service data to be processed and the matched cache data, wherein the processed service data is used for representing the statistical characteristics of a set consisting of the service data to be processed and a data set associated with the service data to be processed.

In some embodiments, the method further comprises: determining the processed service data as updated cache data; and storing the updated cache data to update the cache data.

In some embodiments, the service data to be processed includes a time sequence field, and the cache data corresponds to a time window identifier; the obtaining of the cache data matched with the service data to be processed includes: and selecting cache data corresponding to the time window identifier matched with the time sequence field of the service data to be processed from a preset cache data set, wherein the cache data in the preset cache data set is used for representing the statistical characteristics of the data set in the time window indicated by the time window identifier.

In some embodiments, the service data to be processed includes streaming data.

In some embodiments, the generating the processed service data based on the service data to be processed and the matched cache data includes: generating intermediate data consistent with the statistical characteristics represented by the matched cache data according to the service data to be processed; and generating the processed service data according to the matched cache data and the intermediate data.

In some embodiments, the generating the processed service data according to the matched cache data and the intermediate data includes: and in response to determining that the matched cache data is empty, determining the intermediate data as the processed service data.

In some embodiments, the generating the processed service data according to the matched cache data and the intermediate data includes: and responding to the fact that the matched cache data is not empty, and generating the processed service data according to the intermediate data and the matched cache data.

In some embodiments, the statistical features include the number of data and the sum of the data; and generating the processed service data according to the intermediate data and the matched cache data, including: determining the sum of the values used for representing the number of the data in the intermediate data and the matched cache data as the value used for representing the number of the data in the processed service data; and determining the sum of the values of the sums used for representing the data in the intermediate data and the matched cache data as the value of the sum used for representing the data in the processed service data.

In some embodiments, the generating the processed service data according to the intermediate data and the matched cache data further includes: and determining the ratio of the value of the sum used for representing the data in the processed business data to the value of the number used for representing the data as the value of the average used for representing the data in the processed business data.

In some embodiments, the statistical features further include a mean of the data and a variance of the data; and generating the processed service data according to the intermediate data and the matched cache data, including: and generating the value for representing the variance of the data in the processed business data according to the value for representing the sum of the data in the intermediate data, the value for representing the number of the data, the value for representing the average value of the data and the value for representing the variance of the data in the matched cache data, and the value for representing the number of the data and the value for representing the average value of the data in the processed business data.

In a second aspect, an embodiment of the present disclosure provides a method for querying data, the method including: acquiring data query information, wherein the data query information comprises the range and the statistical characteristics of data to be queried; acquiring at least one piece of cache data matched with the range of the data to be queried, wherein the cache data is used for representing the statistical characteristics of a data set associated with the data query information; and processing the acquired data according to the statistical characteristics to generate query result information matched with the data query information, wherein the query result information is used for representing the statistical characteristics of the data in the range of the data to be queried.

In some embodiments, the range of the data to be queried includes a time range, the cached data corresponds to a time window identifier, and a time period indicated by the time window identifier corresponding to the at least one piece of cached data matching the range of the data to be queried is consistent with the time range.

In a third aspect, an embodiment of the present disclosure provides an apparatus for processing data, the apparatus including: a first obtaining unit configured to obtain service data to be processed; the second acquisition unit is configured to acquire cache data matched with the to-be-processed business data, wherein the matched cache data is used for representing the statistical characteristics of a data set associated with the to-be-processed business data; and the first generating unit is configured to generate the processed service data based on the service data to be processed and the matched cache data, wherein the processed service data is used for representing the statistical characteristics of a set formed by the service data to be processed and a data set associated with the service data to be processed.

In some embodiments, the apparatus further comprises: a determining unit configured to determine the processed service data as updated cache data; and the updating unit is configured to store the updated cache data so as to update the cache data.

In some embodiments, the service data to be processed includes a time sequence field, and the cache data corresponds to a time window identifier; and the second acquiring unit may be further configured to: and selecting cache data corresponding to the time window identifier matched with the time sequence field of the service data to be processed from a preset cache data set, wherein the cache data in the preset cache data set is used for representing the statistical characteristics of the data set in the time window indicated by the time window identifier.

In some embodiments, the service data to be processed includes streaming data.

In some embodiments, the first generating unit includes: the first generation subunit is configured to generate intermediate data consistent with the statistical characteristics represented by the matched cache data according to the service data to be processed; and the second generation subunit is configured to generate the processed service data according to the matched cache data and the intermediate data.

In some embodiments, the second generating subunit is further configured to determine the intermediate data as the processed service data in response to determining that the matched cache data is empty.

In some embodiments, the second generating subunit is further configured to generate the processed service data according to the intermediate data and the matching cache data in response to determining that the matching cache data is not empty.

In some embodiments, the statistical features include the number of data and the sum of the data; and the second generating subunit includes: the first determining module is configured to determine the sum of the values used for representing the number of the data in the intermediate data and the matched cache data as the value used for representing the number of the data in the processed service data; and the second determining module is configured to determine the sum of the values of the sums of the intermediate data and the matched cache data, which are used for representing the data, as the value of the sum of the values of the sums of the processed business data, which are used for representing the data.

In some embodiments, the second generating subunit further includes: and a third determining module configured to determine a ratio of a value used for characterizing the sum of the data in the processed traffic data to a value used for characterizing the number of the data as a value used for characterizing an average value of the data in the processed traffic data.

In some embodiments, the statistical features further include a mean of the data and a variance of the data; and the second generating subunit further comprises: and a third determining module configured to generate a value for characterizing the variance of the data in the processed traffic data according to the value for characterizing the sum of the data in the intermediate data, the value for characterizing the number of data, the value for characterizing the mean of the data, and the value for characterizing the mean of the data in the matched buffered data, and the value for characterizing the number of data and the value for characterizing the mean of the data in the processed traffic data.

In a fourth aspect, an embodiment of the present disclosure provides an apparatus for querying data, the apparatus including: the third acquisition unit is configured to acquire data query information, wherein the data query information comprises the range and the statistical characteristics of the data to be queried; the fourth acquisition unit is configured to acquire at least one piece of cache data matched with the range of the data to be queried, wherein the cache data is used for representing the statistical characteristics of the data set associated with the data query information; and the second generation unit is configured to process the acquired data according to the statistical characteristics and generate query result information matched with the data query information, wherein the query result information is used for representing the statistical characteristics of the data in the range of the data to be queried.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a sixth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor, implements the method as described in any of the implementations of the first aspect.

Embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a medium for processing data, which effectively reduce the amount of data that needs to be stored and transmitted over a network by using data that characterizes statistical characteristics of a data set associated with traffic data to be processed as cache data rather than detail data. Moreover, the processed service data is generated by processing the cache data, so that frequent reading and repeated calculation of the original detail data are avoided, incremental updating of the data is realized, and the cache performance can be remarkably improved particularly in a large-scale data processing scene.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for processing data according to the present disclosure;

FIG. 3 is a schematic diagram of one application scenario of a method for processing data according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram for one embodiment of a method for querying data, according to the present disclosure;

FIG. 5 is a schematic block diagram illustrating one embodiment of an apparatus for processing data according to the present disclosure;

FIG. 6 is a schematic block diagram illustrating one embodiment of an apparatus for querying data according to the present disclosure;

FIG. 7 is a schematic structural diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary architecture 100 to which the method for processing data or the apparatus for processing data of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and

servers

105, 106. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the

servers

105, 106. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with the

servers

105, 106 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a web browser application, a shopping-type application, a search-type application, an instant messaging tool, a mailbox client, a database application, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting database operations, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The

servers

105, 106 may be servers providing various services, for example, the server 105 may be a background server providing support for various applications on the

terminal devices

101, 102, 103; server 106 may be a cache server that provides support for various applications on

terminal devices

101, 102, 103 and databases used by backend server 105. The cache server can analyze and process various acquired service data to be processed, generate processed service data, and update data in the cache according to the generated processed service data.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the method for processing data provided by the embodiment of the present disclosure is generally performed by the server 106, and accordingly, the apparatus for processing data is generally disposed in the server 106.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for processing data in accordance with the present disclosure is shown. The method for processing data comprises the following steps:

step 201, obtaining service data to be processed.

In this embodiment, an execution subject of the method for processing data (such as the server 106 shown in fig. 1) may acquire the service data to be processed by a wired connection manner or a wireless connection manner. As an example, the executing entity may obtain the to-be-processed service data which is stored locally in advance, or may obtain the to-be-processed service data from an electronic device (for example, the

terminal devices

101, 102, 103 or the server 105 shown in fig. 1) which is in communication connection with the executing entity. The service data may include various forms. For example, the business data may be transfer data of an account, personal health data collected by the wearable device, air temperature monitoring data, and the like.

In some optional implementation manners of this embodiment, the service data to be processed may include a timing field. The timing field may be used to determine a timing between a plurality of service data. As an example, the values of the timing fields described above may be used to characterize the time of a transfer operation, the time of collection of personal health data, the time of collection of monitored air temperature data, and the like.

Based on the optional implementation manner, the time sequence data can be processed by adopting the scheme.

In some optional implementations of this embodiment, the service data to be processed may include streaming (streaming) data. The service data to be processed may include real-time data with a timing field from different streaming data sources.

Based on the optional implementation manner, the streaming data can be processed by adopting the scheme.

Step 202, obtaining cache data matched with the service data to be processed.

In this embodiment, the executing entity may obtain the cache data matched with the to-be-processed service data obtained in step 201 in various ways. The matched cache data may be used to characterize statistical characteristics of a data set associated with the service data to be processed. The statistical features may include, but are not limited to, at least one of: number of data (number of bars), sum of numbers, mean, maximum, minimum, variance, standard deviation, k-th order central moment. The association generally includes category agreement. Optionally, the associating may also include time range matching. As an example, the to-be-processed service data is transfer data of an account a, and the cache data matched with the to-be-processed service data may be a value of a statistical characteristic of historical transfer data of the account a, or may be a value of a statistical characteristic of historical transfer data of a bank where the account a is located. As another example, if the pending transaction data is transfer data of the account a at 8/2020 at 20:00:03, the cache data matched with the pending transaction data may be a value of statistical characteristics of historical transfer data of the account a during a time period from 1/00: 00: 00/1/8/2020 at 8/2020 at 20:00: 02.

In some optional implementation manners of this embodiment, based on that the to-be-processed service data may include a time sequence field and that the cache data may correspond to a time window identifier, the execution main body may select, from a preset cache data set, cache data corresponding to the time window identifier that matches the time sequence field of the to-be-processed service data. The cache data in the preset cache data set may be used to characterize the statistical characteristics of the data set in the time window indicated by the time window identifier. As an example, the above-mentioned cache data may include an average air temperature representing daily air temperatures recorded for the last 3 months. As yet another example, the cached data may be used to characterize the variance of the amount of money involved in the transfer operation for account A over the last 30 days.

Based on the optional implementation manner, the matched cache data can be selected according to the time sequence field, so that the time sequence data can be processed.

In some optional implementations of this embodiment, the statistical characteristic may include the number of data and the sum of the data.

In some optional implementations of this embodiment, the statistical characteristics may further include a mean of the data and a variance of the data.

And 203, generating the processed service data based on the service data to be processed and the matched cache data.

In this embodiment, based on the to-be-processed service data acquired in step 201 and the matched cache data acquired in step 202, the execution main body may generate the processed service data in various ways. The processed service data may be used to characterize statistical characteristics of a set composed of the service data to be processed and a data set associated with the service data to be processed. As an example, if the statistical characteristic includes the number of data, the executing entity may add 1 to the value of the number of characterization data in the matched cache data, so as to generate a value of the number of characterization data in the processed service data. As another example, if the statistical characteristic includes a sum of data, the executing entity may add a value of the sum of the characterization data in the matched cache data to a value of the service data to be processed, so as to generate a value of the sum of the characterization data in the processed service data.

In some optional implementation manners of this embodiment, based on the service data to be processed and the matched cache data, the execution main body may generate the processed service data through the following steps:

firstly, according to the service data to be processed, generating intermediate data consistent with the statistical characteristics represented by the matched cache data.

In these implementations, according to the to-be-processed service data acquired in step 201, the execution subject may generate intermediate data consistent with the statistical characteristics represented by the matching cache data acquired in step 202 in various ways. As an example, the statistical characteristics may include at least one of: the number of data, the sum of data, the mean of data, the variance of data. The execution body may determine 1 as a value representing the number of data (e.g., variable count) in the intermediate data; the executing entity may determine a value of the service data to be processed as a value of a sum (e.g., a variable sum) of the characterizing data in the intermediate data and a value of an average (e.g., a variable avg) of the characterizing data; the execution body may determine 0 as a value of variance (e.g., variable varp) of the characterization data in the intermediate data.

And secondly, generating the processed service data according to the matched cache data and the intermediate data.

In these implementations, the execution main body may generate the processed service data in various ways according to the matched cache data obtained in step 202 and the intermediate data generated in the first step.

Optionally, in response to determining that the matched cache data is empty, the execution main body may determine the intermediate data as the processed service data.

Optionally, in response to determining that the matched cache data is not empty, the execution main body may generate the processed service data according to the intermediate data and the matched cache data.

Based on the optional implementation manner, the data can be processed in a manner of defining an intermediate variable, so that the processed service data is generated.

Optionally, based on the number of data and the sum of data included in the statistical characteristics, the execution main body may generate the processed service data according to the intermediate data and the matched cache data by:

and S1, determining the sum of the values used for representing the number of the data in the intermediate data and the matched cache data as the value used for representing the number of the data in the processed service data.

In these alternative implementations, the executing entity may determine a sum of the intermediate data generated in the first step and the value used for characterizing the number of data in the matched cache data acquired in the step 202 as the value used for characterizing the number of data in the processed service data. That is, the formula is expressed as follows:

count_new＝count_old+count

the count _ new may be a variable used to represent the number of data in the processed service data. The count _ old may be a variable used to characterize the number of data in the matched cache data. The count may be a variable used to characterize the number of data in the intermediate data.

And S2, determining the sum of the values of the intermediate data and the sum of the values of the matched cache data used for characterizing the data as the value of the sum of the characterizing data in the processed business data.

In these alternative implementations, the executing entity may determine a sum of the intermediate data generated in the first step and a value used for characterizing a sum of data in the matched cache data obtained in the step 202 as a value used for characterizing a sum of data in the processed service data. That is, the formula is expressed as follows:

sum_new＝sum_old+sum

the sum _ new may be a variable used to represent a sum of data in the processed service data. The sum _ old may be a variable used to characterize the sum of data in the matched cache data. The sum may be a variable in the intermediate data used to characterize the sum of the data.

Based on the above alternative implementation, new data for characterizing the number of data and the sum of the data may be generated by means of incremental updating.

Optionally, the executing body may further determine a ratio of a value of a sum for characterizing data in the processed traffic data to a value of a number for characterizing data as a value of an average value for characterizing data in the processed traffic data. That is, the formula is expressed as follows:

avg_new＝sum_new/count_new

the avg _ new may be a variable used to represent an average value of data in the processed traffic data. The above meanings of sum _ new and count _ new may be the same as those described above, and are not described herein again.

Based on the above alternative implementation, new data for characterizing the mean value of the data may be generated in an incremental update manner.

Optionally, based on the statistical characteristics further including a mean value of data and a variance of data, the execution main body may further generate a value of the variance of the characterizing data in the processed traffic data according to a value of a sum of the characterizing data in the intermediate data, a value of the number of characterizing data, a value of the mean value of characterizing data, and a value of the number of characterizing data and a value of the mean value of characterizing data in the matched buffered data. As an example, the value of the variance varp _ new used for characterizing the data in the processed traffic data can be obtained by the following formula:

the varp _ old may be a variable used to characterize a variance of data in the matched buffered data. The avg _ old may be a variable used to represent an average value of the data in the matched cache data. The meanings of the count _ old, avg _ new, sum, and count _ new may be the same as those described above, and will not be described herein again.

Based on the above alternative implementation, new data for characterizing the variance of the data may be generated by means of incremental update.

Optionally, the executing body may further generate a value of a standard deviation of the processed traffic data according to the generated value of the variance of the processed traffic data.

In some optional implementations of this embodiment, the executing body may further continue to perform the following steps:

step one, the processed service data is determined as updated cache data.

In these implementations, the executing entity may determine the processed service data generated in step 203 as updated cache data.

And step two, storing the updated cache data to update the cache data.

In these implementations, the execution main body may store the updated cache data determined in the first step to update the cache data. In general, the execution may update data stored in a storage space where the matched cache data is located to the processed service data. Thereby realizing the incremental updating of the cache data.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a method for processing data according to an embodiment of the present disclosure. In the application scenario of fig. 3, a user 301 uses a terminal device 302 to perform a transfer operation. The terminal device 302 transmits a transfer request 304 to the server 303 providing financial services. Thereafter, the server 303 transmits the transfer data 306 corresponding to the above-described transfer operation to the cache server 305. The transfer data 306 may be "transfer 500" on account A1/2/month, for example. After the cache server 305 acquires the transfer data 306, cache data 307 matching the transfer data 306 is acquired. The cache data 307 may be data for recording the number of times, total amount, and average value of account a transfers from 1 month and 1 day to 1 month and 31 days, for example. Based on the transfer data 306 and the cache data 307, the cache server 305 may generate processed transaction data 308. The processed service data 308 may be data obtained by updating the original cache data 307 based on the transfer data 306. Optionally, the cache server 305 may also generate new cache data 309 according to the generated processed service data 308.

At present, one of the prior art is that a cache middleware based on an SQL database or a time sequence data structure stores detail data, and when statistical characteristics of a certain scale of data quantity such as variance are required to be obtained, statistical characteristics such as variance are often obtained by performing SQL calculation or CPU calculation based on a memory on the stored detail data, which causes performance problems of SQL calculation due to too large data quantity when processing mass data and causes too large data transmission cost due to a large amount of data being loaded to a local memory, and the calculation quantity is limited by a machine memory, resulting in slow calculation or even unavailable. The method provided by the above embodiment of the present disclosure effectively reduces the amount of data that needs to be stored and transmitted over the network by using the data characterizing the statistical characteristics of the data set associated with the service data to be processed as cache data rather than detail data. Moreover, the processed service data is generated by processing the cache data, so that frequent reading and repeated calculation of the original detail data are avoided, incremental updating of the data is realized, and the cache performance can be remarkably improved particularly in a large-scale data processing scene.

With further reference to FIG. 4, a flow 400 of one embodiment of a method for querying data is shown. The process 400 of the method for querying data includes the steps of:

step 401, obtaining data query information.

In the present embodiment, an execution subject of the method for inquiring data (e.g., the terminal 101, 102, 103 or the server 105 shown in fig. 1) may data-inquire information in various ways. The data query information comprises the range and the statistical characteristics of the data to be queried. As an example, the range of the data to be queried may include the number of data pieces (for example, 1 st to 100 th pieces) of the preset data table. The above statistical features may be consistent with the description of step 202 and its optional implementation in the foregoing embodiments, and are not described herein again.

In the present embodiment, as an example, an execution subject (e.g., the server 106 shown in fig. 1) of the method for querying data may acquire the above-described data query information from a communication-connected electronic device (e.g., the terminal 101, 102, 103 or the server 105 shown in fig. 1). As yet another example, an executing body (e.g., the

terminals

101, 102, 103 shown in fig. 1) of the method for querying data may acquire data query information input or selected by a user from an input device (e.g., a touch screen).

In some optional implementations of this embodiment, the range of the data to be queried may include a time range. As an example, the time range may be, for example, the day of the specified date, the last 3 months, or the like.

Step 402, at least one piece of cache data matched with the range of the data to be queried is obtained.

In this embodiment, the execution main body may acquire at least one piece of cache data matching the range of the data to be queried of the data query information acquired in step 401 in various ways. Wherein the cache data may be used to characterize statistical characteristics of the data set associated with the data query information. The data range corresponding to the cache data matched with the range of the data to be queried may be consistent with the range of the data to be queried, or may be a subset of the range of the data to be queried. As an example, the above-mentioned range of the data to be queried may be the data of item 1 to item 100 of the data table X. The cache data matched with the range of the data to be queried may be a statistical feature of a data set formed by data representing items 1 to 100 in the data table X; the at least one piece of cache data matched with the range of the data to be queried may also include a statistical feature for characterizing a data set formed by the 1 st to 20 th data of the data table X, a statistical feature for characterizing a data set formed by the 21 st to 40 th data of the data table X, a statistical feature for characterizing a data set formed by the 41 st to 60 th data of the data table X, a statistical feature for characterizing a data set formed by the 61 st to 80 th data of the data table X, and a statistical feature for characterizing a data set formed by the 81 st to 100 th data of the data table X.

In some optional implementations of this embodiment, the cache data may correspond to a time window identifier. The time period indicated by the time window identifier corresponding to the at least one piece of cache data matched with the range of the data to be queried may be consistent with the time range. As an example, the above-mentioned data to be queried may range from the total amount of consumption of account a every day from 1/2020 to 1/3/2020. The cache data x, y, z may respectively correspond to the time window identifications "20200101", "20200102", "20200103". The time window identifiers "20200101", "20200102", "20200103" may be used to characterize the total amount of money that account a consumes on the same day as 1/2020, 1/2/2020, and 1/3/2020, respectively. Therefore, the execution subject may determine that the cache data x, y, and z are at least one piece of cache data matching the range of the data to be queried.

In some optional implementations of this embodiment, the above-mentioned cache data may be consistent with the description of step 202 and its optional implementations in the foregoing embodiment. The cache data may also be updated in the manner described above in step 203 and its optional implementation.

And 403, processing the acquired data according to the statistical characteristics to generate query result information matched with the data query information.

In this embodiment, the executing entity may process the acquired data according to the statistical characteristics of the data query information acquired in step 401, so as to generate query result information matching the data query information. The query result information may be used to characterize statistical characteristics of data within the range of the data to be queried. As an example, when the data range of the statistical feature represented by the obtained piece of cache data is consistent with the range of the data to be queried, the executing entity may directly determine the obtained cache data (e.g., an average value, a variance, etc.) as query result information matching the data query information. As another example, when a total data range of the statistical features represented by the obtained pieces of cache data is consistent with the range of the data to be queried, the executing entity may perform corresponding operations on the obtained cache data according to the types of the statistical features, so as to generate query result information matching the data query information. For example, if the statistical characteristic is a sum or a count, the executing entity may perform a sum operation on the acquired pieces of cache data. For another example, the statistical characteristic is a mean or a variance, and the executing entity may perform corresponding operations on the obtained pieces of buffered data by determining the mean or the variance as described in step 203 of the foregoing embodiment and its optional implementation manners.

At present, one of the prior art is that a cache middleware based on an SQL database or a time sequence data structure stores detail data, and when statistical characteristics of a certain scale of data quantity such as variance are required to be obtained, statistical characteristics such as variance are often obtained by performing SQL calculation or CPU calculation based on a memory on the stored detail data, which causes performance problems of SQL calculation due to too large data quantity when processing mass data and causes too large data transmission cost due to a large amount of data being loaded to a local memory, and the calculation quantity is limited by a machine memory, resulting in slow calculation or even unavailable. The flow 400 of the method for querying data provided by the embodiment described in fig. 4 of the present disclosure may reduce the amount of computation of the data statistical features by obtaining at least one piece of cache data representing the data statistical features that match the range of the data to be queried. Moreover, the obtained data can be processed to generate query result information, so that data statistical characteristics of a time window with a larger granularity (e.g. 1 month and 1 year) can be obtained by merging and calculating cache data of a plurality of time windows with smaller granularity (e.g. 1 day and 1 week), and more flexible data processing is realized.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of an apparatus for processing data, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for processing data provided by the present embodiment includes a first acquisition unit 501, a second acquisition unit 502, and a first generation unit 503. The first obtaining unit 501 is configured to obtain service data to be processed; a second obtaining unit 502, configured to obtain cache data matched with the to-be-processed service data, where the matched cache data is used to characterize statistical characteristics of a data set associated with the to-be-processed service data; a first generating unit 503 configured to generate processed service data based on the service data to be processed and the matched cache data, where the processed service data is used to characterize statistical characteristics of a set composed of the service data to be processed and a data set associated with the service data to be processed.

In the present embodiment, in the apparatus 500 for processing data: the specific processing of the first obtaining unit 501, the second obtaining unit 502 and the first generating unit 503 and the technical effects thereof can refer to the related descriptions of step 201, step 202 and step 203 in the corresponding embodiment of fig. 2, which are not repeated herein.

In some optional implementations of this embodiment, the apparatus 500 for processing data may further include: a determination unit (not shown in the figure), an update unit (not shown in the figure). The determining unit may be configured to determine the processed service data as the updated cache data. The updating unit may be configured to store the updated cache data to update the cache data.

In some optional implementation manners of this embodiment, the service data to be processed may include a timing field. The cached data may correspond to a time window identification. The second obtaining unit 502 may be further configured to: and selecting cache data corresponding to the time window identification matched with the time sequence field of the service data to be processed from a preset cache data set. The cache data in the preset cache data set can be used for characterizing the statistical characteristics of the data set in the time window indicated by the time window identification.

In some optional implementations of this embodiment, the service data to be processed may include streaming data.

In some optional implementations of this embodiment, the first generating unit 503 may include: a first generating subunit (not shown in the figure), and a second generating subunit (not shown in the figure). The first generating subunit may be configured to generate, according to the service data to be processed, intermediate data consistent with the statistical characteristics represented by the matched cache data. The second generating subunit may be configured to generate the processed service data according to the matched cache data and the intermediate data.

In some optional implementation manners of this embodiment, the second generating subunit may be further configured to determine, in response to determining that the matched cache data is empty, the intermediate data as the processed service data.

In some optional implementation manners of this embodiment, the second generating subunit may be further configured to, in response to determining that the matched cache data is not empty, generate the processed service data according to the intermediate data and the matched cache data.

In some optional implementations of this embodiment, the statistical characteristic may include the number of data and the sum of the data. The second generating subunit may include: a first determining module (not shown in the figure), a second determining module (not shown in the figure). The first determining module may be configured to determine a sum of the values used for characterizing the number of data in the intermediate data and the matched cache data as the value used for characterizing the number of data in the processed service data. The second determining module may be configured to determine a sum of the values of the sums of the intermediate data and the matched buffer data, which are used for characterizing the data, as the value of the sum of the values of the processed traffic data, which is used for characterizing the data.

In some optional implementation manners of this embodiment, the second generating subunit may further include: a third determining module (not shown in the figure) configured to determine a ratio of a value used for characterizing the sum of data in the processed traffic data to a value used for characterizing the number of data as a value used for characterizing an average value of data in the processed traffic data.

In some optional implementation manners of this embodiment, the second generating subunit may further include: and a third determining module (not shown in the figure) configured to generate the value for characterizing the variance of the data in the processed traffic data according to the value for characterizing the sum of the data in the intermediate data, the value for characterizing the number of data, the value for characterizing the mean of the data, and the value for characterizing the number of data, the value for characterizing the mean of the data in the processed traffic data, which are matched with each other.

The apparatus provided by the foregoing embodiment of the present disclosure acquires, by the second acquiring unit 502, data representing statistical characteristics of a data set associated with the service data to be processed as cache data instead of detail data, thereby effectively reducing the amount of data that needs to be stored and transmitted over a network. Moreover, the processed service data is generated by processing the cache data through the generating unit 503, so that frequent reading and repeated calculation of the original detail data are avoided, incremental updating of the data is realized, and particularly, the cache performance can be remarkably improved in a large-scale data processing scene.

As shown in fig. 6, the apparatus 600 for querying data provided by the present embodiment includes a third obtaining unit 601, a fourth obtaining unit 602, and a second generating unit 603. The third obtaining unit 601 is configured to obtain data query information, where the data query information includes a range and statistical characteristics of data to be queried; a fourth obtaining unit 602, configured to obtain at least one piece of cache data that matches a range of data to be queried, where the cache data is used to characterize statistical characteristics of a data set associated with the data query information; a second generating unit 603 configured to process the acquired data according to the statistical characteristics, and generate query result information matching the data query information, where the query result information is used to represent the statistical characteristics of the data in the range of the data to be queried.

In the present embodiment, in the apparatus 600 for querying data: for specific processing of the third obtaining unit 601, the fourth obtaining unit 602, and the second generating unit 603 and technical effects thereof, reference may be made to the related descriptions of step 401, step 402, and step 403 in the corresponding embodiment of fig. 4, which is not described herein again.

In some optional implementations of this embodiment, the range of the data to be queried may include a time range. The cached data may correspond to a time window identification. The time period indicated by the time window identifier corresponding to the at least one piece of cache data matched with the range of the data to be queried may be consistent with the time range.

In the apparatus provided by the foregoing embodiment of the present disclosure, the fourth obtaining unit 602 obtains the cache data of at least one characteristic data statistical feature that matches the range of the data to be queried, so that the calculation amount of the data statistical feature may be reduced. Moreover, the second generating unit 603 may process the acquired data to generate query result information, so that the data statistics of a time window with a larger granularity (e.g., 1 month, 1 year) may be obtained by merging and calculating the cache data of a plurality of time windows with smaller granularity (e.g., 1 day, 1 week), thereby implementing more flexible data processing.

Referring now to fig. 7, shown is a schematic diagram of an electronic device (e.g.,

servers

105, 106 or

terminal devices

101, 102, 103 in fig. 1) 700 suitable for use in implementing embodiments of the present application. The server shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, electronic device 700 may include a processing means (e.g., central processing unit, graphics processor, etc.) 701 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from storage 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data necessary for the operation of the electronic apparatus 700 are also stored. The processing device 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Generally, the following devices may be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touch pad, keyboard, mouse, etc.; an output device 707 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 708 including, for example, magnetic tape, hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device 700 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 7 may represent one device or may represent multiple devices as desired.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via the communication means 709, or may be installed from the storage means 708, or may be installed from the ROM 702. The computer program, when executed by the processing device 701, performs the above-described functions defined in the methods of the embodiments of the present application.

It should be noted that the computer readable medium described in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring service data to be processed; obtaining cache data matched with the to-be-processed business data, wherein the matched cache data is used for representing the statistical characteristics of a data set associated with the to-be-processed business data; and generating processed service data based on the service data to be processed and the matched cache data, wherein the processed service data is used for representing the statistical characteristics of a set consisting of the service data to be processed and a data set associated with the service data to be processed.

Computer program code for carrying out operations for embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language, Python, or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor comprises a first acquisition unit, a second acquisition unit and a first generation unit; or a processor comprising a third obtaining unit, a fourth obtaining unit, and a second generating unit. The names of these units do not in some cases form a limitation on the unit itself, and for example, the first acquiring unit may also be described as a "unit that acquires service data to be processed".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method for processing data, comprising:

acquiring service data to be processed;

obtaining cache data matched with the to-be-processed business data, wherein the matched cache data is used for representing the statistical characteristics of a data set associated with the to-be-processed business data;

and generating processed service data based on the service data to be processed and the matched cache data, wherein the processed service data is used for representing the statistical characteristics of a set formed by the service data to be processed and a data set associated with the service data to be processed.

2. The method of claim 1, wherein the method further comprises:

determining the processed service data as updated cache data;

and storing the updated cache data to update the cache data.

3. The method according to claim 1, wherein the service data to be processed includes a timing field, and the cache data corresponds to a time window identifier; and

the obtaining of the cache data matched with the service data to be processed includes:

and selecting cache data corresponding to the time window identifier matched with the time sequence field of the service data to be processed from a preset cache data set, wherein the cache data in the preset cache data set is used for representing the statistical characteristics of the data set in the time window indicated by the time window identifier.

4. The method of claim 3, wherein the traffic data to be processed comprises streaming data.

5. The method according to one of claims 1 to 4, wherein the generating of the processed service data based on the service data to be processed and the matched cache data comprises:

generating intermediate data consistent with the statistical characteristics represented by the matched cache data according to the to-be-processed service data;

and generating the processed service data according to the matched cache data and the intermediate data.

6. The method of claim 5, wherein the generating the processed service data according to the matched cache data and the intermediate data comprises:

and in response to determining that the matched cache data is empty, determining the intermediate data as the processed service data.

7. The method of claim 5, wherein the generating the processed service data according to the matched cache data and the intermediate data comprises:

and responding to the fact that the matched cache data is not empty, and generating the processed service data according to the intermediate data and the matched cache data.

8. The method of claim 7, wherein the statistical features include a number of data and a sum of data; and

the generating the processed service data according to the intermediate data and the matched cache data includes:

determining the sum of the values used for representing the number of the data in the intermediate data and the matched cache data as the value used for representing the number of the data in the processed service data;

and determining the sum of the values of the sums of the intermediate data and the buffer data which are matched with each other and used for representing the data as the value of the sum of the sums of the characterizing data in the processed service data.

9. The method of claim 8, wherein the generating the processed service data according to the intermediate data and the matched cache data further comprises:

and determining the ratio of the value of the sum of the characterization data in the processed service data to the value of the number of the characterization data as the value of the average value of the characterization data in the processed service data.

10. The method of claim 8, wherein the statistical features further comprise a mean of the data and a variance of the data; and

and generating a value for characterizing the variance of the data in the processed traffic data according to the value for characterizing the sum of the data in the intermediate data, the value for characterizing the number of data, the value for characterizing the mean value of the data, and the value for characterizing the variance of the data in the matched buffered data, and the value for characterizing the number of data and the value for characterizing the mean value of the data in the processed traffic data.

11. A method for querying data, comprising:

acquiring data query information, wherein the data query information comprises the range and the statistical characteristics of data to be queried;

acquiring at least one piece of cache data matched with the range of the data to be queried, wherein the cache data is used for representing the statistical characteristics of a data set associated with the data query information;

and processing the acquired data according to the statistical characteristics to generate query result information matched with the data query information, wherein the query result information is used for representing the statistical characteristics of the data in the range of the data to be queried.

12. The method of claim 11, wherein the range of the data to be queried comprises a time range, the cached data corresponds to a time window identifier, and a time period indicated by the time window identifier corresponding to the at least one piece of cached data matching the range of the data to be queried is consistent with the time range.

13. An apparatus for processing data, comprising:

a first obtaining unit configured to obtain service data to be processed;

a second obtaining unit, configured to obtain cache data matched with the to-be-processed service data, where the matched cache data is used to characterize statistical characteristics of a data set associated with the to-be-processed service data;

the first generating unit is configured to generate processed service data based on the service data to be processed and the matched cache data, wherein the processed service data is used for representing statistical characteristics of a set formed by the service data to be processed and a data set associated with the service data to be processed.

14. An apparatus for querying data, comprising:

the third acquisition unit is configured to acquire data query information, wherein the data query information comprises the range and the statistical characteristics of the data to be queried;

the fourth acquisition unit is configured to acquire at least one piece of cache data matched with the range of the data to be queried, wherein the cache data is used for representing the statistical characteristics of a data set associated with the data query information;

and the second generation unit is configured to process the acquired data according to the statistical characteristics and generate query result information matched with the data query information, wherein the query result information is used for representing the statistical characteristics of the data in the range of the data to be queried.

15. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-12.

16. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-12.