WO2017016423A1 - 一种实时新增数据更新方法和装置 - Google Patents

一种实时新增数据更新方法和装置 Download PDF

Info

Publication number
WO2017016423A1
WO2017016423A1 PCT/CN2016/090633 CN2016090633W WO2017016423A1 WO 2017016423 A1 WO2017016423 A1 WO 2017016423A1 CN 2016090633 W CN2016090633 W CN 2016090633W WO 2017016423 A1 WO2017016423 A1 WO 2017016423A1
Authority
WO
WIPO (PCT)
Prior art keywords
bucket
data
real
instance
indicator factor
Prior art date
Application number
PCT/CN2016/090633
Other languages
English (en)
French (fr)
Inventor
宋军
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017016423A1 publication Critical patent/WO2017016423A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Definitions

  • the present application relates to the field of the Internet, and in particular, to a method and apparatus for updating real-time data based on big data.
  • One of the main requirements is to extract the data elements of the data user's attention dimension from various types of data to form a new and more streamlined data record.
  • the raw data obtained by the mobile application provider is the download, login and access data from each device and each user, which contains a large number of data elements, but the mobile application provider only wishes to generate these based on the data. Whether the data user or device belongs to a new user or a new mobile device, and counts new metrics such as new users or new mobile devices.
  • new users and new mobile devices refer to users or mobile devices that have never used the application in history; whether a user or a mobile device belongs to "new user” or “new mobile device", It needs to be judged based on the stored historical data.
  • Concurrent loading brings great performance pressure to the file system that stores all historical data, and also puts pressure on the memory of a single server, resulting in high requirements on file system performance and performance on a single server.
  • the present invention provides a method and device for updating data in real time to solve the problem that the existing real-time updating method based on the big data has high performance requirements on physical devices, poor real-time calculation, high resource occupation rate, and no scalability.
  • the real-time new data update method includes:
  • an indicator factor instance corresponding to the real-time added data is composed of specific values of data elements in the dimension of interest of the real-time added data, the dimension of interest Index factor
  • the instance of the indicator factor is added, and an instance of the indicator factor corresponding to the real-time added data is added to the located bucket.
  • the indicator factor algorithm comprises:
  • Dimension allocation logic which allocates the dimension of interest of the real-time added data according to the specific content of the real-time added data
  • An indicator factor instance generation algorithm obtains values of data elements of each dimension of interest from the real-time added data according to the dimension of interest, and combines the values to form an instance of the indicator factor.
  • the dimension allocation logic directly specifies a fixed preset dimension.
  • the dimension allocation logic includes a to-be-matched set, where the to-be-matched set includes different indicator factor subsets composed of different dimension information in different occasions, and the real-time added data in different occasions is described.
  • the to-be-matched set is matched according to a predetermined rule, and the corresponding indicator factor subset is obtained, and the focused dimension of the specific real-time added data is obtained according to the indicator factor subset.
  • the value of the data element of each dimension of interest is obtained from the real-time added data according to the dimension of interest, and the values are combined to form an instance of the indicator factor, and the specific step is :
  • Reading according to each of the focused dimensions, a value of the dimension that should be focused on the real-time added data
  • the bucket storage policy includes a Hash bucket algorithm, and the algorithm includes the following steps:
  • each of the hash values can be uniquely assigned to a certain bucket number of the Hash bucket; according to the bucket number, each of the indicators can be The factor instance is located in its corresponding bucket.
  • the indicator factor instance is stored in the index factor historical accumulation pool in a form of a bucket, and the historical cumulative pool of the indicator factor is obtained as follows:
  • the index accumulation historical accumulation pool of the bucket storage is specifically stored in a distributed file system, and the distributed file system is shared by the distributed cluster server.
  • one or more of the real-time added data can be acquired at the same time, and after forming the indicator factor instance according to the real-time added data, the indicator factor instance is distributed to each server of the distributed cluster server. Locating, by each server, the indicator factor instance into its corresponding bucket, each server reads a bucket that is located by the indicator factor instance from the index factor history accumulation pool stored in the distributed file system. And the server completes the subsequent judgment and joins the work of the new indicator factor instance, and the processing of each server is parallel processing.
  • the indicator factor instance is distributed to each server of the distributed cluster server, and the distribution process is implemented according to a predetermined indicator factor instance distribution algorithm, including the following steps:
  • the algorithm can balance load of each server;
  • the number of the data element corresponding to each indicator of interest in the indicator factor instance is calculated according to a predetermined algorithm, and the number of the server corresponding to the indicator factor instance is calculated by using the following method:
  • the data After obtaining the hexadecimal data by using the md5 algorithm, the data is used to perform modulo operation on the number of servers of the distributed cluster server to obtain the number of the server; or, the variable is ASCII code. After the binary data is converted, the data is used to perform a modulo operation on the number of servers of the distributed cluster server, and the number of the server is obtained.
  • the number of the server corresponding to the indicator factor instance is calculated according to a predetermined algorithm, and is implemented by the following method:
  • the process of locating the indicator factor instance into its corresponding bucket is performed after the step of completing the indicator factor instance being distributed to each server of the distributed cluster server
  • the steps of each server to locate the indicator factor instance into its corresponding bucket are specifically performed in the following manner:
  • Each server of the distributed cluster server calculates a bucket number of the bucket to which the indicator factor instance belongs according to the distributed index factor instance, and uses the bucket number as a lookup for each server.
  • the indicator factor instance corresponds to the feature value of the bucket.
  • each server reads a bucket that is located by the indicator factor instance from the index factor historical accumulation pool stored in the distributed file system, and is determined by each server according to the assigned indicator factor instance.
  • the bucket number of the bucket is searched for the corresponding bucket from the historical accumulation pool of the indicator factor stored in the distributed file system and the data in the bucket is read, which is specifically performed as follows:
  • the example of adding an indicator factor corresponding to the real-time added data to the located bucket includes the following steps:
  • the indicator factor instance is updated to the bucket corresponding to the bucket number, so that the indicator factor historical accumulation pool is synchronized to the latest data;
  • the new data indicator statistics table is updated, and the new data indicator statistics table includes statistics on the number of new users and the number of newly added mobile devices.
  • the application further provides a real-time new data update device, including:
  • a real-time data acquisition unit is configured to obtain real-time new data, where the real-time new data includes data elements of different dimensions;
  • An indicator factor instance obtaining unit is configured to obtain an instance of an indicator factor corresponding to the real-time added data according to a predetermined indicator factor algorithm; the indicator factor instance is specific to a data element in a dimension of interest of the real-time newly added data Numerical composition, the dimension of interest is called an indicator factor;
  • An indicator factor instance locating unit configured to locate the indicator factor instance into its corresponding bucket based on a preset bucket storage policy
  • a real-time judging unit is configured to read a bucket that is located by the indicator factor instance, and retrieve from the bucket according to the indicator factor instance, and determine whether the existing data of the bucket includes the same An example of an indicator factor;
  • the index factor history accumulation pool update unit is configured to perform corresponding update processing according to the judgment result of the real-time judgment sub-unit added by the indicator factor instance, and if the judgment result is yes, the bucket data is not processed; if the judgment result is no, The indicator factor instance is newly added, and an instance of the indicator factor corresponding to the real-time added data is added to the located bucket.
  • the indicator factor instance obtaining unit includes:
  • a dimension allocation logic sub-unit configured to allocate, according to the specific content of the real-time newly added data, a dimension of interest of the real-time added data
  • the indicator factor instance generating sub-unit is configured to obtain values of data elements of each dimension of interest from the real-time added data according to the dimension of interest, and combine the values to form the indicator factor instance.
  • the method includes a bucket storage algorithm unit, where the bucket storage algorithm unit includes:
  • the historical data acquisition sub-unit is used to obtain historical data that has existed before the arrival of real-time new data
  • a bucket number setting sub-unit configured to set a reasonable number of buckets N for the bucket storage policy according to the information amount of the historical data, where each bucket is a Hash bucket, and the maximum threshold of the storage capacity of the hash bucket can Adjusted by the number N of barrels;
  • a bucket number allocation subunit configured to allocate a bucket number for each of the hash buckets
  • An indicator factor instance pre-processing sub-unit configured to respectively acquire, according to the indicator factor instance, a variable that can directly perform operations corresponding to each of the indicator factor instances;
  • a bucket subunit configured to use a hash algorithm to obtain a hashed hash value for the variable, and each of the hash values can be assigned to a bucket number of the hash bucket; according to the bucket number, Each of the indicator factor instances is located in its corresponding bucket.
  • an indicator factor history accumulation pool forming unit is included, and the indicator factor history accumulation pool forming unit package include:
  • the historical data acquisition sub-unit is used to obtain historical data that has existed before the arrival of real-time new data
  • An indicator factor instance sub-unit configured to acquire, by using the index factor algorithm, an indicator factor instance of each historical data for each data of the historical data;
  • An indicator factor historical accumulation pool establishing sub-unit configured to locate the indicator factor instance of the historical data to its corresponding bucket based on the preset bucket storage policy, and index factors belonging to different buckets The instance is stored in different buckets; each bucket is assigned a bucket number to establish a historical cumulative pool of index factor storage.
  • an electronic device including:
  • a memory configured to store real-time new data update means, and when the real-time added data update means is executed by the processor, perform the following steps:
  • an indicator factor instance corresponding to the real-time added data is composed of specific values of data elements in the dimension of interest of the real-time added data, the dimension of interest Index factor
  • the instance of the indicator factor is added, and an instance of the indicator factor corresponding to the real-time added data is added to the located bucket.
  • the real-time new data update method stores the historical data in different buckets in the form of an indicator factor instance based on the preset bucket storage policy, and the instance of the index factor corresponding to the real-time newly added data is the same.
  • the rules are located in their corresponding buckets, and then retrieved in the positioned buckets to determine whether the instance of the indicator factor is new and updated.
  • This method is applied in a distributed system, so that each server in the distributed cluster server does not need to wait for all the historical data to be fully loaded before real-time calculation can be performed. Only each server needs to load part of the bucket data in real time to be accurate.
  • the update process is completed, which reduces the pressure on the file system during initialization and the load on each server, reducing the performance requirements for physical devices.
  • the method of the present application improves real-time performance and reduces resource occupancy.
  • the accuracy and real-time performance of the newly added data update can be ensured by upgrading or expanding the physical device, which makes the application provide The method is extensible.
  • FIG. 1 is a flow chart of an embodiment of a real-time new data update method of the present application.
  • Figure 2 is a flow chart for establishing a historical accumulation pool of indicator factors.
  • FIG. 3 is a flow diagram of an embodiment of an indicator factor instance distribution algorithm.
  • FIG. 4 is a schematic diagram of an embodiment of a real-time new data update device.
  • a real-time new data update method and apparatus are respectively provided, which are described in detail in the following embodiments.
  • This embodiment assumes that the mobile application provider for the mobile Internet obtains the application scenario of the newly added user and the newly added mobile device data from various access data.
  • the following description mainly refers to the application scenario, and takes into consideration the situation of other application scenarios.
  • FIG. 1 is a flowchart of an embodiment of a method for updating real-time data in the present application.
  • the method includes the following steps:
  • Step 101 Acquire real-time added data, where the real-time added data includes data elements of different dimensions.
  • the new data stream is first read in real time, and all real-time new data at the current time is acquired in the read data stream.
  • each real-time new data can contain multiple data elements, each reflecting different aspects of content.
  • the data elements it contains may reflect different aspects; for different aspects in the data, called dimensions, the dimension is basically equivalent to one field in a data record formed by multiple fields.
  • the dimension is an abstraction of what the characteristics of a data element describe; an implementation of new data typically includes data elements of multiple dimensions.
  • the real-time added data includes at least a data element of a dimension such as an application ID, a user ID, and a mobile device ID.
  • Step 102 Obtain an indicator factor instance corresponding to the real-time added data according to a predetermined indicator factor algorithm; the indicator factor instance is composed of specific values of data elements in the dimension of interest of the real-time added data, The dimension of interest is called the indicator factor.
  • the object to be processed in this application is real-time new data based on big data.
  • Each data may include complex data structures, but the information that each data is concerned with is only a part of it. According to the information that is concerned, the information can be judged. Whether the real-time new data is added to the statistical indicator, so the application needs to generate an instance of the indicator factor for each data, and the indicator factor instance is the specific value of the data element under the dimension of interest of each real-time newly added data. Composition, the dimension of interest is called an indicator factor.
  • the application ID, the user ID, and the mobile device ID are the dimensions of interest, and the instance of the index factor corresponding to the real-time newly added data is composed of the values of the data elements under the dimension of interest, and the specific composition method is determined by a predetermined indicator.
  • Factor algorithm implementation
  • the indicator factor algorithm in step 102 includes two aspects: a dimension allocation logic and an indicator factor instance generation algorithm.
  • the dimension allocation logic allocates the dimension of interest according to the specific content of the real-time newly added data; the indicator factor instance generation algorithm obtains each of the real-time data from the real-time added data according to the dimension of interest Focus on the values of the data elements of the dimension and combine these values to form an instance of the indicator factor.
  • the present application provides dimension allocation logic to allocate the dimension of interest of real-time added data.
  • the dimension allocation logic can directly specify a fixed preset dimension. For example, if only the relationship between the application and the user is concerned, only the two dimensions of the application ID and the user ID can be obtained from the real-time newly added data, so the dimension allocation logic is set to directly specify the application ID and the user ID as the dimension of interest, that is, Indicator factor, the specific values of the data elements in these two dimensions constitute an instance of the indicator factor; similarly, if you need to pay attention to the relationship between the application, the user, and the mobile device at the same time, you need to obtain the application ID and user from the newly added data.
  • the ID and the mobile device ID are three dimensions, so the dimension allocation logic is set to directly specify the application ID, the user ID, and the mobile device ID as the dimension of interest.
  • the dimension allocation logic realizes the allocation of the dimension of interest of the newly added data by setting the method to be matched.
  • the indicator factor subset consisting of different dimension information is further set under the set to be matched set, and the real-time added data and the to-be-matched set in different occasions are set.
  • the matching is performed according to a predetermined rule, and a corresponding subset of the indicator factors is obtained, and the focused dimension of the specific real-time added data is obtained according to the indicator factor subset. For example, a subset of the application ID, the user ID, and a subset of the application ID, the user ID, and the mobile device ID are set under the to-be-matched set, and the real-time added data is matched with the subset according to the application ID category, respectively, by matching. The result determines the dimension of interest for the real-time added data.
  • the dimension allocation logic can flexibly allocate the focused dimension of the newly added data, so as to obtain the instance of the indicator factor of different dimensions for the real-time newly added data in different occasions; or, according to different use purposes, for the same occasion
  • the real-time newly added data obtains different levels and granularity of indicator factor instances, so as to perform scheduled analysis and update processing on the newly added data.
  • index factor instance generation algorithm A specific description of the index factor instance generation algorithm is as follows.
  • the indicator factor instance generation algorithm obtains values of data elements of each dimension of interest from the real-time added data according to the dimension of interest, and combines the values to form an instance of the indicator factor.
  • the specific combination method may adopt various methods according to the situation, for example, the data elements may be directly spliced in order; or may be recorded according to fields or recorded in other combinations.
  • the value of the data element under the dimension of interest such as the application ID, the user ID, and the mobile device ID
  • the value of the data element under the focused dimension is determined by the application ID.
  • the user ID and the dimension of the mobile device ID are sequentially spliced, and the spliced characters constitute an instance of the index factor corresponding to the real-time added data, and the spliced characters can be directly used as variables in the bucket storage strategy of the present application. .
  • the value appId1 of the application ID, the userNick of the user ID, and the value of the device ID of the mobile device ID are spliced by one of the real-time data, and the string appId1_userNick_deviceId1 is obtained after splicing, and the appId1_userNick_deviceId1 becomes an indicator.
  • the value of the factor instance is used to form an instance of the indicator factor corresponding to the real-time added data.
  • the indicator factor instance only includes the data information under the dimension of interest in the real-time added data. According to the data information in the dimension of interest, it can be determined whether the real-time new data is added to the statistical indicator, and the filtering is achieved. The effect of redundant information in the original data, when the amount of data is very large, can significantly save the resources occupied by real-time calculation, and save a lot of storage space.
  • the present application may also use other methods to form an instance of an indicator factor corresponding to real-time added data, as long as a unique value can be obtained from the indicator factor instance, and the value can be directly used as a variable in the bucket storage policy of the present application. Just fine.
  • Step 103 Locating the indicator factor instance to its corresponding score based on a preset bucket storage policy In the bucket.
  • the focus of this application is on real-time.
  • a large amount of real-time new data needs to be distributed to multiple servers in the distributed cluster server.
  • Each server must add the full amount in the process of real-time processing and new judgment.
  • the historical data consists of a data set loaded into memory.
  • Historical data has a huge amount of data, which may be hundreds of millions of records, and each record includes data elements of multiple dimensions.
  • the memory of multiple servers needs to load the data set consisting of historical data in parallel and wait for it.
  • the calculation can only be started after the full amount of loading is completed, which not only requires high performance requirements for physical devices, but also causes a large calculation delay.
  • This application uses a bucket storage strategy to divide the full amount of historical data into multiple buckets for storage.
  • each server when adding real-time judgments, it also locates the corresponding indicator factors corresponding to the real-time new data to its corresponding In the distributed bucket, in the distributed system, each server only needs to load the required bucket data to accurately complete the new judgment, reducing the performance requirements of the physical device, improving the real-time performance, and reducing Small computational delays reduce resource utilization.
  • all the buckets are stored in the distributed file system, and the distributed file system is shared by the distributed cluster server.
  • the following steps are all implemented based on the distributed architecture.
  • step 103 before performing the step of locating the indicator factor instance to its corresponding bucket, the historical data that has existed at the current time is stored in a bucket according to a preset bucket storage policy, and an index factor is established.
  • Historical accumulation pool Figure 2 is a flow chart for establishing a historical accumulation pool of indicator factors, including the following steps:
  • Step 201 Obtain historical data that has existed before the arrival of real-time new data.
  • the data set of the original historical data before the arrival of the current real-time added data is used as the existing historical data, which may include a large number of historical records, and the possible data magnitude is, for example, hundreds of millions of records.
  • Step 202 Obtain an instance of the index factor corresponding to each historical data by using the index factor algorithm in step 102 for each historical data of the historical data.
  • the dimension allocation logic is first set to directly specify the application ID, the user ID, and the mobile device ID as the dimension of interest; and secondly, the indicator factor instance generation algorithm is used to generate the index factor corresponding to the historical data.
  • the specific steps are: obtaining, according to the dimension of interest, the values of the data elements of the dimension of interest from the historical data, and ordering the values according to the application ID, the user ID, and the dimension of the mobile device ID.
  • the character splicing is performed, and the spliced characters constitute an instance of the index factor corresponding to the historical data, and the spliced characters can be directly operated as variables in the bucket storage strategy of the present application.
  • Step 203 Establish an index accumulation historical accumulation pool of the bucket storage based on the preset bucket storage policy.
  • the foregoing bucket storage policy is implemented based on a hash algorithm, and an index factor calendar for storing the buckets is established.
  • the history accumulation pool mainly has the following steps:
  • each bucket is assigned a bucket number, and the bucket number is used as an index value of the index bucket.
  • the number of N buckets is set to 1, 2, 3, ..., N, and according to the number, it can be indexed to its corresponding bucket.
  • Determining the appropriate hash function means that the hash function needs to be able to uniformly hash the values of the indicator factor instances to the bucket numbers of each bucket, and the maximum threshold of each bucket storage capacity can also pass the bucket number N. Make adjustments.
  • the multiplication hash is obtained by performing a multiplication hash on the value of the indicator factor instance to obtain a series of hashed hash values, and each hash value can be uniquely assigned to a bucket number of a certain bucket, and the indicator corresponding to each bucket number The number of factor instances is even.
  • Each of the above hash values can be uniquely assigned to the bucket number of a certain bucket. It means that the hash value obtained by the hash operation needs to be mapped to the bucket number of the bucket by a certain algorithm. There are various mapping methods, but Ensure that a hash value corresponds to a unique bucket number.
  • the hash value is modulo the number of buckets N, and the result of the modulo is used as the bucket number to which the hash value belongs.
  • the bucket number is used as the feature value of the index bucket. According to the bucket number, the bucket to which the indicator factor instance is located can be retrieved, so that the indicator factor instance is stored in the corresponding bucket.
  • This embodiment is implemented based on a distributed architecture. Therefore, in this embodiment, the established index factor historical accumulation pool is stored in a distributed file system, and the distributed cluster server operates the index factor history by sharing the distributed file system. Cumulative pool.
  • step 103 the instance of the indicator factor is located in the corresponding bucket, which means that the real-time newly added data is corresponding according to the preset bucket storage policy under the premise that the index factor history accumulation pool has been established.
  • the indicator factor instance is located in its corresponding bucket.
  • the embodiment is implemented based on a distributed architecture, so this step includes two processes: distributing the indicator factor instance to each server of the distributed cluster server, and the distribution process is implemented according to a predetermined indicator factor instance distribution algorithm; The indicator factor instance is located in its corresponding bucket by each server of the distributed cluster server according to a preset bucket storage policy.
  • the indicator factor instance distribution algorithm needs to ensure that the load of each server memory is balanced, for example, The number of buckets that can be loaded by each server is balanced, or the indicators that each server is assigned to The number of child instances is balanced over a period of time, or the same instance of the indicator factor is assigned to the same server, and so on.
  • FIG. 3 is a flow chart of an embodiment of an indicator factor instance distribution algorithm, the specific steps are:
  • Step 301 Number each server of the distributed cluster server, so that the buckets are allocated to the respective servers. For example, there are M servers in a distributed cluster, and the numbers of each server are set to 1, 2, 3, ..., M in order of the IP address.
  • Step 302 Evenly map the bucket numbers of all the buckets to each server of the distributed cluster server.
  • the specific steps are: setting the number of the buckets loaded on each server to be N/M, and uniformly allocating the bucket numbers of all the buckets to the distributed cluster server according to the bucket number order and the server coding order.
  • each server accesses the buckets corresponding to the bucket numbers according to the assigned bucket number.
  • Step 303 Calculate the bucket number of the bucket to which the indicator factor instance belongs according to the bucket storage policy according to the instance of the index factor corresponding to the real-time newly added data.
  • the specific step is: hashing the value of the indicator factor instance by using the hash function in step 203, obtaining a unique value corresponding to the indicator factor instance, and taking the unique value against the number of buckets N, The result of the modulo is used as the bucket number of the bucket to which the indicator factor instance belongs.
  • Step 304 Obtain the number of the server corresponding to the indicator factor instance according to the mapping relationship between the bucket bucket number and each server in step 302.
  • Step 305 Distribute the indicator factor instance to a corresponding server in the distributed cluster server according to the server number.
  • each server of the distributed cluster server so that it can be allocated to each server by bucket. For example, there are M servers in a distributed cluster, and the numbers of each server are set to 1, 2, 3, ..., M in order of the IP address.
  • the second specific implementation method of the indicator factor instance distribution algorithm so that each server is assigned an indicator
  • the number of child instances is balanced over a period of time, and the same metric factor instances are assigned to the same server, thereby balancing the load on each server's memory.
  • the distributed cluster server is configured by the server of the distributed cluster server to locate the indicator factor instance in the corresponding bucket according to the preset bucket storage policy.
  • Each server according to the distributed indicator factor instance, calculates a bucket number of the bucket to which the indicator factor instance belongs according to the bucket storage policy, and the bucket number is used as a proxy corresponding to the index factor instance distributed on each server.
  • the characteristic value of the bucket is used as a proxy corresponding to the index factor instance distributed on each server.
  • Step 104 Read a bucket that is located by the indicator factor instance, and retrieve from the bucket according to the indicator factor instance, and determine whether the existing data of the bucket includes the same indicator factor instance.
  • the specific embodiment of the present application is based on a distributed architecture, and the step 104 specifically refers to that the servers read the buckets that are located by the indicator factor instance from the index factor historical accumulation pool stored in the distributed file system.
  • each server completes subsequent operations such as judgment in the memory, and the processing of each server is parallel processing.
  • each server refers to the bucket number of the bucket to which the index factor instance obtained according to step 103 belongs, and the index factor history accumulated from the distributed file system is accumulated. Find the corresponding bucket in the pool and read the data in the bucket, including the following steps:
  • Step 105 If it is determined in step 104 that the existing data of the bucket contains the same indicator factor instance, the bucket data is not processed.
  • the indicator factor instance includes the information of the dimension of interest added in real time
  • the indicator factor instance stored in the column of the index factor historical accumulation pool also includes the information of the dimension of interest of the historical data
  • the bucket is located in the corresponding metric factor historical accumulation pool, and the bucket is loaded into the corresponding server memory, and based on the metric factor instance, from the bucket Retrieving and judging that the existing data of the bucket contains the same indicator factor instance, indicating the dimension information of the real-time added data, that is, the user ID or the mobile device ID under the corresponding application ID has appeared in history. It is not a new user or a new mobile device, so there is no need to update the metric factor historical accumulation pool or update the new data metrics.
  • Step 106 If it is determined in step 104 that the existing data of the bucket does not include the same indicator factor instance, the instance of the indicator factor is added, and the real-time added data is added to the located bucket.
  • An example of an indicator factor For the embodiment, the judgment indicates that the dimension information of the newly added data in real time, that is, the user ID or the mobile device ID under the corresponding application ID is never seen in the history, is a new user or a new mobile. A device is a new metric that needs to be updated and counted.
  • the specific method of adding an instance of the index factor corresponding to the real-time added data to the located bucket is:
  • the indicator factor instance is updated to the bucket corresponding to the bucket number, so that the indicator factor historical accumulation pool is synchronized to the latest data.
  • the new data indicator statistics table includes statistics on the number of new users and the number of new mobile devices.
  • the pre-set bucket storage policy enables the maximum threshold of the bucket to be adjusted by the number of buckets N, which ensures the load balancing of each server to a certain extent;
  • a distribution algorithm that uniformly maps to the metric factor instances on each server of the distributed cluster server, so that the number of buckets loaded on each server is balanced, further ensuring load balancing of each server, and this is fixed for a period of time.
  • the mapping relationship avoids the frequent loading of the buckets in the index accumulation pool of the index factor by each server; therefore, the update method of the present application makes the resource utilization more reasonable and more scalable, and achieves a truly distributed processing.
  • FIG. 4 is a schematic diagram of an embodiment of a real-time data update device. Since the device embodiment is substantially similar to the method embodiment, the description is relatively simple, and the relevant portions can be referred to the description of the method embodiment.
  • the device embodiments described below are merely illustrative.
  • the real-time newly added data updating device of the embodiment includes: an index factor historical accumulation pool forming unit 1, a real-time new data acquiring unit 2, an index factor instance obtaining unit 3, an index factor instance positioning unit 4, and a new addition
  • the indicator factor historical accumulation pool forming unit 1 is configured to establish an index factor historical accumulation pool stored by the bucket storage policy according to the historical data.
  • the real-time newly added data acquiring unit 2 is configured to acquire one or more real-time newly added data at the same time, and each of the real-time newly added data includes data elements of different dimensions.
  • the indicator factor instance obtaining unit 3 is configured to obtain an instance of an indicator factor corresponding to each real-time newly added data according to a predetermined indicator factor algorithm; the indicator factor instance is a data element in the dimension of interest of the real-time newly added data.
  • the specific numerical composition, the dimension of interest is called the indicator factor.
  • the indicator factor instance locating unit 4 is configured to locate the indicator factor instance into its corresponding bucket based on a preset bucket storage policy.
  • the newly added real-time judging unit 5 is configured to read a sub-bucket that is located by the indicator factor instance, and retrieve from the sub-bucket based on the index factor instance to determine the existing data of the sub-bucket. Whether to include the same indicator factor instance.
  • the indicator factor history accumulation pool update unit 6 is configured to perform a corresponding update process according to the judgment result of the real-time judgment sub-unit added by the indicator factor instance, and if the judgment result is yes, the bucket data is not processed; if the judgment result is If no, the indicator factor instance is added, and an instance of the indicator factor corresponding to the real-time added data is added to the located bucket.
  • the real-time newly added data acquiring unit 2, the index factor instance obtaining unit 3, the index factor instance positioning unit 4, the newly added real-time judging unit 5, and the index factor history accumulating pool updating unit 6 jointly complete the real-time addition.
  • the update processing of the data is performed on the premise that the index factor history accumulation pool forming unit 1 has completed the establishment of the index factor history accumulation pool, and the new real-time judgment unit 5 and the indicator factor history accumulation pool update unit 6
  • the new judgment and update processing are performed according to the historical accumulation pool of the indicator factors.
  • the index factor history accumulation pool forming unit 1 includes a history data acquisition subunit 1-1, an index factor instance subunit 1-2, and an indicator factor history accumulation pool establishment subunit 1-3.
  • the historical data acquisition sub-unit 1-1 is configured to obtain historical data that has existed before the arrival of the real-time added data.
  • the indicator factor instance sub-unit 1-2 is configured to obtain an indicator factor instance of each of the historical data by using the index factor algorithm for each data of the historical data.
  • the indicator factor historical accumulation pool establishing sub-unit 1-3 is configured to locate the indicator factor instance of the historical data to its corresponding bucket based on the preset bucket storage policy, and belong to different
  • the index factor instances of the buckets are stored in different buckets; each bucket is assigned a bucket number to establish the index of the bucket storage. Sub-history accumulation pool.
  • index factor history accumulation pool forming unit 1 real-time new data acquisition unit 2, index factor instance acquisition unit 3, index factor instance positioning unit 4, new real-time judgment unit 5, and index factor history accumulation pool update unit 6 are all Based on the same indicator factor algorithm and the same bucket storage policy, the indicator factor algorithm and the bucket storage policy are described in detail in the method embodiment, and are not described in detail in the embodiment of the device. Method embodiment.
  • the real-time new data update device is deployed in a distributed system, where the metric factor historical accumulation pool of the bucket storage is specifically stored in a distributed file system, and the distributed cluster server shares the distributed The file system, the new data real-time judging module is deployed on the distributed cluster server.
  • the present application further provides an electronic device, where the electronic device includes:
  • a memory configured to store real-time new data update means, and when the real-time added data update means is executed by the processor, perform the following steps:
  • an indicator factor instance corresponding to the real-time added data is composed of specific values of data elements in the dimension of interest of the real-time added data, the dimension of interest Index factor
  • the instance of the indicator factor is added, and an instance of the indicator factor corresponding to the real-time added data is added to the located bucket.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media including both permanent and non-persistent, removable and non-removable media may be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage, Magnetic tape cartridges, magnetic tape storage or other magnetic storage devices or any other non-transportable media can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • embodiments of the present application can be provided as a method, system, or computer program product.
  • the present application can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment in combination of software and hardware.
  • the application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种实时新增数据更新方法和装置,其中,所述方法包括:获取实时新增数据,所述实时新增数据包括不同维度的数据元素(S101);根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子(S102);基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中(S103);读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例(S104);若不包含,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例(S106)。

Description

一种实时新增数据更新方法和装置
本申请要求2015年07月29日递交的申请号为201510455177.5、发明名称为“一种实时新增数据更新方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网领域,尤其涉及一种基于大数据的实时新增数据更新方法和装置。
背景技术
在大数据时代,随着各种数据的快速积累,对数据的有效搜集、存储和利用已经成为互联网企业获取商业优势的重要环节。
对大数据的利用存在多种形式,其中一种主要的需求是从各类数据中抽取出数据利用者关注维度的数据元素,形成新的更精简的数据记录。
例如,在移动互联网领域,移动应用供应商发布一个新的应用或对应用进行渠道活动推广后,需要持续关注新增用户和新增移动设备数量,尤其是实时新增用户和实时新增移动设备数量,通过上述数据能够预估应用在一段时间内可能的突发流量、推广度以及推广度的实时性等指标,进而帮助移动应用供应商确保应用的正常使用,及时判断所做推广活动的价值,以及对推广结算做出衡量。在上述场景中,移动应用供应商获取的原始数据是来自各个设备和各个用户的下载、登录以及访问数据,这些数据中包含大量的数据元素,但是移动应用供应商只希望根据这些数据判断产生这些数据的用户或者设备是否属于新增用户或者新增移动设备,并对新增用户或新增移动设备这些新增指标进行统计。
所谓新增用户和新增移动设备,是指历史上从未使用过该应用的用户或移动设备;对某个用户或者某个移动设备是否属于“新增用户”或者“新增移动设备”,需要根据已存储的历史数据进行判断。
目前,随着应用数量越来越多且每种应用的用户量的量级越来越大,历史用户和历史设备以及已存储的各种历史数据数量往往会达到上亿条记录,海量的历史信息通常存储在文件系统中,移动应用供应商若要对获得的实时数据信息进行实时评判其是否为新 增,需要使用服务器全量加载文件系统中的全部历史数据到内存中进行检索判断,才能够保证新增判断的实时性和准确性。另外,在同一时刻会有大量的实时数据到来,有时多达千万条,目前通常采用分布式集群服务器并行对大量的实时数据进行计算,所使用的集群服务器的集群规模往往较大。因此,在这种基于大数据且又要保证实时性的背景下,集群中每台服务器并发初始化全量加载文件系统中的历史数据到各自的内存时,会有以下问题:
1、并发加载对存储全部历史数据的文件系统带来很大的性能压力,也对单台服务器的内存带来压力,导致对文件系统性能和对单台服务器性能有很高的要求。
2、全量加载的时间很长,并且会导致实时计算的延迟、资源被长期占据的问题,造成实时性差和资源浪费。
3、不具备扩展性。
发明内容
本申请提供一种实时新增数据更新方法和装置,以解决现有的基于大数据背景下实时更新方法对物理设备性能要求高、计算实时性差、资源占有率高和不具备扩展性的问题。
为解决上述技术问题,本申请提供的实时新增数据更新方法,包括:
获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
若判断结果为是,则不对分桶数据进行处理;
若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
优选的,所述指标因子算法包括:
维度分配逻辑,该逻辑根据所述实时新增数据的具体内容,对所述实时新增数据的被关注维度进行分配;
指标因子实例生成算法,该算法根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例。
优选的,所述维度分配逻辑是直接指定固定的预设维度。
可选的,所述维度分配逻辑包括一个待匹配集合,所述待匹配集合包含了不同场合下由不同维度信息组成的不同指标因子子集,并且由不同场合下的实时新增数据与所述待匹配集合按预定的规则进行匹配,获取对应的指标因子子集,并根据该指标因子子集获取该具体的实时新增数据的被关注维度。
优选的,所述根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例,其具体步骤为:
根据各个所述被关注维度,读取所述实时新增数据对应该被关注维度的取值;
将所获得的该实时新增数据的各个被关注维度取值进行字符拼接,由拼接后的字符构成对应所述实时新增数据的指标因子实例,所述拼接后的字符能够在所述分桶存储策略中直接进行运算。
可选的,所述分桶存储策略包括Hash桶算法,该算法包括以下步骤:
获取实时新增数据到来前已存在的历史数据;
根据所述历史数据的信息量为所述分桶存储策略设置合理的桶数N,每个所述分桶为Hash桶,所述Hash桶存储容量的最大阈值能够通过所述桶数N进行调节;
为每一个所述Hash桶分配一个桶号;
根据所述指标因子实例分别获取与每个所述指标因子实例对应的能够直接进行运算的变量;
对所述变量采用Hash算法得到散列的Hash值,每个所述Hash值均能够唯一的归属到所述Hash桶的某个桶号;根据所述桶号,即可以将每个所述指标因子实例定位到其相应的分桶中。
优选的,所述指标因子实例以分桶形式存储在指标因子历史累积池中,所述指标因子历史累计池获得方式如下:
获取实时新增数据到来前已存在的历史数据;
对所述历史数据的每个数据通过所述指标因子算法分别获取所述每个历史数据的指标因子实例;
基于所述预先设置的分桶存储策略,把所述历史数据的所述指标因子实例定位到其相应的分桶,并将属于不同分桶的指标因子实例存储到不同的分桶中;每一个分桶分配 有一个桶号,从而建立分桶存储的指标因子历史累积池。
优选的,所述分桶存储的指标因子历史累积池具体是存储在分布式文件系统中,由分布式集群服务器共享所述分布式文件系统。
优选的,在同一时刻能够获取一个或多个所述实时新增数据,根据所述实时新增数据形成指标因子实例后,所述指标因子实例被分发至所述分布式集群服务器的各服务器上,通过各服务器把所述指标因子实例定位到其相应的分桶中,各服务器从所述分布式文件系统中存储的指标因子历史累积池中读取被所述指标因子实例定位的分桶,并由该服务器完成后续的判断以及加入新的指标因子实例的工作,各服务器的处理是并行处理。
优选的,所述所述指标因子实例被分发至所述分布式集群服务器的各服务器上,该分发过程依据预定的指标因子实例分发算法实现,包括以下步骤:
根据所述指标因子实例中各个被关注维度的数据元素的取值,按照预定的算法计算所述指标因子实例对应的服务器的编号;所述算法能够使各服务器的负载均衡;
将所述指标因子实例分发到所述集群服务器的所述对应的服务器上。
可选的,所述根据所述指标因子实例中各个被关注维度的数据元素的取值,按照预定的算法计算所述指标因子实例对应的服务器的编号,采用如下方法实现:
对所述分布式集群服务器的各个服务器进行编号;
根据所述指标因子实例各个被关注维度的数据元素的取值得到与其对应的能够直接进行运算的变量;
对所述变量用md5算法得到十六进制的数据后,使用该数据对所述分布式集群服务器的服务器数量进行取模运算,得到所述服务器的编号;或者,对所述变量采用ASCII码换算得到二进制的数据后,使用该数据对所述分布式集群服务器的服务器数量进行取模运算,得到所述服务器的编号。
优选的,所述根据所述指标因子实例中各个被关注维度的数据元素的取值,按照预定的算法计算所述指标因子实例对应的服务器的编号,采用如下方法实现:
对所述分布式集群服务器的各个服务器进行编号;
将所述分桶的桶号均匀映射到所述分布式集群服务器的各个服务器上;
根据所述指标因子实例各个被关注维度的数据元素的取值得到与其对应的能够直接进行运算的变量;
根据所述变量,按所述分桶存储策略获取所述指标因子实例所属分桶的桶号;
根据所述桶号与所述服务器的映射关系,获得所述指标因子实例对应的服务器的编 号。
优选的,所述把所述指标因子实例定位到其相应的分桶中这一过程,在完成所述指标因子实例被分发至所述分布式集群服务器的各服务器上的步骤后,所述通过各服务器把所述指标因子实例定位到其相应的分桶中的步骤具体以下述方式执行:
所述分布式集群服务器的各服务器根据分发来的指标因子实例,按所述分桶存储策略计算所述指标因子实例所属分桶的桶号,将所述桶号作为查找各服务器上分发来的所述指标因子实例对应分桶的特征值。
可选的,所述各服务器从所述分布式文件系统中存储的指标因子历史累积池中读取被所述指标因子实例定位的分桶,是由各个服务器根据被分配的指标因子实例所属分桶的桶号,从分布式文件系统中存储的指标因子历史累积池中查找相应的分桶并读取分桶中的数据,具体以下述方式执行:
根据所述指标因子实例所属分桶的桶号,判断当前服务器是否已读取所述桶号对应的分桶,若判断结果为是,则直接使用已经读取的分桶数据,若判断结果为否,则进入下一步;
根据所述桶号,在所述指标因子历史累积池中检索所述桶号对应的分桶;
判断所述分桶是否存在,若判断结果为是,则进入下一步,若判断结果为否,则为所述指标因子实例建立新的分桶,该分桶数据为空,并进入下一步;
将所述分桶数据加载到所述指标因子实例被分发的服务器的内存中;
判断所述分桶是否加载到所述服务器的内存中,若判断结果为是,则在所述服务器的内存中执行所述以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例的步骤。
可选的,所述在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例,包括以下步骤:
将所述指标因子实例更新到所述被分发的服务器内存中存储的所述桶号对应的分桶中;
在指标因子历史累积池中,将该指标因子实例更新到该桶号所对应的分桶中,使所述指标因子历史累积池同步到最新数据;
更新新增数据指标统计表,所述新增数据指标统计表包括对新增用户数量和新增移动设备数量的统计。
相应的,本申请还提供一种实时新增数据更新装置,包括:
实时新增数据获取单元,用于获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
指标因子实例获取单元,用于根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
指标因子实例定位单元,用于基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
新增实时判断单元,用于读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
指标因子历史累积池更新单元,用于根据指标因子实例新增实时判断子单元的判断结果进行相应的更新处理,若判断结果为是,则不对分桶数据进行处理;若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
优选的,所述指标因子实例获取单元包括:
维度分配逻辑子单元,用于根据所述实时新增数据的具体内容,对所述实时新增数据的被关注维度进行分配;
指标因子实例生成子单元,用于根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例。
可选的,包括分桶存储算法单元,所述分桶存储算法单元包括:
历史数据获取子单元,用于获取实时新增数据到来前已存在的历史数据;
桶数设置子单元,用于根据所述历史数据的信息量为所述分桶存储策略设置合理的桶数N,每个所述分桶为Hash桶,所述Hash桶存储容量的最大阈值能够通过所述桶数N进行调节;
桶号分配子单元,用于为每一个所述Hash桶分配一个桶号;
指标因子实例预处理子单元,用于根据所述指标因子实例分别获取与每个所述指标因子实例对应的能够直接进行运算的变量;
分桶子单元,用于对所述变量采用Hash算法得到散列的Hash值,每个所述Hash值均能够归属到所述Hash桶的某个桶号;根据所述桶号,即可以将每个所述指标因子实例定位到其相应的分桶中。
优选的,包括指标因子历史累积池形成单元,所述指标因子历史累积池形成单元包 括:
历史数据获取子单元,用于获取实时新增数据到来前已存在的历史数据;
指标因子实例子单元,用于对所述历史数据的每个数据通过所述指标因子算法分别获取所述每个历史数据的指标因子实例;
指标因子历史累积池建立子单元,用于基于所述预先设置的分桶存储策略,把所述历史数据的所述指标因子实例定位到其相应的分桶,并将属于不同分桶的指标因子实例存储到不同的分桶中;每一个分桶分配有一个桶号,从而建立分桶存储的指标因子历史累积池。
相应的,本申请还提供一种电子设备,包括:
显示器;
处理器;以及
存储器,所述存储器被配置成存储实时新增数据更新装置,所述实时新增数据更新装置被所述处理器执行时,执行如下步骤:
获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
若判断结果为是,则不对分桶数据进行处理;
若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
本申请提供的实时新增数据更新方法,基于预先设置的分桶存储策略,将历史数据以指标因子实例的形式存储到不同的分桶中,而实时新增数据对应的指标因子实例则按同样的规则定位到其相应的分桶中,继而在其定位的分桶中检索判断指标因子实例是否为新增,并进行更新处理。这种方法应用在分布式系统中,使得分布式集群服务器中每台服务器不需要等待全部历史数据全量加载完成后才能进行实时计算,只需要每台服务器实时加载部分分桶的数据即可准确的完成更新处理,降低了初始化时文件系统承受的压力和每台服务器的负载,减少了对物理设备的性能要求。由于每个分桶的数据量相对 较小,使短时间内实现对分桶的实时加载成为可能。因此,本申请的方法提高了实时性,并降低了资源占用率。另外,即使指标因子实例的历史数据越来越多,基于上述分桶存储策略,也可以通过升级或扩展物理设备的方式保证实时新增数据更新的准确性和实时性,这使得本申请提供的方法具有扩展性。
附图说明
图1是本申请的一种实时新增数据更新方法的实施例的流程图。
图2是建立指标因子历史累积池的流程图。
图3是指标因子实例分发算法的实施例的流程图。
图4是一种实时新增数据更新装置的实施例的示意图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本申请。但是本申请能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本申请内涵的情况下做类似推广,因此本申请不受下面公开的具体实施的限制。
在本申请中,分别提供了一种实时新增数据更新方法和装置,在下面的实施例中逐一进行详细说明。本实施例假定用于移动互联网的移动应用供应商从各种访问数据中获取新增用户和新增移动设备数据这一应用场景,以下说明主要结合此应用场景,同时兼顾其他应用场景的情况。
请参考图1,其为本申请的一种实时新增数据更新方法的实施例的流程图。
所述方法包括如下步骤:
步骤101:获取实时新增数据,所述实时新增数据包括不同维度的数据元素。
本实施例在获取实时新增数据时,首先实时读取新的数据流,在读取到的数据流中获取当前时刻的所有实时新增数据。
实时新增数据的数量有时多达千万条,每条实时新增数据可以包含多个数据元素,每个数据元素反映不同方面的内容。对于不同类型和来源的实时新增数据,其包含的数据元素可能反映不同方面的内容;对于数据中的不同方面称为维度,维度基本上相当于多个字段形成的一条数据记录中的一个字段,当然,由于并非所有数据都采用字段方式记录,所以具体应当根据数据的实际情况确定。总之,所述维度是某个数据元素描述的特性是什么的抽象说法;一个实施新增数据一般包括多个维度的数据元素。
对本实施例的具体应用场景而言,实时新增数据中至少包括应用ID、用户ID、移动设备ID等维度的数据元素。
步骤102:根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子。
本申请要处理的对象是基于大数据的实时新增数据,每个数据可能会包括复杂的数据结构,但每个数据被关注的信息只是其中的一部分,根据这些被关注的信息就可以判断所述实时新增数据对待统计指标而言是否为新增,因此本申请需要对每个数据生成指标因子实例,该指标因子实例由每个实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子。在本实施例中,应用ID、用户ID和移动设备ID是被关注维度,实时新增数据对应的指标因子实例由这些被关注维度下的数据元素的数值组成,具体组成方法由预先确定的指标因子算法实现。
步骤102中的指标因子算法包括维度分配逻辑和指标因子实例生成算法两个方面。
所述维度分配逻辑根据所述实时新增数据的具体内容,对其被关注维度进行分配;所述指标因子实例生成算法,根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例。
对所述维度分配逻辑的具体说明如下。
对于不同类型的实时新增数据,它们的被关注维度会有不同,例如某些场合需要针对不同应用分别对使用某个具体应用的用户进行统计,某些场合需要针对某个具体应用分别对使用该应用的用户和终端设备同时进行统计。因此,本申请提供维度分配逻辑对实时新增数据的被关注维度进行分配。
在仅统计固定维度数据的场合下,该维度分配逻辑可以直接指定固定的预设维度。例如,只关注应用和用户的关系,则只从实时新增数据中获得应用ID和用户ID两个维度即可,因此将维度分配逻辑设置为直接指定应用ID和用户ID为被关注维度,即指标因子,这两个维度下的数据元素的具体数值组成指标因子实例;同理,若需同时关注应用、用户以及移动设备之间的关系,则需从实时新增数据中获得应用ID、用户ID以及移动设备ID三个维度,因此将维度分配逻辑设置为直接指定应用ID、用户ID以及移动设备ID为被关注维度。
在另外一些情况下,需要根据不同的数据情况,关注不同的维度,则该维度分配逻辑通过设置待匹配集合的方法实现对实时新增数据的被关注维度的分配。
本申请在维度分配逻辑设置待匹配集合的情况下,在设置的待匹配集合下进一步设置由不同维度信息组成的指标因子子集,并且由不同场合下的实时新增数据与所述待匹配集合按预定规则进行匹配,获取对应的指标因子子集,并根据该指标因子子集获取该具体的实时新增数据的被关注维度。例如,待匹配集合下设置应用ID、用户ID组成的子集,以及应用ID、用户ID和移动设备ID组成的子集,实时新增数据分别与这些子集按应用ID类别进行匹配,由匹配结果确定该实时新增数据的被关注维度。
该维度分配逻辑能够对实时新增数据的被关注维度进行灵活分配,从而能够对不同场合下的实时新增数据获取到不同维度的指标因子实例;或者,根据不同的使用目的,对相同场合下的实时新增数据获取到不同层次和粒度的指标因子实例,从而对实时新增数据进行预定的分析和更新处理。
对所述指标因子实例生成算法的具体说明如下。
指标因子实例生成算法根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例。具体组合办法根据情况可以采用多种方法,例如,可直接按照顺序进行数据元素拼接;也可以按照字段记录,或者采用其他的组合方式记录。
在本实施例中,对每个实时新增数据分别获取应用ID、用户ID和移动设备ID这些被关注维度下数据元素的取值,并将这些被关注维度下数据元素的取值按应用ID、用户ID和移动设备ID的维度顺序进行字符拼接,由拼接后的字符构成对应实时新增数据的指标因子实例,该拼接后的字符能够在本申请的分桶存储策略中作为变量直接进行运算。具体来说,例如,将其中一个实时新增数据中应用ID的取值appId1、用户ID的取值userNick以及移动设备ID的取值deviceId1进行字符拼接,拼接后得到字符串appId1_userNick_deviceId1,则appId1_userNick_deviceId1成为指标因子实例的取值,从而构成该实时新增数据对应的指标因子实例。该指标因子实例仅包括该实时新增数据中被关注维度下的数据信息,根据这些被关注维度下的数据信息就可以判断该实时新增数据对待统计指标而言是否为新增,达到了过滤原始数据中的冗余信息的效果,当数据量非常大时,能够明显的节省实时计算占用的资源,并节省大量的存储空间。
本申请还可以采用其它方法形成对应实时新增数据的指标因子实例,只要从该指标因子实例能够得到唯一的取值,并且这个取值能够在本申请的分桶存储策略中作为变量直接进行运算即可。
步骤103:基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分 桶中。
本申请的重点在于实时性,对于传统的方式,需要将大量实时新增数据分发到分布式集群服务器中的多台服务器上去,每台服务器在实时处理新增判断的过程中,又要将全量的历史数据组成的数据集加载进内存。历史数据的数据量巨大,可能多达几亿条记录,且每条记录又包括多个维度的数据元素,这时,多台服务器的内存需要并行全量加载历史数据组成的数据集,并且要等待全量加载完成后才能开始计算,不仅对物理设备的性能要求高,还造成很大的计算延迟。本申请通过分桶存储策略,一方面将全量的历史数据切分成多个分桶数据进行存储,另一方面在实时新增判断时,又将实时新增数据对应的指标因子实例定位到其相应的分桶中,从而在分布式系统中,每台服务器只需要加载所需的分桶的数据即可准确的完成新增判断,减少了对物理设备的性能要求、提高了实时性,且减小计算延迟使资源占用率降低。
本实施例中所有分桶存储在分布式文件系统中,由分布式集群服务器共享该分布式文件系统,以下步骤都基于这种分布式架构实现。
步骤103中,在执行把所述指标因子实例定位到其相应的分桶中这个步骤前,需要根据预先设置的分桶存储策略,对当前时间已存在的历史数据进行分桶存储,建立指标因子历史累积池,图2为建立指标因子历史累积池的流程图,包括以下步骤:
步骤201:获取实时新增数据到来前已存在的历史数据。
在本实施例中将当前实时新增数据到来前的原有历史数据的数据集作为已存在的历史数据,里面可能包括大量历史记录,可能的数据量级例如为上亿条记录。
步骤202:对上述历史数据的每个历史数据通过步骤102中的指标因子算法分别获取每个历史数据对应的指标因子实例。
在本实施例中,对每个历史数据,首先将维度分配逻辑设置为直接指定应用ID、用户ID和移动设备ID为被关注维度;其次使用指标因子实例生成算法生成该历史数据对应的指标因子实例,具体步骤为:根据所述被关注维度,分别从该历史数据中获取所述被关注维度的数据元素的取值,并将这些取值按应用ID、用户ID和移动设备ID的维度顺序进行字符拼接,由拼接后的字符构成对应该历史数据的指标因子实例,该拼接后的字符能够在本申请的分桶存储策略中作为变量直接进行运算。
步骤203:基于所述预先设置的分桶存储策略,建立分桶存储的指标因子历史累积池。
在本实施例中,上述分桶存储策略基于Hash算法实现,建立分桶存储的指标因子历 史累积池主要有以下步骤:
1)根据历史数据的信息量为所述分桶算法设置合理的桶数N。在本实施例中,为每一个分桶分配一个桶号,并将这个桶号作为索引分桶的特征值。例如,把N个分桶的编号设置为1、2、3、……、N,并根据该编号能够索引到其对应的分桶。
2)确定合适的Hash函数,对每个历史数据对应的指标因子实例的取值进行Hash运算得到散列的Hash值,每个所述Hash值均能够唯一的归属到所述Hash桶的某个桶号,从而根据桶号,把该历史数据的指标因子实例定位到其相应的分桶。
确定合适的Hash函数,是指该Hash函数需能将各指标因子实例的取值均匀的散列到各分桶的桶号上,且每个分桶存储容量的最大阈值还能够通过桶数N进行调节。例如乘法Hash,通过对指标因子实例的取值进行乘法Hash运算得到一系列散列的Hash值,每个Hash值均能够唯一的归属到某个分桶的桶号,且各桶号对应的指标因子实例数量均匀。
上述每个Hash值均能够唯一的归属到某个分桶的桶号,是指通过Hash运算得到的Hash值需要通过一定的算法映射到分桶的桶号上去,映射方法有多种,但要保证一个Hash值仅对应唯一的桶号,本实施例中将该Hash值对桶数N取模,取模的结果作为该Hash值归属的桶号。
3)将属于不同分桶的指标因子实例存储到不同的分桶中,从而建立指标因子历史累积池。如前所述桶号作为索引分桶的特征值,那么根据桶号,能够检索到指标因子实例定位到的分桶,从而把指标因子实例存储到相应的分桶中。本实施例是基于分布式架构实现,因此本实施例中,将建立好的指标因子历史累积池存储在分布式文件系统中,分布式集群服务器通过共享该分布式文件系统来操作该指标因子历史累积池。
步骤103中,把所述指标因子实例定位到其相应的分桶中,是指在指标因子历史累积池已经建立好的前提下,根据预先设置的分桶存储策略,把实时新增数据对应的指标因子实例定位到其相应的分桶中。本实施例基于分布式架构实现,因此这一步骤包括两个过程:将所述指标因子实例分发至所述分布式集群服务器的各服务器上,该分发过程依据预定的指标因子实例分发算法实现;根据预先设置的分桶存储策略,通过所述分布式集群服务器的各服务器把所述指标因子实例定位到其相应的分桶中。
对所述预定的指标因子实例分发算法,具体可以采用多种实现方法,但为保证本申请所述更新方法的可扩展性,指标因子实例分发算法需要保证能够使各服务器内存的负载均衡,例如能够使各服务器加载的分桶数量均衡,或者使各服务器被分配到的指标因 子实例数量在一段时间内保持均衡,或者相同的指标因子实例被分配到相同的服务器上等。
图3为指标因子实例分发算法的实施例的流程图,具体步骤为:
步骤301:对分布式集群服务器的各个服务器进行编号,以便分桶分配到各个服务器上。例如分布式集群下有M台服务器,按IP地址的大小顺序设置各服务器的编号为1、2、3、……、M。
步骤302:将全部分桶的桶号均匀映射到分布式集群服务器的各个服务器上。在本实施例中具体步骤为:设置每台服务器上加载的分桶个数为N/M,并且按桶号顺序及服务器编码顺序,将全部分桶的桶号均匀分配到分布式集群服务器的各个服务器上,从而每个服务器根据所分配的桶号访问该些桶号对应的分桶。
步骤303:根据与实时新增数据对应的指标因子实例,按所述分桶存储策略计算所述指标因子实例所归属分桶的桶号。
在本实施例中具体步骤为:采用步骤203中的Hash函数对指标因子实例的取值进行散列,获取到与指标因子实例对应的唯一值,将该唯一值对桶数N取模,取模的结果作为该指标因子实例所归属分桶的桶号。
步骤304:根据步骤302中分桶桶号与各服务器的映射关系,获取指标因子实例对应的服务器的编号。
步骤305:根据上述服务器编号,将该指标因子实例分发到分布式集群服务器中对应的服务器上。
指标因子实例分发算法具体实现方法的另一种技术选择是:
1)对分布式集群服务器的各个服务器进行编号,以便分桶分配到各个服务器上。例如分布式集群下有M台服务器,按IP地址的大小顺序设置各服务器的编号为1、2、3、……、M。
2)对所述变量用md5算法得到十六进制的数据后,使用该数据对所述分布式集群服务器的服务器数量进行取模运算,得到所述服务器的编号;或者,对所述变量采用ASCII码换算得到二进制的数据后,使用该数据对所述分布式集群服务器的服务器数量进行取模运算,得到所述服务器的编号。
3)根据上述服务器编号,将该指标因子实例分发到分布式集群服务器中对应的服务器上。
指标因子实例分发算法的上述第二种具体实现方法,使各服务器被分配到的指标因 子实例数量在一段时间内保持均衡,且相同的指标因子实例被分配到相同的服务器上,从而使各服务器内存的负载均衡。
对根据预先设置的分桶存储策略,通过所述分布式集群服务器的各服务器把所述指标因子实例定位到其相应的分桶中,本实施例的具体实现是:所述分布式集群服务器的各服务器,根据分发来的指标因子实例,按所述分桶存储策略计算所述指标因子实例所属分桶的桶号,所述桶号作为查找各服务器上分发来的所述指标因子实例对应分桶的特征值。
步骤104:读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例。
本申请的具体实施例基于分布式架构部署,步骤104具体是指所述各服务器从所述分布式文件系统中存储的指标因子历史累积池中读取被所述指标因子实例定位的分桶至各服务器的内存上,并由各服务器在内存中完成后续的判断等操作,各服务器的处理是并行处理。
在本实施例中,读取被所述指标因子实例定位的分桶,是各服务器指根据步骤103得到的指标因子实例所属分桶的桶号,从分布式文件系统中存储的指标因子历史累积池中查找相应的分桶并读取分桶中的数据,具体包括以下步骤:
1)根据指标因子实例所属分桶的桶号,判断当前服务器是否已读取该桶号对应的分桶,若判断结果为是,则直接使用已经读取的分桶数据,若判断结果为否,则进入下一步。
2)根据分桶桶号,在指标因子历史累积池中检索该桶号对应的分桶;
3)判断检索的分桶是否存在,若判断结果为是,则进入下一步,若判断结果为否,则为指标因子实例建立新的分桶,该分桶数据为空,并进入下一步。
4)将检索到的或新建立的分桶数据加载到相应指标因子实例被分发的服务器的内存中。
5)判断分桶是否加载到相应服务器的内存中,若判断结果为是,则以该指标因子实例为依据,从该分桶中检索,判断该分桶的现有数据中是否包含相同的指标因子实例。
步骤105:若步骤104中判断该分桶的现有数据中包含相同的指标因子实例,则不对分桶数据进行处理。
在本实施例中,指标因子实例包含了实时新增数据的被关注维度的信息,指标因子历史累积池中分桶存储的指标因子实例也包含了历史数据的被关注维度的信息;服务器 根据实时新增数据的指标因子实例定位到其相应的指标因子历史累积池中的分桶,并加载该分桶到相应的服务器内存上,若以该指标因子实例为依据,从该分桶中检索,判断该分桶的现有数据中包含相同的指标因子实例,则说明该实时新增数据的被关注维度信息,即在相应应用ID下的用户ID或移动设备ID是历史上已出现过的,不是新增的用户也不是新增移动设备,因此不需要更新指标因子历史累积池,也不需要更新新增数据指标统计表。
步骤106:若步骤104中判断该分桶的现有数据中不包含相同的指标因子实例,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。对于本实施例来说,此判断表明实时新增数据的被关注维度信息,即在相应应用ID下的用户ID或移动设备ID是历史上从没有出现过的,是新增用户或新增移动设备,是需要更新和统计的新增指标。本实施例中,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例的具体做法是:
1)将该指标因子实例更新到其被分发的服务器内存中存储的该指标因子实例所属的分桶中。
2)在指标因子历史累积池中,将该指标因子实例更新到该桶号所对应的分桶中,使指标因子历史累积池同步到最新数据。
3)更新新增数据指标统计表,所述新增数据指标统计表包括对新增用户数量和新增移动设备数量的统计。
在以上具体实施方式中,预先设置的分桶存储策略,使分桶的最大阈值能够通过所述桶数N进行调节,一定程度上保证了各服务器的负载均衡;同时,预先将分桶桶号均匀映射到分布式集群服务器的各个服务器上的指标因子实例的分发算法,使各服务器上加载的分桶数量均衡,更进一步的保证了各服务器的负载均衡,且这种在一段时间内固定的映射关系又避免了各服务器对指标因子历史累积池中的分桶的频繁加载;因此,本申请的更新方法使得资源利用更合理,可扩展性更强,达到了真正意义的分布式处理。
在上述的实施例中,提供了一种实时新增数据更新方法,与之相对应的,本申请还提供一种实时新增数据更新装置。请参看图4,其为一种实时新增数据更新装置的实施例的示意图。由于装置实施例基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。下述描述的装置实施例仅仅是示意性的。
本实施例的一种实时新增数据更新装置,包括:指标因子历史累积池形成单元1、实时新增数据获取单元2、指标因子实例获取单元3、指标因子实例定位单元4、新增实 时判断单元5和指标因子历史累积池更新单元6。
所述指标因子历史累积池形成单元1,用于根据历史数据建立按分桶存储策略存储的指标因子历史累积池。
所述实时新增数据获取单元2,用于在同一时刻获取一个或多个实时新增数据,每个所述实时新增数据包括不同维度的数据元素。
所述指标因子实例获取单元3,用于根据预先确定的指标因子算法,获取各实时新增数据对应的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子。
所述指标因子实例定位单元4,用于基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中。
所述新增实时判断单元5,用于读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例。
所述指标因子历史累积池更新单元6,用于根据指标因子实例新增实时判断子单元的判断结果进行相应的更新处理,若判断结果为是,则不对分桶数据进行处理;若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
在本实施例中,所述实时新增数据获取单元2、指标因子实例获取单元3、指标因子实例定位单元4、新增实时判断单元5和指标因子历史累积池更新单元6共同完成实时新增数据的更新处理,该些单元执行的前提是,指标因子历史累积池形成单元1已完成指标因子历史累积池的建立,所述新增实时判断单元5和所述指标因子历史累积池更新单元6是依据所述指标因子历史累积池来进行新增判断以及更新处理的。
所述指标因子历史累积池形成单元1包括历史数据获取子单元1-1、指标因子实例子单元1-2和指标因子历史累积池建立子单元1-3。
所述历史数据获取子单元1-1,用于获取实时新增数据到来前已存在的历史数据。
所述指标因子实例子单元1-2,用于对所述历史数据的每个数据通过所述指标因子算法分别获取所述每个历史数据的指标因子实例。
所述指标因子历史累积池建立子单元1-3,用于基于所述预先设置的分桶存储策略,把所述历史数据的所述指标因子实例定位到其相应的分桶,并将属于不同分桶的指标因子实例存储到不同的分桶中;每一个分桶分配有一个桶号,从而建立分桶存储的指标因 子历史累积池。
上述指标因子历史累积池形成单元1、实时新增数据获取单元2、指标因子实例获取单元3、指标因子实例定位单元4、新增实时判断单元5和指标因子历史累积池更新单元6,都是基于相同的指标因子算法和相同的分桶存储策略实施的,该指标因子算法和该分桶存储策略在方法实施例中已详细介绍,在装置的实施例中就不详细描述了,具体步骤参见方法实施例。
可选的,所述实时新增数据更新装置部署在分布式系统中,所述分桶存储的指标因子历史累积池具体是存储在分布式文件系统中,由分布式集群服务器共享所述分布式文件系统,所述新增数据实时判断模块部署在所述分布式集群服务器上。
上述分别提供了一种实时新增数据更新方法和一种实时新增数据更新装置的实施例,相应的,本申请还提供一种电子设备,该电子设备包括:
显示器;
处理器;以及
存储器,所述存储器被配置成存储实时新增数据更新装置,所述实时新增数据更新装置被所述处理器执行时,执行如下步骤:
获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
若判断结果为是,则不对分桶数据进行处理;
若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
本申请虽然以较佳实施例公开如上,但其并不是用来限定本申请,任何本领域技术人员在不脱离本申请的精神和范围内,都可以做出可能的变动和修改,因此本申请的保护范围应当以本申请权利要求所界定的范围为准。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
1、计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
2、本领域技术人员应明白,本申请的实施例可提供为方法、系统或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。

Claims (20)

  1. 一种实时新增数据更新方法,其特征在于,包括:
    获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
    根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
    基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
    读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
    若判断结果为是,则不对分桶数据进行处理;
    若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
  2. 根据权利要求1所述的实时新增数据更新方法,其特征在于,所述指标因子算法包括:
    维度分配逻辑,该逻辑根据所述实时新增数据的具体内容,对所述实时新增数据的被关注维度进行分配;
    指标因子实例生成算法,该算法根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例。
  3. 根据权利要求2所述的实时新增数据更新方法,其特征在于,所述维度分配逻辑是直接指定固定的预设维度。
  4. 根据权利要求2所述的实时新增数据更新方法,其特征在于,所述维度分配逻辑包括一个待匹配集合,所述待匹配集合包含了不同场合下由不同维度信息组成的不同指标因子子集,并且由不同场合下的实时新增数据与所述待匹配集合按预定的规则进行匹配,获取对应的指标因子子集,并根据该指标因子子集获取该具体的实时新增数据的被关注维度。
  5. 根据权利要求2所述的实时新增数据更新方法,其特征在于,所述根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例,其具体步骤为:
    根据各个所述被关注维度,读取所述实时新增数据对应该被关注维度的取值;
    将所获得的该实时新增数据的各个被关注维度取值进行字符拼接,由拼接后的字符构成对应所述实时新增数据的指标因子实例,所述拼接后的字符能够在所述分桶存储策略中直接进行运算。
  6. 根据权利要求1所述的实时新增数据更新方法,其特征在于,所述分桶存储策略包括Hash桶算法,该算法包括以下步骤:
    获取实时新增数据到来前已存在的历史数据;
    根据所述历史数据的信息量为所述分桶存储策略设置合理的桶数N,每个所述分桶为Hash桶,所述Hash桶存储容量的最大阈值能够通过所述桶数N进行调节;
    为每一个所述Hash桶分配一个桶号;
    根据所述指标因子实例分别获取与每个所述指标因子实例对应的能够直接进行运算的变量;
    对所述变量采用Hash算法得到散列的Hash值,每个所述Hash值均能够唯一的归属到所述Hash桶的某个桶号;根据所述桶号,即可以将每个所述指标因子实例定位到其相应的分桶中。
  7. 根据权利要求1所述的实时新增数据更新方法,其特征在于,所述指标因子实例以分桶形式存储在指标因子历史累积池中,所述指标因子历史累计池获得方式如下:
    获取实时新增数据到来前已存在的历史数据;
    对所述历史数据的每个数据通过所述指标因子算法分别获取所述每个历史数据的指标因子实例;
    基于所述预先设置的分桶存储策略,把所述历史数据的所述指标因子实例定位到其相应的分桶,并将属于不同分桶的指标因子实例存储到不同的分桶中;每一个分桶分配有一个桶号,从而建立分桶存储的指标因子历史累积池。
  8. 根据权利要求7所述的实时新增数据更新方法,其特征在于,所述分桶存储的指标因子历史累积池具体是存储在分布式文件系统中,由分布式集群服务器共享所述分布式文件系统。
  9. 根据权利要求8所述的实时新增数据更新方法,其特征在于,在同一时刻能够获取一个或多个所述实时新增数据,根据所述实时新增数据形成指标因子实例后,所述指标因子实例被分发至所述分布式集群服务器的各服务器上,通过各服务器把所述指标因子实例定位到其相应的分桶中,各服务器从所述分布式文件系统中存储的指标因子历史累积池中读取被所述指标因子实例定位的分桶,并由该服务器完成后续的判断以及加 入新的指标因子实例的工作,各服务器的处理是并行处理。
  10. 根据权利要求9所述的实时新增数据更新方法,其特征在于,所述所述指标因子实例被分发至所述分布式集群服务器的各服务器上,该分发过程依据预定的指标因子实例分发算法实现,包括以下步骤:
    根据所述指标因子实例中各个被关注维度的数据元素的取值,按照预定的算法计算所述指标因子实例对应的服务器的编号;所述算法能够使各服务器的负载均衡;
    将所述指标因子实例分发到所述集群服务器的所述对应的服务器上。
  11. 根据权利要求10所述的实时新增数据更新方法,其特征在于,所述根据所述指标因子实例中各个被关注维度的数据元素的取值,按照预定的算法计算所述指标因子实例对应的服务器的编号,采用如下方法实现:
    对所述分布式集群服务器的各个服务器进行编号;
    根据所述指标因子实例各个被关注维度的数据元素的取值得到与其对应的能够直接进行运算的变量;
    对所述变量用md5算法得到十六进制的数据后,使用该数据对所述分布式集群服务器的服务器数量进行取模运算,得到所述服务器的编号;或者,对所述变量采用ASCII码换算得到二进制的数据后,使用该数据对所述分布式集群服务器的服务器数量进行取模运算,得到所述服务器的编号。
  12. 根据权利要求10所述的实时新增数据更新方法,其特征在于,所述根据所述指标因子实例中各个被关注维度的数据元素的取值,按照预定的算法计算所述指标因子实例对应的服务器的编号,采用如下方法实现:
    对所述分布式集群服务器的各个服务器进行编号;
    将所述分桶的桶号均匀映射到所述分布式集群服务器的各个服务器上;
    根据所述指标因子实例各个被关注维度的数据元素的取值得到与其对应的能够直接进行运算的变量;
    根据所述变量,按所述分桶存储策略获取所述指标因子实例所属分桶的桶号;
    根据所述桶号与所述服务器的映射关系,获得所述指标因子实例对应的服务器的编号。
  13. 根据权利要求10所述的实时新增数据更新方法,其特征在于,所述把所述指标因子实例定位到其相应的分桶中这一过程,在完成所述指标因子实例被分发至所述分布式集群服务器的各服务器上的步骤后,所述通过各服务器把所述指标因子实例定位到 其相应的分桶中的步骤具体以下述方式执行:
    所述分布式集群服务器的各服务器根据分发来的指标因子实例,按所述分桶存储策略计算所述指标因子实例所属分桶的桶号,将所述桶号作为查找各服务器上分发来的所述指标因子实例对应分桶的特征值。
  14. 根据权利要求13所述的实时新增数据更新方法,其特征在于,所述各服务器从所述分布式文件系统中存储的指标因子历史累积池中读取被所述指标因子实例定位的分桶,是由各个服务器根据被分配的指标因子实例所属分桶的桶号,从分布式文件系统中存储的指标因子历史累积池中查找相应的分桶并读取分桶中的数据,具体以下述方式执行:
    根据所述指标因子实例所属分桶的桶号,判断当前服务器是否已读取所述桶号对应的分桶,若判断结果为是,则直接使用已经读取的分桶数据,若判断结果为否,则进入下一步;
    根据所述桶号,在所述指标因子历史累积池中检索所述桶号对应的分桶;
    判断所述分桶是否存在,若判断结果为是,则进入下一步,若判断结果为否,则为所述指标因子实例建立新的分桶,该分桶数据为空,并进入下一步;
    将所述分桶数据加载到所述指标因子实例被分发的服务器的内存中;
    判断所述分桶是否加载到所述服务器的内存中,若判断结果为是,则在所述服务器的内存中执行所述以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例的步骤。
  15. 根据权利要求14所述的实时新增数据更新方法,其特征在于,所述在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例,包括以下步骤:
    将所述指标因子实例更新到所述被分发的服务器内存中存储的所述桶号对应的分桶中;
    在指标因子历史累积池中,将该指标因子实例更新到该桶号所对应的分桶中,使所述指标因子历史累积池同步到最新数据;
    更新新增数据指标统计表,所述新增数据指标统计表包括对新增用户数量和新增移动设备数量的统计。
  16. 一种实时新增数据更新装置,其特征在于,包括:
    实时新增数据获取单元,用于获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
    指标因子实例获取单元,用于根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
    指标因子实例定位单元,用于基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
    新增实时判断单元,用于读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
    指标因子历史累积池更新单元,用于根据指标因子实例新增实时判断子单元的判断结果进行相应的更新处理,若判断结果为是,则不对分桶数据进行处理;若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
  17. 根据权利要求16所述的实时新增数据更新装置,其特征在于,所述指标因子实例获取单元包括:
    维度分配逻辑子单元,用于根据所述实时新增数据的具体内容,对所述实时新增数据的被关注维度进行分配;
    指标因子实例生成子单元,用于根据所述被关注维度,从所述实时新增数据中获取各个被关注维度的数据元素的取值,并将这些取值组合,形成所述指标因子实例。
  18. 根据权利要求16所述的实时新增数据更新装置,其特征在于,包括分桶存储算法单元,所述分桶存储算法单元包括:
    历史数据获取子单元,用于获取实时新增数据到来前已存在的历史数据;
    桶数设置子单元,用于根据所述历史数据的信息量为所述分桶存储策略设置合理的桶数N,每个所述分桶为Hash桶,所述Hash桶存储容量的最大阈值能够通过所述桶数N进行调节;
    桶号分配子单元,用于为每一个所述Hash桶分配一个桶号;
    指标因子实例预处理子单元,用于根据所述指标因子实例分别获取与每个所述指标因子实例对应的能够直接进行运算的变量;
    分桶子单元,用于对所述变量采用Hash算法得到散列的Hash值,每个所述Hash值均能够归属到所述Hash桶的某个桶号;根据所述桶号,即可以将每个所述指标因子实例定位到其相应的分桶中。
  19. 根据权利要求16所述的实时新增数据更新装置,其特征在于,包括指标因子历史累积池形成单元,所述指标因子历史累积池形成单元包括:
    历史数据获取子单元,用于获取实时新增数据到来前已存在的历史数据;
    指标因子实例子单元,用于对所述历史数据的每个数据通过所述指标因子算法分别获取所述每个历史数据的指标因子实例;
    指标因子历史累积池建立子单元,用于基于所述预先设置的分桶存储策略,把所述历史数据的所述指标因子实例定位到其相应的分桶,并将属于不同分桶的指标因子实例存储到不同的分桶中;每一个分桶分配有一个桶号,从而建立分桶存储的指标因子历史累积池。
  20. 一种电子设备,其特征在于,包括:
    显示器;
    处理器;以及
    存储器,所述存储器被配置成存储实时新增数据更新装置,所述实时新增数据更新装置被所述处理器执行时,执行如下步骤:
    获取实时新增数据,所述实时新增数据包括不同维度的数据元素;
    根据预先确定的指标因子算法,获取对应所述实时新增数据的指标因子实例;该指标因子实例由所述实时新增数据的被关注维度下的数据元素的具体数值组成,所述被关注维度称为指标因子;
    基于预先设置的分桶存储策略,把所述指标因子实例定位到其相应的分桶中;
    读取被所述指标因子实例定位的分桶,以所述指标因子实例为依据从所述分桶中检索,判断所述分桶的现有数据中是否包含相同的指标因子实例;
    若判断结果为是,则不对分桶数据进行处理;
    若判断结果为否,则该指标因子实例为新增,在所述被定位的分桶中加入对应所述实时新增数据的指标因子实例。
PCT/CN2016/090633 2015-07-29 2016-07-20 一种实时新增数据更新方法和装置 WO2017016423A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510455177.5 2015-07-29
CN201510455177.5A CN106407207B (zh) 2015-07-29 2015-07-29 一种实时新增数据更新方法和装置

Publications (1)

Publication Number Publication Date
WO2017016423A1 true WO2017016423A1 (zh) 2017-02-02

Family

ID=57884102

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/090633 WO2017016423A1 (zh) 2015-07-29 2016-07-20 一种实时新增数据更新方法和装置

Country Status (2)

Country Link
CN (1) CN106407207B (zh)
WO (1) WO2017016423A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489405A (zh) * 2019-07-12 2019-11-22 平安科技(深圳)有限公司 数据处理的方法、装置和服务器
CN110866151A (zh) * 2019-11-11 2020-03-06 腾讯科技(深圳)有限公司 一种特征遍历方法及相关设备
CN111435346A (zh) * 2019-01-14 2020-07-21 阿里巴巴集团控股有限公司 离线数据的处理方法、装置及设备
CN111680104A (zh) * 2020-05-29 2020-09-18 平安证券股份有限公司 数据同步方法、装置、计算机设备及可读存储介质
CN113742036A (zh) * 2020-05-28 2021-12-03 阿里巴巴集团控股有限公司 指标处理方法、装置及电子设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109783523B (zh) * 2019-01-24 2022-02-25 广州虎牙信息科技有限公司 一种数据处理方法、装置、设备和存储介质
CN110795440B (zh) * 2019-09-05 2021-02-19 连连银通电子支付有限公司 一种更新指标的方法及装置
CN110727654B (zh) * 2019-10-24 2022-02-18 北京锐安科技有限公司 分布式系统的数据提取方法、装置、服务器和存储介质
CN112817965B (zh) * 2019-11-18 2023-10-17 百度在线网络技术(北京)有限公司 一种数据拼接方法、装置、电子设备和存储介质
CN113051279B (zh) * 2021-03-05 2024-05-10 北京顺达同行科技有限公司 数据消息的存储方法、存储装置、电子设备及存储介质
CN113556797A (zh) * 2021-06-29 2021-10-26 深圳市闪联信息技术有限公司 一种移动设备与大屏设备快速建立连接的方法及系统
CN113704262B (zh) * 2021-08-27 2022-11-15 深圳市路通网络技术有限公司 一种业务数据存储方法、装置、设备及可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655861A (zh) * 2009-09-08 2010-02-24 中国科学院计算技术研究所 基于双计数布鲁姆过滤器的哈希方法和哈希装置
US20100161957A1 (en) * 2008-12-18 2010-06-24 Electronics And Telecommunications Research Institute Methods of storing and retrieving data in/from external server
CN102033938A (zh) * 2010-12-10 2011-04-27 天津神舟通用数据技术有限公司 基于二级映射的集群动态扩展方法
CN102169491A (zh) * 2011-03-25 2011-08-31 暨南大学 一种多数据集中重复记录动态检测方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007143813A1 (en) * 2006-06-16 2007-12-21 Husky Injection Molding Systems Ltd. Preventative maintenance update system
US20110302214A1 (en) * 2010-06-03 2011-12-08 General Motors Llc Method for updating a database
CN104424254B (zh) * 2013-08-28 2018-05-22 阿里巴巴集团控股有限公司 获取相似对象集合、提供相似对象信息的方法及装置
CN103810247A (zh) * 2014-01-10 2014-05-21 国网信通亿力科技有限责任公司 基于分桶算法的灾备数据比对方法
CN104376047B (zh) * 2014-10-28 2017-06-30 浪潮电子信息产业股份有限公司 一种基于HBase的大表join方法
CN104391957A (zh) * 2014-12-01 2015-03-04 浪潮电子信息产业股份有限公司 一种针对混合型大数据处理系统的数据交互分析方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100161957A1 (en) * 2008-12-18 2010-06-24 Electronics And Telecommunications Research Institute Methods of storing and retrieving data in/from external server
CN101655861A (zh) * 2009-09-08 2010-02-24 中国科学院计算技术研究所 基于双计数布鲁姆过滤器的哈希方法和哈希装置
CN102033938A (zh) * 2010-12-10 2011-04-27 天津神舟通用数据技术有限公司 基于二级映射的集群动态扩展方法
CN102169491A (zh) * 2011-03-25 2011-08-31 暨南大学 一种多数据集中重复记录动态检测方法

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111435346A (zh) * 2019-01-14 2020-07-21 阿里巴巴集团控股有限公司 离线数据的处理方法、装置及设备
CN111435346B (zh) * 2019-01-14 2023-12-19 阿里巴巴集团控股有限公司 离线数据的处理方法、装置及设备
CN110489405A (zh) * 2019-07-12 2019-11-22 平安科技(深圳)有限公司 数据处理的方法、装置和服务器
CN110489405B (zh) * 2019-07-12 2024-01-12 平安科技(深圳)有限公司 数据处理的方法、装置和服务器
CN110866151A (zh) * 2019-11-11 2020-03-06 腾讯科技(深圳)有限公司 一种特征遍历方法及相关设备
CN110866151B (zh) * 2019-11-11 2023-09-19 腾讯科技(深圳)有限公司 一种特征遍历方法及相关设备
CN113742036A (zh) * 2020-05-28 2021-12-03 阿里巴巴集团控股有限公司 指标处理方法、装置及电子设备
CN113742036B (zh) * 2020-05-28 2024-01-30 阿里巴巴集团控股有限公司 指标处理方法、装置及电子设备
CN111680104A (zh) * 2020-05-29 2020-09-18 平安证券股份有限公司 数据同步方法、装置、计算机设备及可读存储介质
CN111680104B (zh) * 2020-05-29 2023-11-03 平安证券股份有限公司 数据同步方法、装置、计算机设备及可读存储介质

Also Published As

Publication number Publication date
CN106407207B (zh) 2020-06-16
CN106407207A (zh) 2017-02-15

Similar Documents

Publication Publication Date Title
WO2017016423A1 (zh) 一种实时新增数据更新方法和装置
US20150310045A1 (en) Managing an index of a table of a database
US8843632B2 (en) Allocation of resources between web services in a composite service
US9742860B2 (en) Bi-temporal key value cache system
US20150081908A1 (en) Computer-based, balanced provisioning and optimization of data transfer resources for products and services
CN108279974B (zh) 一种云资源分配方法及装置
US9940020B2 (en) Memory management method, apparatus, and system
CN111966649A (zh) 一种高效去重的轻量级在线文件存储方法及装置
CN110830604B (zh) Dns调度方法、装置
CN109981702B (zh) 一种文件存储方法及系统
TW201702870A (zh) 一種資源分配方法和裝置
CN111666131A (zh) 负载均衡分配方法、装置、计算机设备和存储介质
CN106657182B (zh) 云端文件处理方法和装置
CN114528231A (zh) 一种数据动态存储方法、装置、电子设备及存储介质
CN110178119B (zh) 处理业务请求的方法、装置与存储系统
CN109788013B (zh) 分布式系统中作业资源分配方法、装置及设备
US11442632B2 (en) Rebalancing of user accounts among partitions of a storage service
CN108536759B (zh) 一种样本回放数据存取方法及装置
US11120052B1 (en) Dynamic distributed data clustering using multi-level hash trees
WO2019169998A1 (zh) 选择数据节点的方法、系统以及相关设备
CN110708361A (zh) 数字内容发布用户的等级确定系统、方法、装置及服务器
US20130144838A1 (en) Transferring files
US11159530B2 (en) Direct upload and download to content management system backend
CN110719306B (zh) 一种网络请求限制方法、计算机设备、存储介质
CN110874268A (zh) 数据处理方法、装置和设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16829789

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16829789

Country of ref document: EP

Kind code of ref document: A1