CN111782645A - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN111782645A CN111782645A CN201911200575.7A CN201911200575A CN111782645A CN 111782645 A CN111782645 A CN 111782645A CN 201911200575 A CN201911200575 A CN 201911200575A CN 111782645 A CN111782645 A CN 111782645A
- Authority
- CN
- China
- Prior art keywords
- data
- index
- dimension
- dimensional array
- processed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003672 processing method Methods 0.000 title abstract description 14
- 238000012545 processing Methods 0.000 claims abstract description 46
- 238000000034 method Methods 0.000 claims abstract description 35
- 238000004364 calculation method Methods 0.000 claims description 38
- 238000013507 mapping Methods 0.000 claims description 11
- 230000001174 ascending effect Effects 0.000 claims description 8
- 238000012163 sequencing technique Methods 0.000 claims description 7
- 238000007619 statistical method Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 15
- 238000013459 approach Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 241000283690 Bos taurus Species 0.000 description 1
- 230000001680 brushing effect Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
- 210000002268 wool Anatomy 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2264—Multidimensional index structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present disclosure provides a data processing method and apparatus. The data processing device processes the original event request to form data to be processed for statistics; calculating corresponding hash values aiming at the dimension data in the data to be processed; judging whether the hash value is included in a dimension index of a preset data structure, wherein the dimension index is of a Map structure; if the hash value is not included in the dimension index, calculating a row index of the dimension data in the two-dimensional array, storing the hash value and the row index into the dimension index, and inserting the data to be processed into a row pointed by the row index in the two-dimensional array; and processing the related data in the two-dimensional array according to a preset time window to obtain a statistical result of the corresponding index. The method and the device can effectively save the use of the memory of the machine, greatly save hardware resources and reduce the error probability of the system.
Description
Technical Field
The present disclosure relates to the field of information processing, and in particular, to a data processing method and apparatus.
Background
As a pure streaming data processing engine, Flink is widely used in a data processing scene by virtue of the advantages of real-time performance, strong consistency, high availability and the like. In order to ensure the successful completion of online transaction, maintain the legal rights and interests of normal users and prevent the severe operations of brushing single by cattle, pulling wool and the like, the investment of each large internet enterprise in the field of wind control is getting higher and higher. One of the core means of the wind control technology is real-time index calculation, so that introducing Flink into the wind control technology to process data is naturally a matter of going to a canal from water.
For a certain specific service of a wind control scene, a plurality of indexes are generally required to be processed to support the service, each index has different dimensions, a calculation key value key and a time window, for example, in an index of an order scene, namely 'the number of different account numbers of the same receiving address, the first 7 bits of the same receiving mobile phone and the same SKUID within 24 hours', the receiving address, the first 7 bits of the receiving mobile phone and the SKUID form the dimension of the index, and the account number is the calculation key of the index and the time window corresponding to the index within 24 hours.
Disclosure of Invention
The inventor finds that the current wind control index calculation mode based on Flink is to calculate each index separately, and each index is uniquely determined by dimension, calculation key and time window. However, in a wind control scenario, a plurality of indexes are usually processed according to a plurality of time windows for the same dimension and the same calculation key, and the common time windows include: 5 minutes, 10 minutes, 15 minutes, 30 minutes, 60 minutes, 12 hours, 24 hours, and the like. That is, for the above dimensions "receiving address + receiving handset front 7 bits + sked" and "account number" for calculating key ", 7 indexes corresponding to 5 minutes, 10 minutes, 15 minutes, 30 minutes, 60 minutes, 12 hours and 24 hours are generally processed, and the calculation of the 7 indexes is performed separately.
The Flink is calculated based on a local memory completely, the common memory capacity of the current Flink cluster deployment machine is 8GB and 16GB, and the common CPU configuration is 4 cores and 8 cores. The Flink cluster of 20 machines in the configuration is enough to meet the index calculation requirements of most wind control scenes. However, in internet enterprises such as large e-commerce, such hardware configuration is far from meeting online business requirements, for example, for common wind-control scenarios such as orders, coupons, login and registration, hundreds and thousands of indexes are usually calculated in real time for business support, and moreover, traffic of these business scenarios is also staggering, peak system Throughput (TPS) often reaches millions and lasts for a long time, data volume of one day usually exceeds TB level and reaches PB level, and in such a scenario, the frequency of occurrence of Garbage Collection (GC) and memory overflow (OOM) of a Flink cluster is very high, which seriously affects performance of Flink, and even causes Flink to fail to provide stable real-time computing service.
A common solution to the above problem is to increase the number of machines, usually the number of machines in a cluster exceeds hundreds, and reaches thousands. However, the problem cannot be fundamentally solved by increasing the number of machines, on one hand, along with the increase of the number of cluster machines, the failure rate of nodes in the cluster is increased, the load of the cluster jobManager is increased, and on the other hand, the technical investment of enterprises is also burdened by increasing the number of machines without limitation.
Accordingly, the present disclosure provides a scheme for effectively saving memory usage.
According to a first aspect of the embodiments of the present disclosure, there is provided a data processing method, including: processing the original event request to form data to be processed for statistics; calculating corresponding hash values aiming at the dimension data in the data to be processed; judging whether the hash value is included in a dimension index of a preset data structure, wherein the dimension index is a Map structure, in the Map structure, key is the hash value, and value is a row index pointing to the two-dimensional array; if the hash value is not included in the dimension index, calculating a row index of the dimension data in the two-dimensional array; storing the hash value and the row index into a dimension index, and inserting the data to be processed into a row pointed by the row index in the two-dimensional array; and processing the related data in the two-dimensional array according to a preset time window to obtain a statistical result of the corresponding index.
In some embodiments, if the hash value is not included in the dimension index, further comprising: judging whether the free space of the two-dimensional array in the preset data structure is smaller than a preset threshold or not; and if the free space of the two-dimensional array is smaller than a preset threshold, carrying out capacity expansion on the two-dimensional array, and then executing the step of calculating the row index of the dimension data in the two-dimensional array.
In some embodiments, if the free space of the two-dimensional array is not less than a preset threshold, the step of calculating the row index of the dimensional data in the two-dimensional array is performed.
In some embodiments, calculating the row index of the dimension data in the two-dimensional array comprises: inquiring the current maximum row index value in the two-dimensional array; and adding 1 to the maximum row index value to serve as a row index of the dimension data in the two-dimensional array.
In some embodiments, if the hash value is included in the dimension index, querying the two-dimensional array for a row associated with the hash value; judging whether the current statistical mode is a specified statistical mode or not; if the current statistical mode is the designated statistical mode, updating the data in the row associated with the hash value in the two-dimensional array by using the data to be processed; and processing the related data in the two-dimensional array according to a preset time window.
In some embodiments, if the current statistical manner is not the designated statistical manner, newly adding preset multiple columns in the two-dimensional array; inserting the data to be processed into a new row of rows associated with the hash value in the two-dimensional array; and processing the related data in the two-dimensional array according to a preset time window.
In some embodiments, before calculating the corresponding hash value for the dimension data in the data to be processed, the method further includes: judging whether the data to be processed comprises dimension data or not; if the data to be processed does not include dimension data, mapping the calculation key values in the data to be processed so as to generate corresponding virtual dimension data for the data to be processed.
In some embodiments, mapping the calculation key values in the data to be processed includes: changing the capital into the lowercase in the calculation key value, extracting all characters in the calculation key value and performing de-duplication processing to obtain characters to be processed; and sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes to obtain virtual dimension data.
In some embodiments, if the characters to be processed are all numeric, counting the occurrence frequency of each character in the calculation key value; and sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes, and inserting corresponding occurrence times behind each character to obtain virtual dimension data.
In some embodiments, multiple metrics are included in the same metric domain, and the metrics in the same metric domain have the following properties: the method belongs to the same service scene, has the same statistical method, has the same statistical dimension and statistical key value, respectively has different statistical time windows, and shares the same preset data structure example for data statistics.
According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including: the preprocessing module is configured to process the original event request to form data to be processed for statistics; the hash value calculation module is configured to calculate corresponding hash values for the dimension data in the data to be processed; the identification module is configured to judge whether the hash value is included in a dimension index of a preset data structure, wherein the dimension index is of a Map structure; a data processing module configured to calculate a row index of the dimension data in the two-dimensional array if the hash value is not included in the dimension index; inserting the hash value into a row pointed to by the row index in the dimension index, and inserting the data to be processed into a row pointed to by the row index in the two-dimensional array; and processing the related data in the two-dimensional array according to a preset time window to obtain a statistical result of the corresponding index.
According to a third aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including: a memory configured to store instructions; a processor coupled to the memory, the processor configured to perform a method implementing any of the embodiments described above based on instructions stored by the memory.
According to a fourth aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, in which computer instructions are stored, and when executed by a processor, the computer-readable storage medium implements the method according to any of the embodiments described above.
Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, serve to explain the principles of the disclosure.
The present disclosure may be more clearly understood from the following detailed description, taken with reference to the accompanying drawings, in which:
FIG. 1 is a schematic flow diagram of a data processing method according to one embodiment of the present disclosure;
FIG. 2 is a data structure diagram of one embodiment of the present disclosure;
FIG. 3 is a schematic diagram of data structure row expansion according to one embodiment of the present disclosure;
FIG. 4 is a schematic flow chart diagram of a data processing method according to another embodiment of the present disclosure;
5 a-5 c are schematic diagrams of virtual dimension mappings of some embodiments of the present disclosure;
FIG. 6 is a schematic diagram of data structure column expansion according to one embodiment of the present disclosure;
FIG. 7 is a schematic block diagram of a data processing apparatus according to one embodiment of the present disclosure;
fig. 8 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present disclosure.
It should be understood that the dimensions of the various parts shown in the figures are not drawn to scale. Further, the same or similar reference numerals denote the same or similar components.
Detailed Description
Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. The description of the exemplary embodiments is merely illustrative and is in no way intended to limit the disclosure, its application, or uses. The present disclosure may be embodied in many different forms and is not limited to the embodiments described herein. These embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. It should be noted that: the relative arrangement of parts and steps, the composition of materials and values set forth in these embodiments are to be construed as illustrative only and not as limiting unless otherwise specifically stated.
The use of the word "comprising" or "comprises" and the like in this disclosure means that the elements listed before the word encompass the elements listed after the word and do not exclude the possibility that other elements may also be encompassed.
All terms (including technical or scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs unless specifically defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
FIG. 1 is a flow diagram of a data processing method according to one embodiment of the present disclosure. In some embodiments, the following data processing method steps are performed by a data processing apparatus.
In step 101, the raw event request is processed to form data to be processed for statistics.
In some embodiments, processing the raw event request may include filtering illegal request data to convert the raw event request into data that may be used for statistics.
In step 102, corresponding hash values are calculated for the dimension data in the data to be processed.
In step 103, it is determined whether the hash value is included in the dimension index of the preset data structure.
FIG. 2 is a data structure diagram of one embodiment of the present disclosure. As shown in fig. 2, the preset data structure includes two parts, namely a dimension index and a two-dimensional array, where the dimension index is a Map structure. And the key of the Map structure is a long type hash value obtained by carrying out hash operation after all dimensions of the corresponding indexes are combined, and the value is the corresponding row index. The 1 st column in the two-dimensional array is a row index, and other columns are used for storing data details such as event time, calculation key and the like.
How many "columns" different index types of data detail occupy is typically different, i.e., the spans (spans) of different index types are different. For the 'same dimension, in a period of time, counting different index types of calculating key numbers' (key duplicate removal statistics according to a certain dimension), wherein the span is 2, and the data detail is respectively stored: event time and compute key; for the index type of counting the times of key occurrence (counting the key increment according to a certain dimension) with the same increment statistics in a period of time, the span is 3, and the data detail is respectively stored as follows: event time, count key, and the number of times the count key occurs in the current event (delta).
Returning to fig. 1. At step 104, if the hash value is not included in the dimension index, a row index of the dimension data in the two-dimensional array is calculated.
In some embodiments, the maximum row index value is incremented by 1 by querying the current maximum row index value in the two-dimensional array as the row index of the dimension data in the two-dimensional array.
In step 105, the hash value and the row index are stored in the dimension index, and the data to be processed is inserted into the row pointed by the row index in the two-dimensional array.
FIG. 3 is a schematic diagram of data structure row expansion according to an embodiment of the disclosure. If the calculated hash value 123456 is not included in the dimension index and the current maximum row index value in the two-dimensional array is 2, then 3 is used as the row index of the dimension data in the two-dimensional array in the data to be processed. Next, the hash value 123456 is inserted into the dimension index, and the ABCD of the data to be processed is inserted into the row pointed to by the row index 3 in the two-dimensional array, wherein the ABCD can be placed in the same column of the row or can be placed in different columns of the row respectively.
Returning to fig. 1. In step 106, the related data in the two-dimensional array is processed according to a preset time window to obtain a statistical result of the corresponding index.
It should be noted here that, in order to achieve the purpose of index calculation multiplexing by using the above data structure, the present disclosure uses an "index field" to manage the indexes and implement index calculation multiplexing. Multiplexing here refers to the computational multiplexing of multiple indices for the same statistical dimension, the same computation key, and different statistical time windows. The same index domain usually contains several indexes, and the indexes have the following characteristics:
1) belong to one service scene (usually, indexes under a plurality of service scenes can be processed simultaneously in one Flink cluster);
2) the method has the same statistical method (type), such as the key de-weight statistics according to a certain dimension, the key increment statistics according to a certain dimension, or the key increment statistics directly;
3) have the same statistical dimension and statistical key;
4) have different statistical time windows;
5) and sharing the same core data structure example for data statistics.
In combination with the above features, it can be seen that the index domain can also be considered as uniquely determined by 4 elements of a service scene, a statistical method, a statistical dimension, and a calculation key, and all the elements are consistent with the indexes below the elements. The index domain is responsible for managing the life cycle of the index and the used core data structure, namely, one index domain has one and only one core data structure instance; and meanwhile, the index domain can also be used for clearing the expired data in the corresponding core data structure example according to the own time window value, wherein the time window value of the index domain is the maximum value of all index time windows under the index domain.
In some embodiments, in the unified index processing method initialization process, the following process may be taken:
1) acquiring all indexes in the system and processing definitions of the indexes;
2) in order to distinguish indexes of different service scenes, the indexes need to be classified according to the service scenes, such as a scene A index, a scene B index and a scene C index;
3) according to different service scenes, further dividing each index into corresponding index domains according to the statistical method, the statistical dimension and the calculation key of the index;
4) initializing each index field, including initializing the life cycle of the index field and the life cycle of each index below the index field, calculating the time window value of the index field through the index below the index field, and initializing the core data structure.
5) And finishing the initialization process of the unified index processing method.
It should be noted here that Flink is performed according to the index field during data statistics, and how to calculate each index is completely accomplished by the index field.
In the data processing method provided by the above embodiment of the present disclosure, by designing a reusable data structure capable of supporting multi-index calculation of a memory, the memory usage of a machine can be effectively saved.
Fig. 4 is a flow diagram of a data processing method according to another embodiment of the present disclosure. In some embodiments, the following data processing method steps are performed by a data processing apparatus.
In step 401, the raw event request is processed to form data to be processed for statistics.
In some embodiments, processing the raw event request may include filtering illegal request data to convert the raw event request into data that may be used for statistics.
In step 402, it is determined whether dimension data is included in the data to be processed.
If the data to be processed does not include dimension data, executing step 403; otherwise, step 404 is performed.
It should be noted that, for some specific indexes, only the key and the time window are calculated, such as an index "count the number of times of login of the same account in a period of time" (directly count the increment of the key), and such an index is referred to as a "dimensionless index". For ease of handling, virtual dimensions need to be added to such metrics.
In step 403, mapping processing is performed on the calculation key values in the data to be processed, so as to generate corresponding virtual dimension data for the data to be processed.
In some embodiments, mapping the calculation key values in the data to be processed includes: changing the capital in the calculation key value into lowercase, extracting all characters in the calculation key value and performing duplication elimination processing to obtain characters to be processed; and sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes to obtain virtual dimension data.
And if the characters to be processed are all digital, counting the occurrence times of each character in the calculation key value. And sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes, and inserting corresponding occurrence times behind each character to obtain virtual dimension data.
For example, as shown in fig. 5a, the virtual dimension obtained by mapping three pure alphabetic character strings Kawai, Wiaka, and Kiawa is aikw. As shown in fig. 5b, the virtual dimension obtained by mapping the strings day1, 1day, and da1y with numbers is 1 ady. As shown in fig. 5c, the virtual dimension obtained by mapping the pure numbers 998011, 101899, and 190891 is 01128192.
In step 404, a corresponding hash value is calculated for the dimension data in the data to be processed.
In step 405, it is determined whether the hash value is included in the dimension index of the predetermined data structure.
If the hash value is not included in the dimension index, go to step 406; otherwise, step 411 is performed.
At step 406, it is determined whether a line expansion is required.
If the column expansion is needed, step 407 is executed, otherwise step 408 is executed.
In some embodiments, the determination is made as to whether the free space of the two-dimensional array in the predetermined data structure is less than a predetermined threshold. And if the free space of the two-dimensional array is smaller than the preset threshold, expanding the capacity of the two-dimensional array.
In step 407, a predetermined number of rows are added to the two-dimensional array.
At step 408, a row index of the dimensional data in the two-dimensional array is calculated.
In some embodiments, the maximum row index value is incremented by 1 by querying the current maximum row index value in the two-dimensional array as the row index of the dimension data in the two-dimensional array.
In step 409, the hash value and the row index are stored in the dimension index, and the data to be processed is inserted into the row pointed by the row index in the two-dimensional array.
In step 410, the related data in the two-dimensional array is processed according to a preset time window to obtain a statistical result of the corresponding index.
At step 411, the row associated with the hash value is queried in the two-dimensional array.
In step 412, it is determined whether the current statistical manner is a specified statistical manner. For example, the specified statistical approach is deduplication statistics and data detail duplication.
If the current statistical manner is the designated statistical manner, go to step 413; otherwise, step 414 is performed.
In step 413, the data in the row associated with the hash value in the two-dimensional array is updated with the data to be processed. Step 410 is then performed.
In step 414, a plurality of columns is newly added to the two-dimensional array.
In step 415, the data to be processed is inserted into the new row of rows associated with the hash value in the two-dimensional array. Step 410 is then performed.
FIG. 6 is a diagram illustrating expansion of a column of data structures, according to one embodiment of the present disclosure. If the computed hash value 123456 is included in the dimension index and the row index corresponding to the hash value is 3, a number of columns are added to the right side of the two-dimensional array to insert the related information into the new column of row 3. Can be inserted into 1 column of the 3 rd row and can also be inserted into a plurality of columns of the 3 rd row according to requirements.
Fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure. As shown in fig. 7, the data processing apparatus includes a preprocessing module 71, a hash value calculation module 72, a recognition module 73, and a data processing module 74.
The pre-processing module 71 is configured to process the raw event request to form the data to be processed for statistics.
In some embodiments, processing the raw event request may include filtering illegal request data to convert the raw event request into data that may be used for statistics.
The hash value calculation module 72 is configured to calculate respective hash values for the dimension data in the data to be processed.
In some embodiments, the hash value calculation module 72 determines whether dimension data is included in the data to be processed. If the data to be processed does not include dimension data, mapping the calculation key values in the data to be processed so as to generate corresponding virtual dimension data for the data to be processed.
For example, changing the capital into the lowercase in the calculation key value, extracting all the characters in the calculation key value and performing de-duplication processing to obtain characters to be processed, and performing ascending sorting on the characters to be processed according to the sequence of the ASCII code to obtain virtual dimension data. And if the characters to be processed are all digital, counting the occurrence times of each character in the calculation key values, sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes, and inserting the corresponding occurrence times behind each character to obtain virtual dimension data.
The identification module 73 is configured to determine whether the hash value is included in the dimension index of the preset data structure.
Here, it should be noted that the preset data structure is the data structure shown in fig. 2.
The data processing module 74 is configured to calculate a row index of the dimensional data in the two-dimensional array if the hash value is not included in the dimensional index, store the hash value and the row index into the dimensional index, insert the data to be processed into a row pointed by the row index in the two-dimensional array, and process the related data in the two-dimensional array according to a preset time window to obtain a statistical result of the corresponding index.
In some embodiments, the data processing module 74 adds 1 to the maximum row index value by querying the current maximum row index value in the two-dimensional array as the row index of the dimensional data in the two-dimensional array.
In some embodiments, if the hash value is not included in the dimension index, the data processing module 74 determines whether a free space of a two-dimensional array in a preset data structure is smaller than a preset threshold, and if the free space of the two-dimensional array is smaller than the preset threshold, performs capacity expansion on the two-dimensional array, and then performs an operation of calculating a row index of the dimension data in the two-dimensional array. If the free space of the two-dimensional array is not less than the preset threshold, the data processing module 74 directly performs an operation of calculating the row index of the dimensional data in the two-dimensional array.
In some embodiments, if the hash value is included in the dimension index, the data processing module 74 queries the two-dimensional array for the row associated with the hash value to determine whether the current statistical approach is a specified statistical approach. And if the current statistical mode is the designated statistical mode, updating the data in the row associated with the hash value in the two-dimensional array by using the data to be processed, and processing the related data in the two-dimensional array according to a preset time window.
If the current statistical mode is not the designated statistical mode, the data processing module 74 adds preset multiple columns in the two-dimensional array, inserts the data to be processed into the new added columns of the row associated with the hash value in the two-dimensional array, and processes the related data in the two-dimensional array according to a preset time window.
In some embodiments, multiple metrics are included in the same metric domain, and the metrics in the same metric domain have the following properties: the method belongs to the same service scene, has the same statistical method, has the same statistical dimension and statistical key value, respectively has different statistical time windows, and shares the same preset data structure example for data statistics.
Fig. 8 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present disclosure. As shown in fig. 8, the apparatus includes a memory 81 and a processor 82.
The memory 81 is used to store instructions. The processor 82 is coupled to a memory 81. The processor 82 is configured to perform a method as referred to in any of the embodiments of fig. 1 or fig. 4 based on the instructions stored by the memory.
As shown in fig. 8, the apparatus further includes a communication interface 83 for information interaction with other devices. Meanwhile, the device also comprises a bus 84, and the processor 82, the communication interface 83 and the memory 81 are communicated with each other through the bus 84.
The Memory 81 may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM). Such as at least one disk storage. The memory 81 may also be a memory array. The storage 81 may also be partitioned and the blocks may be combined into virtual volumes according to certain rules.
Further, the processor 82 may be a central processing unit, or may be an ASIC (Application specific integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present disclosure.
The present disclosure also provides a computer-readable storage medium. The computer-readable storage medium stores computer instructions, which when executed by the processor implement a method according to any one of fig. 1 or 4.
By implementing the method and the device, the problems of index calculation under the current large-flow wind control scene can be effectively solved, hardware resources can be greatly saved, and the error probability of the system can be reduced.
In some embodiments, the functional modules may be implemented as a general purpose Processor, a Programmable Logic Controller (PLC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other Programmable Logic device, discrete Gate or transistor Logic device, discrete hardware components, or any suitable combination thereof for performing the functions described in this disclosure.
So far, embodiments of the present disclosure have been described in detail. Some details that are well known in the art have not been described in order to avoid obscuring the concepts of the present disclosure. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.
Although some specific embodiments of the present disclosure have been described in detail by way of example, it should be understood by those skilled in the art that the foregoing examples are for purposes of illustration only and are not intended to limit the scope of the present disclosure. It will be understood by those skilled in the art that various changes may be made in the above embodiments or equivalents may be substituted for elements thereof without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is defined by the appended claims.
Claims (13)
1. A method of data processing, comprising:
processing the original event request to form data to be processed for statistics;
calculating corresponding hash values aiming at the dimension data in the data to be processed;
judging whether the hash value is included in a dimension index of a preset data structure, wherein the dimension index is a Map structure, in the Map structure, key is the hash value, and value is a row index pointing to the two-dimensional array;
if the hash value is not included in the dimension index, calculating a row index of the dimension data in the two-dimensional array;
storing the hash value and the row index into a dimension index, and inserting the data to be processed into a row pointed by the row index in the two-dimensional array;
and processing the related data in the two-dimensional array according to a preset time window to obtain a statistical result of the corresponding index.
2. The method of claim 1, wherein if the hash value is not included in the dimension index, further comprising:
judging whether the free space of the two-dimensional array in the preset data structure is smaller than a preset threshold or not;
and if the free space of the two-dimensional array is smaller than a preset threshold, carrying out capacity expansion on the two-dimensional array, and then executing the step of calculating the row index of the dimension data in the two-dimensional array.
3. The method of claim 2, further comprising:
and if the free space of the two-dimensional array is not smaller than a preset threshold, executing a step of calculating the row index of the dimension data in the two-dimensional array.
4. The method of claim 1, wherein calculating a row index of the dimensional data in the two-dimensional array comprises:
inquiring the current maximum row index value in the two-dimensional array;
and adding 1 to the maximum row index value to serve as a row index of the dimension data in the two-dimensional array.
5. The method of claim 1, further comprising:
querying a row associated with the hash value in the two-dimensional array if the hash value is included in the dimension index;
judging whether the current statistical mode is a specified statistical mode or not;
if the current statistical mode is the designated statistical mode, updating the data in the row associated with the hash value in the two-dimensional array by using the data to be processed;
and processing the related data in the two-dimensional array according to a preset time window.
6. The method of claim 5, further comprising:
if the current statistical mode is not the designated statistical mode, newly adding a plurality of rows in the two-dimensional array;
inserting the data to be processed into a new row of rows associated with the hash value in the two-dimensional array;
and processing the related data in the two-dimensional array according to a preset time window.
7. The method of claim 1, wherein before calculating the respective hash values for the dimension data in the data to be processed, further comprising:
judging whether the data to be processed comprises dimension data or not;
if the data to be processed does not include dimension data, mapping the calculation key values in the data to be processed so as to generate corresponding virtual dimension data for the data to be processed.
8. The method of claim 7, wherein mapping the computation key in the data to be processed comprises:
changing the capital into the lowercase in the calculation key value, extracting all characters in the calculation key value and performing de-duplication processing to obtain characters to be processed;
and sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes to obtain virtual dimension data.
9. The method of claim 8, further comprising:
if the characters to be processed are all digital, counting the occurrence frequency of each character in the calculation key value;
and sequencing the characters to be processed in an ascending order according to the sequence of the ASCII codes, and inserting corresponding occurrence times behind each character to obtain virtual dimension data.
10. The method of any one of claims 1-9, wherein:
the same index domain comprises a plurality of indexes, and the indexes in the same index domain have the following properties: the method belongs to the same service scene, has the same statistical method, has the same statistical dimension and statistical key value, respectively has different statistical time windows, and shares the same preset data structure example for data statistics.
11. A data processing apparatus comprising:
the preprocessing module is configured to process the original event request to form data to be processed for statistics;
the hash value calculation module is configured to calculate corresponding hash values for the dimension data in the data to be processed;
the identification module is configured to judge whether the hash value is included in a dimension index of a preset data structure, wherein the dimension index is a Map structure, in the Map structure, key is the hash value, and value is a row index pointing to the two-dimensional array;
a data processing module configured to calculate a row index of the dimension data in the two-dimensional array if the hash value is not included in the dimension index; storing the hash value and the row index into a dimension index, and inserting the data to be processed into a row pointed by the row index in the two-dimensional array; and processing the related data in the two-dimensional array according to a preset time window to obtain a statistical result of the corresponding index.
12. A data processing apparatus comprising:
a memory configured to store instructions;
a processor coupled to the memory, the processor configured to perform implementing the method of any of claims 1-10 based on instructions stored by the memory.
13. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions which, when executed by a processor, implement the method of any one of claims 1-10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911200575.7A CN111782645B (en) | 2019-11-29 | 2019-11-29 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911200575.7A CN111782645B (en) | 2019-11-29 | 2019-11-29 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111782645A true CN111782645A (en) | 2020-10-16 |
CN111782645B CN111782645B (en) | 2024-07-16 |
Family
ID=72755740
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911200575.7A Active CN111782645B (en) | 2019-11-29 | 2019-11-29 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111782645B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113779022A (en) * | 2021-02-07 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Data backtracking output method and device, electronic equipment and storage medium |
US20220147503A1 (en) * | 2020-08-11 | 2022-05-12 | Massachusetts Mutual Life Insurance Company | Systems and methods to generate a database structure with a low-latency key architecture |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140208053A1 (en) * | 2013-01-18 | 2014-07-24 | International Business Machines Corporation | Re-aligning a compressed data array |
CN104063376A (en) * | 2013-03-18 | 2014-09-24 | 阿里巴巴集团控股有限公司 | Multi-dimensional grouping operation method and system |
WO2015109250A1 (en) * | 2014-01-20 | 2015-07-23 | Alibaba Group Holding Limited | CREATING NoSQL DATABASE INDEX FOR SEMI-STRUCTURED DATA |
CN105989076A (en) * | 2015-02-10 | 2016-10-05 | 腾讯科技(深圳)有限公司 | Data statistical method and device |
CN105989078A (en) * | 2015-02-11 | 2016-10-05 | 烟台中科网络技术研究所 | Index construction method for structured peer-to-peer network as well as retrieval method, apparatus and system |
CN109656923A (en) * | 2018-12-19 | 2019-04-19 | 北京字节跳动网络技术有限公司 | A kind of data processing method, device, electronic equipment and storage medium |
-
2019
- 2019-11-29 CN CN201911200575.7A patent/CN111782645B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140208053A1 (en) * | 2013-01-18 | 2014-07-24 | International Business Machines Corporation | Re-aligning a compressed data array |
CN104063376A (en) * | 2013-03-18 | 2014-09-24 | 阿里巴巴集团控股有限公司 | Multi-dimensional grouping operation method and system |
WO2015109250A1 (en) * | 2014-01-20 | 2015-07-23 | Alibaba Group Holding Limited | CREATING NoSQL DATABASE INDEX FOR SEMI-STRUCTURED DATA |
CN105989076A (en) * | 2015-02-10 | 2016-10-05 | 腾讯科技(深圳)有限公司 | Data statistical method and device |
CN105989078A (en) * | 2015-02-11 | 2016-10-05 | 烟台中科网络技术研究所 | Index construction method for structured peer-to-peer network as well as retrieval method, apparatus and system |
CN109656923A (en) * | 2018-12-19 | 2019-04-19 | 北京字节跳动网络技术有限公司 | A kind of data processing method, device, electronic equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
宋威;杨炳儒;徐章艳;韩彦岭;: "一种基于索引数组的频繁项集高效挖掘算法", 高技术通讯, no. 03 * |
崔晨;郑林江;韩凤萍;何牧君;: "基于内存的HBase二级索引设计", 计算机应用, no. 06 * |
张佳民: "基于数据仓库体系结构的OLAP和数据挖掘技术的研究与应用", 中国优秀硕士学位论文全文数据库 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220147503A1 (en) * | 2020-08-11 | 2022-05-12 | Massachusetts Mutual Life Insurance Company | Systems and methods to generate a database structure with a low-latency key architecture |
CN113779022A (en) * | 2021-02-07 | 2021-12-10 | 北京沃东天骏信息技术有限公司 | Data backtracking output method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111782645B (en) | 2024-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP3316150B1 (en) | Method and apparatus for file compaction in key-value storage system | |
CN112597153B (en) | Block chain-based data storage method, device and storage medium | |
US9619657B2 (en) | Method and apparatus for storing redeem code, and method and apparatus for verifying redeem code | |
US8788499B2 (en) | System and method for finding top N pairs in a map-reduce setup | |
CN103942292A (en) | Virtual machine mirror image document processing method, device and system | |
CN107122130B (en) | Data deduplication method and device | |
CN111694839B (en) | Time sequence index construction method and device based on big data and computer equipment | |
CN110727663A (en) | Data cleaning method, device, equipment and medium | |
US20220005546A1 (en) | Non-redundant gene set clustering method and system, and electronic device | |
CN111782645A (en) | Data processing method and device | |
CN107832341B (en) | AGNSS user duplicate removal statistical method | |
CN109213972B (en) | Method, device, equipment and computer storage medium for determining document similarity | |
CN113221558B (en) | Express address error correction method and device, storage medium and electronic equipment | |
CN112860712B (en) | Block chain-based transaction database construction method, system and electronic equipment | |
CN114969023A (en) | Database learning type index construction method and system | |
CN111488490B (en) | Video clustering method, device, server and storage medium | |
US9824105B2 (en) | Adaptive probabilistic indexing with skip lists | |
CN107248929B (en) | Strong correlation data generation method of multi-dimensional correlation data | |
CN105468603A (en) | Data selection method and apparatus | |
CN117729176B (en) | Method and device for aggregating application program interfaces based on network address and response body | |
CN113723097B (en) | Instrument standard system optimization classification method and equipment | |
CN112612415B (en) | Data processing method and device, electronic equipment and storage medium | |
CN114064791B (en) | Associated mining method based on space-time network, terminal equipment and storage medium | |
CN113468179B (en) | Base number estimation method, base number estimation device, base number estimation equipment and storage medium | |
CN112685378B (en) | Method, apparatus and computer readable storage medium for garbage collection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |