CN115470212A

CN115470212A - Data sampling method and device based on distributed memory database

Info

Publication number: CN115470212A
Application number: CN202211122878.3A
Authority: CN
Inventors: 温平; 朱海勇; 周成祖; 邓立峰
Original assignee: Xiamen Meiya Pico Information Co Ltd
Current assignee: Xiamen Meiya Pico Information Co Ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-12-13

Abstract

The application relates to a data sampling method based on a distributed memory database, which takes the distributed memory database as a filtering container, and takes a data filtering rule as a filtering condition. The filter container attributes comprise distributed cluster servers, data cache sizes and data cache strategies. The filtering condition comprises the steps of calculating a 128-bit HASH value based on a rule according to an MD5 algorithm, and constructing a data storage memory database organization rule based on the HASH value, so that data can be extracted quickly and accurately according to a user-defined rule in the presence of mass data of large data. The method can meet the requirement that the PB sets the data sampling effect of each large level, and can meet the requirement that the required sampling result data effect is obtained in a short time without influencing the efficiency of the service in the service use process. The situation can be rapidly mastered through sampling analysis of mass data in various industries, various early warnings and judgments can be made in advance, and the method has great significance in actual combat in various fields such as life production, event development, disaster prediction and the like.

Description

Data sampling method and device based on distributed memory database

Technical Field

The present application relates to the field of data sampling technologies, and in particular, to a data sampling method and apparatus based on a distributed memory database.

Background

Data sampling has a more profound influence on people's life, and the development trend of things is reflected from dimensions such as various space-time and quantity, so that data sampling methods are widely used in all trades to observe the development of things and predict the future trend. At present, various sampling methods exist in the market, each industry forms a specific sampling method according to own characteristics, the specific sampling methods include manual collection, questionnaire-based survey collection, collection according to log data generated by a business system and the like, and in the current big data era, data sampling in mass data reflects the difficulty, accuracy and timeliness of collection, and data sampling needs to be performed in a more optimized mode.

At present, the big data information era has mass data, manual collection and analysis or simple data extraction cannot be simply carried out, and the accuracy of sampling data can be fed back only by regularly and accurately extracting valuable data samples. At present, the data extraction according to the user-defined rule is difficult to realize quickly and accurately in the presence of mass data of large data.

Disclosure of Invention

In order to solve the technical problems, the application provides a data sampling method and device based on a distributed memory database.

In a first aspect, the present application provides a data sampling method based on a distributed memory database, including the following steps:

s1: calculating the HASH value of the current data based on a preset rule when a program based on streaming processing waits for the data to arrive;

s2: constructing a storage filtering container: deploying a database cluster for the distributed memory database according to 80% of resources in an available resource pool of the system, and dividing a plurality of sub-nodes;

s3: writing the HASH value of the current data into a distributed memory database;

s4: when the next piece of data arrives, calculating the HASH value of the piece of data based on a preset rule, matching the data in the distributed memory database according to the HASH value of the piece of data, filtering the piece of data if the same HASH value exists in the distributed memory database, and storing the piece of data in the distributed memory database if the same HASH value does not exist in the distributed memory database.

By adopting the technical scheme, the distributed memory database is used as a filtering container, and the data filtering rule is a filtering condition. The filter container attributes comprise distributed cluster servers, data cache sizes and data cache strategies. The filtering condition comprises the steps of calculating a 128-bit HASH value based on a rule according to an MD5 algorithm, and constructing a data storage memory database organization rule based on the HASH value, so that data can be extracted quickly and accurately according to a user-defined rule in the presence of mass data of large data.

Preferably, S1 specifically includes:

s11: calculating the file size of a strip record when a program based on streaming processing waits for data to arrive, and distinguishing data larger than 10240KB and data smaller than or equal to 10240 KB;

s12: performing MD5 calculation on data which is more than 10240KB and is in front 1024KB and rear 1024KB of the intercepted file, performing full data reverse order calculation after data which is less than or equal to 10240KB is converted into byte types, and then performing MD5 full data calculation to calculate the HASH value of the current data.

Preferably, in S3, after the HASH value of the current data is written into the distributed memory database, the cache expiration time is assigned.

Preferably, in S4, if the same HASH value does not exist in the distributed memory database, the piece of data is stored in the distributed memory database, and the piece of data is loaded into the memory to assign the cache expiration time.

In a second aspect, the present application further provides a data sampling apparatus based on a distributed memory database, where the apparatus includes:

the HASH value calculating module is configured to calculate a HASH value of current data based on a preset rule when a streaming processing program waits for data to arrive;

the storage filtering container construction module is configured for deploying the distributed memory database to a database cluster according to 80% of resources of the system available resource pool and dividing a plurality of child nodes;

the HASH value storage module is configured to write the HASH value of the current data into the distributed memory database;

and the data filtering module is used for calculating the HASH value of the next piece of data based on a preset rule when the next piece of data arrives, matching the data in the distributed memory database according to the HASH value of the piece of data, filtering the piece of data if the same HASH value exists in the distributed memory database, and storing the piece of data in the distributed memory database if the same HASH value does not exist in the distributed memory database.

Preferably, the calculating, by the streaming processing-based program, the HASH value of the current data based on a preset rule when the data arrives specifically includes:

calculating the file size of a strip record when a program based on streaming processing waits for data to arrive, and distinguishing data larger than 10240KB and data smaller than or equal to 10240 KB;

performing MD5 calculation on data which is more than 10240KB and is in front 1024KB and rear 1024KB of the intercepted file, performing full data reverse order calculation after data which is less than or equal to 10240KB is converted into byte types, and then performing MD5 full data calculation to calculate the HASH value of the current data.

Preferably, the HASH value storage module writes the HASH value of the current data into the distributed memory database and assigns a cache expiration time.

Preferably, in the data filtering module, if the distributed memory database does not have the same HASH value, the piece of data is stored in the distributed memory database, and the piece of data is loaded into the memory to assign the cache expiration time.

In a third aspect, the present application further provides an electronic device, including:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a method as described in the first aspect.

In a fourth aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, characterized in that said program, when executed by a processor, implements the method according to the first aspect.

The application at least comprises the following beneficial technical effects:

1. the invention takes the distributed memory database as a filtering container, and takes the data filtering rule as a filtering condition. The filter container attributes comprise distributed cluster servers, data cache sizes and data cache strategies. The filtering condition comprises that a 128-bit HASH value is calculated according to an MD5 algorithm based on a rule, and a data storage memory database organization rule is constructed based on the HASH value, so that data can be quickly and accurately extracted according to a user-defined rule in the presence of mass data of big data;

2. the invention mainly uses the distributed memory database as a container to quickly filter and extract massive data result information, can meet the requirement of PB to set data sampling effect of each large level, and can meet the requirement of obtaining required sampling result data effect in a short time without influencing the efficiency of the service in the service use process. The situation can be rapidly mastered through sampling analysis of mass data in various industries, various early warnings and judgments can be made in advance, and the method has great significance in actual combat in various fields such as life production, event development, disaster prediction and the like.

Drawings

The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the application. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.

Fig. 1 is a flowchart of a data sampling method based on a distributed memory database according to the present application.

Fig. 2 is a schematic diagram of an embodiment of a data sampling method based on a distributed memory database that can be applied to the present application.

Fig. 3 is a schematic flowchart of step S1 of the data sampling method based on the distributed memory database in an embodiment of the present application.

FIG. 4 is a schematic diagram of the MD5 algorithm in one embodiment of the present application.

Fig. 5 is a schematic block diagram of a data sampling apparatus based on a distributed memory database according to an embodiment of the present application.

FIG. 6 is a schematic block diagram of a computer system suitable for use to implement the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows a flowchart of a distributed memory database-based data sampling method according to the present application, and fig. 2 shows a schematic diagram of a specific embodiment of a distributed memory database-based data sampling method that can be applied to the present application, with reference to fig. 1 and fig. 2, where the method specifically includes the following steps:

in an alternative embodiment, the execution of S1 may be completed according to the following steps:

In a specific embodiment, the PB level traffic reduction analysis of the network data yields that 99.6% of data larger than 10240KB contains various attachments, images, audios, videos and the like, and the data smaller than or equal to 10240KB is mostly text-type data, so that positioning the limit at 10240KB can meet the actual use scene.

Referring to FIG. 3, data larger than 10240KB is processed into the front 1024KB and back 1024KB of the truncated file for MD5 calculation. The MD5 calculation logic is shown in fig. 4, four values are defined in the encryption process of MD5, the original text is calculated by the four values, new four values are obtained, the above process is repeated for a certain number of times, and finally the last four values are subjected to character string splicing to obtain the final ciphertext.

And converting the data less than or equal to 10240KB into a byte type and then performing full-data reverse-order calculation, wherein the calculation example flow of the full-data reverse-order calculation is as follows:

first, 2 bits are grouped into a group, the first half and the second half are exchanged, the second 4 bits are grouped into a group, the first half and the second half are exchanged, and the second 8 bits are grouped into a group, the first half and the second half are exchanged.

For example, invert 12345678 to 87654321:

first 2 are grouped, the first and second halves are swapped to become 21436587, then 4 are grouped, the first and second halves are swapped to become 43218765, then 8 are grouped, the first and second halves are swapped to become 87654321.

The codes are as follows:

static byte ReverseBits(byte c)

{

c＝(byte)((byte)((byte)(c&0x55)<<1))|((byte)((byte)(c&0xAA)>>1)))；

c＝(byte)((byte)(byte)(c&0x33)<<2))|((byte)((byte)(c&0xCC)>>2)))；

c＝(byte)(((byte)(byte)(c&0x0F)<<4))|((byte)((byte)(c&0xF0)>>4)))；

return c

}

and after the data less than or equal to 10240KB is converted into byte types and then the full data is subjected to reverse order calculation, the result is subjected to MD5 full data calculation. The reverse order calculation makes the data distribution cluster more balanced, and is convenient for the high-capacity storage data to carry out calculation more quickly and efficiently.

in a specific embodiment, a database cluster is deployed by a storage filter container, namely a distributed memory database according to 80% of resources of a system available resource pool, and a plurality of sub-nodes are divided to establish one cluster, wherein the advantage of the cluster of the plurality of sub-nodes is that data is distributed in the plurality of nodes according to slot storage, and data sharing among the nodes can dynamically adjust data distribution, expandability, centerless architecture and high availability, reduce operation and maintenance cost, and effectively improve the availability and expandability of the system.

in a specific embodiment, in S3, after writing the HASH value of the current data into the distributed memory database, assigning a cache expiration time, where the assigned cache expiration time is an assigned sampling period time and is effective in a sampling period range, and when the data is automatically invalidated and deleted in more than one sampling period, the data is no longer matched and hit after the expiration, so as to ensure the accuracy of the sampled data.

In a specific embodiment, in S4, if the distributed memory database does not have the same HASH value, the piece of data is stored in the distributed memory database, and the piece of data is loaded into the memory to assign the cache expiration time. When all data are filtered by the container, we obtain the data value to be sampled.

According to four characteristics of HASH values: 1 is an input arbitrary length and the output is a fixed length. 2 is that the HASH value is calculated faster. 3 is collision resistance or uniqueness. 4 is hiding, also called unidirectional. According to the characteristics, the method can be used for quickly calculating and uniformly distributing the characteristics in the distributed memory database nodes, and the efficiency can be greatly improved.

The invention takes the distributed memory database as a filtering container, and takes the data filtering rule as a filtering condition. The filter container attributes comprise distributed cluster servers, data cache sizes and data cache strategies. The filtering condition comprises the steps of calculating a 128-bit HASH value based on a rule according to an MD5 algorithm, and constructing a data storage memory database organization rule based on the HASH value, so that data can be extracted quickly and accurately according to a user-defined rule in the presence of mass data of large data. The invention mainly uses the distributed memory database as a container to quickly filter and extract massive data result information, can meet the requirement of PB to set data sampling effect of each large level, and can meet the requirement of obtaining required sampling result data effect in a short time without influencing the efficiency of the service in the service use process. The situation can be rapidly mastered through sampling analysis of mass data in various industries, various early warnings and judgments can be made in advance, and the method has great significance in actual combat in various fields such as life production, development of things, disaster prediction and the like.

With further reference to fig. 5, as an implementation of the method described above, the present application provides an embodiment of a data sampling apparatus based on a distributed memory database, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 1, and the apparatus may be specifically applied to various electronic devices.

Referring to fig. 5, a data sampling apparatus based on a distributed memory database includes:

the HASH value calculating module 101 is configured to calculate a HASH value of current data based on a preset rule when a streaming processing program waits for data to arrive;

the storage filtering container constructing module 102 is configured to deploy a database cluster for the distributed memory database according to 80% of resources of the system available resource pool, and divide a plurality of child nodes;

the HASH value storage module 103 is configured to write the HASH value of the current data into the distributed memory database;

the data filtering module 104 calculates the HASH value of the next piece of data based on a preset rule when the next piece of data arrives, matches the HASH value of the next piece of data in the distributed memory database, filters the next piece of data if the same HASH value exists in the distributed memory database, and stores the next piece of data in the distributed memory database if the same HASH value does not exist in the distributed memory database.

In a further embodiment, the calculating, by the streaming program, the HASH value of the current data based on the preset rule when the data arrives specifically includes:

calculating the file size of a strip record when a streaming program waits for data to arrive, and distinguishing data larger than 10240KB and data smaller than or equal to 10240 KB;

In a further embodiment, the HASH value storage module assigns a cache expiration time after writing the HASH value of the current data to the distributed memory database.

In a further embodiment, in the data filtering module, if the distributed memory database does not have the same HASH value, the piece of data is stored in the distributed memory database, and the piece of data is loaded into the memory with the assigned cache expiration time.

Referring now to FIG. 6, shown is a block diagram of a computer system 200 suitable for use in implementing an electronic device of an embodiment of the present application. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, the computer system 200 includes a Central Processing Unit (CPU) 201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for the operation of the system 200 are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. The driver 220 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 220 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The above-described functions defined in the method of the present application are performed when the computer program is executed by the Central Processing Unit (CPU) 201.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in fig. 1.

It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

While the present invention has been described with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

In the description of the present application, it is to be understood that the terms "upper", "lower", "inner", "outer", and the like, indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are only for convenience in describing the present application and simplifying the description, and do not indicate or imply that the referred devices or elements must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present application. The word 'comprising' does not exclude the presence of elements or steps not listed in a claim. The word 'a' or 'an' preceding an element does not exclude the presence of a plurality of such elements. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. Any reference signs in the claims shall not be construed as limiting the scope.

Claims

1. A data sampling method based on a distributed memory database is characterized in that: the method comprises the following steps:

2. The data sampling method based on the distributed memory database according to claim 1, characterized in that: the S1 specifically comprises:

3. The data sampling method based on the distributed memory database according to claim 1, characterized in that: and in the S3, writing the HASH value of the current data into the distributed memory database, and assigning the cache invalidation time.

4. The data sampling method based on the distributed memory database according to claim 1, characterized in that: in S4, if the distributed memory database does not have the same HASH value, the piece of data is stored in the distributed memory database, and the piece of data is loaded into the memory to assign the cache expiration time.

5. A data sampling device based on distributed memory database is characterized in that: the device comprises:

the storage filtering container constructing module is configured for deploying a database cluster for the distributed memory database according to 80% of resources of the available resource pool of the system and dividing a plurality of child nodes;

6. The data sampling method based on the distributed memory database according to claim 1, characterized in that: the calculating the HASH value of the current data based on the preset rule when the streaming processing-based program waits for the arrival of the data specifically includes:

7. The method of claim 1, wherein the method comprises the following steps: and the HASH value storage module writes the HASH value of the current data into the distributed memory database and then assigns the cache failure time.

8. The data sampling method based on the distributed memory database according to claim 1, characterized in that: in the data filtering module, if the distributed memory database does not have the same HASH value, the data is stored in the distributed memory database, and the data is loaded to the memory to assign the cache failure time.

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-4.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.