CN110750529B - Data processing method, device, equipment and storage medium - Google Patents

Data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN110750529B
CN110750529B CN201810721912.6A CN201810721912A CN110750529B CN 110750529 B CN110750529 B CN 110750529B CN 201810721912 A CN201810721912 A CN 201810721912A CN 110750529 B CN110750529 B CN 110750529B
Authority
CN
China
Prior art keywords
data
stored
storage area
determining
mark value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810721912.6A
Other languages
Chinese (zh)
Other versions
CN110750529A (en
Inventor
余韬
吴名宇
叶峻
马宇峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810721912.6A priority Critical patent/CN110750529B/en
Publication of CN110750529A publication Critical patent/CN110750529A/en
Application granted granted Critical
Publication of CN110750529B publication Critical patent/CN110750529B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a device, equipment and a storage medium, wherein the method comprises the following steps: acquiring a data storage request, wherein the storage request comprises data to be stored; analyzing the data to be stored, and determining a characteristic identifier corresponding to the data to be stored; carrying out Hash modular operation on the characteristic identifier to determine a mark value corresponding to the characteristic identifier; and storing the data to be stored into a first storage area corresponding to the first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs, wherein the number of the mark values corresponding to different storage areas is different. The method realizes the non-uniform storage processing of the data in a Hash mode of modular processing, not only ensures the uniformity of the sampled data when the data is sampled, but also improves the data sampling precision, effectively reduces the consumption of storage resources under the same sampling precision, improves the data processing efficiency and improves the service performance of equipment.

Description

Data processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and storage medium.
Background
Nowadays, with the wide popularization of internet technology, more and more enterprises begin to use electronic management, and a large amount of data is inevitably generated in the process, so how to store and process the data becomes one of the problems to be solved at present.
In the related art, when a large amount of generated data is stored and processed, the data is usually redundantly stored according to different sampling rates, and then when data analysis is performed, different data tables are selected for data query according to a required sampling rate. However, the inventors have found that, when the data sampling rate is required to be high, the data storage system consumes a large amount of storage resources.
Disclosure of Invention
The application provides a data processing method, a data processing device, data processing equipment and a storage medium, which are used for solving the problem that in the related art, when data are redundantly stored according to different sampling rates, large storage resources are required to be consumed.
An embodiment of an aspect of the present application provides a data processing method, where the method includes: acquiring a data storage request, wherein the storage request comprises data to be stored; analyzing the data to be stored, and determining a characteristic identifier corresponding to the data to be stored; carrying out Hash modular operation on the characteristic identifier, and determining a mark value corresponding to the characteristic identifier; and storing the data to be stored into a first storage area corresponding to a first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs, wherein the number of the mark values corresponding to different storage areas is different.
Another embodiment of the present application provides a data processing apparatus, including: the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a data storage request, and the storage request comprises data to be stored; the first determining module is used for analyzing the data to be stored and determining the characteristic identifier corresponding to the data to be stored; the second determining module is used for carrying out Hash modular operation on the characteristic identifier and determining a mark value corresponding to the characteristic identifier; and the processing module is used for storing the data to be stored into a first storage area corresponding to a first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs, wherein the number of the mark values corresponding to different storage areas is different.
In another embodiment of the present application, a computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the data processing method in the embodiment of the first aspect.
The computer readable storage medium of an embodiment of a further aspect of the present application has a computer program stored thereon, and the computer program is executed by a processor to implement the data processing method of the embodiment of the first aspect.
The computer program of an embodiment of a further aspect of the present application, when executed by a processor, implements the data processing method of the embodiment of the first aspect.
The technical scheme disclosed in the application has the following beneficial effects:
the method comprises the steps of analyzing data to be stored by acquiring a data storage request, determining a feature identifier corresponding to the data to be stored, carrying out Hash modular operation on the feature identifier, determining a mark value corresponding to the feature identifier, and storing the data to be stored into a first storage area corresponding to a first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs. Therefore, non-uniform storage processing is carried out on the data in a Hash mode of modular processing, uniformity of sampled data is guaranteed when the data are sampled, data sampling precision is improved, consumption of storage resources is effectively reduced under the same sampling precision, data processing efficiency is improved, and using performance of equipment is improved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which,
FIG. 1 is a flow diagram illustrating a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic flow diagram of a data processing method according to another embodiment of the present application;
FIG. 3 is a diagram illustrating a result of partitioning a memory area according to one embodiment of the present application;
FIG. 4 is a schematic flow chart diagram of a data processing method according to yet another embodiment of the present application;
FIG. 5 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;
FIG. 6 is a schematic block diagram of a data processing apparatus according to another embodiment of the present application;
FIG. 7 is a schematic block diagram of a data processing apparatus according to yet another embodiment of the present application;
FIG. 8 is a schematic block diagram of a computer device according to one embodiment of the present application;
FIG. 9 is a schematic block diagram of a computer device according to another embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The embodiments of the present application provide a data processing method for solving the problem that, in the related art, when data is redundantly stored according to different sampling rates, a large storage resource needs to be consumed.
According to the embodiment of the application, a data storage request is obtained, wherein the storage request comprises data to be stored, the data to be stored is analyzed to determine a feature identifier corresponding to the data to be stored, then the feature identifier is subjected to Hash modular operation to determine a mark value corresponding to the feature identifier, then the data to be stored is stored in a first storage area corresponding to a first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs, and the number of the mark values corresponding to different storage areas is different. From this, realized carrying out inhomogeneous storage processing to data through the mode of Hash's modulus processing, when not only having guaranteed to sample data, the homogeneity of sampling data has improved the data sampling precision moreover, under same sampling precision, has effectively reduced the consumption of storage resource, has promoted data processing efficiency, has improved the performance of equipment.
A data processing method, an apparatus, a device, and a storage medium according to embodiments of the present application will be described in detail below with reference to the accompanying drawings.
First, a data processing method in the present application will be specifically described with reference to fig. 1.
As shown in fig. 1, the data processing method of the present application may include the steps of:
step 101, obtaining a data storage request, wherein the storage request includes data to be stored.
And 102, analyzing the data to be stored, and determining the characteristic identifier corresponding to the data to be stored.
The data processing method provided by the embodiment of the present application can be executed by the computer device provided by the embodiment of the present application. The computer equipment is provided with a data processing device to realize storage processing of data to be stored. In this embodiment, the computer device may be any hardware device with data processing function, such as a computer, a server, and the like.
Optionally, when the data storage request is obtained, the data to be stored included in the data storage request may be obtained by analyzing the data storage request, and then the obtained data to be stored is analyzed to determine the feature identifier corresponding to the data to be stored.
The feature identifier corresponding to the data to be stored refers to a unique identifier used for distinguishing the data to be stored from other data, and may be determined according to the type of the data to be stored, for example, when the data to be stored is user behavior data, the feature identifier may be a user identifier corresponding to the data; or, the data to be stored is device operation data, and the feature identifier may be an identifier corresponding to the device, and the like, which is not specifically limited herein.
And 103, carrying out Hash modular operation on the characteristic identifier, and determining a mark value corresponding to the characteristic identifier.
The Hash modular operation is a distributed strategy which is simple in principle and easy to implement, and has excellent dispersity. That is to say, in this embodiment, by performing hash modulo processing on the determined feature identifier, it is possible to ensure that subsequently stored data is more balanced, and the complexity of data processing can also be reduced.
For example, if the determined feature identifier is a user ID, the computer device may perform hash processing on the user ID to obtain a hash value, and then perform modulo operation on the hash value to obtain a tag value m corresponding to the user ID.
And 104, storing the data to be stored into a first storage area corresponding to a first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs, wherein the number of the mark values corresponding to different storage areas is different.
Optionally, in this embodiment, a corresponding relationship between the mark value range and the storage area may be pre-established, so that after the mark value corresponding to the feature identifier is determined, the computer device may determine, according to the mark value, the first mark value range to which the computer device belongs from the pre-established corresponding relationship between the mark value range and the storage area.
Furthermore, after the first mark value range to which the mark value corresponding to the data to be stored belongs is determined, the data to be stored can be stored into the first storage area corresponding to the first mark value range.
It can be understood that the number of the corresponding mark values of different storage areas is set to be different, so that the non-uniform storage of the data to be stored is realized, and the precision of data sampling and the number of the partitions of the storage areas are increased in a logarithmic relationship. Wherein sampling precision can improve to ten thousandths to one or even littleer according to getting the setting of modulus value to when making follow-up inquiring less data, the data volume of reading descends by a wide margin, not only can improve data sampling precision, can also improve data sampling speed, promotes data processing efficiency, improves the performance of equipment.
It should be noted that, the correspondence between the number of the pre-established tag values and the storage area in this embodiment will be described in detail in the following embodiments, which are not described in detail herein.
In order to more clearly explain the embodiments of the present application, differences between the related art and the present application will be specifically described below.
In the related art, generally, random data blocks are used to uniformly store data in an analysis system in a form of fixed-size blocks, when query data is analyzed, a part of data is randomly called according to a sampling rate selected by a user to be queried, and an index obtained by calculation is restored according to the sampling rate.
For example, assuming that there are 10 data blocks, each of which holds 10000 pieces of data, and there are 1000 users and each user has 100 pieces of data per day, each data block may include partial data of all users, for example, each data block contains 10 pieces of data of each user. At this time, if the number of users is estimated by using a sampling rate of 10%, a data block is randomly selected to perform user data deduplication, so as to estimate the number of users.
For another example, assuming that there are 10 data blocks, each data block stores 10000 pieces of data, and when there are 1000 users and each user has 100 pieces of data per day, if the data of each user is distributed in each data block according to the users, a database may include all the data of 100 users. At this time, if the number of users is estimated by using a 10% method, the number of users is estimated by multiplying 10 on the basis of user number deduplication of a randomly selected data block.
That is, in the random database mode, because the real distribution of data cannot be predicted, the sampled data may not be uniformly sampled in some dimensions, so that when analyzing each dimension, the index calculation is deviated from the whole, and particularly when removing the duplicate of the data, for example, when removing the duplicate of the number of users, the index is often amplified, and the analysis result of the data cannot be accurately obtained.
However, in the present application, the hash modulo of the feature identifier corresponding to the data to be stored is performed to obtain the tag value corresponding to the feature identifier, and the first tag value range to which the tag value belongs is determined according to the tag value, so that the data to be stored is stored in the first storage area corresponding to the first tag value range, where the tag value data amount in each storage area is different, that is, the storage areas are non-uniformly distributed. That is to say, according to the method and the device for processing the random data, the distribution condition of the data in each storage area can be determined according to the mark value range corresponding to each storage area, so that multiple data of the same characteristic identifier can be subjected to deduplication processing, the finally stored data is the data corresponding to different characteristic identifiers, the data obtained by sampling is more uniform when the stored data is sampled, and the problems that during random data sampling, the analysis result obtained according to the data obtained by sampling is one-sided and the accuracy is low are effectively solved.
According to the data processing method provided by the embodiment of the application, the data to be stored is analyzed through acquiring the data storage request, the feature identifier corresponding to the data to be stored is determined, the hash modulo operation is performed on the feature identifier, the mark value corresponding to the feature identifier is determined, and then the data to be stored is stored in the first storage area corresponding to the first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs. From this, realized carrying out inhomogeneous storage processing to data through the mode of Hash's modulus processing, when not only having guaranteed to sample data, the homogeneity of sampling data has improved the data sampling precision moreover, under same sampling precision, has effectively reduced the consumption of storage resource, has promoted data processing efficiency, has improved the performance of equipment.
As can be seen from the above analysis, in the embodiment of the present application, hash modulo processing is performed on data to be stored to determine a tag value corresponding to the data to be stored, and a corresponding first tag value range is determined according to the tag value corresponding to the data to be stored, and then the data to be stored is stored in a first storage area corresponding to the first tag value range.
In the related art, when data to be stored is stored, the data is usually stored in a uniform partition manner, however, in the partition manner, the number of partitions affects a sampling rate when the stored data is subjected to query operation, and since the number of partitions cannot be too large, adjustment of the sampling rate in actual use is often limited, an expected sampling rate cannot be obtained, and an analysis result of the data is affected. Therefore, in the embodiment, the query requirement of the data to be stored is determined according to the service type to which the data to be stored belongs, the storage area division rule is determined according to the query requirement, and then the storage area is non-uniformly partitioned according to the storage area division rule, so that different sampling rates can be adopted when the data is subsequently queried and analyzed, the accuracy of a data analysis result is improved, and the sampling efficiency when the sampling rate is low can be optimized. The data processing method of the present application will be further described with reference to fig. 2.
Fig. 2 is a schematic flow chart diagram of a data processing method according to another embodiment of the present application.
As shown in fig. 2, the data processing method of the embodiment of the present application may include the following steps:
step 201, a data storage request is obtained, wherein the storage request includes data to be stored.
Step 202, analyzing the data to be stored, and determining the feature identifier corresponding to the data to be stored.
The detailed implementation process and principle of the steps 201 to 202 may refer to the detailed description of the above embodiments, and are not described herein again.
Step 203, determining a target sampling range according to the query service type corresponding to the data to be stored.
And step 204, determining a target modulus value according to the target sampling range.
Step 205, according to the target modulus value, performing hash modulus operation on the feature identifier, and determining a mark value corresponding to the feature identifier.
In this embodiment, there may be multiple service types. Such as employee attendance, video playback, file downloads, and the like.
Optionally, in practical applications, different data may correspond to different query service types, and different sampling ranges may also be required for different query service types. Therefore, in order to perform targeted hash modulo operation on the data to be stored to obtain the tag value corresponding to the data to be stored, the embodiment determines the target sampling range according to the query service type by determining the query service type corresponding to the data to be stored, determines the target modulo value according to the target sampling range, and performs hash modulo operation on the feature identifier corresponding to the data to be stored according to the determined target modulo value.
For example, if it is determined that the query service type corresponding to the data to be stored is the statistical analysis employee attendance, the target sampling range may be determined to be 0 to 1000, the computer device determines the target modulo value to be 1000 according to the target sampling range 0 to 1000, and further may perform hash modulo operation on the feature identifier corresponding to the data to be stored according to the target modulo value 1000, and determine the mark value corresponding to the feature identifier.
Furthermore, when query operation is performed on stored data subsequently, in the related art, traversal operation is performed on the stored data according to a query request of a user to acquire data meeting the user requirement, which not only takes a lot of time, but also increases resources of a processing unit to affect other operations.
In order to reduce the time spent on data query, reduce the data processing amount of the processing unit, and improve the query speed, in this embodiment, before the data to be stored is stored, the corresponding query requirement is determined according to the service type to which the data to be stored belongs, the partition rule of the storage area is determined according to the query requirement, the storage area is partitioned according to the partition rule, and then the data to be stored is stored in different storage areas, so that the data query speed can be improved subsequently when the data query is performed. For a specific implementation process, see steps 206 to 209.
And step 206, determining a query requirement corresponding to the data to be stored according to the service type of the data to be stored.
Step 207, determining a preset storage area division rule according to the query requirement.
Wherein, because there is a difference in system performance of the devices, for example, when the overall query performance is optimal, the storage area division rule is 2: in some systems, when the overall query performance is the best, the storage area division rule may be 1: therefore, in order to adapt to various sampling situations, the present embodiment may determine the preset storage area division rule according to the query requirement corresponding to the data to be stored and the system performance of the device.
For example, if the query request is that 20% of the data needs to be acquired, the computer device may determine the preset storage area division rule, and may perform the following operations according to 2: the ratio of 8 is divided.
For another example, if the query request is that 10% of the data needs to be acquired, the computer device may determine the preset storage area division rule, and may perform the following steps according to 1: the ratio of 9 is divided.
That is to say, according to the data query method and device, the storage area is divided according to different proportions according to query requirements, so that when data query is carried out, only a part of data with a small proportion can be read, analysis and filtration of the whole data are not needed, and therefore the efficiency of data sampling can be improved.
And step 208, determining the mark value ranges respectively corresponding to the storage areas according to the storage area division rule.
Optionally, after determining the storage area division rule, the computer device may determine the mark value range corresponding to each storage area.
For example, if the storage area division rule is according to 1: 9, and according to the query service type corresponding to the data to be stored, the determined target sampling range is 10000, then the computer device may divide the storage area into two parts of 10% and 90%, then divide the 10% part into two parts of 1% and 9%, then divide the 1% part into two parts of 0.1% and 0.9%, and the specific division result is shown in fig. 3. Correspondingly, when the storage areas are divided, the corresponding tag values of the storage areas are also determined, which can be specifically seen in table 1 below.
Table 1:
marking a range of values Zone name
0-9 10
10-99 100
100-999 1000
1000-9999 10000
Step 209, according to the first tag value range to which the tag value corresponding to the data to be stored belongs, storing the data to be stored into a first storage area corresponding to the first tag value range, wherein the number of the tag values corresponding to different storage areas is different.
Optionally, after the storage areas are divided and the flag value ranges respectively corresponding to the storage areas are determined, the computer device may match the flag values corresponding to the determined data to be stored with the flag value ranges respectively corresponding to the storage areas, and store the data to be stored in the first storage area corresponding to the first flag value range if the flag values corresponding to the data to be stored are matched with the first flag range.
For example, if the data to be stored corresponds to a tag value of 99, then it can be determined from table 1 above that the tag value matches the tag value range of 10-99, and the computer device can store the data to be stored in the storage area with the storage area name of 100.
For another example, if the marker value corresponding to the data to be stored is 102, it can be determined from the above table 1 that the marker value matches the marker value range 100-999, and the computer device can store the data to be stored into the storage area with the storage area name 1000.
That is to say, in this embodiment, the storage areas are divided non-uniformly, so that the data to be stored can be stored in different storage areas non-uniformly, and thus, when data query is performed subsequently, the required data can be obtained by processing part of the storage areas, and the data sampling rate is significantly improved. That is, in the present embodiment, by using the non-uniform partition strategy, the amount of data read when querying a smaller amount of data is greatly reduced, so that the query performance of the device is significantly improved.
The data processing method of the embodiment of the application determines a target sampling range according to the query service type corresponding to the data to be stored, determines a target modulus value according to the target sampling range, performs hash modulus operation on the feature identifier according to the target modulus value, and determines a mark value corresponding to the feature identifier, thereby realizing that different modulus values are determined according to query service types corresponding to different data to perform targeted hash modulus operation corresponding to the data, determines a query requirement corresponding to the data to be stored according to the service type to which the data to be stored belongs, determines a storage area division rule according to the query requirement, divides storage areas according to the storage area division rule, determines mark value ranges corresponding to the respective storages, and then compares the mark value corresponding to the data to be stored with the mark value ranges corresponding to the respective storage areas, the method comprises the steps of determining a first mark value range to which the data to be stored belong to store the data into corresponding storage areas, uniformly dividing the storage areas by adopting a non-uniform partition strategy, and non-uniformly storing the data, so that the data acquisition speed can be improved when data inquiry is subsequently carried out, and the corresponding data can be acquired by adopting different sampling rates and analyzed to ensure the accuracy of a data analysis result.
According to the analysis, the query requirement of the data to be stored is determined according to the business type of the data to be stored, the storage area division rule is determined according to the query requirement, and then the storage area is partitioned according to the storage area division rule, so that the data to be stored is stored in different storage areas.
In specific implementation, after the data to be stored is stored in the corresponding storage area, the stored data can be queried to obtain the required data, and the obtained data is subjected to statistical analysis to determine whether the performance corresponding to the data meets the requirement. The above-described data processing method of the present application will be specifically described with reference to fig. 4.
Fig. 4 is a schematic flow chart diagram of a data processing method according to another embodiment of the present application.
As shown in fig. 4, the data processing method according to the embodiment of the present application may include the following steps:
step 401, a data query request is obtained, where the query request includes a sampling rate.
In this embodiment, the sampling rate in the query request may be set according to an actual requirement, which is not specifically limited herein. E.g., 8%, 9%, etc.
Optionally, when the data query request is obtained, the computer device may perform analysis processing on the data query request to obtain a sampling rate included in the data query request, so as to perform subsequent query operation according to the sampling rate.
Step 402, determining a second tag value range to be queried according to the sampling rate.
Optionally, since the sampling rate is usually a percentage, for example, 8% or 9%, and the tag value range corresponding to each storage area is a natural number, for convenience of query, the measurement unit of the sampling rate may be first converted into the measurement unit that is the same as the tag value range corresponding to each storage area, and then a matching operation is performed to determine the second tag value range corresponding to the sampling rate.
As an optional implementation form of the present application, in this embodiment, when determining the second token value range to be queried, the sampling token value corresponding to the sampling rate may be determined according to a modulus value adopted when performing hash modulus operation on the data feature identifier; and determining a second mark value range to be queried according to the sampling mark value.
For example, if the modulus value used in the hash modulus operation of the data feature identifier is 10000 and the sampling rate is 9%, the sampling rate of 9% may be converted into nine hundred per thousand, and then nine hundred per thousand is compared with the mark value range corresponding to each storage region, so that the range of the second mark value corresponding to the sampling rate of 9% may be determined to be 100-999.
Step 403, determining a second storage area corresponding to the second tag value range according to the corresponding relationship between the preset tag value range and the storage area.
In this embodiment, the correspondence between the preset mark value range and the storage area may be referred to the above embodiments, which are not described in detail herein.
For example, when the converted sampling rate p < > is 10 and the corresponding second tag value range is 0 to 9, the second storage area may be determined to be 10 according to the corresponding relationship between the preset tag value range and the storage area; when the converted sampling rate p < > is 100 and the corresponding second tag value range is 10-99, the second storage area can be determined to be 100 according to the corresponding relationship between the preset tag value range and the storage area; when the converted sampling rate p is 1000 and the corresponding second mark value range is 100-999, the second storage area can be determined to be 1000 and so on according to the corresponding relationship between the preset mark value range and the storage area.
In step 404, the target data is read from the second storage area according to the sampling rate.
Optionally, after determining the second storage area corresponding to the second marker value range, the computer device may read the target data from the second storage area according to the sampling rate.
Continuing with the above example, if the corresponding second marker value range is determined to be 100-999 according to the sampling rate of 9%, the second storage area corresponding to the second marker value range can be determined to be 1000 through table 1, and then 900 target data can be read from the second storage area 1000.
In another implementation form of the present application, when the target data is read from the second storage area according to the sampling rate, each data with a flag value less than or equal to the sampling rate may also be read from the second storage area.
For example, if the modulus value adopted when the data characteristic identifier performs the hash modulus operation is 10000 and the sampling rate is 9%, the computer device may obtain data from the partition names 10, 100, and 1000 according to nine-million, respectively, to obtain the target data.
According to the data processing method, the data query request is obtained, the second mark value range to be queried is determined according to the sampling rate included in the data query request, the second storage area corresponding to the second mark value range is determined according to the corresponding relation between the preset mark value range and the storage area, and the target data are read from the second storage area according to the sampling rate. Therefore, when data is queried, the data acquisition speed is reduced by avoiding filtering all data, so that the data volume read when a small data volume is queried is greatly reduced, and the equipment performance is remarkably improved.
In order to implement the foregoing embodiments, the present application further provides a data processing apparatus.
Fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
As shown in fig. 5, the data processing apparatus of the present application includes: an acquisition module 110, a first determination module 120, a second determination module 130, and a processing module 140.
The obtaining module 110 is configured to obtain a data storage request, where the storage request includes data to be stored;
the first determining module 120 is configured to analyze the data to be stored, and determine a feature identifier corresponding to the data to be stored;
the second determining module 130 is configured to perform hash modulo operation on the feature identifier, and determine a mark value corresponding to the feature identifier;
the processing module 140 is configured to store the data to be stored into a first storage area corresponding to a first tag value range according to the first tag value range to which the tag value corresponding to the data to be stored belongs, where the tag values corresponding to different storage areas are different in number.
It should be noted that the foregoing explanation of the embodiment of the data processing method is also applicable to the data processing apparatus of the embodiment, and the implementation principle thereof is similar and will not be described herein again.
The data processing apparatus provided in this embodiment analyzes data to be stored by obtaining a data storage request, determines a feature identifier corresponding to the data to be stored, performs hash modulo operation on the feature identifier, determines a tag value corresponding to the feature identifier, and then stores the data to be stored in a first storage area corresponding to a first tag value range according to the first tag value range to which the tag value corresponding to the data to be stored belongs. From this, realized carrying out inhomogeneous storage processing to data through the mode of Hash's modulus processing, when not only having guaranteed to sample data, the homogeneity of sampling data has improved the data sampling precision moreover, under same sampling precision, has effectively reduced the consumption of storage resource, has promoted data processing efficiency, has improved the performance of equipment.
Fig. 6 is a schematic structural diagram of a data processing apparatus according to another embodiment of the present application.
Referring to fig. 6, the acquiring module 110, the first determining module 120, and the processing module 140 are shown.
The obtaining module 110 is configured to obtain a data storage request, where the storage request includes data to be stored;
the first determining module 120 is configured to analyze the data to be stored, and determine a feature identifier corresponding to the data to be stored;
as an optional implementation manner of the present application, the data processing apparatus further includes: a third determination module 150, a fourth determination module 160, and a second processing module 170.
The third determining module 150 is configured to determine a target sampling range according to the query service type corresponding to the data to be stored;
the fourth determining module 160 is configured to determine a target modulus value according to the target sampling range;
the second processing module 170 is configured to perform a hash modulo operation on the feature identifier according to the target modulo value, and determine a flag value corresponding to the feature identifier.
The processing module 140 is configured to store the to-be-stored data into a first storage area corresponding to a first tag value range according to the first tag value range to which the tag value corresponding to the to-be-stored data belongs, where the tag values corresponding to different storage areas are different in number.
As an optional implementation manner of the present application, the data processing apparatus further includes: the device comprises a fifth determination module, a sixth determination module and a seventh determination module.
The fifth determining module is configured to determine, according to the service type to which the data to be stored belongs, a query requirement corresponding to the data to be stored;
and the sixth determining module is used for determining the preset storage area division rule according to the query requirement.
And the seventh determining module is used for determining the mark value ranges respectively corresponding to the storage areas according to the preset storage area division rule.
It should be noted that, for the implementation process and the technical principle of the data processing apparatus of this embodiment, reference is made to the foregoing explanation of the data processing method of the first embodiment, and details are not described herein again.
The data processing device provided by the embodiment of the application determines a target sampling range according to the query service type corresponding to the data to be stored, determines a target modulus value according to the target sampling range, performs hash modulus operation on the feature identifier according to the target modulus value, and determines a mark value corresponding to the feature identifier, thereby realizing that different modulus values are determined according to query service types corresponding to different data to perform targeted hash modulus operation corresponding to the data, determines a query requirement corresponding to the data to be stored according to the service type to which the data to be stored belongs, determines a storage area division rule according to the query requirement, divides the storage areas according to the storage area division rule, determines mark value ranges corresponding to the respective storages, and then compares the mark value corresponding to the data to be stored with the mark value ranges corresponding to the respective storage areas, the method comprises the steps of determining a first mark value range to which the data to be stored belong to store the data into corresponding storage areas, uniformly dividing the storage areas by adopting a non-uniform partition strategy, and non-uniformly storing the data, so that the data acquisition speed can be improved when data inquiry is subsequently carried out, and the corresponding data can be acquired by adopting different sampling rates and analyzed to ensure the accuracy of a data analysis result.
Fig. 7 is a schematic structural diagram of a server according to another embodiment of the present application.
As shown in fig. 7, the data processing apparatus of the present application further includes: a second obtaining module 180, an eighth determining module 190, a tenth determining module 1100, and a reading module 1120.
The second obtaining module 180 is configured to obtain a data query request, where the query request includes a sampling rate;
the eighth determining module 190 is configured to determine a second token value range to be queried according to the sampling rate;
the tenth determining module 1100 is configured to determine, according to a correspondence between a preset tag value range and a storage area, a second storage area corresponding to the second tag value range;
the reading module 1120 is configured to read the target data from the second storage area according to the sampling rate.
As an optional implementation manner of the present application, the eighth determining module 190 is specifically configured to:
determining a sampling mark value corresponding to the sampling rate according to a modulus value adopted when the data characteristic identification is subjected to Hash modulus operation;
and determining a second mark value range to be inquired according to the sampling mark value.
As an optional implementation manner of the present application, the reading module 1120 is specifically configured to:
reading each data having a tag value less than or equal to the sampling rate from the second storage area.
It should be noted that the foregoing explanation of the embodiment of the data processing method is also applicable to the data processing apparatus of the embodiment, and the implementation principle is similar, and is not described herein again.
The data processing apparatus of this embodiment determines, by obtaining the data query request, a second flag value range to be queried according to a sampling rate included in the data query request, and determines, according to a correspondence between a preset flag value range and a storage area, a second storage area corresponding to the second flag value range, so as to read the target data from the second storage area according to the sampling rate. Therefore, when data is queried, the data acquisition speed is reduced by avoiding filtering operation on all data, so that the data volume read when a small data volume is queried is greatly reduced, and the equipment performance is remarkably improved.
In order to implement the foregoing embodiments, the present application further provides a computer device.
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 8, the computer apparatus 200 includes: a memory 210, a processor 220 and a computer program stored on the memory 210 and operable on the processor 220, the processor 220 when executing the program, to implement the data processing method according to the first aspect.
In an alternative implementation form, as shown in fig. 9, the computer device 200 may further include: a memory 210 and a processor 220, a bus 230 connecting different components (including the memory 210 and the processor 220), wherein the memory 210 stores a computer program, and when the processor 220 executes the program, the data processing method according to the embodiment of the present application is implemented.
Bus 230 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, micro-channel architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer device 200 typically includes a variety of computer device readable media. Such media may be any available media that is accessible by computer device 200 and includes both volatile and nonvolatile media, removable and non-removable media.
Memory 210 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM)240 and/or cache memory 250. The computer device 200 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 260 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 9, and commonly referred to as a "hard drive"). Although not shown in FIG. 9, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 230 by one or more data media interfaces. Memory 210 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the application.
A program/utility 280 having a set (at least one) of program modules 270, including but not limited to an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination thereof may include an implementation of a network environment, may be stored in, for example, the memory 210. The program modules 270 generally perform the functions and/or methodologies of the embodiments described herein.
The computer device 200 may also communicate with one or more external devices 290 (e.g., keyboard, pointing device, display 291, etc.), with one or more devices that enable a user to interact with the computer device 200, and/or with any devices (e.g., network card, modem, etc.) that enable the computer device 200 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 292. Also, computer device 200 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet) through network adapter 293. As shown, network adapter 293 communicates with the other modules of computer device 200 via bus 230. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the computer device 200, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, to name a few.
It should be noted that, for the implementation process and the technical principle of the computer device of this embodiment, reference is made to the foregoing explanation of the data processing method of the first embodiment, and details are not described here.
According to the computer device provided by the embodiment of the application, the data to be stored is analyzed by acquiring the data storage request, the characteristic identifier corresponding to the data to be stored is determined, the characteristic identifier is subjected to Hash modular operation, the mark value corresponding to the characteristic identifier is determined, and then the data to be stored is stored in the first storage area corresponding to the first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs. From this, realized carrying out inhomogeneous storage processing to data through the mode of Hash's modulus processing, when not only having guaranteed to sample data, the homogeneity of sampling data has improved the data sampling precision moreover, under same sampling precision, has effectively reduced the consumption of storage resource, has promoted data processing efficiency, has improved the performance of equipment.
To achieve the above object, the present application also proposes a computer-readable storage medium.
Wherein the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, implements the data processing method of the first aspect.
In an alternative implementation, the embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
To achieve the above object, the present application also proposes a computer program. Wherein the computer program when executed by the processor is adapted to carry out the data processing method of the first aspect.
In this application, unless expressly stated or limited otherwise, the terms "disposed," "connected," and the like are to be construed broadly and include, for example, mechanical and electrical connections; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes additional implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are well known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried out in the method of implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (6)

1. A data processing method, comprising:
acquiring a data storage request, wherein the storage request comprises data to be stored;
analyzing the data to be stored, and determining a characteristic identifier corresponding to the data to be stored;
determining a target sampling range according to the query service type corresponding to the data to be stored;
determining a target module value according to the target sampling range;
performing Hash modular operation on the feature identifier according to the target modular value, and determining a mark value corresponding to the feature identifier;
storing the data to be stored into a first storage area corresponding to a first marking value range according to the first marking value range to which the marking value corresponding to the data to be stored belongs, wherein the marking values corresponding to different storage areas are different in quantity;
before the storing the data to be stored into the first storage area corresponding to the first marker value range, the method further includes:
determining the mark value range corresponding to each storage area according to a preset storage area division rule;
before determining the mark value ranges respectively corresponding to the storage areas, the method further includes:
determining a query requirement corresponding to the data to be stored according to the service type of the data to be stored;
determining the preset storage area division rule according to the sampling rate corresponding to the query requirement, wherein the preset storage area division rule comprises the step of dividing the storage areas in different proportions;
after the storing the data to be stored into the first storage area corresponding to the first mark value range, the method further includes:
acquiring a data query request, wherein the query request comprises a sampling rate;
determining a second mark value range to be queried according to the sampling rate;
determining a second storage area corresponding to the second mark value range according to the corresponding relation between the preset mark value range and the storage area;
and reading target data from the second storage area according to the sampling rate.
2. The method of claim 1, wherein determining a second range of token values to query based on the sampling rate comprises:
determining a sampling mark value corresponding to the sampling rate according to a modulus value adopted when the data characteristic identification is subjected to Hash modulus operation;
and determining a second mark value range to be queried according to the sampling mark value.
3. The method of claim 1, wherein reading target data from the second storage area according to the sampling rate comprises:
reading from the second storage area each data having a tag value less than or equal to the sampling rate.
4. A data processing apparatus, comprising:
the device comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a data storage request, and the storage request comprises data to be stored;
the first determining module is used for analyzing the data to be stored and determining the characteristic identifier corresponding to the data to be stored;
the second determining module is used for determining a target sampling range according to the query service type corresponding to the data to be stored; determining a target modulus value according to the target sampling range; according to the target modulus value, performing Hash modulus operation on the feature identifier, and determining a mark value corresponding to the feature identifier;
the processing module is used for storing the data to be stored into a first storage area corresponding to a first mark value range according to the first mark value range to which the mark value corresponding to the data to be stored belongs, wherein the number of the mark values corresponding to different storage areas is different;
the apparatus is further configured to:
before the storing the data to be stored into the first storage area corresponding to the first marker value range, the method further includes:
determining the mark value range corresponding to each storage area according to a preset storage area division rule;
before determining the mark value ranges respectively corresponding to the storage areas, the method further includes:
determining a query requirement corresponding to the data to be stored according to the service type of the data to be stored;
determining the preset storage area division rule according to the sampling rate corresponding to the query requirement, wherein the preset storage area division rule comprises the step of dividing the storage areas in different proportions;
after the storing the data to be stored into the first storage area corresponding to the first marker value range, the method further includes:
acquiring a data query request, wherein the query request comprises a sampling rate;
determining a second mark value range to be queried according to the sampling rate;
determining a second storage area corresponding to the second mark value range according to the corresponding relation between the preset mark value range and the storage area;
and reading target data from the second storage area according to the sampling rate.
5. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing a data processing method as claimed in any one of claims 1 to 3.
6. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the data processing method of any one of claims 1 to 3.
CN201810721912.6A 2018-07-04 2018-07-04 Data processing method, device, equipment and storage medium Active CN110750529B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810721912.6A CN110750529B (en) 2018-07-04 2018-07-04 Data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810721912.6A CN110750529B (en) 2018-07-04 2018-07-04 Data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110750529A CN110750529A (en) 2020-02-04
CN110750529B true CN110750529B (en) 2022-09-23

Family

ID=69274697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810721912.6A Active CN110750529B (en) 2018-07-04 2018-07-04 Data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110750529B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666421B (en) * 2020-06-03 2024-05-10 北京声智科技有限公司 Data processing method and device and electronic equipment
CN114143279B (en) * 2020-08-13 2023-10-24 北京有限元科技有限公司 Interactive recording sampling method and device and storage medium
CN112988629A (en) * 2021-03-11 2021-06-18 北京信息科技大学 Data recording device and method, storage medium
CN113361683B (en) * 2021-05-18 2023-01-10 山东师范大学 Biological brain-imitation storage method and system
CN114661711B (en) * 2022-03-11 2023-08-29 上海原能细胞生物低温设备有限公司 Sample storage position allocation method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101119246A (en) * 2007-09-20 2008-02-06 杭州华三通信技术有限公司 Data packet sampling statistic method and apparatus
CN102402394A (en) * 2010-09-13 2012-04-04 腾讯科技(深圳)有限公司 Hash algorithm-based data storage method and device
CN102799486A (en) * 2012-06-18 2012-11-28 北京大学 Data sampling and partitioning method for MapReduce system
CN102968498A (en) * 2012-12-05 2013-03-13 华为技术有限公司 Method and device for processing data
CN103530322A (en) * 2013-09-18 2014-01-22 深圳市华为技术软件有限公司 Method and device for processing data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10169409B2 (en) * 2015-10-01 2019-01-01 International Business Machines Corporation System and method for transferring data between RDBMS and big data platform

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101119246A (en) * 2007-09-20 2008-02-06 杭州华三通信技术有限公司 Data packet sampling statistic method and apparatus
CN102402394A (en) * 2010-09-13 2012-04-04 腾讯科技(深圳)有限公司 Hash algorithm-based data storage method and device
CN102799486A (en) * 2012-06-18 2012-11-28 北京大学 Data sampling and partitioning method for MapReduce system
CN102968498A (en) * 2012-12-05 2013-03-13 华为技术有限公司 Method and device for processing data
CN103530322A (en) * 2013-09-18 2014-01-22 深圳市华为技术软件有限公司 Method and device for processing data

Also Published As

Publication number Publication date
CN110750529A (en) 2020-02-04

Similar Documents

Publication Publication Date Title
CN110750529B (en) Data processing method, device, equipment and storage medium
US9858327B2 (en) Inferring application type based on input-output characteristics of application storage resources
CN111813804B (en) Data query method and device, electronic equipment and storage medium
US10198455B2 (en) Sampling-based deduplication estimation
CN111061740B (en) Data synchronization method, device and storage medium
US10963374B2 (en) Memory allocation analysis
CN114281663A (en) Test processing method, test processing device, electronic equipment and storage medium
CN110737727B (en) Data processing method and system
CN111226201B (en) Method for managing memory in computer and computer system
CN112035159B (en) Configuration method, device, equipment and storage medium of audit model
CN110674165A (en) Method and device for adjusting sampling rate, storage medium and terminal equipment
CN113760950B (en) Index data query method, device, electronic equipment and storage medium
CN113094415B (en) Data extraction method, data extraction device, computer readable medium and electronic equipment
CN112131257B (en) Data query method and device
US11023226B2 (en) Dynamic data ingestion
CN113760176A (en) Data storage method and device
CN111158994A (en) Pressure testing performance testing method and device
CN110399298A (en) A kind of test method and device
CN111782588A (en) File reading method, device, equipment and medium
CN117171140B (en) Data migration method and device
CN110134691B (en) Data verification method, device, equipment and medium
CN109710673B (en) Work processing method, device, equipment and medium
US20150006431A1 (en) Providing resource access
CN114064642A (en) Data processing method and device, computer equipment and storage medium
CN115934644A (en) File processing method based on Elasticissearch

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant