CN117632985A

CN117632985A - Database sample updating method, device, equipment and medium

Info

Publication number: CN117632985A
Application number: CN202311657931.4A
Authority: CN
Inventors: 吴涵
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Priority date: 2023-12-05
Filing date: 2023-12-05
Publication date: 2024-03-01

Abstract

The disclosure relates to a database sample updating method, device, equipment and medium, which effectively solve the technical problems that the calculation amount and cost are huge, the query response time is long and the query experience is affected if accurate query is to be performed along with the mass increase of data and the complexity of query, and the database sample updating method comprises the following steps: acquiring tuple set data for updating a database, wherein the tuple set data comprises at least one tuple data; determining test statistics after adding the tuple data by using the tuple set data and sample tuple data in the database; and when the test statistic is greater than a preset threshold, adding the tuple data in the tuple set data into the acquired sample of the database.

Description

Database sample updating method, device, equipment and medium

Technical Field

The present disclosure relates to the field of database sample updating technologies and data processing technologies, and in particular, to a database sample updating method, device, equipment, and medium.

Background

With the rapid development of network communication technology, current computer clusters possess huge data processing capability, but obtaining the results of ad hoc queries from large-scale data sets is still challenging. In order to solve this problem, a trend of promoting approximate calculation in a big data analysis framework has appeared in recent years.

However, when SQL query is performed in the database, if all tuples are traversed, accurate query answers can be obtained, but with the massive increase of data and the complexity of query, if accurate query is performed, the calculation amount and cost will be huge, the query response time will be long, and the query experience will be affected.

Disclosure of Invention

In order to solve the technical problems, the disclosure provides a database sample updating method, a device, equipment and a medium, which effectively solve the technical problems that along with the massive increase of data and the complexity of query, if accurate query is required, the calculated amount and cost will be huge, the query response time will be long, and the query experience is affected.

In a first aspect, an embodiment of the present disclosure provides a database sample updating method, including:

acquiring tuple set data for updating a database, wherein the tuple set data comprises at least one tuple data;

determining test statistics after adding the tuple data by using the tuple set data and sample tuple data in a database;

and adding the tuple data in the tuple set data into the acquired sample of the database when the test statistic is greater than a preset threshold.

In one possible implementation manner, in the method provided by the embodiment of the present invention, test statistics when tuple data is added are determined by using tuple set data and sample tuple data in a database, including:

determining a plurality of sample tuple data as acquisition samples according to a preset sampling probability in a database;

and determining the test statistic after the tuple data is added by using the collected sample and the tuple set data.

In one possible implementation manner, in the method provided by the embodiment of the present invention, by using the collected sample and the tuple set data, the test statistic after the tuple data is added is determined, including:

dividing the collected sample into a plurality of data packets of fixed data size;

generating a target packet with the same data volume as the data packet by using the database;

the target packet is updated with the tuple set data and test statistics are determined with the target packet using the plurality of data packets.

In one possible implementation manner, in the method provided by the embodiment of the present invention, the target packet is updated by using the tuple set data, and the test statistics are determined by using a plurality of data packets and the target packet, including:

sequentially adding the tuple data in the tuple set data into a target group, and correspondingly deleting the target number of tuple data, wherein the target number is the number of tuple data in the tuple set data;

sequentially deleting the target number of sample tuple data from each data packet, and adding the target number of sample tuple data into a database;

test statistics are determined using the plurality of data packets and the target packet.

In a possible implementation manner, in the method provided by the embodiment of the present invention, when the test statistic is greater than a preset threshold, adding the tuple data in the tuple set data to the collection sample of the database, including:

when the test statistic is larger than a preset threshold value, determining the newly distributed target data volume;

and when the target data volume is smaller than the fixed data volume or the target data volume is smaller than the number of the tuple data of each page of sample in the database, adding the tuple data corresponding to the target data volume into the acquired samples.

In one possible implementation manner, in the method provided by the embodiment of the present invention, the method further includes:

when the target data volume is larger than the fixed data volume and the target data volume is larger than the number of the sample tuple data of each page in the database, corresponding sample tuple data are determined from the database according to a preset rule, and the corresponding sample tuple data and the tuple data are added to the acquired samples.

when the test statistic is smaller than a preset threshold value, adding tuple set data into a database, and determining that the distribution of the added database is unchanged;

and taking the acquired sample as the acquired sample after adding the tuple set data, and performing approximate query.

In a second aspect, embodiments of the present disclosure provide a database sample updating apparatus, the apparatus including:

an acquisition unit configured to acquire tuple set data for updating a database, the tuple set data including at least one tuple data;

a determining unit for determining a test statistic after adding the tuple data by using the tuple set data and the sample tuple data in the database;

and the processing unit is used for adding the tuple data in the tuple set data into the acquired sample of the database when the test statistic is greater than a preset threshold value.

In a possible implementation manner, in the device provided by the embodiment of the present invention, the determining unit is specifically configured to:

In a possible implementation manner, in the device provided by the embodiment of the present invention, the processing unit is specifically configured to:

In a possible implementation manner, in the apparatus provided by the embodiment of the present invention, the processing unit is further configured to:

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in a memory and configured to be executed by a processor to implement a database sample updating method as described above.

In a fourth aspect, embodiments of the present disclosure provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a database sample updating method as described above.

The embodiment of the disclosure provides a database sample updating method, which comprises the following steps:

firstly, acquiring tuple set data for updating a database sample, wherein the tuple set data comprises at least one tuple data, then determining test statistics after the tuple data are added by utilizing the tuple data and the sample tuple data in the database, and finally adding the tuple data in the tuple set data into the database when the test statistics are larger than a preset threshold value. By applying the database sample updating method provided by the disclosure, whether the database with the tuple data added is subjected to data distribution change is judged by using the test statistic, and further incremental maintenance and adjustment of the existing sampling samples in the database are realized, so that resampling after updating the database each time is avoided, the original samples are effectively used, the query efficiency is improved, the problem of higher maintenance cost of the traditional database is solved, and the statistical accuracy of incremental sampling is ensured.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

Fig. 1 is a schematic flow chart of a database sample updating method according to an embodiment of the disclosure;

fig. 2 is a specific flowchart of a database sample updating method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a database sample updating apparatus according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.

1. In the embodiment of the invention, the term "and/or" describes the association relation of the association objects, which means that three relations can exist, for example, a and/or B can be expressed as follows: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.

In the field of database approximate query, the following methods are commonly adopted to collect samples to provide approximate results and error estimation of the query:

(1) And (3) sampling a reservoir: assuming that the original data table T contains N tuples in total, N are now required to be taken as samples from it. Firstly, constructing a reservoir capable of containing n elements, and putting all the first n tuples in T into the reservoir; and secondly, starting from the ith (N < i is less than or equal to N) tuple, determining whether the tuple i is taken as a sample by using the probability of N/i, and if so, randomly replacing elements in the reservoir by using the tuple i, wherein the probability that the tuple can replace a certain element in the reservoir is 1/k. The above process is repeatedly performed until all the tuples are traversed, and the probability of each tuple in T appearing in the reservoir is equal and k/N. Thus, the last n tuples in the reservoir are collected samples, thereby achieving the purpose of uniform sampling.

(2) Bernoulli sampling: each tuple is sampled independently with a probability of p being sampled and a probability of 1-p not being sampled, which is a Bernoulli process.

(3) Hash sampling: and performing uniform hash operation on each tuple by using a uniform hash function, mapping each element into a real number on [0,1], and comparing the real number with the sampling probability p to judge whether the tuple is sampled as a sample.

(4) Hierarchical sampling: the original data table is divided into a plurality of layers according to the unique value on the column set C, the number of the tuples contained in each layer is counted, and then random sampling is independently carried out on each layer (a random sampling method such as Bernoulli and the like can be used in the layers). Such as: blinkDB builds offline samples to answer queries based on prior knowledge on the dataset and historical query workload, and creates hierarchical samples for common query column sets and caches, with high storage costs. For queries that can be matched with a common set of query columns, it can be efficient to calculate approximate results on an offline sample, but if there is no set of query columns matched with it, using the offline sample for approximate answers can result in large errors.

The common sampling algorithm in the existing database has the main defects that the reservoir, bernoulli and hash sampling are simple random sampling, the samples cannot well represent the data distribution of the bottom layer, and the sampling effect on the data set with the bottom layer in the bias distribution is not ideal; the main disadvantages of hierarchical sampling are that it is necessary to traverse the database table twice, the access cost is high, and resampling is required when the underlying data set grows, which is not beneficial to sample maintenance.

Fig. 1 is a flowchart of a database sample updating method according to an embodiment of the present disclosure, which specifically includes the following steps S101 to S103 shown in fig. 1:

s101, acquiring tuple set data for updating a database sample.

In particular, tuple set data for updating a sample in a database is obtained, wherein the tuple set data comprises a plurality of tuple data.

S102, determining test statistics after adding the tuple data by using the tuple set data and sample tuple data in a database.

In the specific implementation, firstly, a plurality of sample tuple data are determined in a database according to preset sampling probability to be taken as acquisition samples, and then test statistics after the tuple data are added are determined by utilizing the acquisition samples and the tuple set data. When the test statistic is confirmed specifically, firstly, the collected sample is divided into a plurality of data groups with fixed data quantity, then a target group with the same data quantity as the data group is generated by utilizing a database, then the metadata in the metadata set data is sequentially added into the target group, the target number of the metadata is correspondingly deleted, the target number is the number of the metadata in the metadata set data, then the target number of the sample metadata is sequentially deleted from each data group, the target number of the sample metadata is added into the database, and finally the test statistic is determined by utilizing the plurality of the data groups and the target group.

And S103, adding the tuple data in the tuple set data into the acquisition sample of the database when the test statistic is greater than a preset threshold.

When the test statistic is larger than a preset threshold, determining a newly distributed target data volume, and then adding tuple data corresponding to the target data volume into the acquired samples when the target data volume is smaller than a fixed data volume or the target data volume is smaller than the number of the tuple data of each page of samples in the database; when the target data volume is larger than the fixed data volume and the target data volume is larger than the number of the sample tuple data of each page in the database, corresponding sample tuple data are determined from the database according to a preset rule, and the corresponding sample tuple data and the tuple data are added to the acquired samples.

Of course, in this step, when the test statistic is smaller than the preset threshold, the tuple set data is added to the database, and it is determined that the distribution of the added database is unchanged, and then the collected sample is used as a collected sample after the tuple set data is added, for performing the approximate query.

Fig. 2 is a specific flowchart of a database sample updating method according to an embodiment of the present disclosure, which specifically includes the following steps S201 to S206 shown in fig. 2:

s201, acquiring tuple set data of an acquisition sample for updating a database.

S202, determining a plurality of sample tuple data in a database according to a preset sampling probability as acquisition samples.

In specific implementation, according to the set sampling probability, a plurality of tuple data are extracted from the database to be used as acquisition samples, and specifically, uniform samples are obtained from the database. There are many ways to construct a uniform sample of a data set, most commonly in the database using Bernoulli sampling, i.e., each tuple of data in the database is sampled with a probability p, the probabilities of 1-p are not sampled, and whether the samples are sampled between the tuple of data is independent of each other. The specific implementation method comprises the following steps: scanning the whole data table, generating a random number between [0,1] for each piece of metadata, and if the random number is smaller than the set sampling probability p, extracting the piece of metadata, otherwise, not extracting the piece of metadata. In the present embodiment, it is assumed that N tuple data have been extracted as a primitive sample S in the Bernoulli method from a primitive relation table containing N tuples. S203, dividing the acquired sample into a plurality of data packets with fixed data volume.

In one example, the acquired samples are equally divided into S blocks X of size B. The value of B can be set according to the actual requirement, which is not limited in the embodiment of the disclosure. S204, generating a target packet with the same data volume as the data packet by utilizing the database.

Still using the above example, a Block Y of the same size as Block X is created to store newly added tuple data, and the last B tuple data of the original relationship table is stored in the initialization Y.

S205, updating the target packet by using the tuple set data, and determining the test statistic by using the plurality of data packets and the target packet.

And when the method is implemented, sequentially adding the tuple data in the tuple set data into target groups, correspondingly deleting the target number of tuple data, wherein the target number is the number of tuple data in the tuple set data, sequentially deleting the target number of sample tuple data from each data group, adding the target number of sample tuple data from a database, and finally determining the test statistic by utilizing a plurality of data groups and the target groups.

Still using the above example, each time a new tuple of data is to be streamed into Y, the bottom of Y removes a tuple of data to be put back into the original relation table, and accordingly, each Block X removes a tuple of data from the bottom to put back into the original relation table, and then randomly selects a tuple of data from the original relation table to put on top of the Block, which is equivalent to maintaining a fixed size sliding window on the data stream. And respectively calculating the maximum mean difference (Maximum Mean Discrepancy, MMD) value between each Block X and each Block Y, taking the mean value, and normalizing to obtain the Kernel-based test Statistic M_static.

S206, when the test statistic is larger than a preset threshold, adding the tuple data in the tuple set data into the collection sample of the database.

In particular, the test statistic is compared with a predetermined threshold. If the value of the test statistic is smaller than the preset threshold value, the fact that the distribution of the database is unchanged after the new tuple set data arrives is indicated, the original sample can be continuously used as a sample of the new database to perform approximate query calculation, and the value of the statistic is basically the same for the data with the same distribution. If the value of the test statistic is greater than the preset threshold, that is, if the value of the test statistic exceeds the preset threshold, the detection of the change point and the change of the data distribution are indicated, and at the moment, the incremental sampling should be continued. And counting the data quantity obeying the new distribution by using a detection algorithm, and marking the data quantity as M. And then determines how to sample according to the size of M. If M is less than B or M is less than the number of tuple data contained in each Page, then all of the M tuples are taken into the sample. On the contrary, a data distribution is regarded as a layer, the number of the tuple data which is supposed to be extracted from the new distribution is deduced according to a proportion formula and is recorded as m according to the idea of hierarchical sampling, then a plurality of pages which can reach the required sample number are randomly extracted from the new distribution, and all the tuple data in each selected Page are extracted as samples, namely, the idea of sampling the whole group in statistics is utilized.

Furthermore, when deleting tuple data in the database, a tracking counter c (i) is maintained for each unique item i in sample S, taking into account that there may be multiple copies of a tuple data. When deleting one item i, if c (i) =0, the sample S remains unchanged; if c (i) >0, then randomly deleting one item i in the sample with the original sampling probability. When updating the tuple data in the original database, the updating of the tuple data can be considered as the deletion in the relationship table followed by the insertion of the new tuple, so that the updating case can be handled in conjunction with the insertion and deletion operations in the embodiments of the present disclosure.

Compared with the traditional simple sampling, the database sample maintenance method based on data distribution is adopted in the embodiment of the disclosure, and the collected samples can represent the data distribution of the database table at the bottom layer. The method does not need to make any assumption on the distribution of the bottom data set, and can quickly detect whether the data distribution changes after the data is newly added. If the sample is unchanged, the sample can be used for performing approximate calculation; and after the change of the data distribution is detected, quick incremental sampling is performed, so that good balance between the accuracy and the sampling efficiency is ensured. The method provided by the embodiment of the disclosure can realize increment maintenance and adjustment of the sampling samples in the database, avoid resampling of all tuples by scanning after inserting or deleting data each time, effectively utilize the original samples, reduce the cost of accessing the original relation table, and improve the query efficiency.

Fig. 3 is a schematic structural diagram of a database sample updating apparatus according to an embodiment of the disclosure. The database sample updating apparatus 300 provided in the embodiments of the present disclosure may execute the processing flow provided in the embodiments of the database sample updating method, as shown in fig. 3, where the database sample updating apparatus 300 includes an obtaining unit 301, a determining unit 302, and a processing unit 303, where:

an obtaining unit 301, configured to obtain tuple set data for updating a database, where the tuple set data includes at least one tuple data;

a determining unit 302, configured to determine test statistics after adding the tuple data by using the tuple set data and the sample tuple data in the database;

and the processing unit 303 is configured to add the tuple data in the tuple set data to the acquired sample of the database when the test statistic is greater than a preset threshold.

In a possible implementation manner, in the apparatus provided by the embodiment of the present invention, the determining unit 302 is specifically configured to:

In a possible implementation manner, in the apparatus provided by the embodiment of the present invention, the processing unit 303 is specifically configured to:

In a possible implementation manner, in the apparatus provided by the embodiment of the present invention, the processing unit 303 is further configured to:

The database sample updating apparatus of the embodiment shown in fig. 3 may be used to implement the technical solution of the above-mentioned method embodiment, and its implementation principle and technical effects are similar, and will not be described herein again.

In addition, the database sample updating method and apparatus of the embodiments of the present application described in connection with fig. 1-3 may be implemented by an electronic device. Fig. 4 shows a schematic hardware structure of an electronic device according to an embodiment of the present application.

As shown in fig. 4, the electronic device 400 may include a processing means (e.g., a central processor, a graphics processor, etc.) 401 that may perform various suitable actions and processes to implement the database sample updating method of the embodiments as described in the present disclosure according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage means 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the electronic device 400 are also stored. The processing device 401, the ROM 402, and the RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

In general, the following devices may be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 407 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 408 including, for example, magnetic tape, hard disk, etc.; and a communication device 409. The communication means 409 may allow the electronic device 400 to communicate with other devices wirelessly or by wire to exchange data. While fig. 4 shows an electronic device 400 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts, thereby implementing the speech control method as described above. In such an embodiment, the computer program may be downloaded and installed from a network via communications device 409, or from storage 408, or from ROM 402. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 401.

It should be noted that the computer readable medium described in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to:

Alternatively, the electronic device may perform other steps described in the above embodiments when the above one or more programs are executed by the electronic device.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A method of updating a database sample, the method comprising:

determining test statistics after adding the tuple data by using the tuple set data and sample tuple data in the database;

and when the test statistic is greater than a preset threshold, adding the tuple data in the tuple set data into the acquired sample of the database.

2. The method of claim 1, wherein said determining test statistics when adding said tuple data using said tuple set data and sample tuple data in said database comprises:

determining a plurality of sample tuple data as acquisition samples according to preset sampling probability in the database;

and determining test statistics after adding the tuple data by using the acquired samples and the tuple set data.

3. The method of claim 2, wherein the determining test statistics with the tuple data added using the collected samples and the tuple set data comprises:

dividing the collected sample into a plurality of data packets of fixed data volume;

updating the target packet with the tuple set data and determining the test statistic with a plurality of the data packets and the target packet.

4. The method of claim 3, wherein the updating the target packet with the tuple set data and determining the test statistic with the plurality of data packets and the target packet comprises:

sequentially adding the tuple data in the tuple set data into the target group, and correspondingly deleting the target number of tuple data, wherein the target number is the number of tuple data in the tuple set data;

sequentially deleting target number of sample tuple data from each data packet, and adding the target number of sample tuple data into the database;

the test statistic is determined using a plurality of the data packets and the target packet.

5. The method of claim 4, wherein adding the tuple data in the tuple set data to the collection sample of the database when the test statistic is greater than a preset threshold value comprises:

when the test statistic is larger than a preset threshold value, determining a newly distributed target data volume;

and when the target data volume is smaller than the fixed data volume or the target data volume is smaller than the number of the tuple data of each page of sample in the database, adding the tuple data corresponding to the target data volume into the acquired sample.

6. The method of claim 5, wherein the method further comprises:

and when the target data volume is larger than the fixed data volume and the target data volume is larger than the number of the sample tuple data of each page in the database, determining corresponding sample tuple data from the database according to a preset rule, and adding the corresponding sample tuple data and the tuple data to the acquired samples.

7. The method according to claim 1, wherein the method further comprises:

when the test statistic is smaller than a preset threshold value, adding the tuple set data into the database, and determining that the distribution of the added database is unchanged;

and taking the acquired sample as the acquired sample after the tuple set data is added, and performing approximate query.

8. A database sample updating apparatus, the apparatus comprising:

an acquisition unit configured to acquire tuple set data for updating a database, where the tuple set data includes at least one tuple data;

a determining unit configured to determine test statistics after adding the tuple data using the tuple set data and sample tuple data in the database;

9. An electronic device, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the database sample updating method of any of claims 1-7.

10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the database sample updating method of any of claims 1-7.