CN116662401A

CN116662401A - Data retrieval method, device, equipment and computer readable storage medium

Info

Publication number: CN116662401A
Application number: CN202210156810.0A
Authority: CN
Inventors: 吴先斌; 周领良; 林兆祥; 王建; 孔维; 周润耘
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-02-21
Filing date: 2022-02-21
Publication date: 2023-08-29

Abstract

The application provides a data retrieval method, a device, equipment and a computer readable storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like; the method comprises the following steps: responding to a data retrieval request carrying retrieval information, and acquiring source data and a sampling rate corresponding to the source data; wherein the source data is divided into at least two data units for data storage; determining unit sampling rates corresponding to at least two data units and data sampling rates corresponding to data in the data units based on the sampling rates corresponding to the source data; and carrying out data sampling on the source data by combining the unit sampling rate, the data sampling rate and the retrieval information to obtain the target data as a retrieval result of the data retrieval request. The application can effectively reduce the number of target data and improve the efficiency of data retrieval.

Description

Data retrieval method, device, equipment and computer readable storage medium

Technical Field

The present application relates to big data and data retrieval technology, and more particularly, to a data retrieval method, apparatus, electronic device, computer readable storage medium, and computer program product.

Background

With the rapid development of social informatization and networking, data has been increasing explosively. The demands of users for searching and statistical analysis of mass data are increasing.

In the related art, for the retrieval operation of mass data, the consumption of system resources is high due to overlarge data quantity, and the response user retrieval request is overtime. Or part of result data is returned in response time, and accuracy of the data retrieval process is affected.

Disclosure of Invention

The embodiment of the application provides a data retrieval method, a device, electronic equipment, a computer readable storage medium and a computer program product, which can effectively reduce the number of target data and improve the data retrieval efficiency.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a data retrieval method, which comprises the following steps:

responding to a data retrieval request carrying retrieval information, and acquiring source data and a sampling rate corresponding to the source data;

wherein the source data is divided into at least two data units for data storage;

determining unit sampling rates corresponding to the at least two data units and data sampling rates corresponding to data in the data units based on the sampling rates corresponding to the source data;

Combining the unit sampling rate, the data sampling rate and the retrieval information, and performing data sampling on the source data to obtain target data;

and returning a retrieval result comprising the target data.

The present embodiment provides a data retrieval device, including:

the acquisition module is used for responding to a data retrieval request carrying retrieval information and acquiring source data and a sampling rate corresponding to the source data; wherein the source data is divided into at least two data units for data storage;

a determining module, configured to determine a unit sampling rate corresponding to the at least two data units and a data sampling rate corresponding to data in the data units based on the sampling rate corresponding to the source data;

the sampling module is used for carrying out data sampling on the source data by combining the unit sampling rate, the data sampling rate and the retrieval information to obtain target data;

and the return module is used for returning the retrieval result comprising the target data.

In the above scheme, the acquiring module is further configured to parse the data retrieval request to determine a sampling rate mode for the source data;

and when the sampling rate mode is a specified sampling rate mode, taking the sampling rate carried in the data retrieval request as the sampling rate corresponding to the source data.

when the sampling rate mode is an intelligent sampling rate mode, obtaining the estimated data quantity corresponding to the source data and a processing quantity threshold value when the source data is processed;

and determining the sampling rate corresponding to the source data by combining the processing amount threshold and the estimated data amount.

In the above scheme, when the data unit is a data slice, the determining module is further configured to obtain a slice number of the data slice corresponding to the source data;

and combining the sampling rate corresponding to the source data and the number of fragments, determining the fragment sampling rate corresponding to the at least two data fragments as the unit sampling rate, and taking the ratio of the sampling rate corresponding to the source data to the fragment sampling rate as the data sampling rate corresponding to the data in the data fragments.

In the above scheme, when the data unit is a data block, the source data is divided into at least two data slices, each data slice is divided into at least two data blocks, and the determining module is further configured to obtain a slice number of the data slices corresponding to the source data, and determine a slice sampling rate corresponding to the at least two data slices by combining a sampling rate corresponding to the source data and the slice number;

Acquiring sampling rate thresholds for the at least two data blocks;

determining an intermediate sampling rate for the at least two data blocks in combination with the sampling rate corresponding to the source data and the fragmentation sampling rate;

and determining unit sampling rates corresponding to the at least two data units and data sampling rates corresponding to data in the data units based on the sampling rate threshold and the intermediate sampling rate.

In the above scheme, the determining module is further configured to determine a product of the sampling rate corresponding to the source data and the number of slices;

performing upward rounding treatment on the product to obtain the number of sampling fragments;

and determining the ratio of the number of the sampling slices to the number of the slices, and taking the ratio as the slicing sampling rate corresponding to the at least two data slices.

In the above scheme, the sampling module is further configured to sample the at least two data units based on the unit sampling rate to obtain at least one target data unit;

based on the data sampling rate, respectively sampling the data in each target data unit to obtain sampling data;

and carrying out data retrieval in the sampling data based on the retrieval information to obtain the target data.

In the above scheme, the sampling module is further configured to determine a data sampling operator corresponding to data in the data unit based on the data sampling rate;

sampling the data in the data unit based on the data sampling operator to obtain sampling data;

wherein the ratio of the data volume of the sampled data to the data volume of the data in the data unit is equal to the data sampling rate.

In the above scheme, when the data sampling operator is a modulo operation, the sampling module is further configured to obtain an index value corresponding to each data in the data unit;

taking the modulus of each index value to obtain a modulus value corresponding to each data;

and when the modulus value is matched with a preset modulus value, taking the data indicated by the corresponding index value as sampling data.

In the above scheme, when the data sampling operator is hash operation, the sampling module is further configured to obtain an index value corresponding to each data in the data unit;

hashing the index value to obtain a hash value corresponding to each data;

and when the hash value does not reach the hash value threshold, taking the data indicated by the corresponding index value as sampling data.

In the above scheme, the search information includes a search operator and a query statement, and the sampling module is further configured to combine the unit sampling rate, the data sampling rate and the search operator to perform data sampling on the source data to obtain initial sampling data;

and based on the query statement, carrying out data retrieval in the initial sampling data to obtain the target data.

In the above scheme, when the search result further includes a statistical result corresponding to the target data, the return module is further configured to obtain a data statistical manner corresponding to the target data;

and carrying out statistical analysis on the target data based on the data statistical mode to obtain the statistical result.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the data retrieval method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for causing a processor to execute, thereby realizing the data retrieval method provided by the embodiment of the application.

The embodiment of the application provides a computer program product, which comprises a computer program or instructions for realizing the data retrieval method provided by the embodiment of the application when the computer program or instructions are executed by a processor.

The embodiment of the application has the following beneficial effects:

according to the embodiment of the application, the received data retrieval request is analyzed to obtain the retrieval information and the sampling rate aiming at the source data, so that the unit sampling rate aiming at least two data units and the data sampling rate aiming at the data in the data units can be determined based on the sampling rates; and then, sampling the source data by combining the unit sampling rate, the data sampling rate and the retrieval information to obtain target data, so that the number of the target data can be greatly reduced, the occupation and response time consumption of computing resources are reduced, and the data retrieval efficiency is improved.

Drawings

FIG. 1 is a schematic diagram of a data retrieval system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device 500 according to a data retrieval method provided by an embodiment of the present application;

FIG. 3 is a flow chart of a method for retrieving data according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a visual interface of a data retrieval request provided by an embodiment of the present application;

FIG. 5 is a flowchart of a method for acquiring a sampling rate in an intelligent sampling rate mode according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for determining a fractional sample rate according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for determining a unit sampling rate and a data sampling rate according to an embodiment of the present application;

FIG. 8 is a flow chart of a data sampling method according to an embodiment of the present application;

FIG. 9 is a flowchart of a data sampling method based on modulo arithmetic according to an embodiment of the application;

FIG. 10 is a flowchart of a data sampling method based on hash operation according to an embodiment of the present application;

FIG. 11 is a flowchart of a data sampling method according to an embodiment of the present application;

FIG. 12 is a flow chart of a statistical analysis of data determined by a data-based sampling method provided by an embodiment of the present application;

FIG. 13A is a schematic diagram of data slicing sampling provided by an embodiment of the present application;

FIG. 13B is a schematic diagram of a fragment sampling rate according to an embodiment of the present application;

FIG. 14 is a block sample rate correction schematic diagram provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of other sample rates provided by embodiments of the present application;

fig. 16 is an analysis chart of experimental results provided in the examples of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

If a similar description of "first/second" appears in the application document, the following description is added, in which the terms "first/second/third" are merely distinguishing between similar objects and not representing a particular ordering of the objects, it being understood that the "first/second/third" may be interchanged with a particular order or precedence, if allowed, so that embodiments of the application described herein may be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Distributed storage system: the data is stored in a distributed manner on a plurality of independent devices. The traditional network storage system adopts a centralized storage server to store all data, and the storage server becomes a bottleneck of system performance, is also a focus of reliability and safety, and cannot meet the requirements of large-scale storage application. The distributed network storage system adopts an expandable system structure, utilizes a plurality of storage servers to share the storage load, and utilizes the position servers to position the storage information, thereby improving the reliability, availability and access efficiency of the system and being easy to expand.

2) Searching: the search formula expresses the search intention of the searcher, which is an instruction issued by the searcher to the computer and is also a language of man-machine conversation. The search formula is typically composed of search terms, logical operators, wild cards, and the like.

3) Data slicing: meaning that the data in the distributed database may be replicated in various physical databases at the network site.

Based on the above explanation of terms and expressions involved in the embodiments of the present application, the retrieval system of data provided by the embodiments of the present application is described below. Referring to fig. 1, fig. 1 is a schematic architecture diagram of a data retrieval system according to an embodiment of the present application, in order to support an exemplary application, a terminal (a terminal 400-1 and a terminal 400-2 are shown in an exemplary manner) are connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two, and a wireless or wired link is used to implement data transmission.

The terminals (such as the terminal 400-1 and the terminal 400-2) are installed and operated with a data retrieval client for providing a graphic operation interface for a user, so that the user can input retrieval information for data retrieval for source data, etc. through the graphic operation interface, receive a retrieval instruction for the source data, and send a data retrieval request carrying the retrieval information to the server 200.

The server 200 is configured to obtain source data and a sampling rate corresponding to the source data in response to a data retrieval request carrying retrieval information; wherein the source data is divided into at least two data units for data storage; determining unit sampling rates corresponding to at least two data units and data sampling rates corresponding to data in the data units based on the sampling rates corresponding to the source data; further, combining the unit sampling rate, the data sampling rate and the retrieval information, and performing data sampling on the source data to obtain target data; and finally, returning the retrieval result including the target data to the terminals (such as the terminal 400-1 and the terminal 400-2).

In some embodiments, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

The embodiment of the application can be realized by means of Cloud Technology (Cloud Technology), wherein the Cloud Technology refers to a hosting Technology for integrating serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data.

The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data of different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized through cloud computing.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device 500 according to a data retrieval method according to an embodiment of the present application. In practical applications, the electronic device 500 may be a server or a terminal shown in fig. 1, and the electronic device 500 is taken as an example of a domain name resolution node shown in fig. 1, to describe an electronic device implementing a data retrieval method according to an embodiment of the present application, where the electronic device 500 provided in the embodiment of the present application includes: at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in electronic device 500 are coupled together by bus system 540. It is appreciated that the bus system 540 is used to enable connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 540 in fig. 2 for clarity of illustration.

The processor 510 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 may optionally include one or more storage devices physically located remote from processor 510.

Memory 550 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile memory may be read only memory (ROM, read Only Me mory) and the volatile memory may be random access memory (RAM, random Access Memor y). The memory 550 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

network communication module 552 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

a presentation module 553 for enabling presentation of information (e.g., a user interface for operating a peripheral device and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

the input processing module 554 is configured to detect one or more user inputs or interactions from one of the one or more input devices 532 and translate the detected inputs or interactions.

In some embodiments, the data sampling device provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a data sampling device 555 stored in a memory 550, which may be software in the form of a program and a plug-in, and includes the following software modules: the acquisition module 5551, the determination module 5552, the sampling module 5553 and the return module 5554 are logical, and thus may be arbitrarily combined or further split according to the implemented functions, the functions of each module will be described below.

In other embodiments, the data retrieving apparatus provided by the embodiments of the present application may be implemented by combining software and hardware, and by way of example, the data retrieving apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor, which is programmed to perform the data retrieving method provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSP, programmable logic device (PLD, programmable Logic Device), complex programmable logic device (CPLD, comp lex Programmable Logic Device), field programmable gate array (FPGA, field-Programm able Gate Array), or other electronic components.

Based on the above description of the data retrieval system and the electronic device provided by the embodiment of the present application, the data retrieval method provided by the embodiment of the present application is described below. In some embodiments, the method for retrieving data provided in the embodiments of the present application may be implemented by a server or a terminal alone or in conjunction with the server and the terminal, and the method for retrieving data provided in the embodiments of the present application is described below with reference to the implementation of the server and the terminal.

Referring to fig. 3, fig. 3 is a flow chart of a data retrieval method provided by an embodiment of the present application, where the data retrieval method provided by the embodiment of the present application includes:

in step 101, a server responds to a data retrieval request carrying retrieval information to obtain source data and a sampling rate corresponding to the source data.

In practical implementation, the manner in which the server acquires the sampling rate X for the source data may have two modes, namely a specified sampling rate mode and an intelligent sampling rate mode, where the specified sampling rate mode corresponds to the specified sampling rate and the intelligent sampling rate corresponds to the intelligent sampling rate mode. Here, the source data is divided into at least two data units for data storage.

In some embodiments, the server may obtain the sampling rate corresponding to the source data by: the server analyzes the received data retrieval request to determine a sampling rate pattern for the source data; when the sampling rate mode is the appointed sampling rate mode, the sampling rate carried in the data retrieval request is used as the sampling rate corresponding to the source data.

In practical implementation, after receiving a data retrieval request for source data, a server analyzes the data retrieval request to obtain a corresponding analysis result, and when a specific numerical value of a sampling rate is included in the analysis result, the sampling rate mode for the data is represented as a specified sampling rate mode, and the sampling rate in the analysis result is the specified sampling rate.

Referring to fig. 4, fig. 4 is a schematic diagram of a visual interface of a data retrieval request according to an embodiment of the present application. In the figure, an input interface for executing data retrieval is provided for a user, in an input box shown by a number 1, retrieval information is input, when a sampling rate mode for source data is a specified sampling rate mode, a specified sampling rate can be input in a control shown by a number 2, and when the sampling rate mode for source data is an intelligent sampling rate mode, the control shown by the number 2 can display intelligence and does not accept user input. Then clicking the 'execute' control to generate a data retrieval request, after receiving the data retrieval request, the server analyzes the corresponding information, executes the corresponding data retrieval operation, returns the retrieval result, and displays the retrieval result in the area shown by the number 3 in the figure.

In some embodiments, referring to fig. 5, fig. 5 is a flowchart of a method for acquiring a sampling rate in an intelligent sampling rate mode according to an embodiment of the present application, and the steps shown in fig. 5 are described.

In step 1011, the server parses the data retrieval request to determine the sample rate pattern for the source data.

In actual implementation, after receiving a data retrieval request for source data, a server analyzes the data retrieval request to obtain a corresponding analysis result; and when the analysis result does not comprise a specific numerical value of the sampling rate, the sampling rate mode of the data is represented as an intelligent sampling rate mode, and the sampling rate in the analysis result is the intelligent sampling rate.

In step 1012, when the sampling rate mode is the intelligent sampling rate mode, the estimated data amount corresponding to the source data and the throughput threshold value when the source data is processed are obtained.

In actual implementation, when the sampling rate mode for data is the intelligent sampling rate mode, the server first establishes a throughput threshold D at which the server processes the source data ₀ (i.e., the amount of data that the system can process in a reasonable amount of time), D ₀ And determining the estimated data quantity D corresponding to the source data as a positive integer, and determining the sampling rate X according to an intelligent sampling rate calculation formula.

Sampling rate x=min (1, d ₀ /D*c) (1)

Wherein c is a reserved quantity parameter, and c is less than or equal to 1.

Illustratively, the server throughput threshold D is set to be within a time frame acceptable to the user (e.g., 30 seconds) ₀ The estimated data amount d=2000 tens of thousands=1000 tens of thousands, and the sampling rate xmax may be set to 0.5. At the sampling rate x=0.5,the computing power of the server for sampling is full, due to D ₀ And D are all estimated information and may have errors, so in order to ensure the accuracy of data sampling, a pre-flow parameter c can be set, so that the server can maintain an balanced data sampling state.

Step 1013, determining a sampling rate corresponding to the source data by combining the throughput threshold and the estimated data amount.

In practical implementation, the server may determine the sampling rate X according to the above-mentioned intelligent sampling rate calculation formula, in combination with the throughput threshold and the estimated data amount.

In step 102, a unit sampling rate corresponding to at least two data units and a data sampling rate corresponding to data within the data units are determined based on the sampling rate corresponding to the source data.

In practical implementations, in a distributed storage system, source data is often divided into at least two data units for decentralized storage, and multiple data slices may be scattered across multiple devices. The data in the data unit is managed by the data unit, so that when the source data is sampled by the determined sampling rate X, at least two data units corresponding to the source data can be sampled first, and then the data in the data unit is sampled. The server may first sample at least two data units corresponding to the source data based on the unit sampling rate, and determine the sampled data units. Then, the server continues to sample the data in the sampled data unit based on the data sampling rate to obtain target data. Thus, the reading pressure can be shared to different data unit nodes, and meanwhile IO resource consumption is reduced.

Describing the unit sampling rate, in some embodiments, when the data unit is a data slice, the server may determine the unit sampling rate by: the server acquires the number of data fragments corresponding to the source data; determining the slicing sampling rate corresponding to at least two data slices as a unit sampling rate by combining the sampling rate corresponding to the source data and the number of the slices; after determining the slicing sampling rate, the server determines the ratio of the sampling rate corresponding to the source data to the slicing sampling rate, and takes the obtained ratio as the data sampling rate of the data in the corresponding data slicing.

In practical implementations, the data unit for storing the source data may be a data slice, and the server manages the data by a data slice (shard) method. The slicing mode can ensure that the same time period is adopted, and data in different data slices are uniformly distributed. The server samples a plurality of data slices storing the source data based on the slice sampling rate for the data slices. The slicing sampling rate is expressed by ns/S, wherein S is the number of slices of data slices for storing source data, S is more than or equal to 1, and ns is the number of slices of sampled data slices, and ns is more than or equal to 1 and less than or equal to S. It should be noted that the following relationship exists between the sampling rate and the slice sampling rate: sample rate x=slice sample rate X other sample rates, identified by P, i.e. sample rate x=ns/S X P. When the source data is divided into only data-slice stores, P may be used to characterize the data sampling rate for the data within the data slice.

In some embodiments, referring to fig. 6, fig. 6 is a flowchart of a method for determining a slice sampling rate according to an embodiment of the present application, and a method shown in fig. 6 is combined to describe a manner of determining a slice sampling rate for data slices.

In step 201, the server determines the product of the sampling rate corresponding to the source data and the number of slices.

In actual implementation, since the slice sampling rate for data slices is a coarse-grained sampling rate, the values of the slice sampling rates that can be supported are 1/S, 2/S, … …, (S-1)/S, 1. Namely, ns has values of 1, 2, … …, (S-1) and S. After the server analyzes the data retrieval request, the sampling rate X is obtained, and in order to determine the fragmentation sampling rate, the server may determine the number of fragments of the sampled data fragments, i.e. the size of ns.

Step 202, performing upward rounding processing on the product to obtain the number of sampling fragments.

In practical implementations, the server may determine the manner in which the number of slices for a sampled data slice within the same time period is determined based on the following formula: ns=cell (x×s). Firstly, determining the number S of data fragments corresponding to the sampling rate X and the source data, multiplying the sampling rate X by the number S of the data fragments, and rounding up the product, namely using a cell function to round up the product, and determining the number ns of the sampling fragments.

For example, setting the sampling rate x=30% for the source data and the number of data slices s=5, ns=cell (30% ×5) =2 may be determined according to ns=cell (x×s), that is, the number of sampling slices is determined to be 2, and the server samples 2 data slices from the 5 data slices storing the source data.

Step 203, determining the ratio of the number of the sampled slices to the number of the slices, and taking the ratio as the slice sampling rate of at least two corresponding data slices.

In the above example, the server samples 2 data slices from 5 data slices storing the source data, and determines that the ratio of the number of sampled slices to the number of slices is 2/5, and takes 2/5=40% as the slice sampling rate for the data slices.

In some embodiments, referring to fig. 7, fig. 7 is a flowchart of a method for determining a unit sampling rate and a data sampling rate according to an embodiment of the present application. In this embodiment, the data unit is a data block, the source data is still divided into at least two data slices for storage, each data slice is further divided into at least two data blocks, and the source data is finally stored in the data block, based on fig. 3, the implementation of step 102 will be described in connection with the steps shown in fig. 7.

Step 1021, the server acquires the number of fragments of the data fragments corresponding to the source data, and determines the sampling rates of fragments corresponding to at least two data fragments by combining the sampling rates and the number of fragments corresponding to the source data;

It should be noted that, in this embodiment, the data unit is a data block, the source data is divided into at least two data slices, and each data slice is divided into at least two data blocks.

In actual practice, in a distributed data storage system, the server may also manage the source data in the form of blocks (blocks), including encoding and compressing the data within the blocks, while the disk I O itself is also performed on a block-by-block basis. When the source data is divided into at least two data slices, each data slice comprising at least two data blocks, the block sampling rate for a data block and the data sampling rate for data within a data block may be determined by the sampling rate X and the slice sampling rate ns/S, i.e. the following relationship is satisfied between the sampling rate X, the slice sampling rate ns/S, the block sampling rate P1, the data sampling rate P2 of data within a block: sample rate x=ns/S P1P 2. Note that when ns/s=1, that is, when the slice sampling rate for a data slice is 1, the sampling rate x=p1×p2, where ns/s=1 includes two cases, one is that the number of slices s=1 for a data slice of source data, the data slice must be accessed; the other is that each data slice is accessed when the number of slices S > 1 for the data slices of the source data.

In practical implementation, the number of data slices for source data S > 1, and each data slice is divided into at least two data blocks, where the sampling rate x=the slice sampling rate ns/s×the block sampling rate P1×the intra-block data sampling rate P2. The tile sampling rate ns/S for the data tiles is determined based on the tile sampling rate determination method shown in fig. 7, and then the product of P1P 2 is determined based on the sampling rate x=ns/S P1P 2.

Since the server accesses the data in the data block, it also accesses other data in the vicinity of the data in the data block, and even reads the entire block data. (the target requirement is to access one piece of data, but in actual access, 100 pieces of data are accessed in order to access the one piece of data, which is the case of IO amplification). For the sampled scene, because the sampled data is accessed in a uniform distribution, the actual IO amplification is obvious, and the final sampling reading efficiency is affected. Based on this, in practical implementation, other sampling rates P may be split into 2 parts, i.e., p=p1×p2, where P1 is a block sampling rate (block level sampling rate) for a data block, and if a block does not hit a sample, then the data in the whole block will not be sampled; p2 is the sampling rate for the data within the data block (the sampling rate at the intra-block level) (when a block is determined to sample, each document within it again confirms from P2 whether it is to be sampled). Therefore, for the determination modes of the block sampling rate P1 and the data sampling rate P2, the adjustment can be performed according to the IO number and the IO amplification condition of the current server, that is, the sizes of P1 and P2 are determined through a preset adjustment policy, but no matter the adjustment policy, the conditions of P1, P2, ns/S and the sampling rate X that the sampling rate x=ns/s×p1×p2 are ensured.

Step 1022 obtains a sample rate threshold for at least two data blocks.

In practical implementation, the sampling rate threshold value for at least two data blocks may be preset, where the setting of the sampling rate threshold value is related to the number of IOs of the system and the I/O amplification condition during data access.

Step 1023, determining an intermediate sampling rate for at least two data blocks in combination with the sampling rate corresponding to the source data and the fragmentation sampling rate.

In practical implementation, since the sampling rate x=ns/s×p1×p2, the server can determine the value of P1×p2 based on the number of data slices S and the sampling rate X. The value may be regarded as an intermediate sampling rate for a data block, i.e. the value is the product of the block sampling rate for at least two data blocks within a data slice and the intra-block data sampling rate.

For example, setting the sampling rate x=0.002, the tile sampling rate ns/s=1/10, i.e. 10 data tiles are sampled within one data tile, the intermediate sampling rate p1=p2=0.02 can be determined.

Step 1024, determining a unit sampling rate corresponding to at least two data units and a data sampling rate corresponding to data within the data units based on the sampling rate threshold and the intermediate sampling rate.

In actual implementation, the server determines the block sampling rate P1 and the intra-block data sampling rate P2 according to the sampling rate threshold and the intermediate sampling rate determined in step 1023.

In some embodiments, the server may determine the unit sampling rate and the data sampling rate of the data within the corresponding data unit by: when the intermediate sampling rate reaches a sampling rate threshold, the server determines that the value of the block sampling rate corresponding to at least two data blocks is equal to 1, and takes the intermediate sampling rate as the data sampling rate of the data in the corresponding data blocks; when the intermediate sampling rate does not reach the sampling rate threshold value, determining the ratio of the intermediate sampling rate to the sampling rate threshold value; and taking the ratio as the block sampling rate corresponding to at least two data blocks, and taking the sampling rate threshold as the data sampling rate corresponding to the data in the data blocks.

In practical implementation, the server compares the intermediate sampling rate with the sampling rate threshold, and when the intermediate sampling rate reaches the sampling rate threshold, it is indicated that the data distribution in the data block is centralized, and each data block in the sampled data slice can be sampled, that is, the block sampling rate p1=1 for the data block at this time, and the data sampling rate for the data in the block is directly set to the intermediate sampling rate, that is, p2=sampling rate X/(ns/S); when the intermediate sampling rate does not reach the sampling rate threshold, it indicates that the data distribution in the data block is more dispersed, and in order to change the dispersed access into the centralized access, and further reduce the I/0 amplification effect, the sampling rate threshold can be directly used as the data sampling rate of the data in the block, and the ratio of the intermediate sampling rate to the sampling rate threshold is used as the block sampling rate for the data block.

For example, setting the sampling rate x=0.002 in the figure, and the slice sampling rate ns/s=1/10, that is, sampling one data slice from 10 data slices in the same time period, it can be determined that the intermediate sampling rate p=p1×p2=0.02, when each data block in the data slice is taken as a sampled data block (p1=1), 2% of data is sampled for each data block, if the number of data blocks is excessive, server access is scattered, and IO amplification is obvious; in order to realize centralized access, a data sampling rate threshold value of 20%, namely, the lowest sampling data of 20% in each data block is set, and as the current intermediate sampling rate p=0.02=2%, is smaller than the sampling rate threshold value, p2=20%, namely, 20% of data is sampled for each data block, and the block sampling rate p1=10% of data blocks is accessed from 1/10 of the data blocks in the data block sampling partition, so that part of scattered access and centralized access can be realized through the sampling of the data blocks, thereby greatly reducing the IO amplification effect.

In step 103, the source data is data-sampled in combination with the unit sampling rate, the data sampling rate and the search information, so as to obtain the target data as the search result of the data search request.

In some embodiments, referring to fig. 8, fig. 8 is a flowchart of a data sampling method according to an embodiment of the present application, and the steps shown in fig. 8 are described.

In step 1031, the server samples at least two data units based on the unit sampling rate to obtain at least one target data unit.

In actual implementation, the server samples the data units according to the determined unit sampling rate, determining at least one data unit to be accessed, i.e. at least one target data unit. When the data unit is a data slice, the unit sampling rate is the slice sampling rate, and at this time, the server samples at least two data slices of the storage source data according to the slice sampling rate to obtain a target data slice (i.e. the data slice to be accessed by the server). When the data unit is a data block and the system for storing the source data does not support data slicing, the unit sampling rate is the block sampling rate; when the data fragments are multiple and each data fragment is divided into at least two data blocks, the unit sampling rate comprises a fragment sampling rate and a block sampling rate, at this time, the server samples the at least two data fragments of the stored source data according to the fragment sampling rate to obtain at least one target data fragment, and then samples the data blocks in the target data fragment according to the block sampling rate to obtain at least one target data block (i.e. the data block to be accessed in the target data fragment).

By way of example, the number of data slices is set to 10, the slice sampling rate is set to 0.1, that is, one data slice (target data slice) is determined from among 10 data slices to access, the target data slice includes 20 data blocks therein, and according to the block sampling rate of 0.1, the server samples 2 data blocks from among the 20 data blocks as target data blocks to be accessed by the server, that is, the server reads data from the two target data blocks.

Step 1032, based on the data sampling rate, samples the data in each target data unit, respectively, to obtain sampled data.

In practical implementations, after the server determines the target data unit (data slice or data block) to be accessed, the data in the data unit may be sampled according to the data sampling rate to determine sampled data.

In some embodiments, the server may sample the data within the data unit by: the server determines a data sampling operator corresponding to the data in the data unit based on the data sampling rate; based on a data sampling operator, sampling data in the data unit to obtain sampling data; wherein the ratio of the data amount of the sampled data to the data amount of the data in the data unit is equal to the data sampling rate.

In practical implementations, after determining the data sampling rate for the data units, the server may determine which data in the data units may be used as sampled data according to the data sampling operator, that is, the data sampling operator may be understood as a sampling function or a filtering condition for sampling the data in the data units. And finally, the ratio of the data volume of the sampled data obtained by the server through sampling by the data sampling operator to the total data volume in the current target data unit is equal to the data sampling rate. Data sampling operators include, but are not limited to: modulo operation (mod operation), hash operation (hash operation), random probability selection, etc.

In some embodiments, the server may sample data from within the target data unit by way of random probability selection, i.e. as long as the ratio of the amount of data of the sampled data to the amount of data within the target data unit is equal to the data sampling rate.

Illustratively, if the data sampling rate is set to 20%, the server may randomly sample 20% of the data from within the target data unit.

In some embodiments, referring to fig. 9, fig. 9 is a flowchart of a data sampling method based on modulo arithmetic according to an embodiment of the application. The data sampling process when the data sampling operator is a modulo operation is described in connection with the steps shown in fig. 9.

In step 301a, the server obtains index values corresponding to each data in the data unit.

In practical implementations, when the data sampling operator is a modulo operation, the server obtains an index value for indicating the data, where the index value may be a data identifier or an address pointer of the data store, but it is guaranteed that the index value is of a numeric type.

In step 302a, the modulus of each index value is obtained, so as to obtain the modulus value of each data.

In actual implementation, the server performs mod operation on each index value to obtain a corresponding module value. For example, the server modulo 10 the index value of the data, which may be 0, 1, 2, … …, 9.

In step 303a, when the modulus value matches the preset modulus value, the data indicated by the corresponding index value is used as sampling data.

In practical implementation, the server matches the modulus value obtained in step 302a with a preset modulus value, i.e. if the modulus value matches the preset modulus value, the data indicated by the corresponding index value is used as the sampling data. It should be noted that, the number of the preset module values may be multiple, and the preset module value 1, the preset module values 1, … …, the preset module value N, etc. the application scenario includes that the module value is matched with the preset module value 1, when the ratio of the obtained data volume of the sampled data to the total data volume in the data unit is smaller than the data sampling rate, the server may use the preset module value 2 to sample more data, that is, the module value is matched with the preset module value 2, and the data that can be matched is used as the sampled data until the ratio of the data volume of the last sampled data to the total data volume in the data unit is equal to the data sampling rate. In practical application, the server may set a corresponding priority for each preset module, and select, according to the priority of the preset module, the target preset module to match with the module obtained in step 302 a.

For example, 1 ten thousand pieces of data are stored in a data unit, the data sampling rate is 20%, preset modular values are set to be 1, 2 and 3 respectively, modular operation is carried out on index values of the 1 ten thousand pieces of data and 8 to obtain 2000 pieces of sampled data with the modular value of 1, the 2000 pieces of sampled data cannot reach the stipulated data sampling rate, the server takes the data with the modular value of 2 as sampled data, when the sampled data with the modular value of 2 is 1500 pieces, 100 pieces of data can be taken, and the server finishes the sampling process of the data in the data unit; when the number of the sampled data with the modulus value of 2 is 500, the number of the sampled data still does not reach 2000, the server continues to take the data with the modulus value of 3 as the sampled data until the number of the sampled data reaches 2000, and the sampling process of the server for the data in the data unit is finished.

In some embodiments, referring to fig. 10, fig. 10 is a flowchart of a data sampling method based on hash operation according to an embodiment of the present application. The data sampling process when the data sampling operator is a hash operation is described in connection with the steps shown in fig. 10.

In step 301b, the server obtains index values corresponding to each data in the data unit.

In practical implementations, when the data sampling operator is a hash operation, the server obtains an index value for indicating the data, where the index value may be a data identifier, or an address pointer of the data store, but it is guaranteed that the index value is of a numeric type.

Step 302b, hash the index value to obtain the hash value of each data.

In actual implementation, the server performs hash operation on each index value to obtain a corresponding hash value.

In step 303b, when the hash value does not reach the hash value threshold, the data indicated by the corresponding index value is used as sampling data.

The hash function hashFunc is set, the index value of the data is used as input information of the hasnFunc to obtain a corresponding hash value, and when the hash value is smaller than a hash value threshold, sampling data indicated by the corresponding index value is sampled.

Step 1033, based on the search information, performing data search in the sampled data to obtain target data.

In actual implementation, the server retrieves the sampled data based on the retrieval information carried in the data retrieval request to obtain the target data. The search information may include search operators, query sentences, etc., custom query conditions, query sentences, search formulas, etc.

In some embodiments, referring to fig. 11, fig. 11 is a flowchart of a data sampling method according to an embodiment of the present application, where the search information includes a search operator and a query statement, and the steps shown in fig. 11 are described in connection with the description.

In step 401, the server performs data sampling on the source data by combining the unit sampling rate, the data sampling rate and the search operator, so as to obtain initial sampling data.

In practical implementation, the server samples the data units according to the unit sampling rate, determines at least one target data unit to be accessed, and then samples the target data units according to the data sampling rate to obtain intermediate sampling data. Then, a search operator carried in the data search request is obtained, and the intermediate sampling data is sampled to obtain initial sampling data.

The server samples user log data, and the search information carried in the data search request is msg/error|select avg (rate) group by date, wherein the search operator is msg/error, which is a search operator in a custom key/value form, and means that an error log with the type error in the user log data is sampled. Through the sampling operator msg: error, the server obtains error log information, which is also a screening way for source data. It should be noted that, the specific form of the search operator may be determined according to specific sampling requirements.

Step 402, based on the query statement, data retrieval is performed in the initial sampling data to obtain target data.

In connection with the above example, the server acquires a query statement "select avg (rate) group by date" in the search information "msg: error select avg (rate) group by date", and queries the sample data obtained based on the sample rate X and the search operator to obtain target data.

In some embodiments, the server may perform statistical analysis on the target data by: when the search result also comprises a statistical result corresponding to the target data, the server acquires a data statistical mode corresponding to the target data; and carrying out statistical analysis on the target data based on a data statistical mode to obtain a statistical result.

In actual implementation, the server may perform a statistical analysis on the sampled data based on the sampling rate X and the search operator, the specific statistical manner may be determined according to actual requirements, and the statistical manner includes, but is not limited to, averaging (avg), number (count), top few (topN), etc.

As an example, the "select avg (rate) group by date" statistical way for sampled data is to average avg, i.e., average the rate field by date.

By applying the embodiment of the application, the server samples the source data by determining the unit sampling rate of the data unit corresponding to the source data and the data sampling rate in each data unit, thereby greatly reducing the data volume participating in query analysis, and reducing the occupation and response time consumption of system resources under the condition of ensuring that the correctness of the query result of a client is met. Meanwhile, in the process of data sampling, IO quantity is reduced by a Shard (block) level sampling technology, IO amplification is reduced by a block (block) level sampling technology, and high-efficiency data sampling and retrieval are realized.

In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In the related mass data retrieval analysis system, services such as data writing, data query, data analysis and the like are provided, and further aggregation analysis, such as avg, count, histogram, topN and the like, can be performed on the data set obtained by query. However, when massive data is encountered, the following problems often occur: the read data volume is overlarge, and the IO resource consumption is high; the calculated amount of analysis is overlarge, and the consumption of calculation resources is high; overall query analysis is overtime, and a result cannot be obtained. The main reasons for the problems are that the search result set is too large, so that the data volume of the follow-up analysis is too large. For a search analysis system which does not provide a data sampling function, when the search analysis system faces to mass data, a result cannot be obtained due to overtime or the search analysis system is excessively long in running time. Although the partial search analysis system can provide basic sampling capability, the partial search analysis system has a great relationship with the data field of the user, is easy to generate IO amplification, and has a poor actual effect. Other data detection services, when the amount of data analyzed is large, return the result of partial data operation, and the actual operation result are far apart.

Based on this, the embodiment of the application provides a data retrieval method, which can be understood as a retrieval-based method. According to the data retrieval method, the data is sampled in the retrieval stage by a method of designating the sampling rate/intelligent sampling rate, so that the data quantity participating in analysis is reduced, and the problems of resources and overtime in the mass data analysis are avoided. Meanwhile, in the process of data sampling, IO quantity is reduced by a Shard (block) level sampling technology, IO amplification is reduced by a block (block) level sampling technology, and high-efficiency data sampling and retrieval are realized.

In practical implementation, the data retrieval method provided by the embodiment of the application can be used for a retrieval analysis system of mass data, including but not limited to a log retrieval system, a user data analysis system and the like. Referring to fig. 4, the visual interface interaction flow is as follows: firstly, a user inputs a query analysis statement through a query statement input box in an interface, then, a sampling rate is set, wherein the setting mode of the sampling rate can comprise a specified sampling rate and an intelligent sampling rate, and finally, the user clicks an 'execute analysis' function key in the interface, sends a data retrieval analysis request to a server and receives a data analysis result returned by the server.

Next, a method for retrieving data provided by the embodiment of the present application will be described from a technical implementation point of view. In practical implementation, the manner in which the server obtains the sampling rate X for the mass data may be two modes, namely, a specified sampling rate mode and an intelligent sampling rate mode, where the specified sampling rate mode corresponds to the specified sampling rate and the intelligent sampling rate corresponds to the intelligent sampling rate. After receiving a data retrieval request aiming at mass data, a server analyzes the data retrieval request to obtain a corresponding analysis result, and when the analysis result comprises a specific numerical value of a sampling rate, the sampling rate mode aiming at the data is represented as a specified sampling rate mode, and the sampling rate in the analysis result is the specified sampling rate; and when the analysis result does not comprise a specific numerical value of the sampling rate, the sampling rate mode of the data is represented as an intelligent sampling rate mode, and the sampling rate in the analysis result is the intelligent sampling rate.

In actual implementation, when the sampling rate mode for data is a specified sampling rate mode, the sampling rate carried by the received data retrieval request is read as the sampling rate X.

In practical implementation, when the sampling rate mode for data is the intelligent sampling rate mode, the server may determine a reasonable sampling rate X by first establishing that a system support upper limit is L (i.e., the data amount that can be processed by the system in a reasonable time), and estimating and calculating the query result data amount of the client as N, where the recommended sampling rate X is: the sampling rate x=min (1, l/n×c), where c is a system reservation parameter, c is less than or equal to 1. For example, the server is set to be within a time range acceptable to the user (for example, 30 seconds), L is 1 million pieces and N is 2 million pieces, and theoretically, the sampling rate X may be set to be 0.5 (1000 ten thousand/2000 ten thousand), but the computing capacity of the server for sampling is exactly full at this time, and since L and N are estimated and there may be a certain error, in practical application, a system pre-flow parameter c may be set, so that the server can maintain a better sampling state.

In some embodiments, referring to fig. 12, fig. 12 is a flowchart of data statistics analysis determined by the data-based retrieval method according to an embodiment of the present application, and the steps shown in fig. 12 are described.

In step 501, the server performs time slicing on at least two data slices storing data to be sampled, and determines a slice sampling rate.

In practical implementation, in a distributed storage system, data to be sampled are often stored in a plurality of devices in a scattered manner, and are managed in a data slicing (shard) manner, so that data in the same time period can be ensured to be uniformly distributed on different data slices, and load balancing is realized. Because the data of different fragments are uniformly distributed in the same time period, when data statistics analysis is carried out, the data in partial data fragments can be selected for reading, and the purpose of data sampling can be achieved. Thus, the reading pressure can be shared to different data slicing nodes, and meanwhile IO resource consumption is reduced.

The above sampling manner for data slicing may be referred to as slicing sampling, i.e. a standard level sampling. The server performs the sliced sampling according to a sliced sampling rate, which may be expressed in ns/S. S is the number of fragments of the data fragments, S is more than or equal to 1, ns is the number of fragments of the accessed data fragments, and ns is more than or equal to 1 and less than or equal to S. It should be noted that the following relationship exists between the sampling rate and the slice sampling rate: sample rate x=slice sample rate X other sample rates, identified by P, i.e. sample rate x=ns/S X P.

For example, referring to fig. 13A, fig. 13A is a schematic diagram of data slicing sampling provided in an embodiment of the present application, in the figure, the number of data slices corresponding to data to be sampled is set to s=5, the slicing sampling rate is 40%, in the figure, the slicing sampling rate determined based on the data slices is 40% =2/5, where 5 is the number of data slices, and the same time period (t 1, t2, etc.) samples 2 different slices from the 5 slices. The fractional sampling rate of 40% can be achieved. The segmentation sampling process can be completed by only reading the data in the data segments represented by gray shading parts in the graph, so that the IO number is reduced.

As can be seen from the above examples, the slicing sampling rate for the data slicing may only support 1/S, 2/S, … …, S/S, etc. coarse granularity of the slicing sampling rate, referring to fig. 13B, fig. 13B is a schematic diagram of the slicing sampling rate provided in the embodiment of the present application, taking s=5 as an example, where the corresponding slicing sampling rate may be 20% =1/5, 40% =2/5, 60% =3/5, 80% =4/5, 100% =5/5, etc., that is, the number ns of the data slices accessed in a time period may be determined by using the formula ns=cell (x×s), where X is the obtained intelligent sampling rate or the specified sampling rate, and S is the number of the data slices corresponding to the data to be sampled.

For ns/s=40%, when the sampling rate x=40%, 2 out of 5 data slices can be directly sampled, and 100% of the data in the sampled data slices can be sampled, so that the requirement of 40% for the data to be sampled can be met.

In practical application, referring to fig. 14, fig. 14 is a schematic diagram of a fractional sampling rate correction provided by the embodiment of the present application, in the drawing, the fractional sampling rate for data slices can support fractional sampling rates with coarse granularity of 1/S, 2/S, … …, S/S, etc., and it is assumed that the fractional sampling rate for data slices is 3/S, and the actual required sampling rate X is less than 3/S and greater than 2/S, and at this time, correction can be performed by other sampling rates P, which may also be referred to as correction coefficients for correcting fractional sampling rates.

For example, taking s=5 as an example, when the sampling rate x=50%, the slice sampling rate ns/s=40% cannot reach the sampling rate X, and ns/s×p=50% is corrected by P.

When the search analysis system itself does not support data slicing, the number of slices s=1 may be considered, and ns=1, and the slicing sampling rate at the corresponding data slicing level is 1.

Step 502, analyzing the query statement, and combining with the sampling operator condition to sample the data.

In actual implementation, the server analyzes the query statement carried in the data retrieval request, converts the query statement into a logic query tree, and comprises logic combinations of the query tree, the logic combinations of (++I), the logic combinations of non (++I) and the like. And adding sampling operators of the conditions into the logic query tree to realize the combination of the query conditions and the sampling operators. Where the sampling operator can be understood as a sampling function, or filtering condition.

Illustratively, as in fig. 4, the search information "msg: error|select avg (rate) group by date" output at the search interface, where msg: error can be regarded as a sampling operator, and is shown in the form of key: value, and for the message log system, the message information with the type of error is screened; "select avg (rate) group by date" is the corresponding query statement.

In step 503, the data block is sampled in combination with the block sampling rate and the sampling operator for the data block.

Here, by sampling for the block sampling rate of the data block, a block-level sampling process is actually performed. In a storage system, data is generally managed in blocks (blocks), and data encoding and compression are performed in the blocks, and at the same time, disk IO itself is performed in blocks. Thus, when accessing a certain document in a data block, data near the document in the data block is accessed, and even the whole block data is read. (the target requirement is to access one piece of data, but in actual access, 100 pieces of data are accessed in order to access the one piece of data, which is the case of IO amplification). For the sampled scene, because the sampled data is accessed in a uniform distribution, the actual IO amplification is obvious, and the final sampling reading efficiency is affected. Based on this, in practical implementation, other sampling rates P may be split into 2 parts, i.e., p=p1×p2, where P1 is a block sampling rate (block level sampling rate) for a data block, and if a block does not hit a sample, then the data in the whole block will not be sampled; p2 is the sampling rate for the data within the data block (the sampling rate at the intra-block level) (when a block is determined to sample, each document within it again confirms from P2 whether it is to be sampled). That is, the sampling rate x=ns/s×p1×p2, that is, the product of the burst sampling rate ns/S for data burst, the block sampling rate P1 for data block in data burst, and the data sampling rate P2 for data in data block is equal to the sampling rate X.

For example, referring to fig. 15, fig. 15 is a schematic diagram of another sampling rate provided by an embodiment of the present application, where a sampling rate x=0.002 is set, a slice sampling rate is 1/10=0.1, and when another sampling rate p=2%, each data block in a data slice is accessed (block sampling rate p1=1), and 2% of data is taken in each data block. This makes it easier to make I/O amplification for data access within the data block more diffuse. When the block sampling rate p1=10% and p2=20% is adjusted, only 1/10 of the data blocks are accessed for at least two data blocks of the data slice, and 20% of the data is sampled in each accessed data block, so that the I/O becomes more concentrated, and only 1/10 of the original data is sampled. Therefore, partial scattered access is changed into centralized access through block-level sampling, so that IO amplification effect is greatly reduced.

Step 504, the data within the data block is sampled in combination with a data sampling rate and a sampling operator for the data within the data block.

Here, the sampling rate when sampling data in a data block at the data sampling rate, that is, the sampling rate of the corresponding data when intra-block sampling is performed, may be represented by P2. The server may produce a retrieved data sampling operator based on the data sampling rate. Wherein the data sampling operator needs to satisfy, after traversing all document ids in the data block, dividing the number of documents obtained by the sampling operator by the total number of documents, which is equal to the intra-block sampling rate (i.e. the data sampling rate for the data in the data block finally)

In actual implementation, available data sampling operators include, but are not limited to: modulo arithmetic (mod), ha hash arithmetic (hash), random probability selection, and the like.

Taking a data sampling operator as an example of hash operation, a hash function hashfunc is defined, an input parameter is a document identification (doc_id), a hash value corresponding to the document id is output, the obtained hash value is compared with a preset threshold value, and when the hash value is smaller than the threshold value, data corresponding to the document id is used as sampling data. Wherein the threshold is block sampling rate dependent. I.e.

hashfunc (doc_id) < int.max×p2= > the doc_id sample

hashfunc (doc_id) > int.max×p2= > the doc_id is not sampled

Step 505, query in the retrieval system.

Here, the query operation performed in the retrieval system includes the steps of indexing, scoring documents, sorting documents, and the like, noting that the query condition at this time already includes a sampling operator.

Step 505, obtaining a sampled list of sampled data

In actual implementation, the server retrieves the sampled data that satisfies the query condition and satisfies the sampling condition according to the foregoing sampling for the data slice, the sampling for the data block, and the sampling for the data in the data block, where the data may be the document id.

And step 506, pulling relevant field data in the sampled data, and performing statistical analysis.

Here, the data fields are analyzed according to the document id list in s6, and the final statistical analysis is completed and returned to the user. The amount of data has been reduced here at the sampling rate, so that the time consumption is greatly reduced.

Referring to fig. 16, fig. 16 is an analysis chart of experimental results provided by the embodiment of the present application, in which when the sampling rate x=1 (full-scale search), the server searches the full-scale data, and returns the total number 150000000, which takes 13423 ms, and the calculated statistical result (i.e. avg) is about 1.66. By setting different sampling rates X, the number of data stripes is reduced, and a statistical result is determined, wherein when the sampling rate X=0.3, the time is 4635 milliseconds, the number of stripes 45000006 is returned, and the statistical result is approximately equal to 1.64; at sampling rate x=0.1, which takes 1621 ms, the count 15000001 is returned, and the statistical result is approximately equal to 1.64; at sampling rate x=0.001, which takes 102 milliseconds, the number of returns 149990, and the statistics are approximately equal to 1.64. The avg determined by sampling may be determined with little error than directly solving for the full data. Under the condition that avg operation meets the requirements of clients, analysis time is greatly reduced along with the construction sampling rate, and the availability of the whole query system is improved.

According to the embodiment of the application, through the method of searching and sampling, the number of documents returned in the searching stage can be greatly reduced, so that the number of participating in query analysis is reduced, the occupation of system resources and the time consumption of query analysis are greatly reduced under the condition that the correctness of a client query result is ensured, and meanwhile, 1. The time consumption of the user query analysis is reduced, and the product experience is improved; the occupation of system resources is reduced, and more inquiry concurrency capacity is provided; the usability of the system is effectively improved.

It will be appreciated that in the embodiments of the present application, related data such as user information is involved, and when the embodiments of the present application are applied to specific products or technologies, user permissions or agreements need to be obtained, and the collection, use and processing of related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

Continuing with the description below of an exemplary structure of the data retrieving apparatus 555 implemented as a software module provided by an embodiment of the present application, in some embodiments, as shown in fig. 2, the software module stored in the live interaction apparatus 555 of the memory 540 may include:

an obtaining module 5551, configured to obtain source data and a sampling rate corresponding to the source data in response to a data retrieval request carrying retrieval information; wherein the source data is divided into at least two data units for data storage;

A determining module 5552, configured to determine a unit sampling rate corresponding to the at least two data units and a data sampling rate corresponding to data in the data units based on the sampling rate corresponding to the source data;

the sampling module 5553 is configured to combine the unit sampling rate, the data sampling rate, and the search information to perform data sampling on the source data to obtain target data;

a return module 5554, configured to return a search result including the target data.

In some embodiments, the acquisition module is further configured to parse the data retrieval request to determine a sampling rate pattern for the source data; and when the sampling rate mode is a specified sampling rate mode, taking the sampling rate carried in the data retrieval request as the sampling rate corresponding to the source data.

In some embodiments, the acquisition module is further configured to parse the data retrieval request to determine a sampling rate pattern for the source data; when the sampling rate mode is an intelligent sampling rate mode, obtaining the estimated data quantity corresponding to the source data and a processing quantity threshold value when the source data is processed; and determining the sampling rate corresponding to the source data by combining the processing amount threshold and the estimated data amount.

In some embodiments, when the data unit is a data slice, the determining module is further configured to obtain a slice number of the data slice corresponding to the source data; and combining the sampling rate corresponding to the source data and the number of fragments, determining the fragment sampling rate corresponding to the at least two data fragments as the unit sampling rate, and taking the ratio of the sampling rate corresponding to the source data to the fragment sampling rate as the data sampling rate corresponding to the data in the data fragments.

In some embodiments, when the data unit is a data block, the source data is divided into at least two data slices, each data slice is divided into at least two data blocks, and the determining module is further configured to obtain a slice number of the data slices corresponding to the source data, and determine a slice sampling rate corresponding to the at least two data slices in combination with a sampling rate corresponding to the source data and the slice number; acquiring sampling rate thresholds for the at least two data blocks; determining an intermediate sampling rate for the at least two data blocks in combination with the sampling rate corresponding to the source data and the fragmentation sampling rate; and determining unit sampling rates corresponding to the at least two data units and data sampling rates corresponding to data in the data units based on the sampling rate threshold and the intermediate sampling rate.

In some embodiments, the determining module is further configured to determine a product of the sampling rate corresponding to the source data and the number of slices; performing upward rounding treatment on the product to obtain the number of sampling fragments; and determining the ratio of the number of the sampling slices to the number of the slices, and taking the ratio as the slicing sampling rate corresponding to the at least two data slices.

In some embodiments, the sampling module is further configured to sample the at least two data units based on the unit sampling rate to obtain at least one target data unit; based on the data sampling rate, respectively sampling the data in each target data unit to obtain sampling data; and carrying out data retrieval in the sampling data based on the retrieval information to obtain the target data.

In some embodiments, the sampling module is further configured to determine a data sampling operator corresponding to data within the data unit based on the data sampling rate; sampling the data in the data unit based on the data sampling operator to obtain sampling data; wherein the ratio of the data volume of the sampled data to the data volume of the data in the data unit is equal to the data sampling rate.

In some embodiments, when the data sampling operator is a modulo operation, the sampling module is further configured to obtain an index value corresponding to each data in the data unit; taking the modulus of each index value to obtain a modulus value corresponding to each data; and when the modulus value is matched with a preset modulus value, taking the data indicated by the corresponding index value as sampling data.

In some embodiments, when the data sampling operator is a hash operation, the sampling module is further configured to obtain an index value corresponding to each data in the data unit; hashing the index value to obtain a hash value corresponding to each data; and when the hash value does not reach the hash value threshold, taking the data indicated by the corresponding index value as sampling data.

In some embodiments, the search information includes a search operator and a query statement, and the sampling module is further configured to combine the unit sampling rate, the data sampling rate, and the search operator to perform data sampling on the source data to obtain initial sampling data; and based on the query statement, carrying out data retrieval in the initial sampling data to obtain the target data.

In some embodiments, when the search result further includes a statistical result corresponding to the target data, the return module is further configured to obtain a data statistical manner corresponding to the target data; and carrying out statistical analysis on the target data based on the data statistical mode to obtain the statistical result.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the data retrieval method according to the embodiment of the present application.

An embodiment of the present application provides a computer-readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform a method for retrieving data provided by an embodiment of the present application, for example, a method for retrieving data as shown in fig. 3.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application samples the mass data, reduces the data quantity participating in analysis, and effectively avoids the problems of resources and response overtime of the mass data in retrieval analysis. Meanwhile, in the process of data sampling, IO quantity is reduced by a Shard (block) level sampling technology, IO amplification is reduced by a block (block) level sampling technology, and high-efficiency data sampling and retrieval are realized.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. A method of retrieving data, the method comprising:

And returning a retrieval result comprising the target data.

2. The method of claim 1, wherein the obtaining the sampling rate corresponding to the source data comprises:

parsing the data retrieval request to determine a sampling rate pattern for the source data;

3. The method of claim 1, wherein the obtaining the sampling rate corresponding to the source data comprises:

4. The method of claim 1, wherein when the data unit is a data slice, the determining the unit sample rate for the at least two data units and the data sample rate for the data within the data unit based on the sample rate for the source data comprises:

Acquiring the number of the data fragments corresponding to the source data;

5. The method of claim 4, wherein the determining the fractional sample rate for the at least two data slices in combination with the corresponding sample rate for the source data and the number of slices comprises:

determining the product of the sampling rate corresponding to the source data and the number of fragments;

6. The method of claim 1, wherein when the data unit is a data block, the source data is divided into at least two data slices, each of the data slices is divided into at least two data blocks, the determining the unit sample rate corresponding to the at least two data units and the data sample rate corresponding to the data within the data unit based on the sample rate corresponding to the source data comprises:

Acquiring the number of fragments of the data fragments corresponding to the source data, and determining the sampling rates of fragments corresponding to the at least two data fragments by combining the sampling rates corresponding to the source data and the number of fragments;

acquiring sampling rate thresholds for the at least two data blocks;

7. The method of claim 1, wherein the data sampling the source data in combination with the unit sampling rate, the data sampling rate, and the search information to obtain target data comprises:

sampling the at least two data units based on the unit sampling rate to obtain at least one target data unit;

8. The method of claim 7, wherein the sampling data in each of the target data units based on the data sampling rate to obtain sampled data comprises:

determining a data sampling operator corresponding to data in the data unit based on the data sampling rate;

9. The method of claim 7, wherein when the data sampling operator is a modulo operation, the sampling the data in the data unit based on the data sampling operator to obtain sampled data comprises:

acquiring index values corresponding to all data in the data unit;

10. The method of claim 7, wherein when the data sampling operator is a hash operation, the sampling the data in the data unit based on the data sampling operator to obtain sampled data comprises:

Acquiring index values corresponding to all data in the data unit;

hashing the index value to obtain a hash value corresponding to each data;

11. The method of claim 1, wherein the search information includes a search operator and a query statement, and wherein the combining the unit sampling rate, the data sampling rate, and the search information, data sampling the source data to obtain target data comprises:

combining the unit sampling rate, the data sampling rate and the search operator, and performing data sampling on the source data to obtain initial sampling data;

12. The method of claim 1, wherein when the search result further includes a statistical result corresponding to the target data, the method further includes:

acquiring a data statistics mode corresponding to the target data;

13. A data retrieval device, the device comprising:

14. An electronic device, the electronic device comprising:

a memory for storing executable instructions;

a processor for implementing the method of retrieving data according to any one of claims 1 to 12 when executing executable instructions stored in said memory.

15. A computer readable storage medium storing executable instructions which when executed by a processor implement the method of retrieving data according to any one of claims 1 to 12.

16. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the method of retrieving data according to any one of claims 1 to 12.