CN110795473A

CN110795473A - Bootstrap-method-based accelerated search method

Info

Publication number: CN110795473A
Application number: CN201911106961.XA
Authority: CN
Inventors: 张宏莉; 周志刚; 王晓萌; 于海宁; 张羽; 叶麟; 方滨兴; 邱彪; 郭新凯
Original assignee: Guangdong Institute Of Electronic And Information Engineering University Of Electronic Science And Technology Of China; Harbin Institute of Technology
Current assignee: Guangdong Institute Of Electronic And Information Engineering University Of Electronic Science And Technology Of China; Harbin Institute of Technology
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2020-02-14

Abstract

The invention belongs to the technical field of retrieval, and particularly relates to an accelerated search method based on a bootstrap method, which comprises the following steps of S1, setting a user search request under a Hadoop architecture as a triple Q (Op, D, rho), wherein Op represents the search operation of a user on a target data set D, and rho is a search precision lower limit value set by the user; s2, extracting an initial sample S from the data set D, and then performing m times of playback sampling by taking S as a discourse domain₁,...,S_m}; s3, m results { Op (S) generated by implementing operation Op (D) in step S2₁),...,Op(S_m) Carrying out approximate calculation to obtain a relative error value of the variation coefficient; and S4, evaluating according to the relative error in the step S3 to obtain a search result meeting the approximate precision of the user. He-ShiCompared with the prior art, the method adopts the bootstrap method to sample, effectively reduces the number of samples in the sampling process, and simultaneously can obviously reduce the disk cost in the sampling process because only one small random uniform sample needs to be extracted from the original data set.

Description

Bootstrap-method-based accelerated search method

Technical Field

The invention belongs to the technical field of retrieval, and particularly relates to an accelerated search method based on a bootstrap method.

Background

With the rapid development of information technology, the data volume in various fields is increasing explosively, and the world has entered the big data era. By big data is meant a collection of data that cannot be captured, managed and processed within a certain time frame with conventional software tools. The big data has the four very remarkable characteristics of large data volume, various data types, low value density and high processing speed.

Currently, research on the handling of big data retrieval problems is limited. When searching, the user usually wants to quickly obtain the needed things from all the materials. This involves a speed and accuracy problem. In current practical applications, a sampling technique introduced on a Hadoop architecture to improve search performance, such as bernoulli sampling, knife-cut method, etc., is presented, wherein the Hadoop architecture is an open-source, Java-based programming framework designed to process large data across computer clusters. The Mapper plug-in is commonly used in the Hadoop architecture, so that a user can conveniently perform the operations of adding, deleting, modifying and checking the single table, and the key-value pair finally processed by the Mapper plug-in is sent to the Reduce r component for combination. However, these sampling techniques still have certain drawbacks: even if the error margin is set to be relatively small, the amount of samples required in the searching process is still very large, and certain limit is caused to the improvement of the searching performance.

In view of the above, it is necessary to improve the sampling technique by using a new scheme to overcome the defect of the excessive sampling sample size and increase the search speed.

Disclosure of Invention

The invention aims to: aiming at the defects of the prior art, the provided acceleration search method based on the self-lifting method effectively reduces the number of samples in the sampling process based on the bootstrap method, and simultaneously can obviously reduce the disk cost in the sampling process because only one small random uniform sample needs to be extracted from the original data set, and in addition, has the characteristic of insensitivity to the specific operation of the upper layer.

In order to achieve the purpose, the invention adopts the following technical scheme:

an accelerated search method based on a bootstrap method comprises the following steps:

s1, setting a user search request under a Hadoop architecture as a triple Q (Op, D, rho), wherein Op represents a search operation of a user on a target data set D, and rho is a search precision lower limit value set by the user;

s2, extracting an initial sample S from the data set D, and then performing m times of replacement sampling with S as a domain of discourse { S₁,...,S_m}；

S3, m results Op (S) generated by operation Op (D) in step S2₁),...,Op(S_m) Carrying out approximate calculation to obtain a relative error value of the variation coefficient;

and S4, evaluating according to the relative error of the variation coefficient in the step S3, and obtaining a search result meeting the approximate precision of the user.

As an improvement of the bootstrap-based accelerated search method in the present invention, in the sampling process, the step S2 includes setting a state where the sampling process of the Mapper plug-in of the Hadoop framework is ended as a suspension state by default, until the approximation accuracy of the Mapper plug-in meets the requirement of the user after calculation, and then setting the Mapper plug-in as a termination state. In the original Hadoop system, when the sampling process on the Mapper plug-in is completed, the node will be identified as the termination state, and the occupied resources will be released. In the invention, the default state of the Mapper plug-in is changed into a pause state when the sampling process on the Mapper plug-in is finished, and the state identifier of the Mapper plug-in is not set to be a stop state until the calculated approximate precision meets the requirement of a user, so that the phenomenon that the Mapper plug-in is restarted each time of resampling can be avoided, the total execution time is effectively reduced, and the searching efficiency is improved.

As an improvement of the accelerated search method based on the bootstrap method, the key/value pair can be processed by setting the Reducer component in the Hadoop framework when the Mapper plug-in is in a pause state. In the prior art, the Mapper plug-in sends the key value pair finally processed to the Reducer component for merging in the termination state, but the Reducer component is arranged in the state that the Mapper plug-in is in the pause state, so that the key/value pair can be processed, the number of times that the Mapper plug-in sends the key value pair can be reduced, and the searching efficiency can be effectively improved.

As an improvement to the bootstrap-based accelerated search method described in the present invention, the relative error value of the coefficient of variation in step S3 satisfies the following formula:

wherein cv is a variation coefficient, sm is a sample average value, sd is a sample standard deviation, v is a relative error value of the variation coefficient, t is a search time limit, i is a variable and i is less than m.

Accelerated search based on bootstrap method as described in the present inventionIn an improvement of the searching method, the evaluation process in step S4 is to compare the relative error value of the coefficient of variation with epsilon if v_i-1,iIf the value is more than epsilon, the next round of iterative calculation is carried out, and v is calculated_i,i+1Until v in the j' th calculation_j-1,jIf epsilon is not greater than epsilon, the iterative process is terminated, at which point its corresponding operation is performed

In order to meet the search result of the user with approximate precision, j is a variable and is less than m, and epsilon is a relative error parameter and is more than or equal to 0.

As an improvement to the bootstrap-based accelerated search method described in the present invention, the evaluation process in step S4 further includes the step of evaluating the initial sample S₀Optimization of input parameters, where | S₀|＝n₀＝q_B×N，q_BTo initially extract the proportion of samples from the data set D,for the first time with S₀The number of samples to resample the data discourse field, and N is the number of records in the original data set.

As an improvement to the bootstrap-based accelerated search method described in the present invention, the optimization process includes when i ∈ [2, τ]When, | S_i|＝n_i＝2n_i-1(ii) a When i > τ, | S_i|＝n_i＝(i-τ+1)n_τ(ii) a Wherein tau is a set value of the data platform, and tau is more than or equal to 2 and less than or equal to 10.

As an improvement of the bootstrap-based accelerated search method in the present invention, the method further includes setting a listener on an ApplicationMaster node of the Hadoop architecture, where the listener is configured to estimate a sample error according to an output of the Reducer component and check a termination condition of a resampling iteration process.

The invention has the beneficial effects that: compared with the prior art, the invention effectively reduces the number of samples in the sampling process and improves the searching efficiency by adopting the sampling mode of the bootstrap method, and simultaneously, only one small random uniform sampling needs to be extracted from the original data set, so the disk cost in the sampling process can be obviously reduced, and in addition, the invention has the characteristic of insensitivity to the specific operation of the upper layer.

Detailed Description

As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. The present specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to specific situations.

The present invention will be described in further detail with reference to specific examples, but the present invention is not limited thereto.

s2, extracting an initial sample S from the data set D, and then performing m times of recovery with discharge by taking S as a discourse domainSample { S₁,...,S_m}；

Preferably, in the sampling process of step S2, the method includes setting a state of ending the sampling process of the Mapper plug-in of the Hadoop framework as a suspension state by default, until the approximation accuracy meets the requirement of the user after calculation, and then setting the Mapper plug-in as a termination state. In the original Hadoop system, when the sampling process on the Mapper plug-in is completed, the node will be identified as the termination state, and the occupied resources will be released. In the invention, the default state of the Mapper plug-in is changed into a pause state when the sampling process on the Mapper plug-in is finished, and the state identifier of the Mapper plug-in is not set to be a stop state until the calculated approximate precision meets the requirement of a user, so that the phenomenon that the Mapper plug-in is restarted each time of resampling can be avoided, the total execution time is effectively reduced, and the searching efficiency is improved.

Preferably, the Reducer component in the Hadoop architecture is set to process the key/value pair when the Mapper plug-in is in the pause state. In the prior art, the Mapper plug-in sends the finally processed key value pair to the Reducer component for merging in the termination state, but the Reducer component is arranged in the state that the Mapper plug-in is in the pause state, so that the key/value pair can be processed, the number of times that the Mapper plug-in sends the key value pair can be reduced, and the searching efficiency can be effectively improved.

Preferably, the relative error value of the coefficient of variation in step S3 satisfies the following formula:

Preferably, the evaluation process in step S4 is to compare the relative error value of the coefficient of variation with e if v_i-1,iIf the value is more than epsilon, the next round of iterative calculation is carried out, and v is calculated_i,i+1Until v in the j' th calculation_j-1,jIf epsilon is not greater than epsilon, the iterative process is terminated, at which point its corresponding operation is performedIn order to meet the search result of the user approximate precision, j is a variable and is less than m, and epsilon is a relative error parameter and is more than or equal to 0.

Preferably, the evaluation process in step S4 further includes the step of evaluating the initial sample S₀Optimization of input parameters, where | S₀|＝n₀＝q_B×N，q_BTo initially extract the proportion of samples from the data set D,

for the first time with S₀The number of samples to resample the data discourse field, and N is the number of records in the original data set.

Preferably, the optimization process includes, when i ∈ [2, τ)]When, | S_i|＝n_i＝2n_i-1(ii) a When i > τ, | S_i|＝n_i＝(i-τ+1)n_τ(ii) a τ is a set value of the data platform, and τ is 2 ≦ τ ≦ 10, and τ is 5 in this embodiment, but is not limited to the present invention. Given N ═ D |, and the initial sample S₀Thereafter, the Op operation process will be iteratively performed until

It should be noted that, since the end condition of the resampling depends on the relationship between the relative error v of the sample coefficient of variation and the relative error parameter epsilon, if the coefficient of variation cv still cannot be converged after a plurality of times of resampling, the time overhead of the resampling approaches or exceeds the theoretical sampling overhead directly performed from the original data set D, and therefore, the threshold parameter η may also be set as the most optimal oneTermination conditions for the final sampling iteration operation to reduce the time overhead of bootstrap failure, wherein the threshold parameter η is satisfied

In the present invention, η is 5.

Preferably, a monitor is arranged on an ApplicationMaster node of the Hadoop architecture, and the monitor is used for estimating a sample error according to the output of the Reducer component and checking an end condition of the resampling iteration process.

The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments, and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A bootstrap-based accelerated search method is characterized by comprising the following steps:

s2, extracting an initial sample S from the data set D, and then performing m times of playback sampling with S as a discourse domain₁,...,S_m}；

2. The bootstrap-based accelerated search method as recited in claim 1, wherein: in the sampling process of step S2, the method includes setting a state where the sampling process of the Mapper plug-in of the Hadoop framework is ended as a suspension state by default, and setting the Mapper plug-in to a termination state again until the approximation accuracy after calculation meets the requirement of the user.

3. The bootstrap-based accelerated search method as recited in claim 2, wherein: and setting the Reducer component in the Hadoop architecture when the Mapper plug-in is in a pause state, namely processing the key/value pair.

4. The bootstrap-based acceleration search method as recited in claim 3, wherein the relative error value of the coefficient of variation in said step S3 satisfies the following formula:

5. The bootstrap-based accelerated search method as recited in claim 4, wherein: the evaluation process in step S4 is to compare the relative error value of the coefficient of variation with epsilon if v is_i-1,iIf the value is more than epsilon, the next round of iterative calculation is carried out, and v is calculated_i,i+1Until v in the j' th calculation_j-1,jIf epsilon is not greater than epsilon, the iterative process is terminated, at which point its corresponding operation is performed

In order to meet the search result of the user approximate precision, j is a variable and is less than m, and epsilon is a relative error parameter and is more than or equal to 0.

6. The method of claim 5 based on selfThe method for accelerating search is characterized in that: the evaluation process in step S4 further includes the step of evaluating the initial sample S₀Optimization of input parameters, where | S₀|＝n₀＝q_B×N，q_BTo initially extract the proportion of samples from the data set D,

7. The bootstrap-based accelerated search method as recited in claim 6, wherein said optimization procedure comprises:

when i ∈ [2, τ)]When, | S_i|＝n_i＝2n_i-1；

When i > τ, | S_i|＝n_i＝(i-τ+1)n_τ；

Wherein tau is a set value of the data platform, and tau is more than or equal to 2 and less than or equal to 10.

8. The bootstrap-based accelerated search method as recited in claim 5, characterized in that: the method further comprises the step of arranging a monitor on an applicationMaster node of the Hadoop architecture, wherein the monitor is used for estimating a sample error according to the output of the Reducer component and checking the termination condition of the resampling iteration process.