CN110795473A - Bootstrap-method-based accelerated search method - Google Patents

Bootstrap-method-based accelerated search method Download PDF

Info

Publication number
CN110795473A
CN110795473A CN201911106961.XA CN201911106961A CN110795473A CN 110795473 A CN110795473 A CN 110795473A CN 201911106961 A CN201911106961 A CN 201911106961A CN 110795473 A CN110795473 A CN 110795473A
Authority
CN
China
Prior art keywords
search
user
bootstrap
relative error
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911106961.XA
Other languages
Chinese (zh)
Inventor
张宏莉
周志刚
王晓萌
于海宁
张羽
叶麟
方滨兴
邱彪
郭新凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Institute Of Electronic And Information Engineering University Of Electronic Science And Technology Of China
Harbin Institute of Technology
Original Assignee
Guangdong Institute Of Electronic And Information Engineering University Of Electronic Science And Technology Of China
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Institute Of Electronic And Information Engineering University Of Electronic Science And Technology Of China, Harbin Institute of Technology filed Critical Guangdong Institute Of Electronic And Information Engineering University Of Electronic Science And Technology Of China
Priority to CN201911106961.XA priority Critical patent/CN110795473A/en
Publication of CN110795473A publication Critical patent/CN110795473A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/2448Query languages for particular applications; for extensibility, e.g. user defined types
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of retrieval, and particularly relates to an accelerated search method based on a bootstrap method, which comprises the following steps of S1, setting a user search request under a Hadoop architecture as a triple Q (Op, D, rho), wherein Op represents the search operation of a user on a target data set D, and rho is a search precision lower limit value set by the user; s2, extracting an initial sample S from the data set D, and then performing m times of playback sampling by taking S as a discourse domain1,...,Sm}; s3, m results { Op (S) generated by implementing operation Op (D) in step S21),...,Op(Sm) Carrying out approximate calculation to obtain a relative error value of the variation coefficient; and S4, evaluating according to the relative error in the step S3 to obtain a search result meeting the approximate precision of the user. He-ShiCompared with the prior art, the method adopts the bootstrap method to sample, effectively reduces the number of samples in the sampling process, and simultaneously can obviously reduce the disk cost in the sampling process because only one small random uniform sample needs to be extracted from the original data set.

Description

Bootstrap-method-based accelerated search method
Technical Field
The invention belongs to the technical field of retrieval, and particularly relates to an accelerated search method based on a bootstrap method.
Background
With the rapid development of information technology, the data volume in various fields is increasing explosively, and the world has entered the big data era. By big data is meant a collection of data that cannot be captured, managed and processed within a certain time frame with conventional software tools. The big data has the four very remarkable characteristics of large data volume, various data types, low value density and high processing speed.
Currently, research on the handling of big data retrieval problems is limited. When searching, the user usually wants to quickly obtain the needed things from all the materials. This involves a speed and accuracy problem. In current practical applications, a sampling technique introduced on a Hadoop architecture to improve search performance, such as bernoulli sampling, knife-cut method, etc., is presented, wherein the Hadoop architecture is an open-source, Java-based programming framework designed to process large data across computer clusters. The Mapper plug-in is commonly used in the Hadoop architecture, so that a user can conveniently perform the operations of adding, deleting, modifying and checking the single table, and the key-value pair finally processed by the Mapper plug-in is sent to the Reduce r component for combination. However, these sampling techniques still have certain drawbacks: even if the error margin is set to be relatively small, the amount of samples required in the searching process is still very large, and certain limit is caused to the improvement of the searching performance.
In view of the above, it is necessary to improve the sampling technique by using a new scheme to overcome the defect of the excessive sampling sample size and increase the search speed.
Disclosure of Invention
The invention aims to: aiming at the defects of the prior art, the provided acceleration search method based on the self-lifting method effectively reduces the number of samples in the sampling process based on the bootstrap method, and simultaneously can obviously reduce the disk cost in the sampling process because only one small random uniform sample needs to be extracted from the original data set, and in addition, has the characteristic of insensitivity to the specific operation of the upper layer.
In order to achieve the purpose, the invention adopts the following technical scheme:
an accelerated search method based on a bootstrap method comprises the following steps:
s1, setting a user search request under a Hadoop architecture as a triple Q (Op, D, rho), wherein Op represents a search operation of a user on a target data set D, and rho is a search precision lower limit value set by the user;
s2, extracting an initial sample S from the data set D, and then performing m times of replacement sampling with S as a domain of discourse { S1,...,Sm};
S3, m results Op (S) generated by operation Op (D) in step S21),...,Op(Sm) Carrying out approximate calculation to obtain a relative error value of the variation coefficient;
and S4, evaluating according to the relative error of the variation coefficient in the step S3, and obtaining a search result meeting the approximate precision of the user.
As an improvement of the bootstrap-based accelerated search method in the present invention, in the sampling process, the step S2 includes setting a state where the sampling process of the Mapper plug-in of the Hadoop framework is ended as a suspension state by default, until the approximation accuracy of the Mapper plug-in meets the requirement of the user after calculation, and then setting the Mapper plug-in as a termination state. In the original Hadoop system, when the sampling process on the Mapper plug-in is completed, the node will be identified as the termination state, and the occupied resources will be released. In the invention, the default state of the Mapper plug-in is changed into a pause state when the sampling process on the Mapper plug-in is finished, and the state identifier of the Mapper plug-in is not set to be a stop state until the calculated approximate precision meets the requirement of a user, so that the phenomenon that the Mapper plug-in is restarted each time of resampling can be avoided, the total execution time is effectively reduced, and the searching efficiency is improved.
As an improvement of the accelerated search method based on the bootstrap method, the key/value pair can be processed by setting the Reducer component in the Hadoop framework when the Mapper plug-in is in a pause state. In the prior art, the Mapper plug-in sends the key value pair finally processed to the Reducer component for merging in the termination state, but the Reducer component is arranged in the state that the Mapper plug-in is in the pause state, so that the key/value pair can be processed, the number of times that the Mapper plug-in sends the key value pair can be reduced, and the searching efficiency can be effectively improved.
As an improvement to the bootstrap-based accelerated search method described in the present invention, the relative error value of the coefficient of variation in step S3 satisfies the following formula:
Figure BDA0002271603690000031
Figure BDA0002271603690000032
wherein cv is a variation coefficient, sm is a sample average value, sd is a sample standard deviation, v is a relative error value of the variation coefficient, t is a search time limit, i is a variable and i is less than m.
Accelerated search based on bootstrap method as described in the present inventionIn an improvement of the searching method, the evaluation process in step S4 is to compare the relative error value of the coefficient of variation with epsilon if vi-1,iIf the value is more than epsilon, the next round of iterative calculation is carried out, and v is calculatedi,i+1Until v in the j' th calculationj-1,jIf epsilon is not greater than epsilon, the iterative process is terminated, at which point its corresponding operation is performed
Figure BDA0002271603690000041
In order to meet the search result of the user with approximate precision, j is a variable and is less than m, and epsilon is a relative error parameter and is more than or equal to 0.
As an improvement to the bootstrap-based accelerated search method described in the present invention, the evaluation process in step S4 further includes the step of evaluating the initial sample S0Optimization of input parameters, where | S0|=n0=qB×N,qBTo initially extract the proportion of samples from the data set D,for the first time with S0The number of samples to resample the data discourse field, and N is the number of records in the original data set.
As an improvement to the bootstrap-based accelerated search method described in the present invention, the optimization process includes when i ∈ [2, τ]When, | Si|=ni=2ni-1(ii) a When i > τ, | Si|=ni=(i-τ+1)nτ(ii) a Wherein tau is a set value of the data platform, and tau is more than or equal to 2 and less than or equal to 10.
As an improvement of the bootstrap-based accelerated search method in the present invention, the method further includes setting a listener on an ApplicationMaster node of the Hadoop architecture, where the listener is configured to estimate a sample error according to an output of the Reducer component and check a termination condition of a resampling iteration process.
The invention has the beneficial effects that: compared with the prior art, the invention effectively reduces the number of samples in the sampling process and improves the searching efficiency by adopting the sampling mode of the bootstrap method, and simultaneously, only one small random uniform sampling needs to be extracted from the original data set, so the disk cost in the sampling process can be obviously reduced, and in addition, the invention has the characteristic of insensitivity to the specific operation of the upper layer.
Detailed Description
As used in the specification and in the claims, certain terms are used to refer to particular components. As one skilled in the art will appreciate, manufacturers may refer to a component by different names. The present specification and claims do not intend to distinguish between components that differ in name but not function. In the following description and in the claims, the terms "include" and "comprise" are used in an open-ended fashion, and thus should be interpreted to mean "include, but not limited to. "substantially" means within an acceptable error range, within which a person skilled in the art can solve the technical problem to substantially achieve the technical result.
In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to specific situations.
The present invention will be described in further detail with reference to specific examples, but the present invention is not limited thereto.
An accelerated search method based on a bootstrap method comprises the following steps:
s1, setting a user search request under a Hadoop architecture as a triple Q (Op, D, rho), wherein Op represents a search operation of a user on a target data set D, and rho is a search precision lower limit value set by the user;
s2, extracting an initial sample S from the data set D, and then performing m times of recovery with discharge by taking S as a discourse domainSample { S1,...,Sm};
S3, m results Op (S) generated by operation Op (D) in step S21),...,Op(Sm) Carrying out approximate calculation to obtain a relative error value of the variation coefficient;
and S4, evaluating according to the relative error of the variation coefficient in the step S3, and obtaining a search result meeting the approximate precision of the user.
Preferably, in the sampling process of step S2, the method includes setting a state of ending the sampling process of the Mapper plug-in of the Hadoop framework as a suspension state by default, until the approximation accuracy meets the requirement of the user after calculation, and then setting the Mapper plug-in as a termination state. In the original Hadoop system, when the sampling process on the Mapper plug-in is completed, the node will be identified as the termination state, and the occupied resources will be released. In the invention, the default state of the Mapper plug-in is changed into a pause state when the sampling process on the Mapper plug-in is finished, and the state identifier of the Mapper plug-in is not set to be a stop state until the calculated approximate precision meets the requirement of a user, so that the phenomenon that the Mapper plug-in is restarted each time of resampling can be avoided, the total execution time is effectively reduced, and the searching efficiency is improved.
Preferably, the Reducer component in the Hadoop architecture is set to process the key/value pair when the Mapper plug-in is in the pause state. In the prior art, the Mapper plug-in sends the finally processed key value pair to the Reducer component for merging in the termination state, but the Reducer component is arranged in the state that the Mapper plug-in is in the pause state, so that the key/value pair can be processed, the number of times that the Mapper plug-in sends the key value pair can be reduced, and the searching efficiency can be effectively improved.
Preferably, the relative error value of the coefficient of variation in step S3 satisfies the following formula:
Figure BDA0002271603690000071
wherein cv is a variation coefficient, sm is a sample average value, sd is a sample standard deviation, v is a relative error value of the variation coefficient, t is a search time limit, i is a variable and i is less than m.
Preferably, the evaluation process in step S4 is to compare the relative error value of the coefficient of variation with e if vi-1,iIf the value is more than epsilon, the next round of iterative calculation is carried out, and v is calculatedi,i+1Until v in the j' th calculationj-1,jIf epsilon is not greater than epsilon, the iterative process is terminated, at which point its corresponding operation is performedIn order to meet the search result of the user approximate precision, j is a variable and is less than m, and epsilon is a relative error parameter and is more than or equal to 0.
Preferably, the evaluation process in step S4 further includes the step of evaluating the initial sample S0Optimization of input parameters, where | S0|=n0=qB×N,qBTo initially extract the proportion of samples from the data set D,
Figure BDA0002271603690000073
for the first time with S0The number of samples to resample the data discourse field, and N is the number of records in the original data set.
Preferably, the optimization process includes, when i ∈ [2, τ)]When, | Si|=ni=2ni-1(ii) a When i > τ, | Si|=ni=(i-τ+1)nτ(ii) a τ is a set value of the data platform, and τ is 2 ≦ τ ≦ 10, and τ is 5 in this embodiment, but is not limited to the present invention. Given N ═ D |, and the initial sample S0Thereafter, the Op operation process will be iteratively performed until
Figure BDA0002271603690000074
It should be noted that, since the end condition of the resampling depends on the relationship between the relative error v of the sample coefficient of variation and the relative error parameter epsilon, if the coefficient of variation cv still cannot be converged after a plurality of times of resampling, the time overhead of the resampling approaches or exceeds the theoretical sampling overhead directly performed from the original data set D, and therefore, the threshold parameter η may also be set as the most optimal oneTermination conditions for the final sampling iteration operation to reduce the time overhead of bootstrap failure, wherein the threshold parameter η is satisfied
Figure BDA0002271603690000081
In the present invention, η is 5.
Preferably, a monitor is arranged on an ApplicationMaster node of the Hadoop architecture, and the monitor is used for estimating a sample error according to the output of the Reducer component and checking an end condition of the resampling iteration process.
The foregoing description shows and describes several preferred embodiments of the invention, but as aforementioned, it is to be understood that the invention is not limited to the forms disclosed herein, but is not to be construed as excluding other embodiments, and is capable of use in various other combinations, modifications, and environments and is capable of changes within the scope of the inventive concept as expressed herein, commensurate with the above teachings, or the skill or knowledge of the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (8)

1. A bootstrap-based accelerated search method is characterized by comprising the following steps:
s1, setting a user search request under a Hadoop architecture as a triple Q (Op, D, rho), wherein Op represents a search operation of a user on a target data set D, and rho is a search precision lower limit value set by the user;
s2, extracting an initial sample S from the data set D, and then performing m times of playback sampling with S as a discourse domain1,...,Sm};
S3, m results Op (S) generated by operation Op (D) in step S21),...,Op(Sm) Carrying out approximate calculation to obtain a relative error value of the variation coefficient;
and S4, evaluating according to the relative error of the variation coefficient in the step S3, and obtaining a search result meeting the approximate precision of the user.
2. The bootstrap-based accelerated search method as recited in claim 1, wherein: in the sampling process of step S2, the method includes setting a state where the sampling process of the Mapper plug-in of the Hadoop framework is ended as a suspension state by default, and setting the Mapper plug-in to a termination state again until the approximation accuracy after calculation meets the requirement of the user.
3. The bootstrap-based accelerated search method as recited in claim 2, wherein: and setting the Reducer component in the Hadoop architecture when the Mapper plug-in is in a pause state, namely processing the key/value pair.
4. The bootstrap-based acceleration search method as recited in claim 3, wherein the relative error value of the coefficient of variation in said step S3 satisfies the following formula:
wherein cv is a variation coefficient, sm is a sample average value, sd is a sample standard deviation, v is a relative error value of the variation coefficient, t is a search time limit, i is a variable and i is less than m.
5. The bootstrap-based accelerated search method as recited in claim 4, wherein: the evaluation process in step S4 is to compare the relative error value of the coefficient of variation with epsilon if v isi-1,iIf the value is more than epsilon, the next round of iterative calculation is carried out, and v is calculatedi,i+1Until v in the j' th calculationj-1,jIf epsilon is not greater than epsilon, the iterative process is terminated, at which point its corresponding operation is performed
Figure FDA0002271603680000022
In order to meet the search result of the user approximate precision, j is a variable and is less than m, and epsilon is a relative error parameter and is more than or equal to 0.
6. The method of claim 5 based on selfThe method for accelerating search is characterized in that: the evaluation process in step S4 further includes the step of evaluating the initial sample S0Optimization of input parameters, where | S0|=n0=qB×N,qBTo initially extract the proportion of samples from the data set D,
Figure FDA0002271603680000023
for the first time with S0The number of samples to resample the data discourse field, and N is the number of records in the original data set.
7. The bootstrap-based accelerated search method as recited in claim 6, wherein said optimization procedure comprises:
when i ∈ [2, τ)]When, | Si|=ni=2ni-1
When i > τ, | Si|=ni=(i-τ+1)nτ
Wherein tau is a set value of the data platform, and tau is more than or equal to 2 and less than or equal to 10.
8. The bootstrap-based accelerated search method as recited in claim 5, characterized in that: the method further comprises the step of arranging a monitor on an applicationMaster node of the Hadoop architecture, wherein the monitor is used for estimating a sample error according to the output of the Reducer component and checking the termination condition of the resampling iteration process.
CN201911106961.XA 2019-11-13 2019-11-13 Bootstrap-method-based accelerated search method Pending CN110795473A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911106961.XA CN110795473A (en) 2019-11-13 2019-11-13 Bootstrap-method-based accelerated search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911106961.XA CN110795473A (en) 2019-11-13 2019-11-13 Bootstrap-method-based accelerated search method

Publications (1)

Publication Number Publication Date
CN110795473A true CN110795473A (en) 2020-02-14

Family

ID=69444462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911106961.XA Pending CN110795473A (en) 2019-11-13 2019-11-13 Bootstrap-method-based accelerated search method

Country Status (1)

Country Link
CN (1) CN110795473A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215299A (en) * 2020-10-26 2021-01-12 中山大学 Block bootstrap method for mean value estimation of hydrological meteorological space data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100325134A1 (en) * 2009-06-23 2010-12-23 International Business Machines Corporation Accuracy measurement of database search algorithms
CN102063524A (en) * 2010-12-13 2011-05-18 北京航空航天大学 Performance reliability simulation method based on improved self-adaption selective sampling
CN104199870A (en) * 2014-08-19 2014-12-10 桂林电子科技大学 Method for building LS-SVM prediction model based on chaotic search

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100325134A1 (en) * 2009-06-23 2010-12-23 International Business Machines Corporation Accuracy measurement of database search algorithms
CN102063524A (en) * 2010-12-13 2011-05-18 北京航空航天大学 Performance reliability simulation method based on improved self-adaption selective sampling
CN104199870A (en) * 2014-08-19 2014-12-10 桂林电子科技大学 Method for building LS-SVM prediction model based on chaotic search

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周志刚: ""云环境下数据隐私保护与安全搜索技术研究"" *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112215299A (en) * 2020-10-26 2021-01-12 中山大学 Block bootstrap method for mean value estimation of hydrological meteorological space data
CN112215299B (en) * 2020-10-26 2023-08-15 中山大学 Block bootstrap method for hydrological space data mean value estimation

Similar Documents

Publication Publication Date Title
KR20110009098A (en) Search results ranking using editing distance and document information
WO2022001918A1 (en) Method and apparatus for building predictive model, computing device, and storage medium
EP3608801A1 (en) Method of rapidly searching element information in a bim model
CN107291770B (en) Mass data query method and device in distributed system
CN114116422B (en) Hard disk log analysis method, hard disk log analysis device and storage medium
CN114416670B (en) Index creating method and device suitable for network disk document, network disk and storage medium
WO2020143181A1 (en) Data storage method, apparatus, computer device and storage medium
CN111125213A (en) Data acquisition method, device and system
CN108334675B (en) Artificial intelligence processing method and system for batch simulation of working condition set of digital aircraft
US7908267B2 (en) Automatic use of a functional index as a primary filter
CN110795473A (en) Bootstrap-method-based accelerated search method
CN116015965A (en) Multi-dimensional detection and defense system for network malicious traffic
CN111026709A (en) Data processing method and device based on cluster access
CN112445746B (en) Automatic cluster configuration optimization method and system based on machine learning
CN109101595B (en) Information query method, device, equipment and computer readable storage medium
CN117408249A (en) User-defined word segmentation optimization method and system based on distributed search
EP3108400B1 (en) Virus signature matching method and apparatus
US11949547B2 (en) Enhanced simple network management protocol (SNMP) connector
CN110633430B (en) Event discovery method, apparatus, device, and computer-readable storage medium
CN1684043A (en) Real time monitoring system and method for computer files
CN110955710B (en) Dirty data processing method and device in data exchange operation
KR101329037B1 (en) System and method for detecting variety malicious code
CN110413607B (en) Distributed counting method, server and system
CN111970327A (en) News spreading method and system based on big data processing
CN111639117A (en) Business processing method and device based on data processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200214

RJ01 Rejection of invention patent application after publication