CN106649723B - Large data set multi-pass random sampling method based on improved pond sampling - Google Patents

Large data set multi-pass random sampling method based on improved pond sampling Download PDF

Info

Publication number
CN106649723B
CN106649723B CN201611203570.6A CN201611203570A CN106649723B CN 106649723 B CN106649723 B CN 106649723B CN 201611203570 A CN201611203570 A CN 201611203570A CN 106649723 B CN106649723 B CN 106649723B
Authority
CN
China
Prior art keywords
data set
set file
pond
data
sampled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611203570.6A
Other languages
Chinese (zh)
Other versions
CN106649723A (en
Inventor
许卓明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN201611203570.6A priority Critical patent/CN106649723B/en
Publication of CN106649723A publication Critical patent/CN106649723A/en
Application granted granted Critical
Publication of CN106649723B publication Critical patent/CN106649723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method for randomly sampling the large data set in multiple times based on the improved pond sampling is provided, and comprises the following steps: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting the size k of a pool for random sampling, allocating a memory space which can just accommodate the k data records for the pool, and inputting the number m of times of random sampling, wherein the requirement that k x m < n is met; creating and opening a small data set file to enable writing of sampled data records therein; initially marking all data records in the large data set file as "non-sampled"; repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled'; and closing the large and small data set files to complete the multi-pass random sampling process. The method is simple and easy to implement and has wide application prospect.

Description

Large data set multi-pass random sampling method based on improved pond sampling
Technical Field
The invention belongs to the technical field of random sampling, relates to a random sampling method and technology of a large data set, and particularly relates to a multi-pass random sampling method of the large data set based on improved pond sampling.
Background
Random sampling (random sampling) is the fundamental technique for many application problems in computer science, statistics and engineering, and is particularly useful for statistically significant analysis processing of large data sets (big datasets). There are many Sampling algorithms (Sampling algorithms) in the field of random Sampling techniques (see authoritative book: Yves Till. Sampling algorithms in spring Series in Statistics, spring New York,2006.), and pond Sampling (Reservoir Sampling) (see authoritative book 4.4.5, page 48; Algorithm 4.4, page 49.) among the classic algorithms. By "pond" is meant a particular storage area allocated in the computer memory for storing data records during and after the random sampling process. The concepts related to pond sampling and pond sampling are common knowledge in the technical field.
The traditional pond sampling algorithm was proposed in 1980 s, and its application is a typical "small data" background. Currently, big data applications have become an urgent need. The traditional pond sampling algorithm can not meet the requirement of big data application only by carrying out random sampling once. Assuming that a large data set file contains n data records, the size of the pond is k (i.e., exactly k data records can be accommodated). In practical application, it is not difficult to imagine that the value of n is very large, but the value of k is relatively very small (when a computer runs, too much memory space cannot be allocated to a pond due to limited memory resources), so that the traditional pond sampling algorithm can only randomly extract k data records with very limited number from a large data set file. In such a sampling result, since sample data is actually too small, it is difficult to reasonably embody information content characteristics (such as statistical information characteristics) included in the original large data set file, which will inevitably greatly affect the rationality of the data analysis processing result.
Therefore, a plurality of substantial improvements (for details, see the various distinguishing technical features of the present invention in the following text) must be made on the conventional pond sampling algorithm, so that the algorithm can perform a plurality of random sampling on a large data set by one-time operation, and more random sample data can be obtained under the condition of maintaining the "random sampling" characteristic and not increasing the time complexity of the algorithm. The invention also aims to overcome the defects of the lack of a large data set (multi-pass) random sampling method and a complete and practicable technical scheme.
Disclosure of Invention
The invention aims to solve the technical problems that a pond sampling algorithm is substantially improved and a large data set multi-pass random sampling method based on improved pond sampling is provided, so that the defect that the traditional pond sampling algorithm cannot be suitable for large data set (multi-pass) random sampling is overcome.
In order to solve the technical problems, the invention is realized by the following technical scheme:
the invention provides a large data set multi-pass random sampling method based on improved pond sampling, which comprises the following steps:
step S1: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting a memory area for random sampling, namely the size k of a pond, allocating a memory space which can just accommodate the k data records for the pond, and inputting the pass number m of the random sampling, wherein the requirement that the product of k and m is less than n is met, namely k x m is less than n;
step S2: creating and opening a small data set file to enable writing of sampled data records therein;
step S3: initially marking all data records in the large data set file as "non-sampled";
step S4: repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled';
step S5: and closing the large and small data set files to complete the multi-pass random sampling process.
In the method, the step S4 further includes:
taking the random sampling pass as a loop variable of the counting control loop, wherein the initial value is 1, the termination value is m, the increment step length is 1, and the following steps are repeatedly executed for m times:
step S4-1: copying k "non-sampled" data records in the front of the large dataset file into the pond, while marking those data records as "sampled";
step S4-2: randomly replacing certain data records in the pond with certain "non-sampled" data records in the back of the large data set file with a decreasing probability, while restoring the marking of these replaced data records as "non-sampled" and marking the data records for replacement as "sampled";
step S4-3: k data records in the pond which are not changed any more are added to the tail of the small data set file.
In the method, the step S4-1 further includes:
controlling a cyclic variable of a cycle by taking the current data recording position of a pond as a condition, wherein the initial value of the cyclic variable is 1, the cyclic condition is that the cyclic variable value is less than or equal to k, and the following steps are repeatedly executed for a plurality of times from the current data recording position value of a large data set file being 1:
step S4-1-1: if the current data record of the large data set file is "non-sampled," then the following process occurs:
copying the current data record in the big data set file to the current data record position in the pond;
recording the position value of the copied data record in the pond corresponding to the data record in the big data set file;
marking the copied data record in the large data set file as 'sampled';
increasing the current data recording position value of the pond by 1;
step S4-1-2: the current data record location value of the large data set file is incremented by 1.
In the method, the step S4-2 further includes:
taking the current data recording position of the large data set file as a cycle variable of the counting control cycle, taking the initial value as the current data recording position value r of the large data set file, taking the termination value as n, taking the increment step length as 1, and repeatedly executing the following steps for (n-r +1) times:
step S4-2-1: if the current data record of the large data set file is "non-sampled," then the following steps are performed:
step S4-2-1-1: generating a random integer j from 1 to the current data record position value of the large data set file;
step S4-2-1-2: if the random integer j is less than or equal to the size k of the pond, performing the following treatment:
according to the position value of the data record corresponding to the recorded data record in the pond in the big data set file, the data record in the big data set file corresponding to the data record with the position value of j in the pond is marked as 'not sampled';
replacing the data record with the position value of j in the pond by the current data record in the big data set file;
recording the position value of the replaced data record in the pond corresponding to the data record in the big data set file;
the data record in the large data set file for the above replacement is marked as "sampled".
The beneficial technical effects of the invention mainly comprise four aspects: (1) the defect that the traditional pond sampling algorithm cannot be suitable for large data set (multi-pass) random sampling is overcome; (2) an effective large data set random sampling method is provided; (3) the provided method is simple and easy to implement; (4) the method has wide application prospect in the fields of big data analysis and the like.
The following further describes embodiments of the present invention with reference to the accompanying drawings. Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
Fig. 1 is a flow chart of steps of a large data set multi-pass random sampling method based on improved pond sampling according to the technical scheme of the invention;
fig. 2 is a schematic diagram of a large data set multi-pass random sampling processing procedure in a large data set multi-pass random sampling method based on improved pond sampling according to the technical scheme of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar concepts, objects, elements, etc., or concepts, objects, elements, etc., having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs and in the relevant art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
In order to solve the technical problems, the invention is realized by the following technical scheme:
according to the present invention, there is provided a method for multi-pass random sampling of large data sets based on improved pond sampling, as shown in fig. 1, comprising the following steps S1 to S5:
step S1: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting a memory area for random sampling, namely the size k of a pond, and allocating a memory space which can just accommodate the k data records for the pond, and inputting the pass number m of the random sampling, wherein the requirement that the product of k and m is less than n is met, namely k x m is less than n.
Step S2: a small data set file is created and opened to enable the sampled data records to be written therein.
Step S3: all data records in the large data set file are initially marked as "non-sampled".
Step S4: repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled'; the step S4 further includes:
taking the random sampling pass as a loop variable of the counting control loop, wherein the initial value is 1, the end value is m, the increment step size is 1, and the following steps S4-1, S4-2 and S4-3 are repeatedly executed for m times:
step S4-1: copying k "non-sampled" data records in the front of the large dataset file into the pond, while marking those data records as "sampled"; the step S4-1 further includes:
controlling a loop variable of the loop with the current data recording position of the pond as a condition, wherein an initial value is 1, the loop condition is that the loop variable value is less than or equal to k, and the following steps S4-1-1 and S4-1-2 are repeatedly executed for a plurality of times starting with the current data recording position value of the large data set file being 1:
step S4-1-1: if the current data record of the large data set file is "non-sampled," then the following process occurs:
copying the current data record in the big data set file to the current data record position in the pond;
recording the position value of the copied data record in the pond corresponding to the data record in the big data set file;
marking the copied data record in the large data set file as 'sampled';
increasing the current data recording position value of the pond by 1;
step S4-1-2: the current data record location value of the large data set file is incremented by 1.
Step S4-2: randomly replacing certain data records in the pond with certain "non-sampled" data records in the back of the large data set file with a decreasing probability, while restoring the marking of these replaced data records as "non-sampled" and marking the data records for replacement as "sampled"; the step S4-2 further includes:
taking the current data recording position of the large data set file as a loop variable of the counting control loop, taking the initial value as the current data recording position value r of the large data set file, taking the ending value as n, taking the increment step as 1, and repeatedly executing the following steps S4-2-1 for (n-r +1) times:
step S4-2-1: if the current data record of the large data set file is "non-sampled", the following steps S4-2-1-1 and S4-2-1-2 are performed:
step S4-2-1-1: generating a random integer j from 1 to the current data record position value of the large data set file;
step S4-2-1-2: if the random integer j is less than or equal to the size k of the pond, performing the following treatment:
according to the position value of the data record corresponding to the recorded data record in the pond in the big data set file, the data record in the big data set file corresponding to the data record with the position value of j in the pond is marked as 'not sampled';
replacing the data record with the position value of j in the pond by the current data record in the big data set file;
recording the position value of the replaced data record in the pond corresponding to the data record in the big data set file;
the data record in the large data set file for the above replacement is marked as "sampled".
Step S4-3: k data records in the pond which are not changed any more are added to the tail of the small data set file.
Step S5: and closing the large and small data set files to complete the multi-pass random sampling process.
The following further describes the specific implementation of the technical solution of the present invention by a preferred embodiment, and shows the beneficial technical effects of the present invention. The following pseudo code description provides a program code implementation of the technical solution of the present invention:
Figure BDA0001189542760000061
in this pseudo-code description, bold-faced statements have conventional meanings well known in the art of computer science and related arts. For example, create/open/close file represents create/open/close file, for.. do... end for represents count-controlled loop, while, do... end, while, for example, condition-controlled loop, if... then, end if represents condition-judging statement. The variables or symbols in the pseudo-code description and their meanings are shown in table 1.
TABLE 1 variables or symbols in pseudo code and their meanings
Figure BDA0001189542760000071
Further, the following visually explains the processing procedure of multi-pass random sampling (i.e. step S4) of the large data set, which is a core technical feature in the technical solution of the present invention, with reference to fig. 2. As shown in fig. 2, step S4 requires repeating m times of random sampling under the premise that k × m < n is satisfied (in practical applications, k × m is usually much smaller than n). In each random sampling pass, k "non-sampled" data records are randomly extracted from the large data set file (see left part of fig. 2) using a pond (see middle part of fig. 2) and added to the small data set file (see right part of fig. 2) while they are marked as "sampled". As shown by the arrows, explanatory words and legend in fig. 2, the specific processing procedure of step S4 is as follows:
taking the random sampling pass as a loop variable of the counting control loop, wherein the initial value is 1, the end value is m, the increment step size is 1, and the following steps S4-1, S4-2 and S4-3 are repeatedly executed for m times:
step S4-1: the k "non-sampled" data records in the front of the large data set file (the so-called front, see the description at point 2 in fig. 2) are copied into the pond, while the data records are marked as "sampled" (as indicated by the solid arrows and text labels below the left-middle portion of fig. 2).
Step S4-2: certain data records in the pond are randomly replaced with certain "unsampled" data records in the back of the large data set file (so-called back, see description at point 2 in fig. 2) with a decreasing probability, while those replaced data records are marked back as "unsampled" and data records for replacement as "sampled" (as indicated by the solid arrows and text labels above the left-middle portion of fig. 2).
Step S4-3: k no-more-changed data records in the pond are added to the tail of the small data set file (as indicated by the solid arrows and text labels below the middle-right portion of fig. 2).
It is further noted that, as indicated by the literal notation in FIG. 2, j in step S4-2 is a random integer from 1 to the current data record location value of the large data set file; only under the condition of j < ═ k, the data record in the pond with j is replaced by some "non-sampled" data record at the back of the large data set file (see Step S4-2-1-2 or the condition judgment in the pseudo code Step 16). Because the current data record position value of the large data set file is continuously increased in the processing process, the probability of satisfying the condition and causing the random replacement of the data record is gradually reduced. These features are the core of the improved pool sampling algorithm of the present invention that still maintains the "random sampling" feature.
After the above-mentioned processing procedure is circularly executed m times, m times of random sampling of the large data set file are implemented, so that k × m data records randomly coming from the large data set file are stored in the small data set file as sampling result.
The conventional reservoir sampling algorithm has a linear time complexity O (n), and the time complexity of the improved reservoir sampling algorithm of the present invention is O (m × n) ═ O (n) (since m is a constant much smaller than n), and still has a linear time complexity.
In summary, it can be understood from the foregoing technical solutions and the specific embodiments (including the preferred embodiments) of the present invention that the beneficial technical effects of the present invention mainly include four aspects: (1) the defect that the traditional pond sampling algorithm cannot be suitable for large data set (multi-pass) random sampling is overcome; (2) an effective large data set random sampling method is provided; (3) the provided method is simple and easy to implement; (4) the method has wide application prospect in the fields of big data analysis and the like.
The foregoing is only a partial embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (1)

1. A big data set multi-pass random sampling method based on improved pond sampling comprises the following steps:
step S1: opening a large data set file containing n data records so as to read the data records from the large data set file, inputting a memory area for random sampling, namely the size k of a pond, allocating a memory space which can just accommodate the k data records for the pond, and inputting the pass number m of the random sampling, wherein the requirement that the product of k and m is less than n is met, namely k x m is less than n;
step S2: creating and opening a small data set file to enable writing of sampled data records therein;
step S3: initially marking all data records in the large data set file as "non-sampled";
step S4: repeating m times of random sampling, in each time of random sampling, randomly extracting k 'non-sampled' data records from the large data set file by using a pond, adding the data records to the small data set file, and marking the data records as 'sampled';
step S5: closing the large and small data set files to complete the random sampling process for multiple times;
the step S4 further includes:
taking the random sampling pass as a loop variable of the counting control loop, wherein the initial value is 1, the termination value is m, the increment step length is 1, and the following steps are repeatedly executed for m times:
step S4-1: copying k "non-sampled" data records in the front of the large dataset file into the pond, while marking those data records as "sampled";
step S4-2: randomly replacing certain data records in the pond with certain "non-sampled" data records in the back of the large data set file with a decreasing probability, while restoring the marking of these replaced data records as "non-sampled" and marking the data records for replacement as "sampled";
step S4-3: adding k data records which are not changed any more in the pond to the tail part of the small data set file;
the step S4-1 further includes:
controlling a cyclic variable of a cycle by taking the current data recording position of a pond as a condition, wherein the initial value of the cyclic variable is 1, the cyclic condition is that the cyclic variable value is less than or equal to k, and the following steps are repeatedly executed for a plurality of times from the current data recording position value of a large data set file being 1:
step S4-1-1: if the current data record of the large data set file is "non-sampled," then the following process occurs:
copying the current data record in the big data set file to the current data record position in the pond;
recording the position value of the copied data record in the pond corresponding to the data record in the big data set file;
marking the copied data record in the large data set file as 'sampled';
increasing the current data recording position value of the pond by 1; step S4-1-2: increasing the current data record position value of the large data set file by 1; the step S4-2 further includes:
taking the current data recording position of the large data set file as a cycle variable of the counting control cycle, taking the initial value as the current data recording position value r of the large data set file, taking the termination value as n, taking the increment step length as 1, and repeatedly executing the following steps for (n-r +1) times:
step S4-2-1: if the current data record of the large data set file is "non-sampled," then the following steps are performed:
step S4-2-1-1: generating a random integer j from 1 to the current data record position value of the large data set file;
step S4-2-1-2: if the random integer j is less than or equal to the size k of the pond, performing the following treatment:
according to the recorded position value of the data record in the pond in the large data set file corresponding to the data record,
marking the data record recovery in the large data set file corresponding to the data record with the position value of j in the pond as 'non-sampled';
replacing the data record with the position value of j in the pond by the current data record in the big data set file;
recording the position value of the replaced data record in the pond corresponding to the data record in the big data set file;
the data record in the large data set file for the above replacement is marked as "sampled".
CN201611203570.6A 2016-12-23 2016-12-23 Large data set multi-pass random sampling method based on improved pond sampling Active CN106649723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611203570.6A CN106649723B (en) 2016-12-23 2016-12-23 Large data set multi-pass random sampling method based on improved pond sampling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611203570.6A CN106649723B (en) 2016-12-23 2016-12-23 Large data set multi-pass random sampling method based on improved pond sampling

Publications (2)

Publication Number Publication Date
CN106649723A CN106649723A (en) 2017-05-10
CN106649723B true CN106649723B (en) 2020-09-18

Family

ID=58826638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611203570.6A Active CN106649723B (en) 2016-12-23 2016-12-23 Large data set multi-pass random sampling method based on improved pond sampling

Country Status (1)

Country Link
CN (1) CN106649723B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984152B (en) * 2018-08-21 2021-01-29 北京睦合达信息技术股份有限公司 Data processing method, system and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332449B2 (en) * 2003-09-23 2012-12-11 The Directv Group, Inc. Sample generation method and system for digital simulation processes
CN105005586A (en) * 2015-06-24 2015-10-28 华中科技大学 Degree feature replacement policy based stream type graph sampling method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102262678A (en) * 2011-08-16 2011-11-30 郑毅 System for sampling mass data and managing sampled data
CN102393839B (en) * 2011-11-30 2014-05-07 中国工商银行股份有限公司 Parallel data processing system and method
CN106203534A (en) * 2016-07-26 2016-12-07 南京航空航天大学 A kind of cost-sensitive Software Defects Predict Methods based on Boosting

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8332449B2 (en) * 2003-09-23 2012-12-11 The Directv Group, Inc. Sample generation method and system for digital simulation processes
CN105005586A (en) * 2015-06-24 2015-10-28 华中科技大学 Degree feature replacement policy based stream type graph sampling method

Also Published As

Publication number Publication date
CN106649723A (en) 2017-05-10

Similar Documents

Publication Publication Date Title
Miller et al. Key-value memory networks for directly reading documents
CN106844507B (en) A kind of method and apparatus of data batch processing
US8291058B2 (en) High speed network data extractor
CN110119444B (en) Drawing type and generating type combined document abstract generating model
CN106648467A (en) Log generation method and system
CN108984849A (en) A kind of quantum comparison device design method based on quantum superposition state
CN102831194A (en) New word automatic searching system and new word automatic searching method based on query log
CN113032001B (en) Intelligent contract classification method and device
CN106649723B (en) Large data set multi-pass random sampling method based on improved pond sampling
CN107741947A (en) The storage of random number key based on HDFS file system and acquisition methods
CN110147431A (en) Key word matching method, device, computer equipment and storage medium
CN106802787A (en) MapReduce optimization methods based on GPU sequences
Liu et al. Three novel algorithms for hiding data in pdf files based on incremental updates
CN104750846B (en) A kind of substring lookup method and device
Rehman et al. An analysis of abstractive text summarization using pre-trained models
CN102831073B (en) Internal memory data processing method and system
Titchener et al. Synchronisation process for the variable-length T-codes
Cui et al. An improved hidden Markov model for literature metadata extraction
CN111090996B (en) Word segmentation method, device and storage medium
CN103488433A (en) Batch file operation method and system based on distributed file system
CN112132214A (en) Document information accurate extraction system compatible with multiple languages
CN111353300A (en) Data set construction and related information acquisition method and device
Liu et al. A table compression method for extended aho-corasick automaton
Porter et al. Obtaining precision-recall trade-offs in fuzzy searches of large email corpora
Li et al. Study on efficiency of full-text retrieval based on lucene

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant