CN103577487A - Method and device of testing index function of search engine - Google Patents

Method and device of testing index function of search engine Download PDF

Info

Publication number
CN103577487A
CN103577487A CN201210279847.9A CN201210279847A CN103577487A CN 103577487 A CN103577487 A CN 103577487A CN 201210279847 A CN201210279847 A CN 201210279847A CN 103577487 A CN103577487 A CN 103577487A
Authority
CN
China
Prior art keywords
interval
index file
data
search engine
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210279847.9A
Other languages
Chinese (zh)
Inventor
罗峰
黄苏支
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IZP (BEIJING) TECHNOLOGIES Co Ltd
Original Assignee
IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by IZP (BEIJING) TECHNOLOGIES Co Ltd filed Critical IZP (BEIJING) TECHNOLOGIES Co Ltd
Priority to CN201210279847.9A priority Critical patent/CN103577487A/en
Publication of CN103577487A publication Critical patent/CN103577487A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device of testing an index function of a search engine. The method specifically comprises the following steps of indexing reference data by utilizing a new edition of the search engine so as to obtain the corresponding new edition of an index file, wherein the new edition of the search engine is a search engine to be tested; taking the old edition of the index file and the new edition of the index file as file streams and comparing contents, if same, passing the test, if not, not passing the test, wherein the old edition of the index file is obtained by indexing the reference data by utilizing the old edition of the search engine. According to the method and the device, the testing efficiency of the indexing function of the search engine is improved.

Description

A kind of method of testing of search engine index function and device
Technical field
The application relates to Internet technical field, particularly relates to a kind of method of testing and device of search engine index function.
Background technology
Search engine refers to according to certain strategy, use specific computer program to gather information from internet, after information being organized and is processed, for user provides retrieval service, system by the relevant information display of user search to user, its function realizing is exactly data pre-service in short, sets up index and accept inquiry request and return results.Wherein, data pre-service refers to according to search engine requirement, and the data that obtain through various channels are converted into structural data; Setting up index refers to according to the various field in structural data and sets up corresponding index; Accept inquiry request and return results to refer to and use searching keyword to retrieve the index of setting up, and return to the index structural data pointed retrieving.
In general, the renewal of search engine version is very frequently, and all to carry out function and performance test to judge whether it meets function and performance requirement to the search engine after upgrading (with respect to the legacy version search engine before upgrading, the search engine after renewal can be called redaction search engine) after each renewal.
The method of testing of a kind of search engine index function of prior art, completes the test of search engine index function by the indirect means of retrieval; This is the preset result for retrieval of retrieval and corresponding search condition indirectly, and the retrieval of being correlated with under the structural data of index according to this search condition, and by contrast actual retrieval result and preset result for retrieval, if the two identical test is passed through, otherwise test, do not pass through.
In order to guarantee the coverage rate of test, the indirect retrieval of prior art need to travel through all index; Yet the index number of search engine is numerous, and the quantity of the data recording below single index is also huge, and in actual applications, retrieval need to be carried out retrieval one by one according to record to several hundred million numbers indirectly, causes the increase of test duration and the reduction of testing efficiency.And at present agile development cause the update cycle of search engine version shorten to 3 days even shorter, therefore prior art is difficult to meet the requirement of the renewal speed of search engine version on testing efficiency.
In a word, need the urgent technical matters solving of those skilled in the art to be exactly: the testing efficiency that how can improve search engine index function.
Summary of the invention
The application's technical matters to be solved is to provide a kind of method of testing and device of search engine index function, can improve the testing efficiency of search engine index function.
In order to address the above problem, the application discloses a kind of method of testing of search engine index function, comprising:
Utilize redaction search engine to carry out index to reference data, obtain corresponding new edition index file; Wherein, redaction search engine is search engine to be tested;
Old edition index file and described new edition index file are carried out to the contrast of content as document flow, if identical, test and pass through, if difference is tested and do not passed through; Wherein, old edition index file obtains for utilizing legacy version search engine to carry out index to described reference data.
Preferably, the described step that old edition index file and new edition index file are carried out to the contrast of content as document flow further comprises:
Using old edition index file and new edition index file as document flow, carry out the contrast of all or part of content.
Preferably, describedly using old edition index file and new edition index file as document flow, carry out the step of the contrast of partial content, further comprise:
Extract respectively the data in several same position intervals of old edition index file stream and new edition index file stream; Described same position is interval identical for represent the interval position adopting when described old edition index file stream extracts data with new edition index file stream;
Respectively the data in several same position intervals of described old edition index file stream and new edition index file stream are compared, data comparative result in all same positions interval is identical time test to be passed through, and exists test when different not pass through in the data comparative result in same position interval.
Preferably, extraction document flows upper interval data through the following steps:
Obtain the file pointer of document flow;
The mode certain range being offset by file pointer play end position, and according to file pointer length and the corresponding data of burst length file reading stream end position from interval.
Preferably, can obtain through the following steps the interval end position that rises:
According to preset interval number, burst length and interval length, obtain the end position that rises in each interval; Described acquisition process comprises: using the reference position between proparea and in the burst length between proparea and the interval length sum between Jian Yu back zone, proparea as the end position that rises between back zone.
Preferably, can obtain through the following steps the interval end position that rises:
Determine document flow length and interval number;
Produce the quantity of numerical value in document flow length range and be twice in the corresponding random number of interval number;
According to described random number, obtain each interval reference position.
Preferably, described document flow is text flow or binary stream.
On the other hand, disclosed herein as well is a kind of proving installation of search engine index function, comprising:
Index module, for utilizing redaction search engine to carry out index to reference data, obtains corresponding new edition index file; Wherein, redaction search engine is search engine to be tested;
Contrast module, for old edition index file and described new edition index file are carried out to the contrast of content as document flow, tests and passes through if identical, if difference is tested and do not passed through; Wherein, old edition index file obtains for utilizing legacy version search engine to carry out index to described reference data.
Preferably, described contrast module further comprises:
Whole contrast submodules, for using old edition index file and new edition index file as document flow, carry out the contrast of full content; Or
Part contrasts submodule, for using old edition index file and new edition index file as document flow, carries out the contrast of partial content.
Preferably, described part contrast submodule further comprises:
Interval extraction unit, for extracting respectively the data in several same position intervals of old edition index file stream and new edition index file stream; Described same position is interval identical for represent the interval position adopting when described old edition index file stream extracts data with new edition index file stream; And
Comparing unit, for respectively the data in several same position intervals of described old edition index file stream and new edition index file stream being compared, data comparative result in all same positions interval is identical time test to be passed through, and exists test when different not pass through in the data comparative result in same position interval.
Preferably, described device also comprises: for the document flow interval censored data extraction module of the upper interval data of extraction document stream;
Described document flow interval censored data extraction module further comprises:
Pointer obtains submodule, for obtaining the file pointer of document flow; And
Reading submodule, for the end position that rises of the mode certain range by file pointer side-play amount, and according to file pointer length and the corresponding data of burst length file reading stream end position from interval.
Preferably, described device also comprises: for obtaining the interval origin or beginning position acquisition module that plays end position;
Described origin or beginning position acquisition module further comprises:
The preset submodule that obtains, for interval number, burst length and the interval length according to preset, obtains the end position that rises in each interval; Described acquisition process comprises: using the reference position between proparea and in the burst length between proparea and the interval length sum between Jian Yu back zone, proparea as the end position that rises between back zone.
Preferably, described device also comprises: for obtaining the interval origin or beginning position acquisition module that plays end position;
Described origin or beginning position acquisition module further comprises:
Determine submodule, for determining document flow length and interval number;
Random number submodule, the quantity for generation of numerical value in document flow length range be twice in the corresponding random number of interval number;
Obtain at random submodule, for obtaining each interval reference position according to described random number.
Preferably, described document flow is text flow or binary stream.
Compared with prior art, the application has the following advantages:
The application utilizes redaction search engine to carry out index to reference data, obtains corresponding new edition index file, and the mode based on document flow contrast tests old edition index file and new edition index file, obtains corresponding test result; The contrast that the application only need carry out using old edition index file and described new edition index file as document flow content can complete the test of redaction search engine index function, owing to can avoiding, in prior art, several hundred million numbers are carried out to retrieval one by one according to record, therefore can effectively improve the testing efficiency of search engine index function.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the method for testing embodiment of a kind of search engine index function of the application;
Fig. 2 is the structural drawing of the proving installation embodiment of a kind of search engine index function of the application.
Embodiment
For the application's above-mentioned purpose, feature and advantage can be become apparent more, below in conjunction with the drawings and specific embodiments, the application is described in further detail.
With reference to Fig. 1, show the process flow diagram of the method for testing embodiment of a kind of search engine index function of the application, specifically can comprise:
Step 101, utilize redaction search engine to carry out index to reference data, obtain corresponding new edition index file; Wherein, redaction search engine is search engine to be tested;
In the art, be well known that, index technology is one of core technology of search engine.Search engine will arrange collected information, classification, index to be to produce index database; Conventionally index database is set up corresponding index for storing according to the various field of structural data, and index file is for storing the contact that indexes indexed raw data.
The application adopts reference data as indexed raw data, and it can be various structural datas or unstructured data; For example, a kind of row data that are exemplified as database of structural data, are also data recording; Non-structured a kind of internet article that is exemplified as, the example cross-reference of other structural data or unstructured data.
In a kind of application example of the application, utilize search engine to carry out index to reference data, the process that obtains corresponding index file may further include: reference data is carried out to participle; In word segmentation result, search and the corresponding word of index, and recording indexes is to the contact details of reference data to index file.
Wherein, participle technique is to utilize certain rule and dictionary, is syncopated as a word in sentence, and ready for automatic indexing, therefore not to repeat here.The contact details that index reference data can comprise the identification information and this word positional information in reference data etc. that matches word conventionally.
Take structural data as example, suppose that reference data is stored in database, index is " China ", the content of certain data recording comprises " Chinese Communist Party ", can the Dui“ Chinese Communist Party " carry out participle and obtain " Chinese ”He“ the Communist Party ", and record field name, data recording numbering and " China " position in this data recording of this data recording to the index file of " China ".
Take unstructured data as example, suppose that index is for " real estate ", reference data comprises article A, B, C, article B wherein, in C, all comprise " real estate ", in A, comprise " China ", in the index file of " real estate ", can record the position of " real estate " in article B, article B, the information such as position of " real estate " in article C, article C.
Step 102, old edition index file and described new edition index file are carried out to the contrast of content as document flow, if identical test pass through, if difference is tested and is not passed through; Wherein, old edition index file obtains for utilizing legacy version search engine to carry out index to described reference data.
In the embodiment of the present application, the start context of supposing redaction search engine is N, and the start context of legacy version search engine can be the arbitrary natural number in [1, N-1].
In a kind of embodiment of the application, can utilize legacy version search engine to carry out index to described reference data and obtain described old edition index file, and be saved to corresponding data structure; Like this, while carrying out the test of search engine index function, only need from described data structure, take out described old edition index file at every turn.
In the application's another kind of embodiment, can, when carrying out the test of search engine index function at every turn, utilize legacy version search engine to carry out index to described reference data and obtain described old edition index file.
Document flow is the unified interface that file is operated, and file can be converted to the logical device of stream, and stream has shielded the concrete property of equipment, so the form that application program can be identical operates various files.
Conventionally, document flow has two kinds of patterns: text flow and binary stream.Wherein, text flow is that file is used as to a character stream, in some language, when text flow is operated, can do some conversions, that is to say, text flow has shielded the flesh and blood of file; Content in binary stream and file is that a succession of byte forms one to one.When file is carried out to access, can mix and use these two kinds of stream mode, but conventionally with same stream mode, file be carried out to access.
In specific implementation, can open (opening) method based on language index file be converted to document flow.
For example, can the open method based on Python obtain binary stream or text flow, wherein, indexdata is filename:
1. with binary stream, open filestream=open ('/indexdata ', ' rb ')
2. with text flow, open filestream=open ('/indexdata ', ' r ')
And for example, in the fstream of C++ class, have a member function open () for opening file, its prototype is:
void?open(const?char*filename,int?mode,int?access);
Parameter:
Filename: the filename that open
Mode: the mode that open file
Access: the attribute opening file
The mode opening file defines in class ios, conventional value ios::binary: with binary mode, open file, default mode is text mode.
If want, with the scale-of-two input mode c:config.sys that opens file, can carry out following statement:
fstream?file1;
file1.open(″c:config.sys″,ios::binary|ios::in,0);
It should be noted that, above-mentioned Python, C Plus Plus be just as example, but not as the application's application restric-tion; In fact, the open method based on language such as Java, Php, VB is converted to document flow by index file, is also feasible.
In the embodiment of the present application, because indexed benchmark dataset is known definite, adopt in theory the same benchmark dataset of search engine index of two versions, the index file obtaining should be also same, same index file is carried out to document flow processing, its document flow forming accordingly should be also the same, therefore the mode of the application based on document flow contrast tested old edition index file and new edition index file, not only can reach accuracy and the fine-grained requirement of test; The more important thing is, the content that the application only need the mode based on document flow contrast contrasts old edition index file and new edition index file can complete the test of redaction search engine index function, owing to can avoiding, in prior art, several hundred million numbers are carried out to retrieval one by one according to record, therefore can effectively improve the testing efficiency of search engine index function.
In the embodiment of the present application, described step 102 may further include: using old edition index file and new edition index file as document flow, carry out the contrast of all or part of content.
About using old edition index file and new edition index file as document flow, carry out the scheme of the contrast of full content, in specific implementation, can from described old edition index file and new edition index file, read whole data respectively in the mode of document flow, then the reading result of the two be compared.Wherein, the language such as Python, C++, Java, Php, VB all provide the function of specifying big or small data for reading, and for comparing two functions whether data content equal.
At this, to using old edition index file and new edition index file as document flow, the scheme of carrying out the contrast of partial content is elaborated.
In a preferred embodiment of the present application, describedly using old edition index file and new edition index file as document flow, carry out the step of the contrast of all or part of content, may further include:
Sub-step A1, extract respectively the data in several same position intervals of old edition index file stream and new edition index file stream; Described same position is interval identical for represent the interval position adopting when described old edition index file stream extracts data with new edition index file stream;
Sub-step A2, respectively the data in several same position intervals of described old edition index file stream and new edition index file stream are compared, data comparative result in all same positions interval is identical time test to be passed through, and exists test when different not pass through in the data comparative result in same position interval.
In the embodiment of the present application, can adopt above-mentioned open (opening) method based on language that old edition index file and new edition index file are converted to old edition index file stream and new edition index file stream.
In specific implementation, can compare the data in same position interval by the comparative approach based on language.
For example, the extem int memcmp of C Plus Plus (void*buf1, void*buf2, unsigned int count);
Usage: #include<string.h>
Function: front count the byte that compares region of memory buf1 and buf2.
Illustrate: when buf1 < buf2, rreturn value < 0
When buf1=buf2, rreturn value=0
When buf1 > buf2, rreturn value > 0
And for example cmp (filestream1, the filestream2) function of Python also can compare filestream1 and filestream2 two blocks of data, if equate, returns to 0, returns non-zero if do not wait.The correlation parameter cross-reference of the language such as Java, Php, VB.
It should be noted that, when in existence, more than one same position is interval, size in all intervals is that the data comparative result of the integral multiple of internal memory minimum unit is at 0 o'clock, just can think to test and pass through, while there is non-zero value in the data comparative result in same position interval, can think to test and not pass through.
If regard a document flow as a serially ordered set T, serially ordered set T is for including the full content of document flow, and the application's interval can be understood as the subset S of serially ordered set, and subset S is for including the partial content of document flow.
In specific implementation, the length of supposing document flow is L descriptions such as (here) the unit free Bytes of L, positions, and serially ordered set T can represent with the closed interval of [0, L-1], and so, the application's interval can be the subset of [0, L-1].
Because the data in same position interval just have comparability, therefore the application adopts " same position is interval " this technical term, being intended to emphasize need to be from the interval extraction of the same position data of described old edition index file stream and new edition index file stream.
In a preferred embodiment of the present application, extraction document flows upper interval data through the following steps:
Sub-step B1, obtain the file pointer of document flow;
Sub-step B2, the mode certain range that is offset by file pointer play end position, and according to file pointer length and the corresponding data of burst length file reading stream end position from interval.
Wherein, described file pointer skew is for being offset to file pointer the interval end position that rises.
The starting point of supposing to take document flow is 0 gauge point, can determine an interval end position that rises by reading the mode of file pointer+side-play amount; Also, the skew of described file pointer for by file pointer from current location be offset to interval end position.
For example, the function int fseek that C Plus Plus provides (FILE*stream, long offset, int origin) can be displaced to another position from a position by file pointer, wherein,
First parameter s tream is file pointer
Second parameter offset is side-play amount, and positive number represents forward migration, negative number representation negative offset
The 3rd parameter origin sets the skew that where starts from file
And for example, the function string fread that C++ provides (int handle, int length) can read from file pointer the data of length byte.
In like manner, the seek () of Python, read () function also can be realized the function of similar fseek () and fread (), seek (long offset, int origin) can be used for file pointer to be displaced to another position origin+offset from a position origin, read (int length) for reading the data of length byte from file pointer position.In addition, the correlation parameter cross-reference of the language such as Java, Php, VB, therefore not to repeat here.
In actual applications, the interval in the embodiment of the present application has following attribute:
Attribute 1, interval number;
Interval number can be one, can be for more than one.
Only have one when interval, described file pointer skew for by file pointer from the start position of document flow be offset to interval end position.The starting point of supposing document flow is 0 gauge point, file pointer can be offset to A with the end position that rises of certain range, then from the A of position, read the data of length B, this interval can be expressed as [0+A, 0+A+B], the in the situation that of specified otherwise not, A in the embodiment of the present application, B, C, D etc. represent side-play amount, are positive integer.
When existence is more than one interval, need to determine the end position that rises in a plurality of intervals, the skew of described file pointer for by file pointer from current location be offset to the first interval end position, and, by file pointer end position from the first interval be offset between Second Region end position.
Suppose for the first interval, an interval end position is 0+A, and preset burst length is B, and first interval can be expressed as [0+A, 0+A+B], can read the data that [0+A, 0+A+B] this segment length is B; For between Second Region, an interval end position is 0+A+B+C, and preset burst length is D, between Second Region, can be expressed as [0+A+B+C, 0+A+B+C+D], can read the data that [0+A+B+C, 0+A+B+C+D] this segment length is D.
The starting point of supposing file pointer is 0, reads accordingly process with an above-mentioned above interval and comprises:
Step S1, file pointer is offset to A by seek (A, 0) function;
Step S2, by read (B) function, from the A of position, read the data of B length;
Step S3, file pointer is offset to A+B+C by seek (B+C, A) function;
Step S4, by read (D) function, from the A+B+C of position, read the data of D length.
Attribute 2, burst length;
For an interval, can, with the absolute value representation burst length that plays end position and terminal position in its interval, can directly from interval, end position, read the data of burst length.For example, the burst length of [0+A, 0+A+B], [0+A+B, 0+A+B+C], [0+A+B+C, 0+A+B+C+D] is respectively B, C, D.
It should be noted that, in a document flow, several length of an interval degree can not equate or not etc., and also, those skilled in the art can be according to actual needs, B, C, D are set for not equating or not etc.
In addition, those skilled in the art can be according to actual needs, and the size that burst length is set is the integral multiple of internal memory minimum unit.
Attribute 3, interval.
The interval of the embodiment of the present application can be expressed as the interval between adjacent interval, and the end position that rises specifically can be used between back zone represents with the difference in front same position extremity of an interval position.For example, the interval of [0+A, 0+A+B] and [0+A+B, 0+A+B+C] is 0, and the interval of [0+A, 0+A+B] and [0+A+B+C, 0+A+B+C+D] is C.
With respect to the contrast of full content, the contrast tool of partial content has the following advantages:
1, the length that is certainly less than full content due to the length of partial content, in the situation that the content-length of contrast is little, can improve specific efficiency, and the reduced time can reduce, thereby can further improve the testing efficiency of search engine index function;
2, full content is that fix, unalterable, and arbitrary changing all can cause the variation of partial content in interval number, burst length and interval, therefore partial content is flexibly: interval number is the amount doesn't matter, burst length is changeable, interval layout can be dredged can be close etc.; In a word, those skilled in the art can be according to the actual requirements, adjusts one or more in interval number, burst length and interval, to reach the flexible enforcement of the contrast of partial content.
It should be noted that, because indexed benchmark dataset is known definite, adopt in theory the same benchmark dataset of search engine index of two versions, the index file obtaining should be also same, same index file is carried out to document flow processing, its document flow forming accordingly should be also the same, therefore the mode of the application based on the contrast of document flow partial content tested old edition index file and new edition index file, still can reach accuracy and the fine-grained requirement of test;
The more important thing is, because the length of partial content is less than the length of full content certainly, in the situation that the content-length of contrast is little, to specific efficiency, can improve, the reduced time can reduce, thereby can further improve the testing efficiency of search engine index function.
It should be noted that, those skilled in the art can be according to actual needs, do not arrange in a document flow that the interval between several intervals equates or not etc.
The application can be with the following interval scheme that plays end position of obtaining is provided:
Scheme 1,
In a preferred embodiment of the present application, can, according to preset interval number, burst length and interval length, obtain the end position that rises in each interval; Particularly, can using the reference position between proparea and in the burst length between proparea and the interval length sum between Jian Yu back zone, proparea as the end position that rises between back zone.
In a kind of application example of the application, burst length that can preset all intervals equates, and the interval of all adjacent intervals equates.The length of supposing document flow is 100, is also that the serially ordered set T of document flow can represent with the closed interval of [0,99], can preset interval number be 10 so, and burst length is 2, and interval is 8; The example that plays end position can be 0+2,0+2+2+8, a 0+2+2+8+2+8......, like this, 10 intervals of formation can be expressed as [2,4], [12,14], [22,24] ... [92,94].
In a kind of application example of the application, burst length that can preset described interval is arithmetic progression, and the interval of all adjacent intervals is arithmetic progression.The length of supposing document flow is 100, also the serially ordered set T that is document flow can be with [0,99] closed interval represents, can preset interval number be 8 so, preset first interval burst length be 2, and burst length subsequently increases progressively, tolerance is 1, and preset first interval and second interval interval is 2, interval subsequently increases progressively, and tolerance is 2; The example that plays end position can be 0+2,0+2+2+2,0+2+2+2+2+1+2+2, a 0+2+2+2+2+1+2+2+2+1+1......, and like this, 10 same position intervals of formation can be expressed as [2,4], [6,9], [13,17], [23,28], [36,42], [52,59], [70,78], [90,99].
Certainly, above-mentioned just as example, those skilled in the art can be according to the actual requirements, preset various interval numbers, burst length and interval length.For example, if wish to improve the accuracy rate of test, many interval numbers of can presetly trying one's best, the interval length of trying one's best large burst length and trying one's best little; And for example, if wish to improve testing efficiency, the interval length of can presetly try one's best few interval number, trying one's best little burst length and trying one's best large; For another example, can preset suitable interval number, burst length and interval length, to reach optimum of test accuracy rate and testing efficiency etc.The application is not limited the preset mode of concrete interval number, burst length and interval length.
Scheme 2,
In another preferred embodiment of the present application, described in obtain the interval scheme step that plays end position, may further include:
Sub-step C1, determine document flow length and interval number;
Sub-step C2, produce the quantity of numerical value in document flow length range and be twice in the corresponding random number of interval number;
Sub-step C3, the described random number of foundation are obtained each interval reference position.
In a kind of application example of the application, described sub-step C3 may further include: described random number is carried out to odd even numbering, and the adjacent random number of occasionally numbering is burst length with the difference of strange numbering random number, the end position using the random number of very numbering as interval.
Suppose that document flow length is L, interval number is M, can adopt random algorithm to produce 2M random number in [0, L-1], and that supposes 2M random number is numbered 1,2,3.。。, 2M-1,2M, the difference of numbering so 2 and 1 random number is burst length, and the difference of numbering 4 and 3 random number is burst length, and can be numbered 1,3,5......2M-3 according to this, and the random number of 2M-1 is as the end position that rises in interval.
Certainly above-mentioned just as example, also can be using the random number of even numbering as interval play end position
It should be noted that, take above the opening of language, obtain, comparative approach is illustrated the application's document flow operation as example, in fact, the application is not restricted to the specific function in specific language and language, and the technological means such as other the function that can realize arbitrarily document flow operation, method are all feasible.
Embodiment is corresponding with preceding method, disclosed herein as well is a kind of proving installation embodiment of search engine index function, with reference to the structural drawing shown in Fig. 2, specifically can comprise:
Index module 201, for utilizing redaction search engine to carry out index to reference data, obtains corresponding new edition index file; Wherein, redaction search engine is search engine to be tested; And
Contrast module 202, for old edition index file and described new edition index file are carried out to the contrast of content as document flow, tests and passes through if identical, if difference is tested and do not passed through; Wherein, old edition index file obtains for utilizing legacy version search engine to carry out index to described reference data.
In the embodiment of the present application, preferably, described document flow is text flow or binary stream.
In a preferred embodiment of the present application, described contrast module 202 may further include:
Whole contrast submodules, for using old edition index file and new edition index file as document flow, carry out the contrast of full content; Or
Part contrasts submodule, for using old edition index file and new edition index file as document flow, carries out the contrast of partial content.
In another preferred embodiment of the present application, described part contrast submodule may further include:
Interval extraction unit, for extracting respectively the data in several same position intervals of old edition index file stream and new edition index file stream; Described same position is interval identical for represent the interval position adopting when described old edition index file stream extracts data with new edition index file stream; And
Comparing unit, for respectively the data in several same position intervals of described old edition index file stream and new edition index file stream being compared, data comparative result in all same positions interval is identical time test to be passed through, and exists test when different not pass through in the data comparative result in same position interval.
In another preferred embodiment of the application, described device also comprises the document flow interval censored data extraction module for the upper interval data of extraction document stream;
Described document flow interval censored data extraction module may further include:
Pointer obtains submodule, for obtaining the file pointer of document flow;
Reading submodule, for the end position that rises of the mode certain range by file pointer side-play amount, and according to file pointer length and the corresponding data of burst length file reading stream end position from interval.
In a preferred embodiment of the present application, described device can also comprise: for obtaining the interval origin or beginning position acquisition module that plays end position;
Described origin or beginning position acquisition module may further include:
The preset submodule that obtains, for interval number, burst length and the interval length according to preset, obtains the end position that rises in each interval; Described acquisition process comprises: using the reference position between proparea and in the burst length between proparea and the interval length sum between Jian Yu back zone, proparea as the end position that rises between back zone.
In another preferred embodiment of the present application, described device can also comprise: for obtaining the interval origin or beginning position acquisition module that plays end position;
Described origin or beginning position acquisition module may further include:
Determine submodule, for determining document flow length and interval number;
Random number submodule, the quantity for generation of numerical value in document flow length range be twice in the corresponding random number of interval number;
Obtain at random submodule, for obtaining each interval reference position according to described random number.
For device embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
Those skilled in the art should understand, the application's embodiment can be provided as method, system or computer program.Therefore, the application can adopt complete hardware implementation example, implement software example or in conjunction with the form of the embodiment of software and hardware aspect completely.And the application can adopt the form that wherein includes the upper computer program of implementing of computer-usable storage medium (including but not limited to magnetic disk memory, CD-ROM, optical memory etc.) of computer usable program code one or more.
The application is with reference to describing according to process flow diagram and/or the block scheme of the method for the embodiment of the present application, equipment (system) and computer program.Should understand can be in computer program instructions realization flow figure and/or block scheme each flow process and/or the flow process in square frame and process flow diagram and/or block scheme and/or the combination of square frame.Can provide these computer program instructions to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, the instruction of carrying out by the processor of computing machine or other programmable data processing device is produced for realizing the device in the function of flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame appointments.
These computer program instructions also can be stored in energy vectoring computer or the computer-readable memory of other programmable data processing device with ad hoc fashion work, the instruction that makes to be stored in this computer-readable memory produces the manufacture that comprises command device, and this command device is realized the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make to carry out sequence of operations step to produce computer implemented processing on computing machine or other programmable devices, thereby the instruction of carrying out is provided for realizing the step of the function of appointment in flow process of process flow diagram or a plurality of flow process and/or square frame of block scheme or a plurality of square frame on computing machine or other programmable devices.
Although described the application's preferred embodiment, once those skilled in the art obtain the basic creative concept of cicada, can make other change and modification to these embodiment.So claims are intended to all changes and the modification that are interpreted as comprising preferred embodiment and fall into the application's scope.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and each embodiment stresses is the difference with other embodiment, between each embodiment identical similar part mutually referring to.
Method of testing and the device of a kind of search engine index function above the application being provided, be described in detail, applied specific case herein the application's principle and embodiment are set forth, the explanation of above embodiment is just for helping to understand the application's method and core concept thereof; Meanwhile, for one of ordinary skill in the art, the thought according to the application, all will change in specific embodiments and applications, and in sum, this description should not be construed as the restriction to the application.

Claims (14)

1. a method of testing for search engine index function, is characterized in that, comprising:
Utilize redaction search engine to carry out index to reference data, obtain corresponding new edition index file; Wherein, redaction search engine is search engine to be tested;
Old edition index file and described new edition index file are carried out to the contrast of content as document flow, if identical, test and pass through, if difference is tested and do not passed through; Wherein, old edition index file obtains for utilizing legacy version search engine to carry out index to described reference data.
2. the method for claim 1, is characterized in that, the described step that old edition index file and new edition index file are carried out to the contrast of content as document flow further comprises:
Using old edition index file and new edition index file as document flow, carry out the contrast of all or part of content.
3. method as claimed in claim 2, is characterized in that, describedly using old edition index file and new edition index file as document flow, carries out the step of the contrast of partial content, further comprises:
Extract respectively the data in several same position intervals of old edition index file stream and new edition index file stream; Described same position is interval identical for represent the interval position adopting when described old edition index file stream extracts data with new edition index file stream;
Respectively the data in several same position intervals of described old edition index file stream and new edition index file stream are compared, data comparative result in all same positions interval is identical time test to be passed through, and exists test when different not pass through in the data comparative result in same position interval.
4. method as claimed in claim 3, is characterized in that, through the following steps the upper interval data of extraction document stream:
Obtain the file pointer of document flow;
The mode certain range being offset by file pointer play end position, and according to file pointer length and the corresponding data of burst length file reading stream end position from interval.
5. method as claimed in claim 4, is characterized in that, obtains through the following steps the interval end position that rises:
According to preset interval number, burst length and interval length, obtain the end position that rises in each interval; Described acquisition process comprises: using the reference position between proparea and in the burst length between proparea and the interval length sum between Jian Yu back zone, proparea as the end position that rises between back zone.
6. method as claimed in claim 4, is characterized in that, obtains through the following steps the interval end position that rises:
Determine document flow length and interval number;
Produce the quantity of numerical value in document flow length range and be twice in the corresponding random number of interval number;
According to described random number, obtain each interval reference position.
7. the method as described in any one in claim 1 to 6, is characterized in that, described document flow is text flow or binary stream.
8. a proving installation for search engine index function, is characterized in that, comprising:
Index module, for utilizing redaction search engine to carry out index to reference data, obtains corresponding new edition index file; Wherein, redaction search engine is search engine to be tested;
Contrast module, for old edition index file and described new edition index file are carried out to the contrast of content as document flow, tests and passes through if identical, if difference is tested and do not passed through; Wherein, old edition index file obtains for utilizing legacy version search engine to carry out index to described reference data.
9. device as claimed in claim 8, is characterized in that, described contrast module further comprises:
Whole contrast submodules, for using old edition index file and new edition index file as document flow, carry out the contrast of full content; Or
Part contrasts submodule, for using old edition index file and new edition index file as document flow, carries out the contrast of partial content.
10. device as claimed in claim 9, is characterized in that, described part contrast submodule further comprises:
Interval extraction unit, for extracting respectively the data in several same position intervals of old edition index file stream and new edition index file stream; Described same position is interval identical for represent the interval position adopting when described old edition index file stream extracts data with new edition index file stream; And
Comparing unit, for respectively the data in several same position intervals of described old edition index file stream and new edition index file stream being compared, data comparative result in all same positions interval is identical time test to be passed through, and exists test when different not pass through in the data comparative result in same position interval.
11. devices as claimed in claim 10, is characterized in that, also comprise: for the document flow interval censored data extraction module of the upper interval data of extraction document stream;
Described document flow interval censored data extraction module further comprises:
Pointer obtains submodule, for obtaining the file pointer of document flow; And
Reading submodule, for the end position that rises of the mode certain range by file pointer side-play amount, and according to file pointer length and the corresponding data of burst length file reading stream end position from interval.
12. devices as claimed in claim 11, is characterized in that, also comprise: for obtaining the interval origin or beginning position acquisition module that plays end position;
Described origin or beginning position acquisition module further comprises:
The preset submodule that obtains, for interval number, burst length and the interval length according to preset, obtains the end position that rises in each interval; Described acquisition process comprises: using the reference position between proparea and in the burst length between proparea and the interval length sum between Jian Yu back zone, proparea as the end position that rises between back zone.
13. devices as claimed in claim 11, is characterized in that, also comprise: for obtaining the interval origin or beginning position acquisition module that plays end position;
Described origin or beginning position acquisition module further comprises:
Determine submodule, for determining document flow length and interval number;
Random number submodule, the quantity for generation of numerical value in document flow length range be twice in the corresponding random number of interval number;
Obtain at random submodule, for obtaining each interval reference position according to described random number.
14. devices as described in any one in claim 8 to 13, is characterized in that, described document flow is text flow or binary stream.
CN201210279847.9A 2012-08-07 2012-08-07 Method and device of testing index function of search engine Pending CN103577487A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210279847.9A CN103577487A (en) 2012-08-07 2012-08-07 Method and device of testing index function of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210279847.9A CN103577487A (en) 2012-08-07 2012-08-07 Method and device of testing index function of search engine

Publications (1)

Publication Number Publication Date
CN103577487A true CN103577487A (en) 2014-02-12

Family

ID=50049285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210279847.9A Pending CN103577487A (en) 2012-08-07 2012-08-07 Method and device of testing index function of search engine

Country Status (1)

Country Link
CN (1) CN103577487A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291607A (en) * 2016-03-31 2017-10-24 高德信息技术有限公司 A kind of evaluating method and device for search engine
CN107453960A (en) * 2017-09-26 2017-12-08 聚好看科技股份有限公司 A kind of methods, devices and systems that test data is handled in service testing
WO2018202174A1 (en) * 2017-05-05 2018-11-08 平安科技(深圳)有限公司 Version comparison testing method and system
CN110221971A (en) * 2019-05-21 2019-09-10 口口相传(北京)网络技术有限公司 The test method and device of search engine, electronic equipment, storage medium
CN115576946A (en) * 2022-10-18 2023-01-06 北京火山引擎科技有限公司 Data processing method and device in Iceberg, storage medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295303A (en) * 2007-04-28 2008-10-29 李树德 Knowledge search engine based on intelligent noumenon and implementing method thereof
CN101493819A (en) * 2008-01-24 2009-07-29 中国科学院自动化研究所 Method for optimizing detection of search engine cheat
CN202033748U (en) * 2011-04-22 2011-11-09 阿里巴巴集团控股有限公司 Search engine performance test system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101295303A (en) * 2007-04-28 2008-10-29 李树德 Knowledge search engine based on intelligent noumenon and implementing method thereof
CN101493819A (en) * 2008-01-24 2009-07-29 中国科学院自动化研究所 Method for optimizing detection of search engine cheat
CN202033748U (en) * 2011-04-22 2011-11-09 阿里巴巴集团控股有限公司 Search engine performance test system

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291607A (en) * 2016-03-31 2017-10-24 高德信息技术有限公司 A kind of evaluating method and device for search engine
WO2018202174A1 (en) * 2017-05-05 2018-11-08 平安科技(深圳)有限公司 Version comparison testing method and system
CN107453960A (en) * 2017-09-26 2017-12-08 聚好看科技股份有限公司 A kind of methods, devices and systems that test data is handled in service testing
CN107453960B (en) * 2017-09-26 2020-08-25 青岛聚看云科技有限公司 Method, device and system for processing test data in service test
CN110221971A (en) * 2019-05-21 2019-09-10 口口相传(北京)网络技术有限公司 The test method and device of search engine, electronic equipment, storage medium
CN110221971B (en) * 2019-05-21 2023-01-24 口口相传(北京)网络技术有限公司 Search engine testing method and device, electronic equipment and storage medium
CN115576946A (en) * 2022-10-18 2023-01-06 北京火山引擎科技有限公司 Data processing method and device in Iceberg, storage medium and equipment

Similar Documents

Publication Publication Date Title
CN103577487A (en) Method and device of testing index function of search engine
CN105989089A (en) Data comparison method and device
Holzmann et al. Archivespark: Efficient web archive access, extraction and derivation
CN109448793B (en) Method and system for labeling, searching and information labeling of right range of gene sequence
CN105528149A (en) Application icon display method and device
CN107391101A (en) A kind of information processing method and device
Mehmood et al. Performance analysis of not only SQL semi-stream join using MongoDB for real-time data warehousing
CN103631623A (en) Method and device for allocating application software in trunking system
CN107436911A (en) Fuzzy query method, device and inquiry system
CN103455471A (en) Method and device for analyzing text to key value pairs
US20140081982A1 (en) Method and Computer for Indexing and Searching Structures
CN113535977B (en) Knowledge graph fusion method, device and equipment
CN110069523A (en) A kind of data query method, apparatus and inquiry system
CN109213477B (en) Method and device for realizing automatic comparison of software line difference
Consoli et al. A quartet method based on variable neighborhood search for biomedical literature extraction and clustering
CN117252183B (en) Semantic-based multi-source table automatic matching method, device and storage medium
CN104424399A (en) Knowledge navigation method, device and system based on virus protein body
CN111125216A (en) Method and device for importing data into Phoenix
CN109829051A (en) A kind of method and apparatus of database similar sentence screening
CN112100132B (en) Deleted file type identification method and device, electronic equipment and storage medium
Pokorný et al. Graph pattern index for Neo4j graph databases
CN107818126B (en) Full-text information retrieval method oriented to Mongo database
Consoli et al. A VNS-based quartet algorithm for biomedical literature clustering
Jiang et al. Gvos: a general system for near-duplicate video-related applications on storm
CN110728150B (en) Named entity screening method, named entity screening device, named entity screening equipment and readable medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20170829

AD01 Patent right deemed abandoned