CN109388690A - Text searching method, inverted list generation method and system for text retrieval - Google Patents

Text searching method, inverted list generation method and system for text retrieval Download PDF

Info

Publication number
CN109388690A
CN109388690A CN201710681027.5A CN201710681027A CN109388690A CN 109388690 A CN109388690 A CN 109388690A CN 201710681027 A CN201710681027 A CN 201710681027A CN 109388690 A CN109388690 A CN 109388690A
Authority
CN
China
Prior art keywords
document
retrieved
inverted list
participle
retrieval
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710681027.5A
Other languages
Chinese (zh)
Inventor
王朝阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201710681027.5A priority Critical patent/CN109388690A/en
Publication of CN109388690A publication Critical patent/CN109388690A/en
Pending legal-status Critical Current

Links

Abstract

The application provides a kind of search method, comprising: receives inquiry request;The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, obtains pre-processed results;Pre-processed results are oriented to according to the retrieval and carry out inverted list inquiry and Merging, and obtain predetermined quantity recalls document;Each record of the inverted list is used as document identification using fractionation document identity mark to associated document, and each record is identified according to the fractionation document identity for recording associated document as the sort by corresponding keyword entry;Priority score calculating is carried out to the document of recalling obtained;Document is recalled by sort by output of the priority score.The application provides a kind of retrieval device, a kind of searching system, and the inverted list generation method for retrieval simultaneously.Method provided by the present application for text retrieval can preferentially retrieve the high document of significance level using special inverted list.

Description

Text searching method, inverted list generation method and system for text retrieval
Technical field
This application involves retrieval techniques, and in particular to a kind of text searching method, the application provide a kind of text inspection simultaneously Rope device.The application provides a kind of inverted list generation method for text retrieval simultaneously, the inverted list generated using this method It is used in aforementioned texts search method;The application provides a kind of inverted list generating means for text retrieval simultaneously.This Shen A kind of text retrieval system is please provided simultaneously.The application provides a kind of electronic equipment simultaneously, for running the text retrieval side Method;This Shen provides another electronic equipment simultaneously, for running the inverted list generation method for being used for text retrieval.
Background technique
Search engine (Search Engine) refer to according to certain strategy, with specific computer program from interconnection It is online to collect information, after carrying out tissue and processing to information, retrieval service is provided for user, by the relevant information of user search The system for showing user.
Text retrieval is carried out using search engine, has become the function that people may use at any time.It is searched for using The number of engine is increasingly frequent, and the time loss that search result needed for obtaining every time generates is accumulative, when constituting very big society Between cost.Therefore, the efficiency that entire society can be effectively promoted using the time-consuming that search engine carries out text retrieval is reduced.
During carrying out text retrieval using search engine, time loss is mainly reflected in two aspects, i.e. search engine Obtain searched page time loss, and issue retrieval request search engine user obtain information needed time disappear Consumption.
The text that the time loss of described search engine acquisition searched page, i.e. described search engine include according to retrieval request This information and relevant parameter, retrieval obtains coordinate indexing result (recalling document), and it is shown with page format The spent time.In this process, for search engine firstly the need of retrieving in the database, document is recalled in acquisition;Also need Further progress document ordering, to determine the priority orders for recalling document;The step of document ordering, is recalling document more than one It is especially important when a display page.In the prior art, time consumed by document ordering and real-time computing resource, which account for, entirely searches The major part of rope process.
The search engine user for issuing retrieval request obtains the time loss of information needed, refers to that search engine makes User obtains the time loss of oneself search result actually required from the search result page that search engine finally provides. Time-consuming during this is related to the sequence for recalling document.If sequence is rationally, it is final that search engine user will be reduced The time of document needed for obtaining;If sequence is unreasonable, it will cause the time consumptions that search engine user is excessive.
For example, recall document need by multiple displayed pages show in the case where, search engine makes search engine Document display required for user is in first page and second page, the then time-consuming meeting of the final search of search engine user There is marked difference;Need to be divided into multiple page presentations greatly very much recalling number of documents, if required for search engine user The page of search result when being located exactly at position further below, then user's time-consuming is longer, and search experience is remarkably decreased, even Can be inadequate due to the patience of search engine user, it finally abandons obtaining search result.
In the prior art, in order to more effectively provide displayed page, so as to enable engine user on displayed page more Search result required for fast acquisition, can be arranged it is higher recall the number of documents upper limit, and before it will recall document display Priority ranking will be carried out to document is recalled, so as not to omit important document, and recall document by prior and preferentially mention The user of supply search engine.
But there are major defects for above-mentioned settling mode.Most important defect is that when searching for, content is more popular, recalls When the quantity of document is very big, it is high to be ranked up operation cost and time cost consumed by operation;Also, recall document size increasing Add dramatically increasing for the operand that will cause sort operation, this allow for recall document it is excessive in the case where, displayed page is raw It is substantially reduced at speed, influences the usage experience of user.
Can be with faster speed to recalling document ordering due to there are the above problem, obtaining one kind, and ranking results accord with The text retrieval scheme that user requires is closed, the key for improving search engine working efficiency is become.
Summary of the invention
The application provides a kind of search method, which has used the inverted list specially generated, can be more effective Filter out meet retrieval require recall document;The application provides a kind of retrieval device simultaneously.
Search method provided by the present application, comprising:
Receive inquiry request;
The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, is pre-processed As a result;
The participle object to be retrieved and each participle object to be retrieved provided according to the retrieval guiding pre-processed results Mutual Merger carries out inverted list inquiry and Merging to each participle object to be retrieved, obtains predetermined number Amount recalls document;The inverted list has the feature that each of which record to associated document using fractionation document body Part mark is used as document identification, and each record is used as according to the fractionation document identity mark for recording associated document corresponding Keyword entry in sort by;
Priority score calculating is carried out to the document of recalling obtained, obtains each preferential fraction for recalling document Value;
Using the priority score as sort by, document is recalled described in output.
Preferably, in inverted list inquiry and Merging, with the inverted list key term being ordered as now Foundation preferentially chooses the high document conduct of document score in the document met the requirements and recalls document.
Preferably, the query text provided text query request carries out retrieval and is oriented to pretreated step, Include: to be segmented to query text, obtains participle object, and determine participle object to be retrieved from participle object, and, root According to query text and query argument, the Merger of the participle object to be retrieved is obtained;The participle object to be retrieved is The subset of the participle object.
Preferably, the query text provided text query request carries out retrieval and is oriented to pretreated step also Include: that weight analysis is carried out to each participle object after obtaining participle object, obtains the weight of each participle object;It is subsequent In step, respective handling is carried out to the participle object according to the weight of each participle object.
Preferably, in described the step of carrying out inverted list inquiry and Merging to each participle object to be retrieved, institute State at least one of Merging, including following operation: intersection operation, union operation and difference operation.
Preferably, the record of each key term of the inverted list now is according to each score for recording associated document Change document identity mark and be used as sort by, specifically uses and reverse.
Preferably, in described the step of carrying out inverted list inquiry and Merging to each participle object to be retrieved, institute Stating Merging includes intersection operation;The method for determining that candidate recalls document in the intersection operation is:
In each respective inverted list entry of participle object to be retrieved for needing to carry out intersection operation, according to from front to back Sequence, retrieval meets the records of following conditions:
This records associated fractionation document identity mark in each participle object to be retrieved for needing to carry out intersection operation Inverted list entry in the presence of associated record.
Preferably, the method for determining that candidate recalls document in the intersection operation specifically uses following steps to realize:
It determines that a participle object to be retrieved is used as in the participle object set to be retrieved for needing to carry out intersection operation to work as Preceding participle object to be retrieved, the current participle object to be retrieved are to traverse to each element of the participle object set to be retrieved Starting point, to it is to be retrieved participle object set in each participle object to be retrieved arrange in a fixed order, the fixation The sequence that sequence is looped through as each participle object to be retrieved to participle object set to be retrieved;
In the inverted list entry of the current participle object to be retrieved, obtains and be located in the most record of preamble column position Fractionation document identity mark, and it regard the fractionation document identity mark in the record as current document identity, simultaneously The value that participle object counter is arranged is 1;
In the participle object set to be retrieved for needing to carry out intersection operation, by the next of current participle object to be retrieved Participle object is updated to new current participle object to be retrieved;
Keyword entry of the current participle object to be retrieved in inverted list is inquired, included fractionation is retrieved Document identity mark is less than or equal to the first record of the current document identity;By the fractionation document in the record Identity is identified as document identity to be judged;
Judge that the document identity to be judged identifies whether to be equal to the current document identity;If so, under One step;If it is not, the value that participle object counter is then arranged is 1, it regard the document identity mark to be judged as current document Identity, and return is described in the participle object set to be retrieved for needing to carry out intersection operation, it will current participle to be retrieved Next participle object of object is updated to the step of new current participle object to be retrieved;
The value for segmenting object counter is added 1;
Judge whether the value for segmenting object counter is equal to the sum of participle the included element of object set to be retrieved, If so, into next step;If it is not, described in then returning in the participle object set to be retrieved for needing to carry out intersection operation, The step of next participle object of current participle object to be retrieved is updated to new current participle object to be retrieved;
Document corresponding to current document identity is determined as candidate to recall document;
By in the inverted list of current participle object to be retrieved, after the record comprising the current document identity Record, the record as the most preamble column position;Described in returning in the inverted list of the current participle object to be retrieved, The fractionation document identity mark being located in the most record of preamble column position is obtained, and by the fractionation document identity in the record The step of mark is used as current document identity, while the value that participle object counter is arranged is 1.
Preferably, the query text provided the inquiry request carries out in the pretreated step of retrieval guiding, institute The retrieval guiding pre-processed results of acquisition include matching degree calculating parameter;
It is described that priority score operation is carried out to the document of recalling obtained, include the following steps:
According to the matching degree calculating parameter, and the matching degree algorithm of setting, each matching degree for recalling document is calculated Score value;
According to each matching degree score value and each document score for recalling document for recalling document, with setting Weight calculates, and obtains each priority score for recalling document.
Preferably, the text to be inquired of text query request is the lyrics.
Preferably, the score that the fractionation document identification is relied on, according in following attributes including associated song At least one: audition amount, download, amount of collection, comment amount, the bean vermicelli quantity of related artist.
Retrieval device provided by the present application, comprising:
Inquiry request receiving unit, for receiving inquiry request;
Retrieval guiding pretreatment unit, query text and query argument for being provided according to the inquiry request carry out Retrieval guiding pretreatment, obtains pre-processed results;
Inquiry and Merging unit, for according to it is described retrieval guiding pre-processed results provide participle object to be retrieved and The mutual Merger of each participle object to be retrieved, carries out inverted list inquiry to each participle object to be retrieved and returns Union, obtain predetermined quantity recalls document;The inverted list has the feature that each of which record to associated document Document identification is used as using fractionation document identity mark, it is each to record the fractionation document body that associated document is recorded according to this Part mark is as the sort by corresponding keyword entry;
Priority score computing unit is obtained for carrying out priority score calculating to the document of recalling obtained Each priority score for recalling document;
Output unit, for recalling document described in output using the priority score as sort by.
The application provides a kind of searching system simultaneously, comprising:
Database, for storing the document for inquiry;
Off-line calculation server, for counting the temperature data of each document, and accordingly generate document according to historical data Score value;
Index server, for the concordance list of document each in database described in layout, including inverted list;It is described fall It arranges in table generating process, the document score of each document provided using the off-line calculation server is foundation, to each document Fractionation document identity mark is authorized, and is remembered in the record of each entry of inverted list using fractionation document identity mark Record the document identity of each document;In each record of each key term now according to each score for recording associated document Change document identity mark sequence;
Retrieval server, for receiving inquiry request, and to the query text and inquiry ginseng that the inquiry request provides Number carries out retrieval guiding pretreatment, obtains retrieval guiding pre-processed results;It is provided according to the retrieval guiding pre-processed results Participle object to be retrieved and the mutual Merger of each participle object to be retrieved, with each participle object pair to be retrieved The inverted list answered is foundation, carries out inverted list inquiry and Merging to each participle object to be retrieved, obtains requirement Recall document;And priority score calculating is carried out to the document of recalling obtained, it obtains each candidate and recalls document Priority score recall document described in output finally using the priority score as sort by.
The application also provides a kind of electronic equipment, comprising:
Processor;And
Memory, for storing a kind of search method, which is powered and passes through described in the processor operation for examining After Suo Fangfa, following step is executed:
Receive inquiry request;
The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, obtains retrieval and leads To pre-processed results;
The participle object to be retrieved and each participle object to be retrieved provided according to the retrieval guiding pre-processed results Mutual Merger, using the corresponding inverted list of each participle object to be retrieved as foundation, to each participle pair to be retrieved As carrying out inverted list inquiry and Merging, obtain predetermined quantity recalls document;Each participle object pair to be retrieved The inverted list answered has the feature that each of which record is used as document using fractionation document identity mark to associated document Mark, the record of each key term now identify conduct according to the fractionation document identity of document associated in each record Sort by;It is excellent in the document met the requirements with the sequence of the inverted list in the inverted list inquiry and Merging First the high document of selection document score, which is used as, recalls document;
Priority score calculating is carried out to the document of recalling obtained, obtains each preferential fraction for recalling document Value;
Using the priority score as sort by, document is recalled described in output.
The application provides a kind of inverted list generation method for retrieval simultaneously, comprising:
According to relevant historical data, the document score of the document as the item that is retrieved is calculated;
Using the document score of each document as foundation, fractionation document identity mark is authorized to each document;
In inverted list generating process, by the fractionation document identity mark of the associated document of each record as each It is a to be recorded in its sort by of place key term now.
Preferably, the historical data includes one or more attribute value relevant to document temperature, and the calculating is made For the document for the item that is retrieved document score the step of, i.e., using the attribute value as foundation.
Preferably, if the historical data includes multiple attribute values, weight is assigned to each attribute, and according to each category Property attribute value and corresponding weight, carry out the calculating of the document score of the document as the item that is retrieved.
Preferably, the document score is normalized in processing to determining fractional value section.
Preferably, the structure that the fractionation document identity mark uses are as follows: document score+original document identity.
Preferably, the data type that the fractionation document identity mark uses is Long type or String type.
Preferably, described in inverted list generating process, it is identified according to the fractionation document identity in each record The ordering relation of each record is determined, and then in the step of determining each position being recorded in inverted list, according to each record In the backward of the determining ordering relation of fractionation document identity mark determine each position being recorded in inverted list.
The application provides a kind of inverted list generating means for retrieval, comprising:
Document score computing unit, for calculating the document of the document as the item that is retrieved according to relevant historical data Score value;
Fractionation document identity identification grant unit, for using the document score of each document as foundation, to each document Authorize fractionation document identity mark;
Inverted list generation unit, for generating inverted list, in inverted list generating process, by the associated document of each record Fractionation document identity mark be recorded in its place item sort by now as each.
The application provides a kind of electronic equipment, comprising:
Processor;And
Memory, for storing a kind of inverted list generation program for retrieval, which is powered and by the processing After inverted list of the device operation for retrieval generates program, following step is executed:
According to relevant historical data, the document score of the document as the item that is retrieved is calculated;
Using the document score of each document as foundation, fractionation document identity mark is authorized to each document;
In inverted list generating process, by the fractionation document identity mark of the associated document of each record as each It is a to be recorded in its sort by of place key term now.
The application offer also provides a kind of search method, comprising:
Receive inquiry request;
The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, obtains retrieval and leads To pre-processed results, including participle object to be retrieved;
According to the participle object to be retrieved, inverted list inquiry is carried out to each participle object to be retrieved, obtains multiple call together Palindrome shelves;The inverted list has the feature that each of which record identifies associated document using fractionation document identity As document identification, each record records the fractionation document identity mark of associated document as corresponding crucial according to this Sort by entry mesh;
The multiple document of recalling obtained is ranked up and is exported.
Preferably, in described the step of carrying out inverted list inquiry to each participle object to be retrieved, with inverted list pass Keyword item now be ordered as foundation, the high document of document score is preferentially chosen in the document met the requirements as recalling text Shelves.
Preferably, the query text provided the inquiry request carries out in the pretreated step of retrieval guiding, institute The retrieval guiding pre-processed results of acquisition include matching degree calculating parameter;
It is described to it is obtained it is the multiple recall in the step of document is ranked up and exports, it is described sequence include it is as follows Step:
According to the matching degree calculating parameter, and the matching degree algorithm of setting, each matching degree for recalling document is calculated Score value;
According to each matching degree score value and each document score for recalling document for recalling document, with setting Weight calculates, and obtains each priority score for recalling document;The priority score is as sort by.
Compared with prior art, the method provided by the present application for text retrieval uses special fall in retrieval Table is arranged, in this inverted list, each record therein is made all in accordance with the fractionation document identity mark for recording associated document For the sort by keyword entry.In this way, the record for just preferentially selecting fractional value high for the sorting position by record Provide possibility.Score is capable of the importance degree of concentrated expression relevant documentation as based on fractionation document identification, because This can preferentially retrieve the high document of significance level using text searching method provided by the present application.
Detailed description of the invention
Fig. 1 is a kind of flow chart for inverted list generation method for retrieval that the application first embodiment provides;
Fig. 2 is the structural schematic diagram for the common inverted list not generated using the application first embodiment;
Fig. 3 is the structural schematic diagram of the inverted list generated using the application first embodiment;
Fig. 4 be illustrate the application first embodiment generate inverted list key term now each record placement sequence Schematic diagram;
Fig. 5 is a kind of unit block diagram for inverted list generating means for retrieval that the application second embodiment provides;
Fig. 6 is a kind of flow chart for search method that the application 3rd embodiment provides;
Fig. 7 is the participle object obtained after the preprocess method that the step S302 of the application 3rd embodiment is provided is handled The schematic diagram of Merger;
Fig. 8 is to execute step when Merging is intersection operation in the search method that the application 3rd embodiment provides The flow chart of S303;
Fig. 9 is a kind of unit block diagram for retrieval device that the application fourth embodiment provides;
Figure 10 is a kind of schematic diagram for searching system that the 5th embodiment of the application provides;
Figure 11 is a kind of flow chart for search method that the 8th embodiment of the application provides.
Specific embodiment
Many details are explained in the following description in order to fully understand the application.But the application can be with Much it is different from other way described herein to implement, those skilled in the art can be without prejudice to the application intension the case where Under do similar popularization, therefore the application is not limited by following public specific implementation.
The application first embodiment provides a kind of inverted list generation method for retrieval.The row of falling generated due to this method Table is therefore the basis of search method provided by the present application is introduced in advance herein.
Fig. 1 is please referred to, which is the flow chart of the application first embodiment.Referring to Fig. 2, which is using this Shen Please method provided by first embodiment be formed by the example of inverted list.The application first is implemented below in conjunction with Fig. 1, Fig. 2 A kind of inverted list generation method for retrieval that example provides is described in detail.
Step S101 calculates the document score of the document as the item that is retrieved according to relevant historical data.
The effect of this step is the document score for obtaining reflection as the document of retrieval object.
The historical data, including one or more attribute value relevant to document temperature, the calculating is as retrieval The step of document score of the document of candidate item, i.e., using the attribute value as foundation.
The historical data can only include an attribute value, for example, only considering download for song;But it is general For, it needs to include multiple attribute values, to assess document temperature from different perspectives;At this point, in order to obtain one to document Final score value can assign weight to each attribute, and according to the attribute value of each attribute and corresponding weight, be asked with weighting And mode, carry out the calculating of the document score of the document as the item that is retrieved.It, can be with for the comparability of calculated result By initial data discretization to fixed score value section, such as 0-10 points of section, the attribute value as each attribute;Pass through Weighted sum to the attribute value of each attribute of discretization to fixed interval, the document score finally obtained are distributed on In determining fractional value section.Certainly, it is also not excluded for directly being weighted summation using the initial data of each attribute, finally will Calculated result discretization is to fixed score section.
For example, for the case where document is the lyrics of song, described one relevant to document temperature or more A attribute value, main includes the audition amount of song, download, the bean vermicelli quantity of amount of collection, comment amount and related artist.For Each attribute can obtain relevant initial data from network or other information channel;For the ease of comparing, can incite somebody to action Initial data normalized is to fixed score value section, such as the section of 0-10, using the value after normalized as each The attribute value of a attribute;And the weight of each attribute can be predefined according to demand;Pass through weighted sum, so that it may calculate The document score of each song out.
In general, these score values can be supplied to offline meter by collecting attribute value data in advance, and by these data Server is calculated, is calculated and is obtained using offline mode by off-line calculation server, dynamic is not needed and updates;It is this to be collected using prior Attribute value data calculate obtain document score be properly termed as document static state score value.Certainly, these attribute values might as well also use At any time by network collect related data, and at any time dynamic update method obtain document score, this document score due to When changed according to the data of collection, be properly termed as document dynamic score value.Since network data is excessively huge, and document Temperature was not in great fluctuation process under normal circumstances, therefore, generally can satisfy requirement using document static state score value.Certainly, i.e., Make to be document static state score value, the historical data collected should also be as regularly updating, for example, according to daily, weekly or monthly Frequency updates data relevant to document temperature, and recalculates document static state score value.
Herein described document, the file including various forms, various formats.For example, text file, audio file, view Frequency file, computer executable file etc.;Herein described text retrieval refers to that search condition uses textual form, but makees It can be video, audio or alternative document form to retrieve the document of candidate item.No matter which kind of file, each file itself is equal With some their texts of description, for example, the title of audio file, author, singer or player, date of formation etc. believe Breath, describes in the form of text;Therefore, it can be retrieved by way of text retrieval.
The above-mentioned document as retrieval candidate item is typically stored in database, and document itself can obtain by various modes , such as collected by crawlers or other modes from network.Each document has and its unique corresponding document body Part mark.
Step S102 authorizes fractionation document identity mark to each document using the document score of each document as foundation.
This step generates fractionation document identity mark according to document score.
Document identity mark, i.e. document id are and each document is uniquely corresponding encodes document ids having the same Document be exactly same document.In different situations, the method that can be identified using different generation document identities.This step The rapid fractionation document identity mark is exactly a kind of method of special generation document identity mark;But likewise, score Change document identity mark and also need the characteristics of meeting as document identity mark, it may be assumed that the fractionation document identity of each document Mark will not be with the fractionation document identity duplicate identity of other documents.In addition it is also necessary to meet following condition: according to each text The fractionation document identity mark that the document score of shelves is authorized can determine the sequence between the document and other documents Relationship.
The document score has been obtained in the step S101.Using the document score of each document as foundation, to each Document authorizes fractionation document identity mark, and its purpose is to can pass through fractionation document body in the record of inverted list Part mark, it is convenient to obtain sorting position relevant to its document score.
With document score be according to generate fractionation document identity mark method can there are many kinds of;No matter which kind of feelings Condition, it is necessary to assure there is each file different fractionation document identities to identify, the fractionation document of as different file generateds Identity is not reproducible;Meanwhile it is suitable to need to guarantee that the document score plays in fractionation document identity mark Effect, most important effect are exactly, it is desired to be able to determine each record by the document identity mark in the record of inverted list In its respectively sequence of key term now, so as to quick-searching to valuable document.
In order to achieve the above objectives, the fractionation document identity mark can be constructed using a variety of feasible schemes.For example, The fractionation document identity mark can be using such as flowering structure: document score+original document identity.The document of front point Value can play a role sequence, and subsequent original document identity may insure that different documents has different points Numberization document identity mark;Specific data type can use Long type or String type.Such as:
Scorer-name=(long) (scorer*106+n+doc_id)
Wherein, Scorer-name, that is, fractionation document identity mark;Scorer is the document score, is 6 floating-points Number;The doc_id is the original identity of document, digit n.The long indicates to define a long type data.
Using aforesaid way, document score scorer determines former positions of the fractionation document identity mark, in this way, just Convenient for determining the ordering relation between different document by the document score of each document;And the addition of back several is described original Document identity mark can ensure fractionation document identity to avoid different document fractionation document identity mark having the same The one-to-one correspondence of mark and document.In the above-mentioned definition to fractionation document identity mark, if the doc_id digit mistake Long, long type data are still unable to satisfy, then can identify fractionation document identity and be defined as string type data.
Certainly, above-mentioned example is a kind of mode of fairly simple definition fractionation document identity mark, can also be adopted Fractionation document identity mark is obtained in other ways.Such as, it may be considered that document score part is performed some processing, it is such as right Document score progress discretization, the front identified using the document score after sliding-model control as the fractionation document identity, Continuous document score can be converted to several discontinuous document scores by discretization, be more convenient for carrying out the classification of document.
Step S103, in inverted list generating process, by the fractionation document body of the associated document in each record Part sort by of the mark as each key term being recorded in where it now.
This step be to inverted list is ultimately generated during, to fractionation document identity mark effect restriction.It is practical Generate inverted list mode can there are many, and not this Applicant's Abstract graph emphasis is not described in detail herein.
So-called inverted list, i.e. inverted index, due to generally using form, so usually falling to arrange rope with inverted list address Draw;Whether inverted list is interpreted as the file of record inverted index by the application, be not relevant for really using form, below No matter refer to inverted list or inverted file or inverted index is identical meaning.
The meaning of inverted list is explained in detail below, while introducing other related notions.
So-called index is a kind of structure being ranked up to the value of one or more columns per page in database table, can be fast using index Specific information in speed access database table;The file of recording indexes information is exactly index file.For search engine, due to searching Rope space is huge, and index file is extremely important to its, and index can be established by generally searching for the document that engine obtains search.Because Index will record hereof, and file is mostly used form thus index, index file and concordance list in actual use Often there is same connotation.
Index file used in search engine includes forward index and inverted index.
The forward index, i.e., the word therein in terms of document angle, for indicating that each document (is identified with document identity To identify) all there is how many times (word frequency, english abbreviation TF) and its appearance position containing which word and each word (offset of the general record word relative to document stem, English are indicated with offset).Forward index is with the document identity Identify the foundation as sequence.
The inverted index (inverted index or inverted files) and above-mentioned forward index on the contrary, be from Word angle sees document;Each word (keyword) is recorded to occur in which document (recording using document id) respectively, with And there are how many times (i.e. record word frequency, be abbreviated as TF) respectively in each word in respective document and its appearance position is (i.e. Deviation post is abbreviated as offset).
For search engine, usually the request of reception text query is used as search condition, the text query request By pretreatment, formed it is several participle (Term) and participle between logical relations, by these information using concordance list into Row retrieval.
So-called participle, that is, having verb meaning also has noun meaning;Each participle is exactly a word or phrase, that is, is had really Determine the minimum semantic primitive of meaning;Text query request received for institute, needs to divide minimum semantic unit wherein included, This action process is called participle, i.e. participle can refer to the process of above-mentioned division minimum semantic unit;On the other hand, it is obtained after division The minimum semantic unit obtained, also often referred as segments, that is, segments the word obtained after this operation executes;Sometimes for by two A meaning is mutually distinguishable, and is known as the minimum semantic unit that the latter meaning is censured to segment object (Term);In the application i.e. Use this address of participle object;Object is segmented to correspond in inverted list as the keyword for indexing foundation.For Chinese, Since the word as minimum semantic unit is often to be made of the word of different number, there is no the phonetics such as blank partition between word Natural diacritics in text, therefore, for Chinese, accurately being segmented to obtain reasonable participle object is one Important step.
In search engine use process, actually under participle Object Query document, this usage scenario, using the row of falling Index could be retrieved preferably;Therefore, for needing in face of the search engine of magnanimity document, establishing inverted index is One important work.
In this step, method restriction has been carried out to inverted list generating process, i.e., has generated inverted list anyway, it is desirable that root The ordering relation of each record is determined according to the fractionation document identity mark in each record, and then determines each be recorded in Position in inverted list
Fig. 2, Fig. 3, Fig. 4 are please referred to, Fig. 2 is the structural schematic diagram of a common inverted list without above-mentioned restriction;Fig. 3 The structural schematic diagram of the inverted list limited for one by above-mentioned steps;Fig. 4 illustrates in the inverted list limited by above-mentioned steps In, key term now each record placement sequence.Below in conjunction with above-mentioned Detailed description of the invention under the restriction of above-mentioned steps S103 The feature of the inverted list of generation.Under charting mode, an entry can intuitively be interpreted as a line in table, a line In each cell in content be considered as a record.
As shown in Figure 2 and Figure 3, the index structure of the inverted list is using keyword as sort by;In inverted list Keyword corresponds to " mp3 ", " west to the participle object (Term) obtained after text query request participle, i.e., in Fig. 2, Fig. 3 Sea ", " love song ", " drop centre Drolma " etc..In inverted list, the corresponding entry of each keyword, the sequence of each entry is i.e. with this A little keywords are sort by, each key term now, the information of document of the storage comprising the keyword is each comprising should The form that the relevant information of the document of keyword is recorded with one is stored in the key term now;When retrieval, i.e., basis is divided Word object inquires corresponding keyword in inverted list, the main target of retrieval be obtain keyword in record in the entry The document identity of record identifies.Generally in addition to being identified comprising document identity in each record of inverted list, further include word frequency and The information such as the position of keyword in a document.Each record is indeed through document identity mark and some determination Document is associated, in present specification, by some record in include file corresponding with document identity information be known as the record Associated document.
The schematic diagram that Fig. 2 shows one without limiting the inverted list obtained using step S103.As shown in Fig. 2, its is each Recording and recording the document identity mark of each associated document is exactly common document identity mark, is called in the application original Document identity mark, to be mutually distinguishable with fractionation document identity mark.The original document identity is in text Shelves obtain when generating, or assigned when document is put into the database of search engine and document temperature it is completely irrelevant, only It is the mark for being used to distinguish different documents generated at random according to certain rule.Therefore, if now to each The sequence of each record is using document identity mark sequence, then each record position is unrelated with document temperature.
After the restriction using step S103, the inverted list of formation is as shown in Figure 3.In the inverted list, each now Reading control — that is, include the entry it is corresponding participle object document identity identify (document id), word frequency (TF), and occur One partial data of position (offset) records --- and the document identity mark in the record uses fractionation document identity mark Know;It is identified due to using fractionation document identity, and as previously mentioned, the document score of reflection document temperature is located at described point The front of numberization document identity mark, therefore, when use the fractionation document identity to identify as a key term now When the sort by of each record, the position where each record can actually reflect document temperature.
For each record of a key term now, according to the fractionation text in each record Shelves identity itself can determine the ordering relation between document, for example, the number of fractionation document identity mark When according to type being Long type data, the ordering relation of the same item each record now is assured that with Digital size. For String type data, can also be ranked up using lexicographic order.
In order to convenient for retrieving the higher document of document score first, should in each entry by document score compared with High associated document record place in the entry lean on front position, for this purpose, can using the document identity mark determine The backward of ordering relation determine the position of each key term being recorded in where it now.
Fig. 4 is please referred to, the figure shows three keyword entries of the inverted list generated after limiting using this step S103 In it is each record position signal.The figure does not show the record member such as the word frequency for including in each record and appearance position Element only shows that document identity identifies, and as fractionation document identity identifies here.
As shown in figure 4, for each entry, under record, the Long identified according to fractionation document identity The size of type data is arranged using backward mode.This arrangement mode makes the inverted list provide both sides information, on the one hand It is that the keyword (participle object) used to entry is ranked up the sorted order of offer, if Fig. 4 is according to the word of each keyword Several and stroke number sequence is ranked up and (from left to right indicates that sequence is successive);On the other hand, provide each it is each now The temperature sequence of the associated document of record.
By the way that above-mentioned inverted list is arranged, the temperature of document can be cleverly provided, and its increased calculation amount can make It is completed with offline service device, does not need to carry out in search engine use process.By utilizing the inverted list, can significantly shorten The time of document needed for obtaining, significantly improve the usage experience of search engine.
The application second embodiment provides a kind of inverted list generating means for retrieval, including document score computing unit 201, fractionation document identity identification grant unit 202, inverted list generation unit 203.Referring to FIG. 5, the figure provides a kind of use In the unit block diagram of the inverted list generating means of text retrieval.
The document score computing unit 201, for calculating the document as the item that is retrieved according to relevant historical data Document score.
The historical data includes one or more attribute value relevant to document temperature, and the calculating is used as and is retrieved The step of document score of the document of item, i.e., using the attribute value as foundation.
If the historical data includes multiple attribute values, weight, and the category according to each attribute are assigned to each attribute Property value and corresponding weight, carry out the calculating of the document score of the document as the item that is retrieved.
For the ease of using, the document score can be generally normalized in processing to determining fractional value section.
The fractionation document identity identification grant unit 202, for using the document score of each document as foundation, to every A document authorizes fractionation document identity mark.
The fractionation document identity mark can be using various ways realization, a kind of structure generallyd use are as follows: document Score value+original document identity;The data type that the fractionation document identity mark uses is according to different situations, Ke Yixuan Select different data types, generally Long type or String type
The inverted list generation unit 203, for generating inverted list, in inverted list generating process, by each record The fractionation document identity mark of associated document is recorded in its sort by of place item now as each.
It is true generally according to the backward of the determining ordering relation of the fractionation document identity mark in each record in this step Fixed each position being recorded in inverted list.
The application 3rd embodiment is to provide a kind of method that the inverted list generated using the above method realizes retrieval.Fig. 6 For the flow chart of the embodiment, it is explained below in conjunction with Fig. 6.
Step S301 receives inquiry request.
This step retrieves information required for starting search for obtaining.
The inquiry request is the query requirement inputted using main body as textual form.Its content can be divided into query text And query argument.
The query text is the text that word (such as English) or word (for Chinese) form;It is described to look into Parameter is ask to normally behave as retrieval symbol or add the restriction to retrieval that special parameter is formed using the character arranged.
Query text in the inquiry request generally requires to be segmented, and obtains participle object, basis in later retrieval Participle object is retrieved.The foundation of participle is the participle symbol of Lock-in in dictionary for word segmentation or language expression;For example, making With Chinese search " Xihai sea love song drop centre Drolma mp3 ", previous section can be decomposed by dictionary for word segmentation by " Xihai sea ", " feelings Song ", " drop centre Drolma ", and by space symbol and dictionary for word segmentation, " mp3 " can be marked off.
Retrieval symbol in the query argument is generally used for the relationship of expression participle object in the search.For example, one Perhaps word uses the expression of "-" symbol to need to exclude word or word after the symbol, i.e. progress set difference operation to a little words before; " OR " symbol is used to connect two words, expression is "or" relationship between the two, i.e., front and rear part carries out union operation.Inquiry ginseng Number can also be the character of agreement plus special parameter, for example, indicating this text query using " site:www.taobao.com " Request only needs to inquire the content of some specific website, which is www.taobao.com.
The Merger between the different participle objects of text, still, not merger can also be provided by query argument Relationship must all rely on retrieval symbol and obtain, and the merger that can be obtained sometimes by semantic analysis between different participle objects is closed System.
The text query request is usually to propose that the user of inquiry request is remotely passing through client It proposes.
Step S302, the query text and query argument provide the inquiry request carry out retrieval guiding pretreatment, Obtain pre-processed results.
The effect of this step is to carry out adapting to what retrieval required to the query text that searcher provides before formal inquiry Pretreatment, so as to clearly propose text query request user inquiry target.
If the inquiry request is the speech habits that get close to nature designed to meet search engine user Language mode, the then pre-processed results formed by retrieval guiding pretreatment, exactly in order to provide for search mission convenient for machine The linguistic form of understanding, specifically, including obtaining this search keyword to be used is needed (to be known as to be retrieved point in the application Word object), the logical relation (being known as Merger in the application) between these keywords, the importance degree (body of keyword It is now the weight of participle object), matching degree calculating parameter can also be obtained.
Retrieval, which is oriented to pretreated main task, to be segmented to query text, and the keyword for needing to retrieve is obtained. The word obtained after speechminute word is generally referred to as participle (Term), as previously mentioned, in order to distinguish the verb of participle and noun, this Shen Please in by the word obtained after participle be known as segment object, not to query text carry out participle acquisition all participle objects all There is value to search, but need to exclude the word of some absolutely not meanings, be determined as the participle of search key Object is known as participle object to be retrieved.
The participle process in pretreatment is described in detail below.
Natural language description, expression and search engine is usually used in the query text provided in the inquiry request There are gaps between inquiry needs.The foundation that search engine retrieves content of text is to be obtained by inverted list including key The document of word, and the query requirement of natural language description can not directly determine keyword.For Chinese, with the Chinese It expresses the meaning based on word unit, and really having significant minimum semantic unit is then word;Since there is no as English between word and word Space between language word is as segmentation, and therefore, in a text, which word composition word is not known simultaneously, therefore, to Chinese language text Being segmented is exactly an important job.Also, for query text, wherein only having to natural language understanding comprising some The thing of value, and for search engine, will inquire related content, it must be determined which be really valuable retrieval according to According to.
For example, for " Xihai sea love song drop centre Drolma mp3 " this text query request in first embodiment, if not The operation of any participle is carried out to query text, uses " love song drop in the Xihai sea entreats Drolma " and two parts " mp3 " as retrieving foundation, Then it is less likely to search suitable content, because the content really comprising " Xihai sea love song drop centre Drolma " is very little, can not obtains Effective content;And " mp3 " is used as audio format, meaning is limited, and many satisfactory audio documents are not mp3 formats, phase Instead, there is no show it oneself is mp3 file in the text with the document of many mp3 formats;In this way, the inspection of " MP3 " as keyword Suo Xiaoguo is necessarily bad.
In retrieval guiding preprocessing process, query text can be segmented, obtain participle object.The foundation of participle is phase The dictionary of pass, which, which can analyze out in a text, according to the dictionary can be used as search key, and be search key Between Merger give a clue.
For example, following participle object can be obtained as retrieval pass by analyzing above-mentioned " love song drop in the Xihai sea entreats Drolma mp3 " Keyword " Xihai sea " " love song " " drop centre Drolma " and " mp3 ".
When segmenting for query text, participle object not obtained all can serve as search key;For It can be known as participle object to be retrieved as the participle object of search key.For example, in the query text of Chinese, if gone out It is existing " " as auxiliary word although being an independent word from participle angle have no discrimination for retrieval, these Object is segmented not as participle object to be retrieved, therefore, only part segments in the participle object obtained to query text participle Object can be used as term, and therefore, participle object to be retrieved is the subset of the participle object obtained.Preprocessing process In, need to filter out participle object to be retrieved in the participle object of acquisition, the foundation of screening can be made according to search engine Dictionary.
Another task of retrieval guiding preprocessing process is the Merger for obtaining participle object to be retrieved.
So-called Merger is exactly to retrieve the mutual relationship of the information of acquisition for participle object to be retrieved Problem.Since these relationships are mainly expressed with Boolean calculation, the relationship of boolean queries can be referred to as.Basic returns And the relationship that relationship usually has intersection operation, union operation, set difference operation etc. different;Further, it is also possible to comprising some other Merger can satisfy and can be unsatisfactory for for example, certain keywords belong to.
So-called intersection operation is that several participle objects must all exist in a document and just belong to and recall document;So-called union Operation is that several participle objects meet any one in a document and can serve as recalling document;So-called set difference operation is several Participle object meets part in a document, and cannot include the relationship of other participle object.Above-mentioned relation is general It needs in requesting by text query the retrieval symbol for including to determine, also tends in the case where no retrieval symbol, according to Pre-defined rule determines.For example, obtaining participle object " west in above-mentioned text query request " Xihai sea love song drop centre Drolma mp3 " Sea " " love song " " drop centre Drolma " and " mp3 ", to " Xihai sea " " love song " " drop centre Drolma " these participle natural conducts of object It must all meet to handle, that is, use intersection operation.
After obtaining participle object, it is also possible to according to semantic, word order, make certain words in the search algorithm of search engine For emphasis word, and certain words can be ignored, or ignore substantially.For example, in the text of above-mentioned " Xihai sea love song drop centre Drolma mp3 " In inquiry request, by pretreatment, can analyze " mp3 " is only to need to inquire correlation for the meaning of search engine user Audio file meaning, " mp3 " can be used as to a bonus point item, but can also be used as without the document of mp3 this word Recall document.
It is pre- by requesting text query by taking the text query request of above-mentioned " Xihai sea love song drop centre Drolma mp3 " as an example After processing, the Merger such as Fig. 7 can be obtained.As can be seen that for " Xihai sea " " love song " " drop centre Drolma ", these are to be checked Rope segments object, it is necessary to meet, be indicated in figure with the relationship of MUST, several words use intersection operation between each other;For " mp3 " then as to substantially meaningless participle object is inquired, is indicated using the relationship of SHOULD herein, can if meeting With bonus point.
It further includes the weight analysis to participle object that the retrieval, which is oriented to pretreated step, that is, segments object obtaining Afterwards, weight analysis is carried out to each participle object, obtains the weight of each participle object;In subsequent step, according to each participle The weight of object carries out respective handling to the participle object.
For example, for " mp3 " this participle object, relatively low weight is just given in pre-processed results shown in Fig. 7, Other several words all must satisfy indispensable, and " mp3 " can be then unsatisfactory for, but if candidate to recall document comparison more, lead to Whether cross in search file can distinguish the order of priority of candidate documents comprising " mp3 ".
It is also possible to the appearance sequence of each keyword in requesting according to text query in many cases, determines each pass Is there is more document as the document preferentially exported by the weight of keyword in the higher keyword of weight.For example, inquiry " college entrance examination Paper Chinese language 2016 ", by " college entrance examination " two word as the higher keyword of weight, when being mentioned in subsequent step to search engine user It, can be by the more document of appearance " college entrance examination " keyword (can be obtained according to the word frequency in record) as preferential when for document Document.It is more complicated to the weight analysis of participle object in pretreatment, it may all be played a role in subsequent each step, due to And not this Applicant's Abstract graph emphasis, it does not elaborate herein.
Retrieval guiding preprocessing process can also obtain matching degree calculating parameter.
The matching degree calculating is the item that subsequent step needs to carry out, and can preset matching degree meter in a search engine Calculation method, wherein some relevant parameters can be used;These relevant parameters assist search in matching degree calculating, and engine is determining to recall Matching degree between document and text query request, so that it is determined that providing these order for recalling document.In correlation step The relevant issues of matching degree calculating are discussed in detail again.
Step S303, the participle object to be retrieved and each to be retrieved provided according to the retrieval guiding pre-processed results The mutual Merger of object is segmented, inverted list inquiry and Merging are carried out to each participle object to be retrieved, obtained Obtain predetermined quantity recalls document;The inverted list has the feature that each of which record to associated document using score Change document identity mark and be used as document identification, each record is made according to the fractionation document identity mark for recording associated document For the sort by keyword entry.
The task of this step is that the retrieval guiding pre-processed results obtained according to above-mentioned steps S302 are retrieved, this step Be in entire search method implementation procedure the most core the step of.
The participle object to be retrieved that this step needs to obtain using above mentioned step S3 02, i.e. search key, and need root According to the Merger between each participle object to be retrieved, determines how and use participle object to be retrieved;To described to be retrieved point The retrieval of word object, needs through inverted list, and the inverted list, is exactly the inverted list provided according to the application first embodiment Generation method is realized.Use inverted list with the following characteristics: each of which record is to associated document using fractionation text Shelves identity is used as document identification, each record according to the fractionation document identity mark for recording associated document Sort by keyword entry.Using above-mentioned inverted list, during reading inverted list, each record in inverted list Sequence itself, the associated document that document score will be made high is preferentially chosen;It is of course also possible to especially be mentioned in query process It is following out to require, it may be assumed that in inverted list inquiry and Merging, with the inverted list key term now be ordered as according to According to preferentially choosing the high document of document score in the document met the requirements as recalling document.
So-called document of recalling refers to the document for meeting retrieval requirement in this application, and the total quantity for recalling document is drawn in search It can usually be shown in the search result held up, it is the document for meeting search condition that these, which recall document all, still, for less Deserted search term, can all exist hundreds of thousands so that it is millions of recall document, these qualified documents can't all by It actually recalls, otherwise its workload recalled and all will be very surprising to the workload for recalling document ordering.In fact, can be Described recall selects a certain number of files are practical to recall in document.
In the present embodiment, the record of each of described inverted list now is according to the fractionation document body of its associated document Part mark reverses, i.e., it is more forward that the fractionation document identity recorded in record identifies bigger placement location.Such row Sequential mode, it is only necessary to simply choose satisfactory record from front to back, so that it may sieve the relatively high document of document score It elects.The acquisition pattern of the inverted list has been described in detail in the application first embodiment, and details are not described herein.Always It, the fractionation document identity mark for including in the record of inverted list can characterize the document score of the document of its mark simultaneously, And document score is higher, then the temperature of document is higher, more may be the desired search result of search engine user.
It is described that inverted list inquiry and merger fortune are carried out to each participle object to be retrieved as described in above mentioned step S3 02 In the step of calculation, at least one of described Merging, including following operation: intersection operation, union operation and difference operation. Certainly, the main species of the above only Merging can also actually include other some Mergings.
Compared with the text searching method under the prior art, the above-mentioned inverted list identified using fractionation document identity is used It is the main feature of text searching method provided by the present application;Using this inverted list, retrieving has it Itself the characteristics of;As previously mentioned, work can preferentially be checked out for the higher document of document score in retrieval by being mainly reflected in To recall document.The concrete mode retrieved using the inverted list, intersection different according to the different concrete modes of Merger Had under operation using this kind of inverted list and compare salient feature, retrieval determines candidate when being below intersection operation to Merging The method for recalling document is described in detail.
Please refer to Fig. 6, which is text searching method provided by the present application, it is described to each participle object to be retrieved into When the inquiry of row inverted list and Merging, when the Merging includes intersection operation, determined in the intersection operation Candidate recalls the flow chart of the method for document.It should be noted that the Merging can include simultaneously other operations, only say herein The step of bright progress intersection operation.
Into the premise of the step, be have been obtained for participle object to be retrieved by above mentioned step S3 02, and according to Mergers between these participle objects to be retrieved, there are intersection operation relationships between participle object at least partly to be retrieved. In following declarative procedure, still with " Xihai sea love song drop centre Drolma mp3 " this text query request above-mentioned for retrieval example. It is had learned that in step S302 in front, by pretreatment, the Merger provided according to Fig. 6 is needed to be retrieved.Wherein, " Xihai sea " " love song " " drop centre Drolma " is intersection operation relationship.Below using the inverted list shown in Fig. 4 as the foundation retrieved; Inverted list used in shown in Fig. 4 intersection operation, its main feature is that, the record of each key term now is according to each record The fractionation document identification of associated document is used and is reversed as sort by.
Carrying out the method that the intersection operation determines that candidate recalls document in these cases can be summarized as follows: need In each respective keyword entry of participle object to be retrieved for carrying out intersection operation, according to sequence from front to back, retrieval symbol Close the record of following conditions: this records associated fractionation document identity mark and needs to carry out the to be retrieved of intersection operation each It segments in the keyword entry of object with the presence of associated record.
Certainly, conditions above only has the tool of the participle object to be retrieved of intersection operation relationship there is provided a kind of retrieval Body requirement, according to the specific requirement, based on presented below more than one state the inverted list of offer,
The concrete scheme that the participle object to be retrieved for needing to carry out intersection operation is retrieved.It is detailed below in conjunction with Fig. 7 Illustrate, while please referring to Fig. 4.
On in actual retrieval, document is recalled by what this step S303 was obtained, search condition often will not only expire Foot carries out the participle object set to be retrieved of intersection operation, it is possible to also need to meet other conditions;These conditions can pass through Other modes carry out retrieval merger and are met;Here for convenience is illustrated, the document of this condition of intersection operation will be met Referred to as candidate recalls document, to distinguish with qualified document of recalling.
Before carrying out the retrieval of this intersection operation, need to be arranged several parameters, comprising:
Pre-candidate set recalls number of documents: needHitDoc=(page*pageSize*n);Page since 1,
N takes 3 here, i.e. the pre-candidate set of acquisition three times recalls document;
NeedHitDoc indicates that pre-candidate set recalls number of documents, and pre-candidate set is the text for needing actually to recall and sort Shelves;PageSize indicates that a page needs to provide and how much recalls document.The needHitDoc numerical value, which can limit, actually recalls Document with sequence will not be excessive, can reduce the time recalled and sorted in this way;If user has turned over three pages still Required document is not found, just the work of document is recalled in starting next time, this data can satisfy most of search information behaviour The requirement of work.
Step S303-1 determines a participle to be retrieved in the participle object set to be retrieved for needing to carry out intersection operation Object is to each of the participle object set to be retrieved as current participle object to be retrieved, the current participle object to be retrieved The starting point of a element traversal.
This step is for determining first participle object to be retrieved.
In the intersection operation, at least there are two participle objects to be retrieved, these participle objects to be retrieved to be considered as tool One set, participle object set referred to as to be retrieved.For example, the text query request of " Xihai sea love song drop centre Drolma mp3 ", By pretreatment, the participle object set to be retrieved (Xihai sea, love song, drop centre Drolma) for needing to carry out intersection operation is obtained.
In the method that determining candidate recalls document in the intersection operation that Fig. 8 is provided, need to filter out with to be checked The document of whole elements recalls document as candidate in rope participle object set, and therefore, it is necessary to the participle object to be retrieved The searching loop of order is fixed in each element of set.In the participle object set to be retrieved, it can determine any Starting point of one participle object to be retrieved as each element traversal to the participle object set to be retrieved.Using which As first participle object to be retrieved, difference is little in the present embodiment.It is assumed herein that be retrieved using " Xihai sea " as first Segment object, the i.e. starting point of first round traversal;Meanwhile using fixed traversal order, i.e., follow the 1, Xihai sea always, 2, love song, 3, traversal order as the Drolma of drop centre, in cycles.
The participle object to be retrieved being in being retrieved is exactly the current participle object to be retrieved, above-mentioned traversal Process is exactly the continuous process for updating current participle object to be retrieved.
Step S303-2 is obtained in the inverted list entry of the current participle object to be retrieved and is located at most presequence position Fractionation document identity mark in the record set, and it regard the fractionation document identity mark in the record as current document body Part mark, while the value that participle object counter is arranged is 1.
For each participle object, there is a corresponding entry in the inverted list;The entry is properly termed as the pass The keyword entry of keyword perhaps corresponds to the inverted list entry of the keyword or the inverted list key term of the corresponding keyword Mesh.In this application, keyword entry, inverted list keyword entry or inverted list entry, meaning is identical, refers both in inverted list The entry of middle some particular keywords of correspondence.
As shown in figure 4, the figure shows " Xihai sea ", " love song ", " drop centre Drolmas " corresponding entry;Each now Correspondence preserves record;Each record includes three data, i.e. document identity identifies, word frequency and appearance position.The text Shelves identity uses fractionation document identity to identify in the present embodiment.
This step needs the note in the inverted list in the inverted list entry of the current participle object to be retrieved of retrieval acquisition Record.
As previously mentioned, fractionation document identity of the record of each of described inverted list now according to associated document Mark reverses, i.e., it is more forward to identify bigger placement location for the fractionation document identity of associated document in record;It is from Fig. 4 It can be seen that such case.Due to this sortord, making the associated document in the record of most presequence is exactly current document The highest document of score value is regarded as preferentially selecting.Accordingly, this step obtains the score being located in the most record of preamble column position Change document identity mark, and regard the fractionation document identity mark in the record as current document identity.Setting is current The effect of document identity mark, is the comparison foundation as later step.For the example that Fig. 4 is provided, if This step is in first round traversal loop, then the fractionation document identity in the record of most presequence identifies 1233112 It is exactly current document identity.
The value that participle object counter is arranged in this step simultaneously is 1, and the participle object counter ought be above for recording The document that is identified of shelves identity meets several participle objects to be retrieved in intersection operation, if the value of the counter with Element number in the participle object set to be retrieved for carrying out intersection operation is identical, then illustrates current document identity institute The document of mark has met all intersection operation conditions.
Step S303-3 will current participle to be retrieved in the participle object set to be retrieved for needing to carry out intersection operation Next participle object of object is updated to new current participle object to be retrieved.
This step determines next traverse object for updating current participle object to be retrieved.Above mentioned step S3 03-1 The current participle object to be retrieved being arranged is retrieved for recording which word traversal loop has been directed toward by updating the current band Object changes current traverse object;In actual retrieval example provided in this embodiment, if it is from step S303-2 Into the step, after update, the current participle object to be retrieved is changed to " love song ".
In ergodic process, need using fixed traversal order, therefore, it is necessary to the participle objects to be retrieved that is ranked in advance Circular order.Next participle object to be retrieved of each participle object to be retrieved is fixed, sequentially reading in cycles It takes.
Step S303-4 inquires keyword entry of the current participle object to be retrieved in inverted list, retrieves institute The fractionation document identity mark for including is less than or equal to the first record of the current document identity;By the institute in the record Fractionation document identity mark is stated to identify as document identity to be judged.
This step using current document identity as foundation, current participle object to be retrieved (in the previous step just Update) in keyword entry in inverted list, search the fractionation document identity mark for including in record be less than or equal to it is current The first record of document identity mark.According to the setting of front, the record of inverted list according to its associated document fractionation document Identity reverses, it is necessary in the record for being less than or equal to current document identity, can just find and meet set fortune Calculate and require --- the record for namely not only having met the previous participle object being queried but also having met current participle object to be retrieved. It in retrieving, is inquired from front to back according to the sequence of positions for placing record always, inquires the fractionation of associated document herein Document identity mark is less than or equal to the first record of current document identity, would not miss location is any may be eligible Record.
Fractionation document identity mark in the record that this step simultaneously obtains inquiry is used as document to be judged Identity is somebody's turn to do document identity mark to be judged and works in the next steps.
In an example in this embodiment, first record of " love song " is inquired, document identity mark wherein included (uses Be fractionation document identity mark) be 9343223, do not meet fractionation document identity mark be less than or equal to the current document The condition of identity (numerical value 1233112);Second records the document identity for including and is identified as 78009, meets above-mentioned item Part, therefore, 78009 identify as document identity to be judged.
Step S303-5 judges that the document identity to be judged identifies whether to be equal to the current document identity;If It is then to enter next step;If it is not, then entering step S303-6 ', proceed as follows: the value of setting participle object counter It is 1, regard the document identity mark to be judged as current document identity, then, return is described to be needed to carry out intersection In the participle object set to be retrieved of operation, next participle object of current participle object to be retrieved is updated to new current The step of participle object to be retrieved (i.e. return step S303-3), if the participle object set to be retrieved has been traversed.
This step judges the relationship of document identity mark and current document identity to be judged, if the determination result is YES, Then illustrating that the current document identity and the document identity to be judged are identified is the same document.It is then described It includes the current participle object to be retrieved that document identity to be judged, which identifies identified document,;Next step can be entered Suddenly.
If the judging result of this step be it is no, illustrate the current document identity and the document identity to be judged What is identified is not the same document, then needs again to start counting participle object counter, i.e. setting participle object The value of counter is 1.And it proceeds as follows: by the document identity mark to be judged as current document identity, and Return step S303-3;The execution of step S303-3 to this step is re-started, its essence is to used in the judgement of this step Document identity identifies identified document and is made whether be next participle object inverted list key term record now Associated document judgement.
In an example in this embodiment, it in first round ergodic process, is obtained by inquiring first participle object " Xihai sea " The current document identity obtained is 1233112, and the document identity to be judged is identified as 78009, and the two is unequal, then should Enter step S303-6 '.Proceed as follows: the value of setting participle object counter is 1, by the document identity to be judged Mark 78009 is used as current document identity, and return step S303-3.It is suitable according to traversing after return step S303-3 Sequence continues the entry for reading next word " drop centre Drolma ", and according to step S303-4, searches document body wherein included Part mark is less than or equal to 78009 record;Obtain next document identity mark 9200 to be judged.Into this step, institute is judged State whether document identity mark 9200 to be judged is equal to the current document identity 78009, judging result is no;Then enter Step S303--6 ', into following operation: the value of setting participle object counter is 1, and the document identity to be judged is identified As current document identity, i.e., it regard the document identity mark 9200 to be judged as current document identity.Then Return step S303-3.In step S303-3, " Xihai sea " is chosen as updated current participle object to be retrieved, into step Rapid S303-4 inquires the inverted list entry of the current participle object to be retrieved, retrieves included fractionation document identity Mark is less than or equal to the first record of the current document identity 9200, i.e. third records, and document identity is identified as 9200.Then in this step, judging result is yes, can enter next step.
The value for segmenting object counter is added 1 by step S303-6.
Due to the judging result of S303-5 be it is yes, illustrate meet participle object increase one, then segment object count Device should add up 1.
Step S303-7, judges whether the value for segmenting object counter is equal to the participle object set to be retrieved and is included Element sum, if so, by document corresponding to current document identity be determined as it is candidate recall document, under One step.If it is not, then return step S303-3.
This step is used to judge whether current document identity to have met the requirement of all participle objects to be retrieved.Institute It states participle object counter and is set as 1 when meeting first participle object, and in document identity in this prior later Mark adds up when meeting a participle object, when the sum for being added to the element that participle object set to be retrieved is included When, so that it may judge that document that current document identity is identified has met all participle objects of intersection operation It is required that can be used as candidate recalls document.If judging result be it is no, need return step S303-3, continue to next point The traversal of word object.
In the example of the present embodiment, included fractionation document identity mark is retrieved in " Xihai sea " entry and is equal to After 9200 record, the participle object counter is equal to 2, does not still reach requirement.That is the judging result of this step be it is no, It therefore can return step S203-3;In step S203-3, updating current participle object to be retrieved is " love song ", as under One traverse object;Due to being also identified as 9200 file in the record of " love song " including document identity, then this step is being gone to When, the value of the participle object counter is 3, equal to the number of elements of participle object set to be searched, this step judging result Be it is yes, the file that 9200 this fractionation document identification are identified is recalled into file as candidate, and next step can be entered.
Document corresponding to current document identity is determined as candidate recalling document by step S303-8.
This step formally determines that candidate's recalls document.Why be known as candidate recalls document, is because described in Fig. 7 Series of steps is only the explanation for the union operation part for needing to carry out to text object to be retrieved, may also need to carry out it He, which judges just to can determine that, formal recalls document.
Step S303-9 is located at by the inverted list entry of current participle object to be retrieved and includes the current document body The record after record after part mark, the record as the most preamble column position;Return step S303-2.
This step is after obtaining a candidate and recalling document, and return step S303-2 continues searching next candidate Recall document.In this step, the record being located at after the record comprising the current document identity is determined, as described The most record of preamble column position, that is, the initial conditions for executing step S303-2 have been determined, i.e., with current participle pair to be retrieved As for starting point, the next round for beginning look for qualified record from the next record of its entry is traversed.In this step, It is since " love song " this participle object to be searched, from next note that this is recorded comprising document identity mark 9200 Record starts, and scans for the process that next candidate recalls document.
The Merging that the process of above-mentioned Fig. 7 description is only is the processing of intersection operation, entirely pre- by step S302 It may not only include the participle object to be searched for needing to carry out intersection operation in the pre-processed results obtained after processing, it may Further include need to subtract the participle object to be searched of operation, or need or operation participle object to be searched;Fortune is subtracted for needs The participle object to be searched calculated, can retrieve the document for meeting other conditions, screen out and wait for comprising carrying out subtracting operation The document of search participle object, this screen out can be carried out by the forward index to the document for meeting other conditions.For needing It wants or the participle object to be searched of operation, two or more independent inverted lists retrievals of progress is equivalent to, by several retrievals knot Fruit merges.
After being handled according to these pre-processed results, text is recalled using the document for finally meeting condition as formal Shelves;The quantity for recalling document can be used to select and recall document counter accumulated counts;To the general meeting of the sum for recalling document One limit is set, and referred to as pre-candidate set recalls number of documents, and described select is recalled document counter and the pre-candidate set The setting value of greeting number of documents compares, so that it may determine to select and recall the upper limit whether document reaches, and reach the upper limit Afterwards, the selected of acquisition is recalled document as formal and recalls document, be to recall document row for these in the next steps Sequence.Meanwhile search and merger can be continued;And by pre-set total hit results counter accumulated counts, often Increase a qualified document just to add up to total hit results counter, the final document for obtaining the condition that meets Sum, as the index for the temperature for reflecting this search, which is rough sum, is not needed very accurate.
Step S304 carries out priority score calculating to the document of recalling obtained, obtains each document of recalling Priority score.
What this step was used to obtain step S203 recalls document progress prioritization, to judge that these recall document The degree of correlation and importance degree requested with the text query, so as to according to user demand and significance level, Candidate is recalled into the position that document is placed on more front.This process, including according to priority score to candidate recall document into Row sequence.
The priority for recalling document is calculated and sorted in this step, is the main section of search engine time-consuming.
In the present embodiment, by the process of inverted list merger, be very natural by the higher document of significance level It is preferentially retrieved, it is made less to can be obtained importance degree high document in the case where recalling document obtaining.Therefore, The inverted list provided by the present application that sequence basis is identified as with fractionation document identity is used, it can be by reducing the pre- selected works Close and recall the setting value of number of documents, save to recall document carry out priority calculating and the spent computing resource of sequence and Time.
In the step S302, retrieval guiding pretreatment, retrieval obtained have been carried out to the text retrieval request Being oriented to pre-processed results to be includes matching degree calculating parameter.The matching degree parameter refers to herein on matching degree tool influential one A little factors.For example, by analysis may be used by taking the retrieval example " Xihai sea love song drop centre Drolma MP3 " that the embodiment of the present application uses as an example To obtain Merger shown in fig. 6, wherein the participle object to be retrieved of " mp3 " not as intersection operation, still, if comprising " Xihai sea ", " love song ", " drop centre Drolma " several terms document in there is " mp3 " this word simultaneously, then can be to the document Bonus point improves its matching degree.
It is described that the document of recalling obtained is carried out in the case where the step S302 provides matching degree parameter Priority score operation, includes the following steps:
According to the matching degree calculating parameter, and the matching degree algorithm of setting, each matching degree for recalling document is calculated Score value;
According to each matching degree score value for recalling document and each document static state score value for recalling document, to set Fixed Weight calculates, and obtains each priority score for recalling document.
Wherein, the document static state score value is exactly mentioned-above document score, and document static state score value is to indicate the score value It will not dynamic change.Certainly, document static state score value is actually to calculate according to the document relevant historical data collected, these Document relevant historical data can change, but for the application, not value the real-time of this variation especially, as long as periodically These historical datas were collected, and were regularly updated.
A kind of concrete scheme realizing above-mentioned priority score and calculating presented below;The program consider each recall document with Matching degree and importance degree.The general formula that the concrete scheme provides a priority score is as follows:
Scorer=a*matchScorer+b* ((doc_scorer/10n)*10-6)
Wherein, scorer is priority score;MatchScorer is exactly matching degree score value, this score value reflect document with Matching degree between the requirement of the text query request;Doc_scorer, that is, fractionation document identity mark, (doc_ scorer/10n)*10-6Fractionation document identity mark is exactly converted to the formula of document score, document score reflects document Temperature importance degree in other words;A, b is weight respectively.Pass through the formula, so that it may according to demand, comprehensively consider document Matching degree and importance degree, thus the priority of relatively reasonable determination document.
Vector space model is usually used in the matching degree score value in the prior art or BM25 algorithm is calculated and obtained, Due to being not the application core content, and a variety of feasible matching degree calculation methods can also be found in the prior art, This is not described in detail.
Step S305 recalls document described in output using the priority score as sort by.
After the sequence for having carried out above-mentioned steps S304, document can be recalled according to sequence output;It is described to recall document May include the content that can show multipage, by each sequence for recalling document can effectively determine described in recall document aobvious Which show in page.
Text searching method provided in this embodiment, each placement location recorded now be just in the inverted list used The significance level of document is reflected;In the present embodiment, corresponding search method is used, can be relatively easy to retrieve The document high to significance level improves recall precision.
The application fourth embodiment provides a kind of retrieval device.The retrieval device is used to utilize the application first embodiment The inverted list that method proposes generation carries out text retrieval.Referring to FIG. 9, this illustrates the flow chart of the application fourth embodiment.
The retrieval device, comprising: inquiry request receiving unit 401, pretreatment unit 402, inquiry and Merging unit 403, priority score computing unit 404, output unit 405.
The inquiry request receiving unit 401, for receiving inquiry request;
The retrieval is oriented to pretreatment unit 402, query text for being provided according to text query request and Query argument carries out retrieval guiding pretreatment, obtains retrieval guiding pre-processed results.
The inquiry and Merging unit 403, the participle to be retrieved for being provided according to the retrieval guiding pre-processed results Object and the mutual Merger of each participle object to be retrieved look into each participle object progress inverted list to be retrieved Inquiry and Merging, obtain predetermined quantity recalls document;The inverted list has the feature that each of which record closes it The document of connection is used as document identification using fractionation document identity mark, each to record the score that associated document is recorded according to this Change document identity mark as the sort by corresponding keyword entry.
The priority score computing unit 404, by being carried out based on priority score to the document of recalling obtained It calculates, obtains each priority score for recalling document.
The output unit 405, for recalling document described in output using the priority score as sort by.
The 5th embodiment of the application provides a kind of searching system, which provides for realizing above-mentioned 3rd embodiment Search method.Fig. 9 is please referred to, this illustrates the schematic diagram of the searching system of the 5th embodiment of the application offer.Below in conjunction with The searching system is described in detail in Fig. 9.
The text retrieval system includes database 501, off-line calculation server 502, index server 503, retrieval clothes Business device 504.
The database 501, for storing the document for inquiry.
The database 501 can be realized using various storage mediums, in the formal of database use, can be used existing There are the various database structures provided under technology, for the purpose of accessing.What is wherein stored is for each of text retrieval Kind document.These documents come from various approach, for example, being obtained by web crawlers periodic search.These documents can be various Document form, including textual form, audio form, visual form, hypertext form, or executable computer documents etc.. Any document form, all should at least several descriptive texts, for example, for some audio file, file content is The song that the drop centre Drolma stored in the form of MP3 is sung, with comment, explanation is that drop centre Drolma sings " Xihai sea feelings Song ".Because it has these explanatory notes, convenient for realizing retrieval using text searching method.
The off-line calculation server 502, for counting the temperature data of each document, and accordingly according to historical data Generate document score
" offline " in the off-line calculation server 502 is not to say that the server is necessarily in off-line state, but The calculating that the server provides is off-line calculation, i.e., carries out off-line calculation by obtaining historical data, obtain the text of each document The temperature data of shelves do not have to according to the case where variation in real time carries out dynamic update, dynamic calculates.The historical data can be certain Period obtain, for example, the audition amount an of song, download, amount of collection, comment can be obtained weekly for song Amount and the bean vermicelli quantity of artist, then calculate the temperature of each song according to predetermined algorithm according to these data Data, and accordingly generate document score.In calculating process, various discrete processes or normalized are carried out to data, Various initial data are organized into convenient for calculating the data sorted out, the document score finally obtained is also for some digital section Numerical value.For example, the document score is finally normalized to the continuous floating number between 0-10 in aforementioned first embodiment According to.
The index server 503, for the concordance list of document each in database described in layout, including it is each fall Arrange table;In the inverted list generating process, using the document score of each document of off-line calculation server offer as foundation, Fractionation document identity mark is authorized to each document, and is remembered in inverted list record using fractionation document identity mark Record the document identity of each document;In each record of each key term now according to each score for recording associated document Change document identity mark sequence;
The work that the index server 503 is done is exactly permutation index table.Index server can be with the various indexes of layout Table, for the application, its layout of major concern to segment inverted list of the object as sort by, which uses Fractionation document identity mark, that is, receive the document score for each document that the off-line calculation server 502 provides, and with this The fractionation document identity mark of each document is generated according to pre-determined rule for foundation, and in each note of inverted list In record, each document is identified using the fractionation document identity mark of each document.Pass through the fractionation document body Part mark can easily obtain the document identity mark that each file uses in the database 501, and at the same time knowing institute The document score of document is stated, consequently facilitating understanding the importance degree of each document.For the convenience of retrieval, further excellent Select in scheme, each of the inverted list now, it is each record with it includes fractionation document identity card be identified as according to According to placement;Further preferred mode is to place each record using the backward of fractionation document identity mark.The inverted list Structure and generation method, be described in detail in the application first embodiment, herein not carefully state.Index server 503 At any time according to the document situation of change of the database 501, the data of reading database establish index for it.
The retrieval server 504, for receiving inquiry request, and query text that the inquiry request is provided and Query argument is pre-processed, and pre-processed results are obtained;According to the pre-processed results provide participle object to be retrieved and The mutual Merger of each participle object to be retrieved, the inverted list provided using the index server 503 are right as foundation Each participle object to be retrieved carries out inverted list inquiry and Merging, and acquisition is required amount of to recall document;And to institute The document of recalling obtained carries out priority score calculating, each priority score for recalling document is obtained, finally with described Priority score is sort by, recalls document described in output.The method of the specific implementation retrieval of retrieval server 504 is above-mentioned the It has been described in detail in five embodiments, core is that the inverted list for determining the installation position of record is identified with fractionation document identity Foundation as retrieval;Details are not described herein for concrete methods of realizing.
It, can be in text retrieval from the present embodiment as can be seen that the text retrieval system has off-line calculation server Before request issues, document temperature is assessed, and is embodied in the identity of document;It, can be in retrieval Preferentially chosen in the high document of document temperature it is satisfactory recall document, in this way can be lesser in the quantity for recalling document In the case of, can allow user's greater probability obtains oneself expectation search result.Therefore, this system actually will be a part of Calculating task is completed in off-line calculation server in advance, can effectively reduce time loss when user's search, is saved real-time The calculation amount of calculating.
The application sixth embodiment provides a kind of electronic equipment, which is used to run the institute of first embodiment offer It states and generates method for the inverted list of retrieval.
The electronic equipment includes processor;And
Memory, for storing a kind of inverted list generation method program for text retrieval, which is powered and passes through After the processor operation inverted list generation method program for text retrieval, following step is executed:
According to relevant historical data, the document score of the document as the item that is retrieved is calculated;
Using the document score of each document as foundation, fractionation document identity mark is authorized to each document;
In inverted list generating process, by the fractionation document identity mark of the associated document of each record as each It is a to be recorded in its sort by of place key term now.
The 7th embodiment of the application provides a kind of electronic equipment, which provides for running the 3rd embodiment Search method.
The electronic equipment, including processor;And
Memory, for storing a kind of search method, which, which is powered and passes through the processor, runs the retrieval side After method, following step is executed:
Receive inquiry request;
The query text and query argument provide the inquiry request pre-processes, and obtains retrieval guiding pretreatment As a result;
The participle object to be retrieved and each participle object to be retrieved provided according to the retrieval guiding pre-processed results Mutual Merger, using the corresponding inverted list of each participle object to be retrieved as foundation, to each participle pair to be retrieved As carrying out inverted list inquiry and Merging, obtain predetermined quantity recalls document;Each participle object pair to be retrieved The inverted list answered has the feature that each of which record is used as document using fractionation document identity mark to associated document Mark, the record of each key term now identify conduct according to the fractionation document identity of document associated in each record Sort by;It is excellent in the document met the requirements with the sequence of the inverted list in the inverted list inquiry and Merging First the high document of selection document score, which is used as, recalls document;
Priority score calculating is carried out to the document of recalling obtained, obtains each preferential fraction for recalling document Value;
Using the priority score as sort by, document is recalled described in output.
The 8th embodiment of the application provides a kind of search method.The search method is the inspection that the application 3rd embodiment provides The more simplified technical solution of Suo Fangfa does not account for the problem of needing to participle object Merging to be retrieved, in other words, It is that Merging is carried out to participle object to be retrieved with default behavior;Figure 11 is the flow chart of the embodiment, below in conjunction with Figure 11 It is explained.
Step S801 receives inquiry request.
Step S802, the query text and query argument provide the inquiry request carry out retrieval guiding pretreatment, Retrieval guiding pre-processed results are obtained, including participle object to be retrieved.
In this step, the text mainly provided inquiry request segments, and obtains participle object to be retrieved;Here Participle object interrelationship to be retrieved is default or default, using the Merger of default in retrieving; For example, using the relationship of intersection operation.
In this step, matching degree calculating parameter can also be obtained.
Step S803 carries out inverted list inquiry to each participle object to be retrieved, obtains according to the participle object to be retrieved Multiple it must recall document;The inverted list has the feature that each of which record to associated document using fractionation document Identity records the fractionation document identity mark of associated document as in phase according to this as document identification, each record The sort by keyword entry answered.
Corresponding steps in the step and 3rd embodiment are essentially identical, described to fall to each participle object to be retrieved In the step of arranging table inquiry, with the inverted list key term now be ordered as foundation, it is preferential in the document met the requirements The high document of selection document score, which is used as, recalls document;Compared with the corresponding steps S303 of 3rd embodiment, this step is only Special Merging is not carried out, it is of course also possible to think the merger fortune for using default to each participle object to be retrieved It calculates.
Step S804 is ranked up and exports to the multiple document of recalling obtained.
This step realizes the output to document is recalled, and the quantity that document is recalled as obtained in step S803 is larger, Can not all disposable output, need to be ranked up to recalling document before output;In general, due in above-mentioned steps Special inverted list is used in S803 to be retrieved, this method document obtained of recalling inherently reflects priority level, The influence for whether optimizing the sortord for recalling document to the quality of output document in this step can be substantially reduced, this is also to adopt The advantages of with this search method.
In above-mentioned steps S802, retrieval guiding pre-processed results obtained include matching degree calculating parameter;
In this step, the sequence includes the following steps:
According to the matching degree calculating parameter, and the matching degree algorithm of setting, each matching degree for recalling document is calculated Score value;
According to each matching degree score value and each document score for recalling document for recalling document, with setting Weight calculates, and obtains each priority score for recalling document;The priority score is as sort by.
Some technical details in the present embodiment are in the application 3rd embodiment it is stated that may refer to, herein not It is described in detail again.
Although the application is disclosed as above with preferred embodiment, it is not for limiting the application, any this field skill Art personnel are not departing from spirit and scope, can make possible variation and modification, therefore the guarantor of the application Shield range should be subject to the range that the claim of this application defined.
In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.
1, computer-readable medium can be by any side including permanent and non-permanent, removable and non-removable media Method or technology realize that information stores.Information can be computer readable instructions, data structure, the module of program or other numbers According to.The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory techniques, CD-ROM are read-only Memory (CD-ROM), digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or Other magnetic storage devices or any other non-transmission medium, can be used for storage can be accessed by a computing device information.According to Herein defines, and computer-readable medium does not include non-temporary computer readable media (transitory media), such as modulates Data-signal and carrier wave.
2, it will be understood by those skilled in the art that embodiments herein can provide as the production of method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It can be used moreover, the application can be used in the computer that one or more wherein includes computer usable program code The computer program product implemented on storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Form.

Claims (26)

1. a kind of search method characterized by comprising
Receive inquiry request;
The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, obtains pretreatment knot Fruit;
The participle object to be retrieved and each participle object to be retrieved provided according to the retrieval guiding pre-processed results is mutual Between Merger, inverted list inquiry and Merging are carried out to each participle object to be retrieved, obtain predetermined quantity Recall document;The inverted list has the feature that each of which record to associated document using fractionation document identity mark Know and be used as document identification, each record is used as according to the fractionation document identity mark for recording associated document and closes accordingly Sort by keyword entry;
Priority score calculating is carried out to the document of recalling obtained, obtains each priority score for recalling document;
Using the priority score as sort by, document is recalled described in output.
2. search method according to claim 1, which is characterized in that in the inverted list inquiry and Merging, with The inverted list key term now be ordered as foundation, the high document of document score is preferentially chosen in the document met the requirements As recalling document.
3. search method according to claim 1, which is characterized in that the inquiry provided text query request Text carries out retrieval and is oriented to pretreated step, comprising: segments to query text, obtains participle object, and from participle pair As middle determination participle object to be retrieved, and, according to query text and query argument, obtain the participle object to be retrieved Merger;The participle object to be retrieved is the subset of the participle object.
4. search method according to claim 3, which is characterized in that the inquiry provided text query request Text carries out retrieval and is oriented to pretreated step further include: after obtaining participle object, carries out weight point to each participle object Analysis obtains the weight of each participle object;In subsequent step, phase is carried out to the participle object according to the weight of each participle object It should handle.
5. search method according to claim 1, which is characterized in that described to be arranged each participle object to be retrieved Table inquiry and the step of Merging in, at least one of described Merging, including following operation: intersection operation, simultaneously Set operation and difference operation.
6. search method according to claim 1, which is characterized in that the note of each key term of the inverted list now Record, as sort by, is specifically used and is reversed according to each fractionation document identity mark for recording associated document.
7. search method according to claim 6, which is characterized in that described to be arranged each participle object to be retrieved In the step of table inquiry and Merging, the Merging includes intersection operation;Determine that candidate calls together in the intersection operation The method of palindrome shelves is:
It is suitable according to from front to back in each respective inverted list entry of participle object to be retrieved for needing to carry out intersection operation Sequence, retrieval meet the record of following conditions:
This records associated fractionation document identity mark falling in each participle object to be retrieved for needing to carry out intersection operation It arranges in table clause with the presence of associated record.
8. search method according to claim 7, which is characterized in that determine that candidate recalls document in the intersection operation Method specifically uses following steps to realize:
Determined in the participle object set to be retrieved for needing to carry out intersection operation a participle object to be retrieved conduct currently to Retrieval participle object, the current participle object to be retrieved are rising to each element traversal of the participle object set to be retrieved Point arranges each participle object to be retrieved in participle object set to be retrieved, the sequence of the fixation in a fixed order The sequence looped through as each participle object to be retrieved to participle object set to be retrieved;
In the inverted list entry of the current participle object to be retrieved, the score being located in the most record of preamble column position is obtained Change document identity mark, and regard the fractionation document identity mark in the record as current document identity, is arranged simultaneously The value for segmenting object counter is 1;
In the participle object set to be retrieved for needing to carry out intersection operation, by next participle of current participle object to be retrieved Object is updated to new current participle object to be retrieved;
Keyword entry of the current participle object to be retrieved in inverted list is inquired, included fractionation document is retrieved Identity is less than or equal to the first record of the current document identity;By the fractionation document identity in the record Mark is identified as document identity to be judged;
Judge that the document identity to be judged identifies whether to be equal to the current document identity;If so, entering in next step Suddenly;If it is not, the value that participle object counter is then arranged is 1, it regard the document identity mark to be judged as current document identity Mark, and return is described in the participle object set to be retrieved for needing to carry out intersection operation, it will current participle object to be retrieved Next participle object the step of being updated to new current participle object to be retrieved;
The value for segmenting object counter is added 1;
Judge whether the value for segmenting object counter is equal to the sum of participle the included element of object set to be retrieved, if It is then to enter next step;If it is not, described in then returning in the participle object set to be retrieved for needing to carry out intersection operation, it will The step of next participle object of current participle object to be retrieved is updated to new current participle object to be retrieved;
Document corresponding to current document identity is determined as candidate to recall document;
By the note in the inverted list of current participle object to be retrieved, after the record comprising the current document identity Record, the record as the most preamble column position;Return to described in the inverted list of the current participle object to be retrieved, acquisition Fractionation document identity mark in the most record of preamble column position, and the fractionation document identity in the record is identified As current document identity, while the step of value that participle object counter is arranged is 1.
9. search method according to claim 1, which is characterized in that the query text provided the inquiry request It carries out retrieval to be oriented in pretreated step, retrieval guiding pre-processed results obtained include matching degree calculating parameter;
It is described that priority score operation is carried out to the document of recalling obtained, include the following steps:
According to the matching degree calculating parameter, and the matching degree algorithm of setting, each matching degree score value for recalling document is calculated;
According to each matching degree score value and each document score for recalling document for recalling document, with the weight of setting Weighted calculation obtains each priority score for recalling document.
10. search method according to claim 1, which is characterized in that the text query requests the text to be inquired For the lyrics.
11. search method according to claim 10, which is characterized in that point that the fractionation document identification is relied on Number, according at least one of the following attributes for including associated song: audition amount, download, amount of collection, comment amount, correlation The bean vermicelli quantity of artist.
12. a kind of retrieval device characterized by comprising
Inquiry request receiving unit, for receiving inquiry request;
Retrieval guiding pretreatment unit, query text and query argument for being provided according to the inquiry request are retrieved Guiding pretreatment, obtains pre-processed results;
Inquiry and Merging unit, the participle object to be retrieved and each for being provided according to the retrieval guiding pre-processed results The mutual Merger of participle object to be retrieved, carries out inverted list inquiry to each participle object to be retrieved and merger is transported It calculates, obtain predetermined quantity recalls document;The inverted list has the feature that each of which record uses associated document Fractionation document identity mark is used as document identification, each to record the fractionation document identity mark that associated document is recorded according to this Know as the sort by corresponding keyword entry;
Priority score computing unit obtains each for carrying out priority score calculating to the document of recalling obtained Recall the priority score of document;
Output unit, for recalling document described in output using the priority score as sort by.
13. a kind of searching system characterized by comprising
Database, for storing the document for inquiry;
Off-line calculation server for counting the temperature data of each document according to historical data, and accordingly generates document point Value;
Index server, for the concordance list of document each in database described in layout, including inverted list;The inverted list In generating process, the document score of each document provided using the off-line calculation server authorizes each document as foundation Fractionation document identity mark, and recorded often in the record of each entry of inverted list using fractionation document identity mark The document identity of a document;In each record of each key term now according to each fractionation text for recording associated document Shelves identity sequence;
Retrieval server, for receiving inquiry request, and query text that the inquiry request is provided and query argument into Row retrieval guiding pretreatment, obtains retrieval guiding pre-processed results;It is provided according to the retrieval guiding pre-processed results to be checked Rope segments object and the mutual Merger of each participle object to be retrieved, corresponding with each participle object to be retrieved Inverted list is foundation, carries out inverted list inquiry and Merging to each participle object to be retrieved, obtains required amount of call together Palindrome shelves;And priority score calculating is carried out to the document of recalling obtained, it obtains each candidate and recalls the excellent of document First grade score value recalls document described in output finally using the priority score as sort by.
14. a kind of electronic equipment characterized by comprising
Processor;And
Memory, for storing a kind of search method, which is powered and passes through described in the processor operation for the side of retrieval After method, following step is executed:
Receive inquiry request;
The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, and it is pre- to obtain retrieval guiding Processing result;
The participle object to be retrieved and each participle object to be retrieved provided according to the retrieval guiding pre-processed results is mutual Between Merger, using the corresponding inverted list of each participle object to be retrieved as foundation, to each participle object to be retrieved into The inquiry of row inverted list and Merging, obtain predetermined quantity recalls document;Each participle object to be retrieved is corresponding Inverted list has the feature that each of which record is used as document mark using fractionation document identity mark to associated document Know, the record of each key term now is identified according to the fractionation document identity of document associated in each record as row Sequence foundation;It is preferential in the document met the requirements with the sequence of the inverted list in the inverted list inquiry and Merging The high document of selection document score, which is used as, recalls document;
Priority score calculating is carried out to the document of recalling obtained, obtains each priority score for recalling document;
Using the priority score as sort by, document is recalled described in output.
15. a kind of inverted list generation method for retrieval characterized by comprising
According to relevant historical data, the document score of the document as the item that is retrieved is calculated;
Using the document score of each document as foundation, fractionation document identity mark is authorized to each document;
In inverted list generating process, it regard the fractionation document identity mark of the associated document of each record as each note The record sort by of key term now where it.
16. the inverted list generation method according to claim 15 for retrieval, which is characterized in that the historical data packet Include one or more attribute value relevant to document temperature, the step of the document score for calculating the document as the item that is retrieved Suddenly, i.e., using the attribute value as foundation.
17. the inverted list generation method according to claim 16 for retrieval, which is characterized in that if the historical data When including multiple attribute values, weight is assigned to each attribute, and according to the attribute value of each attribute and corresponding weight, carry out institute State the calculating of the document score of the document as the item that is retrieved.
18. the inverted list generation method according to claim 15 for retrieval, which is characterized in that the document score quilt In normalized to determining fractional value section.
19. the generation method of the inverted list according to claim 15 for retrieval, which is characterized in that the fractionation text The structure that shelves identity uses are as follows: document score+original document identity.
20. the generation method of the inverted list according to claim 19 for retrieval, which is characterized in that the fractionation text The data type that shelves identity uses is Long type or String type.
21. the inverted list generation method according to claim 15 for retrieval, which is characterized in that described raw in inverted list At in the process, the ordering relation of each record is determined according to the fractionation document identity mark in each record, and then really In the step of fixed each position being recorded in inverted list, according to the determining row of the fractionation document identity mark in each record The backward of order relation determines each position being recorded in inverted list.
22. a kind of inverted list generating means for retrieval characterized by comprising
Document score computing unit, for calculating the document score of the document as the item that is retrieved according to relevant historical data;
Fractionation document identity identification grant unit, for being authorized to each document using the document score of each document as foundation Fractionation document identity mark;
Inverted list generation unit, for generating inverted list, in inverted list generating process, by the institute of the associated document of each record Fractionation document identity mark is stated as each and is recorded in its sort by of place item now.
23. a kind of electronic equipment characterized by comprising
Processor;And
Memory, for storing a kind of inverted list generation program for retrieval, which is powered and passes through the processor and transport After inverted list of the row for retrieval generates program, following step is executed:
According to relevant historical data, the document score of the document as the item that is retrieved is calculated;
Using the document score of each document as foundation, fractionation document identity mark is authorized to each document;
In inverted list generating process, it regard the fractionation document identity mark of the associated document of each record as each note The record sort by of key term now where it.
24. a kind of search method characterized by comprising
Receive inquiry request;
The query text and query argument provide the inquiry request carries out retrieval guiding pretreatment, and it is pre- to obtain retrieval guiding Processing result, including participle object to be retrieved;
According to the participle object to be retrieved, inverted list inquiry is carried out to each participle object to be retrieved, acquisition is multiple to recall text Shelves;The inverted list has the feature that each of which record identifies conduct using fractionation document identity to associated document Document identification, each record record the fractionation document identity mark of associated document as in corresponding key term according to this Sort by mesh;
The multiple document of recalling obtained is ranked up and is exported.
25. according to search method described in claims 24, which is characterized in that it is described to each participle object to be retrieved into Row inverted list inquire the step of in, with the inverted list key term now be ordered as foundation, in the document met the requirements The high document of preferential selection document score, which is used as, recalls document.
26. search method according to claim 24, which is characterized in that the inquiry text provided the inquiry request This carries out retrieval and is oriented in pretreated step, and retrieval guiding pre-processed results obtained include matching degree calculating parameter;
It is described to it is obtained it is the multiple recall in the step of document is ranked up and exports, the sequence includes following step It is rapid:
According to the matching degree calculating parameter, and the matching degree algorithm of setting, each matching degree score value for recalling document is calculated;
According to each matching degree score value and each document score for recalling document for recalling document, with the weight of setting Weighted calculation obtains each priority score for recalling document;The priority score is as sort by.
CN201710681027.5A 2017-08-10 2017-08-10 Text searching method, inverted list generation method and system for text retrieval Pending CN109388690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710681027.5A CN109388690A (en) 2017-08-10 2017-08-10 Text searching method, inverted list generation method and system for text retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710681027.5A CN109388690A (en) 2017-08-10 2017-08-10 Text searching method, inverted list generation method and system for text retrieval

Publications (1)

Publication Number Publication Date
CN109388690A true CN109388690A (en) 2019-02-26

Family

ID=65414199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710681027.5A Pending CN109388690A (en) 2017-08-10 2017-08-10 Text searching method, inverted list generation method and system for text retrieval

Country Status (1)

Country Link
CN (1) CN109388690A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083679A (en) * 2019-03-18 2019-08-02 北京三快在线科技有限公司 Processing method, device, electronic equipment and the storage medium of searching request
CN110413738A (en) * 2019-07-31 2019-11-05 腾讯科技(深圳)有限公司 A kind of information processing method, device, server and storage medium
CN114169945A (en) * 2022-02-08 2022-03-11 北京金堤科技有限公司 Method and device for determining hot supply and demand products in field of object

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100010989A1 (en) * 2008-07-03 2010-01-14 The Regents Of The University Of California Method for Efficiently Supporting Interactive, Fuzzy Search on Structured Data
CN102201001A (en) * 2011-04-29 2011-09-28 西安交通大学 Fast retrieval method based on inverted technology
CN102880722A (en) * 2012-10-17 2013-01-16 深圳市宜搜科技发展有限公司 Method and device for searching authoritative site
CN103186650A (en) * 2011-12-30 2013-07-03 中国移动通信集团四川有限公司 Searching method and device
CN105488068A (en) * 2014-09-19 2016-04-13 阿里巴巴集团控股有限公司 Methods and apparatuses for searching music and establishing index, and search result judgment method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100010989A1 (en) * 2008-07-03 2010-01-14 The Regents Of The University Of California Method for Efficiently Supporting Interactive, Fuzzy Search on Structured Data
CN102201001A (en) * 2011-04-29 2011-09-28 西安交通大学 Fast retrieval method based on inverted technology
CN103186650A (en) * 2011-12-30 2013-07-03 中国移动通信集团四川有限公司 Searching method and device
CN102880722A (en) * 2012-10-17 2013-01-16 深圳市宜搜科技发展有限公司 Method and device for searching authoritative site
CN105488068A (en) * 2014-09-19 2016-04-13 阿里巴巴集团控股有限公司 Methods and apparatuses for searching music and establishing index, and search result judgment method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083679A (en) * 2019-03-18 2019-08-02 北京三快在线科技有限公司 Processing method, device, electronic equipment and the storage medium of searching request
CN110413738A (en) * 2019-07-31 2019-11-05 腾讯科技(深圳)有限公司 A kind of information processing method, device, server and storage medium
CN114169945A (en) * 2022-02-08 2022-03-11 北京金堤科技有限公司 Method and device for determining hot supply and demand products in field of object
CN114169945B (en) * 2022-02-08 2022-04-22 北京金堤科技有限公司 Method and device for determining hot supply and demand products in field of object

Similar Documents

Publication Publication Date Title
US10275419B2 (en) Personalized search
AU2009234120B2 (en) Search results ranking using editing distance and document information
US8572074B2 (en) Identifying task groups for organizing search results
US9015176B2 (en) Automatic identification of related search keywords
US9305100B2 (en) Object oriented data and metadata based search
US9396262B2 (en) System and method for enhancing search relevancy using semantic keys
RU2549121C2 (en) Merging search results
US20100121838A1 (en) Index optimization for ranking using a linear model
CN102023989A (en) Information retrieval method and system thereof
CN103425687A (en) Retrieval method and system based on queries
CN106021374A (en) Underlay recall method and device for query result
JP2017194778A (en) Tuning device and method for relational database
US20100042610A1 (en) Rank documents based on popularity of key metadata
CN109388690A (en) Text searching method, inverted list generation method and system for text retrieval
CN105373546A (en) Information processing method and system for knowledge services
CN113449168A (en) Method, device and equipment for capturing theme webpage data and storage medium
US20090006354A1 (en) System and method for knowledge based search system
CN103186650B (en) A kind of searching method and device
Boddu et al. Knowledge discovery and retrieval on World Wide Web using web structure mining
Jain et al. Building query optimizers for information extraction: the sqout project
Fu et al. Towards better understanding and utilizing relations in DBpedia
CN114625761A (en) Optimization method, optimization device, electronic equipment and medium
Khurana et al. Survey of techniques for deep web source selection and surfacing the hidden web content
TWM623755U (en) System for generating creative materials
Leung et al. Multimedia data mining and searching through dynamic index evolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190226

RJ01 Rejection of invention patent application after publication