CN101599078B - Method and device for text retrieval - Google Patents

Method and device for text retrieval Download PDF

Info

Publication number
CN101599078B
CN101599078B CN2009100887508A CN200910088750A CN101599078B CN 101599078 B CN101599078 B CN 101599078B CN 2009100887508 A CN2009100887508 A CN 2009100887508A CN 200910088750 A CN200910088750 A CN 200910088750A CN 101599078 B CN101599078 B CN 101599078B
Authority
CN
China
Prior art keywords
text
morpheme
full
retrieval
bitmap
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2009100887508A
Other languages
Chinese (zh)
Other versions
CN101599078A (en
Inventor
袁哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2009100887508A priority Critical patent/CN101599078B/en
Publication of CN101599078A publication Critical patent/CN101599078A/en
Application granted granted Critical
Publication of CN101599078B publication Critical patent/CN101599078B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method and a device for text retrieval. The method comprises the following steps: firstly, generating coding information, and confirming a text address according to text weight in a text library; then, establishing an index entry according to the generated coding information and the confirmed text address; wherein, the index entry comprises a title index, a full-text index, a high-frequency word title bitmap, a high-frequency word full-text bitmap and an ultrahigh-frequency word text offset address bitmap; finally, retrieving the corresponding text first by using the title index according to the searched morpheme and then by filtering according to the ultrahigh-frequency word text offset address bitmap, finishing the retrieval if the retrieval result meets the pre-determined requirements, and if not, retrieving the corresponding text first by using the full-text index according to the searched morpheme and then by filtering according to the high-frequency word title bitmap and the high-frequency word full-text bitmap. The technical scheme provided by the implementation manner of the invention can accelerate the retrieval, thereby improving the accuracy of the retrieval and the retrieval performance of the system.

Description

A kind of method of text retrieval and device
Technical field
The present invention relates to a kind of method and device of text retrieval, belong to the network communications technology field.
Background technology
At present the method for text retrieval commonly used has the method that method that merger searches and bitmap are got ready, these two kinds of methods all are to determine the text address according to the coded sequence of each text, and subject index and full-text index have only been set up, and retrieve according to subject index and full-text index simultaneously, because the text address is just determined by coded sequence, cause reading under the incomplete situation at text, some important text is not retrieved, thereby influenced the accuracy rate of retrieval, the mode that employing is retrieved together to subject index and full-text index, cause retrieval time long, thereby influenced the retrieval performance of system.
Summary of the invention
The invention provides a kind of method and device of text retrieval, read under the incomplete situation to exist in text in the solution prior art, some important text can not be retrieved, the mode that adopts subject index and full-text index to retrieve together simultaneously, cause retrieval time long, thereby influenced the problem of the retrieval performance of the accuracy rate of retrieval and system, the present invention adopts following technical scheme for this reason:
The embodiment of the invention provides a kind of method of text retrieval, comprises,
Generate coded message, and determine the text address according to the text weight in the text library, described coded message is used to indicate the order of text;
Set up index entry according to the coded message that generates and definite text address, described index entry comprises subject index, full-text index, high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full;
Morpheme according to inquiry retrieves corresponding text after filtering by subject index and according to ultrahigh frequency speech text offset address bitmap, if the result of retrieval satisfies predetermined requirement, then finishes retrieval; If the result of retrieval does not satisfy predetermined requirement, retrieve corresponding text after then bitmap filters in full by full-text index and according to high frequency words title bitmap and high frequency words according to the morpheme of inquiry.
The embodiment of the invention also provides a kind of device of text retrieval, comprising:
Text address determination module is used to generate coded message, and determines the text address according to the text weight in the text library, and described coded message is used to indicate the order of text;
The index entry determination module, index entry is set up in the coded message and the definite text address that are used for generating according to text address determination module, and described index entry comprises subject index, full-text index, high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full;
Retrieval module is used for retrieving corresponding text according to the morpheme of inquiry after filtering by the subject index of index entry determination module and according to ultrahigh frequency speech text offset address bitmap, if the result of retrieval satisfies predetermined requirement, then finishes retrieval; If the result of retrieval does not satisfy predetermined requirement, retrieve corresponding text after then bitmap filters in full by the full-text index in the index entry determination module and according to high frequency words title bitmap and high frequency words according to the morpheme of inquiry.
The method of a kind of text retrieval in sum and device, described subject index is used for the coded message and the text address of the corresponding text of record header; Described full-text index is used to write down the coded message and the text address of corresponding text in full, and described full text comprises title and content; Described high frequency words title bitmap is used to write down coded message and the text address that each high frequency words appears at the corresponding text of title; Described high frequency words is bitmap in full, is used to write down coded message and the text address that each high frequency words appears at corresponding text in full; Described ultrahigh frequency speech text offset address bitmap is used for writing down the offset address of text in the text address of subject index that corresponding text that each ultrahigh frequency speech appears at title occupies text address maximum; Described high frequency words is represented the speech of text coverage rate in predetermined interval, and described ultrahigh frequency vocabulary shows that the text coverage rate surpasses peaked speech in the described predetermined interval.
The technical scheme that embodiment of the present invention provides has increased high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full on the basis of subject index and full-text index, in the process of retrieval, retrieve by subject index and ultrahigh frequency speech text offset address bitmap earlier, when the result of retrieval does not meet predetermined the requirement again by full-text index, high frequency words title bitmap, high frequency words in full bitmap retrieve, the speed that can accelerate to retrieve like this, thus the accuracy rate of retrieval and the retrieval performance of system improved.
Description of drawings
In order to be illustrated more clearly in the technical scheme of the embodiment of the invention, the accompanying drawing of required use is done to introduce simply in will describing embodiment below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of the described a kind of text searching method of the embodiment of the invention;
Fig. 2 is the described schematic flow sheet of determining the text address according to the text weight in the text library of the embodiment of the invention;
Fig. 3 is the schematic flow sheet that retrieves corresponding text after the described morpheme according to inquiry of the embodiment of the invention filters by subject index and according to ultrahigh frequency speech text offset address bitmap;
Fig. 4 is the schematic flow sheet that the described morpheme according to inquiry of the embodiment of the invention retrieves corresponding text after bitmap filters in full by full-text index and according to high frequency words title bitmap and high frequency words;
Fig. 5 is the schematic flow sheet of the described specific embodiment one concrete retrieving of the embodiment of the invention;
Fig. 6 is the structural representation of the described a kind of text retrieval device of the embodiment of the invention;
Fig. 7 is the concrete structural representation Fig. 1 of the described text of embodiment of the invention address determination module 1;
Fig. 8 is the concrete structural representation of the described retrieval module of the embodiment of the invention 3;
Fig. 9 is the concrete structure synoptic diagram that the described retrieval module 3 of the embodiment of the invention is retrieved when not reaching predetermined the requirement.
Embodiment
Below in conjunction with Figure of description the specific embodiment of the present invention is described.This instructions mainly with the application of the present invention in instant communication service as most preferred embodiment, certainly, also can be used for other Internet service systems such as webmail service system, network cooperating work service system in the practical application.
In the technical scheme of a kind of text searching method that embodiment of the present invention provides, as shown in Figure 1, at first, generate coded message, and determine the text address according to the text weight in the text library, described coded message is used to indicate the order of text; Then, set up index entry according to the coded message that generates and definite text address, described index entry comprises subject index, full-text index, high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full; Described subject index is used for the coded message and the text address of the corresponding text of record header; Described full-text index is used to write down the coded message and the text address of corresponding text in full, and described full text comprises title and content; Described high frequency words title bitmap is used to write down coded message and the text address that each high frequency words appears at the corresponding text of title; Described high frequency words is bitmap in full, is used to write down coded message and the text address that each high frequency words appears at corresponding text in full; Described ultrahigh frequency speech text offset address bitmap is used for writing down the offset address of text in the text address of subject index that corresponding text that each ultrahigh frequency speech appears at title occupies text address maximum; Described high frequency words is represented the speech of text coverage rate in predetermined interval, described predetermined interval can be [50%, 80%], if the concrete text coverage rate of the morpheme of inquiry is in 50% to 80% interval, then the morpheme of described inquiry can be thought high frequency words, and described ultrahigh frequency vocabulary shows that the text coverage rate surpasses peaked speech in the described predetermined interval; Maximal value can be 80% in the described predetermined interval, and concrete if the text coverage rate of the morpheme of inquiry surpasses 80%, then the morpheme of described inquiry can be thought the ultrahigh frequency speech; At last, retrieve corresponding text after filtering by subject index and according to ultrahigh frequency speech text offset address bitmap,, then finish retrieval if the result of retrieval satisfies predetermined requirement according to the morpheme of inquiry; If the result of retrieval does not satisfy predetermined requirement, retrieve corresponding text after then bitmap filters in full by full-text index and according to high frequency words title bitmap and high frequency words according to the morpheme of inquiry, described predetermined requirement, be to determine according to requirement to result for retrieval, can represent that specifically the text that retrieves has comprised the text of needs, the textual data that also can represent to retrieve is greater than predetermined value, if the inquiry just hopes the general result that inquires about, not very high requirement, then described predetermined value can be smaller, such as 30 or 50, if inquiry's claimed accuracy height, think comprehensive inquiry, then described predetermined value can be greatly, such as 300 or 500; Described predetermined requirement can also represent to satisfy the textual data of text quality's grade greater than certain threshold level, described text quality grade is to determine with the matching degree of the text that retrieves according to the morpheme of inquiry, described certain threshold level can be determined according to inquiry's specific requirement, such as being 50, also can be 100, if the inquiry just hopes the general result that inquires about, not very high requirement, then can the matching degree of the morpheme of inquiry and the text that retrieves be provided with smaller, be set to 30% or 50% such as matching degree, if inquiry's claimed accuracy height, think comprehensive inquiry, then can the matching degree of the morpheme of inquiry and the text that retrieves be provided with bigger, such as 80% or 90%.
Further, as shown in Figure 2, describedly determine that according to the text weight in the text library process of text address is specifically as follows, at first, read one piece of text in the text library, resolve the field in the text, and generate word segmentation result; Then, according to the described word segmentation result that generates, determine the array mode of each field, determine the weight of current text and upgrade the relevance weight of text, the relevance weight of described renewal text can be according in the retrieving text relevant weighting being obtained, described text relevant weighting can obtain the relevance weight of text according to the importance of weighting factor, and described weighting factor can comprise text modification time, morpheme number, morpheme seniority among brothers and sisters, the rank technology (PageRank) of webpage, outer chain number and interior chain number etc.; At last, all generate the text of coded messages in reading text library to repeat said process, determine the text address according to the relevance weight of the text of the weight of the text of determining and renewal.
In technique scheme, as shown in Figure 3, described morpheme according to inquiry retrieve after filtering by subject index and according to ultrahigh frequency speech text offset address bitmap corresponding text detailed process can for: at first, determine that the morpheme of inquiry is single morpheme or multi-lingual element; Described morpheme is a syntactical unit of forming certain meaning, and described single morpheme can be understood as single word or single foreign language word, and described multi-lingual element can be understood as a plurality of words or a plurality of foreign language word, for example, " I " am single morpheme, and " you are good " is multi-lingual element, and " hello " is single morpheme; " happy birthday " is multi-lingual element, then, if the morpheme of inquiry is single morpheme, then determines that according to subject index described morpheme covers title corresponding codes information and text address, thereby retrieves corresponding text.If the morpheme of inquiry is all to be centre word in multi-lingual element and the described multi-lingual element, then earlier described morpheme medium and low frequency speech being carried out merger searches and retrieves and described low-frequency word corresponding text, the detailed process that described merger is searched can for, the first corresponding text address of determining respectively to inquire about morpheme respectively and being covered according to subject index and full-text index, after merger is carried out in described corresponding text address, for example, if with morpheme A, B and C carry out merger and search, then determine morpheme A respectively according to subject index and full-text index earlier, the corresponding text address that B and C cover, afterwards merger is carried out in described corresponding text address, the corresponding text address that morpheme A covers is a merger section, the corresponding text address that morpheme B covers is a merger section, and the corresponding text address that morpheme C covers is a merger section.Again the ultrahigh frequency root in the described morpheme is judged according to subject index whether described ultrahigh frequency speech hits the title of the corresponding text of described low-frequency word that retrieves after merger is searched,, then finish retrieval if recklessly; If hit, then skip the text that described low-frequency word merger retrieves between the corresponding text of ultrahigh frequency speech text offset address and the current text in the same merger section when searching, read text after skipping till described ultrahigh frequency speech does not hit title, thereby the text that acquisition retrieves, described low-frequency word is represented the speech of text coverage rate less than minimum value in the described predetermined interval, described centre word can be described as the location speech again, can represent in the multi-lingual element tightly round the speech of discussing, for example, " people " is exactly centre word in " you are a clever people " this multi-lingual element, and clever is non-centre word.
If the morpheme of inquiry is to contain non-central speech in multi-lingual element and the described multi-lingual element, the morpheme that then adopts earlier described inquiry is all to be that the search method of centre word is retrieved in multi-lingual element and the described multi-lingual element, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then the centre word in the described multi-lingual element being carried out merger searches, if search the text of acquisition by merger overlapping with the text of having found, then described overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise the text that demonstration retrieves according to the text weight.
In such scheme, as shown in Figure 4, described morpheme according to inquiry retrieve after bitmap filters in full by full-text index and according to high frequency words title bitmap and high frequency words corresponding text detailed process can for, at first, determine that the morpheme of inquiring about is full low-frequency word or low-and high-frequency speech or full high-frequency speech; Then, if the morpheme of inquiry is full low-frequency word, then adopt bitmap to get ready and retrieve corresponding text the low-frequency word in the described morpheme, the detailed process that described bitmap is got ready is, each low-frequency word is got ready successively according to the corresponding text that subject index and full-text index retrieve, literal number up to described low-frequency word is got ready several identical with institute, this low-frequency word retrieval finishes, repetition said process all low-frequency words in morpheme are all retrieved and are finished, and then the corresponding text that each low-frequency word is retrieved is the text that retrieves altogether.If the morpheme of inquiry is the low-and high-frequency speech, then adopt bitmap to get ready and retrieve corresponding text the low-frequency word in the described morpheme, determine that according to high frequency words title bitmap the high frequency words in the described morpheme covers the text of title correspondence, if it is overlapping with the text that obtains by high frequency words by the text that low-frequency word obtains, then described overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise the text that demonstration retrieves according to the text weight.
If the morpheme of inquiry is the full high-frequency speech, then at first determine that according to high frequency words title bitmap the high frequency words in the described morpheme covers the text of title correspondence, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then continue to determine that according to high frequency words full text bitmap the high frequency words in the described morpheme covers corresponding in full text.
Specific embodiment one, present embodiment with the morpheme of inquiry for " people of the world's world peace world future " and be that the index entry that the example basis is built up is retrieved, concrete step as shown in Figure 5,
Step 1, definite morpheme of being inquired about are multi-lingual element, contain centre word " world " and non-central speech " people, peace, future ";
Step 2, if in the morpheme of described inquiry " world's future " be low-frequency word, " people of the world " is high frequency words, " world peace " is the ultrahigh frequency speech, then " world's future " carried out merger and searches and retrieve corresponding text;
Whether step 3, judgement " world peace " hit the title of the text that retrieves according to subject index, if recklessly, then finish retrieval, by " " the corresponding text that retrieves is result for retrieval future in the world; If hit, then execution in step four;
The text that retrieves between step 4, the corresponding text of skipping " world peace " corresponding text offset address in the same merger section of " world's future " when retrieval and the current text, described " world peace " corresponding text offset address obtains according to ultrahigh frequency text offset address bitmap, the text that reads after skipping does not hit till the title of the text that retrieves up to " world peace ", thereby obtains corresponding text;
Whether the corresponding text that step 5, judgement obtain meets predetermined requirement, if the result of retrieval satisfies the requirement of being scheduled to, then finishes retrieval; If the result of retrieval does not satisfy predetermined requirement, then execution in step six;
Step 6, determine that the morpheme of described inquiry is the low-and high-frequency speech, then will " world future " be got ready by bitmap and to retrieve corresponding text, according to high frequency words title bitmap retrieve with " people of the world " and the coded message of corresponding text and text address, thus retrieve corresponding text;
Step 7, judge the text that retrieve by " world's future " with by " " whether the text that retrieves overlapping, if overlapping, if execution in step eight not overlapping, then finishes to retrieve, the text that retrieves according to the weight demonstration of text for the people of the world;
Step 8, described overlapping text are the final text of retrieval, and with described overlapping text relevant weighting, thereby the relevance weight of renewal text.
The technical scheme of a kind of text retrieval device that the embodiment of the invention provides as shown in Figure 6, comprising:
Text address determination module 1 is used to generate coded message, and determines the text address according to the text weight in the text library, and described coded message is used to indicate the order of text;
Index entry determination module 2, index entry is set up in the coded message and the definite text address that are used for generating according to text address determination module 1, and described index entry comprises subject index, full-text index, high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full; Described subject index is used for the coded message and the text address of the corresponding text of record header; Described full-text index is used to write down the coded message and the text address of corresponding text in full, and described full text comprises title and content; Described high frequency words title bitmap is used to write down coded message and the text address that each high frequency words appears at the corresponding text of title; Described high frequency words is bitmap in full, is used to write down coded message and the text address that each high frequency words appears at corresponding text in full; Described ultrahigh frequency speech text offset address bitmap is used for writing down the offset address of text in the text address of subject index that corresponding text that each ultrahigh frequency speech appears at title occupies text address maximum; Described high frequency words is represented the speech of text coverage rate in predetermined interval, and described ultrahigh frequency vocabulary shows that the text coverage rate surpasses peaked speech in the described predetermined interval;
Retrieval module 3 is used for retrieving corresponding text according to the morpheme of inquiry after filtering by the subject index of index entry determination module 2 and according to ultrahigh frequency speech text offset address bitmap, if the result of retrieval satisfies predetermined requirement, then finishes retrieval; If the result of retrieval does not satisfy predetermined requirement, retrieve corresponding text after then bitmap filters in full by the full-text index in the index entry determination module 2 and according to high frequency words title bitmap and high frequency words according to the morpheme of inquiry.
In technique scheme, as shown in Figure 7, described text address determination module 1 specifically comprises:
Participle submodule 11 is used for reading one piece of text of text library, resolves the field in the text, and generates word segmentation result;
Weight is determined submodule 12, is used for the described word segmentation result that generates according to participle submodule 11, determines the array mode of each field, determines the weight of current text and upgrades the relevance weight of text;
The text address generates submodule 13, is used for determining that according to weight the relevance weight of the text of the weight of the text that submodule 12 is determined and renewal determines the text address by reading all texts that generate coded messages of text library.
In technique scheme, as shown in Figure 8, described retrieval module 3 comprises:
First morpheme is determined submodule 31, is used for determining that the morpheme of inquiry is single morpheme or multi-lingual element, if single morpheme is then sent into single morpheme retrieval submodule, if multi-lingual element is then sent into multi-lingual plain retrieval submodule;
Described single morpheme retrieval submodule 32 is used for described single morpheme is determined that according to subject index described morpheme covers title corresponding codes information and text address, thereby retrieves corresponding text;
Described multi-lingual plain retrieval submodule 33 comprises:
Judge submodule 331, be used for judging whether described multi-lingual element all is centre word, if, then send into centre word retrieval submodule 332, if not, then send into non-central word and search submodule 333;
Described centre word retrieval submodule 332, being used for earlier low-frequency word with described morpheme carries out merger and searches and retrieve and described low-frequency word corresponding text, again the ultrahigh frequency root in the described morpheme is judged according to subject index whether described ultrahigh frequency speech hits the title of the corresponding text of described low-frequency word that retrieves, if recklessly, then finish retrieval; If hit, then skip the text that described low-frequency word merger retrieves between the corresponding text of ultrahigh frequency speech text offset address and the current text in the same merger section when searching, read text after skipping till described ultrahigh frequency speech does not hit title, described low-frequency word is represented the speech of text coverage rate less than minimum value in the described predetermined interval;
Described non-central word and search submodule 333, be used for adopting earlier described centre word retrieval submodule 332 to retrieve, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then the centre word in the described multi-lingual element being carried out merger searches, if search the text of acquisition by merger overlapping with the text of having found, then described overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise the text that demonstration retrieves according to the text weight.
In technique scheme, as shown in Figure 9, described retrieval module 3 also comprises:
Second morpheme is determined submodule 34, is used for determining that the morpheme of inquiry is full low-frequency word or low-and high-frequency speech or full high-frequency speech, if full low-frequency word is sent into full low-frequency word retrieval submodule 35; If low-and high-frequency retrieval submodule 36 then sent in the low-and high-frequency speech; If full high-frequency word and search submodule 36 then sent in the full high-frequency speech;
Full low-frequency word retrieval submodule 35 is used for low-frequency word to described morpheme and adopts bitmap to get ready to retrieve corresponding text;
Low-and high-frequency word and search submodule 36, be used for adopting bitmap to get ready and retrieve corresponding text the low-frequency word of described morpheme, determine that according to high frequency words title bitmap the high frequency words in the described morpheme covers the text of title correspondence, when the text of twice acquisition is overlapping, to described overlapping text relevant weighting;
Full high-frequency word and search submodule 37, be used at first determining that according to high frequency words title bitmap the high frequency words of described morpheme covers the text of title correspondence, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then continue to determine that according to high frequency words full text bitmap the high frequency words in the described morpheme covers corresponding in full text.
In technique scheme, described predetermined requirement is to determine according to the requirement to result for retrieval, can represent that specifically the text that retrieves has comprised the text of needs, the textual data that yet can represent to retrieve if the inquiry just hopes the general result that inquires about, does not have very high requirement greater than predetermined value, then described predetermined value can be smaller, such as 30 or 50, if inquiry's claimed accuracy height is thought comprehensive inquiry, then described predetermined value can be greatly, such as 300 or 500; Described predetermined requirement can also represent to satisfy the textual data of text quality's grade greater than certain threshold level, described text quality grade is to determine with the matching degree of the text that retrieves according to the morpheme of inquiry, described certain threshold level can be determined according to inquiry's specific requirement, such as being 50, also can be 100, if the inquiry just hopes the general result that inquires about, not very high requirement, then can the matching degree of the morpheme of inquiry and the text that retrieves be provided with smaller, be set to 30% or 50% such as matching degree, if inquiry's claimed accuracy height, think comprehensive inquiry, then can the matching degree of the morpheme of inquiry and the text that retrieves be provided with bigger, such as 80% or 90%.
The specific implementation of the processing capacity of each module that comprises in the said apparatus is described in method embodiment before, no longer is repeated in this description at this.
The method of the described a kind of text retrieval of the embodiment of the invention and device, the technical scheme that embodiment of the present invention provides has increased high frequency words title bitmap on the basis of subject index and full-text index, high frequency words is bitmap and ultrahigh frequency speech text offset address bitmap in full, in the process of retrieval, retrieve by subject index and ultrahigh frequency speech text offset address bitmap earlier, played the effect of quick merger, improved the performance of system, when not meeting predetermined the requirement, the result of retrieval passes through full-text index again, high frequency words title bitmap, high frequency words bitmap is in full retrieved, when having increased substantially system performance, improved good result's recall rate, thereby improved the accuracy rate of retrieval and the accuracy rate and the recall rate of system, improved user's experience.
The above; only for the preferable embodiment of the present invention, but protection scope of the present invention is not limited thereto, and anyly is familiar with those skilled in the art in the technical scope that the present invention discloses; the variation that can expect easily or replacement all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims (10)

1. the method for a text retrieval is characterized in that, comprise,
Generate coded message, and determine the text address according to the text weight in the text library, described coded message is used to indicate the order of text;
Set up index entry according to the coded message that generates and definite text address, described index entry comprises subject index, full-text index, high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full;
Morpheme according to inquiry retrieves corresponding text after filtering by subject index and according to ultrahigh frequency speech text offset address bitmap, if the result of retrieval does not satisfy predetermined requirement, retrieve corresponding text after then bitmap filters in full by full-text index and according to high frequency words title bitmap and high frequency words according to the morpheme of inquiry, described predetermined requirement is determined according to the requirement to result for retrieval;
Described subject index is used for the coded message and the text address of the corresponding text of record header; Described full-text index is used to write down the coded message and the text address of corresponding text in full, and described full text comprises title and content; Described high frequency words title bitmap is used to write down coded message and the text address that each high frequency words appears at the corresponding text of title; Described high frequency words is bitmap in full, is used to write down coded message and the text address that each high frequency words appears at corresponding text in full; Described ultrahigh frequency speech text offset address bitmap is used for writing down the offset address of text in the text address of subject index that corresponding text that each ultrahigh frequency speech appears at title occupies text address maximum; Described high frequency words is represented the speech of text coverage rate in predetermined interval, and described ultrahigh frequency vocabulary shows that the text coverage rate surpasses peaked speech in the described predetermined interval.
2. method according to claim 1 is characterized in that, describedly determines that according to the text weight in the text library process of text address specifically comprises:
Read one piece of text in the text library, resolve the field in the text, and generate word segmentation result;
According to the described word segmentation result that generates, determine the array mode of each field, determine the weight of current text and upgrade the relevance weight of text;
All generate the text of coded messages in reading text library to repeat said process, determine the text address according to the relevance weight of the text of the weight of the text of determining and renewal.
3. method according to claim 1 is characterized in that, the detailed process that retrieves corresponding text after described morpheme according to inquiry filters by subject index and according to ultrahigh frequency speech text offset address bitmap comprises:
The morpheme of determining inquiry is single morpheme or multi-lingual element;
If the morpheme of inquiry is single morpheme, then determines that according to subject index described morpheme covers title corresponding codes information and text address, thereby retrieve corresponding text;
If the morpheme of inquiry is all to be centre word in multi-lingual element and the described multi-lingual element, then earlier described morpheme medium and low frequency speech being carried out merger searches and retrieves the text corresponding with described low-frequency word, again the ultrahigh frequency root in the described morpheme is judged according to subject index whether described ultrahigh frequency speech hits the title of the corresponding text of described low-frequency word that retrieves, if recklessly, then finish retrieval; If hit, then skip the text that described low-frequency word merger retrieves between the corresponding text of ultrahigh frequency speech text offset address and the current text in the same merger section when searching, read text after skipping till described ultrahigh frequency speech does not hit title, thereby the text that acquisition retrieves, described low-frequency word is represented the speech of text coverage rate less than minimum value in the described predetermined interval;
If the morpheme of inquiry is to contain non-central speech in multi-lingual element and the described multi-lingual element, the morpheme that then adopts earlier described inquiry is all to be that the search method of centre word is retrieved in multi-lingual element and the described multi-lingual element, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then the centre word in the described multi-lingual element being carried out merger searches, if search the text of acquisition by merger overlapping with the text of having found, then overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise the text that demonstration retrieves according to the text weight.
4. method according to claim 1 is characterized in that, the detailed process that described morpheme according to inquiry retrieves corresponding text after bitmap filters in full by full-text index and according to high frequency words title bitmap and high frequency words is:
The morpheme of determining inquiry is full low-frequency word or low-and high-frequency speech or full high-frequency speech, and described low-and high-frequency vocabulary shows in the speech and not only comprises high frequency words but also comprise low-frequency word;
If the morpheme of inquiry is full low-frequency word, then adopts bitmap to get ready and retrieve corresponding text the low-frequency word in the described morpheme;
If the morpheme of inquiry is the low-and high-frequency speech, then adopt bitmap to get ready and retrieve corresponding text the low-frequency word in the described morpheme, determine that according to high frequency words title bitmap the high frequency words in the described morpheme covers the text of title correspondence, if it is overlapping with the text that obtains by high frequency words by the text that low-frequency word obtains, then overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise the text that demonstration retrieves according to the text weight;
If the morpheme of inquiry is the full high-frequency speech, then at first determine that according to high frequency words title bitmap the high frequency words in the described morpheme covers the text of title correspondence, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then continue to determine that according to high frequency words full text bitmap the high frequency words in the described morpheme covers corresponding in full text.
5. according to each described method of claim 1 to 4, it is characterized in that, described predetermined requirement specifically comprises: the text that retrieves comprises the text of needs, or the textual data that retrieves is greater than predetermined value, or the textual data that satisfies text quality's grade is greater than certain threshold level, and described text quality grade is to determine with the matching degree of the text that retrieves according to the morpheme of inquiry.
6. the device of a text retrieval is characterized in that, comprising:
Text address determination module is used to generate coded message, and determines the text address according to the text weight in the text library, and described coded message is used to indicate the order of text;
The index entry determination module, index entry is set up in the coded message and the definite text address that are used for generating according to text address determination module, and described index entry comprises subject index, full-text index, high frequency words title bitmap, high frequency words bitmap and ultrahigh frequency speech text offset address bitmap in full; Described subject index is used for the coded message and the text address of the corresponding text of record header; Described full-text index is used to write down the coded message and the text address of corresponding text in full, and described full text comprises title and content; Described high frequency words title bitmap is used to write down coded message and the text address that each high frequency words appears at the corresponding text of title; Described high frequency words is bitmap in full, is used to write down coded message and the text address that each high frequency words appears at corresponding text in full; Described ultrahigh frequency speech text offset address bitmap is used for writing down the offset address of text in the text address of subject index that corresponding text that each ultrahigh frequency speech appears at title occupies text address maximum; Described high frequency words is represented the speech of text coverage rate in predetermined interval, and described ultrahigh frequency vocabulary shows that the text coverage rate surpasses peaked speech in the described predetermined interval;
Retrieval module, be used for retrieving corresponding text after filtering by the subject index of index entry determination module and according to ultrahigh frequency speech text offset address bitmap according to the morpheme of inquiry, if the result of retrieval does not satisfy predetermined requirement, retrieve corresponding text after then bitmap filters in full by the full-text index in the index entry determination module and according to high frequency words title bitmap and high frequency words according to the morpheme of inquiry, described predetermined requirement is determined according to the requirement to result for retrieval.
7. device according to claim 6 is characterized in that, described text address determination module specifically comprises:
The participle submodule is used for reading one piece of text of text library, resolves the field in the text, and generates word segmentation result;
Weight is determined submodule, is used for the described word segmentation result that generates according to the participle submodule, determines the array mode of each field, determines the weight of current text and upgrades the relevance weight of text;
The text address generates submodule, is used for determining that according to weight the relevance weight of the text of the weight of the text that submodule is determined and renewal determines the text address by reading all texts that generate coded messages of text library.
8. device according to claim 6 is characterized in that, described retrieval module comprises:
First morpheme is determined submodule, is used for determining that the morpheme of inquiry is single morpheme or multi-lingual element, if single morpheme is then sent into single morpheme retrieval submodule, if multi-lingual element is then sent into multi-lingual plain retrieval submodule;
Described single morpheme retrieval submodule is used for described single morpheme is determined that according to subject index described morpheme covers title corresponding codes information and text address, thereby retrieves corresponding text;
Described multi-lingual plain retrieval submodule comprises:
Judge submodule, be used for judging whether described multi-lingual element all is centre word, if, then send into centre word retrieval submodule, if not, then send into non-central word and search submodule;
Described centre word retrieval submodule, being used for earlier low-frequency word with described morpheme carries out merger and searches and retrieve and described low-frequency word corresponding text, again the ultrahigh frequency root in the described morpheme is judged according to subject index whether described ultrahigh frequency speech hits the title of the corresponding text of described low-frequency word that retrieves, if recklessly, then finish retrieval; If hit, then skip the text that described low-frequency word merger retrieves between the corresponding text of ultrahigh frequency speech text offset address and the current text in the same merger section when searching, read text after skipping till described ultrahigh frequency speech does not hit title, thereby the text that acquisition retrieves, described low-frequency word is represented the speech of text coverage rate less than minimum value in the described predetermined interval;
Described non-central word and search submodule, be used for adopting earlier described centre word retrieval submodule to retrieve, if the result of retrieval satisfies predetermined requirement, then finish,, then the centre word in the described multi-lingual element is carried out merger and search if the result of retrieval does not satisfy predetermined requirement, if search the text of acquisition by merger overlapping with the text of having found, then overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise show the text that retrieves according to the weight of text.
9. device according to claim 6 is characterized in that, described retrieval module also comprises:
Second morpheme is determined submodule, is used for determining that the morpheme of inquiry is full low-frequency word or low-and high-frequency speech or full high-frequency speech, and described low-and high-frequency vocabulary shows in the speech and not only comprises high frequency words but also comprise low-frequency word, if full low-frequency word is sent into full low-frequency word retrieval submodule; If low-and high-frequency retrieval submodule then sent in the low-and high-frequency speech; If full high-frequency word and search submodule then sent in the full high-frequency speech;
Full low-frequency word retrieval submodule is used for low-frequency word to described morpheme and adopts bitmap to get ready to retrieve corresponding text;
Low-and high-frequency word and search submodule, be used for adopting bitmap to get ready and retrieve corresponding text the low-frequency word of described morpheme, determine that according to high frequency words title bitmap the high frequency words in the described morpheme covers the text of title correspondence, if it is overlapping with the text that obtains by high frequency words by the text that low-frequency word obtains, then overlapping text is the text that retrieves, and to described overlapping text relevant weighting, otherwise the text that demonstration retrieves according to the text weight;
Full high-frequency word and search submodule, be used at first determining that according to high frequency words title bitmap the high frequency words of described morpheme covers the text of title correspondence, if the result of retrieval satisfies predetermined requirement, then finish retrieval, if the result of retrieval does not satisfy predetermined requirement, then continue to determine that according to high frequency words full text bitmap the high frequency words in the described morpheme covers corresponding in full text.
10. according to each described device of claim 6 to 9, it is characterized in that, described predetermined requirement specifically comprises: the text that retrieves comprises the text of needs, or the textual data that retrieves is greater than predetermined value, or the textual data that satisfies text quality's grade is greater than certain threshold level, and described text quality grade is to determine with the matching degree of the text that retrieves according to the morpheme of inquiry.
CN2009100887508A 2009-07-10 2009-07-10 Method and device for text retrieval Active CN101599078B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2009100887508A CN101599078B (en) 2009-07-10 2009-07-10 Method and device for text retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2009100887508A CN101599078B (en) 2009-07-10 2009-07-10 Method and device for text retrieval

Publications (2)

Publication Number Publication Date
CN101599078A CN101599078A (en) 2009-12-09
CN101599078B true CN101599078B (en) 2011-04-20

Family

ID=41420526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2009100887508A Active CN101599078B (en) 2009-07-10 2009-07-10 Method and device for text retrieval

Country Status (1)

Country Link
CN (1) CN101599078B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102024027B (en) * 2010-11-17 2013-03-20 北京健康在线网络技术有限公司 Method for establishing medical database
CN109086285B (en) * 2017-06-14 2021-10-15 佛山辞荟源信息科技有限公司 Intelligent Chinese processing method, system and device based on morphemes
CN108197249A (en) * 2017-12-29 2018-06-22 廖赟 A kind of big data search method based on recommendation weights
CN109977227B (en) * 2019-03-19 2021-06-22 中国科学院自动化研究所 Text feature extraction method, system and device based on feature coding
CN110413735B (en) * 2019-07-25 2022-04-29 深圳供电局有限公司 Question and answer retrieval method and system, computer equipment and readable storage medium
CN112686717B (en) * 2021-03-11 2021-07-02 腾讯科技(深圳)有限公司 Data processing method and system for advertisement recall
CN113093661A (en) * 2021-04-08 2021-07-09 四川轻化工大学 Embedded machine tool alarm text processing device and control method thereof

Also Published As

Publication number Publication date
CN101599078A (en) 2009-12-09

Similar Documents

Publication Publication Date Title
CN101599078B (en) Method and device for text retrieval
KR101157693B1 (en) Multi-stage query processing system and method for use with tokenspace repository
CN101464896B (en) Voice fuzzy retrieval method and apparatus
CN103903619B (en) A kind of method and system improving speech recognition accuracy
CN101819578B (en) Retrieval method, method and device for establishing index and retrieval system
KR101122887B1 (en) Efficient capitalization through user modeling
CN108509417B (en) Title generation method and device, storage medium and server
CN109726274B (en) Question generation method, device and storage medium
CN103810168A (en) Search application method, device and terminal
CN102999625A (en) Method for realizing semantic extension on retrieval request
CN104040626A (en) Multiple coding mode signal classification
CN103200550A (en) Mobile terminal and method for replaying messages automatically
WO2007114563A1 (en) System and method for providing recommended word of adjustment each user and computer readable recording medium recording program for implementing the method
CN101706807A (en) Method for automatically acquiring new words from Chinese webpages
CN104143001A (en) Search term recommending method and device
CN103914444A (en) Error correction method and device thereof
CN103500579A (en) Voice recognition method, device and system
CN102436448A (en) Search method and search system
CN102270199B (en) A kind of screening technique of information and equipment
CN109299227A (en) Information query method and device based on speech recognition
CN102831224A (en) Creating method for data index base and searching suggest generation method and device
CN106933380B (en) A kind of update method and device of dictionary
CN103455491B (en) To the method and device of query word classification
CN101600023A (en) Short messages of terminals searching method and device thereof
CN101963993A (en) Method for fast searching database sheet table record

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20160115

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: 2 East 403 room, SEG science and technology garden, Futian District, Guangdong, Shenzhen 518028, China

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.