CN103823799A

CN103823799A - New-generation industry knowledge full-text search method

Info

Publication number: CN103823799A
Application number: CN201210461748.2A
Authority: CN
Inventors: 王卫民; 符建辉; 王石
Original assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Current assignee: KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date: 2012-11-16
Filing date: 2012-11-16
Publication date: 2014-05-28

Abstract

A new-generation industry knowledge full-text search method includes: 1, segmentation dictionary setup, namely setting up a segmentation dictionary, and storing dictionary information into a database; 2, full index setup, namely reading, dividing words and analyzing existing full-text documents called 'knowledge point documents' to set up an index file; 3, increment index setup, namely processing newly added documents and updating an index file on a hard disk; 4, memory index setup and memory segmentation dictionary setup, namely reading segmentation dictionary data in a memory to set up a memory segmentation dictionary data structure; 5, full-text search, user question standardization, word dividing, semantic comprehension, semantic extension, candidate document acquisition and candidate document ordering, wherein segmentation dictionary setup is conducted during system initialization; full index setup includes reading all knowledge documents to fully set up a hard disk index file called 'index file' for short; increment index setup is conducted when full-text files are newly added; the three events are independent from a full-text retrieval module and in independent operation.

Description

A new generation's domain knowledge text searching method

Technical field

The present invention relates to full-text search field, the especially full-text search field of domain knowledge, has proposed a kind of new domain knowledge text retrieval system and method.

Background technology

Full-text search refers to that computer index program is by each word in scanning article, each word is set up to an index, indicate number of times and position that this word occurs in article, in the time that user inquires about, search program is just searched according to the index of setting up in advance, and by the result feedback of searching the retrieval mode to user.This process is similar to the process of looking into word by the retrieval word table in dictionary.Full-text search is a kind of written historical materials search method that all texts in file are mated with search terms.Text retrieval system be set up according to full-text search theory for the software systems of full article retrieval are provided.Full-text search is to be stored in database the retrieval out of arbitrary content information searching in whole book, entire article.It is the information such as relevant chapter, paragraph, sentence, word in can obtaining as required in full, also can carry out various statistics and analysis.For example, it can answer " in Dream of the Red Mansion one book, how many times appears altogether in " Lin Daiyu "? " problem.

Traditional text retrieval system is the coupling based on key word, keyword just, lacks the multi-faceted semanteme identification such as English, phonetic, wrongly written or mispronounced characters, synonym, near synonym and the ability of error correction.Along with the intelligence requirement of customer demand is more and more higher, it is backward that traditional text retrieval system seems all the more.

In order to solve the problem of existence, be badly in need of a kind of new text retrieval system, it can allow retrieval more intelligent, be embodied in: can realize the most general phonetic, Chinese character, English and express mutually, such as user's input " shengka ", system is appreciated that out, and the content that user may inquire about is " sound card "; Wrongly written or mispronounced characters error correction can be realized, semantic understanding and semantic expansion can be realized.As: user's input " commercial affairs are navigated ", " navigator in the morning ", " shangwulinghang ", " shwlh " can reach the search effect of " commercial affairs are navigated "; The even colloquial retrieval form of user input " how handle in broadband ", " how broadband is installed ", " doing a broadband to me ", " I look on the bright side of things logical broadband " etc. similar import, also can correctly return to the answer of relevant " handle in broadband ".

Summary of the invention

For the problems referred to above, the present invention is on traditional text retrieval system basis based on key word, word coupling, increase multi-faceted semanteme identification and the error correction such as English, phonetic, wrongly written or mispronounced characters, synonym, near synonym, increased again the semantic extension abilities such as upper the next, Attribute Recognition.The present invention is a text retrieval system with semantic understanding semantic extension function.

Technical scheme: to the invention provides a kind of domain knowledge text searching method of new generation in order overcoming the above problems, to it is characterized in that: comprise the following steps:

Step 1, builds dictionary for word segmentation: build dictionary for word segmentation, and deposit dictinary information in database;

Step 2, builds full dose index: to the full text document having existed " also referred to as knowledge point document " read, participle and analysis, set up index file;

Step 3, builds increment index: newly-increased document is processed, upgraded the index file on hard disk;

Step 4, builds internal memory index, comprising:

Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure;

Step 4-2, builds full dose internal memory index: read index file from hard disk, full dose builds internal memory index;

Step 4-3, builds increment internal memory index: newly-increased document is processed, realized internal memory index delta and upgrade;

Step 5, full-text search, comprising:

Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization

Process ", remove redundancy word, remove the punctuation mark useless that does not affect semantic information, know wrong error correction, another name standardization;

Step 5-2, participle: the problem after standardization is carried out to participle;

Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or

Person's standard words, obtains participle semantic information;

Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these

Semantic extension information, is used some words or part of speech to represent;

Step 5-5, obtains candidate documents: utilize the word or part of speech " these words or the part of speech representative that after semantic extension, obtain

Semanteme letter after expansion ", according to internal memory index information, the corresponding document in full of search, as candidate documents;

Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more

Forward, the candidate documents after sequence becomes final full-text search result;

Wherein, when system initialization, build dictionary for word segmentation; Build full dose index: read all knowledge point documents, full dose builds hard disk index file " abbreviation index file "; When newly-increased document in full, build increment index.These three activities, are independent of full-text search module, independent operating.

Structure dictionary for word segmentation described in step 1, is mainly the structure of realizing dictionary for word segmentation, and the dictionary for word segmentation of structure is " secondary dictionary for word segmentation ", and its building method is as follows:

Step 1-1, according to " general participle vocabulary "+" business vocabulary ", forms one-level participle;

Wherein, " general participle vocabulary " adopts Computer Department of the Chinese Academy of Science's vocabulary as general participle vocabulary, and " business vocabulary " comprised the relevant proper noun of industry, can build by the Business Name importing in industry;

Step 1-2, segments one-level participle automatically, forms candidate secondary participle;

Step 1-3, artificial screening candidate secondary participle;

Secondary dictionary for word segmentation form after structure is as follows: one-level participle secondary participle array (use | cut apart).

Structure full dose index described in step 2, major function is that the full dose that realizes data directory file builds, its main process is as follows:

Step 2-1, read each knowledge point document, knowledge point document is carried out to participle: in participle process, common dictionary for word segmentation and the upper the next dictionary with semantic relation are combined, produce polycomponent word result, and sort according to the length of the number of the word comprising in every group of result and word, when participle, read by row, then every a line is being intercepted according to some punctuation marks, obtain the word of segment, carry out participle according to dictionary for word segmentation and upper the next part of speech " institute's predicate classes are exactly a general designation with one group of word of same or the close meaning " dictionary, for being to set up index with part of speech or word on earth, do following regulation,

If 1. a word has part of speech and only has a part of speech " and not being redundancy part of speech " to set up index with regard to word class name so;

If 2. word has part of speech and more than one, need each part of speech to this word " not comprise redundancy part of speech " and set up index;

If 3. a word is in dictionary, but there is not part of speech, set up index with regard to the ability with this word;

If 4. a word is in dictionary, and is redundancy part of speech, it is not set up to index;

Step 2-2, sets up index, and each word/part of speech is set up to index structure.

Described step 2-2 is further comprising the steps of:

Step 2-2-1, set up index file: index is a kind of method that is used for finding from index terms corresponding document, in English text, word directly carries out participle with blank the separation, Chinese text adopts the participle instrument ICTClass of Inst. of Computing Techn. Academia Sinica to carry out Chinese word segmentation, and the word producing after participle directly carries out the index of word level or part of speech level as index terms;

In index construct, adopt inverted file " inverted file " mode to set up, its processing procedure is: the appearance position of processing successively its each word of comprising of every piece of paper trail, word belongs to part of speech simultaneously, can produce a tlv triple <DocID (document id) to the each word occurring in every piece of document like this, TermID(word ID) | WordClassID(part of speech ID), many positional informations of Positions() >, wherein Positions represents the position that index terms TermID occurs in DocID

Index structure comprises:

<ItemID|WordClassID (word ID| part of speech ID), the ID of <DocID(document), the reference position of < word in document and the line number < index object >>>GreatT.Grea T.GT at place

This structure is the object of a three level list, being described below from inside to outside:

The index object of innermost layer:

Int StartIndex; Article // mono-, the reference position that in index, word or part of speech occur

Short int Length; The length of // word or part of speech

The structure of the 3rd layer: reference position and the place line number of NkiInt2Ptr type < word in document, < index object >

The structure of the second layer: the ID of NkiString2Ptr type < document, the structure > of the 3rd layer

The structure of ground floor: NkiString2Ptr type < part of speech/word, the structure > of the second layer

The implication of each field in index:

Part of speech/word: document content is carried out to participle and obtain later;

The ID of document: be the unique identification of document, because a part of speech/word may appear in multiple documents, replace with document name herein;

The reference position of word in document: owing to also may there being the repeatedly appearance of same word in one piece of document, this recording of information is also the calculating of the distance between word and the word that need to use for the scoring stage in later stage;

The line number of word in document: according to the form of knowledge point document, the first row is the business at place, knowledge point, the second row is the summary of knowledge point, and the third line is the particular content of knowledge point, just can set up the index of all knowledge point documents that will retrieve according to this index structure;

Step 2-2-2, compressed index file: need to store a large amount of <TermID for each take TermID (WordClassID) as order along string file, DocID, Freq, pos1, pos2, pos3 ..., pos _freq> structure, in the time that TermID is identical, increases progressively take DocID as order, in this inside configuration, positional information pos is also by increasing progressively arrangement, when compression, the first step will be carried out Run-Length Coding, is namely transformed to difference sequence " increment sequence between originally adjacent integer " increasing progressively integer sequence, second step, encodes to small integer with certain coding method, to realize compression, to all suitable string file merger, obtain final inverted file index subsequently, the coding method when coding method of each is with the suitable string file of generation in final inverted file is the same, has just lacked TermID, adopt Delta Code coding method, integer x>=1 is encoded to

Figure 2012104617482100002DEST_PATH_IMAGE002

gama Code represent, after connect

Figure 2012104617482100002DEST_PATH_IMAGE004

position

binary representation, as the <term occurring in a document sets, doc> is to ading up to f, total n by it divided by different index word, again divided by total number of documents N, obtain p=f/ (N*n), it represents the probability of any document package of choosing at random containing any index terms of choosing at random, in document, there is once just need in inverted file index, recording a DocID increment size in an index terms, as the f occurring in an inverted entry <term, doc> is to being from all possible N*n of a document sets <term, doc> centering is chosen at random, this process is considered as a Bei Nuli process, there is this hypothesis, DocID increment is then to occur the once probability of certain particular index word after the probability of x can be expressed as in document continuously x-1 nonspecific index terms, ?

Figure 2012104617482100002DEST_PATH_IMAGE008

, illustrating that x meets how much and distributes, implied condition is <term here, the appearance that doc> is right is independent same distribution " shellfish slave profit distributes ".

Structure internal memory index described in step 4: complete the structure of increment internal memory index, full dose internal memory index and internal memory dictionary for word segmentation, main process comprises:

Step 4-2, builds full dose internal memory index:

To Token _iin all participle W _i:

If W _ibe not individual character

If W _ithere is part of speech C _i, use C _iindex collection of document D={d ₁, d ₂...;

Otherwise, use W _icarry out index;

}

Otherwise, use W _icarry out index;

Step 4-3, builds increment internal memory index;

Step 4-3-1, carries out participle to increment document;

Step 4-3-2, the existing dictionary for word segmentation internal storage structure of incremental update;

Step 4-3-3, the existing internal memory index of incremental update.

Full-text search described in step 5: full-text search process is as follows:

Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization processing ", removes redundancy word, removes the punctuation mark useless that does not affect semantic information, knows wrong error correction, another name standardization;

Step 5-2-1, carries out participle to user query;

Step 5-2-2, telephonist inputs crucial phrase, is made as query;

Step 5-2-3, according to one-level, secondary participle participle, obtains polycomponent word result to query, is called the front participle Seg1 of error correction;

Step 5-2-4, carries out correction process to participle before error correction, obtains polycomponent word result, is called participle Seg2 after error correction, is made as Seg=Seg1 ∪ Seg2={{W ₁, W ₂..., word segmentation result sorts from less to more according to participle number; Error correction procedure:

(1) first use the error correction of error correction dictionary;

(2) then use statistical information error correction;

Error correction precondition:

(1) to the each word in query, remember the word search history frequency T of not error correction, the keyword search history frequency T ' after error correction,

If a) T'>>T, uses the laggard line retrieval of error correction, and provides prompting;

B) otherwise, with retrieval before error correction;

Person's standard words, obtains participle semantic information;

Step 5-3-1, word segmentation result part of speech, result is Token={{W ₁(C ₁), W ₂(C ₂) ...; Representing the implication of participle with part of speech,, can there are multiple implications in each participle, can belong to different parts of speech, Token _irepresent Group I participle;

Step 5-3-2, consulting historical query

From consulting history library, find the historical query ' of the consulting the most similar to query, after finding, return to the result for retrieval of query ' as Top1 document;

Consulting similarity is defined as follows:

Sim(Token _i(query),Token _j(query'))=Sim({W1(C1),W2(^),…},{w ₁'(c ₁'),w ₂'(W ₂'),…})

=avg?(sem_sim(C _i,C _i')+?(1-a)*syn_sim(W _i,W _i'))

(1) wherein avg () is mean value function;

（2）sem_sim(C _i,C _i')= 1 （if?ci?∪?ci'?!=?φ）

0 else

(3) syn_sim (W _i, W _i')=composition W _iand W _i' in the number/W of identical characters _iand W _i' in mutually different character number;

Semantic extension information, is used some words or part of speech to represent, to Token _icarry out synonym expansion, the participle after expansion is designated as EToken _i

Step 5-5, obtains candidate documents: utilize the word or the part of speech " the semanteme letter after these words or part of speech representative expansion " that after semantic extension, obtain, according to internal memory index information, the corresponding document in full of search, as candidate documents;

All documents in D are marked; Following factor is considered in scoring:

(1) SegNum:query participle number;

(2) SegWordWgt: participle self weight " the word weight in title, business, summary is high ";

(3) DocWordWgt: the weight of participle in document;

(4) DocHits: the indexed click volume to document;

(5) DocTime: the indexed time to document;

(6) HitWordWgts: the weight of word in the query occurring in document;

(7) MissedWordWgts: the weight of word in the query not occurring in document;

(8) WordSpan (W ₁, W ₂..., d): multiple participles distance in document between two in query;

Step 5-6-1

Credit(d)?= HitWordWgts/(?HitWordWgts?+?MissedWordWgts)

WordWgt(wi,d)=doc_word_wgt(wi,d)*PosiWgt(wi)

Doc_word_wgt(wi,d)?=?tfidf(wi,d)=

2.0(needs to adjust), if word appears in title or business

PosiWgt (wi)=and 1.5(needs to adjust), if word appears in summary

{ 1.0 ，else

Step 5-6-2

Credit(d)?*=?1/log2(SegNum+1)

Undetermined Credit (d) Top_N(N) according to time sequence;

Step 5-6-3

Credit(d) /= (WordSpan(w1…wn,d)+1)

WordSpan (w1 ... wn, d)=Sum (function (wi, wj) of 1<=i<j<=n interval number of words)

Step 5-7, consulting Historic preservation

System for the first time Top1 provides: Token ₁=W ₁(C ₁) W ₂(C ₂) ... W _n(C _n) d _k

User selects: Token ₂=W ₁' (C ₁') W ₂' (C ₂') ... W _n' (C _n') d _j

If k unequal to is j, and (HistoryTop1 (Token ₂)=φ) or (HistoryTop1 (Token ₂) unequal to d _k)

Point out user to feed back, feedback is achieved as follows:

Step 5-7-1, in the time of j>2, checks after document telephonist, while closing, ejects feedback dialog box, to confirm whether user is satisfied with the result of inquiry;

Step 5-7-2, if telephonist's choosing is to preserve so HistoryTop1 (Token ₂)=d _j, preserve form as follows:

<Query, Token ₂, doc_type(Doctype), doc_id(document id), doc_id_value(document id value) and >.

The retrieval granularity of described full-text search is respectively " Service ", " Topic ", " Abstract " and " Mix ", wherein Service(business) represent in full document business classified information; Topic(theme) represent the one-level knowledge point of business; Abstract(summary) represent the most fine-grained knowledge point, be the meticulousst inquiry, Mix(mixes) represent that each Knowledge Granulation all returns.

Beneficial effect:

1. the present invention has stronger semantic understanding and semantic extension ability, more intelligent.

2. full-text search provided by the invention, has realized the full-text search based on internal memory index, and recall precision is higher.

3. full-text search provided by the invention, in the process that the candidate documents retrieving is sorted, has considered retrieving information, standardisation process information, historical retrieving information and user's input information, and science more sorts.

4. the present invention, before full-text search, has realized standardization, semantic understanding and the semantic extension of user being inputted to problem, makes result for retrieval more meet user's request.

5. the present invention, in the process of candidate search file ordering, has proposed multiple sequence index, makes the required document ordering of user more forward, and TopN recall ratio is higher.

Accompanying drawing explanation

Fig. 1 is the document data illustraton of model of system in the present invention;

Fig. 2 is the use case figure of system in the present invention;

Fig. 3 is the mass activity figure of system in the present invention;

Fig. 4 is dictionary for word segmentation administration page in the present invention;

Fig. 5 is the index structure that in the present invention, system is used;

Fig. 6 is the retrieve statement example of system of the present invention.

Embodiment

In the present invention, carry out the full text document data set of full-text search, can regard a set constantly changing along with the time as.Its document data model as shown in Figure 1, represent that these full text documents have the triple characteristics of space, time and content, we represent the spatial character of document with document identification, (document vector, proper vector, theme) represents the content character of document, represents the time response of document with (day, month, year).

As shown in Figure 2 and Figure 3, one a new generation's domain knowledge text searching method, comprise the following steps:

Step 1, builds dictionary for word segmentation: build dictionary for word segmentation, and deposit dictinary information in database.

Step 2, builds full dose index: to the full text document having existed (in the present invention, also referred to as knowledge point document) read, participle and analysis, set up index file.

Step 3, builds increment index: newly-increased document is processed, upgraded the index file on hard disk.

Step 4, builds internal memory index, comprising:

Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure.

Step 4-2, builds full dose internal memory index: read index file from hard disk, full dose builds internal memory index.

Step 4-3, builds increment internal memory index: newly-increased document is processed, realized the internal memory index delta of system and upgrade.

Step 5, full-text search, comprising:

Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting (also claims standardization

Process), as remove redundancy word, remove the punctuation mark useless that does not affect semantic information, know wrong error correction, another name standardization etc.

Step 5-2, participle.Problem after standardization is carried out to participle.

Person's standard words, obtains participle semantic information.

Semantic extension information, is used some words or part of speech to represent.

Step 5-5, obtains candidate documents: utilize the word or part of speech (these words or the part of speech representative that after semantic extension, obtain

Semantic information after expansion), according to internal memory index information, the corresponding document in full of search, as candidate documents.

Forward.Candidate documents after sequence becomes final full-text search result.

Wherein, when system initialization, build dictionary for word segmentation; Build full dose index: read all knowledge point documents, full dose builds hard disk index file (abbreviation index file); When newly-increased document in full, build increment index.These three activities, are independent of full-text search module, can independent operating.

structure dictionary for word segmentation described in step 1, be mainly the structure (in the present invention, the dictionary for word segmentation of structure is " secondary dictionary for word segmentation ") of realizing dictionary for word segmentation.Its building method is as follows:

Wherein, " general participle vocabulary " comprised common vocabulary, and the present invention has adopted the Computer Department of the Chinese Academy of Science's vocabulary publishing as general participle vocabulary." business vocabulary " comprised the relevant proper noun of industry, can build by the Business Name importing in industry.

" business vocabulary " divides three grades and manages, and is respectively part of speech, entry and another name, and as shown in Figure 4, user can manage part of speech, entry and another name by this page its administration page.The part of speech is here the set of one group of synonym or near synonym, in " the nearly class of family ", comprises entry " family ", " family " and " family ".The standard words here points out whether this entry is the entry without wrongly written or mispronounced characters of commonly using, and if " family " is a standard entry, its another name comprises " adding front yard ", " family the court of a feudal ruler " etc., and these two words wrong word that is all " family ".

Step 1-2, segments one-level participle automatically, forms (multiple) candidate secondary participle;

Step 1-3, artificial screening candidate secondary participle;

Secondary dictionary for word segmentation form after structure is as follows:

One-level participle secondary participle array (use | cut apart).

For example: can not enter net and can not enter net | can not enter net | can not enter net

Commercial affairs navigator commercial affairs are navigated

structure full dose index described in step 2,major function is that the full dose that realizes data directory file builds.Its main process is as follows:

Step 2-1, read each knowledge point document, knowledge point document is carried out to participle: in participle process, common dictionary for word segmentation and the upper the next dictionary with semantic relation are combined, produce polycomponent word result, and sort according to the length of the number of the word comprising in every group of result and word, also introduce in addition this key concept of part of speech, (institute's predicate classes are exactly a general designation with one group of word of same or the close meaning, for example: handle, one of lane, doing these three words all belongs to and handles part of speech, express " handling " this meaning) when participle, read by row, then every a line is being intercepted according to some punctuation marks, obtain the word of segment, carry out participle according to dictionary for word segmentation and upper the next part of speech dictionary.For being to set up index with part of speech or word on earth, native system has done following regulation,

If 1. a word has part of speech and only has a part of speech (and not being redundancy part of speech) to set up index with regard to word class name so.

If 2. word has part of speech and more than one, need each part of speech (not comprising redundancy part of speech) to this word to set up index.

If 3. a word is in dictionary, but there is not part of speech, set up index with regard to the ability with this word.

If 4. a word is in dictionary, and is redundancy part of speech, it is not set up to index.

Step 2-2-1, sets up index file.Say in essence, index is a kind of method that is used for finding from index terms corresponding document.The realization of index has different technology: inverted file (inverted file), signature file (signature files) and bitmap (bitmaps) etc.In general application, no matter in index space or in query processing speed, inverted index all provides than another both better performances.Inverted file is also the most widely used index technology at present.A generation that important step is index terms in Index process.In English text, word is directly separated with blank, and the extraction comparison of word is simple.And in Chinese text, between word, there is no list separator, must carry out Chinese word segmentation by natural language processing (NLP, Natural Language Processing) technology.The effect of participle directly affects the retrieval effectiveness of Chinese text searching system.In the present invention, adopt the participle instrument ICTClass of Inst. of Computing Techn. Academia Sinica to carry out Chinese word segmentation.After Chinese word segmentation, need determine index terms and calculate the position that they occur in document.In the present invention, adopt the word producing after participle directly to carry out the index of word level or part of speech level as index terms.Challenge maximum in index construct is the process of establishing of inverted file.Its processing procedure is: process successively every piece of document, we can record the appearance position of each word that it comprises, word belongs to part of speech simultaneously, can produce a tlv triple <DocID (document id) to the each word occurring in every piece of document like this, TermID(word ID) | WordClassID(part of speech ID), many positional informations of Positions() >, wherein Positions represents the position that index terms TermID occurs in DocID.Due to according to document order processing, so the set of these tlv triple is arranged according to DocID at first, pass through the process of falling row, TermID or the identical all DocID of WordClassID are brought together.The method that index is set up has a lot, has based on chained list, based on sequence, separate based on dictionary, based on text separation.As shown in Figure 5, wherein, " word/part of speech " can index multiple " document ids " to index structure in the present invention, and these " document ids " can index multiple " reference positions ", and these " reference positions " can index multiple " document index ", specifically comprise:

The index object of innermost layer:

Short int Length; The length of // word or part of speech

Although only have an attribute herein, the structure that it is designed to object is the expansion for the ease of the later stage, can store more information.

The implication of each field in index:

Part of speech/word: document content is carried out to participle and obtain later

The ID of document: be the unique identification of document, because a part of speech/word may appear in multiple documents, replace with document name herein.

The reference position of word in document: owing to also may there being the repeatedly appearance of same word in one piece of document, this recording of information is also the calculating of the distance between word and the word that need to use for the scoring stage in later stage.

The line number of word in document: this adds according to document feature, the word that different row occurs, its significance level is different.Because according to the form of knowledge point document, the first row is the business at place, knowledge point, and the second row is the summary of knowledge point, the third line is the particular content of knowledge point, obviously the business of knowledge point is most important, is secondly the summary of knowledge point, is finally the particular content of knowledge point.Just can set up like this index of all knowledge point documents that will retrieve according to this index structure.Whether what index was set up is rationally directly connected to result for retrieval in the future, and the present invention fully takes into account this point, first word segmentation result is screened, by those for result for retrieval inessential even can reduce retrieval accuracy redundancy part of speech do not set up index; The polycomponent word result of in addition participle being returned not is all to set up index (otherwise the quantity of index can be very large, and the word of some word segmentation result is very short, almost lose original semantic function, for example: " set meal " and " cover ", " meal " are just completely different), but carry out suitable choosing according to the granularity of retrieval.The index structure of whole three layers is bright spot of the present invention in addition, this field of document index object of innermost layer, being designed to object result is exactly for the convenient expansion to system in the future, only need to increase in object structure the inside other attribute field, and without changing whole index structure, compatibility and the extendability of system are promoted, " reference position of word in document and the line number at place " this field is by the positional information of two different angles, by reference position * 1000+ line number, easily two information are put together, avoid being placed on the traversal relating on different index level.

Step 2-2-2, compressed index file.Need to store a large amount of <TermID for each take TermID (WordClassID) as order along string file, DocID, Freq, pos1, pos2, pos3 ..., pos _freq> structure, in the time that TermID is identical, increases progressively take DocID as order.In this inside configuration, positional information pos is also by increasing progressively arrangement.When compression, the first step will be carried out Run-Length Coding, is namely transformed to difference sequence (increment sequence between originally adjacent integer) increasing progressively integer sequence.Do like this and do not lose any information, because our processing all need to start to carry out from first of integer sequence, only need iterative addition just can recover original series.But so do just large integer has been changed into small integer, itself can not reduce data space.Second step, encodes to small integer with certain coding method, to realize compression.To all suitable string file merger, obtain final inverted file index subsequently.The coding method when coding method of each is with the suitable string file of generation in final inverted file is the same, has just lacked TermID.Can adopt several coding methods below.

Unary Code is encoded to x-1 position 10, one integer x of 1 heel integer x>=1 need to take x binary digit.For example 3 be encoded to 110.The next symbol occurring of its hypothesis is the probability P r[x of x]=2 ^-x.

Gama Code is encoded to integer x>=1

unary Code represent, after connect

position

binary representation.Unary Code part is actual represents that remainder is altogether with how many x that encode, so x needs altogether to use

Figure 2012104617482100002DEST_PATH_IMAGE010

encode in position.For example 3 be encoded to 101.The next symbol occurring of its hypothesis is the probability of x

Figure 2012104617482100002DEST_PATH_IMAGE012

.

Delta Code is encoded to integer x>=1

gama Code represent, after connect

position

binary representation.Same, Gama Code part is actual represents that remainder is altogether with how many x that encode, due to

gama Code coding need

Figure 2012104617482100002DEST_PATH_IMAGE014

position, so x needs altogether to use

Figure 2012104617482100002DEST_PATH_IMAGE016

encode in position.For example 3 be encoded to 1001.The probability that the next symbol occurring of Delta Code hypothesis is x is

Figure 2012104617482100002DEST_PATH_IMAGE018

.

Suppose the <term occurring in a document sets, doc> is to ading up to f.Total n by it divided by different index word, then divided by total number of documents N, obtain p=f/ (N*n).It represents the probability of any document package of choosing at random containing any index terms of choosing at random.In document, there is once just need in inverted file index, recording a DocID increment size in an index terms.Suppose f the <term occurring in inverted entry, doc> is to being from all possible N*n of a document sets <term, doc> centering is chosen at random, and this process can be considered as a Bei Nuli process.Had this hypothesis, the probability that DocID increment is x can be expressed as in document and after x-1 nonspecific index terms, then occur the once probability of certain particular index word continuously,

, illustrate that x meets how much and distributes.Here implied condition is <term, and the appearance that doc> is right is independent same distribution (shellfish slave profit distributes).

Due to distributing for how much that DocID increment meets above, we can encode to it with Golomb Cod.For a parameter b, integer x>=1 can be encoded to two parts: Part I is

Figure 2012104617482100002DEST_PATH_IMAGE020

unary Code represent, need q+1 position; Part II is

binary representation, need or

Figure 2012104617482100002DEST_PATH_IMAGE026

position.Can prove, for given distributing for how much

, when

Figure 2012104617482100002DEST_PATH_IMAGE028

time, what Golomb Code obtained is best prefix code.

structure internal memory index described in step 4: the structure that completes increment internal memory index, full dose internal memory index and internal memory dictionary for word segmentation.Main process comprises:

Step 4-2, builds full dose internal memory index:

To Token _iin all participle W _i:

If W _ibe not individual character

Otherwise, use W _icarry out index;

}

Otherwise, use W _icarry out index;

Step 4-3, builds increment internal memory index;

Step 4-3-1, carries out participle to increment document;

Step 4-3-3, the existing internal memory index of incremental update.

Full-text search described in step 5: full-text search process is as follows:

Step 5-2-1, carries out participle to user query;

Step 5-2-2, telephonist inputs crucial phrase, is made as query;

Step 5-2-4, carries out correction process to participle before error correction, obtains polycomponent word result, is called participle Seg2 after error correction, is made as Seg=Seg1 ∪ Seg2={{W ₁, W ₂..., word segmentation result sorts from less to more according to participle number.Error correction procedure:

(1) first use the error correction of error correction dictionary;

(2) then use statistical information error correction.

As participle before error correction: " morning/navigator/xx/ xxx/ xxx/ " 5words

Participle after its error correction: " commercial affairs/navigator/xx/ xxx/ xxx/ " 5words "

Error correction precondition:

B) otherwise, with retrieval before error correction.

For example:

Customer problem Query=" navigates and can not enter net " morning

Participle Seg1={ before error correction

" morning/net that navigates/can not enter/" 3words

" morning/navigator/upper/not/net/" 5words

}

Participle Seg2={ after error correction

" commercial affairs/net that navigates/can not enter/" 3words

" commercial affairs/navigator/upper/not/net/" 5words

}

Word segmentation result Seg={

" morning/net that navigates/can not enter/" 3words

" morning/navigator/upper/not/net/" 5words

" commercial affairs/net that navigates/can not enter/" 3words

" commercial affairs/navigator/upper/not/net/" 5words

}

Person's standard words, obtains participle semantic information;

Step 5-3-1, word segmentation result part of speech, result is Token={{W ₁(C ₁), W ₂(C ₂) ....In the present invention, represent the implication of participle with part of speech., can there are multiple implications in each participle, can belong to different parts of speech.Token _irepresent Group I participle.

Step 5-3-2, consulting historical query

Consulting similarity is defined as follows:

=avg?(sem_sim(C _i,C _i')+?(1-a)*syn_sim(W _i,W _i'))

(1) wherein avg () is mean value function;

（2）sem_sim(C _i,C _i')= 1 （if?ci?∪?ci'?!=?φ）

0 else

(3) syn_sim (W _i, W _i')=composition W _iand W _i' in the number/W of identical characters _iand W _i' in mutually different character number

As: syn_sim (cannot surf the Net, how to surf the Net)=2/6

S2=can not enter net old_syn_sim (s2, s3)=2/6

Semantic extension information, is used some words or part of speech to represent.To Token _icarry out synonym expansion, the participle after expansion is designated as EToken _i

For example: the synonym " vip navigator " of " commercial affairs are navigated ", the synonym of " can not enter net " is " cannot surf the Net ",

" commercial affairs navigate/can not enter net/" can expand to " vip navigate/cannot surf the Net/"

All documents in D are marked.Following factor is considered in scoring:

(1) SegNum:query participle number;

(2) SegWordWgt: participle self weight (the word weight in title, business, summary is high);

(3) DocWordWgt: the weight of participle in document;

(4) DocHits: the indexed click volume to document;

(5) DocTime: the indexed time to document;

(6) HitWordWgts: the weight of word in the query occurring in document;

(7) MissedWordWgts: the weight of word in the query not occurring in document;

Step 5-6-1

Credit(d)?= HitWordWgts/(?HitWordWgts?+?MissedWordWgts)

WordWgt(wi,d)=doc_word_wgt(wi,d)*PosiWgt(wi)

Doc_word_wgt(wi,d)?=?tfidf(wi,d)=

2.0(needs to adjust), if word appears in title or business

PosiWgt (wi)=and 1.5(needs to adjust), if word appears in summary

{ 1.0 ，else

Step 5-6-2

Credit(d)?*=?1/log2(SegNum+1)

Undetermined Credit (d) Top_N(N) according to time sequence;

Step 5-6-3

Credit(d) /= (WordSpan(w1…wn,d)+1)

Step 5-7, consulting Historic preservation

User selects: Token ₂=W ₁' (C ₁') W ₂' (C ₂') ... W _n' (C _n') d _j

Point out user to feed back, feedback is achieved as follows:

The present invention can provide retrieval service for user.The demand of user's Query Information need to be described by certain data query statement.As shown in Figure 6, wherein, NeedAssociateSearch=" true " represents to return to relevant search result query statement example; ShowCountPerPage=" 10 " represents that every page shows 10 search records; Granularity=" Abstract " represents that retrieval granularity is " Abstract ", three kinds of retrieval granularities are provided in the present invention, respectively " Service ", " Topic ", " Abstract " and " Mix ", wherein Service(business) represent in full document business classified information; Topic(theme) represent the one-level knowledge point of business; Abstract(summary) represent the most fine-grained knowledge point, be the meticulousst inquiry, Mix(mixes) represent that each Knowledge Granulation all returns.TopN=" 100 " represents only to return to front 100 result for retrieval, if all result for retrieval are returned in TopN=-1 representative." commercial affairs are navigated " is the retrieval of content that user inputs.

Claims

1. a domain knowledge text searching method of new generation, is characterized in that: comprise the following steps:

Step 4, builds internal memory index, comprising:

Step 5, full-text search, comprising:

Person's standard words, obtains participle semantic information;

Wherein, when system initialization, build dictionary for word segmentation; Build full dose index: read all knowledge point documents, full dose builds hard disk index file " abbreviation index file "; When newly-increased document in full, build increment index, these three activities, are independent of full-text search module, independent operating.

2. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the structure dictionary for word segmentation described in step 1 is mainly the structure of realizing dictionary for word segmentation, and the dictionary for word segmentation of structure is " secondary dictionary for word segmentation ", and its building method is as follows:

Step 1-3, artificial screening candidate secondary participle;

3. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the structure full dose index described in step 2, and major function is that the full dose that realizes data directory file builds, its main process is as follows:

4. domain knowledge text searching method of new generation according to claim 3, is characterized in that: described step 2-2 is further comprising the steps of:

Index structure comprises:

The index object of innermost layer:

Short int Length; The length of // word or part of speech

The implication of each field in index:

Figure 2012104617482100001DEST_PATH_IMAGE002

gama Code represent, after connect position

Figure 2012104617482100001DEST_PATH_IMAGE006

Figure 2012104617482100001DEST_PATH_IMAGE008

5. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the structure internal memory index described in step 4: complete the structure of increment internal memory index, full dose internal memory index and internal memory dictionary for word segmentation, main process comprises:

Step 4-2, builds full dose internal memory index:

To Token _iin all participle W _i:

If W _ibe not individual character

Otherwise, use W _icarry out index;

}

Otherwise, use W _icarry out index;

Step 4-3, builds increment internal memory index;

Step 4-3-1, carries out participle to increment document;

Step 4-3-3, the existing internal memory index of incremental update.

6. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the full-text search described in step 5: full-text search process is as follows:

Step 5-2-1, carries out participle to user query;

Step 5-2-2, telephonist inputs crucial phrase, is made as query;

(1) first use the error correction of error correction dictionary;

(2) then use statistical information error correction;

Error correction precondition:

B) otherwise, with retrieval before error correction;

Person's standard words, obtains participle semantic information;

Step 5-3-2, consulting historical query

Consulting similarity is defined as follows:

=avg?(sem_sim(C _i,C _i')+?(1-a)*syn_sim(W _i,W _i'))

(1) wherein avg () is mean value function;

（2）sem_sim(C _i,C _i')= 1 （if?ci?∪?ci'?!=?φ）

0 else

All documents in D are marked; Following factor is considered in scoring:

(1) SegNum:query participle number;

(3) DocWordWgt: the weight of participle in document;

(4) DocHits: the indexed click volume to document;

(5) DocTime: the indexed time to document;

(6) HitWordWgts: the weight of word in the query occurring in document;

(7) MissedWordWgts: the weight of word in the query not occurring in document;

Step 5-6-1

Credit(d) = HitWordWgts/(?HitWordWgts?+?MissedWordWgts)

WordWgt(wi,d)=doc_word_wgt(wi,d)*PosiWgt(wi)

Doc_word_wgt(wi,d)?=?tfidf(wi,d)=

2.0(needs to adjust), if word appears in title or business

PosiWgt (wi)=and 1.5(needs to adjust), if word appears in summary

{ 1.0 ，else

Step 5-6-2

Credit(d)?*=?1/log2(SegNum+1)

Undetermined Credit (d) Top_N(N) according to time sequence;

Step 5-6-3

Credit(d) /= (WordSpan(w1…wn,d)+1)

Step 5-7, consulting Historic preservation

User selects: Token ₂=W ₁' (C ₁') W ₂' (C ₂') ... W _n' (C _n') d _j

Point out user to feed back, feedback is achieved as follows:

7. domain knowledge text searching method of new generation according to claim 6, it is characterized in that: the retrieval granularity of described full-text search, respectively " Service ", " Topic ", " Abstract " and " Mix ", wherein Service(business) represent in full document business classified information; Topic(theme) represent the one-level knowledge point of business; Abstract(summary) represent the most fine-grained knowledge point, be the meticulousst inquiry, Mix(mixes) represent that each Knowledge Granulation all returns.