CN103823799A - New-generation industry knowledge full-text search method - Google Patents

New-generation industry knowledge full-text search method Download PDF

Info

Publication number
CN103823799A
CN103823799A CN201210461748.2A CN201210461748A CN103823799A CN 103823799 A CN103823799 A CN 103823799A CN 201210461748 A CN201210461748 A CN 201210461748A CN 103823799 A CN103823799 A CN 103823799A
Authority
CN
China
Prior art keywords
word
index
participle
document
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210461748.2A
Other languages
Chinese (zh)
Inventor
王卫民
符建辉
王石
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Original Assignee
KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd filed Critical KNOWOLOGY INTELLIGENT TECHNOLOGY Co Ltd
Priority to CN201210461748.2A priority Critical patent/CN103823799A/en
Publication of CN103823799A publication Critical patent/CN103823799A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A new-generation industry knowledge full-text search method includes: 1, segmentation dictionary setup, namely setting up a segmentation dictionary, and storing dictionary information into a database; 2, full index setup, namely reading, dividing words and analyzing existing full-text documents called 'knowledge point documents' to set up an index file; 3, increment index setup, namely processing newly added documents and updating an index file on a hard disk; 4, memory index setup and memory segmentation dictionary setup, namely reading segmentation dictionary data in a memory to set up a memory segmentation dictionary data structure; 5, full-text search, user question standardization, word dividing, semantic comprehension, semantic extension, candidate document acquisition and candidate document ordering, wherein segmentation dictionary setup is conducted during system initialization; full index setup includes reading all knowledge documents to fully set up a hard disk index file called 'index file' for short; increment index setup is conducted when full-text files are newly added; the three events are independent from a full-text retrieval module and in independent operation.

Description

A new generation's domain knowledge text searching method
Technical field
The present invention relates to full-text search field, the especially full-text search field of domain knowledge, has proposed a kind of new domain knowledge text retrieval system and method.
Background technology
Full-text search refers to that computer index program is by each word in scanning article, each word is set up to an index, indicate number of times and position that this word occurs in article, in the time that user inquires about, search program is just searched according to the index of setting up in advance, and by the result feedback of searching the retrieval mode to user.This process is similar to the process of looking into word by the retrieval word table in dictionary.Full-text search is a kind of written historical materials search method that all texts in file are mated with search terms.Text retrieval system be set up according to full-text search theory for the software systems of full article retrieval are provided.Full-text search is to be stored in database the retrieval out of arbitrary content information searching in whole book, entire article.It is the information such as relevant chapter, paragraph, sentence, word in can obtaining as required in full, also can carry out various statistics and analysis.For example, it can answer " in Dream of the Red Mansion one book, how many times appears altogether in " Lin Daiyu "? " problem.
Traditional text retrieval system is the coupling based on key word, keyword just, lacks the multi-faceted semanteme identification such as English, phonetic, wrongly written or mispronounced characters, synonym, near synonym and the ability of error correction.Along with the intelligence requirement of customer demand is more and more higher, it is backward that traditional text retrieval system seems all the more.
In order to solve the problem of existence, be badly in need of a kind of new text retrieval system, it can allow retrieval more intelligent, be embodied in: can realize the most general phonetic, Chinese character, English and express mutually, such as user's input " shengka ", system is appreciated that out, and the content that user may inquire about is " sound card "; Wrongly written or mispronounced characters error correction can be realized, semantic understanding and semantic expansion can be realized.As: user's input " commercial affairs are navigated ", " navigator in the morning ", " shangwulinghang ", " shwlh " can reach the search effect of " commercial affairs are navigated "; The even colloquial retrieval form of user input " how handle in broadband ", " how broadband is installed ", " doing a broadband to me ", " I look on the bright side of things logical broadband " etc. similar import, also can correctly return to the answer of relevant " handle in broadband ".
Summary of the invention
For the problems referred to above, the present invention is on traditional text retrieval system basis based on key word, word coupling, increase multi-faceted semanteme identification and the error correction such as English, phonetic, wrongly written or mispronounced characters, synonym, near synonym, increased again the semantic extension abilities such as upper the next, Attribute Recognition.The present invention is a text retrieval system with semantic understanding semantic extension function.
Technical scheme: to the invention provides a kind of domain knowledge text searching method of new generation in order overcoming the above problems, to it is characterized in that: comprise the following steps:
Step 1, builds dictionary for word segmentation: build dictionary for word segmentation, and deposit dictinary information in database;
Step 2, builds full dose index: to the full text document having existed " also referred to as knowledge point document " read, participle and analysis, set up index file;
Step 3, builds increment index: newly-increased document is processed, upgraded the index file on hard disk;
Step 4, builds internal memory index, comprising:
Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure;
Step 4-2, builds full dose internal memory index: read index file from hard disk, full dose builds internal memory index;
Step 4-3, builds increment internal memory index: newly-increased document is processed, realized internal memory index delta and upgrade;
Step 5, full-text search, comprising:
Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization
Process ", remove redundancy word, remove the punctuation mark useless that does not affect semantic information, know wrong error correction, another name standardization;
Step 5-2, participle: the problem after standardization is carried out to participle;
Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or
Person's standard words, obtains participle semantic information;
Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these
Semantic extension information, is used some words or part of speech to represent;
Step 5-5, obtains candidate documents: utilize the word or part of speech " these words or the part of speech representative that after semantic extension, obtain
Semanteme letter after expansion ", according to internal memory index information, the corresponding document in full of search, as candidate documents;
Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more
Forward, the candidate documents after sequence becomes final full-text search result;
Wherein, when system initialization, build dictionary for word segmentation; Build full dose index: read all knowledge point documents, full dose builds hard disk index file " abbreviation index file "; When newly-increased document in full, build increment index.These three activities, are independent of full-text search module, independent operating.
Structure dictionary for word segmentation described in step 1, is mainly the structure of realizing dictionary for word segmentation, and the dictionary for word segmentation of structure is " secondary dictionary for word segmentation ", and its building method is as follows:
Step 1-1, according to " general participle vocabulary "+" business vocabulary ", forms one-level participle;
Wherein, " general participle vocabulary " adopts Computer Department of the Chinese Academy of Science's vocabulary as general participle vocabulary, and " business vocabulary " comprised the relevant proper noun of industry, can build by the Business Name importing in industry;
Step 1-2, segments one-level participle automatically, forms candidate secondary participle;
Step 1-3, artificial screening candidate secondary participle;
Secondary dictionary for word segmentation form after structure is as follows: one-level participle secondary participle array (use | cut apart).
Structure full dose index described in step 2, major function is that the full dose that realizes data directory file builds, its main process is as follows:
Step 2-1, read each knowledge point document, knowledge point document is carried out to participle: in participle process, common dictionary for word segmentation and the upper the next dictionary with semantic relation are combined, produce polycomponent word result, and sort according to the length of the number of the word comprising in every group of result and word, when participle, read by row, then every a line is being intercepted according to some punctuation marks, obtain the word of segment, carry out participle according to dictionary for word segmentation and upper the next part of speech " institute's predicate classes are exactly a general designation with one group of word of same or the close meaning " dictionary, for being to set up index with part of speech or word on earth, do following regulation,
If 1. a word has part of speech and only has a part of speech " and not being redundancy part of speech " to set up index with regard to word class name so;
If 2. word has part of speech and more than one, need each part of speech to this word " not comprise redundancy part of speech " and set up index;
If 3. a word is in dictionary, but there is not part of speech, set up index with regard to the ability with this word;
If 4. a word is in dictionary, and is redundancy part of speech, it is not set up to index;
Step 2-2, sets up index, and each word/part of speech is set up to index structure.
Described step 2-2 is further comprising the steps of:
Step 2-2-1, set up index file: index is a kind of method that is used for finding from index terms corresponding document, in English text, word directly carries out participle with blank the separation, Chinese text adopts the participle instrument ICTClass of Inst. of Computing Techn. Academia Sinica to carry out Chinese word segmentation, and the word producing after participle directly carries out the index of word level or part of speech level as index terms;
In index construct, adopt inverted file " inverted file " mode to set up, its processing procedure is: the appearance position of processing successively its each word of comprising of every piece of paper trail, word belongs to part of speech simultaneously, can produce a tlv triple <DocID (document id) to the each word occurring in every piece of document like this, TermID(word ID) | WordClassID(part of speech ID), many positional informations of Positions() >, wherein Positions represents the position that index terms TermID occurs in DocID
Index structure comprises:
<ItemID|WordClassID (word ID| part of speech ID), the ID of <DocID(document), the reference position of < word in document and the line number < index object >>>GreatT.Grea T.GT at place
This structure is the object of a three level list, being described below from inside to outside:
The index object of innermost layer:
Int StartIndex; Article // mono-, the reference position that in index, word or part of speech occur
Short int Length; The length of // word or part of speech
The structure of the 3rd layer: reference position and the place line number of NkiInt2Ptr type < word in document, < index object >
The structure of the second layer: the ID of NkiString2Ptr type < document, the structure > of the 3rd layer
The structure of ground floor: NkiString2Ptr type < part of speech/word, the structure > of the second layer
The implication of each field in index:
Part of speech/word: document content is carried out to participle and obtain later;
The ID of document: be the unique identification of document, because a part of speech/word may appear in multiple documents, replace with document name herein;
The reference position of word in document: owing to also may there being the repeatedly appearance of same word in one piece of document, this recording of information is also the calculating of the distance between word and the word that need to use for the scoring stage in later stage;
The line number of word in document: according to the form of knowledge point document, the first row is the business at place, knowledge point, the second row is the summary of knowledge point, and the third line is the particular content of knowledge point, just can set up the index of all knowledge point documents that will retrieve according to this index structure;
Step 2-2-2, compressed index file: need to store a large amount of <TermID for each take TermID (WordClassID) as order along string file, DocID, Freq, pos1, pos2, pos3 ..., pos freq> structure, in the time that TermID is identical, increases progressively take DocID as order, in this inside configuration, positional information pos is also by increasing progressively arrangement, when compression, the first step will be carried out Run-Length Coding, is namely transformed to difference sequence " increment sequence between originally adjacent integer " increasing progressively integer sequence, second step, encodes to small integer with certain coding method, to realize compression, to all suitable string file merger, obtain final inverted file index subsequently, the coding method when coding method of each is with the suitable string file of generation in final inverted file is the same, has just lacked TermID, adopt Delta Code coding method, integer x>=1 is encoded to
Figure 2012104617482100002DEST_PATH_IMAGE002
gama Code represent, after connect
Figure 2012104617482100002DEST_PATH_IMAGE004
position
Figure 2012104617482100002DEST_PATH_IMAGE006
binary representation, as the <term occurring in a document sets, doc> is to ading up to f, total n by it divided by different index word, again divided by total number of documents N, obtain p=f/ (N*n), it represents the probability of any document package of choosing at random containing any index terms of choosing at random, in document, there is once just need in inverted file index, recording a DocID increment size in an index terms, as the f occurring in an inverted entry <term, doc> is to being from all possible N*n of a document sets <term, doc> centering is chosen at random, this process is considered as a Bei Nuli process, there is this hypothesis, DocID increment is then to occur the once probability of certain particular index word after the probability of x can be expressed as in document continuously x-1 nonspecific index terms, ?
Figure 2012104617482100002DEST_PATH_IMAGE008
, illustrating that x meets how much and distributes, implied condition is <term here, the appearance that doc> is right is independent same distribution " shellfish slave profit distributes ".
Structure internal memory index described in step 4: complete the structure of increment internal memory index, full dose internal memory index and internal memory dictionary for word segmentation, main process comprises:
Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure;
Step 4-2, builds full dose internal memory index:
To Token iin all participle W i:
If W ibe not individual character
If W ithere is part of speech C i, use C iindex collection of document D={d 1, d 2...;
Otherwise, use W icarry out index;
}
Otherwise, use W icarry out index;
Step 4-3, builds increment internal memory index;
Step 4-3-1, carries out participle to increment document;
Step 4-3-2, the existing dictionary for word segmentation internal storage structure of incremental update;
Step 4-3-3, the existing internal memory index of incremental update.
Full-text search described in step 5: full-text search process is as follows:
Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization processing ", removes redundancy word, removes the punctuation mark useless that does not affect semantic information, knows wrong error correction, another name standardization;
Step 5-2, participle: the problem after standardization is carried out to participle;
Step 5-2-1, carries out participle to user query;
Step 5-2-2, telephonist inputs crucial phrase, is made as query;
Step 5-2-3, according to one-level, secondary participle participle, obtains polycomponent word result to query, is called the front participle Seg1 of error correction;
Step 5-2-4, carries out correction process to participle before error correction, obtains polycomponent word result, is called participle Seg2 after error correction, is made as Seg=Seg1 ∪ Seg2={{W 1, W 2..., word segmentation result sorts from less to more according to participle number; Error correction procedure:
(1) first use the error correction of error correction dictionary;
(2) then use statistical information error correction;
Error correction precondition:
(1) to the each word in query, remember the word search history frequency T of not error correction, the keyword search history frequency T ' after error correction,
If a) T'>>T, uses the laggard line retrieval of error correction, and provides prompting;
B) otherwise, with retrieval before error correction;
Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or
Person's standard words, obtains participle semantic information;
Step 5-3-1, word segmentation result part of speech, result is Token={{W 1(C 1), W 2(C 2) ...; Representing the implication of participle with part of speech,, can there are multiple implications in each participle, can belong to different parts of speech, Token irepresent Group I participle;
Step 5-3-2, consulting historical query
From consulting history library, find the historical query ' of the consulting the most similar to query, after finding, return to the result for retrieval of query ' as Top1 document;
Consulting similarity is defined as follows:
Sim(Token i(query),Token j(query'))=Sim({W1(C1),W2(^),…},{w 1'(c 1'),w 2'(W 2'),…})
=avg?(sem_sim(C i,C i')+?(1-a)*syn_sim(W i,W i'))
(1) wherein avg () is mean value function;
(2)sem_sim(C i,C i')= 1 (if?ci?∪?ci'?!=?φ)
0 else
(3) syn_sim (W i, W i')=composition W iand W i' in the number/W of identical characters iand W i' in mutually different character number;
Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these
Semantic extension information, is used some words or part of speech to represent, to Token icarry out synonym expansion, the participle after expansion is designated as EToken i
Step 5-5, obtains candidate documents: utilize the word or the part of speech " the semanteme letter after these words or part of speech representative expansion " that after semantic extension, obtain, according to internal memory index information, the corresponding document in full of search, as candidate documents;
Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more
Forward, the candidate documents after sequence becomes final full-text search result;
All documents in D are marked; Following factor is considered in scoring:
(1) SegNum:query participle number;
(2) SegWordWgt: participle self weight " the word weight in title, business, summary is high ";
(3) DocWordWgt: the weight of participle in document;
(4) DocHits: the indexed click volume to document;
(5) DocTime: the indexed time to document;
(6) HitWordWgts: the weight of word in the query occurring in document;
(7) MissedWordWgts: the weight of word in the query not occurring in document;
(8) WordSpan (W 1, W 2..., d): multiple participles distance in document between two in query;
Step 5-6-1
Credit(d)?= HitWordWgts/(?HitWordWgts?+?MissedWordWgts)
WordWgt(wi,d)=doc_word_wgt(wi,d)*PosiWgt(wi)
Doc_word_wgt(wi,d)?=?tfidf(wi,d)=
2.0(needs to adjust), if word appears in title or business
PosiWgt (wi)=and 1.5(needs to adjust), if word appears in summary
{ 1.0 ,else
Step 5-6-2
Credit(d)?*=?1/log2(SegNum+1)
Undetermined Credit (d) Top_N(N) according to time sequence;
Step 5-6-3
Credit(d) /= (WordSpan(w1…wn,d)+1)
WordSpan (w1 ... wn, d)=Sum (function (wi, wj) of 1<=i<j<=n interval number of words)
Step 5-7, consulting Historic preservation
System for the first time Top1 provides: Token 1=W 1(C 1) W 2(C 2) ... W n(C n) d k
User selects: Token 2=W 1' (C 1') W 2' (C 2') ... W n' (C n') d j
If k unequal to is j, and (HistoryTop1 (Token 2)=φ) or (HistoryTop1 (Token 2) unequal to d k)
Point out user to feed back, feedback is achieved as follows:
Step 5-7-1, in the time of j>2, checks after document telephonist, while closing, ejects feedback dialog box, to confirm whether user is satisfied with the result of inquiry;
Step 5-7-2, if telephonist's choosing is to preserve so HistoryTop1 (Token 2)=d j, preserve form as follows:
<Query, Token 2, doc_type(Doctype), doc_id(document id), doc_id_value(document id value) and >.
The retrieval granularity of described full-text search is respectively " Service ", " Topic ", " Abstract " and " Mix ", wherein Service(business) represent in full document business classified information; Topic(theme) represent the one-level knowledge point of business; Abstract(summary) represent the most fine-grained knowledge point, be the meticulousst inquiry, Mix(mixes) represent that each Knowledge Granulation all returns.
Beneficial effect:
1. the present invention has stronger semantic understanding and semantic extension ability, more intelligent.
2. full-text search provided by the invention, has realized the full-text search based on internal memory index, and recall precision is higher.
3. full-text search provided by the invention, in the process that the candidate documents retrieving is sorted, has considered retrieving information, standardisation process information, historical retrieving information and user's input information, and science more sorts.
4. the present invention, before full-text search, has realized standardization, semantic understanding and the semantic extension of user being inputted to problem, makes result for retrieval more meet user's request.
5. the present invention, in the process of candidate search file ordering, has proposed multiple sequence index, makes the required document ordering of user more forward, and TopN recall ratio is higher.
Accompanying drawing explanation
Fig. 1 is the document data illustraton of model of system in the present invention;
Fig. 2 is the use case figure of system in the present invention;
Fig. 3 is the mass activity figure of system in the present invention;
Fig. 4 is dictionary for word segmentation administration page in the present invention;
Fig. 5 is the index structure that in the present invention, system is used;
Fig. 6 is the retrieve statement example of system of the present invention.
Embodiment
In the present invention, carry out the full text document data set of full-text search, can regard a set constantly changing along with the time as.Its document data model as shown in Figure 1, represent that these full text documents have the triple characteristics of space, time and content, we represent the spatial character of document with document identification, (document vector, proper vector, theme) represents the content character of document, represents the time response of document with (day, month, year).
As shown in Figure 2 and Figure 3, one a new generation's domain knowledge text searching method, comprise the following steps:
Step 1, builds dictionary for word segmentation: build dictionary for word segmentation, and deposit dictinary information in database.
Step 2, builds full dose index: to the full text document having existed (in the present invention, also referred to as knowledge point document) read, participle and analysis, set up index file.
Step 3, builds increment index: newly-increased document is processed, upgraded the index file on hard disk.
Step 4, builds internal memory index, comprising:
Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure.
Step 4-2, builds full dose internal memory index: read index file from hard disk, full dose builds internal memory index.
Step 4-3, builds increment internal memory index: newly-increased document is processed, realized the internal memory index delta of system and upgrade.
Step 5, full-text search, comprising:
Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting (also claims standardization
Process), as remove redundancy word, remove the punctuation mark useless that does not affect semantic information, know wrong error correction, another name standardization etc.
Step 5-2, participle.Problem after standardization is carried out to participle.
Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or
Person's standard words, obtains participle semantic information.
Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these
Semantic extension information, is used some words or part of speech to represent.
Step 5-5, obtains candidate documents: utilize the word or part of speech (these words or the part of speech representative that after semantic extension, obtain
Semantic information after expansion), according to internal memory index information, the corresponding document in full of search, as candidate documents.
Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more
Forward.Candidate documents after sequence becomes final full-text search result.
Wherein, when system initialization, build dictionary for word segmentation; Build full dose index: read all knowledge point documents, full dose builds hard disk index file (abbreviation index file); When newly-increased document in full, build increment index.These three activities, are independent of full-text search module, can independent operating.
structure dictionary for word segmentation described in step 1, be mainly the structure (in the present invention, the dictionary for word segmentation of structure is " secondary dictionary for word segmentation ") of realizing dictionary for word segmentation.Its building method is as follows:
Step 1-1, according to " general participle vocabulary "+" business vocabulary ", forms one-level participle;
Wherein, " general participle vocabulary " comprised common vocabulary, and the present invention has adopted the Computer Department of the Chinese Academy of Science's vocabulary publishing as general participle vocabulary." business vocabulary " comprised the relevant proper noun of industry, can build by the Business Name importing in industry.
" business vocabulary " divides three grades and manages, and is respectively part of speech, entry and another name, and as shown in Figure 4, user can manage part of speech, entry and another name by this page its administration page.The part of speech is here the set of one group of synonym or near synonym, in " the nearly class of family ", comprises entry " family ", " family " and " family ".The standard words here points out whether this entry is the entry without wrongly written or mispronounced characters of commonly using, and if " family " is a standard entry, its another name comprises " adding front yard ", " family the court of a feudal ruler " etc., and these two words wrong word that is all " family ".
Step 1-2, segments one-level participle automatically, forms (multiple) candidate secondary participle;
Step 1-3, artificial screening candidate secondary participle;
Secondary dictionary for word segmentation form after structure is as follows:
One-level participle secondary participle array (use | cut apart).
For example: can not enter net and can not enter net | can not enter net | can not enter net
Commercial affairs navigator commercial affairs are navigated
structure full dose index described in step 2,major function is that the full dose that realizes data directory file builds.Its main process is as follows:
Step 2-1, read each knowledge point document, knowledge point document is carried out to participle: in participle process, common dictionary for word segmentation and the upper the next dictionary with semantic relation are combined, produce polycomponent word result, and sort according to the length of the number of the word comprising in every group of result and word, also introduce in addition this key concept of part of speech, (institute's predicate classes are exactly a general designation with one group of word of same or the close meaning, for example: handle, one of lane, doing these three words all belongs to and handles part of speech, express " handling " this meaning) when participle, read by row, then every a line is being intercepted according to some punctuation marks, obtain the word of segment, carry out participle according to dictionary for word segmentation and upper the next part of speech dictionary.For being to set up index with part of speech or word on earth, native system has done following regulation,
If 1. a word has part of speech and only has a part of speech (and not being redundancy part of speech) to set up index with regard to word class name so.
If 2. word has part of speech and more than one, need each part of speech (not comprising redundancy part of speech) to this word to set up index.
If 3. a word is in dictionary, but there is not part of speech, set up index with regard to the ability with this word.
If 4. a word is in dictionary, and is redundancy part of speech, it is not set up to index.
Step 2-2, sets up index, and each word/part of speech is set up to index structure.
Step 2-2-1, sets up index file.Say in essence, index is a kind of method that is used for finding from index terms corresponding document.The realization of index has different technology: inverted file (inverted file), signature file (signature files) and bitmap (bitmaps) etc.In general application, no matter in index space or in query processing speed, inverted index all provides than another both better performances.Inverted file is also the most widely used index technology at present.A generation that important step is index terms in Index process.In English text, word is directly separated with blank, and the extraction comparison of word is simple.And in Chinese text, between word, there is no list separator, must carry out Chinese word segmentation by natural language processing (NLP, Natural Language Processing) technology.The effect of participle directly affects the retrieval effectiveness of Chinese text searching system.In the present invention, adopt the participle instrument ICTClass of Inst. of Computing Techn. Academia Sinica to carry out Chinese word segmentation.After Chinese word segmentation, need determine index terms and calculate the position that they occur in document.In the present invention, adopt the word producing after participle directly to carry out the index of word level or part of speech level as index terms.Challenge maximum in index construct is the process of establishing of inverted file.Its processing procedure is: process successively every piece of document, we can record the appearance position of each word that it comprises, word belongs to part of speech simultaneously, can produce a tlv triple <DocID (document id) to the each word occurring in every piece of document like this, TermID(word ID) | WordClassID(part of speech ID), many positional informations of Positions() >, wherein Positions represents the position that index terms TermID occurs in DocID.Due to according to document order processing, so the set of these tlv triple is arranged according to DocID at first, pass through the process of falling row, TermID or the identical all DocID of WordClassID are brought together.The method that index is set up has a lot, has based on chained list, based on sequence, separate based on dictionary, based on text separation.As shown in Figure 5, wherein, " word/part of speech " can index multiple " document ids " to index structure in the present invention, and these " document ids " can index multiple " reference positions ", and these " reference positions " can index multiple " document index ", specifically comprise:
<ItemID|WordClassID (word ID| part of speech ID), the ID of <DocID(document), the reference position of < word in document and the line number < index object >>>GreatT.Grea T.GT at place
This structure is the object of a three level list, being described below from inside to outside:
The index object of innermost layer:
Int StartIndex; Article // mono-, the reference position that in index, word or part of speech occur
Short int Length; The length of // word or part of speech
Although only have an attribute herein, the structure that it is designed to object is the expansion for the ease of the later stage, can store more information.
The structure of the 3rd layer: reference position and the place line number of NkiInt2Ptr type < word in document, < index object >
The structure of the second layer: the ID of NkiString2Ptr type < document, the structure > of the 3rd layer
The structure of ground floor: NkiString2Ptr type < part of speech/word, the structure > of the second layer
The implication of each field in index:
Part of speech/word: document content is carried out to participle and obtain later
The ID of document: be the unique identification of document, because a part of speech/word may appear in multiple documents, replace with document name herein.
The reference position of word in document: owing to also may there being the repeatedly appearance of same word in one piece of document, this recording of information is also the calculating of the distance between word and the word that need to use for the scoring stage in later stage.
The line number of word in document: this adds according to document feature, the word that different row occurs, its significance level is different.Because according to the form of knowledge point document, the first row is the business at place, knowledge point, and the second row is the summary of knowledge point, the third line is the particular content of knowledge point, obviously the business of knowledge point is most important, is secondly the summary of knowledge point, is finally the particular content of knowledge point.Just can set up like this index of all knowledge point documents that will retrieve according to this index structure.Whether what index was set up is rationally directly connected to result for retrieval in the future, and the present invention fully takes into account this point, first word segmentation result is screened, by those for result for retrieval inessential even can reduce retrieval accuracy redundancy part of speech do not set up index; The polycomponent word result of in addition participle being returned not is all to set up index (otherwise the quantity of index can be very large, and the word of some word segmentation result is very short, almost lose original semantic function, for example: " set meal " and " cover ", " meal " are just completely different), but carry out suitable choosing according to the granularity of retrieval.The index structure of whole three layers is bright spot of the present invention in addition, this field of document index object of innermost layer, being designed to object result is exactly for the convenient expansion to system in the future, only need to increase in object structure the inside other attribute field, and without changing whole index structure, compatibility and the extendability of system are promoted, " reference position of word in document and the line number at place " this field is by the positional information of two different angles, by reference position * 1000+ line number, easily two information are put together, avoid being placed on the traversal relating on different index level.
Step 2-2-2, compressed index file.Need to store a large amount of <TermID for each take TermID (WordClassID) as order along string file, DocID, Freq, pos1, pos2, pos3 ..., pos freq> structure, in the time that TermID is identical, increases progressively take DocID as order.In this inside configuration, positional information pos is also by increasing progressively arrangement.When compression, the first step will be carried out Run-Length Coding, is namely transformed to difference sequence (increment sequence between originally adjacent integer) increasing progressively integer sequence.Do like this and do not lose any information, because our processing all need to start to carry out from first of integer sequence, only need iterative addition just can recover original series.But so do just large integer has been changed into small integer, itself can not reduce data space.Second step, encodes to small integer with certain coding method, to realize compression.To all suitable string file merger, obtain final inverted file index subsequently.The coding method when coding method of each is with the suitable string file of generation in final inverted file is the same, has just lacked TermID.Can adopt several coding methods below.
Unary Code is encoded to x-1 position 10, one integer x of 1 heel integer x>=1 need to take x binary digit.For example 3 be encoded to 110.The next symbol occurring of its hypothesis is the probability P r[x of x]=2 -x.
Gama Code is encoded to integer x>=1
Figure DEST_PATH_IMAGE002A
unary Code represent, after connect
Figure DEST_PATH_IMAGE004A
position
Figure DEST_PATH_IMAGE006A
binary representation.Unary Code part is actual represents that remainder is altogether with how many x that encode, so x needs altogether to use
Figure 2012104617482100002DEST_PATH_IMAGE010
encode in position.For example 3 be encoded to 101.The next symbol occurring of its hypothesis is the probability of x
Figure 2012104617482100002DEST_PATH_IMAGE012
.
Delta Code is encoded to integer x>=1
Figure DEST_PATH_IMAGE002AA
gama Code represent, after connect
Figure DEST_PATH_IMAGE004AA
position
Figure DEST_PATH_IMAGE006AA
binary representation.Same, Gama Code part is actual represents that remainder is altogether with how many x that encode, due to
Figure DEST_PATH_IMAGE002AAA
gama Code coding need
Figure 2012104617482100002DEST_PATH_IMAGE014
position, so x needs altogether to use
Figure 2012104617482100002DEST_PATH_IMAGE016
encode in position.For example 3 be encoded to 1001.The probability that the next symbol occurring of Delta Code hypothesis is x is
Figure 2012104617482100002DEST_PATH_IMAGE018
.
Suppose the <term occurring in a document sets, doc> is to ading up to f.Total n by it divided by different index word, then divided by total number of documents N, obtain p=f/ (N*n).It represents the probability of any document package of choosing at random containing any index terms of choosing at random.In document, there is once just need in inverted file index, recording a DocID increment size in an index terms.Suppose f the <term occurring in inverted entry, doc> is to being from all possible N*n of a document sets <term, doc> centering is chosen at random, and this process can be considered as a Bei Nuli process.Had this hypothesis, the probability that DocID increment is x can be expressed as in document and after x-1 nonspecific index terms, then occur the once probability of certain particular index word continuously,
Figure DEST_PATH_IMAGE008A
, illustrate that x meets how much and distributes.Here implied condition is <term, and the appearance that doc> is right is independent same distribution (shellfish slave profit distributes).
Due to distributing for how much that DocID increment meets above, we can encode to it with Golomb Cod.For a parameter b, integer x>=1 can be encoded to two parts: Part I is
Figure 2012104617482100002DEST_PATH_IMAGE020
unary Code represent, need q+1 position; Part II is
Figure 2012104617482100002DEST_PATH_IMAGE022
binary representation, need or
Figure 2012104617482100002DEST_PATH_IMAGE026
position.Can prove, for given distributing for how much
Figure DEST_PATH_IMAGE008AA
, when
Figure 2012104617482100002DEST_PATH_IMAGE028
time, what Golomb Code obtained is best prefix code.
structure internal memory index described in step 4: the structure that completes increment internal memory index, full dose internal memory index and internal memory dictionary for word segmentation.Main process comprises:
Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure;
Step 4-2, builds full dose internal memory index:
To Token iin all participle W i:
If W ibe not individual character
If W ithere is part of speech C i, use C iindex collection of document D={d 1, d 2...;
Otherwise, use W icarry out index;
}
Otherwise, use W icarry out index;
Step 4-3, builds increment internal memory index;
Step 4-3-1, carries out participle to increment document;
Step 4-3-2, the existing dictionary for word segmentation internal storage structure of incremental update;
Step 4-3-3, the existing internal memory index of incremental update.
Full-text search described in step 5: full-text search process is as follows:
Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization
Process ", remove redundancy word, remove the punctuation mark useless that does not affect semantic information, know wrong error correction, another name standardization;
Step 5-2, participle: the problem after standardization is carried out to participle;
Step 5-2-1, carries out participle to user query;
Step 5-2-2, telephonist inputs crucial phrase, is made as query;
Step 5-2-3, according to one-level, secondary participle participle, obtains polycomponent word result to query, is called the front participle Seg1 of error correction;
Step 5-2-4, carries out correction process to participle before error correction, obtains polycomponent word result, is called participle Seg2 after error correction, is made as Seg=Seg1 ∪ Seg2={{W 1, W 2..., word segmentation result sorts from less to more according to participle number.Error correction procedure:
(1) first use the error correction of error correction dictionary;
(2) then use statistical information error correction.
As participle before error correction: " morning/navigator/xx/ xxx/ xxx/ " 5words
Participle after its error correction: " commercial affairs/navigator/xx/ xxx/ xxx/ " 5words "
Error correction precondition:
(1) to the each word in query, remember the word search history frequency T of not error correction, the keyword search history frequency T ' after error correction,
If a) T'>>T, uses the laggard line retrieval of error correction, and provides prompting;
B) otherwise, with retrieval before error correction.
For example:
Customer problem Query=" navigates and can not enter net " morning
Participle Seg1={ before error correction
" morning/net that navigates/can not enter/" 3words
" morning/navigator/upper/not/net/" 5words
}
Participle Seg2={ after error correction
" commercial affairs/net that navigates/can not enter/" 3words
" commercial affairs/navigator/upper/not/net/" 5words
}
Word segmentation result Seg={
" morning/net that navigates/can not enter/" 3words
" morning/navigator/upper/not/net/" 5words
" commercial affairs/net that navigates/can not enter/" 3words
" commercial affairs/navigator/upper/not/net/" 5words
}
Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or
Person's standard words, obtains participle semantic information;
Step 5-3-1, word segmentation result part of speech, result is Token={{W 1(C 1), W 2(C 2) ....In the present invention, represent the implication of participle with part of speech., can there are multiple implications in each participle, can belong to different parts of speech.Token irepresent Group I participle.
Step 5-3-2, consulting historical query
From consulting history library, find the historical query ' of the consulting the most similar to query, after finding, return to the result for retrieval of query ' as Top1 document;
Consulting similarity is defined as follows:
Sim(Token i(query),Token j(query'))=Sim({W1(C1),W2(^),…},{w 1'(c 1'),w 2'(W 2'),…})
=avg?(sem_sim(C i,C i')+?(1-a)*syn_sim(W i,W i'))
(1) wherein avg () is mean value function;
(2)sem_sim(C i,C i')= 1 (if?ci?∪?ci'?!=?φ)
0 else
(3) syn_sim (W i, W i')=composition W iand W i' in the number/W of identical characters iand W i' in mutually different character number
As: syn_sim (cannot surf the Net, how to surf the Net)=2/6
S2=can not enter net old_syn_sim (s2, s3)=2/6
Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these
Semantic extension information, is used some words or part of speech to represent.To Token icarry out synonym expansion, the participle after expansion is designated as EToken i
For example: the synonym " vip navigator " of " commercial affairs are navigated ", the synonym of " can not enter net " is " cannot surf the Net ",
" commercial affairs navigate/can not enter net/" can expand to " vip navigate/cannot surf the Net/"
Step 5-5, obtains candidate documents: utilize the word or part of speech " these words or the part of speech representative that after semantic extension, obtain
Semanteme letter after expansion ", according to internal memory index information, the corresponding document in full of search, as candidate documents;
Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more
Forward, the candidate documents after sequence becomes final full-text search result;
All documents in D are marked.Following factor is considered in scoring:
(1) SegNum:query participle number;
(2) SegWordWgt: participle self weight (the word weight in title, business, summary is high);
(3) DocWordWgt: the weight of participle in document;
(4) DocHits: the indexed click volume to document;
(5) DocTime: the indexed time to document;
(6) HitWordWgts: the weight of word in the query occurring in document;
(7) MissedWordWgts: the weight of word in the query not occurring in document;
(8) WordSpan (W 1, W 2..., d): multiple participles distance in document between two in query;
Step 5-6-1
Credit(d)?= HitWordWgts/(?HitWordWgts?+?MissedWordWgts)
WordWgt(wi,d)=doc_word_wgt(wi,d)*PosiWgt(wi)
Doc_word_wgt(wi,d)?=?tfidf(wi,d)=
2.0(needs to adjust), if word appears in title or business
PosiWgt (wi)=and 1.5(needs to adjust), if word appears in summary
{ 1.0 ,else
Step 5-6-2
Credit(d)?*=?1/log2(SegNum+1)
Undetermined Credit (d) Top_N(N) according to time sequence;
Step 5-6-3
Credit(d) /= (WordSpan(w1…wn,d)+1)
WordSpan (w1 ... wn, d)=Sum (function (wi, wj) of 1<=i<j<=n interval number of words)
Step 5-7, consulting Historic preservation
System for the first time Top1 provides: Token 1=W 1(C 1) W 2(C 2) ... W n(C n) d k
User selects: Token 2=W 1' (C 1') W 2' (C 2') ... W n' (C n') d j
If k unequal to is j, and (HistoryTop1 (Token 2)=φ) or (HistoryTop1 (Token 2) unequal to d k)
Point out user to feed back, feedback is achieved as follows:
Step 5-7-1, in the time of j>2, checks after document telephonist, while closing, ejects feedback dialog box, to confirm whether user is satisfied with the result of inquiry;
Step 5-7-2, if telephonist's choosing is to preserve so HistoryTop1 (Token 2)=d j, preserve form as follows:
<Query, Token 2, doc_type(Doctype), doc_id(document id), doc_id_value(document id value) and >.
The present invention can provide retrieval service for user.The demand of user's Query Information need to be described by certain data query statement.As shown in Figure 6, wherein, NeedAssociateSearch=" true " represents to return to relevant search result query statement example; ShowCountPerPage=" 10 " represents that every page shows 10 search records; Granularity=" Abstract " represents that retrieval granularity is " Abstract ", three kinds of retrieval granularities are provided in the present invention, respectively " Service ", " Topic ", " Abstract " and " Mix ", wherein Service(business) represent in full document business classified information; Topic(theme) represent the one-level knowledge point of business; Abstract(summary) represent the most fine-grained knowledge point, be the meticulousst inquiry, Mix(mixes) represent that each Knowledge Granulation all returns.TopN=" 100 " represents only to return to front 100 result for retrieval, if all result for retrieval are returned in TopN=-1 representative." commercial affairs are navigated " is the retrieval of content that user inputs.

Claims (7)

1. a domain knowledge text searching method of new generation, is characterized in that: comprise the following steps:
Step 1, builds dictionary for word segmentation: build dictionary for word segmentation, and deposit dictinary information in database;
Step 2, builds full dose index: to the full text document having existed " also referred to as knowledge point document " read, participle and analysis, set up index file;
Step 3, builds increment index: newly-increased document is processed, upgraded the index file on hard disk;
Step 4, builds internal memory index, comprising:
Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure;
Step 4-2, builds full dose internal memory index: read index file from hard disk, full dose builds internal memory index;
Step 4-3, builds increment internal memory index: newly-increased document is processed, realized internal memory index delta and upgrade;
Step 5, full-text search, comprising:
Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization
Process ", remove redundancy word, remove the punctuation mark useless that does not affect semantic information, know wrong error correction, another name standardization;
Step 5-2, participle: the problem after standardization is carried out to participle;
Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or
Person's standard words, obtains participle semantic information;
Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these
Semantic extension information, is used some words or part of speech to represent;
Step 5-5, obtains candidate documents: utilize the word or part of speech " these words or the part of speech representative that after semantic extension, obtain
Semanteme letter after expansion ", according to internal memory index information, the corresponding document in full of search, as candidate documents;
Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more
Forward, the candidate documents after sequence becomes final full-text search result;
Wherein, when system initialization, build dictionary for word segmentation; Build full dose index: read all knowledge point documents, full dose builds hard disk index file " abbreviation index file "; When newly-increased document in full, build increment index, these three activities, are independent of full-text search module, independent operating.
2. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the structure dictionary for word segmentation described in step 1 is mainly the structure of realizing dictionary for word segmentation, and the dictionary for word segmentation of structure is " secondary dictionary for word segmentation ", and its building method is as follows:
Step 1-1, according to " general participle vocabulary "+" business vocabulary ", forms one-level participle;
Wherein, " general participle vocabulary " adopts Computer Department of the Chinese Academy of Science's vocabulary as general participle vocabulary, and " business vocabulary " comprised the relevant proper noun of industry, can build by the Business Name importing in industry;
Step 1-2, segments one-level participle automatically, forms candidate secondary participle;
Step 1-3, artificial screening candidate secondary participle;
Secondary dictionary for word segmentation form after structure is as follows: one-level participle secondary participle array (use | cut apart).
3. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the structure full dose index described in step 2, and major function is that the full dose that realizes data directory file builds, its main process is as follows:
Step 2-1, read each knowledge point document, knowledge point document is carried out to participle: in participle process, common dictionary for word segmentation and the upper the next dictionary with semantic relation are combined, produce polycomponent word result, and sort according to the length of the number of the word comprising in every group of result and word, when participle, read by row, then every a line is being intercepted according to some punctuation marks, obtain the word of segment, carry out participle according to dictionary for word segmentation and upper the next part of speech " institute's predicate classes are exactly a general designation with one group of word of same or the close meaning " dictionary, for being to set up index with part of speech or word on earth, do following regulation,
If 1. a word has part of speech and only has a part of speech " and not being redundancy part of speech " to set up index with regard to word class name so;
If 2. word has part of speech and more than one, need each part of speech to this word " not comprise redundancy part of speech " and set up index;
If 3. a word is in dictionary, but there is not part of speech, set up index with regard to the ability with this word;
If 4. a word is in dictionary, and is redundancy part of speech, it is not set up to index;
Step 2-2, sets up index, and each word/part of speech is set up to index structure.
4. domain knowledge text searching method of new generation according to claim 3, is characterized in that: described step 2-2 is further comprising the steps of:
Step 2-2-1, set up index file: index is a kind of method that is used for finding from index terms corresponding document, in English text, word directly carries out participle with blank the separation, Chinese text adopts the participle instrument ICTClass of Inst. of Computing Techn. Academia Sinica to carry out Chinese word segmentation, and the word producing after participle directly carries out the index of word level or part of speech level as index terms;
In index construct, adopt inverted file " inverted file " mode to set up, its processing procedure is: the appearance position of processing successively its each word of comprising of every piece of paper trail, word belongs to part of speech simultaneously, can produce a tlv triple <DocID (document id) to the each word occurring in every piece of document like this, TermID(word ID) | WordClassID(part of speech ID), many positional informations of Positions() >, wherein Positions represents the position that index terms TermID occurs in DocID
Index structure comprises:
<ItemID|WordClassID (word ID| part of speech ID), the ID of <DocID(document), the reference position of < word in document and the line number < index object >>>GreatT.Grea T.GT at place
This structure is the object of a three level list, being described below from inside to outside:
The index object of innermost layer:
Int StartIndex; Article // mono-, the reference position that in index, word or part of speech occur
Short int Length; The length of // word or part of speech
The structure of the 3rd layer: reference position and the place line number of NkiInt2Ptr type < word in document, < index object >
The structure of the second layer: the ID of NkiString2Ptr type < document, the structure > of the 3rd layer
The structure of ground floor: NkiString2Ptr type < part of speech/word, the structure > of the second layer
The implication of each field in index:
Part of speech/word: document content is carried out to participle and obtain later;
The ID of document: be the unique identification of document, because a part of speech/word may appear in multiple documents, replace with document name herein;
The reference position of word in document: owing to also may there being the repeatedly appearance of same word in one piece of document, this recording of information is also the calculating of the distance between word and the word that need to use for the scoring stage in later stage;
The line number of word in document: according to the form of knowledge point document, the first row is the business at place, knowledge point, the second row is the summary of knowledge point, and the third line is the particular content of knowledge point, just can set up the index of all knowledge point documents that will retrieve according to this index structure;
Step 2-2-2, compressed index file: need to store a large amount of <TermID for each take TermID (WordClassID) as order along string file, DocID, Freq, pos1, pos2, pos3 ..., pos freq> structure, in the time that TermID is identical, increases progressively take DocID as order, in this inside configuration, positional information pos is also by increasing progressively arrangement, when compression, the first step will be carried out Run-Length Coding, is namely transformed to difference sequence " increment sequence between originally adjacent integer " increasing progressively integer sequence, second step, encodes to small integer with certain coding method, to realize compression, to all suitable string file merger, obtain final inverted file index subsequently, the coding method when coding method of each is with the suitable string file of generation in final inverted file is the same, has just lacked TermID, adopt Delta Code coding method, integer x>=1 is encoded to
Figure 2012104617482100001DEST_PATH_IMAGE002
gama Code represent, after connect position
Figure 2012104617482100001DEST_PATH_IMAGE006
binary representation, as the <term occurring in a document sets, doc> is to ading up to f, total n by it divided by different index word, again divided by total number of documents N, obtain p=f/ (N*n), it represents the probability of any document package of choosing at random containing any index terms of choosing at random, in document, there is once just need in inverted file index, recording a DocID increment size in an index terms, as the f occurring in an inverted entry <term, doc> is to being from all possible N*n of a document sets <term, doc> centering is chosen at random, this process is considered as a Bei Nuli process, there is this hypothesis, DocID increment is then to occur the once probability of certain particular index word after the probability of x can be expressed as in document continuously x-1 nonspecific index terms, ?
Figure 2012104617482100001DEST_PATH_IMAGE008
, illustrating that x meets how much and distributes, implied condition is <term here, the appearance that doc> is right is independent same distribution " shellfish slave profit distributes ".
5. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the structure internal memory index described in step 4: complete the structure of increment internal memory index, full dose internal memory index and internal memory dictionary for word segmentation, main process comprises:
Step 4-1, builds internal memory dictionary for word segmentation: dictionary for word segmentation data are read in to internal memory, build internal memory dictionary for word segmentation data structure;
Step 4-2, builds full dose internal memory index:
To Token iin all participle W i:
If W ibe not individual character
If W ithere is part of speech C i, use C iindex collection of document D={d 1, d 2...;
Otherwise, use W icarry out index;
}
Otherwise, use W icarry out index;
Step 4-3, builds increment internal memory index;
Step 4-3-1, carries out participle to increment document;
Step 4-3-2, the existing dictionary for word segmentation internal storage structure of incremental update;
Step 4-3-3, the existing internal memory index of incremental update.
6. domain knowledge text searching method of new generation according to claim 1, is characterized in that: the full-text search described in step 5: full-text search process is as follows:
Step 5-1, standardization customer problem: the column criterion processing of going forward side by side of the problem of accepting user consulting " also claims standardization processing ", removes redundancy word, removes the punctuation mark useless that does not affect semantic information, knows wrong error correction, another name standardization;
Step 5-2, participle: the problem after standardization is carried out to participle;
Step 5-2-1, carries out participle to user query;
Step 5-2-2, telephonist inputs crucial phrase, is made as query;
Step 5-2-3, according to one-level, secondary participle participle, obtains polycomponent word result to query, is called the front participle Seg1 of error correction;
Step 5-2-4, carries out correction process to participle before error correction, obtains polycomponent word result, is called participle Seg2 after error correction, is made as Seg=Seg1 ∪ Seg2={{W 1, W 2..., word segmentation result sorts from less to more according to participle number; Error correction procedure:
(1) first use the error correction of error correction dictionary;
(2) then use statistical information error correction;
Error correction precondition:
(1) to the each word in query, remember the word search history frequency T of not error correction, the keyword search history frequency T ' after error correction,
If a) T'>>T, uses the laggard line retrieval of error correction, and provides prompting;
B) otherwise, with retrieval before error correction;
Step 5-3, semantic understanding: word segmentation result is processed, extract the participle occurring in problem affiliated part of speech or
Person's standard words, obtains participle semantic information;
Step 5-3-1, word segmentation result part of speech, result is Token={{W 1(C 1), W 2(C 2) ...; Representing the implication of participle with part of speech,, can there are multiple implications in each participle, can belong to different parts of speech, Token irepresent Group I participle;
Step 5-3-2, consulting historical query
From consulting history library, find the historical query ' of the consulting the most similar to query, after finding, return to the result for retrieval of query ' as Top1 document;
Consulting similarity is defined as follows:
Sim(Token i(query),Token j(query'))=Sim({W1(C1),W2(^),…},{w 1'(c 1'),w 2'(W 2'),…})
=avg?(sem_sim(C i,C i')+?(1-a)*syn_sim(W i,W i'))
(1) wherein avg () is mean value function;
(2)sem_sim(C i,C i')= 1 (if?ci?∪?ci'?!=?φ)
0 else
(3) syn_sim (W i, W i')=composition W iand W i' in the number/W of identical characters iand W i' in mutually different character number;
Step 5-4, semantic extension: participle semantic information is carried out to semantic extension, the semantic information after being expanded, these
Semantic extension information, is used some words or part of speech to represent, to Token icarry out synonym expansion, the participle after expansion is designated as EToken i
Step 5-5, obtains candidate documents: utilize the word or the part of speech " the semanteme letter after these words or part of speech representative expansion " that after semantic extension, obtain, according to internal memory index information, the corresponding document in full of search, as candidate documents;
Step 5-6, sequence candidate documents: candidate documents is carried out to the scoring rank of multi-angle, mark higher, rank is more
Forward, the candidate documents after sequence becomes final full-text search result;
All documents in D are marked; Following factor is considered in scoring:
(1) SegNum:query participle number;
(2) SegWordWgt: participle self weight " the word weight in title, business, summary is high ";
(3) DocWordWgt: the weight of participle in document;
(4) DocHits: the indexed click volume to document;
(5) DocTime: the indexed time to document;
(6) HitWordWgts: the weight of word in the query occurring in document;
(7) MissedWordWgts: the weight of word in the query not occurring in document;
(8) WordSpan (W 1, W 2..., d): multiple participles distance in document between two in query;
Step 5-6-1
Credit(d) = HitWordWgts/(?HitWordWgts?+?MissedWordWgts)
WordWgt(wi,d)=doc_word_wgt(wi,d)*PosiWgt(wi)
Doc_word_wgt(wi,d)?=?tfidf(wi,d)=
2.0(needs to adjust), if word appears in title or business
PosiWgt (wi)=and 1.5(needs to adjust), if word appears in summary
{ 1.0 ,else
Step 5-6-2
Credit(d)?*=?1/log2(SegNum+1)
Undetermined Credit (d) Top_N(N) according to time sequence;
Step 5-6-3
Credit(d) /= (WordSpan(w1…wn,d)+1)
WordSpan (w1 ... wn, d)=Sum (function (wi, wj) of 1<=i<j<=n interval number of words)
Step 5-7, consulting Historic preservation
System for the first time Top1 provides: Token 1=W 1(C 1) W 2(C 2) ... W n(C n) d k
User selects: Token 2=W 1' (C 1') W 2' (C 2') ... W n' (C n') d j
If k unequal to is j, and (HistoryTop1 (Token 2)=φ) or (HistoryTop1 (Token 2) unequal to d k)
Point out user to feed back, feedback is achieved as follows:
Step 5-7-1, in the time of j>2, checks after document telephonist, while closing, ejects feedback dialog box, to confirm whether user is satisfied with the result of inquiry;
Step 5-7-2, if telephonist's choosing is to preserve so HistoryTop1 (Token 2)=d j, preserve form as follows:
<Query, Token 2, doc_type(Doctype), doc_id(document id), doc_id_value(document id value) and >.
7. domain knowledge text searching method of new generation according to claim 6, it is characterized in that: the retrieval granularity of described full-text search, respectively " Service ", " Topic ", " Abstract " and " Mix ", wherein Service(business) represent in full document business classified information; Topic(theme) represent the one-level knowledge point of business; Abstract(summary) represent the most fine-grained knowledge point, be the meticulousst inquiry, Mix(mixes) represent that each Knowledge Granulation all returns.
CN201210461748.2A 2012-11-16 2012-11-16 New-generation industry knowledge full-text search method Pending CN103823799A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210461748.2A CN103823799A (en) 2012-11-16 2012-11-16 New-generation industry knowledge full-text search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210461748.2A CN103823799A (en) 2012-11-16 2012-11-16 New-generation industry knowledge full-text search method

Publications (1)

Publication Number Publication Date
CN103823799A true CN103823799A (en) 2014-05-28

Family

ID=50758872

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210461748.2A Pending CN103823799A (en) 2012-11-16 2012-11-16 New-generation industry knowledge full-text search method

Country Status (1)

Country Link
CN (1) CN103823799A (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361009A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Real-time indexing method based on reverse index
CN104699831A (en) * 2015-03-31 2015-06-10 佛山市金蓝领教育科技有限公司 Atomic word knowledge management system
CN105528411A (en) * 2015-12-03 2016-04-27 中国人民解放军海军工程大学 Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN105677636A (en) * 2015-12-30 2016-06-15 上海智臻智能网络科技股份有限公司 Information processing method and device for intelligent question-answering system
WO2016112832A1 (en) * 2015-01-12 2016-07-21 杏树林信息技术(北京)有限公司 Medical information search engine system and search method
CN105808678A (en) * 2016-03-03 2016-07-27 黄川东 Construction method of standard retrieval and application system
CN105955982A (en) * 2016-04-18 2016-09-21 上海泥娃通信科技有限公司 Method and system for information sequence feature encoding and retrieval
CN105988996A (en) * 2015-01-27 2016-10-05 腾讯科技(深圳)有限公司 Index file generation method and device
CN105989088A (en) * 2015-02-12 2016-10-05 马正方 Learning device under digital environment
CN106874402A (en) * 2017-01-16 2017-06-20 腾讯科技(深圳)有限公司 Searching method and device
CN107506473A (en) * 2017-09-05 2017-12-22 郑州升达经贸管理学院 A kind of big data search method based on cloud computing
CN108205578A (en) * 2016-12-20 2018-06-26 北大方正集团有限公司 Index generation method and device
CN108595529A (en) * 2018-03-30 2018-09-28 苏州风中智能科技有限公司 A kind of device of retrieval software function
CN109583744A (en) * 2018-11-26 2019-04-05 安徽继远软件有限公司 A kind of cross-system account matching system and method based on Chinese word segmentation
CN109885641A (en) * 2019-01-21 2019-06-14 瀚高基础软件股份有限公司 A kind of method and system of database Chinese Full Text Retrieval
CN110297829A (en) * 2019-06-26 2019-10-01 重庆紫光华山智安科技有限公司 A kind of text searching method and system towards specific industry structuring business datum
CN110851559A (en) * 2019-10-14 2020-02-28 中科曙光南京研究院有限公司 Automatic data element identification method and identification system
CN110908998A (en) * 2019-11-13 2020-03-24 广联达科技股份有限公司 Data storage and search method, system and computer readable storage medium
CN111767378A (en) * 2020-06-24 2020-10-13 北京墨丘科技有限公司 Method and device for intelligently recommending scientific and technical literature
CN111881328A (en) * 2020-07-30 2020-11-03 百度在线网络技术(北京)有限公司 Information pushing method and device, electronic equipment and storage medium
CN114298055A (en) * 2021-12-24 2022-04-08 浙江大学 Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN115455147A (en) * 2022-09-09 2022-12-09 浪潮卓数大数据产业发展有限公司 Full-text retrieval method and system
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium
CN117953875A (en) * 2024-03-27 2024-04-30 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6169999B1 (en) * 1997-05-30 2001-01-02 Matsushita Electric Industrial Co., Ltd. Dictionary and index creating system and document retrieval system
JP2006178599A (en) * 2004-12-21 2006-07-06 Fuji Xerox Co Ltd Document retrieval device and method
CN101620607A (en) * 2008-07-01 2010-01-06 全国组织机构代码管理中心 Full-text retrieval method and full-text retrieval system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6169999B1 (en) * 1997-05-30 2001-01-02 Matsushita Electric Industrial Co., Ltd. Dictionary and index creating system and document retrieval system
JP2006178599A (en) * 2004-12-21 2006-07-06 Fuji Xerox Co Ltd Document retrieval device and method
CN101620607A (en) * 2008-07-01 2010-01-06 全国组织机构代码管理中心 Full-text retrieval method and full-text retrieval system

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104361009A (en) * 2014-10-11 2015-02-18 北京中搜网络技术股份有限公司 Real-time indexing method based on reverse index
CN104361009B (en) * 2014-10-11 2017-10-31 北京中搜网络技术股份有限公司 A kind of real time indexing method based on inverted index
WO2016112832A1 (en) * 2015-01-12 2016-07-21 杏树林信息技术(北京)有限公司 Medical information search engine system and search method
CN105988996A (en) * 2015-01-27 2016-10-05 腾讯科技(深圳)有限公司 Index file generation method and device
CN105989088A (en) * 2015-02-12 2016-10-05 马正方 Learning device under digital environment
CN105989088B (en) * 2015-02-12 2019-05-14 马正方 Learning device under digitized environment
CN104699831A (en) * 2015-03-31 2015-06-10 佛山市金蓝领教育科技有限公司 Atomic word knowledge management system
CN105528411B (en) * 2015-12-03 2019-08-20 中国人民解放军海军工程大学 Apparel interactive electronic technical manual full-text search device and method
CN105528411A (en) * 2015-12-03 2016-04-27 中国人民解放军海军工程大学 Full-text retrieval device and method for interactive electronic technical manual of shipping equipment
CN105677636A (en) * 2015-12-30 2016-06-15 上海智臻智能网络科技股份有限公司 Information processing method and device for intelligent question-answering system
CN105808678A (en) * 2016-03-03 2016-07-27 黄川东 Construction method of standard retrieval and application system
CN105955982A (en) * 2016-04-18 2016-09-21 上海泥娃通信科技有限公司 Method and system for information sequence feature encoding and retrieval
CN108205578A (en) * 2016-12-20 2018-06-26 北大方正集团有限公司 Index generation method and device
CN106874402A (en) * 2017-01-16 2017-06-20 腾讯科技(深圳)有限公司 Searching method and device
CN107506473A (en) * 2017-09-05 2017-12-22 郑州升达经贸管理学院 A kind of big data search method based on cloud computing
CN108595529A (en) * 2018-03-30 2018-09-28 苏州风中智能科技有限公司 A kind of device of retrieval software function
CN109583744A (en) * 2018-11-26 2019-04-05 安徽继远软件有限公司 A kind of cross-system account matching system and method based on Chinese word segmentation
CN109885641A (en) * 2019-01-21 2019-06-14 瀚高基础软件股份有限公司 A kind of method and system of database Chinese Full Text Retrieval
CN110297829A (en) * 2019-06-26 2019-10-01 重庆紫光华山智安科技有限公司 A kind of text searching method and system towards specific industry structuring business datum
CN110851559A (en) * 2019-10-14 2020-02-28 中科曙光南京研究院有限公司 Automatic data element identification method and identification system
CN110908998A (en) * 2019-11-13 2020-03-24 广联达科技股份有限公司 Data storage and search method, system and computer readable storage medium
CN111767378A (en) * 2020-06-24 2020-10-13 北京墨丘科技有限公司 Method and device for intelligently recommending scientific and technical literature
CN111881328A (en) * 2020-07-30 2020-11-03 百度在线网络技术(北京)有限公司 Information pushing method and device, electronic equipment and storage medium
CN114298055A (en) * 2021-12-24 2022-04-08 浙江大学 Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN114298055B (en) * 2021-12-24 2022-08-09 浙江大学 Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN115455147A (en) * 2022-09-09 2022-12-09 浪潮卓数大数据产业发展有限公司 Full-text retrieval method and system
CN116431930A (en) * 2023-06-13 2023-07-14 天津联创科技发展有限公司 Technological achievement conversion data query method, system, terminal and storage medium
CN117953875A (en) * 2024-03-27 2024-04-30 成都启英泰伦科技有限公司 Offline voice command word storage method based on semantic understanding

Similar Documents

Publication Publication Date Title
CN103823799A (en) New-generation industry knowledge full-text search method
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
Oliveira et al. Assessing shallow sentence scoring techniques and combinations for single and multi-document summarization
Cohen et al. Learning to match and cluster large high-dimensional data sets for data integration
CN107122413A (en) A kind of keyword extracting method and device based on graph model
CN106446148A (en) Cluster-based text duplicate checking method
US20060129843A1 (en) Method and apparatus for electronically extracting application specific multidimensional information from documents selected from a set of documents electronically extracted from a library of electronically searchable documents
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
CN111143479A (en) Knowledge graph relation extraction and REST service visualization fusion method based on DBSCAN clustering algorithm
US20030115188A1 (en) Method and apparatus for electronically extracting application specific multidimensional information from a library of searchable documents and for providing the application specific information to a user application
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
CN110674252A (en) High-precision semantic search system for judicial domain
CN102193939A (en) Realization method of information navigation, information navigation server and information processing system
CN101398814A (en) Method and system for simultaneously abstracting document summarization and key words
CN103646032A (en) Database query method based on body and restricted natural language processing
CN111177591A (en) Knowledge graph-based Web data optimization method facing visualization demand
CN107145545A (en) Top k zone users text data recommends method in a kind of location-based social networks
CN106484797A (en) Accident summary abstracting method based on sparse study
US9626401B1 (en) Systems and methods for high-speed searching and filtering of large datasets
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN115905489B (en) Method for providing bidding information search service
CN101088082A (en) Full text query and search systems and methods of use
CN102467544B (en) Information smart searching method and system based on space fuzzy coding
CN105404677A (en) Tree structure based retrieval method
Yadav et al. Wavelet tree based dual indexing technique for geographical search.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140528

WD01 Invention patent application deemed withdrawn after publication