CN107402960A

CN107402960A - A kind of inverted index optimized algorithm based on the weighting of the semantic tone

Info

Publication number: CN107402960A
Application number: CN201710453251.9A
Authority: CN
Inventors: 夏珺峥; 傅玉生
Original assignee: Chengdu Gifted Data Co Ltd
Current assignee: Chengdu Gifted Data Co Ltd
Priority date: 2017-06-15
Filing date: 2017-06-15
Publication date: 2017-11-28
Anticipated expiration: 2037-06-15
Also published as: CN107402960B

Abstract

The present invention discloses a kind of inverted index optimized algorithm based on the weighting of the semantic tone, the present invention relates to document information-handling technique field, solving prior art, inverted index accuracy is low and the big technical problem of index difficulty due to existing using only specific word word frequency index, and solve prior art due to keyword sequence and semantic weighted words stock repeat and caused by semantic weighting it is invalid or the technical problems such as material change can not be realized to inverted index.The present invention is mainly in combination with document semantic tone feature, construct brand-new Weighted Term Frequency definition, inverted entry is realized to sort according to Weighted Term Frequency, not only it is demonstrated by the word frequency density of keyword in a document, the intensity of expressing the meaning of keyword is also reflected, the user that can more assist search preferentially finds desired document.

Description

A kind of inverted index optimized algorithm based on the weighting of the semantic tone

Technical field

The present invention relates to document information-handling technique field, and in particular to a kind of inverted index based on the weighting of the semantic tone Optimized algorithm.

Background technology

Search engine at present, have become the most frequently used internet appliance, data tissue and index and information science field Study hotspot.Inverted entry model reversely searches association information, has adapted to the work of search engine very well according to word frequency Make scene.But word frequency is based solely on, and the weights ordering strategy that word-based frequency meter is calculated, it is impossible to completely reflect the keyword Degree of expressing the meaning in a document.

The present invention has further completely quantified keyword to the important of document representation based on semantic and tone weighting processing Property, the inverted entry indexing means based on the Weighted Term Frequency, it can preferably help user to find corresponding document and information.

The content of the invention

For above-mentioned prior art, present invention aims at provide a kind of inverted index optimization based on the weighting of the semantic tone Algorithm, solves prior art due to having that inverted index accuracy is low using only specific word word frequency index and index difficulty is big Technical problem, and solve prior art due to keyword sequence and semantic weighted words stock repeat and caused by semantic weighting It is invalid or the technical problems such as material change can not be realized to inverted index.

To reach above-mentioned purpose, the technical solution adopted by the present invention is as follows：

A kind of inverted index optimized algorithm based on the weighting of the semantic tone, comprises the following steps：

Step 1, default semantic deactivation phrase, then enhancing semanteme phrase and reduction language with different semantic weighted values are set Adopted phrase, and as the semantic subset for disabling phrase；

Step 2, cutting word processing is carried out to each input document, obtain orderly sequence of terms；

Step 3, orderly sequence of terms is disabled into phrase with semanteme matched, filtered out in matching process and appear in semanteme and stop The keyword sequence of input document is obtained with the phrase in phrase；

Step 4, traversal keyword sequence, obtain current key word tone weighted value after, current key lexeme put to There is the phrase matched in position range in inquiry document phrase with strengthening semantic phrase and the semantic phrase of reduction in its last time, by institute The semantic weighted value combination tone weighted value for matching phrase calculates the Weighted Term Frequency of current key word, and text is obtained after the completion of traversal The Weighted Term Frequency of shelves；

Step 5, arranged according to document Weighted Term Frequency, obtain the document sequence of optimiged index.

In such scheme, described step 1, being set by degree adverb strengthens semantic phrase and the semantic phrase of reduction.

In such scheme, described step 4, wherein, its language is determined by prototype statement sentence tail feature where current key word Gas weighted value.

In such scheme, described step 4, wherein, obtaining tone weighted value includes：

Step 1., define the default tone weighted value of punctuate association of prototype statement；

2., by prototype statement sentence tail tag point where current key word step obtains its tone weighted value.

In such scheme, described step 4, wherein, define the keyword key currently indexed in former sentence j_indexWeighting Word frequency f_keyFor：

W_iFor keyword key semantic weighted value, n represents the quantity of keyword key in document, m represent keyword key and The semantic phrase quantity matched before between keyword with strengthening semantic phrase and the semantic phrase of reduction, W_jFor tone weighted value.

A kind of method for determining document Weighted Term Frequency, comprises the following steps：

Step 1, the dictionary with different semantic weighted values is set；

Step 2, the keyword phrase and dictionary of document matched, and all keyword phrases not being matched are made For keyword sequence；

Step 3, prototype statement sentence tail feature divide quantitative, determines tone weighted value corresponding to every kind of tail feature, The tone weighted value of corresponding keyword is determined by the sentence tail feature of prototype statement where each keyword in keyword sequence again；

Step 4, put to its last time to occur inquiring about in document phrase in position range in current key lexeme and match with dictionary Phrase, the semantic weighted value of current key word is obtained by the phrase of matching, passes through weight product meter with reference to tone weighted value The Weighted Term Frequency of current key word is calculated, then travels through keyword sequence, the Weighted Term Frequency of document is gone out by read group total.

In such scheme, described step 1, semantic deactivation phrase is preset, then set the semantic phrase of enhancing and reduction semantic Phrase and the subset as semantic deactivation phrase.

A kind of method for determining keywords semantics weighted value, comprises the following steps：

Step 1, enhancing semanteme phrase and the semantic phrase of reduction for sky with keyword phrase common factor are set, strengthen semantic word Group and the semantic phrase of reduction possess different semantic weighted values respectively；

Step 2, occur in keyword position to its last time inquiring about in position range in document phrase with strengthening semantic phrase With the phrase for weakening semantic phrase matching；

Step 3, the semantic weighted value possessed according to the phrase of matching, the language of the keyword is calculated by weight product Adopted weighted value.

Compared with prior art, beneficial effects of the present invention：

Inverted entry of the present invention sorts according to Weighted Term Frequency, is not only demonstrated by the word frequency density of keyword in a document, The intensity of expressing the meaning of keyword is also reflected, the user that can more assist search preferentially finds desired document, and prior art is deposited How can just construct disable phrase and define its subset enhancing/reduction semanteme phrase (for after filtering out first again With), how using tone semanteme accurate quantitative analysis weighted sum how to avoid with semantic phrase repeat keyword cause semantic weighting Invalid technology barriers.

Brief description of the drawings

Fig. 1 is the main handling process schematic diagram of the present invention；

Fig. 2 is the handling process schematic diagram of the embodiment of the present invention.

Embodiment

All features disclosed in this specification, or disclosed all methods or during the step of, except mutually exclusive Feature and/or step beyond, can combine in any way.

The present invention will be further described below in conjunction with the accompanying drawings：

A kind of inverted index optimized algorithm based on the weighting of the semantic tone, is comprised the following steps:

S0, the semantic phrase S (pos) of enhancing is preset, preset the semantic phrase S (neg) of reduction, preset semanteme and disable phrase S (stop), and S (pos) and S (neg) are S (stop) subsets；

S1, cutting word processing is carried out to each input document, by the sequence L (org) that document representation is an orderly word；

S2, stop words in L (org) is handled, the word in set L (org) is gradually scanned, filters out S (stop) word occurred in, document keyword sequence L (key) is obtained；

S3, the Weighted Term Frequency for calculating each keyword, and the tone of prototype statement where checking keyword, and do and：Wherein W_iFor keyword key semantic weighted value, n is represented in document comprising pass Keyword key quantity, m represent keyword key and the enhancing between keyword before/reduction semanteme word quantity, W_jFor the tone plus Weights；

S4, arranged according to document Weighted Term Frequency, in collection of document, document is indexed sequence according to Weighted Term Frequency；

Further, the semantic word of enhancing is preset, the adverbial word, auxiliary word etc. of positive reinforcement phrase semantic is represented, strengthens semantic word Group is included and is not limited to for example " very " " genuine " " special " " very " " very " " suitable ".

Further, the semantic phrase of reduction is preset, represents reduction semantic meaning representation, reduces the adverbial word that statement determines row, auxiliary word Deng, reduction give phrase include be not limited to for example：" possibility " " general " " a little " " indistinct " " seeming " " whether ".

Further, two phrases are default resource.

Further, as shown in Weighted Term Frequency calculation formula, on the semantic weight of each keyword, with its front position The product positive correlation of each semantic word weights.If without semantic word, weights 1 before keyword.

Further, W_jFor tone weighted value.The tone of the tone weighted value from sentence, it is weighted to each pass of the sentence On keyword, with the sentence of fullstop ending, its tone is states, weighted value 1；The sentence to be ended up with exclamation, its tone are exclamation With pray making, the tone is strong, weighted value be more than 1；With question mark end up sentence, its tone be query, ask in reply, show oneself or it is right The uncertainty of side, tone reduction is semantic, and weighted value is less than 1.

Embodiment 1

Such as Fig. 2, so, a kind of implementation process of the present invention is：

S01, the default semantically enhancement dictionary S (pos) of loading and semantic reduction dictionary S (neg), and corresponding weight value；

S02, read any one document in document library；

S03, document is segmented, obtain document word order and represent L (org)；Stop words is filtered to L (org), obtains crucial word order L(key)；

S04, to each key in L (key), sentence according to where it, obtain its tone weighted value Wj (key)；

S05, traversal L (key), record each keyword key and its adjacent keyword key-1 in left side；

S06, traversal L (org), to the word between key-1 and key, search whether exist in S (pos) and S (neg)；

S07, after finding existing semantic reinforcing/reduction word, on Weighted Term Frequency that its weighted value is taken to keyword key, Sentence punctuate where keyword key is found, according to punctuate by tone weight, on the Weighted Term Frequency of product to keyword key；

S08, in ergodic process, Weighted Term Frequencies of the key being calculated under current context, be summed into keyword Key is on the Weighted Term Frequency under linguistic context before；

If S09, L (key) also have untreated keyword, S5 steps are redirected, are continued；

If S10, collection of document also have untreated document, S2 steps are redirected, are continued；

S11, with each one group of lists of documents of keyword key indexes, the position of lists of documents, according to keyword key at this Weighted Term Frequency in document falls to sort；

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Belong to those skilled in the art the invention discloses technical scope in, the change or replacement that can readily occur in, all should It is included within the scope of the present invention.

Claims

1. a kind of inverted index optimized algorithm based on the weighting of the semantic tone, it is characterised in that comprise the following steps：

Step 1, default semantic deactivation phrase S (stop), then the enhancing semanteme phrase S with different semantic weighted values is set (pos) and semantic phrase S (neg) is weakened, and as the semantic subset for disabling phrase S (stop)；

Step 2, cutting word processing is carried out to each input document, obtain orderly sequence of terms L (org)；

Step 3, by orderly sequence of terms L (org) with semanteme disable phrase S (stop) matched, filtered out out in matching process The semantic phrase disabled in phrase S (stop) now, obtain the keyword sequence L (key) of input document；

Step 4, keyword sequence L (key) is traveled through, after the tone weighted value for obtaining current key word, put in current key lexeme Occurred inquiring about in position range in document phrase with strengthening semantic phrase S (pos) and the semantic phrase S (neg) of reduction to its last time The phrase of matching, the Weighted Term Frequency of current key word is calculated by the semantic weighted value combination tone weighted value of matched phrase, The Weighted Term Frequency of document is obtained after the completion of traversal；

A kind of 2. inverted index optimized algorithm based on the weighting of the semantic tone according to claim 1, it is characterised in that institute The step 1 stated, being set by degree adverb strengthens semantic phrase S (pos) and the semantic phrase S (neg) of reduction.

A kind of 3. inverted index optimized algorithm based on the weighting of the semantic tone according to claim 1, it is characterised in that institute The step 4 stated, wherein, its tone weighted value is determined by prototype statement sentence tail feature where current key word.

A kind of 4. inverted index optimized algorithm based on the weighting of the semantic tone according to claim 3, it is characterised in that institute The step 4 stated, wherein, obtaining tone weighted value includes：

A kind of 5. inverted index optimized algorithm based on the weighting of the semantic tone according to claim 1, it is characterised in that institute The step 4 stated, wherein, define current key word key in former sentence j_indexWeighted Term Frequency f_keyFor：

<mrow> <msub> <mi>f</mi> <mrow> <mi>k</mi> <mi>e</mi> <mi>y</mi> </mrow> </msub> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mi>n</mi> <mi>d</mi> <mi>e</mi> <mi>x</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mi>&Pi;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <msub> <mi>W</mi> <mi>i</mi> </msub> <msub> <mi>Key</mi> <mrow> <mi>i</mi> <mi>n</mi> <mi>d</mi> <mi>e</mi> <mi>x</mi> </mrow> </msub> <msub> <mi>W</mi> <mi>j</mi> </msub> </mrow>

W_iFor keyword key semantic weighted value, n represents the quantity of keyword key in document, and m represents keyword key and before The semantic phrase quantity matched between keyword with strengthening semantic phrase S (pos) and the semantic phrase S (neg) of reduction, W_jFor the tone Weighted value.

A kind of 6. method for determining document Weighted Term Frequency, it is characterised in that comprise the following steps：

Step 1, the dictionary with different semantic weighted values is set；

Step 2, the keyword phrase and dictionary of document matched, and using all keyword phrases not being matched as closing Keyword sequence L (key)；

Step 3, prototype statement sentence tail feature divide quantitative, determine tone weighted value corresponding to every kind of tail feature, then lead to The sentence tail feature for crossing prototype statement where each keyword in keyword sequence L (key) determines the tone weighting of corresponding keyword Value；

Step 4, the word that occurs position range in inquiry document phrase with dictionary match was put to its last time in current key lexeme Group, the semantic weighted value of current key word is obtained by the phrase of matching, calculated with reference to tone weighted value by weight product The Weighted Term Frequency of current key word, keyword sequence L (key) is then traveled through, the Weighted Term Frequency of document is gone out by read group total.

A kind of 7. method for determining document Weighted Term Frequency according to claim 6, it is characterised in that described step 1, in advance If semanteme disables phrase S (stop), then sets the semantic phrase S (pos) of enhancing and the semantic phrase S (neg) of reduction and be used as language Justice disables phrase S (stop) subset.

A kind of 8. method for determining keywords semantics weighted value, it is characterised in that comprise the following steps：

Step 1, enhancing semanteme phrase S (pos) and the semantic phrase S (neg) of reduction occured simultaneously with keyword phrase for sky are set, increased Strong semantic phrase S (pos) and the semantic phrase S (neg) of reduction possess different semantic weighted values respectively；

Step 2, occur in keyword position to its last time inquiring about in position range in document phrase with strengthening semantic phrase S (pos) and reduction semantic phrase S (neg) matching phrase；

Step 3, the semantic weighted value possessed according to the phrase of matching, the semanteme that the keyword is calculated by weight product add Weights.