CN106372956A

CN106372956A - Method and system for intention entity recognition based on user query log

Info

Publication number: CN106372956A
Application number: CN201510440013.5A
Authority: CN
Inventors: 孙鹏飞; 李春生; 金阳春
Original assignee: Suning Commerce Group Co Ltd
Current assignee: NANJING SUNING ELECTRONIC INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2015-07-23
Filing date: 2015-07-23
Publication date: 2017-02-01
Anticipated expiration: 2035-07-23
Also published as: CN106372956B

Abstract

The invention relates to the field of e-commerce, and discloses a method of intention entity recognition based on a user query log. The method comprises steps: an original log is analyzed and extracted, and a session query set, a common click query set and a common query commodity set are formed; through processing, mutual disjoint subsets (I, Q) are obtained, a bipartite graph is built through a click relation, and word clustering is carried out; according to a word clustering result, segment processing is carried out on each query word, and multiple segments are obtained; and weight calculation is carried out on the segments, weight scores of the segments are further obtained, and a segment with the highest score serves as the corresponding entity of the query word. Through analyzing the original log, the query word is subjected to word clustering and weight analysis, intention entity recognition is realized, a context and a semantic context are blended, the query accuracy is improved, better intention recognition effects are achieved, and the online calculation overhead is saved.

Description

A kind of method and system carrying out intention Entity recognition based on user's search daily record

Technical field

The present invention relates to e-commerce field, more particularly, to one kind are intended to based on user's search daily record The method of Entity recognition.

Background technology

In the epoch that the Internet is prevailing, the shopping way of people is gradually become in electricity by solid shop shopping Sub- business web site is bought, and the transformation of this mode not only provides bigger selection space for people, Also more easily purchase experiences are provided for people.At present, because different users is by academic, cultural And the impact of the factor such as region, lead to when stating same commodity, the query word of input may difference Very big, accordingly, it would be desirable to carry out to query word being intended to Entity recognition.

In existing intention entity recognition model, mainly by the method for machine learning, artificial mark Learning sample, trains entity recognition model, query word is labeled, further according to priority rule, Obtain priority high as core word.This model can solve to be intended to Entity recognition to a certain extent Problem, however, the analysis to user's context or semantic background for the disappearance of existing certain methods, meeting Lead to the error of Search Results, for example, such as: a query word is " Samsung mobile phone ", separately One query word is " Semen setariae mobile phone ", all contains " mobile phone " this word in this two query words, But to user's context and semantic background analysis from search daily record, it can be found that this two query words More be concern brand message, and this be also existing intention entity recognition model institute irrealizable.

Content of the invention

The technical problem to be solved is to provide one kind and is intended to based on user's search daily record The method of Entity recognition, to solve, because disappearance is to user's context or semantic background analysis, to lead to Search Results produce the problem of error.

The present invention solves above-mentioned technical problem and be employed technical scheme comprise that offer one kind is searched based on user Suo Zhi carries out the method being intended to Entity recognition, and the method comprising the steps of:

S1, initial search daily record is carried out with parsing extract, obtain the query word clicked on every time and its correspondence Merchandise news；

S2, formation session query set s_session, concurrent hit query set s_queryWith common inquiry commodity Set s_item；

S3, represent that described concurrent hits query set s with bigraph (bipartite graph)_queryWith common inquiry commodity set s_item, The summit of described bigraph (bipartite graph) is processed, obtains out-degree set a1 and a2；

S4, process is merged to described set a1 and a2, obtain term clustering result a；

S5, word segmentation processing is carried out to each query word in above-mentioned term clustering result a using participle technique, And the weight of each participle is calculated；

S6, the selection corresponding participle of highest scoring score are as the corresponding entity of described query word.

Preferably, step s2 includes step:

S201, each conversation element for user are processed, and obtain corresponding to each conversation element Inquiry set of words；

S202, acquisition user's concurrent hit the inquiry set of words corresponding to commodity；

S203, acquisition user click on the set of different commodity datas under same query word；

S204, the inquiry set of words of described conversation element, concurrent are hit the query word unit and altogether of commodity The commodity click data set of query word merges duplicate removal and processes, and obtains session query set s_session, concurrent hit query set s_queryWith common inquiry commodity set s_item.

Preferably, step s3 includes step:

S301, described concurrent is hit query set s_queryWith common inquiry commodity set s_itemUse bigraph (bipartite graph) G=(v, e) represents, wherein, vertex v can be divided into two mutually disjoint subsets (i, q), i, q It is respectively merchandise news set and inquiry set of words, side e represents the point between commodity and described query word Hit relation；

S302, vertex v=(i, q) to described bigraph (bipartite graph) g=(v, e) carry out classification process, count respectively Calculate with described merchandise news set i for arc head, described inquiry set of words q is the out-degree set of arc tail A1, and with described inquiry set of words q for arc head, described merchandise news set i is the out-degree collection of arc tail Close a2.

Preferably, in step s4, described set a1 and a2 is merged with process a '=(a1 ∪ A2)-(a1 ∩ a2), obtains term clustering result a=a ' ∪ a1 further.

Preferably, in step s5, calculate each participle according to the number of times that participle each described occurs Weight:

Wherein, ti is i-th participle of described query word, n_iFor described ti described set a son Concentrate the number of times occurring.

Preferably, described set a comprises m subset, to dividing in each subset of described set a Word is weighted merging:

Wherein, α_iFor the similarity of i-th subset of described set a, its value is bigraph (bipartite graph) path length The inverse of degree a, β_iWeight for the subset of described set a.

On the other hand, present invention offer is a kind of carries out being intended to Entity recognition based on user's search daily record System is it is characterised in that described system includes:

Parsing extraction unit, extracts for initial search daily record is carried out with parsing, obtains and click on every time Query word and its corresponding merchandise news；

Query word forms unit, for forming session query set s_session, concurrent hit query set s_queryWith common inquiry commodity set s_item；

Term clustering unit, for hitting query set s according to described concurrent_queryWith common inquiry commodity set s_itemThe bigraph (bipartite graph) being formed obtains term clustering result a；

Weight unit, is carried out to each query word in above-mentioned term clustering result a using participle technique Word segmentation processing, and the weight of each participle is calculated；

Comparison unit, for contrasting the weight of each participle described, selects highest scoring score to correspond to Participle as the corresponding entity of described query word.

Preferably, described term clustering unit includes bigraph (bipartite graph) formation unit, out-degree unit and merging treatment Unit, wherein,

Described bigraph (bipartite graph) forms unit, for described concurrent is hit query set s_queryWith common inquiry business Product set s_itemRepresented with bigraph (bipartite graph) g=(v, e), wherein, vertex v can be divided into two mutually not Intersecting subset (i, q), i, q are respectively merchandise news set and inquiry set of words, and side e represents business Click relation between product and described query word；

Described out-degree unit, for carrying out to vertex v=(i, q) of described bigraph (bipartite graph) g=(v, e) point Class is processed, and calculates with described merchandise news set i for arc head respectively, and described inquiry set of words q is arc The out-degree set a1 of tail, and with described inquiry set of words q for arc head, described merchandise news set i Out-degree set a2 for arc tail；

Described merging treatment unit, processes a '=(a1 for merging to described set a1 and a2 ∪ a2)-(a1 ∩ a2), obtain term clustering result a=a ' ∪ a1 further.

Preferably, described weight unit includes weight calculation unit and weighted combination units, wherein,

Described weight calculation unit, for according to each described participle occur number of times, utilize formulaCalculate the weight of each participle, wherein, ti is described inquiry I-th participle of word, n_iThe number of times occurring in the subset of described set a for described ti；

Described weighted combination units, for according to formula score (ti)=∑ tf (t_i)×(α₁×β₁+a₂×β₂+ ...+α i × β i+ ...+am × β m) participle in each subset of described set a is weighted merging, wherein, α_iFor the similarity of i-th subset of described set a, its value is the inverse of bigraph (bipartite graph) path a, β_iWeight for the subset of described set a.

In the present invention, merchandise news and query word are clicked on according to initial search log acquisition, formed and wait Selected works close；Set up the bigraph (bipartite graph) of Candidate Set and it is clustered, calculate similarity；Term clustering is tied Fruit carries out word segmentation processing, calculates the weight of each participle, and is weighted merging, and chooses highest scoring Participle as the corresponding entity of query word, present method solves context and semantic background analysis disappearance Problem, improve the accuracy of inquiry word and search, reduce by the academic, cultural of different user and The inquiry error that the impact of the factors such as region leads to, improves user using during e-commerce website shopping User experience.

Brief description

Fig. 1 is to be based on user's search daily record in a preferred embodiment of the present invention to carry out being intended to entity The flow chart knowing method for distinguishing；

Fig. 2 is to be based on user's search daily record in a preferred embodiment of the present invention to carry out being intended to entity The structure chart of the system of identification.

Specific embodiment

Following examples are only used for clearly technical scheme being described, and can not be come with this Limit the scope of the invention.Description subsequent descriptions are to implement the better embodiment of the present invention, So described description is for the purpose of the rule that the present invention is described, is not limited to the model of the present invention Enclose.Protection scope of the present invention ought be defined depending on the defined person of claims.

With specific embodiment, the present invention is described in further details below in conjunction with the accompanying drawings.

As shown in figure 1, being a preferred embodiment of the present invention, disclose one kind based on user's search Daily record carries out the method being intended to Entity recognition, and the method comprises step:

S3, represent that with bigraph (bipartite graph) concurrent hits query set s_queryWith common inquiry commodity set s_item, right The summit of bigraph (bipartite graph) is processed, and obtains out-degree set a1 and a2；

S4, process is merged to set a1 and a2, obtain term clustering result a；

S6, the selection corresponding participle of highest scoring score are as the corresponding entity of query word.

In the present embodiment, query set s is hit by concurrent_queryWith common inquiry commodity set s_itemFormed Bigraph (bipartite graph) obtain term clustering result, and by weight meter is carried out to each participle in term clustering result Calculate, as query word correspondent entity, this avoids existing the participle choosing highest scoring to a certain extent There are context-free in technology or the unrelated problem of semantic background, and then improve the accuracy of Search Results.

Further, step s2 includes step:

S204, the inquiry set of words of conversation element, concurrent are hit the query word unit of commodity and common inquiry The commodity click data set of word merges duplicate removal and processes, and obtains session query set s_session、 Concurrent hits query set s_queryWith common inquiry commodity set s_item.

In the present embodiment, by data prediction is carried out to original log, obtain session query set s_session, concurrent hit query set s_queryWith common inquiry commodity set s_item, by click volume and looking into Inquiry amount determines query word, extracts core word, it is to avoid because different users is subject to educational background, culture and is less than Etc. the impact of factor, lead on stating same problem, the excessive problem of the difference of input inquiry word.

Further, step s3 includes step:

S301, concurrent is hit query set s_queryWith common inquiry commodity set s_itemUse bigraph (bipartite graph) G=(v, e) represents, wherein, vertex v can be divided into two mutually disjoint subsets (i, q), i, q It is respectively merchandise news set and inquiry set of words, side e represents that the click between commodity and query word is closed System；

S302, vertex v=(i, q) to bigraph (bipartite graph) g=(v, e) carry out classification process, calculate respectively with Merchandise news set i is arc head, and inquiry set of words q is the out-degree set a1 of arc tail, and to inquire about Set of words q is arc head, and merchandise news set i is the out-degree set a2 of arc tail.

Further, in step s4, set a1 and a2 is merged with process a '=(a1 ∪ a2)-(a1 ∩ a2), obtain term clustering result a=a ' ∪ a1 further.

In the present embodiment, obtaining merchandise news set i by way of bigraph (bipartite graph) is arc head, query word Set q is the out-degree set a1 of arc tail, and to inquire about set of words q for arc head, merchandise news set i For the out-degree set a2 of arc tail, and by way of term clustering, a1 and a2 is merged process, Entirely it is ensured that the term clustering result obtaining comprises user context information and semantic background analysis, improve The accuracy of query word.

Further, the power of each participle in step s5, is calculated according to the number of times that each participle occurs Weight:

Wherein, ti is i-th participle of query word, n_iOccur in the subset of described set a for ti Number of times.

Further, set a comprises m subset, and the participle in each subset of set a is carried out Weighting merges:

Wherein, α_iFor the similarity of i-th subset of set a, its value is bigraph (bipartite graph) path a Inverse, β_iWeight for the subset of set a.

In the present embodiment, term clustering result is carried out word segmentation processing, obtain multiple participles, to each point Word carries out weight calculation and weighting merges, and obtains the corresponding fraction of each participle, and by the fraction obtaining Contrasted, fraction highest participle, as query word correspondent entity, using calculated off line mode, saves The expense calculating has been saved on line.

It will appreciated by the skilled person that realizing the whole or portion in above-described embodiment method The program that can be by step by step to complete come the hardware to instruct correlation, and described program can be stored in In computer read/write memory medium, this program upon execution, including each step of above-described embodiment method Suddenly, and described storage medium may is that rom/ram, magnetic disc, CD, storage card etc..Therefore, Relevant technical staff in the field will be understood that corresponding with the method for the present invention, and the present invention is also simultaneously Daily record is searched for based on user carry out being intended to the system of Entity recognition including a kind of, referring to Fig. 2, and above-mentioned Correspondingly, this system includes method and step:

Term clustering unit, for hitting query set s according to concurrent_queryWith common inquiry commodity set s_item The bigraph (bipartite graph) being formed obtains term clustering result a；

Comparison unit, for contrasting the weight of each participle, selects corresponding point of highest scoring score Word is as the corresponding entity of query word.

In the present embodiment, by term clustering unit obtain the most accurately, take into account context and the semantic back of the body The query word of scape, is calculated weight and the fraction of each participle of query word, passes through by weighted combination units Comparison unit obtains the corresponding participle of highest score as the corresponding entity of query word, so both incorporates Context environmental and semantic background, improve search accuracy rate, sack more preferable intention assessment effect, Simultaneously additionally it is possible to save computing cost, quickly obtain accurate participle.

Further, term clustering unit includes bigraph (bipartite graph) formation unit, out-degree unit and merging treatment list Unit, wherein,

Bigraph (bipartite graph) forms unit, for concurrent is hit query set s_queryWith common inquiry commodity set s_itemWith bigraph (bipartite graph) g=(v, e) represent, wherein, vertex v can be divided into two mutually disjoint Subset (i, q), i, q are respectively merchandise news set and inquiry set of words, and side e represents commodity and looks into Ask the click relation between word；

Out-degree unit, for classification process is carried out to vertex v=(i, q) of bigraph (bipartite graph) g=(v, e), Calculate respectively with merchandise news set i for arc head, inquiry set of words q is the out-degree set a1 of arc tail, With to inquire about set of words q for arc head, merchandise news set i is the out-degree set a2 of arc tail；

Merging treatment unit, processes a '=(a1 ∪ a2)-(a1 for merging to set a1 and a2 ∩ a2), obtain term clustering result a=a ' ∪ a1 further.

In the present embodiment, term clustering unit passes through bigraph (bipartite graph) unit, out-degree unit and merging treatment list Unit obtains term clustering result, and wherein, bigraph (bipartite graph) is that commodity associate with query word foundation with out-degree, then Merge process it is ensured that term clustering result is more accurate, and semantic background and up and down can be taken into account Civilian relation.

Further, weight unit includes weight calculation unit and weighted combination units, wherein,

Weight calculation unit, for occur according to each participle number of times, utilize formulaCalculate the weight of each participle, wherein, ti is query word I-th participle, n_iThe number of times occurring in the subset of described set a for ti；

Weighted combination units, for according to formula score (ti)=∑ tf (t_i)×(α₁×β₁+a₂×β₂+…+ α i × β i+ ...+am × β m) participle in each subset of set a is weighted merging, wherein, α i is The similarity of i-th subset of set a, its value is the inverse of bigraph (bipartite graph) path a, β_iFor collection Close the weight of the subset of a.

In the present embodiment, each by weight calculation unit and weighted combination units calculated off line query word The weight of individual participle and fraction, save the expense calculating on line, meanwhile, it is capable to intuitively pass through to judge Highest score obtains the corresponding participle of entity.

Compared with prior art, the invention provides a kind of carry out being intended to entity based on user's search daily record Know method for distinguishing, merchandise news and query word are clicked on according to initial search log acquisition, form Candidate Set Close；Set up the bigraph (bipartite graph) of Candidate Set and it is clustered, calculate similarity；Term clustering result is entered Row word segmentation processing, calculates the weight of each participle, and is weighted merging, and chooses dividing of highest scoring Word is as the corresponding entity of query word, present method solves context and semantic background analyze asking of disappearance Topic, improves the accuracy of inquiry word and search, reduces by the academic, cultural of different user and region The inquiry error leading to etc. the impact of factor, improves use when user uses e-commerce website to do shopping Family Experience Degree.

It should be noted that the foregoing is only presently preferred embodiments of the present invention, not thereby limit The scope of patent protection of the present invention, the present invention can also carry out material to the construction of above-mentioned various parts Material and the improvement of structure, or be replaced using technically equivalent ones.Therefore all utilization present invention The equivalent structure change that description and diagramatic content are made, or it is related directly or indirectly to apply to other Technical field is all contained in the range of the present invention covered in the same manner.

Claims

1. a kind of based on user search for daily record carry out be intended to Entity recognition method it is characterised in that Methods described includes step:

2. the method for claim 1 is it is characterised in that step s2 includes step:

3. the method for claim 1 is it is characterised in that step s3 includes step:

4. the method for claim 1 is it is characterised in that in step s4, to described set A1 and a2 merges process a '=(a1 ∪ a2)-(a1 ∩ a2), obtains term clustering result further A=a ' ∪ a1.

5. the method for claim 1 is it is characterised in that in step s5, according to each institute The number of times stating participle appearance calculates the weight of each participle:

t f (t i) = \frac{n_{i}}{m a x (n_{i})} (i = 1, 2, 3 ... .)

6. method as claimed in claim 5 is it is characterised in that described set a comprises m son Collection, is weighted to the participle in each subset of described set a merging:

Score (ti)=σ tf (t_i)×(α₁×β₁+a₂×β₂+…+α_i×β_i+…+a_m×β_m)

7. a kind of based on user search for daily record carry out be intended to Entity recognition system it is characterised in that Described system includes:

Weight calculation unit, using participle technique to each query word in above-mentioned term clustering result a Carry out word segmentation processing, and the weight of each participle is calculated；

8. system as claimed in claim 7 is it is characterised in that described term clustering unit includes two Portion's figure becomes unit, out-degree unit and merging treatment unit, wherein,

9. system as claimed in claim 7 is it is characterised in that described weight unit includes weight Computing unit and weighted combination units, wherein,

Described weighted combination units, for according to formula score (ti)=∑ tf (t_i)×(α₁×β₁+a₂×β₂+ …+α_i×β_i+…+a_m×β_m) participle in each subset of described set a is weighted merging, Wherein, α_iFor the similarity of i-th subset of described set a, its value is bigraph (bipartite graph) path a Inverse, β_iWeight for the subset of described set a.