Content of the invention
The technical problem to be solved is to provide one kind and is intended to based on user's search daily record
The method of Entity recognition, to solve, because disappearance is to user's context or semantic background analysis, to lead to
Search Results produce the problem of error.
The present invention solves above-mentioned technical problem and be employed technical scheme comprise that offer one kind is searched based on user
Suo Zhi carries out the method being intended to Entity recognition, and the method comprising the steps of:
S1, initial search daily record is carried out with parsing extract, obtain the query word clicked on every time and its correspondence
Merchandise news;
S2, formation session query set ssession, concurrent hit query set squeryWith common inquiry commodity
Set sitem;
S3, represent that described concurrent hits query set s with bigraph (bipartite graph)queryWith common inquiry commodity set sitem,
The summit of described bigraph (bipartite graph) is processed, obtains out-degree set a1 and a2;
S4, process is merged to described set a1 and a2, obtain term clustering result a;
S5, word segmentation processing is carried out to each query word in above-mentioned term clustering result a using participle technique,
And the weight of each participle is calculated;
S6, the selection corresponding participle of highest scoring score are as the corresponding entity of described query word.
Preferably, step s2 includes step:
S201, each conversation element for user are processed, and obtain corresponding to each conversation element
Inquiry set of words;
S202, acquisition user's concurrent hit the inquiry set of words corresponding to commodity;
S203, acquisition user click on the set of different commodity datas under same query word;
S204, the inquiry set of words of described conversation element, concurrent are hit the query word unit and altogether of commodity
The commodity click data set of query word merges duplicate removal and processes, and obtains session query set
ssession, concurrent hit query set squeryWith common inquiry commodity set sitem.
Preferably, step s3 includes step:
S301, described concurrent is hit query set squeryWith common inquiry commodity set sitemUse bigraph (bipartite graph)
G=(v, e) represents, wherein, vertex v can be divided into two mutually disjoint subsets (i, q), i, q
It is respectively merchandise news set and inquiry set of words, side e represents the point between commodity and described query word
Hit relation;
S302, vertex v=(i, q) to described bigraph (bipartite graph) g=(v, e) carry out classification process, count respectively
Calculate with described merchandise news set i for arc head, described inquiry set of words q is the out-degree set of arc tail
A1, and with described inquiry set of words q for arc head, described merchandise news set i is the out-degree collection of arc tail
Close a2.
Preferably, in step s4, described set a1 and a2 is merged with process a '=(a1 ∪
A2)-(a1 ∩ a2), obtains term clustering result a=a ' ∪ a1 further.
Preferably, in step s5, calculate each participle according to the number of times that participle each described occurs
Weight:
Wherein, ti is i-th participle of described query word, niFor described ti described set a son
Concentrate the number of times occurring.
Preferably, described set a comprises m subset, to dividing in each subset of described set a
Word is weighted merging:
Wherein, αiFor the similarity of i-th subset of described set a, its value is bigraph (bipartite graph) path length
The inverse of degree a, βiWeight for the subset of described set a.
On the other hand, present invention offer is a kind of carries out being intended to Entity recognition based on user's search daily record
System is it is characterised in that described system includes:
Parsing extraction unit, extracts for initial search daily record is carried out with parsing, obtains and click on every time
Query word and its corresponding merchandise news;
Query word forms unit, for forming session query set ssession, concurrent hit query set
squeryWith common inquiry commodity set sitem;
Term clustering unit, for hitting query set s according to described concurrentqueryWith common inquiry commodity set
sitemThe bigraph (bipartite graph) being formed obtains term clustering result a;
Weight unit, is carried out to each query word in above-mentioned term clustering result a using participle technique
Word segmentation processing, and the weight of each participle is calculated;
Comparison unit, for contrasting the weight of each participle described, selects highest scoring score to correspond to
Participle as the corresponding entity of described query word.
Preferably, described term clustering unit includes bigraph (bipartite graph) formation unit, out-degree unit and merging treatment
Unit, wherein,
Described bigraph (bipartite graph) forms unit, for described concurrent is hit query set squeryWith common inquiry business
Product set sitemRepresented with bigraph (bipartite graph) g=(v, e), wherein, vertex v can be divided into two mutually not
Intersecting subset (i, q), i, q are respectively merchandise news set and inquiry set of words, and side e represents business
Click relation between product and described query word;
Described out-degree unit, for carrying out to vertex v=(i, q) of described bigraph (bipartite graph) g=(v, e) point
Class is processed, and calculates with described merchandise news set i for arc head respectively, and described inquiry set of words q is arc
The out-degree set a1 of tail, and with described inquiry set of words q for arc head, described merchandise news set i
Out-degree set a2 for arc tail;
Described merging treatment unit, processes a '=(a1 for merging to described set a1 and a2
∪ a2)-(a1 ∩ a2), obtain term clustering result a=a ' ∪ a1 further.
Preferably, described weight unit includes weight calculation unit and weighted combination units, wherein,
Described weight calculation unit, for according to each described participle occur number of times, utilize formulaCalculate the weight of each participle, wherein, ti is described inquiry
I-th participle of word, niThe number of times occurring in the subset of described set a for described ti;
Described weighted combination units, for according to formula score (ti)=∑ tf (ti)×(α1×β1+a2×β2+
...+α i × β i+ ...+am × β m) participle in each subset of described set a is weighted merging, wherein,
αiFor the similarity of i-th subset of described set a, its value is the inverse of bigraph (bipartite graph) path a,
βiWeight for the subset of described set a.
In the present invention, merchandise news and query word are clicked on according to initial search log acquisition, formed and wait
Selected works close;Set up the bigraph (bipartite graph) of Candidate Set and it is clustered, calculate similarity;Term clustering is tied
Fruit carries out word segmentation processing, calculates the weight of each participle, and is weighted merging, and chooses highest scoring
Participle as the corresponding entity of query word, present method solves context and semantic background analysis disappearance
Problem, improve the accuracy of inquiry word and search, reduce by the academic, cultural of different user and
The inquiry error that the impact of the factors such as region leads to, improves user using during e-commerce website shopping
User experience.
Specific embodiment
Following examples are only used for clearly technical scheme being described, and can not be come with this
Limit the scope of the invention.Description subsequent descriptions are to implement the better embodiment of the present invention,
So described description is for the purpose of the rule that the present invention is described, is not limited to the model of the present invention
Enclose.Protection scope of the present invention ought be defined depending on the defined person of claims.
With specific embodiment, the present invention is described in further details below in conjunction with the accompanying drawings.
As shown in figure 1, being a preferred embodiment of the present invention, disclose one kind based on user's search
Daily record carries out the method being intended to Entity recognition, and the method comprises step:
S1, initial search daily record is carried out with parsing extract, obtain the query word clicked on every time and its correspondence
Merchandise news;
S2, formation session query set ssession, concurrent hit query set squeryWith common inquiry commodity
Set sitem;
S3, represent that with bigraph (bipartite graph) concurrent hits query set squeryWith common inquiry commodity set sitem, right
The summit of bigraph (bipartite graph) is processed, and obtains out-degree set a1 and a2;
S4, process is merged to set a1 and a2, obtain term clustering result a;
S5, word segmentation processing is carried out to each query word in above-mentioned term clustering result a using participle technique,
And the weight of each participle is calculated;
S6, the selection corresponding participle of highest scoring score are as the corresponding entity of query word.
In the present embodiment, query set s is hit by concurrentqueryWith common inquiry commodity set sitemFormed
Bigraph (bipartite graph) obtain term clustering result, and by weight meter is carried out to each participle in term clustering result
Calculate, as query word correspondent entity, this avoids existing the participle choosing highest scoring to a certain extent
There are context-free in technology or the unrelated problem of semantic background, and then improve the accuracy of Search Results.
Further, step s2 includes step:
S201, each conversation element for user are processed, and obtain corresponding to each conversation element
Inquiry set of words;
S202, acquisition user's concurrent hit the inquiry set of words corresponding to commodity;
S203, acquisition user click on the set of different commodity datas under same query word;
S204, the inquiry set of words of conversation element, concurrent are hit the query word unit of commodity and common inquiry
The commodity click data set of word merges duplicate removal and processes, and obtains session query set ssession、
Concurrent hits query set squeryWith common inquiry commodity set sitem.
In the present embodiment, by data prediction is carried out to original log, obtain session query set
ssession, concurrent hit query set squeryWith common inquiry commodity set sitem, by click volume and looking into
Inquiry amount determines query word, extracts core word, it is to avoid because different users is subject to educational background, culture and is less than
Etc. the impact of factor, lead on stating same problem, the excessive problem of the difference of input inquiry word.
Further, step s3 includes step:
S301, concurrent is hit query set squeryWith common inquiry commodity set sitemUse bigraph (bipartite graph)
G=(v, e) represents, wherein, vertex v can be divided into two mutually disjoint subsets (i, q), i, q
It is respectively merchandise news set and inquiry set of words, side e represents that the click between commodity and query word is closed
System;
S302, vertex v=(i, q) to bigraph (bipartite graph) g=(v, e) carry out classification process, calculate respectively with
Merchandise news set i is arc head, and inquiry set of words q is the out-degree set a1 of arc tail, and to inquire about
Set of words q is arc head, and merchandise news set i is the out-degree set a2 of arc tail.
Further, in step s4, set a1 and a2 is merged with process a '=(a1 ∪ a2)-(a1
∩ a2), obtain term clustering result a=a ' ∪ a1 further.
In the present embodiment, obtaining merchandise news set i by way of bigraph (bipartite graph) is arc head, query word
Set q is the out-degree set a1 of arc tail, and to inquire about set of words q for arc head, merchandise news set i
For the out-degree set a2 of arc tail, and by way of term clustering, a1 and a2 is merged process,
Entirely it is ensured that the term clustering result obtaining comprises user context information and semantic background analysis, improve
The accuracy of query word.
Further, the power of each participle in step s5, is calculated according to the number of times that each participle occurs
Weight:
Wherein, ti is i-th participle of query word, niOccur in the subset of described set a for ti
Number of times.
Further, set a comprises m subset, and the participle in each subset of set a is carried out
Weighting merges:
Wherein, αiFor the similarity of i-th subset of set a, its value is bigraph (bipartite graph) path a
Inverse, βiWeight for the subset of set a.
In the present embodiment, term clustering result is carried out word segmentation processing, obtain multiple participles, to each point
Word carries out weight calculation and weighting merges, and obtains the corresponding fraction of each participle, and by the fraction obtaining
Contrasted, fraction highest participle, as query word correspondent entity, using calculated off line mode, saves
The expense calculating has been saved on line.
It will appreciated by the skilled person that realizing the whole or portion in above-described embodiment method
The program that can be by step by step to complete come the hardware to instruct correlation, and described program can be stored in
In computer read/write memory medium, this program upon execution, including each step of above-described embodiment method
Suddenly, and described storage medium may is that rom/ram, magnetic disc, CD, storage card etc..Therefore,
Relevant technical staff in the field will be understood that corresponding with the method for the present invention, and the present invention is also simultaneously
Daily record is searched for based on user carry out being intended to the system of Entity recognition including a kind of, referring to Fig. 2, and above-mentioned
Correspondingly, this system includes method and step:
Parsing extraction unit, extracts for initial search daily record is carried out with parsing, obtains and click on every time
Query word and its corresponding merchandise news;
Query word forms unit, for forming session query set ssession, concurrent hit query set
squeryWith common inquiry commodity set sitem;
Term clustering unit, for hitting query set s according to concurrentqueryWith common inquiry commodity set sitem
The bigraph (bipartite graph) being formed obtains term clustering result a;
Weight unit, is carried out to each query word in above-mentioned term clustering result a using participle technique
Word segmentation processing, and the weight of each participle is calculated;
Comparison unit, for contrasting the weight of each participle, selects corresponding point of highest scoring score
Word is as the corresponding entity of query word.
In the present embodiment, by term clustering unit obtain the most accurately, take into account context and the semantic back of the body
The query word of scape, is calculated weight and the fraction of each participle of query word, passes through by weighted combination units
Comparison unit obtains the corresponding participle of highest score as the corresponding entity of query word, so both incorporates
Context environmental and semantic background, improve search accuracy rate, sack more preferable intention assessment effect,
Simultaneously additionally it is possible to save computing cost, quickly obtain accurate participle.
Further, term clustering unit includes bigraph (bipartite graph) formation unit, out-degree unit and merging treatment list
Unit, wherein,
Bigraph (bipartite graph) forms unit, for concurrent is hit query set squeryWith common inquiry commodity set
sitemWith bigraph (bipartite graph) g=(v, e) represent, wherein, vertex v can be divided into two mutually disjoint
Subset (i, q), i, q are respectively merchandise news set and inquiry set of words, and side e represents commodity and looks into
Ask the click relation between word;
Out-degree unit, for classification process is carried out to vertex v=(i, q) of bigraph (bipartite graph) g=(v, e),
Calculate respectively with merchandise news set i for arc head, inquiry set of words q is the out-degree set a1 of arc tail,
With to inquire about set of words q for arc head, merchandise news set i is the out-degree set a2 of arc tail;
Merging treatment unit, processes a '=(a1 ∪ a2)-(a1 for merging to set a1 and a2
∩ a2), obtain term clustering result a=a ' ∪ a1 further.
In the present embodiment, term clustering unit passes through bigraph (bipartite graph) unit, out-degree unit and merging treatment list
Unit obtains term clustering result, and wherein, bigraph (bipartite graph) is that commodity associate with query word foundation with out-degree, then
Merge process it is ensured that term clustering result is more accurate, and semantic background and up and down can be taken into account
Civilian relation.
Further, weight unit includes weight calculation unit and weighted combination units, wherein,
Weight calculation unit, for occur according to each participle number of times, utilize formulaCalculate the weight of each participle, wherein, ti is query word
I-th participle, niThe number of times occurring in the subset of described set a for ti;
Weighted combination units, for according to formula score (ti)=∑ tf (ti)×(α1×β1+a2×β2+…+
α i × β i+ ...+am × β m) participle in each subset of set a is weighted merging, wherein, α i is
The similarity of i-th subset of set a, its value is the inverse of bigraph (bipartite graph) path a, βiFor collection
Close the weight of the subset of a.
In the present embodiment, each by weight calculation unit and weighted combination units calculated off line query word
The weight of individual participle and fraction, save the expense calculating on line, meanwhile, it is capable to intuitively pass through to judge
Highest score obtains the corresponding participle of entity.
Compared with prior art, the invention provides a kind of carry out being intended to entity based on user's search daily record
Know method for distinguishing, merchandise news and query word are clicked on according to initial search log acquisition, form Candidate Set
Close;Set up the bigraph (bipartite graph) of Candidate Set and it is clustered, calculate similarity;Term clustering result is entered
Row word segmentation processing, calculates the weight of each participle, and is weighted merging, and chooses dividing of highest scoring
Word is as the corresponding entity of query word, present method solves context and semantic background analyze asking of disappearance
Topic, improves the accuracy of inquiry word and search, reduces by the academic, cultural of different user and region
The inquiry error leading to etc. the impact of factor, improves use when user uses e-commerce website to do shopping
Family Experience Degree.
It should be noted that the foregoing is only presently preferred embodiments of the present invention, not thereby limit
The scope of patent protection of the present invention, the present invention can also carry out material to the construction of above-mentioned various parts
Material and the improvement of structure, or be replaced using technically equivalent ones.Therefore all utilization present invention
The equivalent structure change that description and diagramatic content are made, or it is related directly or indirectly to apply to other
Technical field is all contained in the range of the present invention covered in the same manner.