CN105488209A - Method and device for analyzing word weight - Google Patents

Method and device for analyzing word weight Download PDF

Info

Publication number
CN105488209A
CN105488209A CN201510921247.1A CN201510921247A CN105488209A CN 105488209 A CN105488209 A CN 105488209A CN 201510921247 A CN201510921247 A CN 201510921247A CN 105488209 A CN105488209 A CN 105488209A
Authority
CN
China
Prior art keywords
word
fragment
inquiry
title
same words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510921247.1A
Other languages
Chinese (zh)
Other versions
CN105488209B (en
Inventor
陈进平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510921247.1A priority Critical patent/CN105488209B/en
Publication of CN105488209A publication Critical patent/CN105488209A/en
Application granted granted Critical
Publication of CN105488209B publication Critical patent/CN105488209B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Abstract

The invention discloses a method and a device for analyzing a word weight, relating to the Internet technical field and solving a problem of incapability of accurately determining a term weight in query under the Internet search engine environment through the existing method for determining the term weight. The method comprises the following steps of a (query, title) pair; counting occurrence situation information of each word in a queried word fragment in the (query, title) pair; computing probability of occurrence of each word in the same word fragment according to the occurrence situation information; and determining the weight of each word in the same word fragment according to the probability of occurrence of each word in the same word fragment. The method and the device are used for determining the term weight of the query in the search engine and improving the search quality of the search engine.

Description

A kind of analytical approach of word weight and device
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of analytical approach and device of word weight.
Background technology
Along with the development of internet, storage data volume total in internet is very huge, therefore in order to enable user find required data content fast and accurately, provides the manufacturer of internet search service to be optimized the search quality of search engine with regard to needing.Wherein, weight is the assessed value that search engine gives a webpage, and this weight can reflect the significance level of webpage, and weight is higher, illustrates that webpage obtains trust and the accreditation of more multiple search engine.And using in the process of search engine user, meeting submit Query content in the search box, these query contents are referred to as query usually, and search engine needs in mass data, to obtain useful information according to query.Owing to having different word term in query, wherein each term its significance level for the useful Query Result of acquisition is different, therefore to just need with reference to the importance of each term in query to target query result according to query Obtaining Accurate, the weight utilizing term in query is namely needed to carry out the inquiry of objective result.
Determine in the method for term weight existing; usually common click, part of speech and named entity can be utilized to determine term weight; but these methods are not use search engine to obtain based on content in internet environment by user, thus it is not high to result through the reference value of term weight in field of Internet search that said method determines.Therefore under internet search engine environment, how to determine that term weight becomes problem demanding prompt solution when using internet search engine.
Summary of the invention
In view of this, the present invention proposes a kind of analytical approach and device of word weight, the method that fundamental purpose is to solve the existing term of determination weight accurately cannot determine the problem of term weight in query under internet search engine environment.
According to first aspect of the present invention, the invention provides a kind of analytical approach of word weight, comprising:
Obtain < inquiry, title > couple;
Statistics < inquiry, title > to described in inquiry word fragment in the appearance situation information of each word;
The probability of occurrence that situation information calculates each word in same words fragment is there is according to described;
The weight of each word in described same words fragment is determined according to the probability of occurrence of each word in described same words fragment.
Further, described acquisition < inquires about, and title > is to comprising:
Obtain user's click logs, described click logs comprises all inquiries that user submits to and all titles obtained;
Arrange described click logs, the title one_to_one corresponding that inquiry user submitted to obtains with the url clicking described inquiry, form < inquiry, title > couple.
Further, described statistics < inquires about, title > to described in inquiry word fragment in the appearance situation information of each word comprise:
Obtain < inquiry, title > to described in all word fragments of inquiry, institute's predicate fragment comprises the phrase of each word in described inquiry and adjacent two and above word composition;
Add up the appearance situation information of each word in all word fragments of described inquiry.
Further, the appearance situation information of adding up each word in all word fragments of described inquiry comprises:
Judge that in the word fragment of described inquiry, whether each word is inquired about at the < of described inquiry, occurs in the title that title > centering is corresponding;
Add up the appearance situation information of each word in the word fragment of described inquiry according to judged result, described occur situation information with preset appearance symbol and do not occur that symbol represents.
Further, occur that the probability of occurrence of each word in situation information calculating same words fragment comprises according to described:
Obtain the headed total number corresponding to same words fragment;
Obtain the number of times that in described same words fragment, each word occurs in all titles of described correspondence;
With described number of times divided by described correspondence headed total number obtain the probability of occurrence of each word in all titles of described correspondence in same words fragment.
Further, determine that the weight of each word in described same words fragment comprises according to the probability of occurrence of each word in described same words fragment:
Using the weight of the probability of occurrence of word each in same words fragment in all titles of described correspondence as each word in described same words fragment.
According to second aspect of the present invention, the invention provides a kind of analytical equipment of word weight, comprising:
Acquiring unit, for obtaining < inquiry, title > couple;
Statistic unit, for adding up the < inquiry that described acquiring unit obtains, title > to described in inquiry word fragment in the appearance situation information of each word;
, for there is according to described the probability of occurrence that situation information calculates each word in same words fragment in computing unit;
Determining unit, determines the weight of each word in described same words fragment for the probability of occurrence of each word in the described same words fragment that calculates according to described computing unit.
Further, described acquiring unit comprises:
Acquisition module, for obtaining user's click logs, described click logs comprises all inquiries that user submits to and all titles obtained;
Sorting module, for arranging the described click logs that described acquisition module obtains, the title one_to_one corresponding that inquiry user submitted to obtains with the url clicking described inquiry, forms < inquiry, title > couple.
Further, described statistic unit comprises:
Cutting module, for obtaining < inquiry, title > to described in all word fragments of inquiry, institute's predicate fragment comprises the phrase of each word in described inquiry and adjacent two and above word composition;
Statistical module, for add up the described inquiry that described cutting module obtains all word fragments in the appearance situation information of each word.
Further, described statistic unit also for judge described inquiry word fragment in each word whether inquire about at the < of described inquiry, occur in the title that title > centering is corresponding, and the appearance situation information of each word in the word fragment of adding up described inquiry according to judged result, describedly occur that situation information is with the appearance symbol preset and do not occur that symbol represents.
Further, described computing unit comprises:
Counting module, for obtaining the headed total number corresponding to same words fragment;
Described counting module is also for obtaining the number of times that in described same words fragment, each word occurs in all titles of described correspondence;
Computing module, for described number of times divided by described correspondence headed total number obtain the probability of occurrence of each word in all titles of described correspondence in same words fragment.
Further, described determining unit is used for the weight of the probability of occurrence of word each in same words fragment in all titles of described correspondence as each word in described same words fragment.
By technique scheme, the analytical approach of a kind of word weight that the embodiment of the present invention provides and device, can use on a large scale in the process of internet search engine user and get < inquiry, title > couple, and the appearance situation information of each word in word fragment in statistical query, calculate the probability of occurrence of each word in same words fragment according to the appearance situation information of each word, determine the weight of each word in described same words fragment according to the probability of occurrence of each word in described same words fragment.And in the prior art, when determining the weight of word in search inquiry cannot based on internet environment in use search engine to obtain based on content, thus cause the word weight of search word to determine inaccurate, and then affect the accuracy of Search Results.Compared with this defect of the prior art, the present invention can use search engine to click based on the daily record of formation by user on a large scale, under internet search engine environment, accurately determine the weight of word in search inquiry, thus effectively improve the accuracy of Search Results.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the analytical approach of a kind of word weight that the embodiment of the present invention provides;
Fig. 2 shows the composition frame chart of the analytical equipment of a kind of word weight that the embodiment of the present invention provides;
Fig. 3 shows the composition frame chart of the analytical equipment of the another kind of word weight that the embodiment of the present invention provides;
Fig. 4 shows the composition frame chart of the analytical equipment of the another kind of word weight that the embodiment of the present invention provides;
Fig. 5 shows the composition frame chart of the analytical equipment of the another kind of word weight that the embodiment of the present invention provides.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in further detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Submit Query query is needed when user uses search engine, in inquiry query, there is different word term, wherein each term its significance level for the useful Query Result of acquisition is different, therefore to just need with reference to the importance of each term in query to target query result according to query Obtaining Accurate, the weight utilizing term in query is namely needed to carry out the inquiry of objective result.Determine in the method for term weight existing; usually common click, part of speech and named entity can be utilized to determine term weight; but these methods are not use search engine to obtain based on content in internet environment by user, thus it is not high to result through the reference value of term weight in field of Internet search that said method determines.
In order to solve the problem, embodiments provide a kind of analytical approach of word weight, accurately can determine the weight of each keyword term in the inquiry query that user submits to based on internet search engine environment, as shown in Figure 1, the method comprises:
101, < inquiry is obtained, title > couple.
Need when user uses the content required for search engine inquiry to submit the inquiry query including keyword term to, search engine matches some relevant title title according to the query that user submits to and clicks viewing for user, after user clicks relevant title, the title of the query that user just can submit to by the embodiment of the present invention and click carries out being combined to form < inquiry, title > couple, also <query can be denoted as, title> couple.
102, < inquiry is added up, the appearance situation information of each word in the word fragment of title > centering inquiry.
Because search engine is when the query submitted to according to user searches for corresponding content on the internet, need the importance adjustment search strategy according to word term each in query, and the number of times appeared in title corresponding to query of the term in query is more spoken more, in bright query, this term is more important, therefore the embodiment of the present invention needs execution step 102 to add up large-scale <query, the appearance situation information of each word in the word fragment of title> centering inquiry, according to the importance occurring each word in situation information determination word fragment.
103, there is according to described the probability of occurrence that situation information calculates each word in same words fragment.
Because the embodiment of the present invention needs to add up large-scale <query, title> couple, a large amount of same words fragments is included in the query of therefore all statistics, for same words fragments AB C, allly comprise in the query of word fragments AB C, there is part title to comprise term-A in the title that each query is corresponding, in part title, do not comprise term-A; Part title comprises term-B, does not comprise term-B in part title; Part title comprises term-C, does not comprise term-C in part title.That is in same words fragment, the probability of occurrence of each word in the title corresponding to all query comprising described same words fragment is not identical, and therefore in same words fragment, the importance of each word is also just different.The embodiment of the present invention needs to perform step 103 calculates each word in same words fragment probability of occurrence according to the appearance situation information of word each in same words fragment in the title corresponding to all query comprising described same words fragment thus.
104, according to the weight of each word in the probability of occurrence determination same words fragment of word each in same words fragment.
Because weight is a relative concept, for certain index, the weight of this index refers to the relative importance of this index in the overall evaluation.And for the embodiment of the present invention, the weight of certain term just refers to the relative importance of this term in the word fragment of the query at its place, the probability that the term that significance level is higher simultaneously occurs in the title that the query at its word fragment place is corresponding is higher, therefore when calculating in same words fragment after the probability of occurrence of each word in the title corresponding to all query comprising described same words fragment in step 103, just can according to the weight of each word in the probability of occurrence determination same words fragment of word each in same words fragment, so that search engine is inquired about according to by adding up < on a large scale, title > is to determined term weight adjusting search strategy, improve the accuracy of Search Results.
The analytical approach of a kind of word weight that the embodiment of the present invention provides, can use on a large scale in the process of internet search engine user and get < inquiry, title > couple, and the appearance situation information of each word in word fragment in statistical query, calculate the probability of occurrence of each word in same words fragment according to the appearance situation information of each word, determine the weight of each word in described same words fragment according to the probability of occurrence of each word in described same words fragment.And in the prior art, when determining the weight of word in search inquiry cannot based on internet environment in use search engine to obtain based on content, thus cause the word weight of search word to determine inaccurate, and then affect the accuracy of Search Results.Compared with this defect of the prior art, the present invention can use search engine to click based on the daily record of formation by user on a large scale, under internet search engine environment, accurately determine the weight of word in search inquiry, thus effectively improve the accuracy of Search Results.
Understand the method shown in above-mentioned Fig. 1 in order to better, as to the refinement of above-mentioned embodiment and expansion, the embodiment of the present invention is described in detail for the step in Fig. 1.
Usual user can produce a large amount of click logs in the process using internet, the data such as these click logs information comprise the inquiry query that user submits in search engine, the title title that the URL(uniform resource locator) url that described query clicks and url is corresponding.The title that the query submitted to due to user and the url clicking described query obtains has mutually corresponding relation usually, determines to search for the data basis of keyword term weight under therefore just can obtaining internet search engine environment by large-scale statistics click logs information.Because user is when a submission query, click multiple url sometimes and obtain multiple relevant title, namely also can there is height difference with the matching degree of query in the quality of these title, therefore the embodiment of the present invention needs the click logs to obtaining to arrange, by query and the title one_to_one corresponding in click logs, obtain <query, title> couple.Wherein, because user is when a submission query, the title that multiple url obtains multiple correspondence may be clicked, therefore the large-scale <query obtained, title> centering, same query also can have multiple <query, title> couple.
After user submits query in search engine, search engine needs relative importance according to term (keyword) each in query i.e. weight adjusting search strategy, to get Search Results accurately.And the significance level of each term can represent by the appearance situation of term in the title that query is corresponding in query, if the number of times that certain term in a large amount of query occurs in the title of correspondence is more, illustrate that this term is more important.Due to diversified word fragment can be included in each query, word fragment comprises the phrase that each term in query and adjacent two and above term form, and also can comprise identical word fragment in each query, with regard to same word fragment, the number of times that term in described same word fragment occurs in the title that all query comprising institute's predicate fragment are corresponding is more, illustrates that this term is more important in institute's predicate fragment.Therefore, the embodiment of the present invention needs to add up the appearance situation information of each term in the word fragment of all query.In order to add up all query word fragment in the appearance situation information of each term, the embodiment of the present invention needs to carry out participle to all query, namely process all <query, title> couple, each query is carried out participle, obtain phrase that each term in query and adjacent two and above term form i.e. above-mentioned word fragment, and add up the appearance situation information of each term in the title of its correspondence in word fragment.
In all word fragments of adding up each query during the appearance situation information of each term, with the appearance symbol preset and can not occur that symbol represents.Namely judge each term in the word fragment of query whether at the <query of described query, occurring in the title that title> centering is corresponding, if occur, then occurring that symbol represents with presetting, if do not occur, then do not occur that symbol represents with presetting.Such as <query:ABCD, title:CDEFG>, a word fragment in its query is ABC, and the term-A in this word fragments AB C does not occur in title:CDEFG, then with not occurring that symbol 0 represents; Term-B does not occur in title:CDEFG, then with not occurring that symbol 0 represents; Term-C occurs in title:CDEFG, then with occurring that symbol 1 represents, the appearance situation information of therefore adding up each term in word fragments AB C just can represent with ABC:001.
When determining <query by the way, in the word fragment of title> centering query each term appearance situation information after, just can calculate the probability of occurrence of each term in same words fragment.Concrete when calculating the probability of occurrence of each term in same words fragment, need total number of all title obtained corresponding to same words fragment.For same word fragment, be exactly all <query comprising described same word fragment in query, total number that title> is right, at all these <query, in total number that title> is right, part <query, the title of title> centering includes the term of described same word fragment, part <query, the title of title> centering does not include the term of described same word fragment, therefore after the total number obtaining all title corresponding to same word fragment, also need to obtain the number of times that in same word fragment, each term occurs in described all title, in all title, namely comprise the number of the title of certain term.The number of times occurred in all title with each term in same word fragment obtains the probability of occurrence of each term in all title of correspondence in same words fragment divided by total number of all title of correspondence.
For same word fragment, wherein the frequency of occurrences of certain term in the title that its place query is corresponding is higher, this term is more important, therefore can determine the weight of each term in same words fragment according to the probability of occurrence of each term in the same words fragment calculated.As the optional embodiment of one, the embodiment of the present invention can using the weight of the probability of occurrence of term each in same words fragment in all title of its correspondence as each term in described same words fragment.
In order to better understand said method, the embodiment of the present invention with two <query, title> to for example, will be described in detail to said process.These two <query, title> is to being respectively <query:ABC, title:CDEF>, <query:ABCDE, title:FGACDHJ>.Wherein, if the term in query appears in corresponding title, then with occurring that symbol 1 represents, if the term in query does not appear in corresponding title, then with not occurring that symbol 0 represents.
At statistics <query, in the word fragment of title> centering query during the appearance situation of each term, first need <query, the query of title> centering carries out participle and obtains all word fragments, then the appearance situation of each term in word fragment is added up, namely with the word fragment in query for key, the term comprised with word fragment occurs that in the title of correspondence situation exports for value, and its result is as follows:
1) <query:ABC, title:CDEF> centering,
Comprise 1 term: A:0, B:0, C:1
Comprise 2 term: AB:00, BC:01
Comprise 3 term: ABC:001
2) <query:ABCDE, title:FGACDHJ> centering,
Comprise 1 term: A:1, B:0, C:1, D:1, E:0
Comprise 2 term: AB:10, BC:01, CD:11, DE:10
Comprise 3 term: ABC:101, BCD:011, CDE:110
Comprise 4 term: ABCD:1011, BCDE:0110
Comprise 5 term: ABCDE:10110
When processing all <query, title> is to afterwards, need to merge identical word fragment according to the appearance situation information of term each in word fragment, namely calculate the probability of occurrence of each term in same words fragment.For word fragments AB C, <query:ABC, title:CDEF> centering, the value value of word fragments AB C is 001; At <query:ABCDE, title:FGACDHJ> centering, the value value of word fragments AB C is 101, wherein, term-A does not occur in the title of <query:ABC, title:CDEF> centering, and at <query:ABCDE, occur in the title of title:FGACDHJ> centering, the probability that therefore term-A occurs in title is 0.5; In like manner, the probability that term-B occurs in title is the probability that 0, term-C occurs in title is 1, and therefore for word fragments AB C, the probability that each term occurs in Search Results is ABC:0.5,0,1.According to above-mentioned statistics, when user submits the query comprising ABC in a search engine to, the importance of the term of reference during search, is needed to be followed successively by term-C>term-A>term-B.
Certainly, above-mentioned is that its probability obtained is also not representative with two <query, title> to the explanation carried out for example, just in order to clearly demonstrate concrete analytic process.Carry out in the process analyzed actual, need to add up <query on a large scale in the manner described above, title> is to just obtaining the reliable probability of occurrence of each term in word fragment.Such as, if add up a large amount of <query, title> to after obtain similar following data, word fragments AB C:0.7,0.3,0.9, be then expressed as follows implication: comprise in the query of word fragments AB C all, the probability comprising term-A in the title of click is 0.7, the probability comprising term-B is 0.3, the probability comprising term-C is 0.9, therefore can think that the important ratio of term-A and term-C is higher, and the important ratio of term-B is lower.
By the analytical approach of the word weight described in the embodiment of the present invention, the probability of occurrence of term in title comprised in word fragment under internet hunt environment and word fragment can be excavated on a large scale, such as following two word fragments: a) tomato fish soup: 0.75,0.82; B) fish soup is OK: 0.78,0.51.Wherein, a) represent in that in the title that all query comprising " tomato fish soup " click, 75% comprises " tomato ", and 82% comprises " fish soup ", because " tomato " has synonym " tomato ", so the probability of the tomato comprised in fact described title is taller.In b), show that in the title that all query comprising " fish soup is OK " click, " fish soup " occurs often than " OK ", and " fish soup " is more important than " OK ".
The embodiment of the present invention utilizes <query, whether title> occurs in title term in statistics query, and occurring that the value value of situation information by word fragment exports, further according to the probability of occurrence of each term in title in the value Data-Statistics same words fragment of each word fragment, obtain the weight information of each term in word fragment thus, weight information due to these term determines based on the click logs information under large-scale internet hunt environment, therefore, it is possible to effectively improve the search quality of search engine.
Further, as the realization to method shown in above-mentioned Fig. 1, embodiments provide a kind of analytical equipment of word weight, as shown in Figure 2, this device comprises: acquiring unit 21, statistic unit 22, computing unit 23 and determining unit 24, wherein,
Acquiring unit 21, for obtaining < inquiry, title > couple;
Statistic unit 22, for adding up the < inquiry that acquiring unit 21 obtains, title > to described in inquiry word fragment in the appearance situation information of each word;
Computing unit 23, for the described probability of occurrence occurring each word in situation information calculating same words fragment added up according to statistic unit 22;
Determining unit 24, determines the weight of each word in described same words fragment for the probability of occurrence of each word in the described same words fragment that calculates according to computing unit 23.
Further, as shown in Figure 3, acquiring unit 21 comprises:
Acquisition module 211, for obtaining user's click logs, described click logs comprises all inquiries that user submits to and all titles obtained;
Sorting module 212, for arranging the described click logs that acquisition module 211 obtains, the title one_to_one corresponding that inquiry user submitted to obtains with the url clicking described inquiry, forms < inquiry, title > couple.
Further, as shown in Figure 4, statistic unit 22 comprises:
Cutting module 221, for obtaining < inquiry, title > to described in all word fragments of inquiry, institute's predicate fragment comprises the phrase of each word in described inquiry and adjacent two and above word composition;
Statistical module 222, for add up the described inquiry that cutting module 221 obtains all word fragments in the appearance situation information of each word.
Further, statistic unit 22 also for judge described inquiry word fragment in each word whether inquire about at the < of described inquiry, occur in the title that title > centering is corresponding, and the appearance situation information of each word in the word fragment of adding up described inquiry according to judged result, describedly occur that situation information is with the appearance symbol preset and do not occur that symbol represents.
Further, as shown in Figure 5, computing unit 23 comprises:
Counting module 231, for obtaining the headed total number corresponding to same words fragment;
Counting module 231 is also for obtaining the number of times that in described same words fragment, each word occurs in all titles of described correspondence;
Computing module 232, for described number of times divided by described correspondence headed total number obtain the probability of occurrence of each word in all titles of described correspondence in same words fragment.
Further, determining unit 24 is for using the weight of the probability of occurrence of word each in same words fragment in all titles of described correspondence as each word in described same words fragment.
The analytical equipment of a kind of word weight that the embodiment of the present invention provides, can use on a large scale in the process of internet search engine user and get < inquiry, title > couple, and the appearance situation information of each word in word fragment in statistical query, calculate the probability of occurrence of each word in same words fragment according to the appearance situation information of each word, determine the weight of each word in described same words fragment according to the probability of occurrence of each word in described same words fragment.And in the prior art, when determining the weight of word in search inquiry cannot based on internet environment in use search engine to obtain based on content, thus cause the word weight of search word to determine inaccurate, and then affect the accuracy of Search Results.Compared with this defect of the prior art, the present invention can use search engine to click based on the daily record of formation by user on a large scale, under internet search engine environment, accurately determine the weight of word in search inquiry, thus effectively improve the accuracy of Search Results.
In addition, the embodiment of the present invention utilizes <query, whether title> occurs in title term in statistics query, and occurring that the value value of situation information by word fragment exports, further according to the probability of occurrence of each term in title in the value Data-Statistics same words fragment of each word fragment, obtain the weight information of each term in word fragment thus, weight information due to these term determines based on the click logs information under large-scale internet hunt environment, therefore, it is possible to effectively improve the search quality of search engine.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
Be understandable that, the correlated characteristic in said method and device can reference mutually.In addition, " first ", " second " in above-described embodiment etc. are for distinguishing each embodiment, and do not represent the quality of each embodiment.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the system of foregoing description, the specific works process of device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
Intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with display at this algorithm provided.Various general-purpose system also can with use based on together with this teaching.According to description above, the structure constructed required by this type systematic is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions of the some or all parts in the denomination of invention (as determined the device of website internal chaining grade) that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (12)

1. an analytical approach for word weight, is characterized in that, described method comprises:
Obtain < inquiry, title > couple;
Statistics < inquiry, title > to described in inquiry word fragment in the appearance situation information of each word;
The probability of occurrence that situation information calculates each word in same words fragment is there is according to described;
The weight of each word in described same words fragment is determined according to the probability of occurrence of each word in described same words fragment.
2. method according to claim 1, is characterized in that, described acquisition < inquires about, and title > is to comprising:
Obtain user's click logs, described click logs comprises all inquiries that user submits to and all titles obtained;
Arrange described click logs, the title one_to_one corresponding that inquiry user submitted to obtains with the url clicking described inquiry, form < inquiry, title > couple.
3. method according to claim 1, is characterized in that, described statistics < inquires about, title > to described in inquiry word fragment in the appearance situation information of each word comprise:
Obtain < inquiry, title > to described in all word fragments of inquiry, institute's predicate fragment comprises the phrase of each word in described inquiry and adjacent two and above word composition;
Add up the appearance situation information of each word in all word fragments of described inquiry.
4. method according to claim 3, is characterized in that, the appearance situation information of adding up each word in all word fragments of described inquiry comprises:
Judge that in the word fragment of described inquiry, whether each word is inquired about at the < of described inquiry, occurs in the title that title > centering is corresponding;
Add up the appearance situation information of each word in the word fragment of described inquiry according to judged result, described occur situation information with preset appearance symbol and do not occur that symbol represents.
5. method according to claim 4, is characterized in that, occurs that the probability of occurrence of each word in situation information calculating same words fragment comprises according to described:
Obtain the headed total number corresponding to same words fragment;
Obtain the number of times that in described same words fragment, each word occurs in all titles of described correspondence;
With described number of times divided by described correspondence headed total number obtain the probability of occurrence of each word in all titles of described correspondence in same words fragment.
6. method according to claim 5, is characterized in that, determines that the weight of each word in described same words fragment comprises according to the probability of occurrence of each word in described same words fragment:
Using the weight of the probability of occurrence of word each in same words fragment in all titles of described correspondence as each word in described same words fragment.
7. an analytical equipment for word weight, is characterized in that, described device comprises:
Acquiring unit, for obtaining < inquiry, title > couple;
Statistic unit, for adding up the < inquiry that described acquiring unit obtains, title > to described in inquiry word fragment in the appearance situation information of each word;
Computing unit, for the described probability of occurrence occurring each word in situation information calculating same words fragment added up according to described statistic unit;
Determining unit, determines the weight of each word in described same words fragment for the probability of occurrence of each word in the described same words fragment that calculates according to described computing unit.
8. device according to claim 7, is characterized in that, described acquiring unit comprises:
Acquisition module, for obtaining user's click logs, described click logs comprises all inquiries that user submits to and all titles obtained;
Sorting module, for arranging the described click logs that described acquisition module obtains, the title one_to_one corresponding that inquiry user submitted to obtains with the url clicking described inquiry, forms < inquiry, title > couple.
9. device according to claim 7, is characterized in that, described statistic unit comprises:
Cutting module, for obtaining < inquiry, title > to described in all word fragments of inquiry, institute's predicate fragment comprises the phrase of each word in described inquiry and adjacent two and above word composition;
Statistical module, for add up the described inquiry that described cutting module obtains all word fragments in the appearance situation information of each word.
10. device according to claim 9, it is characterized in that, described statistic unit also for judge described inquiry word fragment in each word whether inquire about at the < of described inquiry, occur in the title that title > centering is corresponding, and the appearance situation information of each word in the word fragment of adding up described inquiry according to judged result, describedly occur that situation information is with the appearance symbol preset and do not occur that symbol represents.
11. devices according to claim 10, is characterized in that, described computing unit comprises:
Counting module, for obtaining the headed total number corresponding to same words fragment;
Described counting module is also for obtaining the number of times that in described same words fragment, each word occurs in all titles of described correspondence;
Computing module, for described number of times divided by described correspondence headed total number obtain the probability of occurrence of each word in all titles of described correspondence in same words fragment.
12. devices according to claim 11, is characterized in that, described determining unit is used for the weight of the probability of occurrence of word each in same words fragment in all titles of described correspondence as each word in described same words fragment.
CN201510921247.1A 2015-12-11 2015-12-11 A kind of analysis method and device of word weight Active CN105488209B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510921247.1A CN105488209B (en) 2015-12-11 2015-12-11 A kind of analysis method and device of word weight

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510921247.1A CN105488209B (en) 2015-12-11 2015-12-11 A kind of analysis method and device of word weight

Publications (2)

Publication Number Publication Date
CN105488209A true CN105488209A (en) 2016-04-13
CN105488209B CN105488209B (en) 2019-06-07

Family

ID=55675184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510921247.1A Active CN105488209B (en) 2015-12-11 2015-12-11 A kind of analysis method and device of word weight

Country Status (1)

Country Link
CN (1) CN105488209B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
CN109815396A (en) * 2019-01-16 2019-05-28 北京搜狗科技发展有限公司 Search term Weight Determination and device
CN111367592A (en) * 2018-12-07 2020-07-03 北京字节跳动网络技术有限公司 Information processing method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278980A (en) * 1991-08-16 1994-01-11 Xerox Corporation Iterative technique for phrase query formation and an information retrieval system employing same
CN101785000A (en) * 2007-06-25 2010-07-21 谷歌股份有限公司 Word probability determination
CN102289436A (en) * 2010-06-18 2011-12-21 阿里巴巴集团控股有限公司 Method and device for determining weighted value of search term and method and device for generating search results
CN104361115A (en) * 2014-12-01 2015-02-18 北京奇虎科技有限公司 Entry weight definition method and device based on co-clicking
CN104376115A (en) * 2014-12-01 2015-02-25 北京奇虎科技有限公司 Fuzzy word determining method and device based on global search
CN104615723A (en) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 Determining method and device of search term weight value
CN105095381A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Method and device for new word identification

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5278980A (en) * 1991-08-16 1994-01-11 Xerox Corporation Iterative technique for phrase query formation and an information retrieval system employing same
CN101785000A (en) * 2007-06-25 2010-07-21 谷歌股份有限公司 Word probability determination
CN102289436A (en) * 2010-06-18 2011-12-21 阿里巴巴集团控股有限公司 Method and device for determining weighted value of search term and method and device for generating search results
CN104361115A (en) * 2014-12-01 2015-02-18 北京奇虎科技有限公司 Entry weight definition method and device based on co-clicking
CN104376115A (en) * 2014-12-01 2015-02-25 北京奇虎科技有限公司 Fuzzy word determining method and device based on global search
CN104615723A (en) * 2015-02-06 2015-05-13 百度在线网络技术(北京)有限公司 Determining method and device of search term weight value
CN105095381A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Method and device for new word identification

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408794A (en) * 2017-08-17 2019-03-01 阿里巴巴集团控股有限公司 A kind of frequency dictionary method for building up, segmenting method, server and client side's equipment
CN111367592A (en) * 2018-12-07 2020-07-03 北京字节跳动网络技术有限公司 Information processing method and device
CN111367592B (en) * 2018-12-07 2023-07-11 北京字节跳动网络技术有限公司 Information processing method and device
CN109815396A (en) * 2019-01-16 2019-05-28 北京搜狗科技发展有限公司 Search term Weight Determination and device
CN109815396B (en) * 2019-01-16 2021-09-21 北京搜狗科技发展有限公司 Search term weight determination method and device

Also Published As

Publication number Publication date
CN105488209B (en) 2019-06-07

Similar Documents

Publication Publication Date Title
Wang et al. Amalgam+: Composing rich information sources for accurate bug localization
US7827166B2 (en) Handling dynamic URLs in crawl for better coverage of unique content
CN106874492B (en) Searching method and device
CN102023989B (en) Information retrieval method and system thereof
US7289981B2 (en) Using text search engine for parametric search
Saha et al. On the effectiveness of information retrieval based bug localization for c programs
CN104361115A (en) Entry weight definition method and device based on co-clicking
CN109783543B (en) Data query method, device, equipment and storage medium
CN111563051B (en) Crawler-based data verification method and device, computer equipment and storage medium
CN105389352A (en) Log processing method and apparatus
US7836048B2 (en) Socially-derived relevance in search engine results
US9727607B2 (en) Systems and methods for representing search query rewrites
KR101922680B1 (en) Auto-suggested content item requests
US8745043B2 (en) Determining sort order by distance
US7240045B1 (en) Automatic system for configuring to dynamic database search forms
CN105488209A (en) Method and device for analyzing word weight
CN105095381A (en) Method and device for new word identification
EP2126744A1 (en) Identifying executable scenario solutions in response to search queries
CN107784003B (en) Data query anomaly detection method, device, equipment and system
CN104376115A (en) Fuzzy word determining method and device based on global search
CN103984757A (en) Method and system for inserting news information articles in search result page
CN105187439A (en) Phishing website detection method and device
CN103500181A (en) Internet information analyzing method and device
US20130007023A1 (en) System and Method for Consolidating Search Engine Results
JP6588661B2 (en) Information retrieval accuracy evaluation method, system, apparatus, and computer-readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220726

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.