CN106484671B - A kind of recognition methods of timeliness inquiry content - Google Patents

A kind of recognition methods of timeliness inquiry content Download PDF

Info

Publication number
CN106484671B
CN106484671B CN201510526945.1A CN201510526945A CN106484671B CN 106484671 B CN106484671 B CN 106484671B CN 201510526945 A CN201510526945 A CN 201510526945A CN 106484671 B CN106484671 B CN 106484671B
Authority
CN
China
Prior art keywords
timeliness
inquiry
content
document
period
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510526945.1A
Other languages
Chinese (zh)
Other versions
CN106484671A (en
Inventor
吴尉林
许欢庆
郭永福
陈沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY CO., LTD.
Original Assignee
Beijing Zhongsou Cloud Business Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Cloud Business Network Technology Co Ltd filed Critical Beijing Zhongsou Cloud Business Network Technology Co Ltd
Priority to CN201510526945.1A priority Critical patent/CN106484671B/en
Publication of CN106484671A publication Critical patent/CN106484671A/en
Application granted granted Critical
Publication of CN106484671B publication Critical patent/CN106484671B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The present invention provides a kind of recognition methods of timeliness inquiry content, by establishing the index of timeliness document resources, the number that statistical query content occurs in the timeliness document resources and carrying out timeliness judgement to the inquiry content, and then identify that timeliness inquires content.Recognition methods proposed by the present invention can quickly and comprehensively identify that timeliness inquires content;It is lower to resource requirement, and is all suitable for common query and long-tail inquiry;Increase recall rate simultaneously;And it remains to identify to the timeliness inquiry in the outburst decline phase;The timeliness intensity that inquiry can be provided, realizing subsequent module can be according to its timeliness intensity using different strategies;It ensure that the accuracy and reliability of identification.

Description

A kind of recognition methods of timeliness inquiry content
Technical field
The present invention relates to inquiry content recognition fields, and in particular to a kind of recognition methods of timeliness inquiry content.
Background technique
In the big data era of current information explosion, search engine has become the indispensable hand that people obtain information Section.Input inquiry obtains search result to user in a search engine, therefrom finds required information.In some cases, User query have very strong timeliness, for example, user inputs " world cup " and mainly closes during Brazilian world cup in 2014 The relevant content of the Brazilian world cup of note, rather than the relevant information of previous session world cup.In this case, search engine is answered first The judgement " world cup " had been the inquiry of timeliness type at that time, and newer correlated results is preferentially then showed user. According to statistics, the inquiry accounting with timeliness demand is up to 30% or so.Therefore, the identification of timeliness inquiry is searched for for improving Outcome quality has very important significance.
Existing timeliness inquiry identifying method is typically based in search engine inquiry log and gives inquiry two sections of front and back The variation of queries in time illustrates to be timeliness inquiry if queries has apparent increase.Existing judgment method Include:
(1) the increased inquiry quantity of surrounding time section
If the increased inquiry quantity of surrounding time section is greater than threshold value, then it is assumed that be timeliness inquiry.This method lacks Point is insensitive for the inquiry of long-tail, for example queries becomes 200 from 100, and queries is double but difference only has 100。
(2) the variation ratio of the increased queries of surrounding time section
If the ratio of queries is more than certain threshold value in the increased queries of surrounding time section and first time period, recognize For be timeliness inquiry.The shortcomings that this method is avoided that first method, but long-tail is inquired too sensitive.For example it looks into Inquiry amount becomes 10 from 5, although double difference only has 5 to queries.
The angle of the trajectory line of (3) two periods
This method is that Chinese patent invention (patent No. CN201410211458.1) proposes, wherein setting second time period For a part of first time period.This method thinks, if first period queries is slowly increased, and second period Queries rapidly increases, then it is assumed that the inquiry is timeliness inquiry.
Existing method has following disadvantage:
(1) whether there is the trend of outburst based on search engine logs statistical query, search engine logs are more expensive Resource, usually only several large-scale search engine producers just have, and this greatly limits the availabilities of method.
(2) it is normally based on entire query string statistical query amount, it is similar with the burst inquiry in search log in this way But the inquiry not occurred integrally in search log does not just identify not come out, and reduces recall rate.For example, May 27 in 2015 Day or so, " Huang Xiaoming baby neck card " is popular search in search log, and still " baby neck card " may not be popular search, If directly pressing entire query string statistics, " baby neck card " just identification is not come out.
(3) based on the method for variation tendency, in the upward period (trough to wave crest) of queries, identify that timeliness inquires ratio It is easier to, but is easy to miss (in general, at this time inquiry still belongs to timeliness in the downward period of queries (getting off from wave crest) Inquiry, because hot spot always has certain continuity).For example, for the method that patent No. CN201410211458.1 is proposed, The increase of first time period queries is than faster, and second period is then slowly increased and even declines, and is not inconsistent Close Rule of judgment.
Summary of the invention
In view of this, a kind of recognition methods of timeliness inquiry content provided by the invention, this method to resource requirement compared with It is low, and common query and long-tail inquiry are all suitable for;Increase recall rate simultaneously;And it looks into the timeliness in the outburst decline phase Inquiry remains to identify;The timeliness intensity that inquiry can be provided, realizing subsequent module can be according to its timeliness intensity using different Strategy;It ensure that the comprehensive of identification, accuracy and reliability.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of recognition methods of timeliness inquiry content, which comprises
Step 1. establishes the index of timeliness document resources;
The mean value of the number of number, appearance that step 2. statistical query content occurs in the timeliness document resources and Variance index;
Step 3. carries out timeliness judgement to the inquiry content, and then identifies that timeliness inquires content.
Preferably, the timeliness document resources in the step 1 are the set of timeliness document;
The timeliness document is search engine inquiry log or news documents.
Preferably, the step 1, comprising:
1-1. adds the new timeliness document to the timeliness document resources in real time, at the same record every it is described when Effect property document is added to the time of the timeliness document resources;
1-2. carries out Chinese word segmentation to the current timeliness document, obtains Chinese word segmentation result;
1-3. is according to the Chinese word segmentation as a result, the timeliness document to be added to the rope of timeliness document resources in real time In drawing.
Preferably, the step 2, comprising:
2-1. carries out Chinese word segmentation to the inquiry content, obtains inquiry participle;
2-2. retrieves the timeliness document resources by the index, obtains including whole inquiry point The timeliness document of word;
2-3. count it is described inquiry content occur in the timeliness document resources number, appearance number mean value And variance index.
Preferably, the 2-3, comprising:
A. using present period as node cutting a cycle forward, wherein the period is divided into more with constant duration A period;The total quantity of the period including 1 current period is N+1;
B. statistics includes each period T of the present periodiIn (- N≤i≤0), it is described inquiry content it is described when The number C occurred in effect property documenti
C. calculating not includes (- N≤i≤- 1) in the history cycle of present period, it is described inquiry content the timeliness The mean value of the number occurred in property documentWith standard deviation SD:
Preferably, the step 3, comprising:
3-1. judges frequency of occurrence C of the inquiry content in the present period0Whether threshold value is greater than, wherein institute Threshold value is stated to be determined according to the scale of resources bank;
If so, into 3-2;
If it is not, then identifying that the inquiry content is that non-timeliness inquires content;
3-2. judges frequency of occurrence C of the inquiry content in the present period0With the mean valueAnd standard deviation Whether the relationship of SD meetsWherein, α is the empirical coefficient greater than 1;
If so, the identification inquiry content is timeliness inquiry content, and according to C0WithRatio, determine described in look into Ask the timeliness intensity of content;
If it is not, then entering 3-3;
3-3. before present period and be located at the period in a gap periods in, counted in the inquiry respectively Hold the frequency of occurrence C in each period in the gap periodsj, wherein there are the M periods in the gap periods, And M < N ,-M≤j≤- 1;
3-4. judges to whether there is in the gap periods
If so, the identification inquiry content is that timeliness inquires content;And according to CjWithRatio, determine described in look into Ask the timeliness intensity of content;
If it is not, then identifying that the inquiry content is that non-timeliness inquires content.
It can be seen from the above technical scheme that leading to the present invention provides a kind of recognition methods of timeliness inquiry content Cross the index for establishing timeliness document resources, the number that statistical query content occurs in the timeliness document resources and to institute It states inquiry content and carries out timeliness judgement, and then identify that timeliness inquires content.Recognition methods proposed by the present invention, can be fast Speed and comprehensively identify timeliness inquire content;It is lower to resource requirement, and is all suitable for common query and long-tail inquiry; Increase recall rate simultaneously;And it remains to identify to the timeliness inquiry in the outburst decline phase;The timeliness that inquiry can be provided is strong Degree, realizing subsequent module can be according to its timeliness intensity using different strategies;It ensure that the accuracy of identification and reliable Property.
Compared with the latest prior art, technical solution provided by the invention has following excellent effect:
1, in technical solution provided by the present invention, by index, the statistical query content of establishing timeliness document resources The number that occurs in the timeliness document resources and timeliness judgement is carried out to the inquiry content, and then identifies timeliness Property inquiry content.Recognition methods proposed by the present invention can quickly and comprehensively identify that timeliness inquires content;It is to resource It is required that it is lower, and common query and long-tail inquiry are all suitable for;Increase recall rate simultaneously;And to in outburst the decline phase when The inquiry of effect property remains to identify;The timeliness intensity that inquiry can be provided, realizing subsequent module can adopt according to its timeliness intensity With different strategies;It ensure that the accuracy and reliability of identification.
2, technical solution provided by the present invention can be search engine logs to the of less demanding of resource, be also possible to News documents set, the latter are easier to obtain than the former.
3, technical solution provided by the present invention, the method based on retrieval remove the frequency of occurrence of statistical query, rather than whole String statistics, can increase recall rate.
4, technical solution provided by the present invention, it is insensitive to the absolute queries of inquiry, common query and long-tail are looked into Inquiry is all suitable for.
5, technical solution provided by the present invention remains to identify to the timeliness inquiry in the outburst decline phase.
6, technical solution provided by the present invention can provide the timeliness intensity of inquiry, facilitate subsequent module according at that time Effect property intensity uses different strategies.
7, technical solution provided by the invention, is widely used, and has significant Social benefit and economic benefit.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the recognition methods of timeliness inquiry content of the invention;
Fig. 2 is the flow diagram of the step 1 of recognition methods of the invention;
Fig. 3 is the flow diagram of the step 2 of recognition methods of the invention;
Fig. 4 is the flow diagram of the step 3 of recognition methods of the invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on The embodiment of the present invention, every other reality obtained by those of ordinary skill in the art without making creative efforts Example is applied, shall fall within the protection scope of the present invention.
As shown in Figure 1, the present invention provides a kind of recognition methods of timeliness inquiry content, comprising:
Step 1. establishes the index of timeliness document resources;
The mean and variance of the number of number, appearance that step 2. statistical query content occurs in timeliness document resources Index;
Step 3. pair inquires content and carries out timeliness judgement, and then identifies that timeliness inquires content.
Preferably, the timeliness document resources in step 1 are the set of timeliness document;
Timeliness document is search engine inquiry log or news documents.
As shown in Fig. 2, step 1, comprising:
1-1. adds new timeliness document to timeliness document resources in real time, while recording every timeliness document addition To the time of timeliness document resources;
1-2. carries out Chinese word segmentation to current timeliness document, obtains Chinese word segmentation result;
1-3. is according to Chinese word segmentation as a result, being in real time added to timeliness document in the index of timeliness document resources.
As shown in figure 3, step 2, comprising:
2-1. carries out Chinese word segmentation to inquiry content, obtains inquiry participle;
2-2. retrieves timeliness document resources by index, obtains including whole timeliness texts for inquiring participle Shelves;
The number that 2-3. statistical query content occurs in timeliness document.
2-3, comprising:
A. using present period as node cutting a cycle forward, wherein when the period is divided into multiple with constant duration Section;The total quantity of period including present period is N+1;Wherein, the period is according to application demand, using hour or day as rank;
B. statistics includes each period T of present periodiIn (- N≤i≤0), inquiry content timeliness document resources The number C of middle appearancei
C. calculating not includes (- N≤i≤- 1) in the history cycle of present period, inquiry content timeliness document money The mean value of the number occurred in sourceWith standard deviation SD:
As shown in figure 4, step 3, comprising:
Frequency of occurrence C of the 3-1. judgement inquiry content in present period0Whether threshold value is greater than, wherein threshold value is according to money The scale in source library determines;Such as 10,20,50 etc., it avoids the occurrence of the very little inquiry of number and is misidentified;
If so, into 3-2;
If it is not, being that non-timeliness inquires content by inquiry content recognition then because the frequency of occurrence of inquiry content is very few;
Frequency of occurrence C of the 3-2. judgement inquiry content in present period0And mean valueAnd the relationship of standard deviation SD whether MeetWherein, α is empirical coefficient greater than 1, such as 1.5,2,2.5 etc.;
If so, identification inquiry content is timeliness inquiry content, and according to C0WithRatio, determine inquiry content Timeliness intensity;Frequency of occurrence in the condition stub current period is much higher than the average frequency, is mainly used to identification and just breaks out Timeliness inquiry;
If it is not, then entering 3-3;
3-3. is previous in present period and in the gap periods in the period, and statistical query content is being spaced respectively The frequency of occurrence C in each period in periodj, wherein there are M period, and M < N ,-M≤j≤- 1 in gap periods;
Such as: period overall length is 1 month before present period, and 1 period is 1 day;Then N=30 days;And gap periods are A period of time within 1 month before present period, this time include M=10 period, i.e., gap periods are 10 days;
3-4. judges to whether there is in gap periods
If so, identification inquiry content is that timeliness inquires content;And according to CjWithRatio, determine inquiry content Timeliness intensity;Inquiry had been broken out in the condition stub gap periods, is inquired usually certain continuity in view of timeliness, is recognized For in present period still in aged;
If it is not, being then that non-timeliness inquires content by inquiry content recognition.
The above embodiments are merely illustrative of the technical scheme of the present invention and are not intended to be limiting thereof, although referring to above-described embodiment pair The present invention is described in detail, those of ordinary skill in the art still can to a specific embodiment of the invention into Row modifies perhaps equivalent replacement and these exist without departing from any modification of spirit and scope of the invention or equivalent replacement Apply within pending claims of the invention.

Claims (1)

1. a kind of recognition methods of timeliness inquiry content, which is characterized in that the described method includes:
Step 1. establishes the index of timeliness document resources;
The mean and variance of the number of number, appearance that step 2. statistical query content occurs in the timeliness document resources Index;
Step 3. carries out timeliness judgement to the inquiry content, and then identifies that timeliness inquires content;
The timeliness document resources in the step 1 are the set of timeliness document;
The timeliness document is search engine inquiry log or news documents;
The step 1, comprising:
1-1. adds the new timeliness document to the timeliness document resources in real time, while recording every timeliness Document is added to the time of the timeliness document resources;
1-2. carries out Chinese word segmentation to the current timeliness document, obtains Chinese word segmentation result;
1-3. is according to the Chinese word segmentation as a result, the timeliness document to be added to the index of timeliness document resources in real time In;
The step 2, comprising:
2-1. carries out Chinese word segmentation to the inquiry content, obtains inquiry participle;
2-2. retrieves the timeliness document resources by the index, obtains including that whole inquiries segments The timeliness document;
2-3. count it is described inquiry content occur in the timeliness document resources number, appearance number mean value and side Poor index;
The 2-3, comprising:
A. using present period as node cutting a cycle forward, wherein when the period is divided into multiple with constant duration Section;The total quantity of the period including 1 current period is N+1;
B. statistics includes each period T of the present periodi, in-N≤i≤0, the inquiry content timeliness text The number C occurred in shelvesi
C. calculating not includes-N≤i≤- 1 in the history cycle of present period, it is described inquiry content the timeliness document The mean value of the number of middle appearanceWith standard deviation SD:
The step 3, comprising:
3-1. judges frequency of occurrence C of the inquiry content in the present period0Whether threshold value is greater than, wherein the threshold value It is determined according to the scale of resources bank;
If so, into 3-2;
If it is not, then identifying that the inquiry content is that non-timeliness inquires content;
3-2. judges frequency of occurrence C of the inquiry content in the present period0With the mean valueAnd the pass of standard deviation SD Whether system meetsWherein, α is the empirical coefficient greater than 1;
If so, the identification inquiry content is timeliness inquiry content, and according to C0WithRatio, determine in the inquiry The timeliness intensity of appearance;
If it is not, then entering 3-3;
3-3. before present period and be located at the period in a gap periods in, count the inquiry content respectively and exist The frequency of occurrence C in each period in the gap periodsj, wherein there are M periods in the gap periods, and M < N,-M≤j≤-1;
3-4. judges to whether there is in the gap periods
If so, the identification inquiry content is that timeliness inquires content;And according to CjWithRatio, determine in the inquiry The timeliness intensity of appearance;
If it is not, then identifying that the inquiry content is that non-timeliness inquires content.
CN201510526945.1A 2015-08-25 2015-08-25 A kind of recognition methods of timeliness inquiry content Active CN106484671B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510526945.1A CN106484671B (en) 2015-08-25 2015-08-25 A kind of recognition methods of timeliness inquiry content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510526945.1A CN106484671B (en) 2015-08-25 2015-08-25 A kind of recognition methods of timeliness inquiry content

Publications (2)

Publication Number Publication Date
CN106484671A CN106484671A (en) 2017-03-08
CN106484671B true CN106484671B (en) 2019-05-28

Family

ID=58233171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510526945.1A Active CN106484671B (en) 2015-08-25 2015-08-25 A kind of recognition methods of timeliness inquiry content

Country Status (1)

Country Link
CN (1) CN106484671B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106933809A (en) * 2017-03-27 2017-07-07 三角兽(北京)科技有限公司 Information processor and information processing method
CN107180093B (en) * 2017-05-15 2020-05-19 北京奇艺世纪科技有限公司 Information searching method and device and timeliness query word identification method and device
CN111324805B (en) * 2018-12-13 2024-02-13 北京搜狗科技发展有限公司 Query intention determining method and device, searching method and searching engine

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609445A (en) * 2009-07-16 2009-12-23 复旦大学 Crucial sub-method for extracting topic based on temporal information
CN101645066A (en) * 2008-08-05 2010-02-10 北京大学 Method for monitoring novel words on Internet
CN103049443A (en) * 2011-10-12 2013-04-17 腾讯科技(深圳)有限公司 Method and device for mining hot-spot words
CN103942265A (en) * 2014-03-26 2014-07-23 北京奇虎科技有限公司 Method and device for pushing webpages containing news information

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645066A (en) * 2008-08-05 2010-02-10 北京大学 Method for monitoring novel words on Internet
CN101609445A (en) * 2009-07-16 2009-12-23 复旦大学 Crucial sub-method for extracting topic based on temporal information
CN103049443A (en) * 2011-10-12 2013-04-17 腾讯科技(深圳)有限公司 Method and device for mining hot-spot words
CN103942265A (en) * 2014-03-26 2014-07-23 北京奇虎科技有限公司 Method and device for pushing webpages containing news information

Also Published As

Publication number Publication date
CN106484671A (en) 2017-03-08

Similar Documents

Publication Publication Date Title
US11003726B2 (en) Method, apparatus, and system for recommending real-time information
US10810499B2 (en) Method and apparatus for recommending social media information
US10423648B2 (en) Method, system, and computer readable medium for interest tag recommendation
CN105765573B (en) Improvements in website traffic optimization
US9619564B2 (en) Method and system for providing recommended terms
TW201541267A (en) Method and device of selecting promotion keywords
WO2015196793A1 (en) Hotspot information analysis method and device and computer storage medium
CN106250513A (en) A kind of event personalization sorting technique based on event modeling and system
CN103336766A (en) Short text garbage identification and modeling method and device
TW201428513A (en) Query word fusion method, commodity information release method and search method and system
CN104298719A (en) Method and system for conducting user category classification and advertisement putting based on social behavior
CN106484671B (en) A kind of recognition methods of timeliness inquiry content
CN104424308A (en) Web page classification standard acquisition method and device and web page classification method and device
CN102193936A (en) Data classification method and device
CN105654201B (en) Advertisement traffic prediction method and device
CN105068991A (en) Big data based public sentiment discovery method
WO2017012222A1 (en) Time-sensitivity processing requirement identification method, device, apparatus and non-volatile computer storage medium
CN104317784A (en) Cross-platform user identification method and cross-platform user identification system
CN104933191A (en) Spam comment recognition method and system based on Bayesian algorithm and terminal
CN107609192A (en) The supplement searching method and device of a kind of search engine
US20140250116A1 (en) Identifying time sensitive ambiguous queries
CN107766446A (en) Method for pushing, device, storage medium and the processor of information
CN105183765A (en) Big data-based topic extraction method
CN103279483B (en) A kind of topic Epidemic Scope appraisal procedure towards micro-blog and system
CN103530796A (en) Active period detection method and active period detection system of application program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20170426

Address after: 100086 Beijing, Haidian District, North Third Ring Road West, No. 43, building 5, floor 08-09, No. 2

Applicant after: BEIJING ZHONGSOU CLOUD BUSINESS NETWORK TECHNOLOGY CO., LTD.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: Beijing Zhongsou Network Technology Co,Ltd

GR01 Patent grant
GR01 Patent grant