CN105574199A - Identification method and device for false search behavior of search engine - Google Patents

Identification method and device for false search behavior of search engine Download PDF

Info

Publication number
CN105574199A
CN105574199A CN201511001301.7A CN201511001301A CN105574199A CN 105574199 A CN105574199 A CN 105574199A CN 201511001301 A CN201511001301 A CN 201511001301A CN 105574199 A CN105574199 A CN 105574199A
Authority
CN
China
Prior art keywords
multimedia resource
user
behavior
word
current queries
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201511001301.7A
Other languages
Chinese (zh)
Other versions
CN105574199B (en
Inventor
魏博
齐志兵
李力行
魏强
马堰夫
姚键
顾思斌
潘柏宇
王冀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Youku Network Technology Beijing Co Ltd
Original Assignee
1Verge Internet Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Verge Internet Technology Beijing Co Ltd filed Critical 1Verge Internet Technology Beijing Co Ltd
Priority to CN201511001301.7A priority Critical patent/CN105574199B/en
Publication of CN105574199A publication Critical patent/CN105574199A/en
Application granted granted Critical
Publication of CN105574199B publication Critical patent/CN105574199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an identification method and device for a false search behavior of a search engine. The search engine is used for searching for a multimedia resource. The identification method comprises the steps of obtaining user watching behavior data of a single query word and user transformation behavior data of the single query word from a user log; according to the user watching behavior data and/or the user transformation behavior data, determining identification data used for identifying the false search behavior, wherein the identification data includes at least one of an independent multimedia resource playing amount, a multimedia resource average playing completion percentage, multimedia resource clicking divergence degree and multimedia resource set playing residue degree; and according to the identification data, identifying the false search behavior. According to the identification method and device, the accuracy of identifying the false search behavior can be improved and the false search behavior of a total-amount query word can be automatically identified.

Description

The recognition methods of the false search behavior of search engine and device
Technical field
The present invention relates to information search and searching field, particularly relate to a kind of recognition methods and device of false search behavior of search engine.
Background technology
At present, the ununified ripe false search behavior of method to the search engine for searching multimedia resource identifies.Generally, only when needing the false search behavior identifying search engine, search engine just can carry out the identification work of false search behavior according to the business demand of oneself.Along with operation system day by day maturation, the processing power of search engine and the day by day lifting of robustness of search engine, substantially can tolerate the false search behavior of search engine, that is, substantially not need the false search behavior identifying search engine.Such as, only when other false search behavior individual have impact on the system service quality of search engine, engineering staff just can carry out the identification work of false search behavior targetedly.
Further, identify it is more difficult to the false search behavior of search engine, reason is:
(1) in the prior art, the false search behavior of search engine does not have strict definition and only has following simple definition: the false search behavior of search engine refers to that user is not by real searching multimedia resource and the search behavior of watching for the purpose of multimedia resource.That is, if the search intention of user do not lie in and search for and watch multimedia resource, then this query word may be false search behavior.This makes the false search behavior identifying search engine be difficult.Such as, the search intention of user can only be judged according to subjective understanding and whether the search intention based on user is search for and watch multimedia resource to identify whether the search behavior of this query word is false search behavior further.
(2) the false search behavior of search engine is all generally hidden.Specifically, because user is positioned at the front end of search engine and slip-stick artist is positioned at the rear end of search engine, and the actual interactive entrance of user and search engine only has query word, therefore, slip-stick artist can not also improper and each user carry out face-to-face, man-to-man search intention and confirms, it is difficult for consequently leads to the false search behavior identifying search engine.
(3) the false search behavior of search engine has maneuverability.Specifically, because the source of the false search behavior of search engine is more various, such as, user's initiatively input, (by imitate or nested searching styles to the large search engine of visit capacity) external web site links, imitation IP address etc., therefore, false search behavior may be difficult to maintain stable feature over time and space.Such as, for same query word, such as the clicking of first day, play, the key index of IP address etc. may with second day such as click, play, the key index of IP address etc. differs larger.This brings difficulty also to the identification of the false search behavior of search engine.
(4) usual, the identification of the false search behavior of search engine is delayed and passive.On the one hand, due to the diversity of Internet user and the existence of long-tail demand, therefore can not judge whether this search behavior is false search behavior for a search behavior.Under normal circumstances, only have and need to identify false search behavior, just judge whether search behavior is false search behavior by carrying out analysis to the request of special time period and IP address field, but this judgement remains delayed.In fact, the technology of current imitation random IP address is very ripe, identifies that false search behavior is possible and improper by analyzing IP address.On the other hand, owing to identifying the full log that the false search behavior of intensive data may need second day, it is unpractical for therefore carrying out manual analysis to the false search behavior of full dose query word.
In addition, the false search behavior of the multimedia resource of such as video, audio frequency etc. is mainly reflected in following two aspects: (1) only has the behavior of searching multimedia resource and do not click the behavior of multimedia resource, although this class behavior is mainly reflected in a large amount of search inputs do not click the behavior of multimedia resource and the behavior of hit multimedia resource accordingly; (2) only there is the behavior clicking multimedia resource and there is no the behavior of play multimedia resource, although this class behavior is mainly reflected in the behavior that the behavior clicking multimedia resource does not have follow-up viewing multimedia resource.
The identification work of the false search behavior of existing search engine is determine whether the search behavior of query word comprises false search behavior based on query word outburst characteristic at short notice and IP Address d istribution substantially.This recognition methods may not click the behavior of multimedia resource false search behavior for there being the behavior of searching multimedia resource is effective, but does not have the false search behavior of play multimedia resource not have effect for there being the behavior of click multimedia resource.Further, along with the development of current crawler technology, the reptile behavior of spoofed IP address makes the identification of false search behavior more difficult.In addition, the current false search behavior that also automatically cannot identify full dose query word.
Summary of the invention
technical matters
In view of this, the technical problem to be solved in the present invention is, how to identify the false search behavior of search engine.
solution
In order to solve the problems of the technologies described above, in first aspect, the invention provides a kind of recognition methods of false search behavior of search engine, described search engine is used for searching multimedia resource, and described recognition methods comprises:
The user obtaining single query word from user journal watches user's conversion behavior data of behavioral data and described single query word, wherein, the user of described single query word watches behavioral data and comprises: query word, clicked multimedia resource set, multimedia resource finishes playing than set, and described clicked multimedia resource set finishes playing than the mapping function of set to described multimedia resource, and user's conversion behavior data of described single query word comprise query word, described user's conversion behavior data also comprise queries, through district hit rate, through district conversion ratio, user's original content UGC district hit rate, UGC district conversion ratio, and at least one in transformation in planta rate,
Watch according to described user the identification data that behavioral data and/or described user's conversion behavior data determine identifying described false search behavior, described identification data comprise independent multimedia resource playback volume, multimedia resource on average finishes playing ratio, multimedia resource clicks divergence and multimedia resource collection is play in remaining degree at least one; And
False search behavior according to described identification data identification.
In conjunction with first aspect, in the implementation that the first is possible, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprises multimedia resource click divergence, according to described identification data identification, false search behavior comprises:
Judge whether the through district conversion ratio of current queries word is less than first threshold;
When the through district conversion ratio of current queries word is less than described first threshold, judge that the multimedia resource of current queries word is clicked divergence and whether is less than Second Threshold; And
When the multimedia resource click divergence of current queries word is less than described Second Threshold, the search behavior of current queries word is identified as described false search behavior.
In conjunction with first aspect, in the implementation that the second is possible, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprise multimedia resource on average finish playing than, according to described identification data identification, false search behavior comprises:
Judge whether the through district conversion ratio of current queries word is less than first threshold;
When the through district conversion ratio of current queries word is not less than described first threshold, judge that the multimedia resource of current queries word on average finishes playing than whether being less than the 3rd threshold value; And
On average finish playing than when being less than described 3rd threshold value at the multimedia resource of current queries word, the search behavior of current queries word is identified as described false search behavior.
In conjunction with the embodiment that first or the second of first aspect and first aspect are possible, in the embodiment that the third is possible, the identification data that behavioral data and/or described user's conversion behavior data determine identifying described false search behavior is watched according to described user, comprise: when described identification data comprises described independent multimedia resource playback volume, watch the clicked multimedia resource set in behavioral data according to described user, determine described independent multimedia resource playback volume.
In conjunction with the third possible implementation of first aspect, in the 4th kind of possible implementation, the identification data that behavioral data and/or described user's conversion behavior data determine identifying described false search behavior is watched, at least one item in comprising the following steps according to described user:
When described identification data comprise described multimedia resource on average finish playing than, watch finishing playing than set and described independent multimedia resource playback volume adopt formula in behavioral data according to described user determine that described multimedia resource on average finishes playing ratio, wherein, described query is current queries word, described APP (query) is that the multimedia resource of current queries word on average finishes playing ratio, described IVC (query) is the independent multimedia resource playback volume of current queries word, described n ithe played number of times of i-th independent multimedia resource of current queries word, described perc iit is the ratio that finishes playing of i-th independent multimedia resource of current queries word;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource click divergence, adopt formula according to described queries and described independent multimedia resource playback volume determine that described multimedia resource clicks divergence, wherein, described VCR (query) is that the multimedia resource of current queries word clicks divergence, and described sqv is queries;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource collection play remaining degree, watch finishing playing than set and described queries adopt formula in behavioral data according to described user V S P R ( q u e r y ) = m a x ( n i s q v * perc i ) , i ∈ [ 1 , I V C ( q u e r y ) ] , Determine that remaining degree play by described multimedia resource collection, wherein, described VSPR (query) is that remaining degree play by the multimedia resource collection of current queries word, and max () gets maximal value.
In second aspect, the invention provides a kind of recognition device of false search behavior of search engine, described search engine is used for searching multimedia resource, and described recognition device comprises:
Acquiring unit, user for obtaining single query word from user journal watches user's conversion behavior data of behavioral data and described single query word, wherein, the user of described single query word watches behavioral data and comprises: query word, clicked multimedia resource set, multimedia resource finishes playing than set, and described clicked multimedia resource set finishes playing than the mapping function of set to described multimedia resource, and user's conversion behavior data of described single query word comprise query word, described user's conversion behavior data also comprise queries, through district hit rate, through district conversion ratio, user's original content UGC district hit rate, UGC district conversion ratio, and at least one in transformation in planta rate,
Determining unit, be connected with described acquiring unit, determine identifying the identification data of described false search behavior for watching behavioral data and/or described user's conversion behavior data according to described user, described identification data comprise independent multimedia resource playback volume, multimedia resource on average finishes playing ratio, multimedia resource clicks divergence and multimedia resource collection is play in remaining degree at least one; And
Processing unit, for false search behavior according to described identification data identification.
In conjunction with second aspect, in the implementation that the first is possible, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprises multimedia resource click divergence, described processing unit specifically comprises:
First judging unit, for judging whether the through district conversion ratio of current queries word is less than first threshold;
Second judging unit, be connected with described first judging unit, for being judged as that the through district conversion ratio of current queries word is less than described first threshold at described first judging unit, judge that the multimedia resource of current queries word is clicked divergence and whether is less than Second Threshold; And
Recognition unit, be connected with described second judging unit, for being judged as that at described second judging unit the multimedia resource click divergence of current queries word is less than described Second Threshold, the search behavior of current queries word is identified as described false search behavior.
In conjunction with second aspect, in the implementation that the second is possible, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprises multimedia resource click divergence, described processing unit specifically comprises:
First judging unit, for judging whether the through district conversion ratio of current queries word is less than first threshold;
Second judging unit, be connected with described first judging unit, for being judged as that the through district conversion ratio of current queries word is not less than described first threshold at described first judging unit, judge that the multimedia resource of current queries word on average finishes playing than whether being less than the 3rd threshold value; And
Recognition unit, be connected with described second judging unit, for being judged as that at described second judging unit the multimedia resource of current queries word on average finishes playing than when being less than described 3rd threshold value, the search behavior of current queries word is identified as described false search behavior.
In conjunction with the embodiment that first or the second of second aspect and second aspect are possible, in the embodiment that the third is possible, described determining unit specifically for, when described identification data comprises described independent multimedia resource playback volume, watch the clicked multimedia resource set in behavioral data according to described user, determine described independent multimedia resource playback volume.
In conjunction with the third possible embodiment of second aspect, in the 4th kind of possible embodiment, described determining unit is specifically for performing at least one item in following steps:
When described identification data comprise described multimedia resource on average finish playing than, watch finishing playing than set and described independent multimedia resource playback volume adopt formula in behavioral data according to described user determine that described multimedia resource on average finishes playing ratio, wherein, described query is current queries word, described APP (query) is that the multimedia resource of current queries word on average finishes playing ratio, described IVC (query) is the independent multimedia resource playback volume of current queries word, described n ithe played number of times of i-th independent multimedia resource of current queries word, described perc iit is the ratio that finishes playing of i-th independent multimedia resource of current queries word;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource click divergence, adopt formula according to described queries and described independent multimedia resource playback volume determine that described multimedia resource clicks divergence, wherein, described VCR (query) is that the multimedia resource of current queries word clicks divergence, and described sqv is queries;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource collection play remaining degree, watch finishing playing than set and described queries adopt formula in behavioral data according to described user V S P R ( q u e r y ) = m a x ( n i s q v * perc i ) , i ∈ [ 1 , I V C ( q u e r y ) ] , Determine that remaining degree play by described multimedia resource collection, wherein, described VSPR (query) is that remaining degree play by the multimedia resource collection of current queries word, and max () gets maximal value.
beneficial effect
The recognition methods of the false search behavior of the search engine of the embodiment of the present invention and device, user's conversion behavior data of watching behavioral data and/or single query word according to the user of the single query word obtained from user journal determine identifying the identification data of false search behavior, and identify false search behavior according to the identification data determined, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
According to below with reference to the accompanying drawings to detailed description of illustrative embodiments, further feature of the present invention and aspect will become clear.
Accompanying drawing explanation
Comprise in the description and form the accompanying drawing of a part for instructions and instructions together illustrates exemplary embodiment of the present invention, characteristic sum aspect, and for explaining principle of the present invention.
Fig. 1 illustrates the process flow diagram of the recognition methods of the false search behavior of the search engine according to the embodiment of the present invention one;
Fig. 2 illustrates the process flow diagram of the recognition methods of the false search behavior of the search engine according to the embodiment of the present invention two;
Fig. 3 illustrates the example being applied to decision-tree model of the present invention;
Fig. 4 illustrates the process flow diagram of the recognition methods of the false search behavior of the search engine according to the embodiment of the present invention three;
Fig. 5 illustrates the structured flowchart of the recognition device of the false search behavior of the search engine according to the embodiment of the present invention four;
Fig. 6 illustrates the structured flowchart of the recognition device of the false search behavior of the search engine according to the embodiment of the present invention five; And
Fig. 7 illustrates the structured flowchart of the recognition device of the false search behavior of the search engine according to the embodiment of the present invention six.
Embodiment
Various exemplary embodiment of the present invention, characteristic sum aspect is described in detail below with reference to accompanying drawing.The same or analogous element of Reference numeral presentation function identical in accompanying drawing.Although the various aspects of embodiment shown in the drawings, unless otherwise indicated, accompanying drawing need not be drawn in proportion.
Word " exemplary " special here means " as example, embodiment or illustrative ".Here need not be interpreted as being better than or being better than other embodiment as any embodiment illustrated by " exemplary ".
In addition, in order to better the present invention is described, in embodiment hereafter, give numerous details.It will be appreciated by those skilled in the art that do not have some detail, the present invention can implement equally.In some instances, the method known for those skilled in the art, means, element and circuit are not described in detail, so that highlight purport of the present invention.
embodiment 1
Fig. 1 illustrates the process flow diagram of the recognition methods of the false search behavior of search engine according to an embodiment of the invention.As shown in Figure 1, this recognition methods mainly can comprise:
Step S100, the user that can obtain single query word from user journal watch user's conversion behavior data of behavioral data and single query word.
Particularly, four-tuple { query, vids, percs, the δ } user to each query word can be used to watch behavior portray.This process can comprise carries out pre-service and noise removal process to user journal data, and the noise of user journal data may from the many-side of such as illegally input, system exception, recording exceptional etc.
Wherein, query is query word, and namely user inputs in the search each time of search engine, such as, can obtain the query word query of user from the user journal of search engine.
Vids is clicked multimedia resource set, namely user clicks the set of multimedia resource at result of page searching by search query word, such as, click multimedia resource set vids can be obtained by the source limiting multimedia resource viewing from the multimedia resource viewing daily record of user journal.
Percs is that multimedia resource finishes playing than set, namely clicked multimedia resource finish playing than set, such as, multimedia resource can be obtained finish playing than set percs from the multimedia resource of user journal viewing daily record by carrying out secondary treating to multimedia resource played data.It should be noted that, because the T.T. length of each multimedia resource may differ larger, therefore, use multimedia resource finish playing than behavior that user is watched portray than merely use the reproduction time length of multimedia resource to user watch behavior portray more objective.Such as, for same query word, if clicked multimedia resource is played repeatedly, then this clicked multimedia resource finish playing than being an integrate score, such as, can get this query word all finish playing than mean value, and for example, can get this query word all finish playing than median etc.
δ be clicked multimedia resource set at the most media resource plays complete than set mapping function, such as, can obtain multimedia resource finish playing than set time pre-define this mapping function.
That is, the user of above-mentioned single query word watches behavioral data and can comprise: query word (query), clicked multimedia resource set (vids), multimedia resource finish playing and to finish playing than the mapping function (δ) gathered to multimedia resource than set (percs) and clicked multimedia resource set.
Particularly, user's conversion behavior data of above-mentioned single query word can comprise query word (query), and these user's conversion behavior data can also comprise at least one in queries (sqv), through district's hit rate (Dhit), through district's conversion ratio (Dtra), user's original content UGC district hit rate (Uhit), UGC district conversion ratio (Utra) and transformation in planta rate (Wtra).
Wherein, queries can be the number of times that query word is searched within certain time period, and such as, suppose that query word A1 is searched 25 times in one day, then the queries of query word A1 is 25.
The ratio that through district hit rate can be clicked by user for the Search Results in through district.Wherein, through district refers in the result of page searching of the search engine of multimedia resource, editorial staff is in order to the display area of the query word and some multimedia resources manually organized that respond input, such as, through district can be made up of the multimedia resource of copyright multimedia resource or high-quality account, inquires result rapidly to be conducive to user.Such as, suppose query word search input 100 times and have to have hit the multimedia resource in through district for 40 times, then through district hit rate is it should be noted that, even if the same multimedia resource in through district is hit repeatedly, be also only denoted as hit through district this same multimedia resource once.
Through district conversion ratio can refer to that the Search Results in through district is converted into the ratio that multimedia resource plays the number of times of the page.Such as, suppose query word search input 100 times, have the multimedia resource that hit through district for 40 times and have and be converted into multimedia resource for 60 times and play the page, then through district conversion ratio is it should be noted that, each conversion all as counting, that is, is often converted into multimedia resource and plays the page once, then transform number of times increase once.
The ratio that user's original content (UserGeneratedContent is called for short UGC) district's hit rate can be clicked by user for the Search Results in UGC district.Wherein, UGC district refers to the display area that the multimedia resource uploaded in the results page of the search engine of multimedia resource, by general user forms, and UGC rises along with the web2.0 concept turning to principal feature to advocate individual character.Such as, suppose query word search input 100 times and have the multimedia resource having hit UGC district for 40 times, then UGC district hit rate is it should be noted that, even if the same multimedia resource in UGC district is hit repeatedly, be also only denoted as hit UGC district this same multimedia resource once.
UGC district conversion ratio can refer to that the Search Results in UGC district is converted into the ratio that multimedia resource plays the number of times of the page.Such as, suppose query word search input 100 times, have the multimedia resource that hit UGC district for 40 times and have and be converted into multimedia resource for 60 times and play the page, then UGC district conversion ratio is it should be noted that, each conversion all as counting, that is, is often converted into multimedia resource and plays the page once, then transform number of times increase once.
Transformation in planta rate can refer to that overall Search Results is converted into the ratio that multimedia resource plays the number of times of the page.Such as, suppose query word search input 100 times and have to be converted into the multimedia resource broadcasting page for 60 times, then transformation in planta rate is it should be noted that, each conversion all as counting, that is, is often converted into multimedia resource and plays the page once, then transform number of times increase once.
In a kind of possible implementation, { query, sqv, Dhit, Dtra, Uhit, Utra, the Wtra} user's conversion behavior to each query word is portrayed can to use seven tuples.Such as, following table 1 shows the conversion of query word and plays the example of raw data field.
The conversion of table 1 query word and broadcasting raw data field
Row number 1 2 3 4 5 6 7 8 9
Field name query vids percs sqv Dhit Dtra Uhit Utra Wtra
Step S120, can watch according to user the identification data that behavioral data and/or user's conversion behavior data determine identifying false search behavior, this identification data can comprise independent multimedia resource playback volume, multimedia resource on average finishes playing ratio, multimedia resource clicks divergence and multimedia resource collection is play in remaining degree at least one.
Present inventor recognizes that the false search behavior of search engine can have following characteristics:
First, the queries sqv comprising the query word of false search behavior is larger and hit amount is little, such as, to reach the false search behavior for the purpose of brush multimedia resource amount, and for example, the false search behavior imported from some outer net (namely, by imitate or nested searching styles to the search behavior of the large search engine of visit capacity), therefore specific query word is many, but these specific query words are not the real demand of user, the behavior that the search behavior of these specific query words is really converted into behavior and the play multimedia resource clicking multimedia resource is little.
Secondly, the query word comprising false search behavior can be fixed on specific one or several multimedia resource the multimedia resource clicked usually, to reach the object of brush multimedia resource amount, and the finishing playing than very low of these multimedia resources.This feature is usually found in be promoted the cheating of multimedia resource.
Moreover, comprise on average finishing playing of the query word of false search behavior lower frequently.A large amount of multimedia resource of certain query word finishes playing than very low, on average finishing playing of causing this query word in full search results page is lower frequently.This feature is especially obvious when queries is larger.
Finally, usually, the number of the independent multimedia resource that the query word comprising false search behavior is clicked is little.Click due to query word concentrates on some or several multimedia resource, and query word is little to the click of other multimedia resources, and therefore overall independent clicked multimedia resource amount is few.
Based on above feature, present inventor expects that can watch behavioral data and/or user's conversion behavior data according to user journal data, such as user extracts identification data for identifying false search behavior.
In a kind of possible implementation, the identification data that behavioral data and/or above-mentioned user's conversion behavior data determine identifying false search behavior is watched according to above-mentioned user, can comprise: when identification data comprises independent multimedia resource playback volume, the clicked multimedia resource set in behavioral data can be watched according to user, determine independent multimedia resource playback volume.
Wherein, independent multimedia resource playback volume (IndependentVideoCount is called for short IVC) is clicking the extensive degree on multimedia resource for describing single query word.The different multimedia resource that query word is clicked is more, and independent multimedia resource playback volume is larger; Otherwise the different multimedia resource that query word is clicked is fewer, and independent multimedia resource playback volume is less.Such as, query word query in behavioral data and clicked multimedia resource set vids can be watched according to user and determine clicked different multimedia resource.Therefore, following formula (1) can be used to determine independent multimedia resource playback volume according to query word query and clicked multimedia resource set vids:
The Counting Formula (1) of the different multimedia resource that IVC (query)=query is clicked
In general, the independent multimedia resource playback volume of the query word of normal Search Results and search behavior all can not be little, and this diversity with user's request and the randomness clicking behavior are consistent.But, if single query word comprises false search behavior, then independent multimedia resource playback volume generally can not be very large, this is because user may not click the multimedia resource that the behavior of multimedia resource or user click may just be confined to specific multimedia resource.It should be noted that for returning results the situation comprising through district multimedia resource, independent multimedia resource playback volume may also can be smaller.
In a kind of possible implementation, watch according to above-mentioned user the identification data that behavioral data and/or above-mentioned user's conversion behavior data determine identifying false search behavior, at least one item in can comprising the following steps:
When identification data comprise multimedia resource on average finish playing than, finishing playing than set and the above-mentioned independent multimedia resource playback volume determined and adopting formula (2) in behavioral data can be watched according to user determine that multimedia resource on average finishes playing ratio, wherein, query is current queries word, and APP (query) is that the multimedia resource of current queries word on average finishes playing ratio, IVC (query) is the independent multimedia resource playback volume of current queries word, n ithe played number of times of i-th independent multimedia resource of current queries word, perc iit is the ratio that finishes playing of i-th independent multimedia resource of current queries word;
When user's conversion behavior comprise queries and identification data comprise multimedia resource click divergence, formula (3) can be adopted according to queries and the above-mentioned independent multimedia resource playback volume determined determine that multimedia resource clicks divergence, wherein, VCR (query) is that the multimedia resource of current queries word clicks divergence, and sqv is queries;
When user's conversion behavior comprise queries and identification data comprise multimedia resource collection play remaining degree, finishing playing than set and queries adopt formula (4) in behavioral data can be watched according to user V S P R ( q u e r y ) = m a x ( n i s q v * perc i ) , i ∈ [ 1 , I V C ( q u e r y ) ] , Determine that remaining degree play by multimedia resource collection, wherein, VSPR (query) is that remaining degree play by the multimedia resource collection of current queries word, and max () gets maximal value.
Wherein, multimedia resource on average finishes playing than (AveragePlayingPercentage is called for short APP) for describing the degree that on average finishes playing that single query word closes at the search result set of the multimedia resource of oneself.Multimedia resource on average finishes playing than larger, and the multimedia resource viewing under query word is more complete; Otherwise multimedia resource on average finishes playing than less, the multimedia resource viewing under query word is more imperfect.As above-mentioned institute, above-mentioned formula (2) can be used to determine, and multimedia resource on average finishes playing ratio.
In general, single query word on average finishing playing of closing of the search result set of whole multimedia resource than can not be very low, unless finishing playing than all extremely low each time.Such as, on average the finishing playing of all single query word of certain search engine is about about 44% than average.If single query word on average finish playing than very low, so this single query word very likely comprises false search behavior.
Wherein, multimedia resource clicks divergence (VideoClickingRange is called for short VCR) clicks the behavior of multimedia resource on the result of page searching of multimedia resource degree of divergence for describing query word.For queries, the independent multimedia resource of click is more, and it is larger that multimedia resource clicks divergence; Otherwise the independent multimedia resource of click is fewer, it is less that multimedia resource clicks divergence.As above-mentioned institute, above-mentioned formula (3) can be used to determine, and multimedia resource clicks divergence.
Generally, multimedia resource clicks divergence can exposing and transforming degree and changing according to through district, if query word is an ageing word (ageing word can refer to that user's attention rate is greater than search word to a certain degree in special time period) simultaneously, then due to ageing word at special time period (such as, the same day) in volumes of searches sqv larger, and click focuses mostly on the multimedia resource of topic discoverer, therefore, clicking divergence according to the known multimedia resource of above-mentioned formula (3) can not be high, namely the click behavior of user will concentrate on multimedia resource new individually.
Wherein, multimedia resource collection plays remaining degree (VideoSetPlayingResidue is called for short VSPR) for describing the situation that query word does not finish playing on the result of page searching of multimedia resource.Each multimedia resource has the clicked number of times of certain click accounting, i.e. multimedia resource to account for the ratio of the searched number of times of this multimedia resource (such as, the number of times supposing to search multimedia resource B1 is 100 times and the number of times clicking this multimedia resource B1 is 20 times, then the click accounting of multimedia resource B1 is that is, the click accounting of multimedia resource can be passed through in above-mentioned formula (4) calculate, this multimedia resource also has certain finishing playing to compare perc simultaneously i.The click accounting of multimedia resource can be utilized perc is compared with finishing playing ithese two parameters determine the integrated degree that query word is play on whole multimedia resource collection.If the click accounting of multimedia resource larger and perc is compared in finishing playing of multimedia resource iless (that is, larger), then multimedia resource collection is play more imperfect, and it is larger that remaining degree VSPR (query) play by multimedia resource collection.In other words, above-mentioned formula (4) can be used to determine, and remaining degree VSPR (query) play by multimedia resource collection.
Known according to above-mentioned formula (4), the poorest performance that remaining degree VSPR (query) make use of single multimedia resource play by multimedia resource collection, if the click accounting of i.e., certain multimedia resource larger and finish playing and compare perc ilower, then to play remaining degree VSPR (query) larger for multimedia resource collection.
It should be noted that, in the present invention, determine that the method that remaining degree VSPR (query) play by multimedia resource collection is not limited thereto, those skilled in the art should be able to know according to content disclosed in the present application and the technology general knowledge grasped thereof, also alternate manner can be adopted to determine that remaining degree VSPR (query) play by multimedia resource collection, such as, also can determine that remaining degree VSPR (query) play by multimedia resource collection according to the general performance of multimedia resource collection.
Step S140, can according to the false search behavior of above-mentioned identification data identification.
Such as, classical decision tree (DecisionTree) algorithm also can be used to complete the identification of the false search behavior of search engine according to identification data.
First, utilize (training data can be comprised) training set training decision-tree model, to obtain the decision tree initial model of false search behavior, wherein, the primary data set of training set whether be the search behavior of each given query word by manually marking be false search behavior, manually marks based on a small amount of query word clearly assert.
Particularly, decision tree is a tree construction being similar to process flow diagram, and wherein each internal node represents the test of leaf on an attribute, and each branch represents a test and exports, and each tree node represents class or class distribution.The top-most node of tree is root node.Decision Tree algorithms be applicable to carrying out attribute number (characteristic number) less when high-quality classification.The key problem of decision Tree algorithms is the attribute that namely each node being chosen at tree will be tested, and strives for selecting the attribute contributing to classified instance most.In order to address this problem, ID3 algorithm introduces the concept of information gain (informationgain), and uses the number of information gain to decide the different nodes on each layer of decision tree, the important attribute namely for classifying.In order to accurately definition information gain, ID3 algorithm uses in information theory and is called that the concept of entropy (entropy) is to describe the purity (purity) of any sample collection.If comprise the sample collection S of the positive and negative sample about certain target concept given, then sample collection S-phase to the entropy that Boolean type is classified is:
Entropy (S)=-P +log 2p +-P -log 2p -formula (5)
Wherein, P +represent positive sample, P -represent anti-sample, definition 0log0 is 0.Utilize entropy, ID3 algorithm defines information gain.Following formula (6) is used to carry out the information gain of defined attribute A relative to sample collection S:
G a i n ( S , A ) = E n t r o p y ( S ) - Σ v ∈ V ( A ) S v S E n t r o p y ( S v ) Formula (6)
Wherein, V (A) is the codomain of attribute A, and S is sample collection, S vit is the sample set that value in S on attribute A equals v.
The flow process of ID3 algorithm is as follows:
Input: sample collection S, community set A;
Export: ID3 decision tree.
1) if the attribute of all kinds is all disposed, then return; Otherwise, perform 2);
2) the maximum attribute a of computing information gain G ain (S, A), using this attribute as a node; If only just can classify to sample with attribute a, then return; Otherwise, perform 3);
3) following operation is performed to each possible value v of attribute a: the value of all properties a is the subset S of sample as sample collection S of v by i. v; Ii. community set AT=A-{a} is generated; Iii. with subset S vwith community set AT for inputting, recurrence performs ID3 algorithm.
According to the annotation results of the identification data determined from training data, training set and ID3 algorithm, the decision tree initial model of false search behavior can be obtained.
Secondly, after the decision tree initial model obtaining above-mentioned false search behavior, may need to be optimized this decision tree initial model, reason is: the decision tree that the preliminary ID3 of employing algorithm generates, namely the decision tree initial model obtained by ID3 algorithm often causes filtering matching, that is, this decision tree initial model is applied to training data, the error rate whether search behavior namely utilizing this decision tree initial model to carry out recognition training data comprises false search behavior is lower, but this decision tree initial model is applied to test data, namely utilize this decision tree initial model may be higher to identify whether the search behavior of test data comprises the error rate of false search behavior, in other words, directly utilize this decision tree initial model to identify that the accuracy rate of false search behavior may be lower.
Such as, beta pruning (pruning) strategy can be used to be optimized above-mentioned decision tree initial model.More specifically, such as, following two kinds of Pruning strategies can be used to be optimized above-mentioned decision tree initial model:
The strategy of preposition cutting, that is, stop in advance in the process building decision tree.But this strategy can very harsh by the condition setting of cutting node, thus cause decision tree very short and small, and thus, decision tree cannot reach optimum.Therefore, the strategy of this preposition cutting may be difficult to obtain good judged result.
The strategy of rearmounted cutting, that is, after the structure completing decision tree, start cutting.Such as, following two kinds of methods can be adopted to carry out cutting: replace whole subtree with single leaf node, the classification of leaf node adopts topmost classification in subtree; And substitute another one subtree completely by a subtree.
Specific to the present invention, above-mentioned decision tree initial model can be utilized to predict known search behavior, can identify whether known search behavior comprises false search behavior in order to above-mentioned decision tree initial model, thus the accuracy of judgement and Optimal Decision-making tree initial model.
Finally, after being optimized above-mentioned decision tree initial model, the decision-tree model after optimization and the false search behavior of above-mentioned identification data to search engine can be utilized to identify.Such as, if suppose to determine identification data lower than predetermined threshold in the decision-tree model after optimizing, then the search behavior of current queries word is identified as false search behavior.
It should be noted that, the embodiment of the present invention only describes how to identify false search behavior according to identification data for Decision Tree Algorithm, those skilled in the art should be able to understand, emphasis of the present invention does not also lie in concrete which kind of sorting algorithm of use, and the sorting algorithm that the present invention can use is not limited to Decision Tree Algorithm, such as, other sorting algorithm of such as Bayesian inference etc. can also be used to come according to the false search behavior of identification data identification.
The recognition methods of the false search behavior of the search engine of the embodiment of the present invention, user's conversion behavior data of watching behavioral data and/or single query word according to the user of the single query word obtained from user journal determine identifying the identification data of false search behavior, and identify false search behavior according to the identification data determined, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
embodiment 2
Fig. 2 illustrates the process flow diagram of the recognition methods of the false search behavior of the search engine according to the embodiment of the present invention two.The step that in Fig. 2, label is identical with Fig. 1 has identical function, for simplicity's sake, omits the detailed description to these steps.
As shown in Figure 2, the key distinction of the recognition methods of the recognition methods of the false search behavior of the search engine shown in Fig. 2 and the false search behavior of the search engine shown in Fig. 1 is, except comprising step S100 in above-described embodiment one and step S120, when user's conversion behavior data comprise through district conversion ratio and identification data comprises multimedia resource click divergence, step S140 specifically can comprise:
Step S200, can judge whether the through district conversion ratio of current queries word is less than first threshold;
Step S220, when the through district conversion ratio of current queries word is less than first threshold, can judge current queries word multimedia resource click divergence whether be less than Second Threshold; And
Step S240, when current queries word multimedia resource click divergence be less than Second Threshold, the search behavior of current queries word can be identified as false search behavior.
For example, first, log acquisition user can be play from the multimedia resource user journal and watch behavioral data, and user's conversion behavior data can be obtained from the query word click logs user journal.Particularly, following table 2 shows the example that the multimedia resource of certain search engine in the user journal of one day plays daily record, and wherein, multimedia resource plays log recording totally 2329980.
The multimedia resource of certain search engine of table 2 in the user journal of one day plays daily record
query vids percs
C1 235949485 0.1338
C2 209907159 0.0442
C2 213535395 0.0587
C2 217417432 0.0980
C2 217417432 0.1960
Known according to above-mentioned table 2, the same day, query word C2 mono-had 4 broadcasting behaviors, but on multimedia resource 217417432, have twice broadcasting behavior, and therefore independent multimedia resource playback volume IVC is 3.
In addition, following table 3 illustrates the example of the query word click logs of certain search engine in the user journal on the same day, and wherein, query word clicks effective log recording totally 185966.
The query word click logs of certain search engine of table 3 in the user journal on the same day
query sqv Dhit Dtra Uhit Utra Wtra
B1 1793 0.4822 0.6599 0.1422 0.2811 0.9426
B2 2491 0.3760 0.7001 0.3308 0.8210 1.5303
B3 3511 0.3896 0.4475 0.0615 0.0880 0.5377
By gathering above-mentioned table 2 and table 3 and be major key with query, the conversion of single query word can be obtained and plays raw data field.
Secondly, can Stochastic choice 518 query words as the primary data of artificial mark, judge whether the search behavior of each query word selected exists false search behavior.Result shows, and in 518 query words selected, 66 query words are noted as false search behavior, and remaining 452 query words are noted as normal searching behavior, and following table 4 shows the example of the false search behavior of artificial mark of certain search engine.
The example of the false search behavior of artificial mark of certain search engine of table 4
query A1 A2 A3 A4 A5 B1
False search Be Be Be No No No
Known according to above-mentioned table 4, the query word A1 to A3 of certain search engine is manually labeled as false search behavior, the i.e. search behavior of query word A1 to A3 and is there is false search behavior, and query word A4-A5 and B1 to be manually labeled as be not the search behavior of false search behavior, i.e. query word A4-A5 and B1 be normal searching behavior.
Then, can by carrying out feature extraction to the query word of above-mentioned artificial mark, obtain the input data of decision Tree algorithms, following table 5 shows the example of the input data of the decision Tree algorithms of certain search engine.Wherein, as described in Table 5, these input data can comprise user's conversion behavior data in above-described embodiment one and identification data.
The example of the input data of the decision Tree algorithms of certain search engine of table 5
Finally, according to decision Tree algorithms and model optimization strategy, the decision-tree model of the false search behavior of identification as shown in Figure 3 can be obtained.Wherein, according to Fig. 3, can determine that above-mentioned first threshold is 0.26 and above-mentioned Second Threshold is 0.14 by decision-tree model, that is, through district conversion ratio Dtra can be less than 0.26 and multimedia resource is clicked the search behavior that divergence VCR is less than the query word of 0.14 and is identified as false search behavior.And, decision-tree model shown in Fig. 3 is mapped in the website experience of user, if can be understood as certain query word to there is not through district or do not have actual effect, the simultaneously multimedia of user in UGC district to click too concentrated, then this query word is false search behavior.This is consistent with naturally understanding of those skilled in the art.
Such as, through district conversion ratio Dtra due to the query word A1 in the example shown in above-mentioned table 5 is 0 and the multimedia resource of query word A1 click divergence VCR is 0.0072, namely, the through district conversion ratio Dtra of query word A1 is less than first threshold 0.26 and the multimedia resource of query word A1 click divergence VCR is less than Second Threshold 0.14, therefore, the search behavior of query word A1 can be identified as false search behavior.
The recognition methods of the false search behavior of the search engine of the embodiment of the present invention, user's conversion behavior data of watching behavioral data and/or single query word according to the user of the single query word obtained from user journal determine identifying the identification data of false search behavior, and when user's conversion data comprises through district conversion ratio and identification data comprises multimedia resource click divergence, through district conversion ratio can be less than first threshold and multimedia resource is clicked the search behavior that divergence is less than the query word of Second Threshold and is identified as false search behavior, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
embodiment 3
Fig. 4 illustrates the process flow diagram of the recognition methods of the false search behavior of the search engine according to the embodiment of the present invention three.The step that in Fig. 4, label is identical with Fig. 1 has identical function, for simplicity's sake, omits the detailed description to these steps.
As shown in Figure 4, the key distinction of the recognition methods of the recognition methods of the false search behavior of the search engine shown in Fig. 4 and the false search behavior of the search engine shown in Fig. 1 is, except comprising step S100 in above-described embodiment one and step S120, when user's conversion behavior data comprise through district conversion ratio and identification data comprise multimedia resource on average finish playing than, step S140 specifically can comprise:
Step S300, can judge whether the through district conversion ratio of current queries word is less than first threshold;
Step S320, when the through district conversion ratio of current queries word is not less than first threshold, can judge that the multimedia resource of current queries word on average finishes playing than whether being less than the 3rd threshold value; And
Step S340, on average finish playing than when being less than the 3rd threshold value at the multimedia resource of current queries word, the search behavior of current queries word can be identified as false search behavior.
Example for the present embodiment specifically can illustrating see above-described embodiment two.Wherein, the difference of the example of embodiment three and the example of above-described embodiment two is, can determine that above-mentioned first threshold is 0.26 and above-mentioned 3rd threshold value is 0.25 by the decision-tree model shown in Fig. 3, that is, the search behavior being less than the query word of 0.25 than APP is identified as false search behavior and multimedia resource on average finishes playing through district conversion ratio Dtra can be more than or equal to 0.26.And, decision-tree model shown in Fig. 3 is mapped in the website experience of user, if (namely the through district that can be understood as certain query word has had higher changing effect, this through district is that high-quality is gone directly district) but the viewing completeness of user is (namely, multimedia resource on average finishes playing ratio) lower, then the search behavior of this query word is false search behavior.In other words, if the conversion in the through district of certain query word (derive play) but finishing playing of more through district is lower, then the search behavior of this query word is false search behavior.This is consistent with naturally understanding of those skilled in the art.
For example, through district conversion ratio Dtra due to the query word B1 in the example shown in above-mentioned table 5 is 0.66 and the multimedia resource of query word B1 on average finishes playing than APP is 0.6816, namely, the through district conversion ratio Dtra of query word B1 is not less than first threshold 0.26 and the multimedia resource of query word B1 on average finishes playing and is not less than the 3rd threshold value 0.25 than APP, therefore, the search behavior of query word B1 can be identified as normal searching behavior, namely the search behavior of query word B1 can be identified as is not false search behavior.
The recognition methods of the false search behavior of the search engine of the embodiment of the present invention, user's conversion behavior data of watching behavioral data and/or single query word according to the user of the single query word obtained from user journal determine identifying the identification data of false search behavior, and when user's conversion data comprise through district conversion ratio and identification data comprise multimedia resource on average finish playing than, through district conversion ratio can be not less than first threshold and multimedia resource on average finishes playing and is identified as false search behavior than the search behavior of the query word being less than the 3rd threshold value, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
embodiment 4
Fig. 5 illustrates the structured flowchart of the recognition device of the false search behavior of the search engine according to the embodiment of the present invention four.The recognition device 500 of the false search behavior of the search engine that the present embodiment provides is for realizing the recognition methods of the false search behavior of the search engine provided embodiment illustrated in fig. 1.As shown in Figure 5, the recognition device 500 of the false search behavior of this search engine can comprise:
Acquiring unit 510, the user that may be used for obtaining single query word from user journal watches user's conversion behavior data of behavioral data and single query word, wherein, the user of single query word watches behavioral data and comprises: query word, clicked multimedia resource set, multimedia resource finishes playing than set, and clicked multimedia resource set finishes playing than the mapping function of set to multimedia resource, and user's conversion behavior data of single query word comprise query word, user's conversion behavior data also comprise queries, through district hit rate, through district conversion ratio, user's original content UGC district hit rate, UGC district conversion ratio, and at least one in transformation in planta rate.Specifically can see the associated description of the step S100 in above-described embodiment one.
Determining unit 530, may be used for watching according to user the identification data that behavioral data and/or user's conversion behavior data determine identifying false search behavior, this identification data can comprise independent multimedia resource playback volume, multimedia resource on average finishes playing ratio, multimedia resource clicks divergence and multimedia resource collection is play in remaining degree at least one.Specifically can see the associated description of the step S120 in above-described embodiment one.
In a kind of possible implementation, determining unit 530 specifically may be used for, when described identification data comprises described independent multimedia resource playback volume, watch the clicked multimedia resource set in behavioral data according to described user, determine described independent multimedia resource playback volume.
Wherein, independent multimedia resource playback volume (IndependentVideoCount is called for short IVC) is clicking the extensive degree on multimedia resource for describing single query word.The different multimedia resource that query word is clicked is more, and independent multimedia resource playback volume is larger; Otherwise the different multimedia resource that query word is clicked is fewer, and independent multimedia resource playback volume is less.Such as, query word query in behavioral data and clicked multimedia resource set vids can be watched according to user and determine clicked different multimedia resource.Therefore, following formula (1) can be used to determine independent multimedia resource playback volume according to query word query and clicked multimedia resource set vids:
The Counting Formula (1) of the different multimedia resource that IVC (query)=query is clicked
In general, the independent multimedia resource playback volume of the query word of normal Search Results and search behavior all can not be little, and this diversity with user's request and the randomness clicking behavior are consistent.But, if single query word comprises false search behavior, then independent multimedia resource playback volume generally can not be very large, this is because user may not click the multimedia resource that the behavior of multimedia resource or user click may just be confined to specific multimedia resource.It should be noted that for returning results the situation comprising through district multimedia resource, independent multimedia resource playback volume may also can be smaller.
In a kind of possible implementation, described determining unit 530 specifically may be used for performing at least one item in following steps:
When identification data comprise multimedia resource on average finish playing than, finishing playing than set and the above-mentioned independent multimedia resource playback volume determined and adopting formula (2) in behavioral data can be watched according to user determine that multimedia resource on average finishes playing ratio, wherein, query is current queries word, and APP (query) is that the multimedia resource of current queries word on average finishes playing ratio, IVC (query) is the independent multimedia resource playback volume of current queries word, n ithe played number of times of i-th independent multimedia resource of current queries word, perc iit is the ratio that finishes playing of i-th independent multimedia resource of current queries word;
When user's conversion behavior comprise queries and identification data comprise multimedia resource click divergence, formula (3) can be adopted according to queries and the above-mentioned independent multimedia resource playback volume determined determine that multimedia resource clicks divergence, wherein, VCR (query) is that the multimedia resource of current queries word clicks divergence, and sqv is queries;
When user's conversion behavior comprise queries and identification data comprise multimedia resource collection play remaining degree, finishing playing than set and queries adopt formula (4) in behavioral data can be watched according to user V S P R ( q u e r y ) = m a x ( n i s q v * perc i ) , i ∈ [ 1 , I V C ( q u e r y ) ] , Determine that remaining degree play by multimedia resource collection, wherein, VSPR (query) is that remaining degree play by the multimedia resource collection of current queries word, and max () gets maximal value.
Wherein, multimedia resource on average finishes playing than (AveragePlayingPercentage is called for short APP) for describing the degree that on average finishes playing that single query word closes at the search result set of the multimedia resource of oneself.Multimedia resource on average finishes playing than larger, and the multimedia resource viewing under query word is more complete; Otherwise multimedia resource on average finishes playing than less, the multimedia resource viewing under query word is more imperfect.As above-mentioned institute, above-mentioned formula (2) can be used to determine, and multimedia resource on average finishes playing ratio.
In general, single query word can not be very low than APP on average finishing playing of closing of the search result set of whole multimedia resource, unless finishing playing than all extremely low each time.Such as, on average the finishing playing of all single query word of certain search engine is about about 44% than average.If single query word on average finish playing than very low, so this single query word very likely comprises false search behavior.
Wherein, multimedia resource clicks divergence (VideoClickingRange is called for short VCR) clicks the behavior of multimedia resource on the result of page searching of multimedia resource degree of divergence for describing query word.For queries, the independent multimedia resource of click is more, and it is larger that multimedia resource clicks divergence; Otherwise the independent multimedia resource of click is fewer, it is less that multimedia resource clicks divergence.As above-mentioned institute, above-mentioned formula (3) can be used to determine, and multimedia resource clicks divergence.
Generally, multimedia resource clicks divergence can exposing and transforming degree and changing according to through district, if query word is an ageing word (ageing word can refer to that user's attention rate is greater than search word to a certain degree in special time period) simultaneously, then due to ageing word at special time period (such as, the same day) in volumes of searches sqv larger, and click focuses mostly on the multimedia resource of topic discoverer, therefore, clicking divergence according to the known multimedia resource of above-mentioned formula (3) can not be high, namely the click behavior of user will concentrate on multimedia resource new individually.
Wherein, multimedia resource collection plays remaining degree (VideoSetPlayingResidue is called for short VSPR) for describing the situation that query word does not finish playing on the result of page searching of multimedia resource.Each multimedia resource has the clicked number of times of certain click accounting, i.e. multimedia resource to account for the ratio of the searched number of times of this multimedia resource (such as, the number of times supposing to search multimedia resource B1 is 100 times and the number of times clicking this multimedia resource B1 is 20 times, then the click accounting of multimedia resource B1 is that is, the click accounting of multimedia resource can be passed through in above-mentioned formula (4) calculate, this multimedia resource also has certain finishing playing to compare perc simultaneously i.The click accounting of multimedia resource can be utilized perc is compared with finishing playing ithese two parameters determine the integrated degree that query word is play on whole multimedia resource collection.If the click accounting of multimedia resource larger and perc is compared in finishing playing of multimedia resource iless (that is, larger), then multimedia resource collection is play more imperfect, and it is larger that remaining degree VSPR (query) play by multimedia resource collection.In other words, above-mentioned formula (4) can be used to determine, and remaining degree VSPR (query) play by multimedia resource collection.
Known according to above-mentioned formula (4), the poorest performance that remaining degree make use of single multimedia resource play by multimedia resource collection, if the click accounting of i.e., certain multimedia resource larger and finish playing and compare perc ilower, then to play remaining degree VSPR (query) larger for multimedia resource collection.It should be noted that, in the present invention, determine that the method that remaining degree VSPR (query) play by multimedia resource collection is not limited thereto, those skilled in the art should be able to know according to content disclosed in the present application and the technology general knowledge grasped thereof, also alternate manner can be adopted to determine that remaining degree VSPR (query) play by multimedia resource collection, such as, also can determine that remaining degree VSPR (query) play by multimedia resource collection according to the general performance of multimedia resource collection.
Processing unit 550, may be used for according to the false search behavior of above-mentioned identification data identification.Specifically can see the associated description of the step S140 in above-described embodiment one.
Such as, processing unit 550 also can use classical decision tree (DecisionTree) algorithm to complete the identification of the false search behavior of search engine according to identification data.
The recognition device of the false search behavior of the search engine of the embodiment of the present invention, determining unit determines identifying the identification data of false search behavior according to user's conversion behavior data that the user of the single query word obtained from user journal watches behavioral data and/or single query word, and processing unit identifies false search behavior according to the identification data that determining unit is determined, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
embodiment 5
Fig. 6 illustrates the structured flowchart of the recognition device of the false search behavior of the search engine according to the embodiment of the present invention five.The recognition device 600 of the false search behavior of the search engine that the present embodiment provides is for realizing the recognition methods of the false search behavior of the search engine provided embodiment illustrated in fig. 2.Wherein, assembly identical with Fig. 5 label in Fig. 6, comprising: acquiring unit 510, determining unit 530 and processing unit 550, has and aforementioned substantially identical function, for simplicity's sake, omits the detailed description to these assemblies.
In addition, by comparison diagram 5 and Fig. 6 known, the key distinction embodiment illustrated in fig. 6 with embodiment illustrated in fig. 5 is, on the basis of the embodiment shown in Fig. 5, when user's conversion behavior data comprise through district conversion ratio and identification data comprises multimedia resource click divergence, processing unit 550 specifically can comprise:
First judging unit 651, for judging whether the through district conversion ratio of current queries word is less than first threshold.
Second judging unit 653, be connected with the first judging unit 651, for being judged as that the through district conversion ratio of current queries word is less than first threshold at the first judging unit 651, judge that the multimedia resource of current queries word is clicked divergence and whether is less than Second Threshold.
Recognition unit 655, is connected with the second judging unit 653, for being judged as that at the second judging unit 653 the multimedia resource click divergence of current queries word is less than Second Threshold, the search behavior of current queries word is identified as false search behavior.
Example for the present embodiment specifically can illustrating see above-described embodiment two.
The recognition device of the false search behavior of the search engine of the embodiment of the present invention, determining unit determines identifying the identification data of false search behavior according to user's conversion behavior data that the user of the single query word obtained from user journal watches behavioral data and/or single query word, and processing unit is when user's conversion data comprises through district conversion ratio and identification data comprises multimedia resource click divergence, through district conversion ratio can be less than first threshold and multimedia resource is clicked the search behavior that divergence is less than the query word of Second Threshold and is identified as false search behavior, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
embodiment 6
Fig. 7 is the structured flowchart of the recognition device of the false search behavior of search engine according to the embodiment of the present invention six.The recognition device 700 of the false search behavior of the search engine that the present embodiment provides is for realizing the recognition methods of the false search behavior of the search engine provided embodiment illustrated in fig. 3.Wherein, assembly identical with Fig. 5 label in Fig. 7, comprising: acquiring unit 510, determining unit 530 and processing unit 550, has and aforementioned substantially identical function, for simplicity's sake, omits the detailed description to these assemblies.
In addition, by comparison diagram 5 and Fig. 7 known, the key distinction embodiment illustrated in fig. 7 with embodiment illustrated in fig. 5 is, on the basis of the embodiment shown in Fig. 5, when user's conversion behavior data comprise through district conversion ratio and identification data comprises multimedia resource click divergence, processing unit 550 specifically can comprise:
First judging unit 751, for judging whether the through district conversion ratio of current queries word is less than first threshold;
Second judging unit 753, be connected with the first judging unit 751, for being judged as that the through district conversion ratio of current queries word is not less than first threshold at the first judging unit 751, judge that the multimedia resource of current queries word on average finishes playing than whether being less than the 3rd threshold value; And
Recognition unit 755, be connected with the second judging unit 753, for being judged as that the multimedia resource of current queries word on average finishes playing than when being less than the 3rd threshold value at the second judging unit 753, the search behavior of current queries word is identified as false search behavior.
Example for the present embodiment specifically can illustrating see above-described embodiment three.
The recognition device of the false search behavior of the search engine of the embodiment of the present invention, determining unit determines identifying the identification data of false search behavior according to user's conversion behavior data that the user of the single query word obtained from user journal watches behavioral data and/or single query word, and processing unit when user's conversion data comprise through district conversion ratio and identification data comprise multimedia resource on average finish playing than, through district conversion ratio can be not less than first threshold and multimedia resource on average finishes playing and is identified as false search behavior than the search behavior of the query word being less than the 3rd threshold value, the accuracy rate identifying false search behavior can be improved thus, automatically can also identify the false search behavior of full dose query word.
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims (10)

1. a recognition methods for the false search behavior of search engine, described search engine is used for searching multimedia resource, and it is characterized in that, described recognition methods comprises:
The user obtaining single query word from user journal watches user's conversion behavior data of behavioral data and described single query word, wherein, the user of described single query word watches behavioral data and comprises: query word, clicked multimedia resource set, multimedia resource finishes playing than set, and described clicked multimedia resource set finishes playing than the mapping function of set to described multimedia resource, and user's conversion behavior data of described single query word comprise query word, described user's conversion behavior data also comprise queries, through district hit rate, through district conversion ratio, user's original content UGC district hit rate, UGC district conversion ratio, and at least one in transformation in planta rate,
Watch according to described user the identification data that behavioral data and/or described user's conversion behavior data determine identifying described false search behavior, described identification data comprise independent multimedia resource playback volume, multimedia resource on average finishes playing ratio, multimedia resource clicks divergence and multimedia resource collection is play in remaining degree at least one; And
False search behavior according to described identification data identification.
2. recognition methods according to claim 1, it is characterized in that, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprises multimedia resource click divergence, according to described identification data identification, false search behavior comprises:
Judge whether the through district conversion ratio of current queries word is less than first threshold;
When the through district conversion ratio of current queries word is less than described first threshold, judge that the multimedia resource of current queries word is clicked divergence and whether is less than Second Threshold; And
When the multimedia resource click divergence of current queries word is less than described Second Threshold, the search behavior of current queries word is identified as described false search behavior.
3. recognition methods according to claim 1, it is characterized in that, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprise multimedia resource on average finish playing than, according to described identification data identification, false search behavior comprises:
Judge whether the through district conversion ratio of current queries word is less than first threshold;
When the through district conversion ratio of current queries word is not less than described first threshold, judge that the multimedia resource of current queries word on average finishes playing than whether being less than the 3rd threshold value; And
On average finish playing than when being less than described 3rd threshold value at the multimedia resource of current queries word, the search behavior of current queries word is identified as described false search behavior.
4. recognition methods according to any one of claim 1 to 3, it is characterized in that, the identification data that behavioral data and/or described user's conversion behavior data determine identifying described false search behavior is watched according to described user, comprise: when described identification data comprises described independent multimedia resource playback volume, watch the clicked multimedia resource set in behavioral data according to described user, determine described independent multimedia resource playback volume.
5. recognition methods according to claim 4, is characterized in that, watches the identification data that behavioral data and/or described user's conversion behavior data determine identifying described false search behavior, at least one item in comprising the following steps according to described user:
When described identification data comprise described multimedia resource on average finish playing than, watch finishing playing than set and described independent multimedia resource playback volume adopt formula in behavioral data according to described user determine that described multimedia resource on average finishes playing ratio, wherein, described query is current queries word, described APP (query) is that the multimedia resource of current queries word on average finishes playing ratio, described IVC (query) is the independent multimedia resource playback volume of current queries word, described n ithe played number of times of i-th independent multimedia resource of current queries word, described perc iit is the ratio that finishes playing of i-th independent multimedia resource of current queries word;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource click divergence, adopt formula according to described queries and described independent multimedia resource playback volume determine that described multimedia resource clicks divergence, wherein, described VCR (query) is that the multimedia resource of current queries word clicks divergence, and described sqv is queries;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource collection play remaining degree, watch finishing playing than set and described queries adopt formula in behavioral data according to described user V S P R ( q u e r y ) = max ( n i s q v * perc i ) , i ∈ [ 1 , I V C ( q u e r y ) ] , Determine that remaining degree play by described multimedia resource collection, wherein, described VSPR (query) is that remaining degree play by the multimedia resource collection of current queries word, and max () gets maximal value.
6. a recognition device for the false search behavior of search engine, described search engine is used for searching multimedia resource, and it is characterized in that, described recognition device comprises:
Acquiring unit, user for obtaining single query word from user journal watches user's conversion behavior data of behavioral data and described single query word, wherein, the user of described single query word watches behavioral data and comprises: query word, clicked multimedia resource set, multimedia resource finishes playing than set, and described clicked multimedia resource set finishes playing than the mapping function of set to described multimedia resource, and user's conversion behavior data of described single query word comprise query word, described user's conversion behavior data also comprise queries, through district hit rate, through district conversion ratio, user's original content UGC district hit rate, UGC district conversion ratio, and at least one in transformation in planta rate,
Determining unit, be connected with described acquiring unit, determine identifying the identification data of described false search behavior for watching behavioral data and/or described user's conversion behavior data according to described user, described identification data comprise independent multimedia resource playback volume, multimedia resource on average finishes playing ratio, multimedia resource clicks divergence and multimedia resource collection is play in remaining degree at least one; And
Processing unit, for false search behavior according to described identification data identification.
7. recognition device according to claim 6, is characterized in that, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprises multimedia resource click divergence, described processing unit specifically comprises:
First judging unit, for judging whether the through district conversion ratio of current queries word is less than first threshold;
Second judging unit, be connected with described first judging unit, for being judged as that the through district conversion ratio of current queries word is less than described first threshold at described first judging unit, judge that the multimedia resource of current queries word is clicked divergence and whether is less than Second Threshold; And
Recognition unit, be connected with described second judging unit, for being judged as that at described second judging unit the multimedia resource click divergence of current queries word is less than described Second Threshold, the search behavior of current queries word is identified as described false search behavior.
8. recognition device according to claim 6, is characterized in that, when described user's conversion behavior data comprise through district conversion ratio and described identification data comprises multimedia resource click divergence, described processing unit specifically comprises:
First judging unit, for judging whether the through district conversion ratio of current queries word is less than first threshold;
Second judging unit, be connected with described first judging unit, for being judged as that the through district conversion ratio of current queries word is not less than described first threshold at described first judging unit, judge that the multimedia resource of current queries word on average finishes playing than whether being less than the 3rd threshold value; And
Recognition unit, be connected with described second judging unit, for being judged as that at described second judging unit the multimedia resource of current queries word on average finishes playing than when being less than described 3rd threshold value, the search behavior of current queries word is identified as described false search behavior.
9. the recognition device according to any one of claim 6 to 8, it is characterized in that, described determining unit specifically for, when described identification data comprises described independent multimedia resource playback volume, watch the clicked multimedia resource set in behavioral data according to described user, determine described independent multimedia resource playback volume.
10. recognition device according to claim 9, is characterized in that, described determining unit is specifically for performing at least one item in following steps:
When described identification data comprise described multimedia resource on average finish playing than, watch finishing playing than set and described independent multimedia resource playback volume adopt formula in behavioral data according to described user determine that described multimedia resource on average finishes playing ratio, wherein, described query is current queries word, described APP (query) is that the multimedia resource of current queries word on average finishes playing ratio, described IVC (query) is the independent multimedia resource playback volume of current queries word, described n ithe played number of times of i-th independent multimedia resource of current queries word, described perc iit is the ratio that finishes playing of i-th independent multimedia resource of current queries word;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource click divergence, adopt formula according to described queries and described independent multimedia resource playback volume determine that described multimedia resource clicks divergence, wherein, described VCR (query) is that the multimedia resource of current queries word clicks divergence, and described sqv is queries;
When described user's conversion behavior comprise described queries and described identification data comprise described multimedia resource collection play remaining degree, watch finishing playing than set and described queries adopt formula in behavioral data according to described user V S P R ( q u e r y ) = max ( n i s q v * perc i ) , i ∈ [ 1 , I V C ( q u e r y ) ] , Determine that remaining degree play by described multimedia resource collection, wherein, described VSPR (query) is that remaining degree play by the multimedia resource collection of current queries word, and max () gets maximal value.
CN201511001301.7A 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine Active CN105574199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511001301.7A CN105574199B (en) 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511001301.7A CN105574199B (en) 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine

Publications (2)

Publication Number Publication Date
CN105574199A true CN105574199A (en) 2016-05-11
CN105574199B CN105574199B (en) 2020-04-21

Family

ID=55884330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511001301.7A Active CN105574199B (en) 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine

Country Status (1)

Country Link
CN (1) CN105574199B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326497A (en) * 2016-10-10 2017-01-11 合网络技术(北京)有限公司 Cheating video user identification method and device
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device
CN106777303A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Passenger flight User behavior sorting technique and system
CN107529093A (en) * 2017-09-05 2017-12-29 北京奇艺世纪科技有限公司 A kind of detection method and system of video file playback volume
CN108090100A (en) * 2016-11-23 2018-05-29 百度在线网络技术(北京)有限公司 A kind of data identification method and device
CN110188262A (en) * 2019-07-23 2019-08-30 武汉斗鱼网络科技有限公司 A kind of abnormal object determines method, apparatus, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143278A1 (en) * 2005-12-15 2007-06-21 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
CN104021162A (en) * 2014-05-28 2014-09-03 小米科技有限责任公司 Method and device for grading multimedia resource
CN104035982A (en) * 2014-05-28 2014-09-10 小米科技有限责任公司 Multimedia resource recommendation method and device
CN104504059A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Multimedia resource recommending method
CN104506894A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Method and device for evaluating multi-media resources

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070143278A1 (en) * 2005-12-15 2007-06-21 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs
CN104021162A (en) * 2014-05-28 2014-09-03 小米科技有限责任公司 Method and device for grading multimedia resource
CN104035982A (en) * 2014-05-28 2014-09-10 小米科技有限责任公司 Multimedia resource recommendation method and device
CN104504059A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Multimedia resource recommending method
CN104506894A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Method and device for evaluating multi-media resources

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326497A (en) * 2016-10-10 2017-01-11 合网络技术(北京)有限公司 Cheating video user identification method and device
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device
CN108090100A (en) * 2016-11-23 2018-05-29 百度在线网络技术(北京)有限公司 A kind of data identification method and device
CN106777303A (en) * 2016-12-30 2017-05-31 中国民航信息网络股份有限公司 Passenger flight User behavior sorting technique and system
CN106777303B (en) * 2016-12-30 2020-11-06 中国民航信息网络股份有限公司 Passenger flight inquiry behavior classification method and system
CN107529093A (en) * 2017-09-05 2017-12-29 北京奇艺世纪科技有限公司 A kind of detection method and system of video file playback volume
CN107529093B (en) * 2017-09-05 2020-05-22 北京奇艺世纪科技有限公司 Method and system for detecting playing amount of video file
CN110188262A (en) * 2019-07-23 2019-08-30 武汉斗鱼网络科技有限公司 A kind of abnormal object determines method, apparatus, equipment and medium

Also Published As

Publication number Publication date
CN105574199B (en) 2020-04-21

Similar Documents

Publication Publication Date Title
CN105574199A (en) Identification method and device for false search behavior of search engine
CN106202211B (en) Integrated microblog rumor identification method based on microblog types
Das Sarma et al. Dynamic relationship and event discovery
US9317550B2 (en) Query expansion
CN101320375B (en) Digital book search method based on user click action
Bouaziz et al. Short text classification using semantic random forest
CN106682194A (en) Answer positioning method and device based on deep questions and answers
CN102737042B (en) Method and device for establishing question generation model, and question generation method and device
CN105975596A (en) Query expansion method and system of search engine
CN104994424B (en) A kind of method and apparatus for building audio and video standard data set
CN103678668A (en) Prompting method of relevant search result, server and system
CN102929873A (en) Method and device for extracting searching value terms based on context search
CN103365910A (en) Method and system for information retrieval
CN106326497A (en) Cheating video user identification method and device
WO2020135642A1 (en) Model training method and apparatus employing generative adversarial network
Chen et al. Generating ontologies with basic level concepts from folksonomies
CN103020289B (en) A kind of search engine user individual demand supplying method based on Web log mining
CN104331493A (en) Method and device for generating trend interpretation data by virtue of computer
Santos et al. Integrating proximity to subjective sentences for blog opinion retrieval
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
Da Silva et al. Measuring quality of similarity functions in approximate data matching
Hao et al. Modeling positive and negative feedback for improving document retrieval
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
CN112286799A (en) Software defect positioning method combining sentence embedding and particle swarm optimization algorithm
CN107480130B (en) Method for judging attribute value identity of relational data based on WEB information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee after: Youku network technology (Beijing) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: 1VERGE INTERNET TECHNOLOGY (BEIJING) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200522

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: Youku network technology (Beijing) Co.,Ltd.