CN105574199B - Method and device for identifying false search behavior of search engine - Google Patents

Method and device for identifying false search behavior of search engine Download PDF

Info

Publication number
CN105574199B
CN105574199B CN201511001301.7A CN201511001301A CN105574199B CN 105574199 B CN105574199 B CN 105574199B CN 201511001301 A CN201511001301 A CN 201511001301A CN 105574199 B CN105574199 B CN 105574199B
Authority
CN
China
Prior art keywords
multimedia resource
behavior
query
multimedia
playing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511001301.7A
Other languages
Chinese (zh)
Other versions
CN105574199A (en
Inventor
魏博
齐志兵
李力行
魏强
马堰夫
姚键
顾思斌
潘柏宇
王冀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Youku Network Technology Beijing Co Ltd
Original Assignee
Youku Network Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Youku Network Technology Beijing Co Ltd filed Critical Youku Network Technology Beijing Co Ltd
Priority to CN201511001301.7A priority Critical patent/CN105574199B/en
Publication of CN105574199A publication Critical patent/CN105574199A/en
Application granted granted Critical
Publication of CN105574199B publication Critical patent/CN105574199B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for identifying false search behavior of a search engine, wherein the search engine is used for searching multimedia resources, and the identification method comprises the following steps: acquiring user watching behavior data of a single query word and user conversion behavior data of the single query word from a user log; determining identification data for identifying false search behaviors according to the user viewing behavior data and/or the user conversion behavior data, wherein the identification data comprises at least one of independent multimedia resource playing amount, multimedia resource average playing completion ratio, multimedia resource click divergence and multimedia resource set playing residue; and identifying false search behavior based on the identification data. The method and the device can improve the accuracy of identifying the false search behavior and can also automatically identify the false search behavior of the full amount of query words.

Description

Method and device for identifying false search behavior of search engine
Technical Field
The invention relates to the field of information search and retrieval, in particular to a method and a device for identifying false search behaviors of a search engine.
Background
Currently, there is no uniformly mature method to identify false search behavior of a search engine for searching multimedia resources. Generally, only under the condition that the false search behavior of the search engine needs to be identified, the search engine can carry out the identification work of the false search behavior according to the business requirements of the search engine. As the business systems of search engines mature and the processing power and robustness of search engines increase, false search behavior of search engines can be substantially tolerated, i.e., there is substantially no need to identify false search behavior of search engines. For example, only in the case that individual false search behavior affects the system service quality of a search engine, the engineer can purposely perform the identification work of the false search behavior.
Also, it is difficult to identify false search behavior of a search engine because:
(1) in the prior art, the false search behavior of a search engine is not strictly defined but has only a simple definition as follows: the false search behavior of the search engine refers to search behavior that the user does not aim to actually search for and view multimedia resources. That is, if the user's search intent is not to search and view multimedia resources, the query term may be a false search behavior. This makes it difficult to identify spurious search behavior by the search engine. For example, the user's search intention can be judged only by subjective understanding, and whether the search behavior of the query word is a false search behavior is further identified based on whether the user's search intention is to search and view multimedia resources.
(2) False search behavior of search engines is typically masked. In particular, since the user is located at the front end of the search engine and the engineer is located at the back end of the search engine, and the actual interactive entries of the user and the search engine are only query words, the engineer is unlikely and inappropriate to perform face-to-face, one-to-one search intention confirmation with each user, thereby making it difficult to identify false search behavior of the search engine.
(3) The spurious search behavior of search engines is flexible. In particular, since the sources of false search behavior of search engines are diverse, e.g., a user actively inputs an external website link (linked to a search engine with a large access volume through a impersonation or nested search pattern), impersonation of an IP address, etc., it may be difficult for the false search behavior to maintain stable characteristics in time and space. For example, key metrics such as click, play, IP address, etc. for the first day may differ significantly from key metrics such as click, play, IP address, etc. for the second day for the same query term. This also presents difficulties for the identification of false search behavior by search engines.
(4) Typically, the identification of false search behavior by a search engine is hysteretic and passive. On one hand, due to the diversity of internet users and the existence of long tail requirements, it is impossible to judge whether a search behavior is a false search behavior with respect to a one-time search behavior. Typically, only if a false search behavior needs to be identified, the search behavior is determined to be false by analyzing the requests for a particular time period and segment of the IP address, but such determination is still late. In fact, current techniques for emulating random IP addresses are well established and it may not be appropriate to analyze IP addresses to identify false search behavior. On the other hand, manual analysis of the false search behavior of a full number of query terms is impractical because identifying false search behavior of aggregated data may require a complete log of the next day.
In addition, the false search behavior of multimedia resources such as video, audio, etc. is mainly reflected in the following two aspects: (1) only the behavior of searching the multimedia resources without clicking the multimedia resources is presented, and the behavior is mainly embodied in the behavior that the multimedia resources are not clicked correspondingly although a large amount of search input exists and the behavior of hitting the multimedia resources is hit; (2) the behavior of clicking on the multimedia resource without playing the multimedia resource is mainly reflected in that although the behavior of clicking on the multimedia resource exists, the behavior of watching the multimedia resource does not follow.
The identification work of the false search behavior of the existing search engine is basically to determine whether the search behavior of the query word contains the false search behavior based on the burst characteristic and the IP address distribution of the query word in a short time. This identification method may be effective for a false search action with an action of searching for a multimedia resource without a click on the multimedia resource, whereas it may not be effective for a false search action with a click on the multimedia resource without a play of the multimedia resource. Also, with the development of current crawler technologies, crawler behavior of forging IP addresses makes identification of false search behavior more difficult. In addition, false search behavior for a full number of query terms is not currently automatically identified.
Disclosure of Invention
Technical problem
In view of the above, the technical problem to be solved by the present invention is how to identify the false search behavior of a search engine.
Solution scheme
In order to solve the above technical problem, in a first aspect, the present invention provides a method for identifying false search behavior of a search engine, the search engine being used for searching multimedia resources, the method comprising:
acquiring user viewing behavior data of a single query term and user conversion behavior data of the single query term from a user log, wherein the user viewing behavior data of the single query term comprises: the method comprises the steps of obtaining a query word, a clicked multimedia resource set, a multimedia resource playing completion ratio set and a mapping function from the clicked multimedia resource set to the multimedia resource playing completion ratio set, wherein user conversion behavior data of a single query word comprise the query word, and the user conversion behavior data further comprise at least one of query quantity, a direct region hit rate, a direct region conversion rate, a user original content UGC region hit rate, a UGC region conversion rate and an overall conversion rate;
determining identification data for identifying the false search behavior according to the user viewing behavior data and/or the user conversion behavior data, wherein the identification data comprises at least one of independent multimedia resource playing amount, multimedia resource average playing completion ratio, multimedia resource click divergence and multimedia resource set playing residue; and
identifying the false search behavior based on the identification data.
With reference to the first aspect, in a first possible implementation manner, in a case where the user conversion behavior data includes a direct region conversion rate and the identification data includes a multimedia resource click divergence, identifying the false search behavior according to the identification data includes:
judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
under the condition that the conversion rate of the direct region of the current query word is smaller than the first threshold value, judging whether the click divergence of the multimedia resource of the current query word is smaller than a second threshold value; and
and under the condition that the multimedia resource click divergence of the current query word is smaller than the second threshold value, identifying the search behavior of the current query word as the false search behavior.
With reference to the first aspect, in a second possible implementation manner, in a case that the user conversion behavior data includes a direct region conversion rate and the identification data includes an average multimedia resource playing completion ratio, identifying the false search behavior according to the identification data includes:
judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
under the condition that the conversion rate of the direct region of the current query word is not less than the first threshold value, judging whether the average playing completion ratio of the multimedia resources of the current query word is less than a third threshold value; and
and under the condition that the average playing completion ratio of the multimedia resources of the current query word is smaller than the third threshold value, identifying the search behavior of the current query word as the false search behavior.
With reference to the first aspect and the first or second possible implementation manner of the first aspect, in a third possible implementation manner, determining, according to the user viewing behavior data and/or the user conversion behavior data, identification data for identifying the false search behavior includes: and under the condition that the identification data comprises the independent multimedia resource playing amount, determining the independent multimedia resource playing amount according to the clicked multimedia resource set in the user watching behavior data.
With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner, determining, according to the user viewing behavior data and/or the user conversion behavior data, identification data for identifying the false search behavior includes at least one of the following steps:
under the condition that the identification data comprises the average playing completion ratio of the multimedia resources, adopting a formula according to the playing completion ratio set and the independent multimedia resource playing amount in the user watching behavior data
Figure BDA0000893080410000051
Determining the average playing completion ratio of the multimedia resources, wherein the query is the current query term, the APP (query) is the average playing completion ratio of the multimedia resources of the current query term, the IVC (query) is the independent multimedia resource playing amount of the current query term, and niIs the number of times the ith independent multimedia resource of the current query term is played, the perciIs the play completion ratio of the ith independent multimedia resource of the current query term;
in the case that the user conversion behavior comprises the query amount and the identification data comprises the click divergence of the multimedia resource, adopting a formula according to the query amount and the independent multimedia resource playing amount
Figure BDA0000893080410000052
Determining the multimedia resource click divergence, wherein VCR (query) is the multimedia resource click divergence of the current query word, and sqv is the query quantity;
under the condition that the user conversion behavior comprises the query quantity and the identification data comprises the multimedia resource set playing residual degree, adopting a formula according to the playing completion ratio set and the query quantity in the user viewing behavior data
Figure BDA0000893080410000053
Determining the multimedia resource set playing residual, wherein the VSPR (query) is the multimedia resource set playing residual of the current query word, and max () takes the maximum value.
In a second aspect, the present invention provides an apparatus for identifying false search behavior of a search engine, the search engine being used for searching multimedia resources, the apparatus comprising:
an obtaining unit, configured to obtain, from a user log, user viewing behavior data of a single query term and user conversion behavior data of the single query term, where the user viewing behavior data of the single query term includes: the method comprises the steps of obtaining a query word, a clicked multimedia resource set, a multimedia resource playing completion ratio set and a mapping function from the clicked multimedia resource set to the multimedia resource playing completion ratio set, wherein user conversion behavior data of a single query word comprise the query word, and the user conversion behavior data further comprise at least one of query quantity, a direct region hit rate, a direct region conversion rate, a user original content UGC region hit rate, a UGC region conversion rate and an overall conversion rate;
the determining unit is connected with the acquiring unit and used for determining identification data for identifying the false search behavior according to the user watching behavior data and/or the user conversion behavior data, wherein the identification data comprises at least one of independent multimedia resource playing amount, multimedia resource average playing completion ratio, multimedia resource click divergence and multimedia resource set playing residual degree; and
a processing unit for identifying the false search behavior according to the identification data.
With reference to the second aspect, in a first possible implementation manner, when the user conversion behavior data includes a direct zone conversion rate and the identification data includes a multimedia resource click divergence, the processing unit specifically includes:
the first judgment unit is used for judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
the second judging unit is connected with the first judging unit and used for judging whether the multimedia resource click divergence of the current query word is smaller than a second threshold value or not under the condition that the first judging unit judges that the direct region conversion rate of the current query word is smaller than the first threshold value; and
and the identification unit is connected with the second judgment unit and is used for identifying the search behavior of the current query word as the false search behavior under the condition that the second judgment unit judges that the click divergence of the multimedia resource of the current query word is smaller than the second threshold value.
With reference to the second aspect, in a second possible implementation manner, when the user conversion behavior data includes a direct zone conversion rate and the identification data includes a multimedia resource click divergence, the processing unit specifically includes:
the first judgment unit is used for judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
the second judging unit is connected with the first judging unit and used for judging whether the average playing completion ratio of the multimedia resources of the current query word is smaller than a third threshold value or not under the condition that the first judging unit judges that the conversion rate of the direct region of the current query word is not smaller than the first threshold value; and
and the identification unit is connected with the second judgment unit and is used for identifying the search behavior of the current query word as the false search behavior under the condition that the second judgment unit judges that the average playing completion ratio of the multimedia resources of the current query word is smaller than the third threshold value.
With reference to the second aspect and the first or second possible implementation manner of the second aspect, in a third possible implementation manner, the determining unit is specifically configured to, in a case that the identification data includes the independent multimedia asset playing amount, determine the independent multimedia asset playing amount according to a clicked multimedia asset set in the user viewing behavior data.
With reference to the third possible implementation manner of the second aspect, in a fourth possible implementation manner, the determining unit is specifically configured to perform at least one of the following steps:
under the condition that the identification data comprises the average playing completion ratio of the multimedia resources, the number of watching behaviors of the user is countedAccording to the play completion ratio set and the independent multimedia resource play amount in the multimedia data, a formula is adopted
Figure BDA0000893080410000071
Determining the average playing completion ratio of the multimedia resources, wherein the query is the current query term, the APP (query) is the average playing completion ratio of the multimedia resources of the current query term, the IVC (query) is the independent multimedia resource playing amount of the current query term, and niIs the number of times the ith independent multimedia resource of the current query term is played, the perciIs the play completion ratio of the ith independent multimedia resource of the current query term;
in the case that the user conversion behavior comprises the query amount and the identification data comprises the click divergence of the multimedia resource, adopting a formula according to the query amount and the independent multimedia resource playing amount
Figure BDA0000893080410000072
Determining the multimedia resource click divergence, wherein VCR (query) is the multimedia resource click divergence of the current query word, and sqv is the query quantity;
under the condition that the user conversion behavior comprises the query quantity and the identification data comprises the multimedia resource set playing residual degree, adopting a formula according to the playing completion ratio set and the query quantity in the user viewing behavior data
Figure BDA0000893080410000081
Determining the multimedia resource set playing residual, wherein the VSPR (query) is the multimedia resource set playing residual of the current query word, and max () takes the maximum value.
Advantageous effects
According to the method and the device for identifying the false search behavior of the search engine, the identification data for identifying the false search behavior is determined according to the user watching behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and the false search behavior is identified according to the determined identification data, so that the accuracy rate of identifying the false search behavior can be improved, and the false search behavior of the whole query word can be automatically identified.
Other features and aspects of the present invention will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the invention and, together with the description, serve to explain the principles of the invention.
FIG. 1 is a flow chart of a method for identifying false search behavior of a search engine according to a first embodiment of the invention;
FIG. 2 is a flow chart of a method for identifying false search behavior of a search engine according to a second embodiment of the invention;
FIG. 3 illustrates an example of a decision tree model applied to the present invention;
FIG. 4 is a flow chart of a method for identifying false search behavior of a search engine according to a third embodiment of the present invention;
FIG. 5 is a block diagram of an apparatus for identifying false search behavior of a search engine according to a fourth embodiment of the present invention;
FIG. 6 is a block diagram of an apparatus for identifying false search behavior of a search engine according to an embodiment of the present invention; and
fig. 7 is a block diagram showing an apparatus for identifying false search behavior of a search engine according to a sixth embodiment of the present invention.
Detailed Description
Various exemplary embodiments, features and aspects of the present invention will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, procedures, components, and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present invention.
Example 1
FIG. 1 illustrates a flow diagram of a method for identifying false search behavior of a search engine according to an embodiment of the invention. As shown in fig. 1, the identification method may mainly include:
step S100, user viewing behavior data of a single query term and user conversion behavior data of a single query term may be obtained from the user log.
In particular, the user viewing behavior of each query term may be characterized using the quadruple { query, videos, percs, δ }. The process may include pre-processing and noise-removal processing of user log data, which may be noisy from various aspects such as illegal inputs, system anomalies, logging anomalies, and so forth.
The query is a query word, that is, the query word query of the user can be obtained from a user log of the search engine, for example.
The videos are the clicked multimedia resource sets, that is, the users click on the sets of multimedia resources on the search result page by searching the query terms, for example, the clicked multimedia resource sets videos can be obtained from the multimedia resource viewing logs of the user logs by defining the sources of the multimedia resource viewing.
The percs is a set of the playing completion ratios of the multimedia resources, that is, the set of the playing completion ratios of the clicked multimedia resources, and for example, the playing completion ratio set percs can be obtained from the multimedia resource viewing log of the user log by performing secondary processing on the multimedia resource playing data. It should be noted that, because the total time length of each multimedia resource may be different greatly, the use of the multimedia resource playing completion ratio to characterize the user viewing behavior is more objective than simply using the playing time length of the multimedia resource to characterize the user viewing behavior. For example, for the same query term, if a clicked multimedia resource is played multiple times, the play completion ratio of the clicked multimedia resource should be a composite score, for example, an average value of all the play completion ratios of the query term may be taken, and for example, a median of all the play completion ratios of the query term may be taken.
δ is a mapping function of the clicked multimedia resource set to the multimedia resource play completion ratio set, which may be predefined, for example, when the multimedia resource play completion ratio set is obtained.
That is, the user viewing behavior data of the single query term may include: query word (query), clicked multimedia resource collection (videos), multimedia resource playing completion ratio collection (percs), and mapping function (delta) from clicked multimedia resource collection to multimedia resource playing completion ratio collection.
Specifically, the user conversion behavior data of the single query term may include the query term (query), and the user conversion behavior data may further include at least one of a query amount (sqv), a direct region hit rate (Dhit), a direct region conversion rate (dta), a user original content UGC region hit rate (Uhit), a UGC region conversion rate (Utra), and an overall conversion rate (Wtra).
The query amount may be the number of times the query term is searched in a certain time period, for example, assuming that the query term a1 is searched 25 times in a day, the query amount of the query term a1 is 25.
The hit rate of the direct zone may be a ratio of search results of the direct zone clicked by the user. The through area refers to a display area of some multimedia resources manually organized by editors in response to input query words in a search result page of a search engine of the multimedia resources, for example, the through area may be composed of copyright multimedia resources or multimedia resources of a high-quality account, so as to haveThe user can quickly inquire the result. For example, assuming that the query word is input for 100 searches and there are 40 hits to the multimedia resource of the direct zone, the direct zone hit rate is
Figure BDA0000893080410000111
It should be noted that even if the same multimedia resource in the direct zone is hit multiple times, the same multimedia resource that hits the direct zone is only written once.
The direct zone conversion rate may refer to a ratio of the number of times the search result of the direct zone is converted into the multimedia resource playing page. For example, assuming that the query word is input for 100 searches, 40 times hits the multimedia resource in the direct zone, and 60 times converts into a multimedia resource play page, the direct zone conversion rate is
Figure BDA0000893080410000112
It should be noted that each conversion is taken as a count, that is, each conversion is performed once on the multimedia resource playing page, and the conversion times are increased once.
The hit rate of the User Generated Content (UGC) region may be a ratio of the search result of the UGC region clicked by the User. The UGC area refers to a display area consisting of multimedia resources uploaded by general users in a result page of a search engine of the multimedia resources, and the UGC is developed along with a web2.0 concept advocating personalization as a main characteristic. For example, assuming that the query word is searched for multimedia resources input 100 times and 40 times hit the UGC region, the UGC region hit rate is
Figure BDA0000893080410000113
It should be noted that even if the same multimedia resource in the UGC region is hit multiple times, the same multimedia resource hit in the UGC region is only written once.
The UGC region conversion rate may refer to a ratio of the number of times the search result of the UGC region is converted into the multimedia resource play page. For example, assume that the query term is searched for input 100 times, there are 40 times of multimedia resources that hit the UGC region, and there are 60 times of conversion into multimediaIf the resource plays the page, the UGC area conversion rate is
Figure BDA0000893080410000121
It should be noted that each conversion is taken as a count, that is, each conversion is performed once on the multimedia resource playing page, and the conversion times are increased once.
The overall conversion rate may refer to a proportion of the number of times the overall search results are converted into multimedia resource play pages. For example, assuming that the query term is input 100 times and there are 60 conversions to the multimedia asset playback page, the overall conversion rate is
Figure BDA0000893080410000122
It should be noted that each conversion is taken as a count, that is, each conversion is performed once on the multimedia resource playing page, and the conversion times are increased once.
In one possible implementation, the user translation behavior of each query term may be characterized using the seven-tuple { query, sqv, Dhit, Dtra, Uhit, Utra, Wtra }. For example, table 1 below shows an example of translation of query terms and playing of original data fields.
TABLE 1 translation of query terms and Play raw data fields
Column number 1 2 3 4 5 6 7 8 9
Name of field query vids percs sqv Dhit Dtra Uhit Utra Wtra
Step S120, determining identification data for identifying the false search behavior according to the user viewing behavior data and/or the user conversion behavior data, where the identification data may include at least one of an independent multimedia resource playing amount, a multimedia resource average playing completion ratio, a multimedia resource click divergence, and a multimedia resource set playing residual.
The inventors of the present application have realized that spurious search behavior by a search engine may have the following characteristics:
first, the query quantity sqv of the query words containing false search behavior is large, but the hit quantity is small, for example, false search behavior for the purpose of refreshing the multimedia resource quantity, and for example, false search behavior imported from some extranets (i.e., search behavior linked to search engine with large access through impersonation or nested search patterns), so that there are many specific query words, but these specific query words are not the real requirement of the user, and the search behavior of these specific query words really translates into the behavior of clicking on multimedia resource and the behavior of playing multimedia resource is small.
Secondly, the query term containing the false search behavior usually fixes the clicked multimedia resources on a specific one or several multimedia resources to achieve the purpose of refreshing the multimedia resources, and the playing completion ratio of the multimedia resources is low. This feature often appears in the promotion of cheating on multimedia assets.
Furthermore, the average playing completion ratio of the query words containing false search behaviors is lower. A low playing completion ratio of a large number of multimedia resources for a certain query term will result in a low average playing completion ratio of the query term on the full-size search result page. This feature is particularly evident when the query volume is relatively large.
Finally, the number of independent multimedia assets that a query word containing a false search behavior clicks on is typically not large. The clicks of the query words are concentrated on one or more multimedia resources, and the clicks of the query words on other multimedia resources are few, so that the amount of the integrally and independently clicked multimedia resources is small.
The inventors of the present application have conceived based on the above feature that identification data for identifying false search behavior can be extracted from user log data, for example, user viewing behavior data and/or user conversion behavior data.
In one possible implementation, determining identification data for identifying a false search behavior according to the user viewing behavior data and/or the user conversion behavior data may include: in the case where the identification data includes the independent multimedia resource play amount, the independent multimedia resource play amount may be determined according to the clicked multimedia resource set in the user viewing behavior data.
The Independent multimedia resource playing volume (IVC) is used to describe the extent of a single query word on clicking a multimedia resource. The more different multimedia resources clicked by the query word, the larger the playing amount of the independent multimedia resources is; on the contrary, the less the different multimedia resources clicked by the query word, the smaller the playing amount of the independent multimedia resources. For example, the clicked different multimedia resources can be determined according to the query word query and the clicked multimedia resource collection videos in the user viewing behavior data. Therefore, the independent multimedia resource playing amount can be determined according to the query term query and the clicked multimedia resource collection videos and by using the following formula (1):
counting formula (1) of different multimedia assets that IVC (query) is clicked on
Generally speaking, the independent multimedia resource playing amount of the query words of the normal search results and the search behaviors is not small, which is consistent with the diversity of the user requirements and the randomness of the click behaviors. However, if a single query term contains false search behavior, the amount of independent multimedia asset playback will generally not be large, since the user may not have the behavior of clicking on a multimedia asset, or the multimedia asset that the user clicks on may be limited to only a particular multimedia asset. It should be noted that for the case where the returned result contains the direct zone multimedia asset, the independent multimedia asset playing amount may be smaller.
In one possible implementation, determining identification data for identifying false search behavior based on the user viewing behavior data and/or the user conversion behavior data may include at least one of:
under the condition that the identification data comprises the average playing completion ratio of the multimedia resources, the formula (2) can be adopted according to the playing completion ratio set in the user watching behavior data and the determined independent multimedia resource playing amount
Figure BDA0000893080410000141
Determining the average playing completion ratio of the multimedia resources, wherein query is the current query term, APP (query) is the average playing completion ratio of the multimedia resources of the current query term, IVC (query) is the independent multimedia resource playing amount of the current query term, niIs the number of times the ith independent multimedia resource of the current query term is played, perciIs the play completion ratio of the ith independent multimedia resource of the current query term;
in the case where the user conversion behavior includes a query amount and the identification data includes a multimedia resource click divergence, formula (3) may be applied based on the query amount and the determined independent multimedia resource play amount
Figure BDA0000893080410000142
Determining the click divergence of the multimedia resource, wherein VCR (query) is the click divergence of the multimedia resource of the current query word, and sqv is the query quantity;
in the case where the user's conversion behavior includes a query amount and the identification data includes a multimedia asset set play residue, the formula (4) may be adopted based on the play completion ratio set and the query amount in the user's viewing behavior data
Figure BDA0000893080410000151
And determining the multimedia resource set playing residual, wherein VSPR (query) is the multimedia resource set playing residual of the current query word, and max () takes the maximum value.
The multimedia resource Average Playing completion ratio (APP for short) is used to describe an Average Playing completion degree of a single query term on a search result set of the multimedia resource of the user. The larger the average playing completion ratio of the multimedia resources is, the more complete the multimedia resources under the query term are viewed; on the contrary, the smaller the average playing completion ratio of the multimedia resources is, the more incomplete the multimedia resource under the query word is viewed. As described above, the average play completion ratio of the multimedia asset can be determined using the above equation (2).
Generally, the average completion ratio of a single query term over the search result set of the entire multimedia resource will not be very low unless the completion ratio of each play is very low. For example, the average completion of the search engine for all the single query terms is about 44% greater than the average. If the average completion of the play of a single query term is low, the single query term is likely to contain false search behavior.
The multimedia resource click divergence (VCR) is used to describe the divergence degree of the behavior of the query word Clicking the multimedia resource on the search result page of the multimedia resource. Compared with the query quantity, the more independent multimedia resources are clicked, the larger the click divergence of the multimedia resources is; conversely, the click divergence of the multimedia resources is smaller as the number of the clicked independent multimedia resources is smaller. As described above, the multimedia resource click divergence can be determined using the above equation (3).
Generally, the click divergence of the multimedia resource changes according to the exposure and conversion degree of the direct region, and if the query word is a time-efficient word (the time-efficient word may refer to a search word with a user attention degree greater than a certain degree in a specific time period), since the search amount sqv of the time-efficient word in the specific time period (for example, the current day) is large and the number of clicks is concentrated on the multimedia resource of the topic finder, it can be known from the above formula (3) that the click divergence of the multimedia resource is not high, that is, the click behavior of the user is concentrated on a new multimedia resource.
The Video Set Playing residual (VSPR for short) is used to describe a situation that a query word is not played on a search result page of a multimedia resource. Each multimedia resource has a certain click ratio, i.e. the ratio of the number of times the multimedia resource is clicked to the number of times the multimedia resource is searched (for example, assuming that the number of times the multimedia resource B1 is searched is 100 times and the number of times the multimedia resource B1 is clicked is 20 times, the click ratio of the multimedia resource B1 is
Figure BDA0000893080410000161
That is, the click ratio of the multimedia resource can be determined by the above formula (4)
Figure BDA0000893080410000162
To calculate, and the multimedia resource also has a certain play completion ratio perci. Click occupation ratio capable of utilizing multimedia resources
Figure BDA0000893080410000163
And Play completion ratio perciThese two parameters determine how completely the query terms are played over the entire set of multimedia resources. If the click ratio of the multimedia resource
Figure BDA0000893080410000164
The larger and more complete the playback of the multimedia asset than perciThe smaller (i.e.,
Figure BDA0000893080410000165
the larger), the more incomplete the multimedia asset set playing, and the larger the multimedia asset set playing residual vspr (query). In other words, the multimedia resource set play residue vspr (query) can be determined using the above equation (4).
According to the above formula (4), the multimedia asset set play residual vspr (query) utilizes the worst performance of a single multimedia asset, i.e. if the click ratio of a multimedia asset is larger than the click ratio of a multimedia asset
Figure BDA0000893080410000166
The larger and the playback completion ratio perciThe lower, the greater the multimedia asset set play residual vspr (query).
It should be noted that, in the present invention, the method for determining the playing residual vspr (query) of the multimedia resource set is not limited thereto, and those skilled in the art should understand according to the disclosure of the present application and the technical common sense mastered by the disclosure of the present application, and may also determine the playing residual vspr (query) of the multimedia resource set in other ways, for example, the playing residual vspr (query) of the multimedia resource set may also be determined according to the comprehensive performance of the multimedia resource set.
Step S140 may identify a false search behavior according to the identification data.
For example, the identification of spurious search behavior by a search engine may be done from the identification data and using classical Decision Tree (Decision Tree) algorithms.
First, a decision tree model is trained using a training set (which may include training data) to obtain a decision tree initial model of false search behavior, wherein the training set is an initial data set of whether the search behavior of each given query term given by manual annotation is false search behavior, and the manual annotation is based on a small number of explicitly identified query terms.
In particular, a decision tree is a tree structure similar to a flow diagram, where each internal node represents a test of a leaf on an attribute, each branch represents a test output, and each tree node represents a class or class distribution. The topmost node of the tree is the root node. The decision tree algorithm is suitable for high-quality classification with a small number of attributes (feature numbers). The core problem of the decision tree algorithm is to select the attribute to be tested at each node of the tree in an attempt to select the attribute which is most helpful for classifying the instances. To solve this problem, the ID3 algorithm introduces the concept of information gain (information gain), and uses how much information gain to decide different nodes on each level of the decision tree, i.e., important attributes for classification. To accurately define the information gain, the ID3 algorithm uses a concept called entropy (entrypy) in information theory to describe the purity (purity) of any sample set. If a sample set S containing positive and negative samples about a certain target concept is given, the entropy of the sample set S with respect to the Boolean type classification is:
Entropy(S)=-P+log2P+-P-log2P-formula (5)
Wherein, P+Denotes a normal case, P-Representing a reverse example, 0log0 is defined as 0. Using entropy, the ID3 algorithm defines the information gain. The information gain of attribute a relative to sample set S is defined using the following equation (6):
Figure BDA0000893080410000181
formula (6)
Where V (A) is the value range of attribute A, S is the sample set, SvIs the set of samples in S for which the value on attribute a is equal to v.
The flow of the ID3 algorithm is as follows:
inputting: a sample set S and an attribute set A;
and (3) outputting: ID3 decision tree.
1) If all kinds of attributes are processed, returning; otherwise, execute 2);
2) calculating the maximum attribute a of the information Gain (S, A), and taking the attribute as a node; if the sample can be classified only by the attribute a, returning; otherwise, execute 3);
3) the following is performed for each possible value v of the attribute a: i. taking all samples with the attribute a of which the value is v as a subset S of the sample set Sv(ii) a Generating an attribute set AT ═ a- { a }; in subset SvAnd attribute set AT, recursively executes the ID3 algorithm.
An initial model of a decision tree of false search behavior may be obtained based on the identification data determined from the training data, the labeling results of the training set, and the ID3 algorithm.
Secondly, after obtaining the decision tree initial model of the false search behavior, it may need to be optimized because: the decision tree initially generated using the ID3 algorithm, i.e., the initial model of the decision tree obtained by the ID3 algorithm, tends to result in a filter fit, i.e., the error rate of applying the initial model of the decision tree to the training data, i.e., identifying whether the search behavior of the training data contains false search behavior by using the initial model of the decision tree, is low, but the error rate of applying the initial model of the decision tree to the test data, i.e., identifying whether the search behavior of the test data contains false search behavior by using the initial model of the decision tree, may be high, i.e., the accuracy of identifying the false search behavior directly by using the initial model of the decision tree, may be low.
For example, pruning (pruning) strategies may be used to optimize the decision tree initial model described above. More specifically, for example, the decision tree initial model described above can be optimized using the following two clipping strategies:
the strategy of pre-clipping, namely, stopping in advance in the process of constructing the decision tree. However, this strategy can set the conditions for splitting the nodes very rigorously, resulting in a very short decision tree, which cannot be optimized. Therefore, such a pre-clipping strategy may have difficulty in obtaining a good judgment result.
Strategy for post-clipping, i.e., clipping begins after completion of the construction of the decision tree. For example, the following two methods can be adopted for clipping: replacing the whole sub-tree with a single leaf node, the classification of the leaf node using the most dominant classification in the sub-tree; and completely replacing one subtree with another.
Specifically, the present invention may predict the known search behavior by using the decision tree initial model, that is, may identify whether the known search behavior includes a false search behavior by using the decision tree initial model, so as to determine and optimize the accuracy of the decision tree initial model.
Finally, after the decision tree initial model is optimized, the false search behavior of the search engine can be identified by using the optimized decision tree model and the identification data. For example, assume that it is determined in the optimized decision tree model that the search behavior of the current query term is identified as false search behavior if the recognition data is below a predetermined threshold.
It should be noted that the embodiment of the present invention only takes the decision tree classification algorithm as an example to illustrate how to identify the false search behavior according to the identification data, and those skilled in the art should understand that the present invention is not focused on what kind of classification algorithm is specifically used, and the classification algorithm that can be used by the present invention is not limited to the decision tree classification algorithm, for example, other classification algorithms such as bayesian inference and the like can also be used to identify the false search behavior according to the identification data.
According to the identification method of the false search behavior of the search engine, the identification data for identifying the false search behavior is determined according to the user watching behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and the false search behavior is identified according to the determined identification data, so that the accuracy rate of identifying the false search behavior can be improved, and the false search behavior of the whole query word can be automatically identified.
Example 2
Fig. 2 shows a flowchart of a method for identifying false search behavior of a search engine according to a second embodiment of the present invention. The steps in fig. 2, which are numbered the same as those in fig. 1, have the same functions, and detailed descriptions of the steps are omitted for the sake of brevity.
As shown in fig. 2, the method for identifying the false search behavior of the search engine shown in fig. 2 is mainly different from the method for identifying the false search behavior of the search engine shown in fig. 1 in that, in addition to including step S100 and step S120 in the first embodiment, in the case that the user conversion behavior data includes a direct region conversion rate and the identification data includes a multimedia resource click divergence, step S140 may specifically include:
step S200, judging whether the conversion rate of the direct region of the current query word is less than a first threshold value;
step S220, under the condition that the conversion rate of the direct region of the current query word is smaller than a first threshold value, whether the click divergence of the multimedia resource of the current query word is smaller than a second threshold value can be judged; and
step S240, in the case that the multimedia resource click divergence of the current query word is smaller than the second threshold, the search behavior of the current query word may be identified as a false search behavior.
For example, first, the user viewing behavior data may be obtained from a multimedia resource play log in a user log, and the user conversion behavior data may be obtained from a query word click log in the user log. Specifically, table 2 below shows an example of multimedia asset play logs in a user log of a certain search engine on a certain day, wherein the multimedia asset play logs are recorded for 2329980 pieces.
TABLE 2 multimedia resource play Log in user Log of a certain search Engine on a certain day
query vids percs
C1 235949485 0.1338
C2 209907159 0.0442
C2 213535395 0.0587
C2 217417432 0.0980
C2 217417432 0.1960
As can be seen from table 2, the query word C2 has 4 play actions in total, but there are two play actions on the multimedia asset 217417432, so the independent multimedia asset play amount IVC is 3.
In addition, table 3 below shows an example of query term click logs in a user log of the current day for a search engine, where 185966 query term click valid log records are recorded.
TABLE 3 query term click logs in the user logs of the current day for a search engine
query sqv Dhit Dtra Uhit Utra Wtra
B1 1793 0.4822 0.6599 0.1422 0.2811 0.9426
B2 2491 0.3760 0.7001 0.3308 0.8210 1.5303
B3 3511 0.3896 0.4475 0.0615 0.0880 0.5377
The original data field for converting and playing the single query word can be obtained by summarizing the above tables 2 and 3 and using the query as the main key.
Secondly, 518 query terms can be randomly selected as initial data of manual labeling, and whether false search behaviors exist in the search behaviors of each selected query term is judged. The results show that 66 of the 518 selected query terms are labeled as false search behavior and the remaining 452 query terms are labeled as normal search behavior, and table 4 below shows an example of a manually labeled false search behavior for a search engine.
TABLE 4 example of manually labeling spurious search behavior for a search engine
query A1 A2 A3 A4 A5 B1
Dummy search Is that Is that Is that Whether or not Whether or not Whether or not
As can be seen from table 4, the search words a1 through A3 of a search engine are manually labeled as false search behavior, i.e., the search behaviors of the search words a1 through A3 have false search behavior, and the search words a4-a5 and B1 are manually labeled as not false search behavior, i.e., the search behaviors of the search words a4-a5 and B1 are normal search behavior.
Then, the input data of the decision tree algorithm may be obtained by performing feature extraction on the above-mentioned manually labeled query term, and table 5 below shows an example of the input data of the decision tree algorithm of a certain search engine. As shown in table 5, the input data may include the user conversion behavior data and the identification data in the first embodiment.
TABLE 5 example of input data for decision Tree Algorithm for a search Engine
Figure BDA0000893080410000221
Finally, a decision tree model for identifying false search behavior as shown in fig. 3 can be obtained according to a decision tree algorithm and a model optimization strategy. As can be seen from fig. 3, it can be determined through the decision tree model that the first threshold is 0.26 and the second threshold is 0.14, that is, the search behavior of the query word with the direct region conversion Dtra less than 0.26 and the multimedia resource click divergence VCR less than 0.14 can be identified as the false search behavior. Moreover, mapping the decision tree model shown in fig. 3 to the website experience of the user can be understood that if a certain query term does not have a direct region or has no actual effect, and simultaneously, multimedia clicks of the user in the UGC region are too concentrated, the query term is a false search behavior. This is in accordance with the natural understanding of those skilled in the art.
For example, since the direct-region conversion Dtra of the query word a1 is 0 and the multimedia resource click divergence VCR of the query word a1 is 0.0072 in the example shown in table 5 above, i.e., the direct-region conversion Dtra of the query word a1 is less than the first threshold 0.26 and the multimedia resource click divergence VCR of the query word a1 is less than the second threshold 0.14, the search behavior of the query word a1 may be identified as false search behavior.
According to the identification method of the false search behavior of the search engine, the identification data for identifying the false search behavior is determined according to the user viewing behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and under the condition that the user conversion data comprises the direct region conversion rate and the identification data comprises the multimedia resource click divergence, the search behavior of the query word with the direct region conversion rate smaller than the first threshold and the multimedia resource click divergence smaller than the second threshold can be identified as the false search behavior, so that the accuracy of identifying the false search behavior can be improved, and the false search behavior of the whole query word can be automatically identified.
Example 3
Fig. 4 shows a flowchart of a method for identifying false search behavior of a search engine according to a third embodiment of the present invention. The steps in fig. 4, which are numbered the same as those in fig. 1, have the same functions, and detailed descriptions of the steps are omitted for the sake of brevity.
As shown in fig. 4, the main difference between the identification method of the false search behavior of the search engine shown in fig. 4 and the identification method of the false search behavior of the search engine shown in fig. 1 is that, in addition to including step S100 and step S120 in the first embodiment, in the case that the user conversion behavior data includes the direct region conversion rate and the identification data includes the average playing completion ratio of the multimedia resource, step S140 may specifically include:
step S300, whether the conversion rate of the direct region of the current query word is smaller than a first threshold value or not can be judged;
step S320, under the condition that the direct region conversion rate of the current query term is not less than the first threshold, determining whether the average playing completion ratio of the multimedia resource of the current query term is less than a third threshold; and
step S340, in the case that the average playing completion ratio of the multimedia resource of the current query term is smaller than the third threshold, the search behavior of the current query term may be identified as a false search behavior.
For an example of this embodiment, the description of the second embodiment above may be specifically referred to. The third example of the embodiment is different from the second example of the embodiment in that it can be determined by the decision tree model shown in fig. 3 that the first threshold is 0.26 and the third threshold is 0.25, that is, the search behavior of the query word with the direct region conversion rate Dtra greater than or equal to 0.26 and the average multimedia playing completion ratio APP less than 0.25 can be identified as the false search behavior. Moreover, mapping the decision tree model shown in fig. 3 to the website experience of the user, it can be understood that if the direct region of a certain query word already has a high conversion effect (i.e., the direct region is a good-quality direct region), but the viewing completion degree (i.e., the average playing completion ratio of multimedia resources) of the user is low, the search behavior of the query word is a false search behavior. In other words, if the conversion (derivative play) of the direct region of a certain query word is more, but the play completion of the direct region is lower, the search behavior of the query word is a false search behavior. This is in accordance with the natural understanding of those skilled in the art.
For example, since the direct region conversion rate Dtra of the query word B1 is 0.66 and the multimedia resource average playing completion ratio APP of the query word B1 is 0.6816 in the example shown in table 5 above, that is, the direct region conversion rate Dtra of the query word B1 is not less than the first threshold value 0.26 and the multimedia resource average playing completion ratio APP of the query word B1 is not less than the third threshold value 0.25, the search behavior of the query word B1 may be identified as normal search behavior, that is, the search behavior of the query word B1 may be identified as not false search behavior.
According to the identification method of the false search behavior of the search engine, the identification data for identifying the false search behavior is determined according to the user viewing behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and under the condition that the user conversion data comprises the direct region conversion rate and the identification data comprises the multimedia resource average playing completion ratio, the search behavior of the query word with the direct region conversion rate not less than the first threshold value and the multimedia resource average playing completion ratio less than the third threshold value can be identified as the false search behavior, so that the accuracy rate of identifying the false search behavior can be improved, and the false search behavior of the full number of query words can be automatically identified.
Example 4
Fig. 5 is a block diagram showing the structure of an apparatus for identifying false search behavior of a search engine according to a fourth embodiment of the present invention. The device 500 for identifying false search behavior of a search engine provided by the embodiment is used for implementing the method for identifying false search behavior of a search engine provided by the embodiment shown in fig. 1. As shown in fig. 5, the identifying means 500 of false search behavior of the search engine may include:
the obtaining unit 510 may be configured to obtain, from the user log, user viewing behavior data of a single query term and user conversion behavior data of the single query term, where the user viewing behavior data of the single query term includes: the method comprises the steps of searching words, a clicked multimedia resource set, a multimedia resource playing completion ratio set and a mapping function from the clicked multimedia resource set to the multimedia resource playing completion ratio set, wherein user conversion behavior data of a single searching word comprise the searching words, and further comprise at least one of searching amount, direct region hit rate, direct region conversion rate, user original content UGC region hit rate, UGC region conversion rate and overall conversion rate. Specifically, reference may be made to the related description of step S100 in the first embodiment.
A determining unit 530, configured to determine, according to the user viewing behavior data and/or the user conversion behavior data, identification data for identifying the false search behavior, where the identification data may include at least one of an individual multimedia resource play amount, a multimedia resource average play completion ratio, a multimedia resource click divergence, and a multimedia resource set play residue. Specifically, reference may be made to the related description of step S120 in the first embodiment.
In a possible implementation manner, the determining unit 530 is specifically configured to, in a case that the identification data includes the independent multimedia resource playing amount, determine the independent multimedia resource playing amount according to a clicked multimedia resource set in the user viewing behavior data.
The Independent multimedia resource playing volume (IVC) is used to describe the extent of a single query word on clicking a multimedia resource. The more different multimedia resources clicked by the query word, the larger the playing amount of the independent multimedia resources is; on the contrary, the less the different multimedia resources clicked by the query word, the smaller the playing amount of the independent multimedia resources. For example, the clicked different multimedia resources can be determined according to the query word query and the clicked multimedia resource collection videos in the user viewing behavior data. Therefore, the independent multimedia resource playing amount can be determined according to the query term query and the clicked multimedia resource collection videos and by using the following formula (1):
counting formula (1) of different multimedia assets that IVC (query) is clicked on
Generally speaking, the independent multimedia resource playing amount of the query words of the normal search results and the search behaviors is not small, which is consistent with the diversity of the user requirements and the randomness of the click behaviors. However, if a single query term contains false search behavior, the amount of independent multimedia asset playback will generally not be large, since the user may not have the behavior of clicking on a multimedia asset, or the multimedia asset that the user clicks on may be limited to only a particular multimedia asset. It should be noted that for the case where the returned result contains the direct zone multimedia asset, the independent multimedia asset playing amount may be smaller.
In a possible implementation manner, the determining unit 530 may specifically be configured to perform at least one of the following steps:
under the condition that the identification data comprises the average playing completion ratio of the multimedia resources, the formula (2) can be adopted according to the playing completion ratio set in the user watching behavior data and the determined independent multimedia resource playing amount
Figure BDA0000893080410000261
Determining the average playing completion ratio of the multimedia resources, wherein query is the current query term, APP (query) is the average playing completion ratio of the multimedia resources of the current query term, IVC (query) is the independent multimedia resource playing amount of the current query term, niIs the number of times the ith independent multimedia resource of the current query term is played, perciIs the play completion ratio of the ith independent multimedia resource of the current query term;
in the case where the user conversion behavior includes a query amount and the identification data includes a multimedia resource click divergence, formula (3) may be applied based on the query amount and the determined independent multimedia resource play amount
Figure BDA0000893080410000262
Determining the click divergence of the multimedia resource, wherein VCR (query) is the click divergence of the multimedia resource of the current query word, and sqv is the query quantity;
in the case where the user's conversion behavior includes a query amount and the identification data includes a multimedia asset set play residue, the formula (4) may be adopted based on the play completion ratio set and the query amount in the user's viewing behavior data
Figure BDA0000893080410000263
And determining the multimedia resource set playing residual, wherein VSPR (query) is the multimedia resource set playing residual of the current query word, and max () takes the maximum value.
The multimedia resource Average Playing completion ratio (APP for short) is used to describe an Average Playing completion degree of a single query term on a search result set of the multimedia resource of the user. The larger the average playing completion ratio of the multimedia resources is, the more complete the multimedia resources under the query term are viewed; on the contrary, the smaller the average playing completion ratio of the multimedia resources is, the more incomplete the multimedia resource under the query word is viewed. As described above, the average play completion ratio of the multimedia asset can be determined using the above equation (2).
Generally, the average completion of the single query term over the search result set of the entire multimedia resource is not very low compared to APP unless the completion of each play is very low. For example, the average completion of the search engine for all the single query terms is about 44% greater than the average. If the average completion of the play of a single query term is low, the single query term is likely to contain false search behavior.
The multimedia resource click divergence (VCR) is used to describe the divergence degree of the behavior of the query word Clicking the multimedia resource on the search result page of the multimedia resource. Compared with the query quantity, the more independent multimedia resources are clicked, the larger the click divergence of the multimedia resources is; conversely, the click divergence of the multimedia resources is smaller as the number of the clicked independent multimedia resources is smaller. As described above, the multimedia resource click divergence can be determined using the above equation (3).
Generally, the click divergence of the multimedia resource changes according to the exposure and conversion degree of the direct region, and if the query word is a time-efficient word (the time-efficient word may refer to a search word with a user attention degree greater than a certain degree in a specific time period), since the search amount sqv of the time-efficient word in the specific time period (for example, the current day) is large and the number of clicks is concentrated on the multimedia resource of the topic finder, it can be known from the above formula (3) that the click divergence of the multimedia resource is not high, that is, the click behavior of the user is concentrated on a new multimedia resource.
The Video Set Playing residual (VSPR for short) is used to describe a situation that a query word is not played on a search result page of a multimedia resource. Each multimedia resource has a certain click ratio, i.e. the ratio of the number of times the multimedia resource is clicked to the number of times the multimedia resource is searched (for example, assuming that the number of times the multimedia resource B1 is searched is 100 times and the number of times the multimedia resource B1 is clicked is 20 times, the click ratio of the multimedia resource B1 is
Figure BDA0000893080410000281
That is, the click ratio of the multimedia resource can be determined by the above formula (4)
Figure BDA0000893080410000282
To calculate, and the multimedia resource also has a certain play completion ratio perci. Click occupation ratio capable of utilizing multimedia resources
Figure BDA0000893080410000283
And Play completion ratio perciThese two parameters determine how completely the query terms are played over the entire set of multimedia resources. If the click ratio of the multimedia resource
Figure BDA0000893080410000284
The larger and more complete the playback of the multimedia asset than perciThe smaller (i.e.,
Figure BDA0000893080410000285
the larger), the more incomplete the multimedia asset set playing, and the larger the multimedia asset set playing residual vspr (query). In other words, the multimedia resource set play residue vspr (query) can be determined using the above equation (4).
According to the above formula (4), the multimedia resource set playing residual utilizes the most of the single multimedia resourcePoor performance, i.e. if click-through of a multimedia asset is proportional
Figure BDA0000893080410000286
The larger and the playback completion ratio perciThe lower, the greater the multimedia asset set play residual vspr (query). It should be noted that, in the present invention, the method for determining the playing residual vspr (query) of the multimedia resource set is not limited thereto, and those skilled in the art should understand according to the disclosure of the present application and the technical common sense mastered by the disclosure of the present application, and may also determine the playing residual vspr (query) of the multimedia resource set in other ways, for example, the playing residual vspr (query) of the multimedia resource set may also be determined according to the comprehensive performance of the multimedia resource set.
A processing unit 550 may be configured to identify a false search behavior based on the identification data. Specifically, reference may be made to the related description of step S140 in the first embodiment.
For example, the processing unit 550 may perform the identification of false search behavior of the search engine based on the identification data and using a classical Decision Tree (Decision Tree) algorithm.
According to the identification device for the false search behavior of the search engine, the determination unit determines the identification data for identifying the false search behavior according to the user viewing behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and the processing unit identifies the false search behavior according to the identification data determined by the determination unit, so that the accuracy rate of identifying the false search behavior can be improved, and the false search behavior of the whole query word can be automatically identified.
Example 5
Fig. 6 is a block diagram showing an apparatus for identifying false search behavior of a search engine according to a fifth embodiment of the present invention. The device 600 for identifying false search behavior of a search engine provided by the embodiment is used for implementing the method for identifying false search behavior of a search engine provided by the embodiment shown in fig. 2. Wherein, the components in fig. 6, which are numbered the same as those in fig. 5, include: the acquiring unit 510, the determining unit 530, and the processing unit 550 have substantially the same functions as described above, and detailed descriptions of these components are omitted for the sake of brevity.
Furthermore, as can be seen from comparing fig. 5 and fig. 6, the main difference between the embodiment shown in fig. 6 and the embodiment shown in fig. 5 is that, on the basis of the embodiment shown in fig. 5, in the case that the user conversion behavior data includes a direct region conversion rate and the identification data includes a multimedia resource click divergence, the processing unit 550 may specifically include:
the first determining unit 651 is configured to determine whether the direct region conversion rate of the current query term is smaller than a first threshold.
The second determining unit 653 is connected to the first determining unit 651, and is configured to determine whether the multimedia resource click divergence of the current query term is smaller than a second threshold value when the first determining unit 651 determines that the direct region conversion rate of the current query term is smaller than the first threshold value.
The identifying unit 655 is connected to the second judging unit 653, and is configured to identify the search behavior of the current query term as a false search behavior when the second judging unit 653 judges that the multimedia resource click divergence of the current query term is smaller than the second threshold.
For an example of this embodiment, the description of the second embodiment above may be specifically referred to.
According to the identification device for the false search behavior of the search engine, the determination unit determines the identification data for identifying the false search behavior according to the user viewing behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and the processing unit can identify the search behavior of the query word with the conversion rate of the direct region smaller than the first threshold and the click divergence of the multimedia resource smaller than the second threshold as the false search behavior under the condition that the user conversion data comprises the conversion rate of the direct region and the identification data comprises the click divergence of the multimedia resource, so that the accuracy rate of identifying the false search behavior can be improved, and the false search behavior of the full number of query words can be automatically identified.
Example 6
Fig. 7 is a block diagram of an apparatus for identifying false search behavior of a search engine according to a sixth embodiment of the present invention. The device 700 for identifying false search behavior of a search engine provided by the embodiment is used for implementing the method for identifying false search behavior of a search engine provided by the embodiment shown in fig. 3. Wherein, the components in fig. 7 having the same reference numerals as those in fig. 5 include: the acquiring unit 510, the determining unit 530, and the processing unit 550 have substantially the same functions as described above, and detailed descriptions of these components are omitted for the sake of brevity.
Furthermore, as can be seen from comparing fig. 5 and fig. 7, the main difference between the embodiment shown in fig. 7 and the embodiment shown in fig. 5 is that, on the basis of the embodiment shown in fig. 5, in the case that the user conversion behavior data includes a direct region conversion rate and the identification data includes a multimedia resource click divergence, the processing unit 550 may specifically include:
the first judging unit 751 is used for judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
a second judging unit 753, connected to the first judging unit 751, configured to judge whether the average playing completion ratio of the multimedia resource of the current query term is smaller than a third threshold value when the first judging unit 751 judges that the direct region conversion ratio of the current query term is not smaller than the first threshold value; and
and the identifying unit 755, connected to the second judging unit 753, configured to identify the search behavior of the current query term as a false search behavior when the second judging unit 753 judges that the average playing completion ratio of the multimedia resource of the current query term is smaller than the third threshold.
For an example of this embodiment, the description of the third embodiment above may be specifically referred to.
According to the identification device for the false search behavior of the search engine, the determination unit determines the identification data for identifying the false search behavior according to the user viewing behavior data of the single query word and/or the user conversion behavior data of the single query word acquired from the user log, and the processing unit can identify the search behavior of the query word, of which the conversion rate of the direct region is not less than the first threshold and the average playing completion rate of the multimedia resource is less than the third threshold, as the false search behavior under the condition that the user conversion data comprises the conversion rate of the direct region and the identification data comprises the average playing completion rate of the multimedia resource, so that the accuracy rate of identifying the false search behavior can be improved, and the false search behavior of the whole query word can be automatically identified.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. A method for identifying false search behavior of a search engine, the search engine being used for searching multimedia resources, the method comprising:
acquiring user viewing behavior data of a single query term and user conversion behavior data of the single query term from a user log, wherein the user viewing behavior data of the single query term comprises: the method comprises the steps of obtaining a query word, a clicked multimedia resource set, a multimedia resource playing completion ratio set and a mapping function from the clicked multimedia resource set to the multimedia resource playing completion ratio set, wherein user conversion behavior data of a single query word comprise the query word, and the user conversion behavior data further comprise at least one of query quantity, a direct region hit rate, a direct region conversion rate, a user original content UGC region hit rate, a UGC region conversion rate and an overall conversion rate;
determining identification data for identifying the false search behavior according to the user viewing behavior data and/or the user conversion behavior data, wherein the identification data comprises at least one of independent multimedia resource playing amount, multimedia resource average playing completion ratio, multimedia resource click divergence and multimedia resource set playing residue; and
identifying the false search behavior according to the identification data;
compared with the query quantity, the more independent clicked multimedia resources are, the larger the click divergence of the multimedia resources is, the fewer the independent clicked multimedia resources are, and the smaller the click divergence of the multimedia resources is;
the larger the click occupation ratio of the multimedia resource is and the smaller the playing completion ratio of the multimedia resource is, the larger the playing residual degree of the multimedia resource set is.
2. The method of claim 1, wherein in the case where the user conversion behavior data includes a direct zone conversion rate and the identification data includes multimedia resource click divergence, identifying the false search behavior from the identification data comprises:
judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
under the condition that the conversion rate of the direct region of the current query word is smaller than the first threshold value, judging whether the click divergence of the multimedia resource of the current query word is smaller than a second threshold value; and
and under the condition that the multimedia resource click divergence of the current query word is smaller than the second threshold value, identifying the search behavior of the current query word as the false search behavior.
3. The method of claim 1, wherein in the case where the user conversion behavior data includes a direct zone conversion rate and the identification data includes an average play completion ratio of multimedia assets, identifying the false search behavior based on the identification data comprises:
judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
under the condition that the conversion rate of the direct region of the current query word is not less than the first threshold value, judging whether the average playing completion ratio of the multimedia resources of the current query word is less than a third threshold value; and
and under the condition that the average playing completion ratio of the multimedia resources of the current query word is smaller than the third threshold value, identifying the search behavior of the current query word as the false search behavior.
4. An identification method according to any one of claims 1 to 3, wherein determining identification data for identifying the false search behavior from the user viewing behavior data and/or the user conversion behavior data comprises: and under the condition that the identification data comprises the independent multimedia resource playing amount, determining the independent multimedia resource playing amount according to the clicked multimedia resource set in the user watching behavior data.
5. An identification method according to claim 4, wherein determining identification data for identifying the false search behavior from the user viewing behavior data and/or the user conversion behavior data comprises at least one of:
under the condition that the identification data comprises the average playing completion ratio of the multimedia resources, adopting a formula according to the playing completion ratio set and the independent multimedia resource playing amount in the user watching behavior data
Figure FDA0002125136140000021
Determining the average playing completion ratio of the multimedia resources, wherein the query is the current query term, the APP (query) is the average playing completion ratio of the multimedia resources of the current query term, the IVC (query) is the independent multimedia resource playing amount of the current query term, and niIs the number of times the ith independent multimedia resource of the current query term is played, the perciIs the play completion ratio of the ith independent multimedia resource of the current query term;
in the case that the user conversion behavior comprises the query amount and the identification data comprises the click divergence of the multimedia resource, adopting a formula according to the query amount and the independent multimedia resource playing amount
Figure FDA0002125136140000031
Determining the multimediaVolume resource click divergence, wherein VCR (query) is the multimedia resource click divergence of the current query term, and sqv is the query quantity;
under the condition that the user conversion behavior comprises the query quantity and the identification data comprises the multimedia resource set playing residual degree, adopting a formula according to the playing completion ratio set and the query quantity in the user viewing behavior data
Figure FDA0002125136140000032
Determining the multimedia resource set playing residual, wherein the VSPR (query) is the multimedia resource set playing residual of the current query word, and max () takes the maximum value.
6. An apparatus for identifying false search behavior of a search engine, the search engine being configured to search a multimedia resource, the apparatus comprising:
an obtaining unit, configured to obtain, from a user log, user viewing behavior data of a single query term and user conversion behavior data of the single query term, where the user viewing behavior data of the single query term includes: the method comprises the steps of obtaining a query word, a clicked multimedia resource set, a multimedia resource playing completion ratio set and a mapping function from the clicked multimedia resource set to the multimedia resource playing completion ratio set, wherein user conversion behavior data of a single query word comprise the query word, and the user conversion behavior data further comprise at least one of query quantity, a direct region hit rate, a direct region conversion rate, a user original content UGC region hit rate, a UGC region conversion rate and an overall conversion rate;
the determining unit is connected with the acquiring unit and used for determining identification data for identifying the false search behavior according to the user watching behavior data and/or the user conversion behavior data, wherein the identification data comprises at least one of independent multimedia resource playing amount, multimedia resource average playing completion ratio, multimedia resource click divergence and multimedia resource set playing residual degree; and
a processing unit for identifying the false search behavior based on the identification data;
compared with the query quantity, the more independent clicked multimedia resources are, the larger the click divergence of the multimedia resources is, the fewer the independent clicked multimedia resources are, and the smaller the click divergence of the multimedia resources is;
the larger the click occupation ratio of the multimedia resource is and the smaller the playing completion ratio of the multimedia resource is, the larger the playing residual degree of the multimedia resource set is.
7. The recognition device according to claim 6, wherein in a case where the user conversion behavior data includes a direct zone conversion rate and the recognition data includes a multimedia resource click divergence, the processing unit specifically includes:
the first judgment unit is used for judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
the second judging unit is connected with the first judging unit and used for judging whether the multimedia resource click divergence of the current query word is smaller than a second threshold value or not under the condition that the first judging unit judges that the direct region conversion rate of the current query word is smaller than the first threshold value; and
and the identification unit is connected with the second judgment unit and is used for identifying the search behavior of the current query word as the false search behavior under the condition that the second judgment unit judges that the click divergence of the multimedia resource of the current query word is smaller than the second threshold value.
8. The recognition device according to claim 6, wherein in a case where the user conversion behavior data includes a direct zone conversion rate and the recognition data includes a multimedia resource click divergence, the processing unit specifically includes:
the first judgment unit is used for judging whether the conversion rate of the direct region of the current query word is smaller than a first threshold value;
the second judging unit is connected with the first judging unit and used for judging whether the average playing completion ratio of the multimedia resources of the current query word is smaller than a third threshold value or not under the condition that the first judging unit judges that the conversion rate of the direct region of the current query word is not smaller than the first threshold value; and
and the identification unit is connected with the second judgment unit and is used for identifying the search behavior of the current query word as the false search behavior under the condition that the second judgment unit judges that the average playing completion ratio of the multimedia resources of the current query word is smaller than the third threshold value.
9. The identification device according to any one of claims 6 to 8, wherein the determining unit is specifically configured to determine the independent multimedia asset playback amount according to the clicked multimedia asset set in the user viewing behavior data, if the identification data includes the independent multimedia asset playback amount.
10. The identification device according to claim 9, wherein the determination unit is specifically configured to perform at least one of the following steps:
under the condition that the identification data comprises the average playing completion ratio of the multimedia resources, adopting a formula according to the playing completion ratio set and the independent multimedia resource playing amount in the user watching behavior data
Figure FDA0002125136140000051
Determining the average playing completion ratio of the multimedia resources, wherein the query is the current query term, the APP (query) is the average playing completion ratio of the multimedia resources of the current query term, the IVC (query) is the independent multimedia resource playing amount of the current query term, and niIs the number of times the ith independent multimedia resource of the current query term is played, the perciIs the play completion ratio of the ith independent multimedia resource of the current query term;
the user conversion behavior comprises the query amount and the identification data comprises the multimedia resource click divergenceUnder the condition of (1), adopting a formula according to the query quantity and the independent multimedia resource playing quantity
Figure FDA0002125136140000052
Determining the multimedia resource click divergence, wherein VCR (query) is the multimedia resource click divergence of the current query word, and sqv is the query quantity;
under the condition that the user conversion behavior comprises the query quantity and the identification data comprises the multimedia resource set playing residual degree, adopting a formula according to the playing completion ratio set and the query quantity in the user viewing behavior data
Figure FDA0002125136140000053
Determining the multimedia resource set playing residual, wherein the VSPR (query) is the multimedia resource set playing residual of the current query word, and max () takes the maximum value.
CN201511001301.7A 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine Active CN105574199B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511001301.7A CN105574199B (en) 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511001301.7A CN105574199B (en) 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine

Publications (2)

Publication Number Publication Date
CN105574199A CN105574199A (en) 2016-05-11
CN105574199B true CN105574199B (en) 2020-04-21

Family

ID=55884330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511001301.7A Active CN105574199B (en) 2015-12-28 2015-12-28 Method and device for identifying false search behavior of search engine

Country Status (1)

Country Link
CN (1) CN105574199B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106326497A (en) * 2016-10-10 2017-01-11 合网络技术(北京)有限公司 Cheating video user identification method and device
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device
CN108090100B (en) * 2016-11-23 2022-02-18 百度在线网络技术(北京)有限公司 Data identification method and device
CN106777303B (en) * 2016-12-30 2020-11-06 中国民航信息网络股份有限公司 Passenger flight inquiry behavior classification method and system
CN107529093B (en) * 2017-09-05 2020-05-22 北京奇艺世纪科技有限公司 Method and system for detecting playing amount of video file
CN110188262B (en) * 2019-07-23 2019-10-29 武汉斗鱼网络科技有限公司 A kind of abnormal object determines method, apparatus, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021162A (en) * 2014-05-28 2014-09-03 小米科技有限责任公司 Method and device for grading multimedia resource
CN104035982A (en) * 2014-05-28 2014-09-10 小米科技有限责任公司 Multimedia resource recommendation method and device
CN104504059A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Multimedia resource recommending method
CN104506894A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Method and device for evaluating multi-media resources

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7627559B2 (en) * 2005-12-15 2009-12-01 Microsoft Corporation Context-based key phrase discovery and similarity measurement utilizing search engine query logs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021162A (en) * 2014-05-28 2014-09-03 小米科技有限责任公司 Method and device for grading multimedia resource
CN104035982A (en) * 2014-05-28 2014-09-10 小米科技有限责任公司 Multimedia resource recommendation method and device
CN104504059A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Multimedia resource recommending method
CN104506894A (en) * 2014-12-22 2015-04-08 合一网络技术(北京)有限公司 Method and device for evaluating multi-media resources

Also Published As

Publication number Publication date
CN105574199A (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN105574199B (en) Method and device for identifying false search behavior of search engine
US9317550B2 (en) Query expansion
WO2017096877A1 (en) Recommendation method and device
JP5513624B2 (en) Retrieving information based on general query attributes
JP5736469B2 (en) Search keyword recommendation based on user intention
CN104391999B (en) Information recommendation method and device
Chen et al. Velda: Relating an image tweet’s text and images
KR20160057475A (en) System and method for actively obtaining social data
CN103258025B (en) Generate the method for co-occurrence keyword, the method that association search word is provided and system
US20100306214A1 (en) Identifying modifiers in web queries over structured data
CN107967280B (en) Method and system for recommending songs by tag
Bellogín et al. The magic barrier of recommender systems–no magic, just ratings
CN103365910A (en) Method and system for information retrieval
CN107203640A (en) The method and system of physical model are set up by database log
CN109033286B (en) Data statistical method and device
CN105447131B (en) Internet resources relatedness determines method and apparatus
US20170132294A1 (en) App store searching
CN109815337B (en) Method and device for determining article categories
JP2020129377A (en) Content retrieval method, apparatus, device, and storage medium
Chen et al. User intent-oriented video QoE with emotion detection networking
JP2019040605A (en) Feeling interactive method based on humor creation and robot system
US20140095411A1 (en) Establishing "is a" relationships for a taxonomy
Hao et al. Modeling positive and negative feedback for improving document retrieval
CN106407254B (en) Method and device for processing user click behavior chain
Rao et al. Taxonomy based personalized news recommendation: Novelty and diversity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder
CP01 Change in the name or title of a patent holder

Address after: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee after: Youku network technology (Beijing) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: 1VERGE INTERNET TECHNOLOGY (BEIJING) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200522

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Patentee before: Youku network technology (Beijing) Co.,Ltd.