CN106326498A - Cheat video identification method and device - Google Patents

Cheat video identification method and device Download PDF

Info

Publication number
CN106326498A
CN106326498A CN201610892400.7A CN201610892400A CN106326498A CN 106326498 A CN106326498 A CN 106326498A CN 201610892400 A CN201610892400 A CN 201610892400A CN 106326498 A CN106326498 A CN 106326498A
Authority
CN
China
Prior art keywords
video
title
cheating
averagely
rate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610892400.7A
Other languages
Chinese (zh)
Inventor
魏博
齐志兵
尹玉宗
姚键
潘柏宇
王冀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
1Verge Internet Technology Beijing Co Ltd
Original Assignee
1Verge Internet Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 1Verge Internet Technology Beijing Co Ltd filed Critical 1Verge Internet Technology Beijing Co Ltd
Priority to CN201610892400.7A priority Critical patent/CN106326498A/en
Publication of CN106326498A publication Critical patent/CN106326498A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention mainly aims to provide a cheat video identification method and device to solve the problem that cheat videos affect normal video display in the prior art. The method comprises steps as follows: information and log data of videos are required from a video website; videos in the acquired information and log data of the videos are identified according to an initial index item, whether the videos are cheat videos or not is determined, and a judgment index for the cheat videos is defined in the initial index item; identified sample data are trained with a decision tree algorithm and a decision tree model is generated; whether the videos are cheat videos or not is identified with the decision tree model. With the adoption of the method, the effect of the cheat videos on normal video display is avoided, so that normal videos can obtain reasonable display chance.

Description

A kind of cheating video frequency identifying method and device
Technical field
The present invention relates to video search engine technical field, particularly relate to a kind of cheating video frequency identifying method and device.
Background technology
Nowadays, video, as important online Streaming Media product, occupies important in daily life is entertained Position.Encourage user to make video, uploaded videos, and to obtain exposing also be the basic principle of video website.Each video website Results for video will be shown in Search Results or commending system.Its algorithm behind is typically to make use of video title, retouch State and the playback volume of video, the data such as upload user information.Normal video is generally of rational title, description, video Playback volume, and the interbehavior of user, but, there is substantial amounts of cheating video in current internet video website,
Cheating video council produces inequitable impact to normal video.At industrial quarters and academia, people not to work Fraud video carry out strict difinition, but common cheating video has a following features:
Video title has a large amount of word to pile up, such as " what Gui of happy base camp that makes progress every day thanks to Na video ", " horse cloud horseization Rise the foundation treasured book of Wang Jian Lin Liyanhong thunder army Chen An ";Video content and video title not the biggest association, or carry agency secretly Promotion message.Such as the video content of " what Gui of happy base camp that makes progress every day thanks to Na video " is about starting an undertaking.Cheating regards Frequency has big playback volume, but, the video of non-popular program and personage does not have the playback volume of up to million.
Regular traffic is carried out and is and disadvantageous by cheating video, cheating video due to false playback volume and title, Generally can gain all advantage in sort algorithm so that cheating video can come before results for video, it is simple to searching for and pushing away Expose in recommending.Thus cause non-cheating video there is no chance for exposure.
By the reason of cheating video having been carried out preliminary being analyzed as follows:
Promoting personal information, be generally mingled with QQ, wechat and cell-phone number etc. in cheating video, video uploader expectation user see After video, can actively contact, and carry out business under line;Building egoistic opinion atmosphere, such as foundation class video is generally accused Tell user, make a good deal of money chance now with substantial amounts of foundation, and have a lot of people success;Attempt to obtain other people concern, example As video title comprises a large amount of popular word, it is desirable to have more chances watched.
The limitation of traditional algorithm, traditional search and sort algorithm is utilized can the multiformity of video and user to be done necessarily Requirement, the result i.e. gone out should comprise more independent video and user.Cheating video and cheating user would generally create A large amount of identical videos and user, win advantage, and this is for non-cheating video, i.e. for normal video, is wrongful competing Strive, and had a strong impact on the displaying of normal video.
Summary of the invention
Present invention is primarily targeted at offer one cheating video frequency identifying method and device, to solve prior art is made Fraud video affects the problem that normal video is shown.
A kind of cheating video frequency identifying method, including:
The information of video and the daily record data of video is obtained from video website;
Each video in information according to the Raw performance item video to getting and the daily record data of video is known Not, determine that whether described video is cheating video, the judge index of cheating video defined in described Raw performance item;
Use decision Tree algorithms that the sample data after being identified is trained, generate decision-tree model;
Use whether described decision-tree model identification video is cheating video.
Preferably, described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video pair in preset time period The number of times of the user's interbehavior answered, the number of popular key word comprised in video title and averagely finishing playing of video Rate, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Preferably, whether described use described decision-tree model identification video is cheating video, including:
The aim parameter of each index item obtained according to training is to the information of described video and/or the daily record data of described video Carry out following at least one judge:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is at Preset Time Whether the playback volume in Duan meets the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video Number whether meet the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior Meet the number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described mesh The rate that finishes playing in scalar, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video Rate;The video at least meeting a described aim parameter is defined as video of practising fraud.
Preferably, each in the described information of video according to Raw performance item to getting and the daily record data of video Video is identified, and determines whether described video is cheating video, including:
In the case of described video is not played within a daily record cycle, regarded described by following Raw performance item Frequency is identified, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding Number of times and video title in the number of popular key word that comprises.
Preferably, each in the described information of video according to Raw performance item to getting and the daily record data of video Video is identified, and determines whether described video is cheating video, including:
In the case of described video was played at least one times within a daily record cycle, right by following Raw performance item Described video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding Number of times, the number of the popular key word comprised in video title and video averagely finish playing rate, described averagely plays One-tenth rate is that the viewing of played video completes part and accounts for the ratio of this video.
A kind of cheating video identification device, including:
Acquisition module, for obtaining the information of video and the daily record data of video from video website;
Determine module, for according in the Raw performance item information of video to getting and the daily record data of video Each video is identified, determine that whether described video is cheating video, sentencing of cheating video defined in described Raw performance item Severed finger mark;
Training module, for using decision Tree algorithms to be trained the sample data after being identified, generates decision tree Model;
Identification module, is used for using whether described decision-tree model identification video is cheating video.
7, device according to claim 6, it is characterised in that described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video pair in preset time period The number of times of the user's interbehavior answered, the number of popular key word comprised in video title and averagely finishing playing of video Rate, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Preferably, described identification module specifically for:
The aim parameter of each index item obtained according to training is to the information of described video and/or the daily record data of described video Carry out following at least one judge:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is at Preset Time Whether the playback volume in Duan meets the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video Number whether meet the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior Meet the number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described mesh The rate that finishes playing in scalar, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video Rate;The video at least meeting a described aim parameter is defined as video of practising fraud.
Preferably, described determine module specifically for:
In the case of described video is not played within a daily record cycle, regarded described by following Raw performance item Frequency is identified, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding Number of times and video title in the number of popular key word that comprises.
Preferably, described determine module specifically for:
In the case of described video was played at least one times within a daily record cycle, right by following Raw performance item Described video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding Number of times, the number of the popular key word comprised in video title and video averagely finish playing rate, described averagely plays One-tenth rate is that the viewing of played video completes part and accounts for the ratio of this video.
The present invention has the beneficial effect that:
The scheme that present example provides passes through information and the daily record of video of the Raw performance item video to getting Data are trained, and generate decision data model, re-use decision-tree model and are identified cheating video so that cheating video Can be effectively recognized, evade the cheating video impact on normal video display so that normal video can obtain reasonably Display machine meeting.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this Bright schematic description and description is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the cheating video frequency identifying method provided in the embodiment of the present invention 1;
Fig. 2 is the path schematic diagram using decision tree to be identified cheating video in the embodiment of the present invention 2;
Fig. 3 is the structured flowchart of the cheating video identification device provided in the embodiment of the present invention 3.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on Embodiment in the present invention, those of ordinary skill in the art obtained on the premise of not making creative work all its His embodiment, broadly falls into the scope of protection of the invention.
Embodiment 1
Present embodiments providing a kind of cheating video frequency identifying method, Fig. 1 is the flow chart of the method, as it is shown in figure 1, the party Method includes processing as follows:
Step 101: obtain the information of video and the daily record data of video from video website;
Wherein, the information of video can include the uploader of the title of video, video, and the character description information of video etc. regards The attribute information of frequency, the daily record data of video, the daily record numbers such as uploading the date of video, reproduction time, broadcasting time can be included According to.
Step 102: according to respectively regarding in the Raw performance item information of video to getting and the daily record data of video Frequency is identified, determine that whether video is cheating video, the judge index of cheating video defined in Raw performance item;
Step 103: use decision Tree algorithms that the sample data after being identified is trained, generate decision-tree model;
In this step, the decision-tree model of generation has i.e. obtained the aim parameter that each Raw performance item is corresponding, such as, this reality Execute the Raw performance item in example and specifically may include that the form of video title title, video playback volume in preset time period, The number of times of user's interbehavior that video in preset time period is corresponding, the number of the popular key word comprised in video title And the rate that averagely finishes playing of video, the rate that averagely finishes playing is that the viewing of played video completes partly to account for this video Ratio, based on this, the aim parameter that each Raw performance item is corresponding is the threshold value judging that whether video is cheating video.Concrete real Shi Shi, can select a certain item in above-mentioned Raw performance item, it is also possible to select multinomial simultaneously.
Step 104: use whether decision-tree model identification video is cheating video.
In the present embodiment, use whether decision-tree model identification video is that cheating video specifically may include that according to instruction The aim parameter of each index item got the information of video and/or the daily record data of video are carried out following at least one judge: Judge whether the title of video meets the aim parameter that video name is corresponding, it is judged that video playback volume in preset time period whether Meet the playback volume in aim parameter, it is judged that whether the number of the popular key word comprised in the title of video meets in aim parameter Popular key word number, it is judged that the number of times of video correspondence user's interbehavior whether meet user's interbehavior in aim parameter time Number, it is judged that whether the rate that averagely finishes playing in video meets the rate that finishes playing in aim parameter, and the rate that averagely finishes playing is for be broadcast The viewing of the video put completes part and accounts for the ratio of this video;The video at least meeting an aim parameter is defined as cheating regard Frequently.
Wherein, enter according to each video in the Raw performance item information of video to getting and the daily record data of video Row identifies, determines whether video is that cheating video specifically may include that when the feelings that video is not played within a daily record cycle Under condition, by following Raw performance item, video is identified, to determine whether video is cheating video: video title title Form, video playback volume in preset time period, the number of times of user's interbehavior that video is corresponding and video title wrap The number of the popular key word contained.
Optionally, according to each video in the Raw performance item information of video to getting and the daily record data of video It is identified, determines whether video is that cheating video specifically may include that
In the case of video was played at least one times within a daily record cycle, by following Raw performance item to video It is identified, to determine whether video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding Number of times, the number of the popular key word comprised in video title and video averagely finish playing rate, and averagely finish playing rate Viewing for played video completes partly to account for the ratio of this video.
Embodiment 2
The present embodiment is by the recognizer for SEO Technology design cheating video, the most respectively to there being broadcasting The cheating video of behavior and the cheating video without broadcasting behavior carry out feature extraction, utilize decision Tree algorithms to carry out cheating video Judge.Cheating video it is generally desirable to obtain on video website platform and higher exposes chance and attention rate.Such as, draw in search In holding up, cheating video it is generally desirable to come the homepage of result page, the most former positions;In commending system, cheating video is also wished Prestige can obtain more recommendation;Cheating video is generally also expected to be collected by more user or reprint, and so, puts down third party On platform, cheating video also has an opportunity to be found by more video user.By the statistical analysis to SEO technology and video, draw Cheating video is generally of following characteristics: the title of cheating video generally comprises multiple popular word, or the heat of association area Door word.Such as, the programm name of hot broadcast TV play, variety and field of finance and economics and celebrity names, start an undertaking or direct selling field Programm name and celebrity names etc..Such as, happy comedy people, the descendant of the sun, youth's eye finance and economics, Ma Yun, Chen An it, Amway. The playback volume of cheating video reaches improper high value usual within a short period of time.Cheating video has special SEO instrument Extremely to improve playback volume, statistically it is found that the video of a domestic consumer generally will not surpass in intraday playback volume Cross 10000, but can to reach hundreds of thousands at several hours the most up to a million for the playback volume of cheating video.Cheating video does not almost have Have and stepped on or the behavior such as collection by user top.Under the highest playback volume premise, user's interbehaviors such as the top of video is stepped on, collection Certain level can be reached.But cheating video usually not these behaviors.This explanation, although video playback amount is by exception Improve, but the most real user comes mutual.The user's name of cheating video has certain rule.Owing to SEO is existing In the modes using software automation, therefore user is before uploaded videos, will not manually arrange user name more.Only can simply depend on User name is generated according to certain rule by software.Common are: game_XXXXXX, QQYYYYYYYY, wherein X represents word Female or digital, Y represents numeral.The comprehensive above basic understandings to cheating video, the present embodiment can obtain following substantially special Levy:
The hot word number wordCount that video title is comprised, based on this, needs a hot word word based on frequency statistics Table.For the hot word repeated, need repeat count.
Odd-numbered day playback volume firstDayVV of video, the odd-numbered day herein refers to reach the standard grade the date issued of video, i.e. video Date.
Video odd-numbered day interbehavior conversion ratio interactRatio, needs, according to available logged result, to step on top, draw With, the numerical value such as collection is divided by the odd-numbered day playback volume of video.Situation according to gained codomain, it may be necessary to certain normalization.
The User Format accountName of video, needs exist for a common cheating User Format based on frequency statistics List.
For there being the cheating video of broadcasting behavior, it is also possible to use finishing playing of video to be used for further investigating spy Levying in the present embodiment, employing video averagely finishes playing to portray than (Average Playing Percentage, avgPP) and regards The degree that averagely finishes playing of frequency.The ratio that averagely finishes playing is the biggest, and video-see is the most complete, otherwise the most imperfect.Average broadcasting Complete to be defined below than use:
a v g P P = Σ i = 1 n watchingLength i n * v i d e o L e n g t h
Wherein, watchingLengthi, is the i & lt viewing duration of video, and videoLength is the total of current video Duration, n is broadcast number of times.
In general, averagely finishing playing than avgPP of an ordinary video will not be a value the lowest, unless each The secondary ratio that finishes playing is the most extremely low.According to statistics, the ratio that averagely finishes playing of general full dose video for about 40%, if therefore one The ratio that averagely finishes playing of video is the lowest, then it is particularly likely that cheating video.
More than Zong He, for there being the cheating video of broadcasting behavior, it is right that suggestion in the present embodiment uses following index item Cheating video in video website is identified:
The hot word number wordCount that video title is comprised, odd-numbered day playback volume firstDayVV of video, the friendship of video odd-numbered day Behavior conversion ratio interactRatio, the User Format accountName of video mutually, and the ratio that averagely finishes playing of video avgPP。
Its data form is:
vid|wordCount|firstDayVV|interactRatio|accountName|avgPP。
In the present embodiment, the sequencing of above-mentioned data field is without compulsive requirement.
For without playing behavior cheating video (refer to a upper daily record cycle without broadcasting behavior, rather than from reach the standard grade I.e. without playing behavior), owing to obtaining less than more data availables, can portray just with basic feature, it may be assumed that
The hot word number wordCount that video title is comprised, odd-numbered day playback volume firstDayVV of video, the friendship of video odd-numbered day Behavior conversion ratio interactRatio and the User Format accountName of video mutually.
Its data form is:
vid|wordCount|firstDayVV|interactRatio|accountName。
Same, in the present embodiment, the sequencing of above-mentioned data field is without compulsive requirement.
In general, the difference designed according to the algorithm of the operation systems such as search, recommendation, cheating video will not never have Broadcasting behavior.Current a lot of business algorithms all encourage user to upload, more emphasis algorithms ageing, and cheating video is being issued Within first day, have a broadcasting behavior by force due to ageing, but As time goes on, cheating video may not have broadcasting behavior ?.If business algorithm more focuses on the classical degree (typically playback volume, interaction data etc.) of video, cheating video is without playing Behavior is also the most universal.
Recognizer
When learning all characteristics of cheating video, training pattern can be carried out according to the data of small sample, the most just Being that be calculated in full specimen discerning various compare threshold value, this model to be also prone in engineering realize, in this enforcement simultaneously Example can use decision-tree model to identify cheating video.
The present embodiment uses classical decision tree (Decision Tree) algorithm to complete the search of video search engine falseness The identification of behavior.Decision-tree model is trained first with training set.Training set can be given each by manually mark Whether individual search word is the primary data set of false search behavior.Artificial mark with it is expressly intended that a small amount of search word as base Plinth, then utilizes decision-tree model to predict known search behavior, and then judges and the accuracy of Optimized model.Decision tree is One tree construction being similar to flow chart, the test that the most each internal node only represents on an attribute, each branch generation The test output of one, table, and each tree node represents class or class distribution, the top-most node of tree is root node.Decision Tree algorithms Feature itself be adapted for carrying out attribute number (characteristic number) less in the case of high-quality classification.
The key problem of decision Tree algorithms is to be chosen at the attribute that each node of tree is to be tested, and strives for selecting Attribute most helpful in classified instance.In order to solve this problem, ID3 algorithm introduces information gain (information Gain) concept, and use information gain number to determine on each level of decision tree different node i.e. for classification weight Want attribute.For accurately definition information gain, ID3 algorithm (i.e. realize a kind of way of decision Tree algorithms, the present embodiment only with As a example by this algorithm, the most do not limit this kind of algorithm) use the concept being referred to as entropy (entropy) in theory of information, it features The arbitrarily purity (purity) of sample set.The given sample set S comprising the positive and negative sample about certain target concept, then S-phase The entropy classifying this Boolean type is:
Entropy (S)=-P+log2P+-P-log2P-
In above-mentioned formula, P+Represent positive sample, P-Represent anti-sample, (about 0log0 defined in all calculating of entropy be 0).Entropy, ID3 is utilized to define information gain.Briefly, the information gain of an attribute is precisely due to use this attribute to divide The expectation entropy that cuts sample and cause reduce (in other words, sample according to certain Attribute transposition time cause entropy to reduce expectation).More accurate Say, attribute A information gain relative to sample set S is defined as:
G a i n ( S , A ) = E n t r o p y ( S ) - Σ v ∈ V ( A ) S v S E n t r o p y ( S v ) ;
Wherein, V (A) is the codomain of attribute A;S is sample set;SvIt is that in S, on attribute A, value is equal to the sample set of v.
ID3 algorithm flow is as follows: input: sample set S, community set A export: ID3 decision tree.
If the attribute of 1 all kinds is all disposed, return;Otherwise perform 2;
2, information gain Gain (S, A) maximum attribute a is calculated, using this attribute as a node;If only with attribute a Just sample classification then can be returned;Otherwise perform 3;
3, each possible value v to attribute a, the following operation of execution:
4, it is sample subset S as S of v using the value of all properties av
5, community set AT=A-{a} is generated;
6, with sample set SvBeing input with community set AT, recurrence performs ID3 algorithm;
By the characteristic extracted, the annotation results of training set and ID3 decision Tree algorithms, it is possible to obtain falseness and search The decision tree initial model of Suo Hangwei.
The optimization of model can use beta pruning (pruning) strategy realize, mainly have two kinds of Pruning strategies:
Preposition cutting, when building the process of decision tree, stops in advance.So, can be by the condition setting of cutting node The harshest, cause decision tree the shortest and the smallest, result is exactly that decision tree is unable to reach optimum,
Rearmounted cutting, after decision tree builds, the most just starts cutting, this cut out employing two kinds of methods:
Replacing whole subtree with single leaf node, the classification of leaf node uses topmost classification in subtree;
One word tree is substituted completely an other subtree.
In the present embodiment, according to the feature of video, decision tree can judge that whether it is through cheating.It is basic Flow process is as follows:
(i.e. the information of video, such as, the information of video can include the title of video, video to obtain the data of video Uploader, the character description information etc. of video), and the daily record data of video (uploading the date of such as video, reproduction time, broadcast Put number of times etc.);
From the data of video and the daily record data of video, randomly draw a certain amount of video sample, be made whether as making The artificial mark of fraud video is (i.e., it is possible to judge whether it is cheating video according to the indices data of video, so by artificial The cheating video that rear mark is judged);
According to the video sample data after mark, utilize ID3 algorithm to carry out the training of cheating video decision tree, obtain decision-making Tree-model;
Video to be detected in video website is identified by the decision-tree model according to generating, it is judged that whether it is cheating Video.
First portion cheating video hot word vocabulary is prepared.Then the broadcasting number of these videos of log acquisition is play according to backstage According to, obtain the playback volume of video according to video static information, the data such as top is stepped on, collection, quotes, user name simultaneously.
The hot word vocabulary that the present embodiment uses is as follows:
' Ma Yun ', ' Ma Huateng ', ' Li Yanhong ', ' ', ' success ', ' the Chen An it ', ' that starts an undertaking pursue a goal with determination ', ' Wang Jianlin ', ' Liu Qiang East ', ' Lei Jun ', ' Qiao Busi ', ' Luo Yonghao ', ' Zhang Chaoyang ', ' Zhou Hong ', ' Bill Gates ', ' Zhao Benshan ', ' Song little Bao ', ' White hundred what ', ' plumage springs ', ' Huang Xiaoming ', ' Guo Degang ', ' Yue Yunpeng ', ' Cheng Long ', ' Liu Dehua ', ' Liu Jialing ', ' Liang Chaowei ', ' Guo Fucheng ', ' Zeng Shiqiang ', ' Liang Kaien ', ' Yu Lingxiong ', ' Zhai Hong ', ' Amway ', ' unlimited pole ', ' Avon ', ' sky lion ', ' rose Lin Kai ', ' Long Liqi ', ' Zhao Liying ', ' deer break ', ' Liu Yifei ', ' Li Yifeng ', ' Liu Shishi ', ' Du Yunsheng ', ' Xu Hening ', ' Li Jiacheng ', ' Niu Gensheng ', ' Yang Yuanqing ', ' Li Kaifu ', ' Ren Zhengfei ', ' Tang Jun ', ' fourth ', ' of heap of stone Shi Yuzhu ', ' Yu Minhong ', ' Liu Chuanzhi ', ' cloud business ', ' as newly ', ' the Internet ', ' silk ', ' Liu Yimiao ', ' China's dream ', ' opportunity ', ' business ', ' battalion Pin ', ' today's tops ', ' Zhejiang business ', ' Tao Yang's ring ', ' investment ', ' marketing ', ' destiny ', ' make progress every day ', ' happy comedy people ', ' Success ', ' superman ', ' Anthony guest sieve ', ' Zheng Shuan ', ' Wu Qilong ', ' trend ', ' Ji Zhongzhan ', ' state treasure ', ' Deng Chao ', ' suddenly Jian Hua ', ' poplar power ', ' finance and economics youth's eye ', ' Zhao Wei ', ' the Negotiator ', ' hero alliance ', ' match in spring ', ' I be singer ', ' Happy base camp '.
There is hot word vocabulary, it is possible to judge that the hot word of a video piles up degree.
Generation decision rules:
According to the video data obtained, the various feature forms of video can be set up.Such as, for video, just like table 1 below Shown in data slot:
Table 1
Wherein, accountName field is 1, shows that its user name format is game_XXXXXX.
Random video sample is labeled, and utilize decision Tree algorithms generate decision rules specifically can as in figure 2 it is shown, The threshold value that respectively compares shown in Fig. 2 is and is obtained by decision tree training pattern.It will be seen that the identification of cheating video has 4 Path:
AccountName=1;
AccountName<>1 and wordCount>4;
AccountName<>1 and wordCount>2 and avgPP<0.5;
AccountName<>1 and wordCount<2 and avgPP<0.01.
The decision tree drawn according to above-mentioned study and decision rules, by calculated off line every day, generate the number of cheating video According to, and using fall power to process at video search engine so that these videos are in very big inferior position in sequence.
Embodiment 3
Present embodiments providing a kind of cheating video identification device, Fig. 3 is the structured flowchart of this device, as it is shown on figure 3, should Device includes following ingredient:
Acquisition module 31, for obtaining the information of video and the daily record data of video from video website;
Determine module 32, for according in the Raw performance item information of video to getting and the daily record data of video Each video be identified, determine whether video is cheating video, the judge index of cheating video defined in Raw performance item;
Training module 33, for using decision Tree algorithms to be trained the sample data after being identified, generates decision-making Tree-model;
Identification module 34, is used for using whether decision-tree model identification video is cheating video.
Optionally, in the present embodiment, Raw performance item specifically may include that the form of video title title, and video exists Playback volume in preset time period, the number of times of user's interbehavior that video in preset time period is corresponding, in video title The number of the popular key word comprised and the rate that averagely finishes playing of video, the rate that averagely finishes playing is played video Watch part and account for the ratio of this video.In the specific implementation, a certain item in above-mentioned Raw performance item can be selected, it is possible to Multinomial to select simultaneously.
Optionally, above-mentioned identification module 34 specifically for: the aim parameter of each index item obtained according to training is to video The daily record data of information and/or video carries out at least one judgement following: judge whether the title of video meets video name pair The aim parameter answered, it is judged that whether video playback volume in preset time period meets the playback volume in aim parameter, it is judged that video Whether the number of the popular key word comprised in title meets the popular key word number in aim parameter, it is judged that video correspondence user Whether the number of times of interbehavior meets the number of times of user's interbehavior in aim parameter, it is judged that whether the rate that averagely finishes playing in video Meeting the rate that finishes playing in aim parameter, the rate that averagely finishes playing is that the viewing of played video completes partly to account for this video Ratio;The video at least meeting an aim parameter is defined as video of practising fraud.
Wherein, above-mentioned determine module 32 specifically for: in the case of video is played within a daily record cycle, logical Cross following Raw performance item video is identified, to determine whether video is cheating video: the form of video title title, depending on Frequency playback volume in preset time period, the hot topic comprised in the number of times of user's interbehavior that video is corresponding and video title The number of key word.
Optionally, above-mentioned determine that module specifically may be used for: when video within a daily record cycle played at least one times In the case of, by following Raw performance item, video is identified, to determine whether video family is cheating video: video title The form of title, video playback volume in preset time period, the number of times of user's interbehavior that video is corresponding, in video title The number of the popular key word comprised and video averagely finish playing rate, and the rate that averagely finishes playing is the sight of played video See that part accounts for the ratio of this video.
The foregoing is only embodiments of the invention, be not limited to the present invention, for those skilled in the art For Yuan, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, any amendment of being made, Equivalent, improvement etc., within should be included in scope of the presently claimed invention.

Claims (10)

1. a cheating video frequency identifying method, it is characterised in that including:
The information of video and the daily record data of video is obtained from video website;
Each video in information according to the Raw performance item video to getting and the daily record data of video is identified, really Whether fixed described video be cheating video, the judge index of cheating video defined in described Raw performance item;
Use decision Tree algorithms that the sample data after being identified is trained, generate decision-tree model;
Use whether described decision-tree model identification video is cheating video.
Method the most according to claim 1, it is characterised in that described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video in preset time period is corresponding The number of times of user's interbehavior, the number of the popular key word comprised in video title and the rate that averagely finishes playing of video, The described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Method the most according to claim 2, it is characterised in that whether described use described decision-tree model identification video is Cheating video, including:
The information of described video and/or the daily record data of described video are carried out by the aim parameter according to training each index item obtained At least one judges below:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is in preset time period Playback volume whether meet the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video Whether number meets the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior meets The number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described aim parameter In the rate that finishes playing, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video; The video at least meeting a described aim parameter is defined as video of practising fraud.
Method the most according to claim 1, it is characterised in that the letter of the described video according to Raw performance item to getting Each video in the daily record data of breath and video is identified, and determines whether described video is cheating video, including:
In the case of described video is not played within a daily record cycle, by following Raw performance item, described video is entered Row identifies, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time The number of the popular key word comprised in number and video title.
Method the most according to claim 1, it is characterised in that the letter of the described video according to Raw performance item to getting Each video in the daily record data of breath and video is identified, and determines whether described video is cheating video, including:
In the case of described video was played at least one times within a daily record cycle, by following Raw performance item to described Video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time Number, the number of the popular key word comprised in video title and video averagely finish playing rate, the described rate that averagely finishes playing Viewing for played video completes partly to account for the ratio of this video.
6. a cheating video identification device, it is characterised in that including:
Acquisition module, for obtaining the information of video and the daily record data of video from video website;
Determine module, for according to respectively regarding in the Raw performance item information of video to getting and the daily record data of video Frequency is identified, determine that whether described video is cheating video, and the judgement of cheating video defined in described Raw performance item refers to Mark;
Training module, for using decision Tree algorithms to be trained the sample data after being identified, generates decision-tree model;
Identification module, is used for using whether described decision-tree model identification video is cheating video.
Device the most according to claim 6, it is characterised in that described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video in preset time period is corresponding The number of times of user's interbehavior, the number of the popular key word comprised in video title and the rate that averagely finishes playing of video, The described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Device the most according to claim 7, it is characterised in that described identification module specifically for:
The information of described video and/or the daily record data of described video are carried out by the aim parameter according to training each index item obtained At least one judges below:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is in preset time period Playback volume whether meet the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video Whether number meets the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior meets The number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described aim parameter In the rate that finishes playing, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video; The video at least meeting a described aim parameter is defined as video of practising fraud.
Device the most according to claim 1, it is characterised in that described determine module specifically for:
In the case of described video is not played within a daily record cycle, by following Raw performance item, described video is entered Row identifies, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time The number of the popular key word comprised in number and video title.
Device the most according to claim 6, it is characterised in that described determine module specifically for:
In the case of described video was played at least one times within a daily record cycle, by following Raw performance item to described Video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time Number, the number of the popular key word comprised in video title and video averagely finish playing rate, the described rate that averagely finishes playing Viewing for played video completes partly to account for the ratio of this video.
CN201610892400.7A 2016-10-13 2016-10-13 Cheat video identification method and device Pending CN106326498A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610892400.7A CN106326498A (en) 2016-10-13 2016-10-13 Cheat video identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610892400.7A CN106326498A (en) 2016-10-13 2016-10-13 Cheat video identification method and device

Publications (1)

Publication Number Publication Date
CN106326498A true CN106326498A (en) 2017-01-11

Family

ID=57820301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610892400.7A Pending CN106326498A (en) 2016-10-13 2016-10-13 Cheat video identification method and device

Country Status (1)

Country Link
CN (1) CN106326498A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108764021A (en) * 2018-04-04 2018-11-06 北京奇艺世纪科技有限公司 A kind of cheating video frequency identifying method and device
CN109165691A (en) * 2018-09-05 2019-01-08 北京奇艺世纪科技有限公司 Training method, device and the electronic equipment of the model of cheating user for identification
CN109840445A (en) * 2017-11-24 2019-06-04 优酷网络技术(北京)有限公司 A kind of recognition methods and system of video of practising fraud
CN110147472A (en) * 2017-07-14 2019-08-20 北京搜狗科技发展有限公司 Detection method, device and the detection device for website of practising fraud of cheating website
CN110290400A (en) * 2019-07-29 2019-09-27 北京奇艺世纪科技有限公司 The recognition methods of suspicious brush amount video, true playback volume predictor method and device
CN110381375A (en) * 2018-04-13 2019-10-25 武汉斗鱼网络科技有限公司 A kind of determining method, client and server for stealing brush data

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2563014A2 (en) * 2007-02-21 2013-02-27 Nds Limited Method for content presentation
CN103064850A (en) * 2011-10-20 2013-04-24 腾讯科技(深圳)有限公司 Method and system of digging cheating data
CN105183897A (en) * 2015-09-29 2015-12-23 北京奇艺世纪科技有限公司 Method and system for ranking video retrieval
CN105574199A (en) * 2015-12-28 2016-05-11 合一网络技术(北京)有限公司 Identification method and device for false search behavior of search engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2563014A2 (en) * 2007-02-21 2013-02-27 Nds Limited Method for content presentation
CN103064850A (en) * 2011-10-20 2013-04-24 腾讯科技(深圳)有限公司 Method and system of digging cheating data
CN105183897A (en) * 2015-09-29 2015-12-23 北京奇艺世纪科技有限公司 Method and system for ranking video retrieval
CN105574199A (en) * 2015-12-28 2016-05-11 合一网络技术(北京)有限公司 Identification method and device for false search behavior of search engine

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110147472A (en) * 2017-07-14 2019-08-20 北京搜狗科技发展有限公司 Detection method, device and the detection device for website of practising fraud of cheating website
CN110147472B (en) * 2017-07-14 2021-10-15 北京搜狗科技发展有限公司 Detection method and device for cheating sites and detection device for cheating sites
CN109840445A (en) * 2017-11-24 2019-06-04 优酷网络技术(北京)有限公司 A kind of recognition methods and system of video of practising fraud
CN109840445B (en) * 2017-11-24 2021-10-01 阿里巴巴(中国)有限公司 Method and system for identifying cheating videos
CN108764021A (en) * 2018-04-04 2018-11-06 北京奇艺世纪科技有限公司 A kind of cheating video frequency identifying method and device
CN108764021B (en) * 2018-04-04 2021-03-26 北京奇艺世纪科技有限公司 Cheating video identification method and device
CN110381375A (en) * 2018-04-13 2019-10-25 武汉斗鱼网络科技有限公司 A kind of determining method, client and server for stealing brush data
CN109165691A (en) * 2018-09-05 2019-01-08 北京奇艺世纪科技有限公司 Training method, device and the electronic equipment of the model of cheating user for identification
CN109165691B (en) * 2018-09-05 2022-04-22 北京奇艺世纪科技有限公司 Training method and device for model for identifying cheating users and electronic equipment
CN110290400A (en) * 2019-07-29 2019-09-27 北京奇艺世纪科技有限公司 The recognition methods of suspicious brush amount video, true playback volume predictor method and device
CN110290400B (en) * 2019-07-29 2022-06-03 北京奇艺世纪科技有限公司 Suspicious brushing amount video identification method, real playing amount estimation method and device

Similar Documents

Publication Publication Date Title
CN106326498A (en) Cheat video identification method and device
Xue et al. Detecting fake news by exploring the consistency of multimodal data
CN106326497A (en) Cheating video user identification method and device
CN104317959B (en) Data digging method based on social platform and device
Sharifi et al. Summarizing microblogs automatically
CN102929873B (en) Method and device for extracting searching value terms based on context search
KR101536520B1 (en) Method and server for extracting topic and evaluating compatibility of the extracted topic
Firan et al. Bringing order to your photos: event-driven classification of flickr images based on social knowledge
Abel et al. Twitcident: fighting fire with information from social web streams
US10372717B2 (en) Systems and methods for identifying documents based on citation history
Jakob et al. Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations
CN104994424B (en) A kind of method and apparatus for building audio and video standard data set
Saeed et al. Crowdsourced fact-checking at Twitter: How does the crowd compare with experts?
CN104516986A (en) Method and device for recognizing sentence
CN101520802A (en) Question-answer pair quality evaluation method and system
CN101609459A (en) A kind of extraction system of affective characteristic words
Tran et al. Leveraging learning to rank in an optimization framework for timeline summarization
CN103279504B (en) A kind of searching method and device based on ambiguity resolution
Theisen et al. Automatic discovery of political meme genres with diverse appearances
CN106357416A (en) Group information recommendation method, device and terminal
CN105574199B (en) Method and device for identifying false search behavior of search engine
TW201405341A (en) Information Classification Based on Product Recognition
CN103123624A (en) Method of confirming head word, device of confirming head word, searching method and device
CN101894129B (en) Video topic finding method based on online video-sharing website structure and video description text information
CN109033286B (en) Data statistical method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100080 A 5 C, block A, China International Steel Plaza, 8 Haidian Avenue, Haidian District, Beijing.

Applicant after: Youku network technology (Beijing) Co., Ltd.

Address before: 100080 A 5 C, block A, China International Steel Plaza, 8 Haidian Avenue, Haidian District, Beijing.

Applicant before: 1Verge Inc.

CB02 Change of applicant information
RJ01 Rejection of invention patent application after publication

Application publication date: 20170111

RJ01 Rejection of invention patent application after publication