CN106326498A - Cheat video identification method and device - Google Patents
Cheat video identification method and device Download PDFInfo
- Publication number
- CN106326498A CN106326498A CN201610892400.7A CN201610892400A CN106326498A CN 106326498 A CN106326498 A CN 106326498A CN 201610892400 A CN201610892400 A CN 201610892400A CN 106326498 A CN106326498 A CN 106326498A
- Authority
- CN
- China
- Prior art keywords
- video
- title
- cheating
- averagely
- rate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention mainly aims to provide a cheat video identification method and device to solve the problem that cheat videos affect normal video display in the prior art. The method comprises steps as follows: information and log data of videos are required from a video website; videos in the acquired information and log data of the videos are identified according to an initial index item, whether the videos are cheat videos or not is determined, and a judgment index for the cheat videos is defined in the initial index item; identified sample data are trained with a decision tree algorithm and a decision tree model is generated; whether the videos are cheat videos or not is identified with the decision tree model. With the adoption of the method, the effect of the cheat videos on normal video display is avoided, so that normal videos can obtain reasonable display chance.
Description
Technical field
The present invention relates to video search engine technical field, particularly relate to a kind of cheating video frequency identifying method and device.
Background technology
Nowadays, video, as important online Streaming Media product, occupies important in daily life is entertained
Position.Encourage user to make video, uploaded videos, and to obtain exposing also be the basic principle of video website.Each video website
Results for video will be shown in Search Results or commending system.Its algorithm behind is typically to make use of video title, retouch
State and the playback volume of video, the data such as upload user information.Normal video is generally of rational title, description, video
Playback volume, and the interbehavior of user, but, there is substantial amounts of cheating video in current internet video website,
Cheating video council produces inequitable impact to normal video.At industrial quarters and academia, people not to work
Fraud video carry out strict difinition, but common cheating video has a following features:
Video title has a large amount of word to pile up, such as " what Gui of happy base camp that makes progress every day thanks to Na video ", " horse cloud horseization
Rise the foundation treasured book of Wang Jian Lin Liyanhong thunder army Chen An ";Video content and video title not the biggest association, or carry agency secretly
Promotion message.Such as the video content of " what Gui of happy base camp that makes progress every day thanks to Na video " is about starting an undertaking.Cheating regards
Frequency has big playback volume, but, the video of non-popular program and personage does not have the playback volume of up to million.
Regular traffic is carried out and is and disadvantageous by cheating video, cheating video due to false playback volume and title,
Generally can gain all advantage in sort algorithm so that cheating video can come before results for video, it is simple to searching for and pushing away
Expose in recommending.Thus cause non-cheating video there is no chance for exposure.
By the reason of cheating video having been carried out preliminary being analyzed as follows:
Promoting personal information, be generally mingled with QQ, wechat and cell-phone number etc. in cheating video, video uploader expectation user see
After video, can actively contact, and carry out business under line;Building egoistic opinion atmosphere, such as foundation class video is generally accused
Tell user, make a good deal of money chance now with substantial amounts of foundation, and have a lot of people success;Attempt to obtain other people concern, example
As video title comprises a large amount of popular word, it is desirable to have more chances watched.
The limitation of traditional algorithm, traditional search and sort algorithm is utilized can the multiformity of video and user to be done necessarily
Requirement, the result i.e. gone out should comprise more independent video and user.Cheating video and cheating user would generally create
A large amount of identical videos and user, win advantage, and this is for non-cheating video, i.e. for normal video, is wrongful competing
Strive, and had a strong impact on the displaying of normal video.
Summary of the invention
Present invention is primarily targeted at offer one cheating video frequency identifying method and device, to solve prior art is made
Fraud video affects the problem that normal video is shown.
A kind of cheating video frequency identifying method, including:
The information of video and the daily record data of video is obtained from video website;
Each video in information according to the Raw performance item video to getting and the daily record data of video is known
Not, determine that whether described video is cheating video, the judge index of cheating video defined in described Raw performance item;
Use decision Tree algorithms that the sample data after being identified is trained, generate decision-tree model;
Use whether described decision-tree model identification video is cheating video.
Preferably, described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video pair in preset time period
The number of times of the user's interbehavior answered, the number of popular key word comprised in video title and averagely finishing playing of video
Rate, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Preferably, whether described use described decision-tree model identification video is cheating video, including:
The aim parameter of each index item obtained according to training is to the information of described video and/or the daily record data of described video
Carry out following at least one judge:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is at Preset Time
Whether the playback volume in Duan meets the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video
Number whether meet the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior
Meet the number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described mesh
The rate that finishes playing in scalar, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video
Rate;The video at least meeting a described aim parameter is defined as video of practising fraud.
Preferably, each in the described information of video according to Raw performance item to getting and the daily record data of video
Video is identified, and determines whether described video is cheating video, including:
In the case of described video is not played within a daily record cycle, regarded described by following Raw performance item
Frequency is identified, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding
Number of times and video title in the number of popular key word that comprises.
Preferably, each in the described information of video according to Raw performance item to getting and the daily record data of video
Video is identified, and determines whether described video is cheating video, including:
In the case of described video was played at least one times within a daily record cycle, right by following Raw performance item
Described video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding
Number of times, the number of the popular key word comprised in video title and video averagely finish playing rate, described averagely plays
One-tenth rate is that the viewing of played video completes part and accounts for the ratio of this video.
A kind of cheating video identification device, including:
Acquisition module, for obtaining the information of video and the daily record data of video from video website;
Determine module, for according in the Raw performance item information of video to getting and the daily record data of video
Each video is identified, determine that whether described video is cheating video, sentencing of cheating video defined in described Raw performance item
Severed finger mark;
Training module, for using decision Tree algorithms to be trained the sample data after being identified, generates decision tree
Model;
Identification module, is used for using whether described decision-tree model identification video is cheating video.
7, device according to claim 6, it is characterised in that described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video pair in preset time period
The number of times of the user's interbehavior answered, the number of popular key word comprised in video title and averagely finishing playing of video
Rate, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Preferably, described identification module specifically for:
The aim parameter of each index item obtained according to training is to the information of described video and/or the daily record data of described video
Carry out following at least one judge:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is at Preset Time
Whether the playback volume in Duan meets the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video
Number whether meet the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior
Meet the number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described mesh
The rate that finishes playing in scalar, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video
Rate;The video at least meeting a described aim parameter is defined as video of practising fraud.
Preferably, described determine module specifically for:
In the case of described video is not played within a daily record cycle, regarded described by following Raw performance item
Frequency is identified, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding
Number of times and video title in the number of popular key word that comprises.
Preferably, described determine module specifically for:
In the case of described video was played at least one times within a daily record cycle, right by following Raw performance item
Described video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding
Number of times, the number of the popular key word comprised in video title and video averagely finish playing rate, described averagely plays
One-tenth rate is that the viewing of played video completes part and accounts for the ratio of this video.
The present invention has the beneficial effect that:
The scheme that present example provides passes through information and the daily record of video of the Raw performance item video to getting
Data are trained, and generate decision data model, re-use decision-tree model and are identified cheating video so that cheating video
Can be effectively recognized, evade the cheating video impact on normal video display so that normal video can obtain reasonably
Display machine meeting.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this
Bright schematic description and description is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the cheating video frequency identifying method provided in the embodiment of the present invention 1;
Fig. 2 is the path schematic diagram using decision tree to be identified cheating video in the embodiment of the present invention 2;
Fig. 3 is the structured flowchart of the cheating video identification device provided in the embodiment of the present invention 3.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Describe, it is clear that described embodiment is only a part of embodiment of the present invention rather than whole embodiments wholely.Based on
Embodiment in the present invention, those of ordinary skill in the art obtained on the premise of not making creative work all its
His embodiment, broadly falls into the scope of protection of the invention.
Embodiment 1
Present embodiments providing a kind of cheating video frequency identifying method, Fig. 1 is the flow chart of the method, as it is shown in figure 1, the party
Method includes processing as follows:
Step 101: obtain the information of video and the daily record data of video from video website;
Wherein, the information of video can include the uploader of the title of video, video, and the character description information of video etc. regards
The attribute information of frequency, the daily record data of video, the daily record numbers such as uploading the date of video, reproduction time, broadcasting time can be included
According to.
Step 102: according to respectively regarding in the Raw performance item information of video to getting and the daily record data of video
Frequency is identified, determine that whether video is cheating video, the judge index of cheating video defined in Raw performance item;
Step 103: use decision Tree algorithms that the sample data after being identified is trained, generate decision-tree model;
In this step, the decision-tree model of generation has i.e. obtained the aim parameter that each Raw performance item is corresponding, such as, this reality
Execute the Raw performance item in example and specifically may include that the form of video title title, video playback volume in preset time period,
The number of times of user's interbehavior that video in preset time period is corresponding, the number of the popular key word comprised in video title
And the rate that averagely finishes playing of video, the rate that averagely finishes playing is that the viewing of played video completes partly to account for this video
Ratio, based on this, the aim parameter that each Raw performance item is corresponding is the threshold value judging that whether video is cheating video.Concrete real
Shi Shi, can select a certain item in above-mentioned Raw performance item, it is also possible to select multinomial simultaneously.
Step 104: use whether decision-tree model identification video is cheating video.
In the present embodiment, use whether decision-tree model identification video is that cheating video specifically may include that according to instruction
The aim parameter of each index item got the information of video and/or the daily record data of video are carried out following at least one judge:
Judge whether the title of video meets the aim parameter that video name is corresponding, it is judged that video playback volume in preset time period whether
Meet the playback volume in aim parameter, it is judged that whether the number of the popular key word comprised in the title of video meets in aim parameter
Popular key word number, it is judged that the number of times of video correspondence user's interbehavior whether meet user's interbehavior in aim parameter time
Number, it is judged that whether the rate that averagely finishes playing in video meets the rate that finishes playing in aim parameter, and the rate that averagely finishes playing is for be broadcast
The viewing of the video put completes part and accounts for the ratio of this video;The video at least meeting an aim parameter is defined as cheating regard
Frequently.
Wherein, enter according to each video in the Raw performance item information of video to getting and the daily record data of video
Row identifies, determines whether video is that cheating video specifically may include that when the feelings that video is not played within a daily record cycle
Under condition, by following Raw performance item, video is identified, to determine whether video is cheating video: video title title
Form, video playback volume in preset time period, the number of times of user's interbehavior that video is corresponding and video title wrap
The number of the popular key word contained.
Optionally, according to each video in the Raw performance item information of video to getting and the daily record data of video
It is identified, determines whether video is that cheating video specifically may include that
In the case of video was played at least one times within a daily record cycle, by following Raw performance item to video
It is identified, to determine whether video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding
Number of times, the number of the popular key word comprised in video title and video averagely finish playing rate, and averagely finish playing rate
Viewing for played video completes partly to account for the ratio of this video.
Embodiment 2
The present embodiment is by the recognizer for SEO Technology design cheating video, the most respectively to there being broadcasting
The cheating video of behavior and the cheating video without broadcasting behavior carry out feature extraction, utilize decision Tree algorithms to carry out cheating video
Judge.Cheating video it is generally desirable to obtain on video website platform and higher exposes chance and attention rate.Such as, draw in search
In holding up, cheating video it is generally desirable to come the homepage of result page, the most former positions;In commending system, cheating video is also wished
Prestige can obtain more recommendation;Cheating video is generally also expected to be collected by more user or reprint, and so, puts down third party
On platform, cheating video also has an opportunity to be found by more video user.By the statistical analysis to SEO technology and video, draw
Cheating video is generally of following characteristics: the title of cheating video generally comprises multiple popular word, or the heat of association area
Door word.Such as, the programm name of hot broadcast TV play, variety and field of finance and economics and celebrity names, start an undertaking or direct selling field
Programm name and celebrity names etc..Such as, happy comedy people, the descendant of the sun, youth's eye finance and economics, Ma Yun, Chen An it, Amway.
The playback volume of cheating video reaches improper high value usual within a short period of time.Cheating video has special SEO instrument
Extremely to improve playback volume, statistically it is found that the video of a domestic consumer generally will not surpass in intraday playback volume
Cross 10000, but can to reach hundreds of thousands at several hours the most up to a million for the playback volume of cheating video.Cheating video does not almost have
Have and stepped on or the behavior such as collection by user top.Under the highest playback volume premise, user's interbehaviors such as the top of video is stepped on, collection
Certain level can be reached.But cheating video usually not these behaviors.This explanation, although video playback amount is by exception
Improve, but the most real user comes mutual.The user's name of cheating video has certain rule.Owing to SEO is existing
In the modes using software automation, therefore user is before uploaded videos, will not manually arrange user name more.Only can simply depend on
User name is generated according to certain rule by software.Common are: game_XXXXXX, QQYYYYYYYY, wherein X represents word
Female or digital, Y represents numeral.The comprehensive above basic understandings to cheating video, the present embodiment can obtain following substantially special
Levy:
The hot word number wordCount that video title is comprised, based on this, needs a hot word word based on frequency statistics
Table.For the hot word repeated, need repeat count.
Odd-numbered day playback volume firstDayVV of video, the odd-numbered day herein refers to reach the standard grade the date issued of video, i.e. video
Date.
Video odd-numbered day interbehavior conversion ratio interactRatio, needs, according to available logged result, to step on top, draw
With, the numerical value such as collection is divided by the odd-numbered day playback volume of video.Situation according to gained codomain, it may be necessary to certain normalization.
The User Format accountName of video, needs exist for a common cheating User Format based on frequency statistics
List.
For there being the cheating video of broadcasting behavior, it is also possible to use finishing playing of video to be used for further investigating spy
Levying in the present embodiment, employing video averagely finishes playing to portray than (Average Playing Percentage, avgPP) and regards
The degree that averagely finishes playing of frequency.The ratio that averagely finishes playing is the biggest, and video-see is the most complete, otherwise the most imperfect.Average broadcasting
Complete to be defined below than use:
Wherein, watchingLengthi, is the i & lt viewing duration of video, and videoLength is the total of current video
Duration, n is broadcast number of times.
In general, averagely finishing playing than avgPP of an ordinary video will not be a value the lowest, unless each
The secondary ratio that finishes playing is the most extremely low.According to statistics, the ratio that averagely finishes playing of general full dose video for about 40%, if therefore one
The ratio that averagely finishes playing of video is the lowest, then it is particularly likely that cheating video.
More than Zong He, for there being the cheating video of broadcasting behavior, it is right that suggestion in the present embodiment uses following index item
Cheating video in video website is identified:
The hot word number wordCount that video title is comprised, odd-numbered day playback volume firstDayVV of video, the friendship of video odd-numbered day
Behavior conversion ratio interactRatio, the User Format accountName of video mutually, and the ratio that averagely finishes playing of video
avgPP。
Its data form is:
vid|wordCount|firstDayVV|interactRatio|accountName|avgPP。
In the present embodiment, the sequencing of above-mentioned data field is without compulsive requirement.
For without playing behavior cheating video (refer to a upper daily record cycle without broadcasting behavior, rather than from reach the standard grade
I.e. without playing behavior), owing to obtaining less than more data availables, can portray just with basic feature, it may be assumed that
The hot word number wordCount that video title is comprised, odd-numbered day playback volume firstDayVV of video, the friendship of video odd-numbered day
Behavior conversion ratio interactRatio and the User Format accountName of video mutually.
Its data form is:
vid|wordCount|firstDayVV|interactRatio|accountName。
Same, in the present embodiment, the sequencing of above-mentioned data field is without compulsive requirement.
In general, the difference designed according to the algorithm of the operation systems such as search, recommendation, cheating video will not never have
Broadcasting behavior.Current a lot of business algorithms all encourage user to upload, more emphasis algorithms ageing, and cheating video is being issued
Within first day, have a broadcasting behavior by force due to ageing, but As time goes on, cheating video may not have broadcasting behavior
?.If business algorithm more focuses on the classical degree (typically playback volume, interaction data etc.) of video, cheating video is without playing
Behavior is also the most universal.
Recognizer
When learning all characteristics of cheating video, training pattern can be carried out according to the data of small sample, the most just
Being that be calculated in full specimen discerning various compare threshold value, this model to be also prone in engineering realize, in this enforcement simultaneously
Example can use decision-tree model to identify cheating video.
The present embodiment uses classical decision tree (Decision Tree) algorithm to complete the search of video search engine falseness
The identification of behavior.Decision-tree model is trained first with training set.Training set can be given each by manually mark
Whether individual search word is the primary data set of false search behavior.Artificial mark with it is expressly intended that a small amount of search word as base
Plinth, then utilizes decision-tree model to predict known search behavior, and then judges and the accuracy of Optimized model.Decision tree is
One tree construction being similar to flow chart, the test that the most each internal node only represents on an attribute, each branch generation
The test output of one, table, and each tree node represents class or class distribution, the top-most node of tree is root node.Decision Tree algorithms
Feature itself be adapted for carrying out attribute number (characteristic number) less in the case of high-quality classification.
The key problem of decision Tree algorithms is to be chosen at the attribute that each node of tree is to be tested, and strives for selecting
Attribute most helpful in classified instance.In order to solve this problem, ID3 algorithm introduces information gain (information
Gain) concept, and use information gain number to determine on each level of decision tree different node i.e. for classification weight
Want attribute.For accurately definition information gain, ID3 algorithm (i.e. realize a kind of way of decision Tree algorithms, the present embodiment only with
As a example by this algorithm, the most do not limit this kind of algorithm) use the concept being referred to as entropy (entropy) in theory of information, it features
The arbitrarily purity (purity) of sample set.The given sample set S comprising the positive and negative sample about certain target concept, then S-phase
The entropy classifying this Boolean type is:
Entropy (S)=-P+log2P+-P-log2P-;
In above-mentioned formula, P+Represent positive sample, P-Represent anti-sample, (about 0log0 defined in all calculating of entropy be
0).Entropy, ID3 is utilized to define information gain.Briefly, the information gain of an attribute is precisely due to use this attribute to divide
The expectation entropy that cuts sample and cause reduce (in other words, sample according to certain Attribute transposition time cause entropy to reduce expectation).More accurate
Say, attribute A information gain relative to sample set S is defined as:
Wherein, V (A) is the codomain of attribute A;S is sample set;SvIt is that in S, on attribute A, value is equal to the sample set of v.
ID3 algorithm flow is as follows: input: sample set S, community set A export: ID3 decision tree.
If the attribute of 1 all kinds is all disposed, return;Otherwise perform 2;
2, information gain Gain (S, A) maximum attribute a is calculated, using this attribute as a node;If only with attribute a
Just sample classification then can be returned;Otherwise perform 3;
3, each possible value v to attribute a, the following operation of execution:
4, it is sample subset S as S of v using the value of all properties av;
5, community set AT=A-{a} is generated;
6, with sample set SvBeing input with community set AT, recurrence performs ID3 algorithm;
By the characteristic extracted, the annotation results of training set and ID3 decision Tree algorithms, it is possible to obtain falseness and search
The decision tree initial model of Suo Hangwei.
The optimization of model can use beta pruning (pruning) strategy realize, mainly have two kinds of Pruning strategies:
Preposition cutting, when building the process of decision tree, stops in advance.So, can be by the condition setting of cutting node
The harshest, cause decision tree the shortest and the smallest, result is exactly that decision tree is unable to reach optimum,
Rearmounted cutting, after decision tree builds, the most just starts cutting, this cut out employing two kinds of methods:
Replacing whole subtree with single leaf node, the classification of leaf node uses topmost classification in subtree;
One word tree is substituted completely an other subtree.
In the present embodiment, according to the feature of video, decision tree can judge that whether it is through cheating.It is basic
Flow process is as follows:
(i.e. the information of video, such as, the information of video can include the title of video, video to obtain the data of video
Uploader, the character description information etc. of video), and the daily record data of video (uploading the date of such as video, reproduction time, broadcast
Put number of times etc.);
From the data of video and the daily record data of video, randomly draw a certain amount of video sample, be made whether as making
The artificial mark of fraud video is (i.e., it is possible to judge whether it is cheating video according to the indices data of video, so by artificial
The cheating video that rear mark is judged);
According to the video sample data after mark, utilize ID3 algorithm to carry out the training of cheating video decision tree, obtain decision-making
Tree-model;
Video to be detected in video website is identified by the decision-tree model according to generating, it is judged that whether it is cheating
Video.
First portion cheating video hot word vocabulary is prepared.Then the broadcasting number of these videos of log acquisition is play according to backstage
According to, obtain the playback volume of video according to video static information, the data such as top is stepped on, collection, quotes, user name simultaneously.
The hot word vocabulary that the present embodiment uses is as follows:
' Ma Yun ', ' Ma Huateng ', ' Li Yanhong ', ' ', ' success ', ' the Chen An it ', ' that starts an undertaking pursue a goal with determination ', ' Wang Jianlin ', ' Liu Qiang
East ', ' Lei Jun ', ' Qiao Busi ', ' Luo Yonghao ', ' Zhang Chaoyang ', ' Zhou Hong ', ' Bill Gates ', ' Zhao Benshan ', ' Song little Bao ', '
White hundred what ', ' plumage springs ', ' Huang Xiaoming ', ' Guo Degang ', ' Yue Yunpeng ', ' Cheng Long ', ' Liu Dehua ', ' Liu Jialing ', ' Liang Chaowei ', '
Guo Fucheng ', ' Zeng Shiqiang ', ' Liang Kaien ', ' Yu Lingxiong ', ' Zhai Hong ', ' Amway ', ' unlimited pole ', ' Avon ', ' sky lion ', ' rose
Lin Kai ', ' Long Liqi ', ' Zhao Liying ', ' deer break ', ' Liu Yifei ', ' Li Yifeng ', ' Liu Shishi ', ' Du Yunsheng ', ' Xu Hening ', '
Li Jiacheng ', ' Niu Gensheng ', ' Yang Yuanqing ', ' Li Kaifu ', ' Ren Zhengfei ', ' Tang Jun ', ' fourth ', ' of heap of stone Shi Yuzhu ', ' Yu Minhong ', '
Liu Chuanzhi ', ' cloud business ', ' as newly ', ' the Internet ', ' silk ', ' Liu Yimiao ', ' China's dream ', ' opportunity ', ' business ', ' battalion
Pin ', ' today's tops ', ' Zhejiang business ', ' Tao Yang's ring ', ' investment ', ' marketing ', ' destiny ', ' make progress every day ', ' happy comedy people ', '
Success ', ' superman ', ' Anthony guest sieve ', ' Zheng Shuan ', ' Wu Qilong ', ' trend ', ' Ji Zhongzhan ', ' state treasure ', ' Deng Chao ', ' suddenly
Jian Hua ', ' poplar power ', ' finance and economics youth's eye ', ' Zhao Wei ', ' the Negotiator ', ' hero alliance ', ' match in spring ', ' I be singer ', '
Happy base camp '.
There is hot word vocabulary, it is possible to judge that the hot word of a video piles up degree.
Generation decision rules:
According to the video data obtained, the various feature forms of video can be set up.Such as, for video, just like table 1 below
Shown in data slot:
Table 1
Wherein, accountName field is 1, shows that its user name format is game_XXXXXX.
Random video sample is labeled, and utilize decision Tree algorithms generate decision rules specifically can as in figure 2 it is shown,
The threshold value that respectively compares shown in Fig. 2 is and is obtained by decision tree training pattern.It will be seen that the identification of cheating video has 4
Path:
AccountName=1;
AccountName<>1 and wordCount>4;
AccountName<>1 and wordCount>2 and avgPP<0.5;
AccountName<>1 and wordCount<2 and avgPP<0.01.
The decision tree drawn according to above-mentioned study and decision rules, by calculated off line every day, generate the number of cheating video
According to, and using fall power to process at video search engine so that these videos are in very big inferior position in sequence.
Embodiment 3
Present embodiments providing a kind of cheating video identification device, Fig. 3 is the structured flowchart of this device, as it is shown on figure 3, should
Device includes following ingredient:
Acquisition module 31, for obtaining the information of video and the daily record data of video from video website;
Determine module 32, for according in the Raw performance item information of video to getting and the daily record data of video
Each video be identified, determine whether video is cheating video, the judge index of cheating video defined in Raw performance item;
Training module 33, for using decision Tree algorithms to be trained the sample data after being identified, generates decision-making
Tree-model;
Identification module 34, is used for using whether decision-tree model identification video is cheating video.
Optionally, in the present embodiment, Raw performance item specifically may include that the form of video title title, and video exists
Playback volume in preset time period, the number of times of user's interbehavior that video in preset time period is corresponding, in video title
The number of the popular key word comprised and the rate that averagely finishes playing of video, the rate that averagely finishes playing is played video
Watch part and account for the ratio of this video.In the specific implementation, a certain item in above-mentioned Raw performance item can be selected, it is possible to
Multinomial to select simultaneously.
Optionally, above-mentioned identification module 34 specifically for: the aim parameter of each index item obtained according to training is to video
The daily record data of information and/or video carries out at least one judgement following: judge whether the title of video meets video name pair
The aim parameter answered, it is judged that whether video playback volume in preset time period meets the playback volume in aim parameter, it is judged that video
Whether the number of the popular key word comprised in title meets the popular key word number in aim parameter, it is judged that video correspondence user
Whether the number of times of interbehavior meets the number of times of user's interbehavior in aim parameter, it is judged that whether the rate that averagely finishes playing in video
Meeting the rate that finishes playing in aim parameter, the rate that averagely finishes playing is that the viewing of played video completes partly to account for this video
Ratio;The video at least meeting an aim parameter is defined as video of practising fraud.
Wherein, above-mentioned determine module 32 specifically for: in the case of video is played within a daily record cycle, logical
Cross following Raw performance item video is identified, to determine whether video is cheating video: the form of video title title, depending on
Frequency playback volume in preset time period, the hot topic comprised in the number of times of user's interbehavior that video is corresponding and video title
The number of key word.
Optionally, above-mentioned determine that module specifically may be used for: when video within a daily record cycle played at least one times
In the case of, by following Raw performance item, video is identified, to determine whether video family is cheating video: video title
The form of title, video playback volume in preset time period, the number of times of user's interbehavior that video is corresponding, in video title
The number of the popular key word comprised and video averagely finish playing rate, and the rate that averagely finishes playing is the sight of played video
See that part accounts for the ratio of this video.
The foregoing is only embodiments of the invention, be not limited to the present invention, for those skilled in the art
For Yuan, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, any amendment of being made,
Equivalent, improvement etc., within should be included in scope of the presently claimed invention.
Claims (10)
1. a cheating video frequency identifying method, it is characterised in that including:
The information of video and the daily record data of video is obtained from video website;
Each video in information according to the Raw performance item video to getting and the daily record data of video is identified, really
Whether fixed described video be cheating video, the judge index of cheating video defined in described Raw performance item;
Use decision Tree algorithms that the sample data after being identified is trained, generate decision-tree model;
Use whether described decision-tree model identification video is cheating video.
Method the most according to claim 1, it is characterised in that described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video in preset time period is corresponding
The number of times of user's interbehavior, the number of the popular key word comprised in video title and the rate that averagely finishes playing of video,
The described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Method the most according to claim 2, it is characterised in that whether described use described decision-tree model identification video is
Cheating video, including:
The information of described video and/or the daily record data of described video are carried out by the aim parameter according to training each index item obtained
At least one judges below:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is in preset time period
Playback volume whether meet the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video
Whether number meets the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior meets
The number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described aim parameter
In the rate that finishes playing, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video;
The video at least meeting a described aim parameter is defined as video of practising fraud.
Method the most according to claim 1, it is characterised in that the letter of the described video according to Raw performance item to getting
Each video in the daily record data of breath and video is identified, and determines whether described video is cheating video, including:
In the case of described video is not played within a daily record cycle, by following Raw performance item, described video is entered
Row identifies, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time
The number of the popular key word comprised in number and video title.
Method the most according to claim 1, it is characterised in that the letter of the described video according to Raw performance item to getting
Each video in the daily record data of breath and video is identified, and determines whether described video is cheating video, including:
In the case of described video was played at least one times within a daily record cycle, by following Raw performance item to described
Video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time
Number, the number of the popular key word comprised in video title and video averagely finish playing rate, the described rate that averagely finishes playing
Viewing for played video completes partly to account for the ratio of this video.
6. a cheating video identification device, it is characterised in that including:
Acquisition module, for obtaining the information of video and the daily record data of video from video website;
Determine module, for according to respectively regarding in the Raw performance item information of video to getting and the daily record data of video
Frequency is identified, determine that whether described video is cheating video, and the judgement of cheating video defined in described Raw performance item refers to
Mark;
Training module, for using decision Tree algorithms to be trained the sample data after being identified, generates decision-tree model;
Identification module, is used for using whether described decision-tree model identification video is cheating video.
Device the most according to claim 6, it is characterised in that described Raw performance item include following at least one:
The form of video title title, video playback volume in preset time period, the video in preset time period is corresponding
The number of times of user's interbehavior, the number of the popular key word comprised in video title and the rate that averagely finishes playing of video,
The described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video.
Device the most according to claim 7, it is characterised in that described identification module specifically for:
The information of described video and/or the daily record data of described video are carried out by the aim parameter according to training each index item obtained
At least one judges below:
Judge whether the title of described video meets the aim parameter that video name is corresponding, it is judged that described video is in preset time period
Playback volume whether meet the playback volume in described aim parameter, it is judged that the popular key word comprised in the title of described video
Whether number meets the popular key word number in described aim parameter, it is judged that whether the number of times of video correspondence user's interbehavior meets
The number of times of user's interbehavior in described aim parameter, it is judged that whether the rate that averagely finishes playing in described video meets described aim parameter
In the rate that finishes playing, the described viewing that rate is played video that averagely finishes playing completes part and accounts for the ratio of this video;
The video at least meeting a described aim parameter is defined as video of practising fraud.
Device the most according to claim 1, it is characterised in that described determine module specifically for:
In the case of described video is not played within a daily record cycle, by following Raw performance item, described video is entered
Row identifies, to determine whether described video is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time
The number of the popular key word comprised in number and video title.
Device the most according to claim 6, it is characterised in that described determine module specifically for:
In the case of described video was played at least one times within a daily record cycle, by following Raw performance item to described
Video is identified, to determine whether described video family is cheating video:
The form of video title title, video playback volume in preset time period, user's interbehavior that video is corresponding time
Number, the number of the popular key word comprised in video title and video averagely finish playing rate, the described rate that averagely finishes playing
Viewing for played video completes partly to account for the ratio of this video.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610892400.7A CN106326498A (en) | 2016-10-13 | 2016-10-13 | Cheat video identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610892400.7A CN106326498A (en) | 2016-10-13 | 2016-10-13 | Cheat video identification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106326498A true CN106326498A (en) | 2017-01-11 |
Family
ID=57820301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610892400.7A Pending CN106326498A (en) | 2016-10-13 | 2016-10-13 | Cheat video identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106326498A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108764021A (en) * | 2018-04-04 | 2018-11-06 | 北京奇艺世纪科技有限公司 | A kind of cheating video frequency identifying method and device |
CN109165691A (en) * | 2018-09-05 | 2019-01-08 | 北京奇艺世纪科技有限公司 | Training method, device and the electronic equipment of the model of cheating user for identification |
CN109840445A (en) * | 2017-11-24 | 2019-06-04 | 优酷网络技术(北京)有限公司 | A kind of recognition methods and system of video of practising fraud |
CN110147472A (en) * | 2017-07-14 | 2019-08-20 | 北京搜狗科技发展有限公司 | Detection method, device and the detection device for website of practising fraud of cheating website |
CN110290400A (en) * | 2019-07-29 | 2019-09-27 | 北京奇艺世纪科技有限公司 | The recognition methods of suspicious brush amount video, true playback volume predictor method and device |
CN110381375A (en) * | 2018-04-13 | 2019-10-25 | 武汉斗鱼网络科技有限公司 | A kind of determining method, client and server for stealing brush data |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2563014A2 (en) * | 2007-02-21 | 2013-02-27 | Nds Limited | Method for content presentation |
CN103064850A (en) * | 2011-10-20 | 2013-04-24 | 腾讯科技(深圳)有限公司 | Method and system of digging cheating data |
CN105183897A (en) * | 2015-09-29 | 2015-12-23 | 北京奇艺世纪科技有限公司 | Method and system for ranking video retrieval |
CN105574199A (en) * | 2015-12-28 | 2016-05-11 | 合一网络技术(北京)有限公司 | Identification method and device for false search behavior of search engine |
-
2016
- 2016-10-13 CN CN201610892400.7A patent/CN106326498A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2563014A2 (en) * | 2007-02-21 | 2013-02-27 | Nds Limited | Method for content presentation |
CN103064850A (en) * | 2011-10-20 | 2013-04-24 | 腾讯科技(深圳)有限公司 | Method and system of digging cheating data |
CN105183897A (en) * | 2015-09-29 | 2015-12-23 | 北京奇艺世纪科技有限公司 | Method and system for ranking video retrieval |
CN105574199A (en) * | 2015-12-28 | 2016-05-11 | 合一网络技术(北京)有限公司 | Identification method and device for false search behavior of search engine |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110147472A (en) * | 2017-07-14 | 2019-08-20 | 北京搜狗科技发展有限公司 | Detection method, device and the detection device for website of practising fraud of cheating website |
CN110147472B (en) * | 2017-07-14 | 2021-10-15 | 北京搜狗科技发展有限公司 | Detection method and device for cheating sites and detection device for cheating sites |
CN109840445A (en) * | 2017-11-24 | 2019-06-04 | 优酷网络技术(北京)有限公司 | A kind of recognition methods and system of video of practising fraud |
CN109840445B (en) * | 2017-11-24 | 2021-10-01 | 阿里巴巴(中国)有限公司 | Method and system for identifying cheating videos |
CN108764021A (en) * | 2018-04-04 | 2018-11-06 | 北京奇艺世纪科技有限公司 | A kind of cheating video frequency identifying method and device |
CN108764021B (en) * | 2018-04-04 | 2021-03-26 | 北京奇艺世纪科技有限公司 | Cheating video identification method and device |
CN110381375A (en) * | 2018-04-13 | 2019-10-25 | 武汉斗鱼网络科技有限公司 | A kind of determining method, client and server for stealing brush data |
CN109165691A (en) * | 2018-09-05 | 2019-01-08 | 北京奇艺世纪科技有限公司 | Training method, device and the electronic equipment of the model of cheating user for identification |
CN109165691B (en) * | 2018-09-05 | 2022-04-22 | 北京奇艺世纪科技有限公司 | Training method and device for model for identifying cheating users and electronic equipment |
CN110290400A (en) * | 2019-07-29 | 2019-09-27 | 北京奇艺世纪科技有限公司 | The recognition methods of suspicious brush amount video, true playback volume predictor method and device |
CN110290400B (en) * | 2019-07-29 | 2022-06-03 | 北京奇艺世纪科技有限公司 | Suspicious brushing amount video identification method, real playing amount estimation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106326498A (en) | Cheat video identification method and device | |
Xue et al. | Detecting fake news by exploring the consistency of multimodal data | |
CN106326497A (en) | Cheating video user identification method and device | |
CN104317959B (en) | Data digging method based on social platform and device | |
Sharifi et al. | Summarizing microblogs automatically | |
CN102929873B (en) | Method and device for extracting searching value terms based on context search | |
KR101536520B1 (en) | Method and server for extracting topic and evaluating compatibility of the extracted topic | |
Firan et al. | Bringing order to your photos: event-driven classification of flickr images based on social knowledge | |
Abel et al. | Twitcident: fighting fire with information from social web streams | |
US10372717B2 (en) | Systems and methods for identifying documents based on citation history | |
Jakob et al. | Beyond the stars: exploiting free-text user reviews to improve the accuracy of movie recommendations | |
CN104994424B (en) | A kind of method and apparatus for building audio and video standard data set | |
Saeed et al. | Crowdsourced fact-checking at Twitter: How does the crowd compare with experts? | |
CN104516986A (en) | Method and device for recognizing sentence | |
CN101520802A (en) | Question-answer pair quality evaluation method and system | |
CN101609459A (en) | A kind of extraction system of affective characteristic words | |
Tran et al. | Leveraging learning to rank in an optimization framework for timeline summarization | |
CN103279504B (en) | A kind of searching method and device based on ambiguity resolution | |
Theisen et al. | Automatic discovery of political meme genres with diverse appearances | |
CN106357416A (en) | Group information recommendation method, device and terminal | |
CN105574199B (en) | Method and device for identifying false search behavior of search engine | |
TW201405341A (en) | Information Classification Based on Product Recognition | |
CN103123624A (en) | Method of confirming head word, device of confirming head word, searching method and device | |
CN101894129B (en) | Video topic finding method based on online video-sharing website structure and video description text information | |
CN109033286B (en) | Data statistical method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100080 A 5 C, block A, China International Steel Plaza, 8 Haidian Avenue, Haidian District, Beijing. Applicant after: Youku network technology (Beijing) Co., Ltd. Address before: 100080 A 5 C, block A, China International Steel Plaza, 8 Haidian Avenue, Haidian District, Beijing. Applicant before: 1Verge Inc. |
|
CB02 | Change of applicant information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170111 |
|
RJ01 | Rejection of invention patent application after publication |