CN109840445B - Method and system for identifying cheating videos - Google Patents

Method and system for identifying cheating videos Download PDF

Info

Publication number
CN109840445B
CN109840445B CN201711188045.6A CN201711188045A CN109840445B CN 109840445 B CN109840445 B CN 109840445B CN 201711188045 A CN201711188045 A CN 201711188045A CN 109840445 B CN109840445 B CN 109840445B
Authority
CN
China
Prior art keywords
vocabulary
characteristic
feature
cheating
vocabulary set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711188045.6A
Other languages
Chinese (zh)
Other versions
CN109840445A (en
Inventor
张深源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Youku Culture Technology Beijing Co ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN201711188045.6A priority Critical patent/CN109840445B/en
Publication of CN109840445A publication Critical patent/CN109840445A/en
Application granted granted Critical
Publication of CN109840445B publication Critical patent/CN109840445B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a method and a system for identifying a cheating video, wherein the method comprises the following steps: acquiring title information of a target video, and extracting feature words in the title information; dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; the characteristic vocabularies in the same characteristic vocabulary set belong to the same category; acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value; and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video. The technical scheme provided by the application can improve the identification accuracy of the cheating video.

Description

Method and system for identifying cheating videos
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method and a system for identifying a cheating video.
Background
With the continuous development of internet technology, more and more video playing platforms emerge. At present, a video playing platform usually counts the click rate of each video. Therefore, the user can judge the popularity of the video content according to the click rate of the video, so as to selectively watch the video.
At present, some uploaders of cheating videos usually configure false video titles for the cheating videos in order to increase the click rate of the cheating videos. The false video titles and the actual content of the cheating video may not be related, but are merely piled up with some hot search words, so that when a user searches a video which is relatively hot, the false video titles appear in the search results, and the click rate of the user is cheated. For example, if a false video title is "golden star show happy man's Chinese good voice running a new album" then the false video title will appear in the search results when the user is searching for "golden star show" or "Chinese good voice".
In order to identify a cheating video from a plurality of videos, hot search words appearing in the same video title can be limited currently. For example, the upper limit of the number of hot searched words appearing in the same video title may be set to 3, so that once 4 or more hot searched words appear in a title of a certain video, the video may be determined to be a cheating video. However, the existing identification method of the cheating video can cause many normal videos to be judged as the cheating videos, for example, a certain video is marked as' dung super zheng happy baobaer laichen happy big book collection ". The video title has 5 hot search words, and if the video title is determined to be a cheating video according to the existing method. In practice, several stars in the video title participate in the same general art program, so that the names of the stars do not simply pile up the hot search vocabulary but are normally listed at the same time, and thus the video is not a cheating video. As can be seen from the above, the identification method of the cheating video in the prior art cannot accurately identify the cheating video.
Disclosure of Invention
The embodiment of the application aims to provide a method and a system for identifying a cheating video, which can improve the identification accuracy of the cheating video.
In order to achieve the above object, an embodiment of the present application provides a method for identifying a cheating video, where the method includes: acquiring title information of a target video, and extracting feature words in the title information; dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; the characteristic vocabularies in the same characteristic vocabulary set belong to the same category; acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value; and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
To achieve the above object, the present application further provides a system for identifying a cheating video, the system including a memory and a processor, the memory storing a computer program, and the computer program when executed by the processor implementing the following steps: acquiring title information of a target video, and extracting feature words in the title information; dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; the characteristic vocabularies in the same characteristic vocabulary set belong to the same category; acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value; and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
Therefore, according to the technical scheme provided by the application, when the title information of the target video is identified, the feature vocabulary in the title information can be extracted firstly. In practical applications, the feature vocabulary may be the current hot search vocabulary. After the feature vocabulary is extracted, the extracted feature vocabulary may be classified to obtain at least one feature vocabulary set. Specifically, different classes of feature vocabulary sets may be associated with different recognition thresholds, which may serve as an upper limit number of feature vocabularies included in a class of feature vocabulary set. And if the number of the characteristic words contained in the characteristic word set exceeds the associated recognition threshold value, the characteristic word set is considered as an abnormal word set, so that the target video can be judged as a cheating video. As can be seen from the above, the determination scales for different feature vocabulary sets may also be different. For example, for a feature vocabulary set of entertainment stars class, its corresponding recognition threshold may be slightly larger; the recognition threshold for the feature vocabulary of the program name class may be slightly smaller. Specifically, the value of the recognition threshold may be obtained by counting the number of feature words included in the video title of the normal video. Therefore, according to the technical scheme provided by the application, different judgment standards can be adopted for judging aiming at different types of hot searched words, the misjudgment condition caused by adopting a unified judgment standard for judging is avoided, and the identification accuracy of the cheating video is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a diagram illustrating steps of a method for identifying a cheating video according to an embodiment of the present application;
fig. 2 is a flowchart of a method for identifying a cheating video according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a system for identifying a cheating video according to an embodiment of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.
The application provides a cheating video identification method which can be applied to a server of a video playing website. Referring to fig. 1 and 2, the method may include the following steps.
S1: and acquiring the title information of the target video, and extracting the feature words in the title information.
In this embodiment, the target video may be a video to be identified, the target video may have title information, and the title information may be text information set for the target video by a video uploader. For example, the title information of the target video may be "gold star show happy man's voice china best voice runner recent album".
In this embodiment, when determining whether the target video is a cheating video, the title information of the target video may be identified. In the server, the uploaded video data can be stored in association with the video information. The information of the video can comprise a series of information such as the duration, name, type and name of the user of the uploading user of the video. In this way, when the title information of the target video is obtained, the character string representing the video name can be read from the video information associated with the target video.
In the present embodiment, after the title information of the target video is acquired, the content of the title information can be identified. Specifically, a feature word in the header information may be extracted. The characteristic vocabulary can be a vocabulary with more searching times in the current video playing website. In practical application, the video playing website may count the number of searches of each vocabulary in a specified time period, and then may sort the searched vocabularies in the order from the largest number of searches. Finally, a plurality of words with top rank can be obtained, and the words with top rank can be used as the characteristic words in the video playing website. For example, the video playing website may count hot searched words ranked 100 a week ago, and these hot searched words may be used as feature words of the video playing website.
In the present embodiment, when extracting the feature words in the header information, the header information may be segmented to obtain a plurality of words included in the header information. When the title information is segmented, a preset vocabulary library can be adopted to identify the vocabulary in the title information, so that a plurality of vocabularies in the title information can be identified and obtained. In practical applications, various word segmenters can be used to segment the words of the title information. The word segmenter may be, for example, a friso word segmenter, a Jcseg word segmenter, an MMSEG4J word segmenter, or the like. Furthermore, in order to improve the accuracy of segmenting the title information of the video, a word bank of the segmenter can be constructed based on the words commonly used in the video playing website, so that the output result of the segmenter can better accord with the language habit of the words in the video playing website.
In the present embodiment, after performing word segmentation to obtain a plurality of words, a word in a hot-searched word set among the plurality of words may be used as the feature word of the header information. The hot searched vocabulary in the hot searched vocabulary set can be determined according to the corresponding search times in the appointed time limit. For example, the video playing website may count hot searched words ranked 100 a week ago and compose the hot searched words into a hot searched word set. Then after the plurality of words are obtained by word segmentation according to the title information of the target video, the words in the hot searched word set can be extracted as feature words. In the present embodiment, the purpose of extracting feature words is to cheat the video with the goal of cheating the click rate of the user by piling up a plurality of currently hot search words in the title information. Therefore, the extracted feature words can be analyzed subsequently, and whether the target video is a cheating video or not is judged.
S3: dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; wherein, the characteristic words in the same characteristic word set belong to the same category.
In the present embodiment, the feature words may be classified according to the category to which the feature words belong. The categories of the feature vocabulary may be classified according to the search intention of the user. Specifically, the categories of the feature vocabulary may include a program name category, a character category, a self-media category, and a sensitive word category. Wherein the program name class may be the name of the art program or an abbreviation of the name. For example, the program name category may include characteristic vocabularies such as "brother of running bar", "jinxing show", "chinese good voice", and the like. The person class may be the name of a public person or another name of a name. For example, the character class may include feature words such as "plum morning", "maryun", "baffert", and the like. The self-media class may be the name of the PGC (Professional Generated Content) in the video playing website or the name of the uploader. For example, the self-media may include characteristic words such as "hero alliance", "evening maple", and the like. The sensitive part of speech may be a feature vocabulary with poor guiding meaning.
It should be noted that, for the categories of the feature vocabulary, in an actual application scenario, a certain category may be divided more finely, so as to obtain multiple sub-categories in one category. For example, the character class may include a plurality of sub-classes such as entertainment class characters, finance class characters, politics class characters, and the like.
In the present embodiment, after the feature words are screened from the header information of the target video, the feature words may be classified according to the category to which the feature words belong. The feature vocabulary belonging to the same class may be divided into a feature vocabulary set. Thus, at least one characteristic vocabulary set can be obtained, and the types of the characteristic vocabularies in the same characteristic vocabulary set are the same. For example, for the title information of "good sound package for golden star show for men" a latest collection of good sound packages for looking at Baebel plum morning, "two feature word collections of" good sound package for golden star show for men "and" Baebel plum morning "can be obtained by dividing.
S5: and acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value.
In general, the number of feature words included in the header information of a normal video may be different for different types of feature words. For example, for the feature vocabulary of the program name class, the number appearing in the same title information generally does not exceed three; for the feature vocabulary of entertainment type characters, the number of the feature vocabulary appearing in the same title information generally does not exceed five. Therefore, in order to avoid misjudging a normal video as a cheating video, different identification strategies may be formulated for different categories in the present embodiment.
In the present embodiment, a recognition threshold for determining whether or not the number of feature words included in a feature word set is normal may be determined in advance for feature word sets of different types. The recognition threshold may be set as an upper limit on the number of feature words included in the feature word set. If the number of the characteristic words contained in the characteristic word set is larger than the recognition threshold value, the condition that the hot search words are piled up is indicated in the corresponding header information. Specifically, since different feature vocabulary sets may be associated with different recognition thresholds, when determining the current feature vocabulary set, the recognition threshold associated with the current feature vocabulary set may be obtained first. Each recognition threshold may be stored in association with the corresponding category in a server of the video playback website. The category of the feature vocabulary may be used as a key, and the recognition threshold associated with the category may be used as a value, so that the feature vocabulary may be stored in a key-value manner. After the category corresponding to the current feature vocabulary is determined, the associated recognition threshold may be read.
In this embodiment, the recognition threshold may be obtained by performing statistical analysis based on the header information of the normal video. Specifically, the preset number of non-cheating title information of the non-cheating videos can be obtained in advance, and the maximum number of feature words of the specified category contained in the same non-cheating title information is counted. For example, the title information of 5000 pieces of non-cheating videos may be acquired, and then for each piece of title information, the number of feature words in which a specified category is included may be counted. For example, the number of feature words of the program name class included in each of the 5000 pieces of header information may be counted. Finally, by comparing the respective numbers of statistics, the maximum number thereof can be obtained. The maximum number may be used as an upper limit of the number of feature words in the non-cheating video that include the specified category, so that the counted maximum number may be used as a recognition threshold associated with the feature word set of the specified category. For example, after analyzing a large amount of normal title information, it is found that at most 2 program names are generally mentioned in the title information of a normal video, and then the identification threshold for the program name class may be set to 2.
In the present embodiment, after the recognition threshold associated with the current feature vocabulary is obtained, it is possible to determine whether the current feature vocabulary belongs to an abnormal vocabulary based on the recognition threshold. Specifically, if the number of feature words included in the current feature word set is greater than the recognition threshold associated with the current feature vocabulary set, it may be determined that the current feature vocabulary set belongs to an abnormal vocabulary set. For example, the recognition threshold associated with the feature vocabulary set of the program name class may be 2, and if the number of feature vocabularies included in the feature vocabulary set of the program name class is greater than 2, the feature vocabulary set may be determined to be an abnormal feature vocabulary set. On the contrary, if the number of the feature words included in the current feature word set is less than or equal to the recognition threshold associated with the current feature word set, it may be determined that the current feature word set does not belong to an abnormal word set.
S7: and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
In this embodiment, if the current feature vocabulary set belongs to an abnormal vocabulary set, it indicates that the feature vocabulary in the current feature vocabulary set is suspected of building a hot search vocabulary. The title information of the target video can correspond to a plurality of feature vocabulary sets, and if one abnormal vocabulary set exists, the target video can be judged to be a cheating video. For example, for the title information of "good voice package for god show for men" looking at baebelk chat ideal "at the latest, although the feature vocabulary of" baebelk morning "belongs to the normal vocabulary, the good voice package for god show for men" belongs to the abnormal vocabulary, and then the video corresponding to the title information can be determined as the cheating video.
In one embodiment, if the feature vocabulary sets obtained by dividing all belong to normal vocabulary sets, then whether the target video is a cheating video can be further comprehensively judged. Specifically, the total number of feature vocabulary sets obtained by dividing the title information of the target video may be counted. For example, for the header information "runner a recent set of bayberry-juncheng ideal", two feature vocabulary sets are included, so that the total number of feature vocabulary sets corresponding to the header information is 2. If the total number counted is greater than a specified number threshold, the target video can be determined to be a cheating video. The specified number threshold may be used to define an upper limit on the number of feature vocabulary sets of different categories that occur simultaneously in the same header information. In some cases, the header information should be determined as the cheating header information even if the feature words included in any feature word set of the header information do not exceed the associated recognition threshold but the header information includes many feature word sets of different categories. For example, a title such as "a new season Mayunbafte who runs a new group of people who views Baebel Li Cheng chat ideal hero alliance and gives financial resources" contains four feature vocabulary sets (the character class can be divided into two classes of entertainment class characters and financial class characters), the number of feature vocabularies contained in each feature vocabulary set is normal, but because the total number of the feature vocabulary sets is too many, a video corresponding to the title information can be judged to be a cheating video.
In the present embodiment, the specified number threshold may be obtained by performing statistical analysis on the title information of the non-cheating video. Specifically, the preset number of non-cheating title information of the non-cheating videos can be obtained, and the maximum number of feature vocabulary categories contained in the same non-cheating title information is counted. The counted maximum number may then be used as the specified number threshold.
In one embodiment, a more refined partitioning may be performed for a certain category therein, resulting in multiple sub-categories within a category. In this way, the feature words in the current feature word set can be divided into a plurality of sub-categories. For example, the character class may include a plurality of sub-classes such as entertainment class characters, finance class characters, politics class characters, and the like. Then in obtaining the recognition thresholds associated with the current feature vocabulary sets, recognition thresholds associated with respective sub-categories in the current feature vocabulary sets may be obtained. Subsequently in determining the abnormal vocabulary set, it may be determined whether the sub-category belongs to an abnormal sub-category based on an identification threshold associated with the sub-category. Specifically, the manner of determining whether the sub-category belongs to the abnormal sub-category is similar to the manner of determining the abnormal vocabulary set described in the above embodiment, and will not be further described here. If at least one abnormal sub-category exists in the current characteristic vocabulary set, it can be determined that the current characteristic vocabulary set belongs to an abnormal vocabulary set.
In an embodiment, if all the sub-categories in the current feature vocabulary set are normal sub-categories, it may be further determined whether the current feature vocabulary set is an abnormal feature vocabulary set from the total number of the sub-categories. Specifically, the total number of sub-categories included in the current feature vocabulary set may be counted, and if the counted total number of sub-categories is greater than a specified category threshold, it may be determined that the current feature vocabulary set belongs to an abnormal vocabulary set. The specified category threshold may also be based on statistical analysis of title information of the non-cheating video. For example, in the current feature vocabulary set, if the feature vocabulary set includes a subset of the entertainment-like characters, a subset of the financial-like characters, and a subset of the political-like characters, the current feature vocabulary set can be determined as the abnormal vocabulary set.
In an embodiment, if the category of the current feature vocabulary set is a self-media category, the identification threshold associated with the self-media category may be obtained by performing statistical analysis on title information corresponding to a video uploaded by a key PGC user. Specifically, a plurality of non-cheating videos uploaded by users in a specified user group may be acquired, and respective title information of the plurality of non-cheating videos may be extracted. The specified user group may be the above-mentioned key PGC users, and the key PGC users may be PGC users whose video upload volume reaches a specified number, or PGC users authenticated by a video playing website in a self-media category. The videos uploaded by the key PGC users are usually non-cheating videos, and at this time, the identification threshold corresponding to the media category can be obtained by performing statistical analysis on the title information of the videos uploaded by the key PGC users. Specifically, similar to the above-described embodiment, the maximum number of feature words belonging to the self-media category in the same non-cheating title information may be counted, and then the counted maximum number may be used as the recognition threshold associated with the current feature vocabulary set.
In one embodiment, further decisions may be made regarding the feature vocabulary of the sensitive part of speech. Specifically, if the feature vocabulary sets obtained by the division all belong to normal vocabulary sets, it can be determined whether the feature vocabulary sets obtained by the division have the first feature vocabulary sets representing sensitive vocabularies. If the first feature vocabulary set exists, whether a second feature vocabulary set representing the program name exists in the feature vocabulary set obtained by dividing the first feature vocabulary set can be further judged. If the second feature vocabulary set exists, the target video can be judged to be a cheating video. The reason for this is that, in the same header information, if only a feature vocabulary set of sensitive parts of speech whose vocabulary number meets the requirement is present, it is not appropriate to determine the header information as the cheating header information. Since the video may be presented with the aspect content corresponding to the sensitive word, the title information of the video does not have illegal operation. However, if the sensitive word and the program name are edited in the title information at the same time, there may be a suspicion that the user is attracted to click by the combination of the program name and the sensitive word. For example, a certain title information includes both a program name and a sensitive word, so that it can be determined that a video corresponding to the title information is a cheating video.
Referring to fig. 3, the present application further provides a system for identifying a cheating video, where the system includes a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, implements the following steps.
S1: and acquiring the title information of the target video, and extracting the feature words in the title information.
S3: dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; wherein, the characteristic words in the same characteristic word set belong to the same category.
S5: and acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value.
S7: and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
In this embodiment, the category of the current feature vocabulary set is a self-media category; accordingly, the computer program, when executed by the processor, further implements the steps of:
acquiring a plurality of non-cheating videos uploaded by users in a designated user group, and extracting respective title information of the non-cheating videos;
counting the maximum number of feature words belonging to the self-media category in the same non-cheating title information;
and taking the counted maximum number as an identification threshold value associated with the current feature vocabulary set.
In this embodiment, the computer program, when executed by the processor, further implements the steps of:
if the feature vocabulary sets obtained by the division belong to normal vocabulary sets, judging whether a first feature vocabulary set for representing sensitive vocabularies exists in the feature vocabulary sets obtained by the division;
if the first characteristic vocabulary set exists, judging whether a second characteristic vocabulary set representing the program name exists in the divided characteristic vocabulary set except the first characteristic vocabulary set;
and if the second feature vocabulary set exists, judging that the target video is a cheating video.
In this embodiment, the memory may include a physical device for storing information, and typically, the information is digitized and then stored in a medium using an electrical, magnetic, or optical method. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.
In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.
The specific functions implemented by the memory and the processor of the identification system for cheating videos provided in the embodiments of the present specification can be explained in comparison with the foregoing embodiments in the present specification, and can achieve the technical effects of the foregoing embodiments, and thus, no further description is provided here.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbyscript Description Language (vhr Description Language), and the like, which are currently used by Hardware compiler-software (Hardware Description Language-software). It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
It is also known to those skilled in the art that instead of implementing the identification system of the cheating video in pure computer readable program code, the identification system of the cheating video could be implemented with the same functionality in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like, all by logically programming the method steps. Therefore, the identification system of the cheating video can be regarded as a hardware component, and the devices included in the hardware component for realizing various functions can also be regarded as structures in the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the identification system of a cheating video, reference may be made to the introduction of the embodiments of the method described above for a comparative explanation.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims (12)

1. A method for identifying a cheating video, the method comprising:
acquiring title information of a target video, and extracting feature words in the title information;
dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; the characteristic vocabularies in the same characteristic vocabulary set belong to the same category;
acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value; wherein the feature vocabulary in the current feature vocabulary set is further divided into a plurality of sub-categories; accordingly, obtaining the recognition threshold associated with the current feature vocabulary comprises: acquiring recognition threshold values respectively associated with subcategories in the current characteristic vocabulary set;
and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
2. The method of claim 1, wherein extracting feature words from the header information comprises:
performing word segmentation on the title information to obtain a plurality of words contained in the title information;
taking the vocabulary in the hot searched vocabulary set in the plurality of vocabularies as the characteristic vocabulary of the title information; and determining the hot searched vocabulary in the hot searched vocabulary set according to the corresponding search times in the appointed time limit.
3. The method of claim 1, wherein the recognition threshold is determined as follows:
acquiring the preset number of non-cheating title information of the non-cheating videos, and counting the maximum number of characteristic words containing specified categories in the same non-cheating title information;
and taking the counted maximum number as an identification threshold value associated with the feature vocabulary set of the specified category.
4. The method of claim 3, wherein determining whether the current feature vocabulary set belongs to an abnormal vocabulary set comprises:
if the number of the characteristic words contained in the current characteristic word set is larger than the recognition threshold value associated with the current characteristic word set, judging that the current characteristic word set belongs to an abnormal word set;
and if the number of the characteristic words contained in the current characteristic word set is less than or equal to the recognition threshold value associated with the current characteristic word set, judging that the current characteristic word set does not belong to an abnormal word set.
5. The method of claim 1, wherein if the feature vocabulary sets obtained by dividing all belong to normal vocabulary sets, the method further comprises:
counting the total number of the feature word collections obtained by division, and if the counted total number is larger than a specified number threshold, judging that the target video is a cheating video;
wherein the specified number threshold is determined in the following manner:
acquiring non-cheating title information of a preset number of non-cheating videos, and counting the maximum number of characteristic vocabulary categories contained in the same non-cheating title information;
and taking the counted maximum number as the specified number threshold.
6. The method of claim 5, further comprising:
determining whether the sub-category belongs to an abnormal sub-category based on an identification threshold associated with the sub-category;
and if at least one abnormal sub-category exists in the current characteristic vocabulary set, judging that the current characteristic vocabulary set belongs to an abnormal vocabulary set.
7. The method of claim 5, wherein if all the sub-categories in the current feature vocabulary set are normal sub-categories, the method further comprises:
and counting the total number of the sub-categories contained in the current characteristic vocabulary set, and if the counted total number of the sub-categories is greater than a specified category threshold, judging that the current characteristic vocabulary set belongs to an abnormal vocabulary set.
8. The method of claim 1, wherein if the category of the current feature vocabulary set is a self media category, the recognition threshold associated with the current feature vocabulary set is determined as follows:
acquiring a plurality of non-cheating videos uploaded by users in a designated user group, and extracting respective title information of the non-cheating videos;
counting the maximum number of feature words belonging to the self-media category in the same non-cheating title information;
and taking the counted maximum number as an identification threshold value associated with the current feature vocabulary set.
9. The method of claim 1, wherein if the feature vocabulary sets obtained by dividing all belong to normal vocabulary sets, the method further comprises:
judging whether a first characteristic vocabulary set of the characteristic sensitive vocabulary exists in the characteristic vocabulary set obtained by division;
if the first characteristic vocabulary set exists, judging whether a second characteristic vocabulary set representing the program name exists in the divided characteristic vocabulary set except the first characteristic vocabulary set;
and if the second feature vocabulary set exists, judging that the target video is a cheating video.
10. A system for identification of a cheating video, said system comprising a memory and a processor, said memory having stored therein a computer program that, when executed by said processor, performs the steps of:
acquiring title information of a target video, and extracting feature words in the title information;
dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; the characteristic vocabularies in the same characteristic vocabulary set belong to the same category;
acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value; wherein the feature vocabulary in the current feature vocabulary set is further divided into a plurality of sub-categories; accordingly, obtaining the recognition threshold associated with the current feature vocabulary comprises: acquiring recognition threshold values respectively associated with subcategories in the current characteristic vocabulary set;
and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
11. The system of claim 10, wherein the current feature vocabulary has a category that is a self-media category; accordingly, the computer program, when executed by the processor, further implements the steps of:
acquiring a plurality of non-cheating videos uploaded by users in a designated user group, and extracting respective title information of the non-cheating videos;
counting the maximum number of feature words belonging to the self-media category in the same non-cheating title information;
and taking the counted maximum number as an identification threshold value associated with the current feature vocabulary set.
12. The system of claim 10, wherein the computer program, when executed by the processor, further performs the steps of:
if the feature vocabulary sets obtained by the division belong to normal vocabulary sets, judging whether a first feature vocabulary set for representing sensitive vocabularies exists in the feature vocabulary sets obtained by the division;
if the first characteristic vocabulary set exists, judging whether a second characteristic vocabulary set representing the program name exists in the divided characteristic vocabulary set except the first characteristic vocabulary set;
and if the second feature vocabulary set exists, judging that the target video is a cheating video.
CN201711188045.6A 2017-11-24 2017-11-24 Method and system for identifying cheating videos Active CN109840445B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711188045.6A CN109840445B (en) 2017-11-24 2017-11-24 Method and system for identifying cheating videos

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711188045.6A CN109840445B (en) 2017-11-24 2017-11-24 Method and system for identifying cheating videos

Publications (2)

Publication Number Publication Date
CN109840445A CN109840445A (en) 2019-06-04
CN109840445B true CN109840445B (en) 2021-10-01

Family

ID=66876321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711188045.6A Active CN109840445B (en) 2017-11-24 2017-11-24 Method and system for identifying cheating videos

Country Status (1)

Country Link
CN (1) CN109840445B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111950360B (en) * 2020-07-06 2023-08-18 北京奇艺世纪科技有限公司 Method and device for identifying infringement user

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077172A (en) * 2011-10-26 2013-05-01 腾讯科技(深圳)有限公司 Method and device for mining cheating user
US8745056B1 (en) * 2008-03-31 2014-06-03 Google Inc. Spam detection for user-generated multimedia items based on concept clustering
US8752184B1 (en) * 2008-01-17 2014-06-10 Google Inc. Spam detection for user-generated multimedia items based on keyword stuffing
CN106202049A (en) * 2016-07-18 2016-12-07 合网络技术(北京)有限公司 A kind of hot word determines method and device
CN106326497A (en) * 2016-10-10 2017-01-11 合网络技术(北京)有限公司 Cheating video user identification method and device
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8752184B1 (en) * 2008-01-17 2014-06-10 Google Inc. Spam detection for user-generated multimedia items based on keyword stuffing
US8745056B1 (en) * 2008-03-31 2014-06-03 Google Inc. Spam detection for user-generated multimedia items based on concept clustering
CN103077172A (en) * 2011-10-26 2013-05-01 腾讯科技(深圳)有限公司 Method and device for mining cheating user
CN106202049A (en) * 2016-07-18 2016-12-07 合网络技术(北京)有限公司 A kind of hot word determines method and device
CN106326497A (en) * 2016-10-10 2017-01-11 合网络技术(北京)有限公司 Cheating video user identification method and device
CN106326498A (en) * 2016-10-13 2017-01-11 合网络技术(北京)有限公司 Cheat video identification method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"搜索引擎反作弊方法研究";王庆福等;《电脑知识与技术》;20160531;第12卷(第15期);202-203页 *

Also Published As

Publication number Publication date
CN109840445A (en) 2019-06-04

Similar Documents

Publication Publication Date Title
CN110020422B (en) Feature word determining method and device and server
CN102483743B (en) Detecting writing systems and languages
US10318543B1 (en) Obtaining and enhancing metadata for content items
CN110555136B (en) Video tag generation method and device and computer storage medium
CN111738011A (en) Illegal text recognition method and device, storage medium and electronic device
WO2018201600A1 (en) Information mining method and system, electronic device and readable storage medium
EP4035082A1 (en) Deep neural architectures for detecting false claims
CN111831629B (en) Data processing method and device
JP5884740B2 (en) Time-series document summarization apparatus, time-series document summarization method, and time-series document summarization program
CN110727785A (en) Recommendation method, device and storage medium for training recommendation model and recommending search text
Dumont et al. Automatic story segmentation for tv news video using multiple modalities
EP3051428A1 (en) Method and system for selecting an encoding format for reading a target document
CN110019794A (en) Classification method, device, storage medium and the electronic device of textual resources
CN110413787A (en) Text Clustering Method, device, terminal and storage medium
CN112883734A (en) Block chain security event public opinion monitoring method and system
CN118072339B (en) Document segmentation method and system based on large language model assisted title extraction
JP5056337B2 (en) Information retrieval system
CN109840445B (en) Method and system for identifying cheating videos
JP2008310626A (en) Automatic tag impartment device, automatic tag impartment method, automatic tag impartment program and recording medium recording the program
CN110874408B (en) Model training method, text recognition device and computing equipment
CN109492401B (en) Content carrier risk detection method, device, equipment and medium
CN116029280A (en) Method, device, computing equipment and storage medium for extracting key information of document
US20070088717A1 (en) Back-tracking decision tree classifier for large reference data set
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
US9026540B1 (en) Systems and methods for information match scoring

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200514

Address after: 310052 room 508, floor 5, building 4, No. 699, Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Applicant after: Alibaba (China) Co.,Ltd.

Address before: 100080 Beijing Haidian District city Haidian street A Sinosteel International Plaza No. 8 block 5 layer A, C

Applicant before: Youku network technology (Beijing) Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240618

Address after: 101400 Room 201, 9 Fengxiang East Street, Yangsong Town, Huairou District, Beijing

Patentee after: Youku Culture Technology (Beijing) Co.,Ltd.

Country or region after: China

Address before: 310052 room 508, 5th floor, building 4, No. 699 Wangshang Road, Changhe street, Binjiang District, Hangzhou City, Zhejiang Province

Patentee before: Alibaba (China) Co.,Ltd.

Country or region before: China