Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.
The application provides a cheating video identification method which can be applied to a server of a video playing website. Referring to fig. 1 and 2, the method may include the following steps.
S1: and acquiring the title information of the target video, and extracting the feature words in the title information.
In this embodiment, the target video may be a video to be identified, the target video may have title information, and the title information may be text information set for the target video by a video uploader. For example, the title information of the target video may be "gold star show happy man's voice china best voice runner recent album".
In this embodiment, when determining whether the target video is a cheating video, the title information of the target video may be identified. In the server, the uploaded video data can be stored in association with the video information. The information of the video can comprise a series of information such as the duration, name, type and name of the user of the uploading user of the video. In this way, when the title information of the target video is obtained, the character string representing the video name can be read from the video information associated with the target video.
In the present embodiment, after the title information of the target video is acquired, the content of the title information can be identified. Specifically, a feature word in the header information may be extracted. The characteristic vocabulary can be a vocabulary with more searching times in the current video playing website. In practical application, the video playing website may count the number of searches of each vocabulary in a specified time period, and then may sort the searched vocabularies in the order from the largest number of searches. Finally, a plurality of words with top rank can be obtained, and the words with top rank can be used as the characteristic words in the video playing website. For example, the video playing website may count hot searched words ranked 100 a week ago, and these hot searched words may be used as feature words of the video playing website.
In the present embodiment, when extracting the feature words in the header information, the header information may be segmented to obtain a plurality of words included in the header information. When the title information is segmented, a preset vocabulary library can be adopted to identify the vocabulary in the title information, so that a plurality of vocabularies in the title information can be identified and obtained. In practical applications, various word segmenters can be used to segment the words of the title information. The word segmenter may be, for example, a friso word segmenter, a Jcseg word segmenter, an MMSEG4J word segmenter, or the like. Furthermore, in order to improve the accuracy of segmenting the title information of the video, a word bank of the segmenter can be constructed based on the words commonly used in the video playing website, so that the output result of the segmenter can better accord with the language habit of the words in the video playing website.
In the present embodiment, after performing word segmentation to obtain a plurality of words, a word in a hot-searched word set among the plurality of words may be used as the feature word of the header information. The hot searched vocabulary in the hot searched vocabulary set can be determined according to the corresponding search times in the appointed time limit. For example, the video playing website may count hot searched words ranked 100 a week ago and compose the hot searched words into a hot searched word set. Then after the plurality of words are obtained by word segmentation according to the title information of the target video, the words in the hot searched word set can be extracted as feature words. In the present embodiment, the purpose of extracting feature words is to cheat the video with the goal of cheating the click rate of the user by piling up a plurality of currently hot search words in the title information. Therefore, the extracted feature words can be analyzed subsequently, and whether the target video is a cheating video or not is judged.
S3: dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; wherein, the characteristic words in the same characteristic word set belong to the same category.
In the present embodiment, the feature words may be classified according to the category to which the feature words belong. The categories of the feature vocabulary may be classified according to the search intention of the user. Specifically, the categories of the feature vocabulary may include a program name category, a character category, a self-media category, and a sensitive word category. Wherein the program name class may be the name of the art program or an abbreviation of the name. For example, the program name category may include characteristic vocabularies such as "brother of running bar", "jinxing show", "chinese good voice", and the like. The person class may be the name of a public person or another name of a name. For example, the character class may include feature words such as "plum morning", "maryun", "baffert", and the like. The self-media class may be the name of the PGC (Professional Generated Content) in the video playing website or the name of the uploader. For example, the self-media may include characteristic words such as "hero alliance", "evening maple", and the like. The sensitive part of speech may be a feature vocabulary with poor guiding meaning.
It should be noted that, for the categories of the feature vocabulary, in an actual application scenario, a certain category may be divided more finely, so as to obtain multiple sub-categories in one category. For example, the character class may include a plurality of sub-classes such as entertainment class characters, finance class characters, politics class characters, and the like.
In the present embodiment, after the feature words are screened from the header information of the target video, the feature words may be classified according to the category to which the feature words belong. The feature vocabulary belonging to the same class may be divided into a feature vocabulary set. Thus, at least one characteristic vocabulary set can be obtained, and the types of the characteristic vocabularies in the same characteristic vocabulary set are the same. For example, for the title information of "good sound package for golden star show for men" a latest collection of good sound packages for looking at Baebel plum morning, "two feature word collections of" good sound package for golden star show for men "and" Baebel plum morning "can be obtained by dividing.
S5: and acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value.
In general, the number of feature words included in the header information of a normal video may be different for different types of feature words. For example, for the feature vocabulary of the program name class, the number appearing in the same title information generally does not exceed three; for the feature vocabulary of entertainment type characters, the number of the feature vocabulary appearing in the same title information generally does not exceed five. Therefore, in order to avoid misjudging a normal video as a cheating video, different identification strategies may be formulated for different categories in the present embodiment.
In the present embodiment, a recognition threshold for determining whether or not the number of feature words included in a feature word set is normal may be determined in advance for feature word sets of different types. The recognition threshold may be set as an upper limit on the number of feature words included in the feature word set. If the number of the characteristic words contained in the characteristic word set is larger than the recognition threshold value, the condition that the hot search words are piled up is indicated in the corresponding header information. Specifically, since different feature vocabulary sets may be associated with different recognition thresholds, when determining the current feature vocabulary set, the recognition threshold associated with the current feature vocabulary set may be obtained first. Each recognition threshold may be stored in association with the corresponding category in a server of the video playback website. The category of the feature vocabulary may be used as a key, and the recognition threshold associated with the category may be used as a value, so that the feature vocabulary may be stored in a key-value manner. After the category corresponding to the current feature vocabulary is determined, the associated recognition threshold may be read.
In this embodiment, the recognition threshold may be obtained by performing statistical analysis based on the header information of the normal video. Specifically, the preset number of non-cheating title information of the non-cheating videos can be obtained in advance, and the maximum number of feature words of the specified category contained in the same non-cheating title information is counted. For example, the title information of 5000 pieces of non-cheating videos may be acquired, and then for each piece of title information, the number of feature words in which a specified category is included may be counted. For example, the number of feature words of the program name class included in each of the 5000 pieces of header information may be counted. Finally, by comparing the respective numbers of statistics, the maximum number thereof can be obtained. The maximum number may be used as an upper limit of the number of feature words in the non-cheating video that include the specified category, so that the counted maximum number may be used as a recognition threshold associated with the feature word set of the specified category. For example, after analyzing a large amount of normal title information, it is found that at most 2 program names are generally mentioned in the title information of a normal video, and then the identification threshold for the program name class may be set to 2.
In the present embodiment, after the recognition threshold associated with the current feature vocabulary is obtained, it is possible to determine whether the current feature vocabulary belongs to an abnormal vocabulary based on the recognition threshold. Specifically, if the number of feature words included in the current feature word set is greater than the recognition threshold associated with the current feature vocabulary set, it may be determined that the current feature vocabulary set belongs to an abnormal vocabulary set. For example, the recognition threshold associated with the feature vocabulary set of the program name class may be 2, and if the number of feature vocabularies included in the feature vocabulary set of the program name class is greater than 2, the feature vocabulary set may be determined to be an abnormal feature vocabulary set. On the contrary, if the number of the feature words included in the current feature word set is less than or equal to the recognition threshold associated with the current feature word set, it may be determined that the current feature word set does not belong to an abnormal word set.
S7: and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
In this embodiment, if the current feature vocabulary set belongs to an abnormal vocabulary set, it indicates that the feature vocabulary in the current feature vocabulary set is suspected of building a hot search vocabulary. The title information of the target video can correspond to a plurality of feature vocabulary sets, and if one abnormal vocabulary set exists, the target video can be judged to be a cheating video. For example, for the title information of "good voice package for god show for men" looking at baebelk chat ideal "at the latest, although the feature vocabulary of" baebelk morning "belongs to the normal vocabulary, the good voice package for god show for men" belongs to the abnormal vocabulary, and then the video corresponding to the title information can be determined as the cheating video.
In one embodiment, if the feature vocabulary sets obtained by dividing all belong to normal vocabulary sets, then whether the target video is a cheating video can be further comprehensively judged. Specifically, the total number of feature vocabulary sets obtained by dividing the title information of the target video may be counted. For example, for the header information "runner a recent set of bayberry-juncheng ideal", two feature vocabulary sets are included, so that the total number of feature vocabulary sets corresponding to the header information is 2. If the total number counted is greater than a specified number threshold, the target video can be determined to be a cheating video. The specified number threshold may be used to define an upper limit on the number of feature vocabulary sets of different categories that occur simultaneously in the same header information. In some cases, the header information should be determined as the cheating header information even if the feature words included in any feature word set of the header information do not exceed the associated recognition threshold but the header information includes many feature word sets of different categories. For example, a title such as "a new season Mayunbafte who runs a new group of people who views Baebel Li Cheng chat ideal hero alliance and gives financial resources" contains four feature vocabulary sets (the character class can be divided into two classes of entertainment class characters and financial class characters), the number of feature vocabularies contained in each feature vocabulary set is normal, but because the total number of the feature vocabulary sets is too many, a video corresponding to the title information can be judged to be a cheating video.
In the present embodiment, the specified number threshold may be obtained by performing statistical analysis on the title information of the non-cheating video. Specifically, the preset number of non-cheating title information of the non-cheating videos can be obtained, and the maximum number of feature vocabulary categories contained in the same non-cheating title information is counted. The counted maximum number may then be used as the specified number threshold.
In one embodiment, a more refined partitioning may be performed for a certain category therein, resulting in multiple sub-categories within a category. In this way, the feature words in the current feature word set can be divided into a plurality of sub-categories. For example, the character class may include a plurality of sub-classes such as entertainment class characters, finance class characters, politics class characters, and the like. Then in obtaining the recognition thresholds associated with the current feature vocabulary sets, recognition thresholds associated with respective sub-categories in the current feature vocabulary sets may be obtained. Subsequently in determining the abnormal vocabulary set, it may be determined whether the sub-category belongs to an abnormal sub-category based on an identification threshold associated with the sub-category. Specifically, the manner of determining whether the sub-category belongs to the abnormal sub-category is similar to the manner of determining the abnormal vocabulary set described in the above embodiment, and will not be further described here. If at least one abnormal sub-category exists in the current characteristic vocabulary set, it can be determined that the current characteristic vocabulary set belongs to an abnormal vocabulary set.
In an embodiment, if all the sub-categories in the current feature vocabulary set are normal sub-categories, it may be further determined whether the current feature vocabulary set is an abnormal feature vocabulary set from the total number of the sub-categories. Specifically, the total number of sub-categories included in the current feature vocabulary set may be counted, and if the counted total number of sub-categories is greater than a specified category threshold, it may be determined that the current feature vocabulary set belongs to an abnormal vocabulary set. The specified category threshold may also be based on statistical analysis of title information of the non-cheating video. For example, in the current feature vocabulary set, if the feature vocabulary set includes a subset of the entertainment-like characters, a subset of the financial-like characters, and a subset of the political-like characters, the current feature vocabulary set can be determined as the abnormal vocabulary set.
In an embodiment, if the category of the current feature vocabulary set is a self-media category, the identification threshold associated with the self-media category may be obtained by performing statistical analysis on title information corresponding to a video uploaded by a key PGC user. Specifically, a plurality of non-cheating videos uploaded by users in a specified user group may be acquired, and respective title information of the plurality of non-cheating videos may be extracted. The specified user group may be the above-mentioned key PGC users, and the key PGC users may be PGC users whose video upload volume reaches a specified number, or PGC users authenticated by a video playing website in a self-media category. The videos uploaded by the key PGC users are usually non-cheating videos, and at this time, the identification threshold corresponding to the media category can be obtained by performing statistical analysis on the title information of the videos uploaded by the key PGC users. Specifically, similar to the above-described embodiment, the maximum number of feature words belonging to the self-media category in the same non-cheating title information may be counted, and then the counted maximum number may be used as the recognition threshold associated with the current feature vocabulary set.
In one embodiment, further decisions may be made regarding the feature vocabulary of the sensitive part of speech. Specifically, if the feature vocabulary sets obtained by the division all belong to normal vocabulary sets, it can be determined whether the feature vocabulary sets obtained by the division have the first feature vocabulary sets representing sensitive vocabularies. If the first feature vocabulary set exists, whether a second feature vocabulary set representing the program name exists in the feature vocabulary set obtained by dividing the first feature vocabulary set can be further judged. If the second feature vocabulary set exists, the target video can be judged to be a cheating video. The reason for this is that, in the same header information, if only a feature vocabulary set of sensitive parts of speech whose vocabulary number meets the requirement is present, it is not appropriate to determine the header information as the cheating header information. Since the video may be presented with the aspect content corresponding to the sensitive word, the title information of the video does not have illegal operation. However, if the sensitive word and the program name are edited in the title information at the same time, there may be a suspicion that the user is attracted to click by the combination of the program name and the sensitive word. For example, a certain title information includes both a program name and a sensitive word, so that it can be determined that a video corresponding to the title information is a cheating video.
Referring to fig. 3, the present application further provides a system for identifying a cheating video, where the system includes a memory and a processor, the memory stores a computer program, and the computer program, when executed by the processor, implements the following steps.
S1: and acquiring the title information of the target video, and extracting the feature words in the title information.
S3: dividing the characteristic vocabulary into at least one characteristic vocabulary set according to the category to which the characteristic vocabulary belongs; wherein, the characteristic words in the same characteristic word set belong to the same category.
S5: and acquiring an identification threshold value associated with the current characteristic vocabulary set, and judging whether the current characteristic vocabulary set belongs to an abnormal vocabulary set or not based on the identification threshold value.
S7: and if the current feature vocabulary belongs to the abnormal vocabulary, judging that the target video is the cheating video.
In this embodiment, the category of the current feature vocabulary set is a self-media category; accordingly, the computer program, when executed by the processor, further implements the steps of:
acquiring a plurality of non-cheating videos uploaded by users in a designated user group, and extracting respective title information of the non-cheating videos;
counting the maximum number of feature words belonging to the self-media category in the same non-cheating title information;
and taking the counted maximum number as an identification threshold value associated with the current feature vocabulary set.
In this embodiment, the computer program, when executed by the processor, further implements the steps of:
if the feature vocabulary sets obtained by the division belong to normal vocabulary sets, judging whether a first feature vocabulary set for representing sensitive vocabularies exists in the feature vocabulary sets obtained by the division;
if the first characteristic vocabulary set exists, judging whether a second characteristic vocabulary set representing the program name exists in the divided characteristic vocabulary set except the first characteristic vocabulary set;
and if the second feature vocabulary set exists, judging that the target video is a cheating video.
In this embodiment, the memory may include a physical device for storing information, and typically, the information is digitized and then stored in a medium using an electrical, magnetic, or optical method. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.
In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.
The specific functions implemented by the memory and the processor of the identification system for cheating videos provided in the embodiments of the present specification can be explained in comparison with the foregoing embodiments in the present specification, and can achieve the technical effects of the foregoing embodiments, and thus, no further description is provided here.
In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbyscript Description Language (vhr Description Language), and the like, which are currently used by Hardware compiler-software (Hardware Description Language-software). It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.
It is also known to those skilled in the art that instead of implementing the identification system of the cheating video in pure computer readable program code, the identification system of the cheating video could be implemented with the same functionality in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like, all by logically programming the method steps. Therefore, the identification system of the cheating video can be regarded as a hardware component, and the devices included in the hardware component for realizing various functions can also be regarded as structures in the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.
From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the identification system of a cheating video, reference may be made to the introduction of the embodiments of the method described above for a comparative explanation.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.