CN113408470A

CN113408470A - Data processing method, data processing apparatus, electronic device, storage medium, and program product

Info

Publication number: CN113408470A
Application number: CN202110743077.8A
Authority: CN
Inventors: 秦楚晴
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-17
Anticipated expiration: 2041-06-30
Also published as: CN113408470B

Abstract

The present disclosure relates to a data processing method, an apparatus, an electronic device, a storage medium, and a program product. The data processing method comprises the following steps: acquiring risk information of a plurality of first videos within first preset time; dividing the plurality of first videos into a preset number of groups based on the risk information of each first video; determining misjudgment information of a first video in a first group in a preset number of groups based on a preset sampling evaluation mode; and determining misjudgment information of the first video in the second group in the preset number of groups based on the misjudgment information of the first video in the first group. By adopting the method and the device, the problems that a large amount of auditing manpower is consumed for evaluating the missed content according to the traditional sampling method and the evaluation efficiency is low can be solved.

Description

Data processing method, data processing apparatus, electronic device, storage medium, and program product

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a data processing method and apparatus, an electronic device, a storage medium, and a program product.

Background

With the rapid development of the mobile internet and the continuous popularization of electronic equipment, users can realize the purposes of sharing and communicating in modes of publishing video content on a video publishing platform, live broadcasting and the like. But at the same time, the situation that illegal contents are embedded maliciously in the existing video often exists, and the illegal contents displayed by the video have negative influence on a viewer. Therefore, it is an important task for the video distribution platform to detect whether the video content is illegal.

In the related art, a platform uses a large amount of resources to review and supervise video content uploaded by a user, but the illegal content is still judged to be "normal", so that the video content judged to be "normal" (or "missed content") is usually evaluated by adopting a traditional sampling method. However, the amount of video content released by the platform is huge, and the ratio of illegal content to missed content is extremely low, so that the missed content evaluation according to the traditional sampling method needs a lot of auditing manpower, and the evaluation efficiency is low.

Disclosure of Invention

The disclosure provides a data processing method, a data processing device, an electronic device, a storage medium and a program product, which are used for at least solving the problems that in the related art, due to the fact that the content of videos published by a platform is huge, the ratio of illegal contents to missed contents is extremely low, a large amount of auditing manpower is consumed for the missed content evaluation according to a traditional sampling method, and the evaluation efficiency is low.

The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided a data processing method, including: acquiring risk information of a plurality of first videos in first preset time, wherein the risk information is used for representing the violation risk of video content; dividing the plurality of first videos into a preset number of groups based on the risk information of each first video; determining misjudgment information of a first video in a first group in a preset number of groups based on a preset sampling evaluation mode; and determining misjudgment information of the first video in the second group in the preset number of groups based on the misjudgment information of the first video in the first group.

In some implementations of the first aspect, the determining the misjudgment information of the first video in the second group of the preset number of groups based on the misjudgment information of the first video in the first group includes: and calculating the misjudgment rate of the first video in the second packet based on the misjudgment rate of the first video in the first packet and the misjudgment rate ratio of the first packet to the second packet.

In some implementations of the first aspect, before calculating the rate of misjudgment of the first video in the second packet based on the rate of misjudgment of the first video in the first packet and the rate of misjudgment of the first packet to the second packet, the method further comprises: acquiring risk information of a plurality of second videos in second preset time; dividing the plurality of second videos into a preset number of groups based on the risk information of each second video; determining the misjudgment rate of the second video in each group in a preset number of groups based on a preset sampling evaluation mode, wherein the preset number of groups comprise a first group and a second group; and calculating the misjudgment rate ratio of the first packet to the second packet based on the misjudgment rate of the second video in the first packet and the second packet.

In some implementations of the first aspect, the risk information includes an illegal probability value of the video content, and the preset sampling evaluation manner includes calculating a misjudgment rate of the video content in a preset group based on the number of sampled videos corresponding to the illegal tag in the preset group, where the preset group is a first group or a second group; the number of the sampling videos corresponding to the preset groups is calculated based on the number of the video contents in the preset groups and the violation probability value corresponding to each video content.

In some implementations of the first aspect, the obtaining risk information of a plurality of first videos within a first preset time includes: extracting content characteristics of the plurality of first videos and user characteristics of users corresponding to the plurality of first videos; inputting content characteristics and user characteristics into the risk evaluation model, and outputting violation probability values of the first video; and taking the violation probability value of the first video as the risk information of the first video.

In some implementations of the first aspect, prior to inputting the content features and the user features to the risk assessment model, the method further comprises: acquiring an input training sample and an output training sample, wherein the input training sample comprises the content characteristics of a sample video and the user characteristics of a user corresponding to the sample video, and the output training sample comprises the label information of the sample video; and training a preset neural network model based on the input training sample and the output training sample to obtain a risk assessment model.

According to a second aspect of the embodiments of the present disclosure, there is provided a data processing apparatus including: the acquisition module is configured to acquire risk information of a plurality of first videos in a first preset time, wherein the risk information is used for representing the violation risk of the video content; a grouping module configured to perform grouping of the plurality of first videos into a preset number of groups based on the risk information of each first video; the determining module is configured to determine misjudgment information of a first video in a first group in a preset number of groups based on a preset sampling evaluation mode; the determining module is further configured to determine misjudgment information of the first video in the second group in the preset number of groups based on the misjudgment information of the first video in the first group.

In some implementations of the second aspect, the misjudgment information includes a misjudgment rate of the video, and the determining module is specifically configured to perform: and calculating the misjudgment rate of the first video in the second packet based on the misjudgment rate of the first video in the first packet and the misjudgment rate ratio of the first packet to the second packet.

In some implementations of the second aspect, the apparatus further comprises: the obtaining module is further configured to obtain risk information of a plurality of second videos in a second preset time before calculating the misjudgment rate of the first videos in the second group based on the misjudgment rate of the first videos in the first group and the misjudgment rate ratio of the first group to the second group; a grouping module further configured to perform grouping of the plurality of second videos into a preset number of groups based on the risk information of each second video; the determining module is further configured to determine the misjudgment rate of the second video in each of a preset number of groups based on a preset sampling evaluation mode, wherein the preset number of groups comprise a first group and a second group; and the calculating module is configured to calculate the misjudgment ratio of the first packet and the second packet based on the misjudgment ratio of the second video in the first packet and the second packet.

In some implementations of the second aspect, the risk information includes an illegal probability value of the video content, and the preset sampling evaluation manner includes calculating a misjudgment rate of the video content in a preset group based on the number of sampled videos corresponding to the illegal tag in the preset group, where the preset group is a first group or a second group; the number of the sampling videos corresponding to the preset groups is calculated based on the number of the video contents in the preset groups and the violation probability value corresponding to each video content.

In some implementations of the second aspect, the obtaining module includes: the extraction unit is configured to extract content characteristics of a plurality of first videos and user characteristics of users corresponding to the plurality of first videos; the input and output unit is configured to input the content characteristics and the user characteristics to the risk assessment model and output violation probability values of the first video; a determination unit configured to perform a violation probability value of the first video as risk information of the first video.

In some implementations of the second aspect, the apparatus further comprises: the obtaining module is further configured to obtain an input training sample and an output training sample before inputting the content features and the user features to the risk assessment model, wherein the input training sample comprises the content features of the sample video and the user features of the user corresponding to the sample video, and the output training sample comprises the label information of the sample video; and the model training module is configured to perform training on a preset neural network model based on the input training sample and the output training sample to obtain a risk assessment model.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the data processing method as in the first aspect or some realizations of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions, which, when executed by a processor of an electronic device, enable the electronic device to perform a data processing method as in the first aspect or some realizations of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement a data processing method as in the first aspect or some of the realizations of the first aspect.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

in the embodiment of the disclosure, all video contents issued by the platform in the first preset time period are grouped based on the violation risk of each video, so that the violation risks of the video contents in different groups can be differentiated, and then only the video contents in a certain group or some groups are audited and evaluated to obtain the misjudgment information, so that the misjudgment information of the video contents in other groups can be obtained based on the misjudgment information, and therefore, the video contents misjudged to be normal in all the video contents issued in the first preset time period are effectively evaluated, a large amount of sampling is avoided, the evaluation efficiency is improved, and meanwhile, auditing manpower and resources are saved. Therefore, the method and the device for evaluating the missed content can solve the problems that due to the fact that the content amount of the video published by the platform is huge, the ratio of illegal content to missed content is extremely low, a large amount of auditing labor is consumed for evaluating the missed content according to a traditional sampling method, and evaluation efficiency is low.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is an architectural diagram illustrating one type of data processing according to an exemplary embodiment.

FIG. 2 is a flow chart illustrating a method of data processing according to an exemplary embodiment.

FIG. 3 is a flow chart illustrating another method of data processing according to an example embodiment.

FIG. 4 is a flow chart illustrating yet another method of data processing according to an exemplary embodiment.

FIG. 5 is a block diagram illustrating a data processing apparatus according to an example embodiment.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Fig. 7 is a block diagram illustrating an apparatus for a data processing method according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

First, technical terms related to technical solutions provided by embodiments of the present disclosure are introduced:

the live broadcast room is a virtual space constructed by using software and hardware resources in the main broadcast equipment, the audience equipment and the live broadcast platform equipment during network live broadcast. The anchor user can create or log in the live broadcast room through the anchor device and release the live broadcast content to the live broadcast room. The audience user can watch the live broadcast content released by the anchor user through the audience equipment which logs in the live broadcast room.

The data processing method provided by the embodiment of the present disclosure may be applied to the architecture shown in fig. 1, and is specifically described in detail with reference to fig. 1.

As shown in FIG. 1, the server 100 is communicatively coupled to the client 200 via a network 300 for data communications or interactions. The server 100 may be one server, or may be a server cluster of a video distribution platform composed of at least two servers. The client 200 may be, but is not limited to, a Personal Computer (PC), a smart phone, a tablet PC, a Personal Digital Assistant (PDA), and the like. The network 300 may be a wired or wireless network. It should be noted that fig. 1 is only an example, and the number of clients 200 in practical application can be set according to practical requirements.

In one example, the client 200 may be a user device through which a user may upload short video content on a video distribution platform or perform live broadcast in a live broadcast room, and the server 100 may audit the uploaded or live broadcast video content to determine whether the video content is illegal video content, for example, determine whether there is low-popular content in the video, whether there is content that negatively affects the user, such as advertisement placement and guidance of user purchase, and distribute the video content after the audit is passed.

In some examples, the client 200 may be configured with or connected to a camera device to perform live video via the camera device.

In the related technology, on the basis of defining behavior specifications to users, a platform utilizes a large number of resources to check and supervise video contents uploaded by the users, so that illegal video contents are expected to be filtered. However, there may be some cases where the illegal video content is misjudged as "normal" and is released by the platform, and such video is called "missed content". For a video distribution platform, it is very important to find the missed content and clean the missed content, so the conventional sampling method is usually adopted in the related art to evaluate the missed content of the platform. However, the video content amount released by the platform is huge, and the ratio of illegal content to missed content is extremely low, so that the missed content evaluation according to the traditional sampling method needs a large amount of auditing manpower, and the evaluation efficiency is low.

In order to solve the problems that in the related art, due to the fact that the content of videos published by a platform is huge, the ratio of illegal contents to missing contents is extremely low, a large amount of auditing manpower is consumed for missing content evaluation according to a traditional sampling method, and evaluation efficiency is low, the embodiment of the disclosure provides a data processing method, a device, electronic equipment, a storage medium and a program product, and can solve the problems that a large amount of auditing manpower is consumed for missing content evaluation according to a traditional sampling method, and evaluation efficiency is low.

The data processing method provided by the embodiment of the present disclosure will be described in detail below.

FIG. 2 is a flow chart illustrating a method of data processing according to an exemplary embodiment. The data processing method provided by the embodiment of the present disclosure may be applied to the server 100 in fig. 1, and it should be understood that the above-mentioned execution subject does not constitute a limitation to the present disclosure.

As shown in fig. 2, the data processing method may include S210-S240.

S210, acquiring risk information of a plurality of first videos in first preset time.

The first video may be all videos published by the video publishing platform within a first preset time, and the first preset time may be a specific time period and may be specifically set according to requirements, which is not specifically limited in this disclosure. The risk information is used for representing the violation risk size of the video content, that is, the risk information of each first video can represent the violation risk size of the corresponding first video.

Illustratively, the first preset time is 8:00-23:00, the first video may be a video distribution platform in 8: all videos published between 00-23: 00.

And S220, dividing the plurality of first videos into a preset number of groups based on the risk information of each first video.

The preset number of packets may be set according to specific requirements, for example, the plurality of first videos may be divided into 3, 4, or 5 packets.

In some embodiments of the present disclosure, all the first videos may be sorted in order of the violation risk from large to small, or from small to large, based on the violation risk size of each first video, and then grouped based on the sorting.

Illustratively, the preset number of the packets is 3 packets, and the number of the first videos is 100, then after the 100 first videos are sorted in the order of the violation risk from large to small, A1 and a2 … a100 are obtained, which are 100 videos. Then, the top 5% of the ranked 5 videos (A1-A5) are selected as group 1, the 5% of the ranked 5% -10% videos (A6-A10) are selected as group 2, and the remaining 90 videos (A11-A100) are selected as group 3.

In other embodiments of the present disclosure, the plurality of first videos may be divided into a preset number of groups based on the violation risk size of each first video and the preset risk interval corresponding to each group. The preset grouping interval can also be set according to specific requirements.

Illustratively, the number of the first videos is 100, the preset number of the first videos is grouped into 4 groups, the preset risk interval corresponding to the group 1 is [0.8, 1], the preset risk interval corresponding to the group 2 is [0.6, 0.8 ], the preset risk interval corresponding to the group 3 is [0.3, 0.6 ], and the preset risk interval corresponding to the group 4 is [0, 0.3 "). Based on the preset risk interval corresponding to the group, all the first videos with the violation risk of 0.8 or more and the violation risk of 1 or less can be used as the group 1, and by analogy, 100 first videos are grouped.

It should be noted that the number of the first videos corresponding to different packets may be the same or different. The above are only a few examples of grouping videos according to violation risk sizes of the videos, and the videos may also be divided by other methods based on the violation risk sizes of the videos, which is not described herein again for brevity.

In the embodiment of the present disclosure, since the higher the violation risk of a video is, the higher the possibility that the video is missed violation video content is, that is, the higher the risk that the video is misjudged as "normal" in a previous review is. Therefore, all the first videos are grouped based on the violation risk of the first videos, so that the violation risk of the first videos in different groups can be guaranteed to be different, and the misjudgment information of the first videos in different groups can also be guaranteed to be different, namely, in the group where the first video with the higher violation risk is located, missed video contents misjudged to be normal are relatively more; the missed video content which is misjudged to be normal is relatively less in the group of the first video with smaller violation risk. Therefore, after the misjudgment information of the first video in a certain group or some groups is determined, the misjudgment information of the first video in other groups can be determined reasonably and strictly, and the accuracy is high.

And S230, determining misjudgment information of the first video in the first group in the preset number of groups based on a preset sampling evaluation mode.

The first packet may be any packet in a preset number of packets, and it should be noted that the number of the first packets may be at least one, and may be specifically set according to a requirement.

In some embodiments of the present disclosure, prior to S230, the method may further include: and determining the group of the first video with the highest violation risk and/or the group of the first video with the lowest violation risk as the first group.

In some embodiments of the present disclosure, because there may be missed video content in the first video published by the video publishing platform, that is, illegal video content that is successfully published because the earlier stage audit is misjudged as "normal", a preset sampling evaluation mode may be adopted here to perform sampling evaluation on the first video in the first group, so as to obtain misjudgment information of the first video in the first group.

In some embodiments of the present disclosure, the false positive information may include a false positive rate of the video, that is, the first video in the first packet is sampled, and the false positive information is a ratio of the number of the illegal video contents obtained by sampling to the number of all sampled videos.

Illustratively, the first packet is packet 1, packet 1 includes 1000 first videos, 20 videos are sampled and evaluated from the 1000 first videos, 4 of the videos are illegal video contents, that is, 4 of the 20 videos are misjudged as "normal" at the time of early review, so that the misjudgment rate of the 1000 first videos in packet 1 may be 0.2.

S240, based on the misjudgment information of the first video in the first group, determining the misjudgment information of the first video in the second group in the preset number of groups.

The second packet may be a packet other than the first packet in the preset number of packets, and the number of the second packets may be at least one.

For example, the preset number of packets includes packet 1, packet 2, packet 3, and packet 4, where packet 1 is the first packet, and the second packet includes packet 2, packet 3, and packet 4.

For another example, the predetermined number of packets are packet 1, packet 2, packet 3, and packet 4, where packet 1 and packet 2 are the first packets, and the second packets include packet 3 and packet 4.

As described above, grouping a plurality of first videos based on the risk information of each first video can ensure that the misjudgment information of the first videos in different groups is differentiated, so after the misjudgment information of the first videos in the first groups is determined, the misjudgment information of the first videos in the second groups can be determined more accurately, and it is avoided that it is difficult to determine the misjudgment information of the videos in other groups based on the misjudgment information of the videos in the first groups because the misjudgment information of the first videos in different groups is similar.

According to the data processing method provided by the embodiment of the disclosure, risk information of a plurality of first videos in a first preset time is obtained, and the risk information is used for representing the violation risk size of the video content; the plurality of first videos are divided into a preset number of groups based on the risk information of each first video. On the basis, the misjudgment information of the first video in the first group in the preset number of groups is determined by adopting a preset sampling evaluation mode, and then the misjudgment information of the first video in the second group in the preset number of groups is determined based on the misjudgment information of the first video in the first group. Therefore, all video contents issued by the platform in the first preset time period are grouped based on the violation risk of each video, so that the violation risks of the video contents in different groups can be differentiated, and then only the video contents in a certain group or certain groups are audited and evaluated to obtain misjudgment information, so that the misjudgment information of the video contents in other groups can be obtained based on the misjudgment information, and therefore, the video contents misjudged to be normal in all the video contents issued in the first preset time period are effectively evaluated, a large number of samples are avoided, the evaluation efficiency is improved, and meanwhile, auditing manpower and resources are saved.

The above-mentioned steps S210-S240 are described in detail below with reference to specific embodiments.

Referring first to S210, risk information of a plurality of first videos within a first preset time is obtained.

In some embodiments of the present disclosure, S210 may specifically include the following steps: extracting content characteristics of the plurality of first videos and user characteristics of users corresponding to the plurality of first videos; inputting content characteristics and user characteristics into the risk evaluation model, and outputting violation probability values of the first video; and taking the violation probability value of the first video as the risk information of the first video.

The violation probability value of the first video may be used to represent the violation risk size of the video content of the first video, that is, the larger the violation probability value of the first video is, the larger the violation risk of the first video is; the smaller the violation probability value of the first video, the smaller the violation risk of the first video.

Therefore, violation risk assessment can be performed on the video more comprehensively based on the content characteristics of the video and the user characteristics of the user corresponding to the video, the obtained violation probability value can represent the violation risk size of the video more accurately, video grouping is performed based on the violation risk size, and violation risks of video content in different groups can be differentiated. Therefore, when the misjudgment information of the video content in one or some of the packets is used for carrying out misjudgment evaluation on the video content in other packets, the accuracy of the misjudgment evaluation can be ensured.

In some embodiments of the present disclosure, the content characteristics of the first video may include, but are not limited to: scene recognition characteristics, face recognition characteristics, pornographic risk probability of video content and audio vocabulary characteristics in the first video. The user characteristics of the user corresponding to the first video may include historical violation information of a user account issuing the first video, a historical issuing video type, and the like.

In some embodiments of the present disclosure, before S210, the method may further include the steps of: acquiring an input training sample and an output training sample, wherein the input training sample comprises the content characteristics of a sample video and the user characteristics of a user corresponding to the sample video, and the output training sample comprises the label information of the sample video; and training a preset neural network model based on the input training sample and the output training sample to obtain a risk assessment model.

Illustratively, the tag information may include normal (or non-violating), verbal violations, behavioral violations, and the like.

In the embodiment of the disclosure, the preset neural network model is trained based on the content characteristics of the sample video, the user characteristics of the user corresponding to the sample video, and the label information of the sample video, so that a risk assessment model capable of performing risk assessment on the video content can be obtained, and thus the accurate assessment of the violation risk of the first video is realized.

Referring then to S220, the plurality of first videos are divided into a preset number of groups based on the risk information of each first video.

In some embodiments of the present disclosure, S220 may include: and based on the violation probability value of each first video, sequencing all the first videos according to the violation probability value from large to small or from small to large, and then grouping based on the sequencing.

In other embodiments of the present disclosure, the plurality of first videos may be divided into a preset number of packets based on the violation probability value of each first video and the preset probability value interval corresponding to each packet.

Referring then to S230, the misjudgment information of the first video in the first packet of the predetermined number of packets is determined based on a predetermined sampling evaluation manner.

The misjudgment information may include a misjudgment rate of the video.

In some embodiments of the present disclosure, the risk information may include an illegal probability value of the video content, and the preset sampling evaluation manner may include calculating a misjudgment rate of the video content in a preset packet based on the number of sampled videos corresponding to the illegal tag in the preset packet, where the preset packet may be a first packet or a second packet; the number of the sampling videos corresponding to the preset group can be calculated based on the number of the video contents in the preset group and the violation probability value corresponding to each video content.

The preset group can be any group in a preset number of groups.

In this way, when sampling and evaluating the video content in a certain packet in the preset number of packets, the number of the sampled videos of the packet can be calculated based on the number of the video content in the packet and the violation probability value. That is, when a packet is sampled, the number of sampled videos extracted from the packet needs to be considered in consideration of the size of the violation risk of the video content in the packet, so that the number of sampled videos obtained can be more suitable for the video violation in the packet.

In some embodiments of the present disclosure, the preset packet may be a first packet, fig. 3 is a flowchart illustrating another data processing method according to an exemplary embodiment, and as shown in fig. 3, S230 may specifically include S310-S340.

S310, obtain the number of the first videos in the first packet.

And S320, calculating the number of the sampling videos corresponding to the first group according to the number of the first videos in the first group and the violation probability value.

In some embodiments of the present disclosure, S320 may specifically include the following steps: calculating a target violation probability value corresponding to the first packet based on the violation probability value of each first video in the first packet; the number of sampled videos of the first packet is calculated based on the number of first videos in the first packet and the target violation probability value.

In an example of the present disclosure, the first packet is a packet 1, the packet 1 includes 5 first videos, and each first video corresponds to one violation probability value, and then an average value of the violation probability values of the 5 videos may be used as a target violation probability value corresponding to the first packet.

In another example of the disclosure, the first packet is a packet 1, and the packet 1 includes 5 first videos a1-a5, where a1-a5 are sorted from large to small according to violation probability values, a median of the violation probability values of the 5 videos, that is, the violation probability value of the video A3, may be used as a target violation probability value corresponding to the first packet.

It should be noted that, the above are only some examples of calculating the target violation probability value according to the violation probability value of each video in the group, and the present application may also perform calculation in other manners based on the violation probability value of each video in the group, and for brevity, details are not described here again.

In some embodiments of the disclosure, calculating the number of sampled videos of the first packet based on the number of first videos in the first packet and the target violation probability value may include: the number N of sampled videos of the first packet is calculated based on the number N of first videos in the first packet, the target violation probability value P, and equation (1).

Wherein d is confidence coefficient, t is standard score corresponding to the confidence coefficient d, n₀For the preliminary calculated number of sample videos, if

Less than 0.05, then n ═ n₀(ii) a If it is

Not less than 0.05, then

In the above embodiment, the standard score corresponding to the confidence d may be queried based on the standard normal distribution table.

Illustratively, if d takes 0.90, t is 1.64; if d is 0.95, t is 1.96; if d is 0.99, t is 2.58.

It should be noted that the absolute error d can be set according to specific requirements.

In the embodiment of the present disclosure, it can be seen based on formula (1) that the larger the target violation probability value corresponding to the first packet is, the larger n is₀The larger. Therefore, when the violation risk of the first video in the first group is larger, the fact that the first group may have more missed violation video contents is shown, and under the circumstance, more sampling videos can be selected for evaluation, so that the accuracy of misjudgment evaluation on the first video in the first group is improved, and the effectiveness of blinding all video contents of the platform is improved.

And S330, acquiring label information corresponding to the sampled video, wherein the label information can comprise a normal label and an illegal label.

In some embodiments of the present disclosure, an auditor may view the sampled video and mark the sampled video with a normal tag or an illegal tag according to a given judgment rule.

And S340, calculating the misjudgment rate of the first video in the first group based on the number of the sampling videos corresponding to the violation labels.

Illustratively, the first packet is packet 1, packet 1 includes 1000 first videos, and 20 videos are sampled and evaluated from the 1000 first videos, wherein 4 videos correspond to the violation tags, that is, 4 violation video contents in the 20 videos are misjudged as "normal" at the time of the previous review, so that the misjudgment rate of the 1000 first videos in packet 1 may be 0.2.

In this way, based on the number of first videos in the first group and the risk information, the number of sampled videos with higher suitability for the violation risk size of all first videos in the first group can be calculated. Based on the above, the misjudgment rate of the first video in the first packet can be calculated relatively accurately according to the number of the sampled videos corresponding to the violation labels in all the sampled videos.

Finally, referring to S240, the misjudgment information of the first video in the second group in the preset number of groups is determined based on the misjudgment information of the first video in the first group.

The misjudgment information may include a misjudgment rate of the video.

In some embodiments of the present disclosure, prior to S240, the method may further include determining a false positive rate ratio of the first packet to the second packet. Fig. 4 is a flowchart illustrating yet another data processing method according to an exemplary embodiment, which may specifically include S410-S440, as shown in fig. 4.

And S410, acquiring risk information of a plurality of second videos in second preset time.

The second preset time is different from the first preset time.

In some embodiments of the present disclosure, the first preset time may include at least a part of the second preset time, and the plurality of first videos may include at least a part of the second video.

For example, the second predetermined time is 8:00 to 12:00, and the first predetermined time may be 8:00 to 23:00, where the first video includes all the second videos.

For another example, the second preset time is 8:00-12:00, the first preset time may be 11:00-23:00, and the first video may include the second video published at 11:00-12: 00.

In some embodiments of the present disclosure, the second preset time may be a previous period of the first preset time.

For example, the second predetermined time may be 8:00-24:00 monday, and the first predetermined time may be 8:00-24:00 monday.

In some embodiments of the present disclosure, S410 may specifically include the following steps: extracting the content characteristics of the plurality of second videos and the user characteristics of the users corresponding to the plurality of second videos; inputting content characteristics and user characteristics into a preset risk assessment model, and outputting violation probability values of a second video; and taking the violation probability value of the second video as the risk information of the second video.

And S420, dividing the plurality of second videos into a preset number of groups based on the risk information of each second video.

It should be noted that the manner of dividing the plurality of second videos into the groups with the preset number in S420 is the same as the manner of dividing the plurality of first videos into the groups with the preset number in S220, and for brevity, no further description is given here.

And S430, determining the misjudgment rate of the second video in each of the preset number of groups based on a preset sampling evaluation mode, wherein the preset number of groups comprises a first group and a second group.

In some embodiments of the present disclosure, determining a misjudgment rate of a second video in a first packet of a preset number of packets based on a preset sampling evaluation manner may include: calculating the number of sample videos of the first packet based on the number of second videos in the first packet, a misjudgment rate of the second videos, and formula (1); acquiring label information corresponding to the sampled video, wherein the label information can comprise a normal label and an illegal label; a false positive rate for the second video in the first packet is calculated based on the number of sampled videos corresponding to the violation label.

In some embodiments of the present disclosure, determining a misjudgment rate of the second video in the second packet of the preset number of packets based on the preset sampling evaluation manner may include: calculating the number of sample videos of the second packet based on the number of second videos in the second packet, a misjudgment rate of the second videos, and formula (1); acquiring label information corresponding to the sampled video, wherein the label information can comprise a normal label and an illegal label; a false positive rate for the second video in the second packet is calculated based on the number of sampled videos corresponding to the violation tags.

It should be noted that, in the present disclosure, a manner of calculating a misjudgment rate of the second video in each packet is the same as a manner of calculating a misjudgment rate of the first video in the first packet, and details are not repeated here.

And S440, calculating the misjudgment ratio of the first packet to the second packet based on the misjudgment ratio of the second video in the first packet and the second packet.

In some embodiments of the present disclosure, the false positive rate ratio may be a ratio of a false positive rate of the first packet to a false positive rate of the second packet.

Illustratively, the first packet is a packet 1, and the second packet includes a packet 2 and a packet 3, where the packet 1 corresponds to a false positive rate of 0.2, the packet 2 corresponds to a false positive rate of 0.2, and the packet 3 corresponds to a false positive rate of 0.1. The false positive ratio of packet 1 to packet 2 is 1:1 and the false positive ratio of packet 1 to packet 3 is 2: 1.

Therefore, all video contents which are distributed by the platform in the second preset time are grouped based on the violation risk of each video, so that the violation risks of the video contents in different groups are differentiated. On the basis, the misjudgment rate of the second video in each group can be calculated, and the video misjudgment rate ratio between the groups is obtained based on the misjudgment rate. In this way, although the video content of the video distribution platform is updated from time to time, the video misjudgment ratio between the groups is relatively stable, and therefore, after all the first videos distributed by the platform within the first preset time are grouped in the same manner, the video misjudgment ratio is also applicable. That is to say, through once comprehensive sampling evaluation of video content issued by a video issuing platform, the distribution characteristics of missed video content are found out, and after the video misjudgment rate ratio is obtained, the video misjudgment rates of other groups can be calculated based on the video misjudgment rate ratio and the video misjudgment rate of a certain group, and the large-scale missed video evaluation of the illegal video content can be realized without the need of comprehensive sampling evaluation again.

In some embodiments of the present disclosure, S240 may specifically include: and calculating the misjudgment rate of the first video in the second packet based on the misjudgment rate of the first video in the first packet and the misjudgment rate ratio of the first packet to the second packet.

Illustratively, the first packet is a packet 1, the second packet includes a packet 2 and a packet 3, the misjudgment ratio of the 3 packets is 2:2:1, and the calculated misjudgment ratio of the packet 1 is 0.24. Therefore, based on the two parameters, the packet 2 misjudgment rate is calculated to be 0.24, and the packet 3 misjudgment rate is calculated to be 0.12.

In some embodiments of the present disclosure, in the case that the number of the first packets is at least two, the misjudgment rate of the first video in the at least two second packets may be calculated based on S240, and at this time, an average value of the misjudgment rates of the first video in the at least two second packets may be calculated, and the average value is taken as a final misjudgment rate.

Illustratively, the first packet includes a packet 1 and a packet 4, the second packet includes a packet 2 and a packet 3, the misjudgment ratio of the 4 packets is 2:2:1:3, the calculated misjudgment ratio of the packet 1 is 0.24, and the calculated misjudgment ratio of the packet 4 is 0.39. According to the ratio of 2:2:1:3 and 0.24, calculating to obtain that the corresponding misjudgment rate of the group 2 is 0.24, and the corresponding misjudgment rate of the group 3 is 0.12; according to the ratio of 2:2:1:3 and 0.39, the corresponding misjudgment rate of the group 2 is 0.26, and the corresponding misjudgment rate of the group 3 is 0.13. In this case, the average of the misjudgment rates may be set as the final misjudgment rate, that is, the misjudgment rate for the group 2 is 0.25, and the misjudgment rate for the group 3 is 0.125.

Therefore, based on the calculated misjudgment rate ratio of the first group to the second group, the misjudgment rate of the first video in other groups can be determined only by calculating the misjudgment rate of the first video in one or some groups, so that effective evaluation can be performed on all video contents issued by the platform within the first preset time, the sample amount required to be evaluated is greatly reduced, and the auditing manpower is saved. On the basis, the illegal video content in the first video can be effectively deleted and cleaned, the overall content quality of the video content issued by the platform is improved, and the watching experience of a user is effectively improved.

Based on the data processing method, the embodiment of the disclosure also provides a data processing device. This is explained with reference to fig. 5.

FIG. 5 is a block diagram illustrating a data processing apparatus according to an example embodiment. Referring to fig. 5, the data processing apparatus 500 may include an acquisition module 510, a grouping module 520, and a determination module 530.

The obtaining module 510 is configured to perform obtaining risk information of a plurality of first videos within a first preset time, where the risk information is used to represent violation risk size of video content; a grouping module 520 configured to perform grouping of the plurality of first videos into a preset number of groups based on the risk information of each first video; a determining module 530 configured to perform determining misjudgment information of the first video in the first packet of the predetermined number of packets based on a predetermined sampling evaluation manner; the determining module 530 is further configured to perform determining the misjudgment information of the first video in the second packet of the preset number of packets based on the misjudgment information of the first video in the first packet.

By the data processing device provided by the embodiment of the disclosure, all video contents issued by the platform in the first preset time period are grouped based on the violation risk of each video, so that the violation risks of the video contents in different groups are differentiated, and then only the video contents in a certain group or certain groups are checked and evaluated to obtain misjudgment information, so that the misjudgment information of the video contents in other groups can be obtained based on the misjudgment information, and therefore, the video contents misjudged to be normal in all the video contents issued in the first preset time period are effectively evaluated, a large amount of sampling is avoided, the evaluation efficiency is improved, and meanwhile, the checking manpower and resources are saved. Therefore, the method and the device for evaluating the missed content can solve the problems that due to the fact that the content amount of the video published by the platform is huge, the ratio of illegal content to missed content is extremely low, a large amount of auditing labor is consumed for evaluating the missed content according to a traditional sampling method, and evaluation efficiency is low.

In some embodiments of the present disclosure, the misjudgment information includes a misjudgment rate of the video, and the determining module 530 is specifically configured to perform: and calculating the misjudgment rate of the first video in the second packet based on the misjudgment rate of the first video in the first packet and the misjudgment rate ratio of the first packet to the second packet.

In some embodiments of the present disclosure, the apparatus further comprises: the obtaining module 510 is further configured to perform obtaining risk information of a plurality of second videos in a second preset time before calculating a misjudgment rate of the first video in the second packet based on the misjudgment rate of the first video in the first packet and a misjudgment rate ratio of the first packet to the second packet; a grouping module 520 further configured to perform grouping the plurality of second videos into a preset number of groups based on the risk information of each second video; a determining module 530, further configured to perform determining a misjudgment rate of the second video in each of a preset number of packets based on a preset sampling evaluation manner, where the preset number of packets includes a first packet and a second packet; the calculation module is further configured to calculate a misjudgment ratio of the first packet to the second packet based on the misjudgment ratio of the second video in the first packet and the second packet.

In some embodiments of the present disclosure, the risk information includes an illegal probability value of the video content, and the preset sampling evaluation manner includes calculating a misjudgment rate of the video content in a preset group based on the number of sampled videos corresponding to the illegal tag in the preset group, where the preset group is a first group or a second group; the number of the sampling videos corresponding to the preset groups is calculated based on the number of the video contents in the preset groups and the violation probability value corresponding to each video content.

In some embodiments of the present disclosure, the obtaining module 510 includes: the extraction unit is configured to extract content characteristics of a plurality of first videos and user characteristics of users corresponding to the plurality of first videos; the input and output unit is configured to input content characteristics and user characteristics to a preset risk assessment model and output violation probability values of the first video; a determination unit configured to perform a violation probability value of the first video as risk information of the first video.

In some embodiments of the present disclosure, the obtaining module 510 is further configured to perform, before inputting the content features and the user features to the risk assessment model, obtaining an input training sample and an output training sample, where the input training sample includes the content features of the sample video and the user features of the user corresponding to the sample video, and the output training sample includes label information of the sample video; and the model training module is configured to perform training on a preset neural network model based on the input training sample and the output training sample to obtain a risk assessment model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 6 is a block diagram illustrating an electronic device in accordance with an example embodiment. Referring to fig. 6, an embodiment of the present disclosure further provides an electronic device, which includes a processor 610, a communication interface 620, a memory 630, and a communication bus 640, where the processor 610, the communication interface 620, and the memory 630 complete communication with each other through the communication bus 640.

The memory 630 is used for storing instructions executable by the processor 610.

The processor 610, when executing the instructions stored in the memory 630, implements the following steps:

acquiring risk information of a plurality of first videos in first preset time, wherein the risk information is used for representing the violation risk of video content;

dividing the plurality of first videos into a preset number of groups based on the risk information of each first video;

determining misjudgment information of a first video in a first group in a preset number of groups based on a preset sampling evaluation mode;

and determining misjudgment information of the first video in the second group in the preset number of groups based on the misjudgment information of the first video in the first group.

It can be seen that, by applying the embodiment of the present disclosure, all video contents issued by a platform in a first preset time period are grouped based on the violation risk of each video, so that the violation risks of the video contents in different groups can be differentiated, and then only the video contents in a certain group or some groups are audited and evaluated to obtain misjudgment information, so that the misjudgment information of the video contents in other groups can be obtained based on the misjudgment information, thereby effectively evaluating the video contents misjudged to be "normal" in all the video contents issued in the first preset time period, avoiding a large number of samples, improving the evaluation efficiency, and saving the audit manpower and resources. Therefore, the method and the device for evaluating the missed content can solve the problems that due to the fact that the content amount of the video published by the platform is huge, the ratio of illegal content to missed content is extremely low, a large amount of auditing labor is consumed for evaluating the missed content according to a traditional sampling method, and evaluation efficiency is low.

Fig. 7 is a block diagram illustrating an apparatus for a data processing method according to an example embodiment. For example, the device 700 may be provided as a server. Referring to fig. 7, server 700 includes a processing component 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Furthermore, the processing component 722 is configured to execute instructions to perform the data processing method according to any of the embodiments described above.

The device 700 may also include a power component 727 configured to perform power management of the device 700, a wired or wireless network interface 750 configured to perform connections of the device 700 to a network, and an input output (I/O) interface 758. The device 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In some embodiments of the present disclosure, a computer-readable storage medium is also provided, and when executed by a processor of an electronic device, the instructions of the computer-readable storage medium enable the electronic device to perform the data processing method of any one of the above embodiments.

Alternatively, the computer-readable storage medium may be a computer Read-Only Memory (ROM), a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

In some embodiments of the present disclosure, a computer program product is further provided, which includes computer instructions, and the computer instructions, when executed by a processor, implement the data processing method according to any of the above embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A data processing method, comprising:

determining misjudgment information of a first video in a first group in the preset number of groups based on a preset sampling evaluation mode;

2. The method according to claim 1, wherein the misjudgment information comprises a misjudgment rate of videos, and the determining the misjudgment information of the first video in the second group of the preset number of groups based on the misjudgment information of the first video in the first group comprises:

and calculating the misjudgment rate of the first video in the second packet based on the misjudgment rate of the first video in the first packet and the misjudgment rate ratio of the first packet to the second packet.

3. The method of claim 2, wherein before the calculating the false positive rate of the first video in the second packet based on the false positive rate of the first video in the first packet and the false positive rate ratio of the first packet to the second packet, the method further comprises:

acquiring risk information of a plurality of second videos in second preset time;

dividing the plurality of second videos into the preset number of groups based on the risk information of each second video;

determining the misjudgment rate of the second video in each of the preset number of groups based on the preset sampling evaluation mode, wherein the preset number of groups comprise the first group and the second group;

and calculating the misjudgment ratio of the first packet to the second packet based on the misjudgment ratio of the second video in the first packet and the second packet.

4. The method according to claim 2 or 3, wherein the risk information comprises violation probability values of video contents, and the preset sampling evaluation manner comprises calculating a misjudgment rate of the video contents in a preset packet based on the number of sampling videos corresponding to violation tags in the preset packet, wherein the preset packet is the first packet or the second packet;

and calculating the number of the sampling videos corresponding to the preset grouping based on the number of the video contents in the preset grouping and the violation probability value corresponding to each video content.

5. The method of claim 1, wherein the obtaining the risk information of the plurality of first videos in the first preset time comprises:

extracting content characteristics of a plurality of first videos and user characteristics of users corresponding to the first videos;

inputting the content characteristics and the user characteristics into a risk evaluation model, and outputting violation probability values of the first video;

and taking the violation probability value of the first video as the risk information of the first video.

6. The method of claim 5, wherein prior to the inputting the content feature and the user feature to the risk assessment model, the method further comprises:

acquiring an input training sample and an output training sample, wherein the input training sample comprises content characteristics of a sample video and user characteristics of a user corresponding to the sample video, and the output training sample comprises label information of the sample video;

and training a preset neural network model based on the input training sample and the output training sample to obtain the risk assessment model.

7. A data processing apparatus, comprising:

the acquisition module is configured to acquire risk information of a plurality of first videos in a first preset time, wherein the risk information is used for representing the violation risk size of video content;

a grouping module configured to perform grouping the plurality of first videos into a preset number of groups based on risk information of each first video;

the determining module is configured to determine misjudgment information of a first video in a first group in the preset number of groups based on a preset sampling evaluation mode;

the determining module is further configured to perform determining misjudgment information of the first video in the second group in the preset number of groups based on the misjudgment information of the first video in the first group.

8. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the data processing method of any one of claims 1-6.

9. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the data processing method of any of claims 1-6.

10. A computer program product comprising computer instructions, characterized in that the computer instructions, when executed by a processor, implement the data processing method of any of claims 1-6.