CN117668296A

CN117668296A - Video auditing method, electronic device and computer readable storage medium

Info

Publication number: CN117668296A
Application number: CN202311653014.9A
Authority: CN
Inventors: 李涛; 贾欣阳; 张旭
Original assignee: Zhengzhou Apas Digital Cloud Information Technology Co ltd
Current assignee: Zhengzhou Apas Digital Cloud Information Technology Co ltd
Priority date: 2023-12-04
Filing date: 2023-12-04
Publication date: 2024-03-08

Abstract

The application discloses a video auditing method, electronic equipment and a computer readable storage medium, and belongs to the field of computers. The method comprises the following steps: acquiring a target video to be audited; acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video; and auditing the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video.

Description

Video auditing method, electronic device and computer readable storage medium

Technical Field

The application belongs to the field of computers, and particularly relates to a video auditing method, electronic equipment and computer readable storage.

Background

With the development of artificial intelligence technology, large models like ChatGPT have emerged. The large model can gradually complete various tasks like 'people', and can complete multiple tasks of writing mails, creating texts, translating, even programming and the like in the conversation process, so that the large model is widely used in multiple fields.

In the related art, a large model is generally used for simple answer auxiliary processing, but when relatively complex data (including various data such as images and audio) such as video is faced, the difficulty of video auditing is high by using the large model.

Disclosure of Invention

The embodiment of the application provides a video auditing method, electronic equipment and computer readable storage, which can solve the problem that the video auditing difficulty is high by using a large model in the related technology.

In a first aspect, an embodiment of the present application provides a method for video auditing, including: acquiring a target video to be audited; acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video; and auditing the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video.

In a second aspect, embodiments of the present application provide an electronic device comprising a processor and a memory storing a program or instructions that, when executed by the processor, implement the steps of the method according to the first aspect.

In a third aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a program or instructions which when executed implement the steps of the method according to the first aspect.

In the embodiment of the application, a target video to be audited is obtained; acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video; and auditing the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video. Therefore, as the target text corresponding to the target video is acquired, the content to be audited is converted into text data (target text corresponding to the target video) which is easy to process from original image and/or audio and other complex data (target video). The target auditing model can audit the target text capable of representing the target video content according to a preset rule, and then a target auditing result of the target video can be obtained. The difficulty of video auditing by the target auditing model is reduced, and the problem of higher difficulty of video auditing by using a large model in the related technology is solved to a certain extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a method for video auditing according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for video auditing provided in an embodiment of the present application;

FIG. 3 is a conceptual diagram of a method for video auditing provided in an embodiment of the present application;

fig. 4 is a block diagram of a video auditing apparatus according to an embodiment of the present application;

fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As described in the background art, with the development of artificial intelligence technology, a large model can gradually complete various tasks like "people", and multiple tasks such as writing mails, creating texts, translating and even programming are completed in the process of dialogue, so that the large model is widely used in multiple fields. In the related art, a large model is generally used for simple answer auxiliary processing, but when relatively complex data (including various data such as images and audio) such as video is faced, the difficulty of video auditing is high by using the large model. The embodiment of the application provides a video auditing method, which can solve the problem that the video auditing difficulty is high by using a large model in the related technology to a certain extent.

Specifically, in the video auditing method provided by the embodiment of the present application, a target video to be audited may be obtained; acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video; and auditing the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video. Therefore, as the target text corresponding to the target video is acquired, the content to be audited is converted into text data (target text corresponding to the target video) which is easy to process from original image and/or audio and other complex data (target video). The target auditing model can audit the target text capable of representing the target video content according to a preset rule, and then a target auditing result of the target video can be obtained. The difficulty of video auditing by the target auditing model is reduced, and the problem of higher difficulty of video auditing by using a large model in the related technology is solved to a certain extent.

In the embodiment of the application, in order to make the accuracy of the target text corresponding to the target video higher, an image-to-text model is also introduced. In the process of obtaining a text label corresponding to a target video by utilizing an image-to-text model, a target image frame in the target video can be extracted according to the target frequency, and image recognition processing is carried out on the target image frame, so that the text label of the target image frame is obtained. The target image frame comes from the target video, so that the content of the target video can be accurately represented, the text label is obtained from the target image frame, and the content of the target video can be more accurately represented. The video auditing difficulty is reduced, and meanwhile, the target auditing result is further accurate. In addition, the target video is extracted at the target frequency, so that illegal image frames in the target video can be avoided.

In the embodiment of the application, not only an image-to-text model but also a voice-to-text model are introduced. The audio data (target audio) in the target video is converted into text data (audio text) through the voice conversion text model, so that the audio text can accurately represent the content of the target video. The video auditing difficulty is reduced, and meanwhile, the target auditing result is more accurate.

In the embodiment of the application, the target audit result includes that the audit is not passed, and if the target audit result is that the audit is not passed, a complaint request of a target user for the target video which is not passed by the audit can be obtained; and responding to the complaint request, and re-checking the target text of the target video through the target checking model to obtain a secondary checking result of the target video, so that the target checking result is more accurate.

It should be appreciated that a method for video auditing provided by embodiments of the present application may be performed by a target device. The target device may be one electronic device, or may be a plurality of electronic devices that are executed in cooperation with each other. The electronic device may be, for example, a terminal device such as a notebook computer or a tablet, or may be a server, such as an independent physical server, a server cluster formed by a plurality of servers, and a cloud server capable of performing cloud computing.

The method provided by the embodiment of the application is described in detail below through specific embodiments and application scenes thereof with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for video auditing according to an embodiment of the present application. As shown in fig. 1, the method for video auditing provided in the embodiment of the present application includes the following steps:

step 110, obtaining a target video to be audited.

In this embodiment of the present application, the target video may be a video of any content to be audited by the target device, for example, the target video may be a video shot by the user through a photographing lens, and the target video may also be a video post-secondarily processed by the user through clipping and synthesizing.

Accordingly, the target device can receive the target video from the outside, for example, the target device acquires the target video by receiving the target video on the server having the clip function. The target device may also generate the target video internally, for example, the target device may receive the video shot by the user through the photographing lens, and then perform secondary processing on the video shot by the photographing lens to obtain the target video. When the target device obtains the target video from the external device (e.g., a server with a clipping function), the connection between the target device and the external device may be through a wired connection, such as an optical fiber connection, or may be through a wireless connection, such as a bluetooth manner, which is not specifically limited in the embodiments of the present application.

The file format of the target video may be any video encapsulation format, such as an audio video interleave format (Audio Video Interleaved, AVI for short) and a moving picture expert group format (Moving Picture Experts Group, MPEG for short), which is not particularly limited in the embodiments of the present application. Accordingly, the embodiment of the application does not limit the file size, refresh rate, pixel and other video data attributes of the target video.

After receiving the target video to be audited, the target device can acquire a target text corresponding to the target video, and audit the target text by the target audit model, so that the difficulty of auditing the video by the target audit model is reduced.

Step 120, obtaining a target text corresponding to the target video, where the target text includes a text tag of a target image frame in the target video, and/or an audio text of a target audio in the target video.

In the embodiment of the present application, the extracted part of the image frames may be extracted from the complete image frames of the target video according to the fixed static frequency as the target image frames, and the extracted part of the image frames may be randomly extracted from the complete image frames of the target video according to the dynamic frequency as the target image frames.

In the embodiment of the present application, the target image frame may be any video image frame or a plurality of video image frames in the target video. The content of the target image frame may be any content, for example, the content of the target image frame may be content in the target video corresponding to the content of the target video, and when the target video is a scenic video photographed by a user, the content of the target image frame may be scenic content contained in the target video. The content of the target image frame may also be content in the target video that is unrelated to the target video content, such as when a user manually inserts image frames of a portion of unrelated content (e.g., promotional offensive content) in the target video. The embodiment of the application does not limit the data attribute of the image class such as the file size, the file format, the pixels and the like of the target image frame.

Accordingly, the text label of the target image frame corresponds to the content of the target image frame. The text label may come directly from the text content on the target image frame, for example, when the target image frame is a movie shot with subtitles, the text label may be the subtitle content on the target image frame. The text label may also be indirectly obtained from the target image frame, for example, the text label may be the content of the target image frame identified by using the image identification algorithm, and when the content of the target image frame is a mountain forest photograph of a certain scenic spot, the image identification algorithm may identify that a mountain forest element exists on the target image frame, and then the text label of the target image frame may be the mountain forest.

In the embodiment of the present application, the text label may be a word summarizing the content on the target image frame, or may be a sentence describing the content on the target image frame, which is not specifically limited in the embodiment of the present application. For example, when the background of the target image frame is grassland, there is one person and one dog on the screen, the text labels of the target image frame may be "one person", "one dog" and "grassland", and the text labels may also be "on grassland, one person is dragging one dog". The format of the text tag may be a text file format (e.g., TXT format), which is not particularly limited by the embodiments of the present application. In the embodiment of the application, in order to better display the text label, separators, such as "# mountain lin#", may also be added before and after the text label.

In the embodiment of the application, the text label is from the target image frame in the target video, can represent the content of the image data in the target video, and can also obtain the audio text of the target audio in the target video in order to completely represent the content of the target video, and represent the content of the audio data in the target video through the audio text.

In this embodiment of the present application, the target audio may be a complete audio corresponding to the target video, or may be a part of audio in the complete audio corresponding to the target video. Accordingly, the target audio can be obtained by extracting the complete audio in the target video, and part of the audio in the complete audio corresponding to the target video can be cut off, and the part of the audio is used as the target audio. When the target audio is part of the audio in the complete audio corresponding to the target video, the target audio may be a plurality of pieces of audio cut randomly, and the embodiment of the present application does not specifically limit the duration, the file size, and the like of the target audio.

In the embodiment of the present application, the audio text is a text corresponding to the target audio, that is, the content of the target audio is represented by a text form. For example, when the target video is a scenic video with background music added to the user, the target audio may be the background music added to the user, and accordingly, the content of the audio text may be lyrics of the background music added to the user. The embodiment of the application does not limit the data attribute of the audio class such as the file size, the duration and the like of the target audio.

In the embodiment of the application, after the text label including the target image frame in the target video is obtained, and/or the target text corresponding to the target video of the audio text of the target audio in the target video is obtained, the target text can be audited through a target audit model. Because the content of the target text corresponds to the content of the target video, the target auditing result corresponding to the target text can also be the target auditing result of the target video.

And 130, auditing the target text according to a preset rule through a target auditing model to obtain a target auditing result of the target video.

In the embodiment of the application, the target audit result includes audit passing and/or audit failing. The target audit model may be a large model with text recognition and audit capabilities, such as ChatGPT, or an apes large model. The target auditing model can analyze and understand the target text and the preset rule respectively, so that the aim of auditing the target text according to the preset rule is fulfilled. The preset rule may be a rule conforming to a popular law, such as "minor smoking illegal", "content of prohibiting pornography, violence, horror in pictures and videos". The target auditing result can be 'target video auditing passing', and can also be 'target video auditing failing'. After the target audit model obtains the target audit result, the target audit result can be fed back to the user.

In the process that the target auditing model carries out auditing processing on the target text according to a preset rule, video auditing prompt words can be preset. The preset video auditing prompt word can be as follows: please "you" as a video auditor to audit the video, "you" need to determine whether the target video passes the audit, if so, answer [ yes ] otherwise answer [ no ], and in addition, "you" need to follow the preset rules to audit the video:

{ preset rules that need to be followed by the target audit model };

the following are image and audio information of the target video:

the target video image contains the following information: "{ text label of target image frame }";

the content of the target video voice is as follows: "{ audio text of target audio }";

please "you" determine if this video can pass the audit based on the audit rules.

In order to better understand the method for video auditing provided by the embodiments of the present application, examples will now be described. It should be understood that the examples are not limiting. For example, when the preset rule is "minor smoking illegal", "forbid picture, content of pornography, violence, horror, etc. The target video may be a mountain scene video with background music added to the user. Correspondingly, the target image frame can be a frame of image extracted randomly, and the content of the image frame is mountain white cloud, so that the text labels obtained according to the target image frame can be 'mountain' and 'white cloud', and also can be 'a mountain with a mountain to rise into the cloud'. The target audio may be the complete audio of the target video, that is, the audio containing the background music added by the user and the noise of shooting the scene, and the audio text of the target audio may be the lyrics of the background music added by the user. The target text (text label of the target image frame and/or audio text of the target audio) accords with the preset rule, and the target auditing model considers that the target video auditing is passed, and can feed the auditing result back to the user.

In step 120 provided in the embodiment of the present application, when the text label of the target image frame in the target video is included in the target text, a specific implementation manner of obtaining the text label of the target image frame in the target video may be: extracting target image frames in a target video according to the target frequency; and carrying out image recognition processing on the target image frame to obtain a text label of the target image frame. The unit of the target frequency can be any time or frame number, for example, the target frequency can be every five image frames, and one image frame is selected as a target image frame; the target frequency may also select one image frame for each second as the target image frame. After the target image frame is obtained, image recognition processing can be carried out on the target image frame, a text label of the target image frame can be obtained according to output of the image recognition processing, confidence of the text label of the target image frame can be obtained, and the higher the confidence is, the more the image-to-text model is used for determining that the label is matched with the image. For example, the text label may be "boy" 0.8 and "lawn" 0.6, which indicates that the image-to-text model considers that 80% of the content on the target image frame is "boy", and 60% of the content determines that "lawn" is on the picture. In the embodiment of the application, after extracting the target image frame in the target video according to the target frequency, performing image recognition processing on the target image frame to obtain the text label of the target image frame. The text label is obtained by the image recognition processing of the target image frame, so that the content of the text label is more accurate, and if the illegal image frame exists in the target video, the illegal image frame can be timely found, so that the auditing result of the target video is more accurate.

In the embodiment of the application, in the process of performing image recognition processing on the target image frame by the target device to obtain the text label of the target image frame, the image recognition processing can be performed on the target image frame through an image-to-text model to obtain a recognition processing result; and carrying out de-duplication processing on the identification processing result to obtain the text label of the target image frame. The image-text model can be any large model with the functions of image recognition and output of corresponding content, such as a multimodal model (CLIP model) based on contrast learning or a YOLO v8 model. Image-to-text models can identify objects and features in a picture by learning a large number of tagged pictures. For example, the image-to-text model may identify whether a person is in the image, what gender, what color, what expression, what clothing, etc. The image-to-text model may also identify the style of the image, such as cartoon or realistic, two-dimensional or three-dimensional, and so on. When you upload a picture to the image-to-text model, you can let the image-to-text model analyze the characteristics of the picture and give some possible labels. The labels are the contents corresponding to the pictures which are reversely deduced from the image to the text. The image-to-text model selects the most relevant tags for output based on the confidence level of each tag. The higher the confidence, the more the image-to-text model is to determine that the tag and picture match. In this embodiment of the present application, if the target image frame is a plurality of video image frames, the image-to-text model may sequentially perform recognition analysis on the target image frame, and output a recognition processing result corresponding to each video image frame until all recognition processing is completed, and after the recognition processing results corresponding to all the target image frames are obtained, the target device performs deduplication processing on the recognition processing results, and clears and sorts the repeated content in the recognition processing results, so as to obtain the text label of the target image frame. Through the reprocessing, the recognition processing result output by the image-text conversion model is more concise, the calculation amount of auditing the target text by the target auditing model is reduced, and the auditing speed of the target auditing model is improved.

In step 120 provided in the embodiment of the present application, when the target text includes the audio text of the target audio in the target video, a specific implementation manner of obtaining the audio text of the target audio in the target video may be: separating target audio from the target video; and carrying out audio recognition processing on the target audio to obtain an audio text of the target audio. The target Audio may be obtained from the target video by a separation process, for example, the target video may be input to a tool (e.g., audio Extractor) of the separation process to obtain the target Audio. And then carrying out audio recognition processing on the target audio, and converting the audio information of the target audio into corresponding text data (namely, audio text). The audio in the target video is separated to obtain target audio, and the audio text is obtained through the target audio, so that the content of the target video can be accurately represented, and the audio data can be converted into text data which is easy to process.

In the embodiment of the application, in the process of performing audio recognition processing on the target audio to obtain the audio text of the target audio, the voice characteristics can be extracted from the target audio through a voice-to-text model; and converting the sound characteristics through the voice-to-text model to obtain the audio text of the target audio. The speech-to-text model may be any large model capable of converting speech data into corresponding text data, for example, a speech recognition model such as Whisper, and may include a cyclic neural network (Recurrent Neural Network, abbreviated as RNN) in the speech-to-text model, where the speech file is a file stored in time sequence, and the artificial neural network may convert the speech data into the text data through a calculation manner. In the process of converting voice data into text data by the voice-to-text model, two units may exist in the voice-to-text model, one being an acoustic unit and the other being a language unit. Within the acoustic unit, a neural network may be included that may extract important features from the sound, such as pitch, volume, timbre, etc. These features may represent a string of numbers in the form of a matrix, which string of numbers resembles the internal coding of sound. Another neural network may be included in the language unit, and the neural network in the language unit may generate corresponding text based on the numbers provided by the acoustic unit. The neural network within the language units can learn the regularity and order between words, such as which letters are combined into pinyin, which pinyin is combined into kanji, and which kanji is combined into sentences. The two units can be trained jointly, for example, a voice with characters is given to a voice-to-text model as an example, so that the voice-to-text model is continuously learned and improved, and finally, the voice-to-text model capable of automatically generating a character description for the voice is obtained. The target audio is converted into the audio text through the voice conversion text model, so that the accuracy of the audio text is higher, and the target auditing result of the target auditing model is more accurate.

In the embodiment of the application, the target text may further include a text label of a target image frame in the target video and an audio text of target audio in the target video. When the target text comprises the text label of the target image frame in the target video and the audio text of the target audio in the target video, the target video can be processed at the same time, and the text label of the target image frame and the audio text of the target audio can be obtained at the same time. For example: the video file is broken down into pictures frame by a tool program (e.g., ffmpeg) and the audio tracks in the video file are separated individually into an audio file. In the embodiment of the present application, the target video may be sequentially processed to sequentially obtain the text label of the target image frame and the audio text of the target audio, which is not specifically limited in the embodiment of the present application. Therefore, the efficiency of acquiring the target text can be improved, the time for obtaining the target auditing result is further reduced, and the efficiency of video auditing is improved.

In step 130 provided in the embodiment of the present application, the target audit result includes that the audit is passed and/or that the audit is not passed. The specific implementation manner of auditing the target text through the target auditing model according to the preset rule to obtain the target auditing result of the target video may be: carrying out semantic understanding processing on the target text through a target audit model to obtain a semantic understanding result; if the semantic understanding result of the target text accords with the preset rule, determining that the target video auditing is passed; and if the semantic understanding result of the target text does not accord with the preset rule, determining that the target video audit is not passed. In the process of carrying out semantic understanding processing on the target text by the target audit model, the target audit model can calculate the target text, judge the content of the target text, mark the probability after the content of the target text judged by the target audit model, and take the content of the target text with the highest probability as a semantic understanding result.

In order to better understand the method for video auditing provided by the embodiments of the present application, examples will now be described. It should be understood that the examples are not limiting. For example, the preset rule includes "prohibit minor smoking", and the audio text in the target text is "he is 14 years old, i send a box of cigarettes to him". And carrying out semantic understanding processing on the target text through a target audit model to obtain a semantic understanding result 'related to smoking of the underage propaganda'. And according to the semantic understanding result of the target text, the target audit model can consider that the semantic understanding result is related to smoking propaganda to the underage and does not accord with the preset rule, underage smoking is forbidden, and the target video audit is not passed, and the audit result is audit not passed. The target text is subjected to semantic understanding processing through the target auditing model, so that whether the target text corresponding to the target video accords with the rule can be judged more accurately, and a more accurate target auditing result is further obtained.

When the target auditing result is 'auditing failed', a complaint request of a target user for the target video which is not audited can be obtained; and responding to the complaint request, and re-checking the target text of the target video through the target checking model to obtain a secondary checking result of the target video. For example, the target auditing result is a target video which is 'audited failed', the target user can lift the complaint request aiming at the video which is audited failed, and when the target equipment receives the complaint request lifted by the target user, the target text of the target video can be subjected to rechecking processing through a target auditing model. For example, in the process of rechecking the target video, the target auditing model carries out semantic understanding processing on the target text of the target video again, the semantic understanding result is unchanged, and the judgment is still that the auditing is not passed. Therefore, the complaint request of the target user for the target video which does not pass the audit can be obtained, the target audit model obtains feedback for the target audit result, the target audit model is trained for a plurality of times, the accuracy of the target audit model is improved, and the target audit result is more accurate.

In the process of rechecking the target text of the target video through the target audit model, a prompt word for rechecking processing can exist in target equipment, for example:

please "you" as a video auditor to process the complaint request of the video producer for recheck, "you" need to follow the following rules for video audit:

{ preset rules that need to be followed by the target audit model };

the following are image and audio information of the target video:

please "you" again review the video based on the review rules, judging whether this video can pass the review.

In the embodiment of the application, in the process of re-auditing the complaint request of the user, after the target auditing result of the target video is obtained, the target auditing result is recorded, and in the case that the target auditing result is that the auditing is not passed, the reason that the auditing is not passed is recorded; and after the secondary auditing result of the target video is obtained, recording the secondary auditing result, and recording the reason that the secondary auditing is not passed under the condition that the secondary auditing result is that the auditing is not passed. Therefore, in the process of performing secondary auditing on the target video, reasons which do not pass the primary auditing can be acquired, so that the target auditing model can learn, and a more accurate target auditing result can be provided. Meanwhile, the reason records of which the video auditing is not passed can be stored in a database so as to facilitate the subsequent other technicians to sort and analyze the data.

Fig. 2 is a flowchart of a method for video auditing according to an embodiment of the present application. As shown in fig. 2, the method for video auditing provided in the embodiment of the present application includes the following steps:

step 210, obtaining a target video to be audited.

In this embodiment of the present application, after obtaining the target video to be audited, the text labels of the target image frames may be obtained by performing steps 215 to 225, the audio texts of the target audio may be obtained by performing steps 230 to 245, and the steps 215 to 225 and the steps 230 to 245 may be performed simultaneously.

Step 215, extracting the target image frame in the target video according to the target frequency.

And 220, performing image recognition processing on the target image frame through the image-to-text model to obtain a recognition processing result.

And 225, performing de-duplication processing on the identification processing result to obtain a text label of the target image frame.

And step 230, separating target audio from the target video.

And step 235, extracting sound characteristics from the target audio through a voice conversion text model.

And 240, converting the sound characteristics through the voice-to-text model to obtain the audio text of the target audio.

After obtaining the text label of the target image frame and/or the audio text of the target audio corresponding to the target video, step 245 may be executed, and the target audit model performs the next processing.

And 245, carrying out semantic understanding processing on the target text through a target audit model to obtain a semantic understanding result, wherein the target text comprises a text label of the target image frame and/or an audio text of the target audio.

If the semantic understanding result of the target text accords with the preset rule, steps 250 to 255 can be executed; if the semantic understanding result of the target text does not conform to the preset rule, steps 260 to 285 may be performed.

And step 250, if the semantic understanding result of the target text accords with the preset rule, determining that the target video auditing is passed.

And 255, recording the target auditing result.

And 260, if the semantic understanding result of the target text does not accord with the preset rule, determining that the target video audit is not passed.

And 265, recording the target auditing result and reasons for failing auditing.

If the target auditing result of the target video is "auditing failed", steps 270 to 285 may be executed to perform secondary auditing on the target video.

Step 270, obtaining complaint requests of target users for target videos which do not pass the auditing.

And 275, responding to the complaint request, and re-checking the target text of the target video through the target checking model to obtain a secondary checking result of the target video.

If the secondary review result of the target video is "review pass", step 280 may be executed; if the secondary review result of the target video is "review failed", step 285 may be performed.

And 280, recording the secondary auditing result.

And step 285, recording the secondary auditing result and the reasons that the secondary auditing is not passed.

Therefore, as the target text corresponding to the target video is acquired, the content to be audited is converted into text data (target text corresponding to the target video) which is easy to process from original image and/or audio and other complex data (target video). The target auditing model can audit the target text capable of representing the target video content according to a preset rule, and then a target auditing result of the target video can be obtained. The difficulty of video auditing by the target auditing model is reduced, and the problem of higher difficulty of video auditing by using a large model in the related technology is solved to a certain extent.

For a better understanding of the method of video auditing provided by embodiments of the present application, examples are now presented, with the understanding that examples are not limiting. Fig. 3 is a conceptual diagram of a method for video auditing provided by an embodiment of the present application. In the video auditing method provided by the embodiment of the application, the target video is firstly acquired to start video auditing, after the target video is obtained, frame extraction processing can be performed on the target video according to the target frequency to obtain the target image frame, and audio separation can be performed on the target video to obtain the target audio. The target image frame input image may be converted to a text model to derive keywords similar to, for example: "scene, no_human, pair, window, info, tree" etc. are repeated so far that all the target image frames are converted into text labels, and these text labels are de-duplicated to obtain the text labels of the whole target video. The target image frame is input into the image-to-text model, and meanwhile, the target audio can be input into the voice-to-text model, so that the audio text corresponding to the target audio can be obtained. After the text label and/or the audio text of the target video are obtained, the text label and/or the audio text can be input into a target auditing model, and video auditing processing is carried out by the target auditing model according to a preset rule. During the process of auditing the target auditing model, auditing prompt words may be input to the target auditing model, for example: please "you" as a video auditor to audit the video, "you" need to determine if the target passes the audit, if so, answer [ yes ] otherwise answer [ no ], and in addition "you" need to audit the video following rules:

{ audit rules (i.e., preset rules) that need to be followed by the target audit model };

the following are image and audio information of the target video:

the target video image contains the following information: "{ text label }";

the content of the target video voice is as follows: "{ Audio text }";

please "you" determine if this video can pass the audit based on the audit rules and give the reasons for not passing the audit.

If the target auditing model judges that the target video is 'audited and passed', a target auditing result can be recorded. If the target auditing model judges that the target video is not approved, the target auditing result can be recorded, whether a complaint request exists or not is judged, and secondary auditing is carried out according to the complaint request of the user.

In the process of secondary auditing of the target auditing model, auditing prompt words can be input into the target auditing model, for example: please "you" as a video auditor to process the complaint request of the video producer for recheck, "you" need to follow the following rules for video audit:

the reason why the verification is not passed last time is that:

{ reason for failed last audit };

The following are image and audio information of the target video:

the target video image contains the following information: "{ text label }";

the content of the target video voice is as follows: "{ Audio text }";

please "you" review the video again based on the review rules, determine if the video passes the review, and give the reasons for failing the review.

In the embodiment of the application, the user can repeatedly modify the video for the reason that the verification is not passed, and repeatedly complain until the video verification is passed.

The video auditing method provided by the embodiment of the application comprises a complete video auditing flow. The video auditing method provided by the embodiment of the application can enable the content to be audited to be converted into text data (target text corresponding to the target video) which is easy to process from original image and/or audio and other complex data (target video). The target auditing model can audit the target text capable of representing the target video content according to a preset rule, and then a target auditing result of the target video can be obtained. The difficulty of video auditing by the target auditing model is reduced, and the problem of higher difficulty of video auditing by using a large model in the related technology is solved to a certain extent.

Fig. 4 is a block diagram of a video auditing apparatus according to an embodiment of the present application, and as shown in fig. 4, an image noise reduction apparatus 400 according to an embodiment of the present application includes an obtaining module 410 and a processing module 420.

An acquisition module 410, configured to acquire a target video to be audited; acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video;

and the processing module 420 is used for auditing the target text according to a preset rule through a target auditing model to obtain a target auditing result of the target video.

The video auditing device provided by the embodiment of the application can acquire the target video to be audited; acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video; and auditing the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video. Therefore, as the target text corresponding to the target video is acquired, the content to be audited is converted into text data (target text corresponding to the target video) which is easy to process from original image and/or audio and other complex data (target video). The target auditing model can audit the target text capable of representing the target video content according to a preset rule, and then a target auditing result of the target video can be obtained. The difficulty of video auditing by the target auditing model is reduced, and the problem of higher difficulty of video auditing by using a large model in the related technology is solved to a certain extent.

It should be noted that, the embodiment of the apparatus for video auditing in the present specification and the embodiment of the method for video auditing in the present specification are based on the same inventive concept, so that the specific implementation of the embodiment may refer to the implementation of the corresponding method for video auditing, and the repetition is omitted.

Fig. 5 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, an electronic device 500 provided by an embodiment of the present application may include a processor 510 and a memory 520. The memory stores a computer program that when executed performs steps in any of the methods of video auditing provided by embodiments of the present application (e.g., the methods of video auditing shown in any of figures 1 and 2).

The Memory is used to store programs or data, and may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), and the like.

The embodiment of the present application further provides a computer readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the above method embodiment, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

Wherein the processor is a processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium such as a computer Read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disk, etc.

The embodiment of the application further provides a chip, the chip includes a processor and a communication interface, the communication interface is coupled with the processor, and the processor is used for running a program or an instruction, implementing each process of the above method embodiment, and achieving the same technical effect, so as to avoid repetition, and not repeated here.

The embodiments of the present application provide a computer program product, which is stored in a storage medium, and the program product is executed by at least one processor to implement the respective processes of the above method embodiments, and achieve the same technical effects, and are not repeated herein.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may also be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a computer software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in the embodiments of the present application.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims

1. A method of video auditing, comprising:

acquiring a target video to be audited;

acquiring a target text corresponding to the target video, wherein the target text comprises a text label of a target image frame in the target video and/or an audio text of target audio in the target video;

and auditing the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video.

2. The method of claim 1, wherein the target text comprises a text label of a target image frame in the target video; the obtaining the target text corresponding to the target video includes:

extracting target image frames in a target video according to the target frequency;

And carrying out image recognition processing on the target image frame to obtain a text label of the target image frame.

3. The method according to claim 2, wherein the performing image recognition processing on the target image frame to obtain a text label of the target image frame includes:

performing image recognition processing on the target image frame through an image-to-text model to obtain a recognition processing result;

and carrying out de-duplication processing on the identification processing result to obtain the text label of the target image frame.

4. The method of claim 1, wherein the target text comprises audio text of target audio in the target video; the obtaining the target text corresponding to the target video includes:

separating target audio from the target video;

and carrying out audio recognition processing on the target audio to obtain an audio text of the target audio.

5. The method of claim 4, wherein the performing the audio recognition process on the target audio to obtain the audio text of the target audio comprises:

extracting sound characteristics from the target audio through a voice-to-text model;

And converting the sound characteristics through the voice-to-text model to obtain the audio text of the target audio.

6. The method of any one of claims 1-5, wherein the target audit result includes audit pass and/or audit fail; the auditing processing is carried out on the target text through a target auditing model according to a preset rule to obtain a target auditing result of the target video, which comprises the following steps:

carrying out semantic understanding processing on the target text through a target audit model to obtain a semantic understanding result;

if the semantic understanding result of the target text accords with the preset rule, determining that the target video auditing is passed;

and if the semantic understanding result of the target text does not accord with the preset rule, determining that the target video audit is not passed.

7. The method of claim 6, wherein the target audit result includes audit failed; the method further comprises the steps of:

acquiring a complaint request of a target user for a target video which is not passed by the auditing;

and responding to the complaint request, and re-checking the target text of the target video through the target checking model to obtain a secondary checking result of the target video.

8. The method of claim 7, wherein the method further comprises:

after a target auditing result of the target video is obtained, recording the target auditing result, and recording reasons for failing auditing under the condition that the target auditing result is failing auditing;

and after the secondary auditing result of the target video is obtained, recording the secondary auditing result, and recording the reason that the secondary auditing is not passed under the condition that the secondary auditing result is that the auditing is not passed.

9. An electronic device comprising a processor and a memory storing a program or instructions that when executed by the processor perform the steps of the method of any of claims 1-8.

10. A computer readable storage medium, characterized in that the medium has stored thereon a program or instructions which, when executed, implement the steps of the method according to any of claims 1-8.