CN115269910A

CN115269910A - Audio and video auditing method and system

Info

Publication number: CN115269910A
Application number: CN202210924829.5A
Authority: CN
Inventors: 李晓宇; 杨朝; 郭俊
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-11-01

Abstract

The application discloses an audio and video auditing method, which comprises the following steps: constructing a characteristic retrieval reference set based on a first information base and a second information base, wherein the first information base comprises copyrighted audio and video source file information, and the second information base comprises non-copyrighted audio and video source file information; performing audio and picture content identification on an audio and video manuscript uploaded by a user, and retrieving the reference set to obtain a matching result of the audio and video manuscript and an audio and video source file in the reference set; performing creation type identification on the audio and video manuscript; and determining the auditing result of the audio and video manuscript by combining the matching result and the authoring type. The application also discloses an audio and video auditing system, an electronic device and a computer readable storage medium. Therefore, the effect of efficiently and accurately carrying out automatic audio and video auditing can be realized.

Description

Audio and video auditing method and system

Technical Field

The present application relates to the field of audio and video technologies, and in particular, to an audio and video auditing method, system, electronic device, and computer-readable storage medium.

Background

With popularization and development of computer technologies, audio and video sharing platforms are increasing, and these platforms generally allow users to upload audio and video contents, which are called UGC (User Generated Content, such as series of video contents uploaded by a UP master) platforms. And in the UGC platform, the problem that the content uploaded by a user relates to copyright is often encountered. The user may carry content produced by another copyright party or upload the video as a manuscript after editing the video, thereby causing an infringement problem.

Conventional UGC platforms typically employ a large number of auditors to perform audits of audio-visual content, including audits for copyright issues. However, since the amount of content produced by the user is very large, the manual copyright identification and management consumes human resources, and the accuracy and the real-time performance are poor. Therefore, a scheme capable of automatically auditing mass audio/video contents is needed.

Disclosure of Invention

The application mainly aims to provide an audio and video auditing method, an audio and video auditing system, an electronic device and a computer readable storage medium, and aims to solve the problem of how to automatically audit massive audio and video contents.

In order to achieve the above object, an embodiment of the present application provides an audio/video auditing method, where the method includes:

constructing a feature retrieval reference set based on a first information base and a second information base, wherein the first information base comprises audio and video source file information with acquired copyright, and the second information base comprises audio and video source file information without acquired copyright;

carrying out audio and picture content identification on the audio and video manuscript uploaded by a user, and retrieving the reference set to obtain a matching result of the audio and video manuscript and an audio and video source file in the reference set;

performing authoring type recognition on the audio and video manuscript;

and determining the auditing result of the audio and video manuscript by combining the matching result and the authoring type.

Optionally, the method further comprises:

and generating processing opinions and actions on the audio and video manuscript according to the auditing result and the processing rule.

Optionally, the constructing a feature search reference set based on the first information base and the second information base includes:

establishing the first information base and the second information base;

acquiring a feature description file of each audio/video source file corresponding to the first information base and the second information base through a feature extraction service;

and constructing a retrieval reference set according to the picture characteristic description files and the audio characteristic description files of all the audio and video source files.

Optionally, the establishing the first information base and the second information base includes:

inputting information of the audio and video source file with the copyright acquired into the first information base, wherein the information comprises meta information, a source file address and an authorization condition;

and inputting the information of the audio and video source file without obtaining the copyright into the second information base, wherein the information comprises meta information, a source file address and the current copyright side condition.

Optionally, the obtaining, by the feature extraction service, the feature description file of each audio/video source file includes:

extracting a picture image of the audio/video source file according to a first preset time period to obtain a plurality of sub-pictures;

extracting the features of the extracted sprite by using a first neural network model to generate a corresponding descriptor;

forming a picture characteristic description subset of the audio/video source file according to the descriptor of each sub-picture, and packaging the picture characteristic description subset into a picture characteristic description file;

extracting the audio track data of the audio and video source file according to a second preset time interval to obtain a plurality of audio segments;

carrying out frequency domain conversion on the audio segments, and then using a second neural network model to carry out feature extraction to generate corresponding descriptors;

and forming an audio characteristic description subset of the audio and video source file according to the descriptor of each audio segment, and packaging the audio characteristic description subset into an audio characteristic description file.

Optionally, the identifying the audio and the picture content of the audio and video manuscript uploaded by the user, and retrieving the reference set to obtain the matching result between the audio and video manuscript and the audio and video source file in the reference set includes:

acquiring an audio characteristic description file and a picture characteristic description file of the audio and video manuscript through a characteristic extraction service;

and retrieving the reference set according to the audio characteristic description file and the picture characteristic description file of the audio and video manuscript, and comparing the audio characteristic description file and the picture characteristic description file of each audio and video source file corresponding to the reference set to obtain the matching result.

Optionally, the comparing the audio feature description file and the picture feature description file of each audio/video source file corresponding to the reference set to obtain the matching result includes:

according to the picture characteristic description subset of the audio/video manuscript and the picture characteristic description subset of each audio/video source file in the reference set, obtaining a first vector result closest to the reference set through a distance measurement model, and obtaining a first matching result with a similarity parameter through a similarity model;

according to the audio feature description subset of the audio and video manuscript and the audio feature description subset of each audio and video source file in the reference set, obtaining a second vector result closest to the reference set through a distance measurement model, and obtaining a second matching result with a similarity parameter through a similarity model;

and synthesizing the first matching result and the second matching result to obtain a matching result of the audio/video manuscript and the audio/video source file closest to the audio/video manuscript.

Optionally, the performing authoring type recognition on the audio and video manuscript includes:

and identifying the audio and video manuscript as a clip or secondary creation of the audio and video source file in the reference set.

and inputting the audio and picture matching result set of a single file dimension into a trained classifier model to obtain the authoring type result of the audio and video manuscript on the audio and video assets.

Optionally, the determining, by combining the matching result and the authoring type, an auditing result of the audio/video manuscript includes:

configuring a judgment rule through a condition rule matcher;

determining whether the audio and video manuscript relates to a copyright problem according to the judgment rule;

and when the audio and video manuscript relates to a copyright problem, inquiring the first information base and the second information base to generate an audit result report.

Optionally, the generating of the processing opinions and actions on the audio and video manuscript according to the auditing results and the processing rules includes:

and generating a processing opinion on the audio/video manuscript according to the audit result report of the audio/video manuscript by configuring a processing rule, and intervening the state and play limitation of the audio/video manuscript.

Optionally, the auditing result and the processing opinion are obtained by performing comprehensive judgment according to a matching result of the audio/video manuscript and the audio/video source file, an authoring type of the audio/video manuscript, an authorization range of the audio/video source file, and an authorization region.

In addition, to achieve the above object, an embodiment of the present application further provides an audio/video auditing system, where the system includes:

the device comprises a construction module and a characteristic retrieval reference set, wherein the construction module is used for constructing a characteristic retrieval reference set based on a first information base and a second information base, the first information base comprises copyrighted audio and video source file information, and the second information base comprises non-copyrighted audio and video source file information;

the retrieval module is used for identifying the audio and picture contents of the audio and video manuscript uploaded by a user and retrieving the reference set to obtain a matching result of the audio and video manuscript and an audio and video source file in the reference set;

the recognition module is used for recognizing the creation type of the audio and video manuscript;

and the judging module is used for determining the auditing result of the audio and video manuscript by combining the matching result and the authoring type.

To achieve the above object, an embodiment of the present application further provides an electronic device, where the electronic device includes: the device comprises a memory, a processor and an audio and video auditing program which is stored on the memory and can run on the processor, wherein the audio and video auditing program realizes the audio and video auditing method when being executed by the processor.

In order to achieve the above object, an embodiment of the present application further provides a computer-readable storage medium, where an audio/video auditing program is stored on the computer-readable storage medium, and when executed by a processor, the audio/video auditing method is implemented as above.

According to the audio and video auditing method, the audio and video auditing system, the electronic device and the computer readable storage medium, the characteristic description file is generated for the audio and video source file, the retrieval reference set is constructed, and when the characteristic descriptor with the distance close to that of the reference set exists in the query set (the manuscript uploaded by the user), whether the picture or the audio of the reference set is used or not can be judged. Based on the audio and video retrieval technology, the method can realize high-precision quasi-real-time audio and video manuscript content identification and automatically identify the condition that the user manuscript uses copyright assets. Meanwhile, the infringement and authorization information base is associated, so that infringement and authorization judgment of the audio and video manuscript of the user can be realized, and the effect of efficiently and accurately carrying out automatic audio and video auditing is realized.

Drawings

FIG. 1 is a diagram of an application environment architecture in which various embodiments of the present application may be implemented;

fig. 2 is a flowchart of an audio/video auditing method according to a first embodiment of the present application;

FIG. 3 is a detailed flowchart of step S20 in FIG. 2;

FIG. 4 is a diagram illustrating key fields of an alternative authorization information base according to the present application;

FIG. 5 is a detailed flowchart of step S202 in FIG. 3;

FIG. 6 is a detailed flowchart of step S22 in FIG. 2;

FIG. 7 is a detailed flowchart of step S26 in FIG. 2;

fig. 8 is a flowchart of an audio/video auditing method according to a second embodiment of the present application;

fig. 9 is a schematic flowchart of another form of the audio/video auditing method according to the second embodiment of the present application;

fig. 10 is a schematic hardware architecture of an electronic device according to a third embodiment of the present application;

fig. 11 is a schematic block diagram of an audio/video auditing system according to a fourth embodiment of the present application;

fig. 12 is a schematic block diagram of an audio/video auditing system according to a fifth embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be noted that the descriptions relating to "first", "second", etc. in the embodiments of the present application are only for descriptive purposes and are not to be construed as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a diagram illustrating an application environment architecture for implementing various embodiments of the present application. The present application is applicable in application environments including, but not limited to, client 2, server 4, network 6.

The client 2 is used for displaying a current application interface to a user and receiving operations such as uploading of audio and video manuscripts of the user. The client 2 may be a terminal device such as a PC (Personal Computer), a mobile phone, a tablet Computer, a portable Computer, and a wearable device.

The server 4 is used for constructing a characteristic retrieval reference set based on an infringement and authorization library, and identifying and auditing the content of the audio and video manuscript uploaded by the client 2. The server 4 may be a rack server, a blade server, a tower server, a cabinet server, or other computing devices, may be an independent server, or may be a server cluster formed by a plurality of servers.

The network 6 may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, and the like. The server 4 and one or more clients 2 are connected through the network 6 for data transmission and interaction.

Example one

As shown in fig. 2, a flowchart of an audio/video auditing method provided in a first embodiment of the present application is shown. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. Some steps in the flowchart may be added or deleted as desired. This method will be described below with the server 4 as an execution subject.

The method comprises the following steps:

and S20, constructing a characteristic retrieval reference set based on the first information base and the second information base.

In this embodiment, the first information base is an authorization information base, and includes copyrighted audio/video source file information; the second information base is an infringement information base and comprises audio and video source file information which does not obtain copyright. In order to verify the copyright of the audio and video content, an infringement and authorization information base needs to be established, and audio and video feature extraction service and audio and video feature retrieval service are provided.

Specifically, further refer to fig. 3, which is a schematic view of the detailed flow of step S20. It is to be understood that the flow chart is not intended to limit the order in which the steps are performed. Some steps in the flowchart may be added or deleted as desired. In this embodiment, the step S20 specifically includes:

s200, establishing the first information base and the second information base.

Firstly, the audio and video source file information of which the copyright is purchased (on the platform) is recorded into a first information base. Fig. 4 is a schematic diagram of key fields of the first information library. The first information base mainly comprises key fields of meta-information, a source file address, authorization conditions and the like of audio and video assets (source files).

And inputting unauthorized audio and video source file information into a second information base. The second information base mainly comprises key fields of meta-information of audio and video assets (source files), source file addresses, current copyright side conditions and the like. Wherein, the unauthorized data mainly comes from the copyright declaration platform of other copyright parties or the complaint infringement warehouse of the platform.

S202, obtaining a feature description file of each audio and video source file through a feature extraction service.

In the present embodiment, the feature extraction is divided into picture feature extraction and audio feature extraction. Specifically, further refer to fig. 5, which is a schematic view of the detailed flow of step S202. It is to be understood that the flow chart is not intended to limit the order in which the steps are performed. Some steps in the flowchart may be added or deleted as desired. In this embodiment, the step S202 specifically includes:

and S2020, extracting the picture image of the audio/video source file according to a first preset time period to obtain a plurality of sub-pictures.

In this embodiment, the picture image of the audio/video source file is extracted in seconds, that is, the first preset time period is one second. And extracting the picture images of the audio/video source files according to seconds to obtain a plurality of sub-pictures (single-frame pictures).

S2021, feature extraction is carried out on the extracted sprite by using a neural network model, and a corresponding descriptor is generated.

After extracting a plurality of sub-pictures (single-frame pictures) from the audio/video source file, extracting the features of the extracted sub-pictures (single-frame pictures) by using a neural network model to generate a corresponding descriptor Desc _i . The neural network model (first neural network model) may be any existing neural network model capable of performing picture feature extraction, and is not limited herein.

S2022, forming a picture feature description subset of the audio/video source file according to the descriptor of each sub-picture, and packaging the picture feature description subset into a picture feature description file.

Extracting picture characteristic of each sprite (single frame picture) to generate descriptor Desc _i All pictures of the whole audio and video source file form { Desc ₁ ,…,Desc _n And the picture feature description sub-set corresponding to the picture of the audio/video source file can be packaged into a file, which is called a picture feature description file.

And S2023, extracting the audio track data of the audio and video source file according to a second preset time period to obtain a plurality of audio segments.

In this embodiment, the track data of the audio/video source file is extracted according to x seconds continuously, that is, the second preset time period is x seconds (where the value of x may be specifically adjusted according to network performance). And aiming at the audio track data of the audio and video source file, a plurality of continuous x-second audio segments can be obtained.

S2024, performing frequency domain conversion on the audio segments, and then performing feature extraction by using a neural network model to generate corresponding descriptors.

The difference between the audio clip and the sprite is that the feature extraction is performed after the audio clip is frequency domain converted. The neural network model (second neural network model) used may be any existing neural network model capable of audio feature extraction, and is not limited herein. After the audio feature extraction is finished, obtaining a descriptor Desc of each audio segment _i 。

S2025, forming an audio feature description subset of the audio and video source file according to the descriptor of each audio segment, and packaging the audio feature description subset into an audio feature description file.

Extracting audio features of each audio clip to generate a descriptor Desc _i And the whole audio track file of the audio and video source file forms { Desc ₁ ,…,Desc _n And (4) packing the audio feature description subset of the audio data into a file, which is called an audio feature description file.

Returning to fig. 3, S204, a retrieval reference set is constructed according to the picture feature description file and the audio feature description file of all the audio/video source files.

In this embodiment, the audio and picture feature description files of each audio and video source file are loaded into the audio and picture feature retrieval service, so that a retrieval reference set can be constructed. At this time, the feature retrieval service already has a reference set for retrieving the audio-video assets. Subsequently, when a profile of an audio-visual contribution is retrieved, a match (asset utilization) of the profile in the reference set may be generated.

Returning to fig. 2, in S22, performing audio and picture content identification on the audio and video manuscript uploaded by the user, and retrieving the reference set to obtain a matching result between the audio and video manuscript and the audio and video source file in the reference set.

Specifically, further refer to fig. 6, which is a schematic view of the detailed flow of step S22. It is to be understood that the flow chart is not intended to limit the order in which the steps are performed. Some steps in the flowchart may be added or deleted as desired. In this embodiment, the step S22 specifically includes:

and S220, acquiring the audio and picture feature description files of the audio and video manuscript through a feature extraction service.

Firstly, the audio and video manuscript uploaded by the user needs to obtain a feature description file of the manuscript through a feature extraction service. The feature extraction here is similar to the feature extraction for the audio-video asset source file, and is also for the distinguishing processing of the audio track data and the picture image.

Wherein, the picture processing comprises the following steps: extracting the picture image of the audio/video manuscript according to seconds to obtain a plurality of sub-pictures; using neural network model to extract the characteristic of each sprite and generate descriptor Desc _i . All frames of the whole audio-video manuscript form Desc ₁ ,…,Desc _n And (5) packing the picture description subset of the picture into a picture feature description file.

The audio track is processed as: extracting the characteristics of the continuous x second (specifically adjusted according to the network performance) audio segment to generate a descriptor Desc _i . The whole audio track file of the audio and video manuscript forms { Desc ₁ ,…,Desc _n And (5) packing the audio description subset of the audio into an audio feature description file.

S222, retrieving the reference set according to the audio and picture feature description files of the audio and video manuscript, and comparing the reference set with the audio and picture feature description files of each audio and video source file corresponding to the reference set to obtain the matching result.

The audio and picture feature description file of the audio and video manuscript is sent to a feature retrieval service to generate a matching result (audio and video asset condition) in a reference set used by the manuscript. And then the authorization information and the infringement information of the corresponding audio/video source file are checked back through the matching result, so that whether the manuscript is authorized or involved in infringement can be obtained, and infringement and authorization information (such as the using time segment of the asset, the using type of the manuscript to the asset, the authorized region of the asset, the authorized timeliness and the like) can be generated.

Subset of picture feature descriptions of query set (contribution) by distance measureA subset of picture feature descriptors from the reference set may produce a vector result that is closest in distance. Matching results with similarity parameters can be obtained by training machine learning models (including but not limited to neural networks). The matching result can show the use relationship of the two video files. For example: query video Image _query Using Image in reference set _refer The time range of the picture segment of (1) is: query 5-10 seconds, refer 20-25 seconds (i.e. 20-25 seconds representing the 5-10 seconds pictures in the contribution used the audio-video asset in the reference set).

The processing logics of the audio tracks are similar, and after the audio features are extracted, the matching result of the audio clips is generated by adopting the combination of similar distance measurement and similarity model, so that the use condition of the audio and video manuscript on the audio assets can be indicated. For example: query video Audioo _query Audio using a reference set _refer Time range query 5-10 seconds, refer 20-25 seconds (i.e. 20-25 seconds indicating that the 5-10 th second audio in the contribution uses the audio-video asset in the reference set).

And combining the two matching results (the matching result of the picture and the matching result of the audio) to obtain the matching result of the audio/video manuscript and the audio/video source file closest to the audio/video manuscript.

Returning to fig. 2, S24, performing authoring type recognition on the audio/video manuscript.

After the use condition of the audio and the picture of the audio and video asset by the manuscript is obtained, the authoring type of the manuscript needs to be identified. The authoring type is a clip type of the manuscript to the audio-video asset (source file) in the reference set, and can be divided into secondary authoring and clipping, for example.

The identification of authoring types is mainly a binary classification combining the usage data of the audio track and the picture. In particular, a classifier model (including pattern recognition and neural network classifiers) may be trained by the authored type dataset of the manual annotation contribution. For example, an SVM classifier may be used to determine the authored type of the contribution. Sound and picture use result set organized into single asset (file) dimension by aggregating different asset results. For example: { { Laoyouji Image ₁ Recording Image of the old friend ₂ 8230am friend remember Image _n }, { Laoyouji Audio ₁ Audio recorded by laoyou ₂ 8230am friend's record of Audio _n }}. The sound and picture use result set of the asset dimension is input to a classifier to obtain authoring type results of each asset (source file).

And S26, determining the auditing result of the audio and video manuscript by combining the matching result and the authoring type.

And after acquiring a matching result and an authoring type of the manuscript and the audio and video asset (source file), infringement and authorization judgment are required to be carried out on a single asset (source file). Specifically, further refer to fig. 7, which is a schematic view of the detailed flow of step S26. It is to be understood that the flow chart is not intended to limit the order in which the steps are performed. Some steps in the flowchart may be added or deleted as desired. In this embodiment, the step S26 specifically includes:

s260, configuring the judgment rule through the condition rule matcher.

The different assets (source files) are different in authorization, use and authoring types. There is a need for flexible configuration of authorization and infringement determination policies for individual assets. In this embodiment, a configurable condition rule matcher is used to flexibly configure any combination of conditions such as manuscript attributes, asset usage, authoring type, asset authorization attributes, and the like.

Certainly, the step does not need to be configured once when each audio/video manuscript is checked, and can be configured in advance or modified at any time.

And S262, determining whether the audio and video manuscript relates to a copyright problem according to the judgment rule.

For example, it is determined that the usage of the asset by the contribution has been related to copyright issues, based on the authoring type being a clip and the picture usage duration exceeding 30 seconds, while the audio usage duration exceeding 120 seconds.

And S264, when the audio and video manuscript relates to copyright problems, inquiring the first information base and the second information base, and generating an audit result report, including an authorization and infringement result report.

If the copyright problem is involved, the first information base is checked whether purchase authorization exists or not, and if the purchase authorization exists, whether the specific situation of the authorization is met or not is judged. For example, the authorized use case is created twice, and the manuscript playing region is China. And generating an authorization result report of the contribution of the asset according to all the authorization terms. Meanwhile, if the manuscript playing region relates to a region outside China, the second information base is required to be matched with the authorization condition of the corresponding asset at other copyright parties, and an infringement result report is generated.

In the audio and video auditing method provided by the embodiment, the characteristic description information of each frame of picture or a section of audio is generated through the deep learning neural network, and the characteristic description file can be formed by extracting frames and audio segments of the whole audio and video file and generating the characteristic description information. A reference set can be constructed through the feature description file, and when the feature descriptors with the similar distance to the reference set exist in the query set, whether the reference set is used for pictures or audios can be judged. With the more mature picture retrieval capability and audio retrieval capability of deep learning, the high-precision quasi-real-time audio and video manuscript content identification can be realized based on the audio and video retrieval technology, and the condition that the user manuscript uses copyright assets is automatically identified. Meanwhile, the infringement and authorization information base is correlated, so that infringement and authorization judgment of the audio and video manuscript of the user can be realized, and the effect of efficiently and accurately carrying out automatic audio and video verification is realized.

Example two

Fig. 8 is a flowchart of an audio/video auditing method according to a second embodiment of the present application. In a second embodiment, the audio/video auditing method further includes step S38 on the basis of the first embodiment. It is to be understood that the flow charts in the embodiments of the present method are not intended to limit the order in which the steps are performed. Some steps in the flowchart may be added or deleted as desired.

The method comprises the following steps:

and S30, constructing a feature retrieval reference set based on the first information base and the second information base.

Firstly, the audio and video source file information of which the copyright is purchased (on the platform) is recorded into a first information base. The first information base mainly comprises key fields of meta-information, source file addresses, authorization conditions and the like of audio and video resource (source files). And inputting the unauthorized audio and video source file information into a second information base. The second information base mainly comprises key fields of meta-information of audio and video assets (source files), source file addresses, current copyright side conditions and the like. Wherein, the unauthorized data mainly comes from the copyright declaration platform of other copyright parties or the complaint infringement warehouse of the platform.

And then, acquiring a feature description file of each audio and video source file through a feature extraction service.

In the present embodiment, the feature extraction is divided into picture feature extraction and audio feature extraction.

The feature extraction for the picture includes: extracting the picture image of the audio/video source file according to seconds to obtain a plurality of sub-pictures (single-frame pictures); extracting the characteristics of the extracted sprite by using a neural network model to generate a corresponding descriptor; and forming a picture characteristic description subset of the audio/video source file according to the descriptor of each sub-picture, and packaging the picture characteristic description subset into a picture characteristic description file.

Feature extraction for audio includes: extracting a plurality of audio segments of x seconds in succession; carrying out frequency domain conversion on the audio segments, and then carrying out feature extraction by using a neural network model to generate corresponding descriptors; and forming an audio characteristic description subset of the audio and video source file according to the descriptor of each audio segment, and packaging the audio characteristic description subset into an audio characteristic description file.

And finally, constructing a retrieval reference set according to the picture characteristic description files and the audio characteristic description files of all the audio and video source files.

And S32, identifying the audio and picture contents of the audio and video manuscript uploaded by the user, and retrieving the reference set to obtain a matching result of the audio and video manuscript and the audio and video source file in the reference set.

Firstly, the audio and video manuscript uploaded by a user needs to obtain a feature description file of the manuscript through a feature extraction service. The feature extraction here is similar to the above-described feature extraction for the audio/video source file, and is also for the discrimination processing of the track data and the picture image.

And retrieving the reference set according to the audio and picture characteristic description files of the audio and video manuscript, and comparing the reference set with the audio and picture characteristic description files of each audio and video source file corresponding to the reference set to obtain the matching result.

A vector result closest in distance can be generated by distance measuring the subset of picture feature descriptions of the query set (contribution) and the subset of picture feature descriptions of the reference set. Matching results with similarity parameters can be obtained by training machine learning models (including but not limited to neural networks). The matching result can show the use relationship of the two video files. For example: query video Image _query Using Image in reference set _refer The time range of the picture segment of (1) is: query 5-10 seconds, refer 20-25 seconds (i.e. 20-25 seconds representing the 5-10 seconds pictures in the contribution used the audio-video asset in the reference set).

The processing logics of the audio tracks are similar, and after the audio features are extracted, the matching result of the audio fragments is generated by adopting the combination of similar distance measurement and similarity model, so that the use condition of the audio and video manuscript to the audio assets can be indicated. For example: query video Audioo _query Audio using a reference set _refer Time range query 5-10 seconds, refer 20-25 seconds (i.e. 20-25 seconds indicating that the 5-10 th second audio in the contribution uses the audio-video asset in the reference set).

And S34, performing creation type identification on the audio and video manuscript.

The recognition of authoring type is mainly a binary classification combining usage data of audio tracks and pictures. In particular, a classifier model (including pattern recognition and neural network classifiers) may be trained by the authored type dataset of the manual annotation contribution. For example, an SVM classifier can be used to determine the authoring type of the contribution. By aggregating different asset results, a voice-picture usage result set organized into a single asset (file) dimension. For example: { { Looyouji Image ₁ The aged remembering Image ₂ 8230am age _n }, { Laoyouji Audio ₁ Audio recorded by laoyou ₂ 8230am friend's Audio _n }}. The sound and picture use result set of the asset dimension is input to a classifier to obtain authoring type results of each asset (source file).

And S36, determining the auditing result of the audio and video manuscript by combining the matching result and the authoring type.

After the matching result and the authoring type of the manuscript and the audio and video asset (source file) are obtained, infringement and authorization judgment on a single asset (source file) are needed. The different assets (source files) are different in authorization, use and authoring types. There is a need for flexible configuration of authorization and infringement determination policies for individual assets. In this embodiment, a configurable condition rule matcher is used to flexibly configure any combination of conditions such as contribution attributes, asset use cases, authoring types, asset authorization attributes, and the like. For example, it is determined that the usage of the asset by the contribution has been related to copyright issues, based on the authoring type being a clip and the picture usage duration exceeding 30 seconds, while the audio usage duration exceeding 120 seconds.

If the copyright problem is involved, the first information base is checked to check whether the purchase authorization exists, and if the purchase authorization exists, whether the specific situation of the authorization is met is judged. For example, the authorized use case is created for two times, and the manuscript playing region is China. And generating an authorization result report of the contribution of the asset according to all the authorization terms. Meanwhile, if the manuscript playing region relates to a region outside China, the second information base is required to be matched with the authorization condition of the corresponding asset at other copyright parties, and an infringement result report is generated.

And S38, generating manuscript processing opinions and actions according to the auditing results and the infringement processing rules.

By configuring infringement processing rules (such as the use duration range, the authoring type and the manuscript playing region of a certain asset), processing opinions on the manuscript can be generated according to auditing result reports (including an infringement result report and an authorization result report) of the audio and video asset (source file) used by the manuscript, and the state and the playing limit of the manuscript can be directly interfered according to the configured infringement processing rules.

For example, if the authorization result report of a certain manuscript indicates that the manuscript meets the authorized authoring type and the authorized usage duration, the authorized region is china, and the infringement result report indicates that the manuscript infringement region is australia, the playing region of the manuscript needs to be modified to china, and the broadcast-prohibited region is australia.

It should be noted that, for the judgment of whether a manuscript is infringed and the subsequent processing, it is necessary to comprehensively judge whether the manuscript is infringed and an infringement processing scheme according to the usage of the manuscript on the assets in the reference set (that is, the matching result), the authoring type, the authorization range of the assets (audio/video source files), the authorization region, and the like. For example, a certain contribution video _query Continuous use of asset video in reference set _refer The picture segment of (2) is 20 minutes, and whether the right is infringed or not is judged according to the authorization information of the asset copyright acquisition. If not, directly judging as infringement. The video is also determined to be infringement-related, provided that the usage duration has exceeded the authorized scope of the asset, or exceeded the authorized territory of the asset, etc. For the manuscript judged as infringement, different processing can be carried out according to the actual infringement situation, such as audio and video off-shelf, partial regional broadcast prohibition, withdrawal modification and the like.

Fig. 9 is a schematic flow chart of another form of the audio/video auditing method according to this embodiment. The specific processes of the steps in fig. 8 have been described above, and are not described again here.

The audio and video auditing method provided by the embodiment generates feature description information of each frame of picture or a section of audio through the deep learning neural network, and can form a feature description file by extracting frames and audio segments of the whole audio and video file and generating the feature description information. A reference set can be constructed through the feature description file, and when the feature descriptors with the similar distance to the reference set exist in the query set, whether the reference set is used for pictures or audios can be judged. With the more mature picture retrieval capability and audio retrieval capability of deep learning, the high-precision quasi-real-time audio and video manuscript content identification can be realized based on the audio and video retrieval technology, and the condition that the user manuscript uses copyright assets is automatically identified. Meanwhile, the infringement and authorization information base is associated to realize the infringement and authorization judgment of the user audio/video manuscript, and the automatic intervention processing of the user manuscript can be realized by matching with the infringement processing rule, so that the effect of efficiently and accurately carrying out automatic audio/video auditing is realized.

EXAMPLE III

Fig. 10 is a schematic diagram of a hardware architecture of an electronic device 20 according to a third embodiment of the present application. In the present embodiment, the electronic device 20 may include, but is not limited to, a memory 21, a processor 22, and a network interface 23, which are communicatively connected to each other through a system bus. It is noted that fig. 10 only shows the electronic device 20 with components 21-23, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. In this embodiment, the electronic device 20 may be the server.

The memory 21 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 21 may be an internal storage unit of the electronic device 20, such as a hard disk or a memory of the electronic device 20. In other embodiments, the memory 21 may also be an external storage device of the electronic apparatus 20, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic apparatus 20. Of course, the memory 21 may also include both an internal storage unit and an external storage device of the electronic apparatus 20. In this embodiment, the memory 21 is generally used to store an operating system installed in the electronic device 20 and various application software, such as program codes of the audio/video auditing system 60. Further, the memory 21 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 22 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the electronic device 20. In this embodiment, the processor 22 is configured to execute the program codes stored in the memory 21 or process data, for example, execute the audio/video auditing system 60.

The network interface 23 may include a wireless network interface or a wired network interface, and the network interface 23 is generally used for establishing a communication connection between the electronic apparatus 20 and other electronic devices.

Example four

As shown in fig. 11, a schematic block diagram of an audio/video auditing system 60 is provided for a fourth embodiment of the present application. The audiovisual review system 60 may be partitioned into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement embodiments of the present application. The program modules referred to in the embodiments of the present application refer to a series of computer program instruction segments that can perform specific functions, and the following description will specifically describe the functions of each program module in the embodiments.

In this embodiment, the audio/video auditing system 60 includes:

a construction module 600, configured to construct a feature retrieval reference set based on the first information base and the second information base.

In this embodiment, the first information base is an authorization information base, and includes copyrighted audio/video source file information; the second information base is an infringement information base and comprises audio and video source file information which does not obtain copyright. Firstly, the audio and video source file information of which the copyright is purchased (on the platform) is recorded into a first information base. The first information base mainly comprises key fields of meta-information of audio and video assets (source files), source file addresses, authorization conditions and the like. And recording the unauthorized audio/video source file information into a second information base. The second information base mainly comprises key fields of meta-information of audio and video assets (source files), source file addresses, current copyright side conditions and the like. Wherein, the unauthorized data mainly comes from the copyright declaration platform of other copyright parties or the complaint infringement warehouse of the platform.

And then, acquiring a feature description file of each audio and video source file through a feature extraction service. In the present embodiment, the feature extraction is divided into picture feature extraction and audio feature extraction.

The retrieval module 602 is configured to perform audio and picture content identification on an audio and video manuscript uploaded by a user, and retrieve the reference set to obtain a matching result between the audio and video manuscript and an audio and video source file in the reference set.

Firstly, the audio and video manuscript uploaded by a user needs to obtain a feature description file of the manuscript through a feature extraction service. The feature extraction here is similar to the feature extraction for the audio/video source file described above, and is also for the discrimination processing of the audio track data and the picture image.

And retrieving the reference set according to the audio and picture feature description files of the audio and video manuscript, and comparing the reference set with the audio and picture feature description files of each audio and video source file corresponding to the reference set to obtain the matching result.

The closest vector result can be generated by measuring the distance between the picture characteristic description subset of the query set (manuscript) and the picture characteristic description subset of the reference set. Matching results with similarity parameters can be obtained by training machine learning models (including but not limited to neural networks). The matching result can show the use relationship of the two video files. For example: video Image query _query Using Image in reference set _refer The time range of the picture segment of (1) is: query 5-10 seconds, refer 20-25 seconds (i.e. 20-25 seconds representing the 5-10 seconds pictures in the contribution used the audio-video asset in the reference set).

And the identifying module 604 is configured to perform authoring type identification on the audio/video manuscript.

After the using conditions of the audio and the picture of the audio and video assets by the manuscripts are obtained, the authoring types of the manuscripts need to be identified. The authoring type refers to a clip type of the manuscript to an audio-video asset (source file) in the reference set, and can be divided into, for example, secondary authoring and clipping.

The recognition of authoring type is mainly a binary classification combining usage data of audio tracks and pictures. In particular, a classifier model (including pattern recognition and neural network classifiers) may be trained by the authored type dataset of the manual annotation contribution. For example, an SVM classifier can be used to determine the authoring type of the contribution. By aggregating different asset results, a voice-painting usage result set organized into a single asset (file) dimension. For example: { { Laoyouji Image ₁ The aged remembering Image ₂ 8230am age _n }, { Laoyouji Audio ₁ Audio recorded by laoyou ₂ 8230am friend's record of Audio _n }}. A sound and picture usage result set of the asset dimension is input to a classifier to obtain authoring type results for each asset (source file).

And the judging module 606 is used for determining the auditing result of the audio/video manuscript by combining the matching result and the authoring type.

After the matching result and the authoring type of the manuscript and the audio and video asset (source file) are obtained, infringement and authorization judgment on a single asset (source file) are needed. The authorization condition, the use condition and the authoring type of different assets (source files) are different. There is a need for flexible configuration of authorization and infringement determination policies for individual assets. In this embodiment, a configurable condition rule matcher is used to flexibly configure any combination of conditions such as contribution attributes, asset use cases, authoring types, asset authorization attributes, and the like. For example, it is determined that the usage of the asset by the contribution has been related to copyright issues, based on the authoring type being a clip and the picture usage exceeding 30 seconds, while the audio usage exceeding 120 seconds.

If the copyright problem is involved, the first information base is checked to check whether the purchase authorization exists, and if the purchase authorization exists, whether the specific situation of the authorization is met is judged. For example, the authorized use case is created twice, and the manuscript playing region is China. And generating an authorization result report of the contribution of the asset according to all the authorization terms. Meanwhile, if the manuscript playing region relates to a region outside China, the second information base is required to be matched with the authorization condition of the corresponding asset at other copyright parties, and an infringement result report is generated.

EXAMPLE five

As shown in fig. 12, a schematic block diagram of an audio/video auditing system 60 is provided for a fifth embodiment of the present application. In this embodiment, the audio/video auditing system 60 includes a processing module 608, in addition to the constructing module 600, the retrieving module 602, the identifying module 604 and the determining module 606 in the fourth embodiment.

The processing module 608 is configured to generate a manuscript processing opinion and an action according to the audit result and the infringement processing rule.

EXAMPLE six

The present application further provides another embodiment, that is, a computer-readable storage medium is provided, where an audio/video auditing program is stored in the computer-readable storage medium, and the audio/video auditing program is executable by at least one processor to cause the at least one processor to execute the steps of the audio/video auditing method as described above.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the present application described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different from that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications that can be made by the use of the equivalent structures or equivalent processes in the specification and drawings of the present application or that can be directly or indirectly applied to other related technologies are also included in the scope of the present application.

Claims

1. An audio and video auditing method, the method comprising:

constructing a characteristic retrieval reference set based on a first information base and a second information base, wherein the first information base comprises copyrighted audio and video source file information, and the second information base comprises non-copyrighted audio and video source file information;

performing audio and picture content identification on an audio and video manuscript uploaded by a user, and retrieving the reference set to obtain a matching result of the audio and video manuscript and an audio and video source file in the reference set;

performing creation type identification on the audio and video manuscript;

2. A video and audio auditing method according to claim 1 characterised in that the method further comprises:

3. An audio-video auditing method according to claim 1 or 2 where said building a feature search reference set based on a first information base and a second information base comprises:

establishing the first information base and the second information base;

4. An audio-video auditing method according to claim 3 in which said establishing the first information base and the second information base comprises:

inputting information of the copyrighted audio/video source file into the first information base, wherein the information comprises meta-information, a source file address and authorization conditions;

5. The audio/video auditing method according to claim 3 or 4 where the obtaining of the feature description file for each audio/video source file by the feature extraction service comprises:

6. The audio/video auditing method according to any one of claims 1 to 5 characterized in that said identifying audio and picture contents of an audio/video manuscript uploaded by a user and retrieving said reference set to obtain a matching result of said audio/video manuscript and an audio/video source file in said reference set comprises:

acquiring an audio feature description file and a picture feature description file of the audio and video manuscript through a feature extraction service;

7. The audio/video auditing method according to claim 6, characterized in that the comparing of the audio feature description file and the picture feature description file of each audio/video source file corresponding to the reference set to obtain the matching result comprises:

obtaining a first vector result closest to the reference source file through a distance measurement model according to the picture feature description subset of the audio/video manuscript and the picture feature description subset of each audio/video source file in the reference set, and obtaining a first matching result with a similarity parameter through a similarity model;

8. An audio-visual review method according to any of claims 1 to 7 in which the authoring type recognition of the audio-visual contribution comprises:

9. An audio-video review method according to claim 8 and wherein the performing authoring type recognition on the audio-video contribution comprises:

10. An audio-video review method according to any one of claims 1 to 9 wherein determining the review result of the audio-video contribution in combination with the matching result and the authoring type comprises:

configuring a judgment rule through a condition rule matcher;

and when the audio and video manuscript relates to the copyright problem, inquiring the first information base and the second information base to generate an audit result report.

11. The audio/video auditing method according to claim 2 where generating processing opinions and actions on the audio/video contribution according to the auditing results and processing rules comprises:

generating a processing opinion on the audio and video manuscript according to the audit result report of the audio and video manuscript by configuring a processing rule, and intervening the state and the play limit of the audio and video manuscript.

12. The audio/video auditing method according to claim 2 or 11 characterised in that the auditing results and the processing opinions are obtained by comprehensive judgment according to the matching results of the audio/video manuscript and the audio/video source file, the authoring type of the audio/video manuscript, the authorization range of the audio/video source file, and the authorization region.

13. An audio-visual auditing system, the system comprising:

the device comprises a construction module, a characteristic retrieval reference set and a characteristic retrieval module, wherein the construction module is used for constructing the characteristic retrieval reference set based on a first information base and a second information base, the first information base comprises audio and video source file information with acquired copyright, and the second information base comprises audio and video source file information without acquired copyright;

the recognition module is used for performing creation type recognition on the audio and video manuscript;

14. An electronic device, comprising: a memory, a processor and an audio-visual auditing program stored on the memory and executable on the processor, which when executed by the processor implements an audio-visual auditing method according to any one of claims 1 to 12.

15. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon an audio-video auditing program which, when executed by a processor, implements an audio-video auditing method according to any one of claims 1 to 12.