WO2020022956A1 - Procédé et appareil de validation de contenu vidéo - Google Patents

Procédé et appareil de validation de contenu vidéo Download PDF

Info

Publication number
WO2020022956A1
WO2020022956A1 PCT/SG2018/050379 SG2018050379W WO2020022956A1 WO 2020022956 A1 WO2020022956 A1 WO 2020022956A1 SG 2018050379 W SG2018050379 W SG 2018050379W WO 2020022956 A1 WO2020022956 A1 WO 2020022956A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
videos
pair
frame
distance
Prior art date
Application number
PCT/SG2018/050379
Other languages
English (en)
Inventor
Sang Nguyen
Quang Tran
Erman TJIPUTRA
Original Assignee
Aioz Pte Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aioz Pte Ltd filed Critical Aioz Pte Ltd
Priority to PCT/SG2018/050379 priority Critical patent/WO2020022956A1/fr
Publication of WO2020022956A1 publication Critical patent/WO2020022956A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions
    • H04L9/3239Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions involving non-keyed hash functions, e.g. modification detection codes [MDCs], MD5, SHA or RIPEMD
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/50Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols using hash chains, e.g. blockchains or hash trees
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234309Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by transcoding between formats or standards, e.g. from MPEG-2 to MPEG-4 or from Quicktime to Realvideo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • H04N21/4394Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams

Definitions

  • the present invention relates to a method and apparatus for video content validation, in particular to check that contents of more than one videos are similar to one another.
  • SSIM is typically for predicting perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos.
  • SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms.
  • Structural information of SSIM considers an assumption that pixels of videos have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene.
  • the SSIM index is calculated on various windows of an image.
  • methods based on SSIM are computationally intensive.
  • Image matching via a heuristics solution involving frame feature matching typically relies on a fast and simple algorithm for difference comparison between two images.
  • An example of such algorithm is proposed by ⁇ . C. Wong, M. Bern, and D. Goidberg. An image signature for any kind of image. In Image Processing. 2002. Proceedings. 2002 International Conference on, volume 1 , pages I— I . IEEE, 2002.”.
  • the fast and simple algorithm calculates difference between two images much faster than the measures based on conventional techniques such as PSNR, MSE, SSIM, . . etc.
  • Figure 1 illustrates a triplet-based network architecture of an example of the present disclosure.
  • Figure 2 illustrates a process flow for video similarity computation according to an example of the present disclosure.
  • Figure 3 illustrates a process flow proposed for evaluation of two baseline methods against a video similarity computation method according to an example of the present disclosure.
  • Figure 4 shows a bar chart providing information about percentages of Similarities Benchmark Performance through 25 test cases.
  • Figure 5 shows a bar chart providing information about computation speed of video similarity computation according to an example of the present disclosure against two other methods.
  • Figure 6 is a schematic diagram showing components of a processor according to an example of the present disclosure.
  • Figure 7 shows examples of filter configurations for generating audio fingerprints.
  • One example of the present disclosure is concerned with an application of Proof of Transcoding using Artificial Intelligence (A.l.) that involves neural network on a proposed biockchain- based network (hereinafter known as“BOB network”) operating on a proposed blockchain-based platform (hereinafter known as“BOB platform”).
  • the BOB network can be said to be decentralized video streaming platform.
  • the BCB Platform operates in a proposed Blockchain-based network whereby users are incentivized to share redundant memory, storage, and bandwidth resources to address today’s video streaming challenges.
  • the BCB network may be configured to employ three Artificial-powered light-weight proofs for video streaming processing tasks and services, which includes Proof of Transcoding using Artificial Intelligence.
  • Miner On the proposed BCB network, there are four stakeholders, which are Miner, Content Creator, Viewer, and Advertiser. More details will be provided for Miner, Content Creator, and their relationship that are relevant to the examples in the present disclosure.
  • Miners people who have unused computing, storing, or networking resources and, for instance, like to earn money from these assets could participate on the BCB platform to become a miner of the platform. By“participating”, it covers the following actions:
  • Miners get paid when they perform a task of the BCB network correctly. The payment is deducted from a wallet of an owner of the task, that is, the one that posted the task.
  • the BCB platform introduces three types of tasks for video streaming and process services which are Transcoding, Storing, and Delivering. A complete flow of a task, starting with task posting and ending with task payment, is as follows:
  • a Content Creator submits a task to the BCB network & deposits fund to cover cost of the task. Fund can be refilled at any point, but Miners might stop work if the deposit runs out as they gradually cash in for works done. Note that, the fund is kept by BCB network and is only released if and only if it is verified that the task is done correctly by a Miner.
  • Miners send their offers to the Content Creator. Offers might contain various information including Miners’ offered price for the task.
  • the Content Creator picks the most suitable offer according to their needs.
  • the Content Creator sends accept message to the chosen miner.
  • N Elected witnesses also called“validators”
  • These witnesses use the mentioned proofs, that is, Proof of Storing, Proof of Delivering, and, in the present example, Proof of Transcoding using Artificial Intelligence, to verify the task and reach an agreement on its validity by way of voting. In other words, if it is confirmed by more than 66% of the number of witnesses that the task is perform accurately, the task is recorded as verified and the fund will be released to the Miner’s wallet.
  • Consensus Model On the BCB platform, Witnesses possess the most critical privilege which is the ability to verify transactions. Without the availability of a mechanism to prevent fraudulent actions from witnesses, such as intentional placement of wrong verification information to the network to benefit themselves in some ways, the BCB platform might not function properly. Such mechanism is referred to as a Consensus Model.
  • the BCB network employs the Delegated Byzantine Fault Tolerance protocol as its consensus model. Details of the protocol are beyond the scope of this disclosure. Basically, the Delegated Byzantine Fault Tolerance protocol offers a reliable low-cost approach to reach agreements between geographically distributed network nodes of a decentralized video streaming platform by way of a voting mechanism, given that an insignificant number of nodes are dishonest parties.
  • each witness will pull or obtain the input video and transcoding requirements from the Content Creator that submitted the task T.
  • the witnesses are supposed to use the Proof of Transcoding using A.l. of the present example to assert whether T is executed accurately.
  • the witnesseses will share their results and the network will follow a majority opinion in which a decision will be determined by the seven honest witnesseses
  • a focus of examples in the present disclosure relate to proof of transcoding using A.l.
  • Proof of Transcoding using A.l. (specifically, video transcoding) can, for example, be used by the witnesseses in the BCB network during a validation process of a transcoding task.
  • the Proofing process of examples of the present disclosure serves as a fast reliable and low-cost tool to check whether a transcoded video TV is transcoded correctly, given an original video OV and transcoding parameters TP, for example, from 720p-H264 to 360p-Webm.
  • the Proofing process examples of the present disclosure can be a substitute of a traditional method that is described as follows.
  • the original video OV needs to be re-transcoded using the same parameter as TP to produce V.
  • the output V is compared with the TV frame-by-frame and pixel-to-pixel. If V and TV are matched, then there is confidence that TV is transcoded accurately.
  • the traditional method has some major drawbacks. Firstly, the OV is required to be transcoded again and again. Let's say the cost to transcode a video is C. Then the total cost that is added to the traditional method would be C multiply by n, where n is a number of witnesseses. As explained above, the Consensus Model requires the proofing process to be repeated at each and every Witness. A lot of cost would be thus involved.
  • the Proof of Transcoding using A.l. aims to solve problems relating to the traditional methods of video comparison that have performance issues.
  • Proof of Video Transcoding is rather generic in nature, its technology discussed herein is applicable to other image/video processing applications requiring proof of image/video transcoding and is not limited to Proof of Transcoding that can be utilised for verifying a Video transcoding transaction of a Blockchain-based network such as the one described in the present disclosure or in other blockchain networks such as Ethereum and the like.
  • the examples of the present disclosure address a problem in Proof of Transcoding, especially, how to explicitly validate content of any transcoded video to check whether it matches with the original video or not.
  • the transcoded video is in a different format from the format of original video.
  • an A.! solution that leverages on breakthrough Deep Learning technology, typically deep Residual Neural Network (ResNet) in combination with Metric Learning, to significantly improve both accuracy and execution time of the proofing process. It would be demonstrated later that the proposed approach in examples herein described achieves outstanding performance against the traditional baseline methods in a margin of 5-20 times faster, while preserving robustness on accuracy over various kinds of test cases.
  • ResNet Residual Neural Network
  • PoT Proof of Transcoding
  • the proofing problem is exacerbated in the case of video due to its considerably larger volume (compared to text/images), which makes it a great challenge for any decentralized video applications that analyze, transcode and index large amounts of video content.
  • an efficient proof mechanism is nowadays an indispensable component in numerous decentralized video applications to prove that the process of transcoding was truly executed, and completed by, for instance, a miner in the BCB network.
  • Transcoded videos are identical or approximately identical to the original video, but typically different in file formats (video containers), encoding parameters, photometric variations (color, lighting changes), editing operations (caption, logo, and border insertion), different lengths and certain modifications (frames add/remove). A normal user would clearly identify the transcoded videos as essentially the same, compared to the original ones.
  • the differences of a transcoded video can be mainly categorized into two classes of factors:
  • the frame rate can be computed and verified by sampling the number of images that consecutively appear on a display in a second, with a few lines of code.
  • the transcoded video does not have to be pixel identical to be considered the same with the original because there are factors such as photometric variations, editing and content modification as explained earlier.
  • a transcoded video has the same content as the original video and at the same time, dealing with possible video content deformations, such as brightness shift or equalization, hard noise, different resolution (aspect ratio change), compression artifacts and black borders.
  • video content deformations such as brightness shift or equalization, hard noise, different resolution (aspect ratio change), compression artifacts and black borders.
  • the resolutions of the transcoded videos should be limited by some threshold parameters.
  • an A.l. solution involving computer neural network is applied for video-level proof of transcoding and it incorporates deep learning in two steps.
  • a proposed method for the example includes two main steps: In a first stage, extract features of a deep Residual Neural Network (ResNet) (Reference: K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770-77S. IEEE Computer Society, 2016.) from intermediate convolutional layers based on a scheme called Maximum Activation of Convolutions (Reference: F. Radenovic, G. Tolias, and O. Chum. CNN image retrieval learns from bow: Unsupervised fine-tuning with hard examples. In ECCV (1 ), volume 9905 of Lecture Notes in Computer Science, pages 3-20. Springer, 2016.).
  • ResNet deep Residual Neural Network
  • a ResNet is trained to learn an embedding function that maps a video to a feature space where videos with similar content should have smaller distances between each other compared to other different videos.
  • a feature fusion scheme is proposed for the generation of video representation, which would ensure the final representation is compact enough to facilitate: the development of high throughput, low latency, while having a high-accuracy encrypted form, and a decoder function that reverses encoding for playback or editing. Codecs are designed to emphasize certain aspects of the media to be encoded (see Table 1 below).
  • a video container is a metafile format whose specification describes how different elements of data and metadata coexist in a computer file. Video is almost always stored in compressed form to reduce the file size.
  • Video bitrate is a speed at which a video is transmitted, measured in bits-per-second (bps). The higher the bit rate of a video, the more data gets transferred at a time.
  • Framerate (frames per second) is the frequency (rate) at which consecutive images called frames appear on a display.
  • the proposed approach leverages features produced by the intermediate convolutional layers of a deep Residual Neural Network (ResNet) architecture to generate compact global video representation (such as feature vectors). Additionally, in order to accurately verify the similarity between two candidate videos (that is, the transcoded videos and the original one), another ResNet is trained to approximate an embedding function for distance calculation, also known as metric learning.
  • the machine learning model is trained by iterating on batches of generated video triplets that are extracted from a development dataset. A triplet contains a set of samples including query, positive, negative videos, where the positive video is more similar to the query video than the negative one.
  • the proposed approach involves a method and can be performed by an apparatus with processing capabilities that is operated by software to perform the method. An example of the apparatus is described with reference to Figure 6 later.
  • Image features need to be extracted to represent video content in a compact form of information that a machine can understand, and then execute the machine learning process. Such image features extraction has to take place for training an untrained machine or untrained computer neural network to derive a model that is capable of calculating a similarity value between two (or more) videos. Such image features extraction also has to take place during use of a trained machine or trained computer neural network to extract features from two (or more) videos that are to be compared.
  • features are extracted from activations of convolution layers of a pre trained ResNet.
  • ResNet ResNet
  • MAC Maximum Activation of Convolutions
  • the features to be extracted to determine the video frame descriptors to be extracted can be predetermined through other suitable means and is not restricted to features extraction from activations of convolution layers of a pre-trained ResNet and derivation of a compact representation by Maximum Activation of Convolutions (MAC).
  • a pre-trained ResNet is not used but another suitable computer neural network is used instead.
  • suitable deep learning neural networks include AlexNet, VGG16, GoogleNet, etc.
  • a pre-trained ResNet is employed, with a total number of / convolution layers, denoted as l_i , l_2, .. . ,L / .
  • l_i a total number of / convolution layers
  • l_2 a total number of / convolution layers
  • L / a total number of / convolution layers
  • Equation 1 1 , .. . , I) , where n d l X n d l is the dimension of every channel for convolution layer L; (which depends on the frame size) and d is the total number of channels.
  • max pooling is applied on every channel of each feature map M to extract a single value.
  • the extraction process can be formulated by Equation 1 be!ow.
  • layer vector ⁇ / is a c-dimensional vector that is derived from max pooling on every channel of each feature map M.
  • all layer vectors are concatenated to obtain a video frame descriptor for the video frame.
  • each video frame descriptor obtained in the same manner for each video frame of the input video is each normalized by applying zero-mean and / ⁇ -normalization. Thereafter, the video frame descriptors are ready for further processing to generate input of video samples to be inputted for training the model or as input of a pair of videos to the trained model for comparison of the pair of videos.
  • Global video descriptors can be generated and it can be done by initially applying uniform sampling to select P frames per second for an input video, and the respective video frame descriptors are extracted for each of them. The Global video descriptors are then derived by averaging and normalizing (zero-mean and / ⁇ normalization) these extracted video frame descriptors.
  • a Global video descriptor describes a video as a whole. In the present example, Global video descriptors are generated as input of video samples to be inputted for training the model or as input of a pair of videos to the trained model for comparison of the pair of videos.
  • An optimization problem of learning a video similarity function for the present example of proof of transcoding, especially for similarity comparison of videos content, is addressed in the present example by utilizing relative information of triplet-wise video relations.
  • a goal is to compute a similarity score of content between the original video and the transcoded videos in the hope that the same videos (i.e. those transcoded videos found to be the same) are ranked at the top.
  • similarity between two arbitrary videos q and p can be defined as a squared Euclidean distance in a video embedding space. This can be represented by Equation 2 below.
  • fe(-) is an embedding function that maps a video to a point in an Euclidean space
  • Q is a system parameter
  • D( ⁇ , ⁇ ) is the squared Euclidean distance in this space.
  • r( ⁇ , ⁇ ) which specifies whether a pair of videos are similar in content.
  • An objective is to learn an embedding function fe( ⁇ ) that assigns smaller distances to similar pairs compared to non-similar ones.
  • the embedding function fd( ⁇ ) should map video representations to a common space 3 ⁇ 4 d , where d is the dimension of the feature embedding, in which a distance between original v and positive v ⁇ is always smaller than a distance between the original v and negative V- This can be represented by Equation 3 beiow.
  • a distance vector indicative of the distance between the original v and positive v + and the distance between the original v and negative v would be output from the trained model for further calculation to determine similarity between a pair of videos compared.
  • V j , vf , vf are respective feature vectors (e.g. global video descriptors derived according to section 3 1 above) of (i) a query video (i.e. original video), (ii) a positive video (i.e. truly transcoded video; also known as“similar video”), and (iii) a negative video (i.e. a dissimilar/fake video; also known as“dissimilar video”) respectively.
  • Each triplet represents as set of sample for metric learning that contains these 3 types (i.e. (i), (ii) and (iii)) of videos.
  • the triplets expresses or characterizes a relative similarity ranking order among the three videos, i.e., V j is more similar to vf in contrast to v .
  • a hinge loss function is defined for a given triplet and it is called“triplet loss” in the present disclosure. In machine learning, the hinge loss is a loss function used for training classifiers. Equation 4 below represents this hinge loss function.
  • g is a margin parameter that regularizes the gap between the distance of the two video pairs, which are (V j , vf) and (V j, vf ) to ensure a sufficiently large difference between a positive-query distance (i.e. a distance between the original V j and positive vf ) and a negative-query distance (i.e. a distance between the original V j and negative vf ). If the video distances are calculated correctly within the margin g , then this triplet will not be penalised. Otherwise, the loss is a convex approximation of the loss that measures the degree of violation of the desired distance between the video pairs specified by the triplet.
  • a triplet is penalized if it does not satisfy requirements of similarity score between V j , v , vf . If a triplet is penalized, the value of the hinge loss function provided by equation (4) will be large and a batch gradient descent calculation (provided below) is required to minimize the loss if a triplet satisfies the requirements, the value of the hinge loss function provided by equation (4) will be small and the batch gradient descent calculation (provided below) is not required to minimize the loss.
  • Equation 5 is a regularization parameter that controls a margin of a learned ranker in the machine learning process to prevent overfitting of the resulting metric learning model while improving its generalization
  • m is the total size of a triplet mini-batch.
  • 0 is a parameter of the embedding function fd( ⁇ ). Equation 5 above can be converted to an unconstrained optimization by relaxing
  • Equation 6 Equation 6
  • Equation (6) Minimizing this loss function defined by Equation (6) will narrow the query-positive distance while widening the query-negative distance, and thus lead to a representation satisfying the desirable ranking order i.e. the transcoded videos found to be the same are ranked at the top. With an appropriate triplet generation strategy as described above in place, the resulting metric learning model will eventually learn a video representation that improves the effectiveness of video content similarity comparison for proof of transcoding.
  • a triplet-based network architecture for training the metric learning model, a triplet-based network architecture is proposed for the present example.
  • This architecture optimizes the triplet loss function of Equation 5.
  • a computer neural network to be trained is provided or inputted with one or more set of triplet 104 extracted from before-training samples 102 of a development database of similar and dissimilar videos. This input step is known as triplet sampling.
  • Each set of triplet 104 comprises a ground truth (original video), a positive video (truly transcoded video), and a negative video (fake/dissimilar video) with Vj , v ⁇ , vj ⁇ feature vectors, respectively.
  • the 3 videos are fed independently into three respective deep Residual Neural Networks (ResNet) 106 with identical architecture and parameters.
  • the ResNet 106 compute the embedding functions of
  • V fg (v) 6 3 ⁇ 4 d
  • the architecture of the deployed ResNet 106 is based on three dense fully- connected layers, 1 1 convolutional layer for reducing number of channels, and a normalization layer at the end. This leads to vectors that lie on a ( ⁇ dimension of the mapping function f e (v) that depends on the dimensionality of input vectors, which is in turn dictated by the employed ResNet architecture 106.
  • the video embedding functions computed from a batch of triplets are then given to a triplet loss layer 108 to calculate accumulated cost based on Equation 5.
  • after-training samples 1 1 0 outputted by the trained metric learning model would be organized in a desired manner i.e. most similar samples are ranked at the top.
  • ResNet is not used but another suitable computer neural network is used instead.
  • the learned embedding function fg ( ) is used for computing similarities between videos.
  • a feature fusion method or scheme is proposed for fusing similarity computation across video frames and it is described as follows.
  • Feature fusion method Frame or video descriptors are averaged and normalized into a global video descriptor, before they are forward propagated to the neural network.
  • the global video descriptor or signature is an output of the embedding function fg (-).
  • Equation 7 is determined by Equation 7 as follows.
  • FIG. 2 A process flow for Video-level Similarity Computation of the present example is illustrated in Figure 2.
  • Video 1 (202) is made up of frames (204) and video 2 (212) is made up of frames (214).
  • Feature vectors (206) of the frames (204) of Video 1 (206) may be extracted according to Equation 1 .
  • Feature vectors (216) of the frames (214) of Video 2 (216) may be extracted according to Equation 1 .
  • the feature vectors (206) and (21 6) are then subject to a matching stage that involves inputting the feature vectors (206) and (216) to the trained metric learning model (208) that is trained according to the earlier description.
  • the trained metric learning model (208) outputs a distance vector (210) that can be used for calculation of similarities percentage (218; also known herein as similarity value or similarities value) according to Equation 7.
  • ResNet or other kinds of advanced Deep Learning Neural Network such as AlexNet, VGG16, GoogleNet, etc. mentioned above can be used with an FAemb algorithm
  • An advantage of this FAemb algorithm is that this FAemb algorithm is expected to relax the requirements of triplet preparation compared to the requirements for triplet preparation for the proposed method described with reference to sections 3.2.1 to 3.2.4 above, while preserving the same or higher accuracy and performance.
  • the advantage of the FAemb algorithm is based on a solid mathematical proof of feature embedding using linear approximation of a non-linear function (i.e. Taylor expansion) in high dimensional space.
  • a non-linear function i.e. Taylor expansion
  • a start timer is started.
  • two videos for comparison are loaded.
  • One video is an original video and the other video is a transcoded video that is converted into a different format.
  • width and height of the two loaded videos are checked. If the width and height of the two loaded videos are the same, a flag is set to false at a step 312. If the width and height of the two loaded videos are not the same, the flag is set to true at a step 310.
  • similarity calculation or computation begins for the two videos. A frame of the first video (video 1 ) and a frame of the second video (video 2) are read in for comparison sequentially at steps 314 and 316 respectively.
  • a step 318 depending on whether the flag is set to false or true, different steps are taken. If it is determined at the step 318 that the flag is true, the bigger video frames of the two videos that was detected is resized to the width and height of the smaller video frames of the two videos. After resizing, frame by frame comparison takes place at a step 326 using both the SSIM metric method 322 and the Image Matching (heuristics solution) method 324. If it is determined at the step 318 that the flag is false, frame by frame comparison takes place at the step 326 using both the SSIM metric method 322 and the Image Matching (heuristics solution) method 324.
  • the output of the comparison result of each frame compared using the SSIM metric method 322 and the Image Matching (heuristics solution) method 324 are stored at a step 328.
  • a check to determine whether the compared videos have ended is performed. If the compared videos have not ended, steps 314 and 316 are performed for the next frame of the 2 videos to be compared. If the compared videos have ended, a step 332 is performed to calculate an average output for each of the methods 322 and 324 from the output that was stored for each pair of frames compared by each method. A similarity percentage can be worked out from the average output.
  • the timer is stopped.
  • the time determined by the timer is displayed along with similarity percentage of the 2 inputted videos for each method on a display.
  • step 336 After the step 336, the process flow ends at a step 338.
  • the structural similarity (SSIM) index approach is a method for predicting the perceived quality of digital television and cinematic pictures, as well as other kinds of digital images and videos.
  • SSIM is a perception-based model that considers image degradation as perceived change in structural information, while also incorporating important perceptual phenomena, including both luminance masking and contrast masking terms.
  • Structural information is the idea that the pixels have strong inter-dependencies especially when they are spatially close. These dependencies carry important information about the structure of the objects in the visual scene.
  • the SSIM index is calculated on various windows of an image. The measure between two windows x and y of common size N x N is:
  • m c , m g is respectively the average of x, y s c , Oy is respectively the variance of x, y
  • a xy is the covariance of x and y
  • the final output is the mean structural similarity over both images.
  • Step 1 If the image is colored, convert it to 8-bit grayscale.
  • Step 2 impose 9 x 9 grid of points on the image.
  • Step 3 At each grid point, compute the average gray level of the P x P square centered at the grid point by formula: P - max ⁇ 2, [. 5 + min ⁇ n, m ⁇ /20J ⁇ , where n, m are the dimensions of the image
  • Step 4 For each grid point, compute an 8-element array whose elements give a comparison of the average gray level of the grid point square with those of its eight neighbors.
  • the result of a comparison can be ’’much darker”, ’’darker”, ’’same”, ’’lighter”, or ’’much lighter”, represented numerically as -2, -1 , 0, 1 , 2.
  • the signature of an image is simply the concatenation of the 8-element arrays corresponding to the grid points, ordered left-to-right, top-to-bottom.
  • Step 5 Calculate the difference between the two images by a normalized distance:
  • the algorithm calculates difference between two images based on signature much faster than the baseline measures such as PSNR, MSE, SSIM,... etc.
  • the main limitation of this algorithm is that it is not designed to handle large amounts of cropping or rotation.
  • Table 2 describes properties of some videos that are collected from a source before being transcoded.
  • Moviepy editor is a python package that can read and write all of the most common audio, video formats and run on cross-platforms. From the table 2 above, the differences in duration, FPS and bitrate among sample videos can be seen. In order to create comprehensive test cases, each sample video in the dataset is transcoded by carefully changing various parameters like video codec, bitrate, framerate, etc., one by one and in combination.
  • the processing time for featurizing a frame depends on the duration of video (number of frames), and number of computing threads. For instance, a video with a total of 5900 frames (about 4-5 minutes) would take 1 10 seconds (18 ms/frame) using a single thread, while taking only 40 seconds (6 ms/frame) approximately tor feature extraction using parallelization with a mini-batch-source.
  • Figure 4 shows a bar chart providing information about percentages of Similarities Benchmark Performance through 25 test cases from the dataset defined in section 4.1 above.
  • the leftmost bar represents the Baseline 1
  • the center bar represents Heuristic 2
  • the rightmost bar represents Deep Feature 3 respectively.
  • Deep Feature 3 delivers dramatically higher results than Baseline (SSIM) 1
  • Heuristic (Image Matching) 2 2.
  • Deep Feature 3 is the most stable and the highest performance in the Video Similarities Measurement followed by Baseline 1 , and the Heuristic method 2.
  • Table 3 compares distinctiveness performance between the three methods Baseline 1 , Heuristic 2, and Deep Feature 3.
  • Deep Feature 3 performs with absolute accuracy when all the test cases return “Not Match” results, whereas Baseline 1 and Heuristic 2 return a Similarities Percentage variance in a range of 18% to 41 %. Furthermore, Heuristic 2 results seem to be better than Baseline 1 , which is proven by the Average Similarities which are 26.95%, and 27.55% respectively (which is the lower, the better).
  • Figure 5 shows information about how fast each of the methods Baseline 1 , Heuristic 2 and Deep Feature 3 take for matching two videos.
  • Baseline 1 is the slowest one with the last test case taking up to 4000 seconds to match the two video.
  • Heuristic 2 and Deep Feature 3 show their better performance by alternately being the fastest one for each test case.
  • Deep Feature 3 is the best among the three methods tested. On average, it has the Highest Accuracy in the Similarities Benchmark, Highest Distinctiveness in the Dis similarities Benchmark and the Lowest Execution Time.
  • PoT Proof of Transcoding
  • the proposed A. I. architecture based on FtesNet in combination with Deep Metric facilitates is a compact video-level representation for checking similarities between the content of any pair of videos. As described above, the proposed approach is comprehensively tested on various kinds of test cases, and exhibited highly competitive accuracy. Furthermore, the method outperforms ail compared approaches, from traditional baseline (SSIM) approach to heuristics solution, by a clear margin.
  • SSIM baseline
  • a video may not only contain many frames of images but may also contain sound or audio content.
  • a video file can be said to be a container containing video content (i.e. images only) and an audio file.
  • Such audio content may be embedded in the video content.
  • Such audio content also requires validation during video content validation if the video includes audio content. Examples of how the audio content can be validated are described as follows.
  • Audiodiff is a Python library for comparing audio files or streams.
  • one of the audio files to be compared may be an audio file contained in an original video file and the other audio file is a transcoded audio file contained in a video file transcoded from the original video file.
  • PCM Pulse Code Modulation
  • the checksums can be obtained by using FFmpeg to get information of the audio and then use hashlib to compute the SHA1 checksums.
  • Python is featured here, it is appreciated that other suitable tools (Mathlab) apart from Python with similar operation to perform similar checksum checks can also be used.
  • FFmpeg refers to a vast software suite of libraries and programs for handling video, audio, and other multimedia files and streams. FFmpeg is designed for command- line-based processing of video and audio files, and is widely used for format transcoding, basic editing (trimming and concatenation), video scaling, video post-production effects, and standards compliance (SMPTE, ITU) “hashlib” is a module included in the Python Standard library containing an interface to hashing algorithms.“SHA-1” (Secure Hash Algorithm 1 ) is a cryptographic hash function which takes an input and produces a 160-bit (20-byte) hash value known as a message digest - typically rendered as a hexadecimal number, 40 digits long.
  • the second technique involves first obtaining Mel-frequency cepstral coefficients (MFCCs) from each of two audio dips or files to be compared. Thereafter, Dynamic times warping (DTW) algorithm is applied to measure similarity between the sequences pertaining to the respective MFCCs derived from the two audio clips or files.
  • MFCCs Mel-frequency cepstral coefficients
  • DTW Dynamic times warping
  • MFCCs Mel-frequency cepstral coefficients
  • M(f) 1 125 * ln(1 +f/700)
  • Dynamic times warping relates to an algorithm for measuring similarity between two temporal sequences.
  • DTW is a method that calculates an optimal match between two given sequences with certain restrictions.
  • DTW can be represented by a formula:
  • Audio signals of an audio file or clip can be represented by waveforms and spectrograms.
  • a useful representation is in a form of a spectrogram, which shows how intensity on specific frequencies of the audio signals changes over times.
  • an audio signal may be split into many overlapping frames and the Fourier transform is applied on them (“Short-time Fourier transform”).
  • Chromaprint In the case of use of Chromaprint, an input audio signal or stream of an audio file is first converted, for example, into an audio representation having a sampling rate of 1 1025Hz with a frame size of 4096 (0.371s) and 3 ⁇ 4noverlap. An audio image of the audio representation can be generated and image comparison can be made. Many fingerprinting algorithms work with this kind of audio representation. Some algorithms compare differences across time and frequency in the generated audio image, whereas some algorithms look for peaks in the generated audio image. However, Chromaprint processes the audio representation further by transforming frequencies into musical notes before generating an audio image for the processed audio representation. It is noted that the interest is in notes, not octaves, so a result would have 12 bins, one for each note.
  • Chroma features This information is called "chroma features" (Audio Thumbnailing of Popular Music Using Chroma-Based Representations by Mark A. Bartsch and Gregory H. Wakefield, Journal, March 2005, IEEE Transactions on Multimedia Volume 7 Issue 1 Pages 96 - 104).
  • fpcalc library in Python can be used for extracting Chromaprint features from an audio signal.
  • the resulting Chromaprint audio representation is quite robust to changes caused by lossy codecs and the like.
  • the generated Chromaprint audio image of the resulting Chromaprint audio representation can still be quite hard to compare through image comparison. Also, in order to be able to search for such audio images in a database, a more compact form of the audio image is preferred.
  • These sub-images may be grayscale sub-images.
  • a pre-defined set of, for example, 16 filters can be applied to capture intensity differences across musical notes and time. What the predefined set of 16 filters does is that they calculate a sum for each specific area or region of each grayscale sub-image.
  • the sum is a convolution operator calculated after overlaying each of the 16 filters over each specific area or region of each grayscale sub-image.
  • the specific area or region can be pre-defined.
  • the image filtering configuration of each of the 16 filters may be selected from, for example, 6 types of image filtering configurations as shown in Figure 7.
  • the 16 filters can be made different from one another by adjusting values of elements of a filter matrix of the selected image filtering configuration.
  • the objective of the 16 filters is such that when applying each filter over any region or area in each grayscale sub-image, a sum for representation as part of the audio fingerprint to be obtained is calculated by calculating a convolution (weighted sum) operator. Such sum is considered to be a compact number useful for representing the audio fingerprint.
  • Such filtering technique is also known as a way of feature extraction using multiple types of predefined filters (kernels) to extract information that are robust to noise, invariance of scale, rotation, etc.
  • the combined results would thus be a 32-bit integer. If the aforementioned steps are performed for every sub-image generated by the sliding window i.e. the 16x12 pixel window, a full audio fingerprint can be obtained. Performing such filtering advantageously achieves an audio fingerprint that is more compact and easier to search in a database and is easier for comparing similarity of two audio signals.
  • comparison to determine whether the 2 audio files are similar can be done based on matching of number of bits present in the 2 audio fingerprints.
  • generation of the aforementioned audio fingerprints would end up with some unwanted errors that would cause some flips in the bits compared. In fact, the probability of occurrence of a 1 bit error is rather high at 98% of the time. So, in one example, if the difference between two strings of audio fingerprint bits that are compared is just in 1 bit out of ail the bits in the two audio fingerprints, it is safe to assume that the two audio fingerprints are similar. To decrease the chances of errors in the comparison, in other examples, it could be that a threshold similarity level or confidence level can be calculated.
  • Such threshold similarity level or confidence level can be compared with a similarity value calculated for 2 audio files being compared to determine whether the 2 audio files are similar. It can be determined that the 2 audio files are similar if the similarity value is higher or lower than the calculated threshold similarity level or confidence level.
  • similarity determination through the Chromaprint-based technique is most efficient and accurate.
  • Examples of the present disclosure may have the foliowing features.
  • a method for video content validation comprising: inputting a pair of videos (e.g. 202 and 212 in Figure 2) for comparison; extracting a video frame descriptor (e.g. 206 and 216 in Figure 2) from each frame of the pair of videos, wherein the video frame descriptors to be extracted are predetermined; processing the extracted video frame descriptors from each frame of the pair of videos to generate input to a model derived from a trained computer neural network (e.g. 208 in Figure 2); and outputting from the model, a distance vector (e.g. 210 in Figure 2) to be used for calculation of a similarity value (e.g. 218 in Figure 2) indicative of similarity between the pair of videos.
  • a distance vector e.g. 210 in Figure 2 to be used for calculation of a similarity value (e.g. 218 in Figure 2) indicative of similarity between the pair of videos.
  • the trained computer neural network may be deep Residual Neural Network (ResNet).
  • ResNet deep Residual Neural Network
  • the method may comprise: extracting features from activations of convolution layers of a pre trained computer neural network having convolution layers; and deriving a compact representation by Maximum Activation of Convolutions (MAC) to extract the video frame descriptors from each frame of the pair of videos.
  • MAC Maximum Activation of Convolutions
  • the method may comprise: forward propagating each frame of the pair of videos through the pre-trained computer neural network comprising a plurality of convolution layers to generate feature maps; extracting a single descriptor from each convolution layer by applying max pooling on every channel of each feature map to extract a single value; concatenating vectors of each convolution layer to obtain the video frame descriptor for each frame; and normalizing the obtained video frame descriptor for each frame.
  • the method may comprise: calculating a global video descriptor from the obtained video frame descriptors for each frame of each video; and inputting the global video descriptor of each of the pair of videos to the model.
  • the method may comprise: defining for the model, simi!arity between the pair of videos as a squared Euclidean distance in a video embedding space, wherein each video is mapped to a point in an Euclidean space; and defining an embedding function for the model, that maps an original video, a similar video similar to the original video, and a dissimilar video dissimilar to the original video, to a common space and assigns smaller distance when the pair of videos are similar compared to when the pair of videos are non-similar such that a distance between the original video and the similar video is smaller than a distance between the original video and the dissimilar video, wherein the distance vector is indicative of the distance between the original video and the similar video and the distance between the original video and the dissimilar video.
  • the method may comprise: inputting video samples to train the model, wherein the video samples are in form of triplets (e.g. 104 in Figure 1 ) and each triplet comprises an original video, a similar video similar to the original video and a dissimilar video dissimilar to the original video.
  • triplets e.g. 104 in Figure 1
  • the method may comprise: defining a hinge loss function (e.g. 108 in Figure 1 ) for each triplet that includes a margin parameter to ensure a difference between a distance between the original video and the similar video and a distance between the original video and the dissimilar video so that if these two distances are calculated to fall within the margin parameter, the triplet is not penalised for inaccuracy.
  • a hinge loss function e.g. 108 in Figure 1
  • the method may comprise: feeding each video of each triplet independently into a neural network architecture (e.g. 106 in Figure 1 ) comprising three respective deep Residual Neural Networks and based on three dense fully-connected layers, 1 x1 convolutional layer for reducing number of channels, and a normalization layer.
  • a neural network architecture e.g. 106 in Figure 1
  • the method may comprise: feeding each video of each triplet independently into a neural network architecture (e.g. 106 in Figure 1 ) comprising three respective deep Residual Neural Networks and based on three dense fully-connected layers, 1 x1 convolutional layer for reducing number of channels, and a normalization layer.
  • the method may comprise using deep Residual Neural Network (ResNet) with FAemb algorithm for feature embedding.
  • ResNet deep Residual Neural Network
  • the method may be used in a verification process of a transcoding task in Blockchain-based platform.
  • the method may comprise: obtaining a Chromaprint audio image for audio content of each of the pair of videos; filtering each Chromaprint audio image to obtain an audio fingerprint; and comparing the audio fingerprints of the audio content of each of the pair of videos to determine similarity between the audio content of each of the pair of videos.
  • An apparatus for video content validation may comprise: a memory; and a processor configured to execute instructions stored in the memory to operate the apparatus to: input a pair of videos for comparison; extract video frame descriptors from each frame of the pair of videos, wherein the video frame descriptors to be extracted are predetermined; process the video frame descriptors to generate input to a model derived from a trained computer neural network; and output from the model, a distance vector to be used for calculation of a similarity vaiue indicative of similarity between the pair of videos.
  • the trained computer neural network may be deep Residual Neural Network (ResNet).
  • ResNet deep Residual Neural Network
  • the apparatus may be operable by the processor to: extract features from activations of convolution layers of a pre-trained computer neural network having convolution layers; and derive a compact representation by Maximum Activation of Convolutions (MAC) to extract the video frame descriptors from each frame of the pair of videos.
  • MAC Maximum Activation of Convolutions
  • the apparatus may be operable by the processor to: forward propagate each frame of the pair of videos through the pre-trained computer neural network comprising a plurality of convolution layers to generate feature maps; extract a single descriptor from each convolution layer by applying max pooling on every channel of each feature map to extract a single value; concatenate vectors of each convolution layer to obtain the video frame descriptor for each frame; and normalize the obtained video frame descriptor for each frame.
  • the apparatus may be operable by the processor to: calculate a global video descriptor from the obtained video frame descriptors for each frame of each video; and input the global video descriptor of each of the pair of videos to the model.
  • the apparatus may be operable by the processor to: define for the model, similarity between the pair of videos as a squared Euclidean distance in a video embedding space, wherein each video is mapped to a point in an Euclidean space; and define an embedding function for the model, that maps an original video, a similar video similar to the original video, and a dissimilar video dissimilar to the original video, to a common space and assigns smaller distance when the pair of videos are similar compared to when the pair of videos are non-similar such that a distance between the original video and the similar video is smaller than a distance between the original video and the dissimilar video, wherein the distance vector is indicative of the distance between the original video and the similar video and the distance between the original video and the dissimilar video.
  • the apparatus may be operable by the processor to: input video samples to train the model, wherein the video samples are in form of triplets and each triplet comprises an original video, a similar video similar to the original video and a dissimilar video dissimilar to the original video.
  • the apparatus may be operable by the processor to: defining a hinge loss function for each triplet that includes a margin parameter to ensure a difference between a distance between the original video and the similar video and a distance between the original video and the dissimilar video so that if these two distances are calculated to fall within the margin parameter, the triplet is not penalised for inaccuracy.
  • the apparatus may be operable by the processor to: feed each video of each triplet independently into a neural network architecture comprising three respective deep Residual Neural Networks and based on three dense fully-connected layers, 1 x1 convolutional layer for reducing number of channels, and a normalization layer.
  • a neural network architecture comprising three respective deep Residual Neural Networks and based on three dense fully-connected layers, 1 x1 convolutional layer for reducing number of channels, and a normalization layer.
  • the apparatus may be operable by the processor to: use deep Residual Neural Network (ResNet) with FAemb algorithm for feature embedding.
  • ResNet deep Residual Neural Network
  • the apparatus may be used in a verification process of a transcoding task in a Blockchain- based platform.
  • the apparatus may be operable by the processor to: obtain a Chromaprint audio image for audio content of each of the pair of videos; filter each Chromaprint audio image to obtain an audio fingerprint; and compare the audio fingerprints of the audio content of each of the pair of videos to determine similarity between the audio content of each of the pair of videos.
  • Artificial Intelligence i.e. Computer Neural Network
  • video content validation i.e. video content validation
  • the Artificial Intelligence algorithms set out in sections 3.2.1 to 3.2.5 above can be used, for example, to build a proof of transcoding system for video content validation in the Blockchain- based platform or other similar platforms.
  • FIG. 6 shows in detail an example of a processor 600 suitable for executing instructions or a program to perform a method of the proposed approach described with reference to section 3 above and/or to operate an apparatus to perform the method of the proposed approach described with reference to section 3 above.
  • the processor 600 may be a mobile device such as a smartphone, tablet device and the like as well. Alternatively, instead of one processor, more than one processors 600 may be deployed for the same purpose.
  • the processor 600 may comprise a processing unit 602 for processing software including one or more computer programs for running one or more computer/server applications to enable a backend logic flow, the method or the methods for carrying out the proposed approach described with reference to section 3 above.
  • the processing unit 602 may include user input modules such as a computer mouse 636, keyboard/keypad 604, and/or a plurality of output devices such as a display device 608.
  • the display of the display device 608 may be a touch screen capable of receiving user input as well.
  • the processing unit 602 may be connected to a computer network 612 via a suitable transceiver device 614 (i.e. a network interface), to enable access to e.g. the Internet or other network systems such as a wired Local Area Network (LAN) or Wide Area Network (WAN).
  • the processing unit 602 may also be connected to one or more external wireless communication enabled devices 634 via a suitable wireless transceiver device 632 e.g. a WiFi transceiver, Bluetooth module, Mobile telecommunication transceiver suitable for Global System for Mobile Communication (GSM), 3G, 3.5G, 4G telecommunication systems, and the like.
  • GSM Global System for Mobile Communication
  • 3G Global System for Mobile Communication
  • 4G telecommunication systems and the like.
  • the processing unit 602 can gain access to one or more storages i.e. data storages, databases, data servers and the like connectable to the computer network 612 to retrieve and/or store data in the one or more storages.
  • the processing unit 602 may include an arithmetic logic unit (ALU) 618, a Random Access Memory (RAM) 620 and a Read Only Memory (ROM) 622.
  • the processing unit 602 may also include a number of Input/Output (I/O) interfaces, for example I/O interface 638 to the computer mouse 636, a memory card and/or a Subscriber Identity Module (SIM) card slot or slots 616, I/O interface 624 to the display device 608, , and I/O interface 626 to the keyboard/keypad 604.
  • I/O Input/Output
  • the components of the processing unit 602 typically communicate via an interconnected bus 628 and in a manner known to the person skilled in the relevant art.
  • the computer programs may be supplied to the user of the processor 600, or the processor (not shown) of one of the one or more external wireless communication enabled devices 634, encoded on a data storage medium such as a CD-ROM, on a flash memory carrier, a Solid State Drive or a Hard Disk Drive, and are to be read using a corresponding data storage medium drive of a data storage device 630.
  • a data storage medium such as a CD-ROM
  • flash memory carrier such as a solid State Drive or a Hard Disk Drive
  • Such computer or application programs may also be downloaded from the computer network 612
  • the application programs are read and controlled in its execution by the processor 618.
  • Intermediate storage of program data may be accomplished using RAM 620
  • one or more of the computer or application programs may be stored on any non-transitory machine- or computer- readable medium.
  • the machine- or computer- readable medium may include storage devices such as magnetic or optical disks, memory chips, or other storage devices suitable for interfacing with a general purpose computer.
  • the machine- or computer- readable medium may also include a hard-wired medium such as that exemplified in the Internet system, or wireless medium such as that exemplified in the Wireless LAN (WLAN) system and the like.
  • WLAN Wireless LAN

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

L'invention concerne un procédé et un appareil de validation de contenu vidéo, le procédé consistant à : entrer une paire de vidéos à comparer ; extraire des descripteurs d'images vidéo à partir de chaque image de la paire de vidéos, les descripteurs d'images vidéo à extraire étant prédéterminés ; traiter les descripteurs d'image vidéo pour générer une entrée dans un modèle dérivé d'un réseau neuronal informatique entraîné ; et délivrer en sortie à partir du modèle, un vecteur de distance à utiliser pour calculer une valeur de similitude indicative d'une similitude entre les vidéos de la paire.
PCT/SG2018/050379 2018-07-27 2018-07-27 Procédé et appareil de validation de contenu vidéo WO2020022956A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/SG2018/050379 WO2020022956A1 (fr) 2018-07-27 2018-07-27 Procédé et appareil de validation de contenu vidéo

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG2018/050379 WO2020022956A1 (fr) 2018-07-27 2018-07-27 Procédé et appareil de validation de contenu vidéo

Publications (1)

Publication Number Publication Date
WO2020022956A1 true WO2020022956A1 (fr) 2020-01-30

Family

ID=69182397

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SG2018/050379 WO2020022956A1 (fr) 2018-07-27 2018-07-27 Procédé et appareil de validation de contenu vidéo

Country Status (1)

Country Link
WO (1) WO2020022956A1 (fr)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111277902A (zh) * 2020-02-17 2020-06-12 北京达佳互联信息技术有限公司 一种视频匹配方法和装置及设备
CN111667050A (zh) * 2020-04-21 2020-09-15 佳都新太科技股份有限公司 度量学习方法、装置、设备及存储介质
CN112418191A (zh) * 2021-01-21 2021-02-26 深圳阜时科技有限公司 指纹识别模型构建方法、存储介质及计算机设备
CN112949431A (zh) * 2021-02-08 2021-06-11 证通股份有限公司 视频篡改检测方法和系统、存储介质
CN113505680A (zh) * 2021-07-02 2021-10-15 兰州理工大学 基于内容的高时长复杂场景视频不良内容检测方法
US11153358B2 (en) 2019-07-31 2021-10-19 Theta Labs, Inc. Methods and systems for data caching and delivery over a decentralized edge network
CN113596467A (zh) * 2020-04-30 2021-11-02 北京达佳互联信息技术有限公司 一种转码服务的检测方法、装置、电子设备及存储介质
CN114550272A (zh) * 2022-03-14 2022-05-27 东南大学 基于视频时域动态注意力模型的微表情识别方法及装置
US11659015B2 (en) 2019-10-11 2023-05-23 Theta Labs, Inc. Tracker server in decentralized data streaming and delivery network
US11763332B2 (en) 2020-11-16 2023-09-19 Theta Labs, Inc. Edge computing platform supported by smart contract enabled blockchain network
US11778167B1 (en) 2022-07-26 2023-10-03 Insight Direct Usa, Inc. Method and system for preprocessing optimization of streaming video data
US11849241B2 (en) 2021-12-29 2023-12-19 Insight Direct Usa, Inc. Dynamically configured processing of a region of interest dependent upon published video data selected by a runtime configuration file
US11961273B2 (en) 2021-12-29 2024-04-16 Insight Direct Usa, Inc. Dynamically configured extraction, preprocessing, and publishing of a region of interest that is a subset of streaming video data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040221237A1 (en) * 1999-03-11 2004-11-04 Fuji Xerox Co., Ltd. Methods and apparatuses for interactive similarity searching, retrieval and browsing of video
CN106778604A (zh) * 2015-12-15 2017-05-31 西安电子科技大学 基于匹配卷积神经网络的行人再识别方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040221237A1 (en) * 1999-03-11 2004-11-04 Fuji Xerox Co., Ltd. Methods and apparatuses for interactive similarity searching, retrieval and browsing of video
CN106778604A (zh) * 2015-12-15 2017-05-31 西安电子科技大学 基于匹配卷积神经网络的行人再识别方法

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DO T.-T. ET AL.: "FAemb: A function approximation-based embedding method for image retrieval", 2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 12 June 2015 (2015-06-12), pages 3556 - 3564, XP032793806, [retrieved on 20180903], DOI: 10.1109/CVPR.2015.7298978 *
HE K. ET AL.: "Deep Residual Learning for Image Recognition", 2016 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 30 June 2016 (2016-06-30), pages 770 - 778, XP055536240, [retrieved on 20180903], DOI: 10.1109/CVPR.2016.90 *
RADENOVIC F. ET AL.: "CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples", COMPUTER SCIENCE , COMPUTER VISION AND PATTERN RECOGNITION (CS.CV, 7 September 2016 (2016-09-07), pages 1 - 17, XP047355245, DOI: 10.1007/978-3-319-46448-0_1 *
TOLIAS G. ET AL.: "Particular object retrieval with integral max-pooling of CNN activations", COMPUTER SCIENCE , COMPUTER VISION AND PATTERN RECOGNITION (CS.CV, 24 February 2016 (2016-02-24), pages 1 - 12, XP055464086 *
WANG J. ET AL.: "Learning Fine-Grained Image Similarity with Deep Ranking", 2014 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, 28 June 2014 (2014-06-28), pages 1386 - 1393, XP032649351, [retrieved on 20180903], DOI: 10.1109/CVPR.2014.180 *

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11153358B2 (en) 2019-07-31 2021-10-19 Theta Labs, Inc. Methods and systems for data caching and delivery over a decentralized edge network
US11659015B2 (en) 2019-10-11 2023-05-23 Theta Labs, Inc. Tracker server in decentralized data streaming and delivery network
CN111277902A (zh) * 2020-02-17 2020-06-12 北京达佳互联信息技术有限公司 一种视频匹配方法和装置及设备
CN111667050A (zh) * 2020-04-21 2020-09-15 佳都新太科技股份有限公司 度量学习方法、装置、设备及存储介质
CN113596467A (zh) * 2020-04-30 2021-11-02 北京达佳互联信息技术有限公司 一种转码服务的检测方法、装置、电子设备及存储介质
CN113596467B (zh) * 2020-04-30 2024-03-12 北京达佳互联信息技术有限公司 一种转码服务的检测方法、装置、电子设备及存储介质
US11763332B2 (en) 2020-11-16 2023-09-19 Theta Labs, Inc. Edge computing platform supported by smart contract enabled blockchain network
CN112418191A (zh) * 2021-01-21 2021-02-26 深圳阜时科技有限公司 指纹识别模型构建方法、存储介质及计算机设备
CN112418191B (zh) * 2021-01-21 2021-04-20 深圳阜时科技有限公司 指纹识别模型构建方法、存储介质及计算机设备
CN112949431A (zh) * 2021-02-08 2021-06-11 证通股份有限公司 视频篡改检测方法和系统、存储介质
CN113505680A (zh) * 2021-07-02 2021-10-15 兰州理工大学 基于内容的高时长复杂场景视频不良内容检测方法
US11849241B2 (en) 2021-12-29 2023-12-19 Insight Direct Usa, Inc. Dynamically configured processing of a region of interest dependent upon published video data selected by a runtime configuration file
US11849240B2 (en) 2021-12-29 2023-12-19 Insight Direct Usa, Inc. Dynamically configured processing of a region of interest dependent upon published video data selected by a runtime configuration file
US11849242B2 (en) 2021-12-29 2023-12-19 Insight Direct Usa, Inc. Dynamically configured processing of a region of interest dependent upon published video data selected by a runtime configuration file
US11961273B2 (en) 2021-12-29 2024-04-16 Insight Direct Usa, Inc. Dynamically configured extraction, preprocessing, and publishing of a region of interest that is a subset of streaming video data
CN114550272A (zh) * 2022-03-14 2022-05-27 东南大学 基于视频时域动态注意力模型的微表情识别方法及装置
CN114550272B (zh) * 2022-03-14 2024-04-09 东南大学 基于视频时域动态注意力模型的微表情识别方法及装置
US11778167B1 (en) 2022-07-26 2023-10-03 Insight Direct Usa, Inc. Method and system for preprocessing optimization of streaming video data

Similar Documents

Publication Publication Date Title
WO2020022956A1 (fr) Procédé et appareil de validation de contenu vidéo
CN109359636B (zh) 视频分类方法、装置及服务器
Dolhansky et al. The deepfake detection challenge (dfdc) dataset
US10007723B2 (en) Methods for identifying audio or video content
Bondi et al. Training strategies and data augmentations in cnn-based deepfake video detection
CN109871490B (zh) 媒体资源匹配方法、装置、存储介质和计算机设备
WO2020022958A1 (fr) Procédé et appareil de vérification de transactions dans un réseau de chaînes de blocs
CN110866958A (zh) 一种文本到图像的方法
EP3508986A1 (fr) Identification de reprise de musique pour recherche, conformité et octroi de licences
KR20170018042A (ko) 규칙 기반 비디오 중요도 분석
EP3168754B1 (fr) Procédé d'identification d'une entité
CN111683285B (zh) 文件内容识别方法、装置、计算机设备及存储介质
US20110219076A1 (en) System and method for integrating user generated content
US20220318349A1 (en) Liveness detection using audio-visual inconsistencies
US11983195B2 (en) Tokenized voice authenticated narrated video descriptions
CN112866776B (zh) 视频生成方法和装置
US11711363B2 (en) Systems for authenticating digital contents
US10924637B2 (en) Playback method, playback device and computer-readable storage medium
Altinisik et al. Video source characterization using encoding and encapsulation characteristics
US20210209256A1 (en) Peceptual video fingerprinting
CN113689527A (zh) 一种人脸转换模型的训练方法、人脸图像转换方法
Güera Media forensics using machine learning approaches
CN112559975A (zh) 一种基于区块链的数字媒体版权实现方法、设备及介质
US11954676B1 (en) Apparatus and method for minting NFTs from user-specific events
JP5275376B2 (ja) 情報処理装置及び情報処理プログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18927935

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18927935

Country of ref document: EP

Kind code of ref document: A1