WO2021007846A1 - 一种视频相似检测的方法、装置及设备 - Google Patents

一种视频相似检测的方法、装置及设备 Download PDF

Info

Publication number
WO2021007846A1
WO2021007846A1 PCT/CN2019/096515 CN2019096515W WO2021007846A1 WO 2021007846 A1 WO2021007846 A1 WO 2021007846A1 CN 2019096515 W CN2019096515 W CN 2019096515W WO 2021007846 A1 WO2021007846 A1 WO 2021007846A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
key frame
similar
editing
feature extraction
Prior art date
Application number
PCT/CN2019/096515
Other languages
English (en)
French (fr)
Inventor
李军
刘昊淼
公维蒙
涂丹丹
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19937682.3A priority Critical patent/EP3989158A4/en
Priority to PCT/CN2019/096515 priority patent/WO2021007846A1/zh
Priority to CN201980098001.5A priority patent/CN114041165A/zh
Publication of WO2021007846A1 publication Critical patent/WO2021007846A1/zh
Priority to US17/568,705 priority patent/US20220172476A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/757Matching configurations of points or features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/2628Alteration of picture size, shape, position or orientation, e.g. zooming, rotation, rolling, perspective, translation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method for video similarity detection, and a device and equipment for executing the method.
  • Editing the existing videos to obtain a variety of different styles of videos not only brings the diversity and entertainment of the video content, but also brings greater challenges to the video information security.
  • One or more existing videos are obtained after various types of editing operations, how to detect similarities with existing videos to obtain detailed overlapping areas, and how to identify the type of editing used, and check copyright certification for similar videos
  • Video-related services such as advertising and advertising recognition are crucial.
  • An adaptive threshold is set based on the correlation between the mode noise distribution of the video to be detected and the video similar to it, and the video is similarly detected based on the adaptive threshold. And positioning; 2. Calculate the similarity between the video to be detected and the existing video based on the hash algorithm; 3. compare the global features and local features of the fixed-interval frames in the video to obtain the edited or tampered area position.
  • the prior art can only judge whether a video is obtained by editing one or more existing videos, and cannot determine whether the video to be detected is specifically made by one or more existing videos. Type of editing operation obtained. Therefore, how to determine a video that is similar to the video to be detected, and to detect the editing type existing between the video to be detected and the similar video is a technical problem to be solved urgently for video similarity detection.
  • This application provides a method for video similarity detection, which can determine a video that is similar to the video to be detected, and further determines the editing type used between the video to be detected and the similar video.
  • the present application provides a method for video similarity detection.
  • the method includes: receiving a first video, determining a key frame of the first video according to the first video; inputting the key frame to a feature extraction model, Obtain the feature of the key frame; determine a similar key frame and a second video according to the feature of the key frame, wherein the second video is the video where the similar key frame is located, and the second video is the same as the first video.
  • a video is similar; input the key frame and the similar key frame to the editing type recognition model to obtain the editing type, where the editing type indicates the editing used for editing between the first video and the second video Types of.
  • the method for video similarity detection provided by this method not only determines the similar video corresponding to the video to be detected, but also obtains the editing type between the video to be detected and the similar video on the basis of determining the similar video. This method is more advantageous when used in applications such as video copyright authentication and advertisement recognition.
  • the method further includes: outputting the second video or the information of the second video to a display module, wherein the information of the second video includes the second video.
  • the name of the video output the editing type to the display module.
  • the display module intuitively displays the video similar to the video to be detected and the editing type between the videos, so that the user can intuitively obtain this information.
  • the edit type recognition model includes a first feature extraction branch, a second feature extraction branch, and a predictor; input the key frame and the similar key frame to the edit type recognition model ,
  • Obtaining the editing type specifically includes: inputting the key frame to the first feature extraction branch, and inputting the similar key frame to the second feature extraction branch; the first feature extraction branch characterizes the key frame Extract, output the editing feature of the key frame, the second feature extraction branch performs feature extraction on the similar key frame, and output the editing feature of the similar key frame; input the editing feature of the key frame and the similarity
  • the editing feature of the key frame is sent to the predictor, and the predictor outputs the editing type.
  • the method adopts the editing type recognition module including feature extraction branch and predictor, so that the accuracy of the obtained editing type is high.
  • the method further includes: calculating a similarity between the first video and the second video; and outputting the similarity to a display module.
  • This method provides similarity information, which enriches the detection results of the video to be detected, and these results can be used by users or other modules.
  • determining the similar key frame and the second video according to the feature of the key frame specifically includes: querying a video library according to the feature of the key frame, and obtaining the video library from the video library. Similar key frames, the features of the similar key frames are similar to the features of the key frames; the second video is determined according to the similar key frames. This method of determining similar videos from the perspective of determining similar key frames improves the accuracy of video similarity detection.
  • the feature extraction model and the edit type recognition model respectively adopt different neural network models.
  • Both the feature extraction model and the editing type recognition model adopt a trained neural network model, so that the video similarity detection efficiency of this application is high, and the accuracy of the detection result obtained is high.
  • the editing type includes one or more of the following operations: cropping, splicing, rotating, mirroring, blurring, adding text, adding icons, changing colors, changing brightness, and Change the contrast.
  • the method further includes: determining a similar shot in the similar video according to the similar key frame, wherein the similar shot is a shot similar to the key frame.
  • the lens output the correspondence between the similar lens and the lens where the key frame is located to the display module.
  • the method can also accurately output the correspondence between similar shots and the shots where the key frame is located, so that the results of video similarity detection are richer, and it is convenient for users to make further plans based on the detection results.
  • the editing type may also be an editing type used for editing between the similar shot and the shot where the key frame is located.
  • the similarity between the video and the similar video further includes the similarity between a shot of the video and a similar shot in the corresponding similar video.
  • determining the key frame of the video according to the video specifically includes: performing structural analysis on the video according to the content of the video to obtain a shot of the video, the shot It is a collection of video frames in the video expressing a piece of continuous screen content; the key frame is determined in the shot, and the key frame is a video frame representing the main screen content of the shot.
  • the present application provides a detection device, including: a structure analysis module for receiving a first video, and determining a key frame of the first video according to the first video; a feature extraction model for receiving a first video according to the The key frame obtains the characteristics of the key frame; the comparative analysis module is used to determine a similar key frame and a second video according to the characteristics of the key frame, wherein the second video is the video where the similar key frame is located, so The second video is similar to the first video; the editing type recognition model is used to obtain the editing type according to the key frame and the similar key frame, wherein the editing type indicates the first video and the first video 2.
  • the editing type used for editing between videos.
  • the detection device further includes: an output module configured to output the second video or the information of the second video to a display module, wherein The information includes the name of the second video; it is also used to output the editing type to the display module.
  • the edit type recognition model includes a first feature extraction branch, a second feature extraction branch, and a predictor; the first feature extraction branch is used to receive the key frame, The key frame performs feature extraction, and the editing feature of the key frame is output; the second feature extraction branch is used to receive the similar key frame, perform feature extraction on the similar key frame, and output the similar key frame Edit feature; the predictor is used to obtain the edit type according to the edit feature of the key frame and the edit feature of the similar key frame.
  • the comparative analysis module is further used to calculate the similarity between the first video and the second video; the output module is also used to output the similarity To the display module.
  • the structure analysis module is specifically configured to: query a video library according to the characteristics of the key frame, and obtain the similar key frame from the video library, and the similar key frame The feature of is similar to the feature of the key frame; the second video is determined according to the similar key frame.
  • the feature extraction model and the edit type recognition model respectively adopt different neural network models.
  • the editing type includes one or more of the following operations: cropping, splicing, rotating, mirroring, blurring, adding text, adding icons, changing colors, changing brightness, and Change the contrast.
  • the comparative analysis module is further configured to determine similar shots in the similar video according to the similar key frames, where the similar shots are the same as those in the key frame.
  • a lens with a similar lens; the output module is also used to output the correspondence between the similar lens and the lens where the key frame is located to the display module.
  • the editing type may also be an editing type used for editing between the similar shot and the shot where the key frame is located.
  • the similarity between the video and the similar video further includes the similarity between the shot of the video and the similar shot in the corresponding similar video.
  • the structure analysis module is specifically configured to: perform structure analysis on the video according to the content of the video to obtain a shot of the video, and the shot is in the video A collection of video frames expressing a piece of screen content with a continuous background; the key frame is determined in the shot, and the key frame is a video frame representing the main screen content of the shot.
  • the present application provides a computing device system, including at least one computing device, each computing device includes a memory and a processor, and the memory of the at least one computing device is used to store computer instructions;
  • the processor of a computing device executes the computer instructions stored in the memory to execute the first aspect or the method provided in any one of the possible implementation manners of the first aspect.
  • the present application provides a non-transitory readable storage medium.
  • the non-transitory readable storage medium executes any one of the foregoing first aspect or the first aspect.
  • the storage medium stores the program.
  • the storage medium includes but is not limited to volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk (English: hard disk drive, abbreviation: HDD), solid state drive (English: solid state drive, Abbreviation: SSD).
  • the present application provides a computer program product
  • the computer program product includes computer instructions, when executed by a computing device, the computing device executes the foregoing first aspect or any possible implementation of the first aspect.
  • the computer program product may be a software installation package.
  • the computer program product may be downloaded and executed on a computing device. Program product.
  • FIG. 1 is a schematic diagram of the relationship between videos, video clips, shots, and key frames provided by an embodiment of the application;
  • FIG. 2 is a schematic diagram of the deployment of a detection device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of the deployment of another detection device provided by an embodiment of the application.
  • FIG. 4 is a schematic structural diagram of a computing device 100 equipped with a detection device provided by an embodiment of the application;
  • FIG. 5 is a schematic structural diagram of a training device 200 and a detection device 300 provided by an embodiment of this application;
  • FIG. 6 is a schematic flowchart of a method for video similarity detection provided by an embodiment of the application.
  • FIG. 7 is a schematic structural diagram of a feature extraction model provided by an embodiment of this application.
  • FIG. 8 is a schematic structural diagram of an editing type recognition model provided by an embodiment of the application.
  • FIG. 9 is a schematic diagram of displaying information output by the detection device in the form of text according to an embodiment of the application.
  • FIG. 10 is a schematic diagram of displaying information output by a detection device in the form of a visual interface provided by an embodiment of the application;
  • FIG. 11 is a schematic flowchart of a method for determining shots and key frames according to an embodiment of the application.
  • FIG. 12 is a schematic diagram of a computing device system provided by an embodiment of this application.
  • Video is an electrical signal that stores continuously occurring pictures in the real world.
  • Figure 1 is a schematic diagram of the relationship between videos, video clips, shots, and key frames.
  • a video can be divided into multiple video clips according to the content of the screen. Each video clip records a relatively complete plot of video content.
  • a video clip can be divided into multiple shots, and the video content in each shot is a single shot of the camera with a continuous background.
  • a shot contains one or more video frames, and each video frame is an independent image.
  • the video frame that can describe the main content of the current shot is called the key frame of the shot.
  • a shot can have one or more key frames.
  • Various methods can be used to determine the key frame in a shot, so that the content of a shot can be characterized by the content in the key frame.
  • a video can be divided into multiple levels, and its smallest unit is a video frame. Therefore, in other words, one or more video frames (including key frames) form a shot, shots of different scenes form a video clip, and one or more video clips form a complete video.
  • the relationship between a video and its corresponding video segments, shots, and key frames is called the multi-level structure of a video, and the multi-level structure of the video can be analyzed according to the content of a video.
  • This application provides a method for video similarity detection.
  • the method analyzes and detects the video based on the multi-level structure of the video, and obtains the result of whether the video to be detected is similar to the video in the video library.
  • the position information of the video similar to the similar video and the editing type of the video lens relative to the similar video can be further obtained, such as: cropping, splicing, rotation, mirroring , Blur, add text, add icon, change color, change brightness and change contrast, etc.
  • two videos are similar means that the two videos include one or more similar key frames, that is, one or more key frames contained in one of the two videos
  • a frame is obtained by editing one or more key frames contained in another video through one or more types of editing methods.
  • all the key frames in the first video segment in video A are obtained by adding subtitles to the key frames in video B, and the 2-Nth (N is a positive integer greater than or equal to 1) in video A
  • the key frames in the video clip are obtained by removing the icons from part of the key frames in the video C, and then the video B and the video C are considered to be similar videos of the video A, that is, the video A and the video B are similar, and the video A and the video C are also similar. similar.
  • the embodiment of the present application provides a method for video similarity detection, which is executed by a detection device.
  • the deployment of detection devices is more flexible.
  • the detection device can be deployed in a cloud environment, which is an entity that uses basic resources to provide cloud services to users in a cloud computing mode.
  • the cloud environment includes a cloud data center and a cloud service platform.
  • the cloud data center includes a large number of basic resources (including computing resources, storage resources, and network resources) owned by a cloud service provider.
  • the computing resources included in the cloud data center can be a large number of computing resources.
  • Device for example, server).
  • the detection device can be a server used for video detection in a cloud data center; the detection device can also be a virtual machine created in a cloud data center for video detection; the detection device can also be deployed in a cloud data center
  • the software device on the server or virtual machine of the software device is used to detect the video.
  • the software device can be distributed on multiple servers, or distributed on multiple virtual machines, or distributed on multiple virtual machines. Deploy on virtual machines and servers. As shown in Figure 2, the detection device is abstracted by the cloud service provider on the cloud service platform into a cloud service for video similarity detection and provided to the user.
  • the cloud environment uses the detection device to provide the user with video Similar to the detection cloud service, users can upload the video to be detected to the cloud environment through the application program interface (API) or through the web interface provided by the cloud service platform.
  • the detection device receives the video to be detected, and the video to be detected is processed In the detection, the detection result is returned by the detection device to the terminal where the user is located, or the detection result is stored in the cloud environment, for example, presented on the web interface of the cloud service platform for the user to view.
  • the detection device When the detection device is a software device, the detection device can be logically divided into multiple parts, each of which has different functions (for example: the detection device includes a structure analysis module, a feature extraction model, a comparison analysis module, and an edit type recognition model , Output module). Several parts of the detection device can be deployed in different environments or devices.
  • a part of the detection device is deployed on terminal computing devices (such as terminal servers, smart phones, laptops, tablet computers, Personal desktop computer, smart camera), the other part is deployed in the data center (specifically deployed on the server or virtual machine in the data center), the data center can be a cloud data center, the data center can also be an edge data center, edge data center It is a collection of edge computing devices that are deployed closer to the terminal computing device.
  • terminal computing devices such as terminal servers, smart phones, laptops, tablet computers, Personal desktop computer, smart camera
  • the other part is deployed in the data center (specifically deployed on the server or virtual machine in the data center)
  • the data center can be a cloud data center
  • the data center can also be an edge data center
  • edge data center It is a collection of edge computing devices that are deployed closer to the terminal computing device.
  • a smart phone is deployed with a structure analysis module in the detection device, and the smart phone acquires a video.
  • the structure analysis module is used to analyze the structure of the video.
  • the smart phone sends the data after the structure analysis to the data center through the network.
  • the data center is deployed with a feature extraction model, a comparative analysis module, an editing type recognition model, and an output module. These modules/ The model further processes the data after structural analysis, and finally obtains the detection result.
  • the data center sends the detection result to the smart phone, so that the user who uses the smart phone can obtain the video detection result.
  • the detection device does not restrict which parts of the detection device are deployed in the terminal computing equipment and which parts are deployed in the data center. In actual applications, it can be deployed adaptively according to the computing capabilities of the terminal computing equipment or specific application requirements. It is worth noting that, in an embodiment, the detection device can also be deployed in three parts, where one part is deployed in the terminal computing device, one part is deployed in the edge data center, and the other part is deployed in the cloud data center.
  • the detection device can also be separately deployed on a computing device in any environment (for example: separately deployed on a terminal computing device or separately deployed on a computing device in a data center), as shown in Figure 4
  • the computing device 100 includes a bus 101, a processor 102, a communication interface 103, and a memory 104.
  • the processor 102, the memory 104, and the communication interface 103 communicate through a bus 101.
  • the processor 102 may be a central processing unit (English: central processing unit, abbreviated: CPU).
  • the memory 104 may include a volatile memory (English: volatile memory), such as a random access memory (English: random access memory, abbreviation: RAM).
  • the memory 104 may also include a non-volatile memory (English: non-volatile memory, abbreviation: NVM), such as read-only memory (English: read-only memory, abbreviation: ROM), flash memory, HDD or SSD.
  • NVM non-volatile memory
  • the memory 104 stores executable code included in the detection device, and the processor 102 reads the executable code in the memory 104 to execute the video similarity detection method.
  • the memory 104 may also include an operating system and other software modules required for running processes.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM etc.
  • the neural network model is a type of mathematical calculation model that imitates the structure and function of a biological neural network (animal central nervous system).
  • the neural network model can include multiple neural network layers with different functions, and each layer includes parameters and calculation formulas. According to different calculation formulas or different functions, different layers in the neural network model have different names. For example, the layer that performs convolution calculations is called convolutional layer, and the convolutional layer is often used for input signals (for example: image ) Perform feature extraction.
  • a neural network model can also be composed of a combination of multiple existing neural network models.
  • Neural network models with different structures can be used in different scenarios (for example: classification, recognition) or provide different effects when used in the same scenario.
  • Different neural network model structures include one or more of the following: The network layer in the neural network model The number of layers is different, the order of each network layer is different, and the weights, parameters or calculation formulas in each network layer are different.
  • the method for performing video similarity detection requires two different neural network models, one is a neural network model used for feature extraction of the video to be detected, called a feature extraction model; One is a model used to identify edit types between two similar videos, called an edit type recognition model.
  • the feature extraction model and the editing type recognition model can be trained by the training device before being used to detect the editing type of the video.
  • the training device uses different training sets to train the feature extraction model and the editing type recognition model, and the training is completed by the training device
  • the feature extraction model and the editing type recognition model are deployed in the detection device, and the detection device is used to detect the editing type of the video.
  • FIG. 5 provides a schematic diagram of the structure of a training device 200 and a detection device 300.
  • the structure and function of the training device 200 and the detection device 300 are introduced below in conjunction with FIG. 5. It should be understood that the embodiment of the present application is only an exemplary division of the structure and function modules of the training device 200 and the detection device 300, and this application does not There is no restriction on its specific division.
  • the training device 200 is used to train the feature extraction model 203 and the edit type recognition model 204 respectively.
  • Training the feature extraction model 203 and the edit type recognition model 204 requires two different training sets, which are called feature extraction training set and edit.
  • Type recognition training set The obtained feature extraction training set and editing type recognition training set are stored in the database.
  • a collection device can collect multiple training videos or training images, and the collected multiple training videos or training images are processed and annotated manually or by the collection device to form a training set.
  • the acquisition device divides each training video into shots, determines key frames in the divided shots, uses the determined key frames as training images, and then processes and labels the training images. Training set.
  • the initialization module 201 When the training device 200 starts training the feature extraction model 203, the initialization module 201 first initializes the parameters of each layer in the feature extraction model 203 (that is, assigns an initial value to each parameter), and then the training module 202 reads The training images in the feature extraction training set in the database train the feature extraction model 203 until the loss function in the feature extraction model 203 converges or all training images in the feature extraction training set are used for training, then the training of the feature extraction model 203 is completed.
  • the initialization module 201 first initializes the parameters of each layer in the edit type recognition model 204 (that is, assigns an initial value to each parameter), and then The training module 202 reads the training images in the edit type recognition training set in the database to train the edit type recognition model 204 until the loss function in the edit type recognition model 204 converges or all the training images in the edit type recognition training set are used for training , The editing type recognition model 204 is trained.
  • the feature extraction model 203 and the edit type recognition model 204 can also be trained by two training devices respectively, and the feature extraction model 203 and/or the edit type recognition model 204 may not need to be trained by the training device 200, for example:
  • the feature extraction model 203 and/or the edit type recognition model 204 use a neural network model that has been trained by a third party and has good accuracy for feature extraction and/or type recognition.
  • the basic feature extraction part of any one of the neural network models such as Alexnet, Resnet, Mobilenet, Densenet, etc. can be used as the feature extraction model 203, which is used to train the feature extraction model 203
  • the loss function may be a tripletloss function.
  • the feature extraction training set includes two types of image groups. One is labeled as similar image groups, that is, the labels of similar image groups are set to be similar, and the other is labeled as dissimilar image groups, that is, the labels of dissimilar image groups. Is set to be dissimilar, a feature extraction training set includes multiple similar image groups and multiple dissimilar image groups.
  • the similar image group includes two images, one of which is the original image, and the other is the generated image obtained by editing the original image through one or more editing types (for example: the original image and the generated image after the original image is rotated and cropped).
  • the images constitute a similar image group). Since the original image and the generated image are obtained through one or more editing types, the original image and the generated image have some similar characteristics.
  • the dissimilar image group also includes two (or more) images, and there is no relationship between the two images before and after editing.
  • the feature extraction training set is used to train the feature extraction model, and the similar image group is used as a positive sample for the training of the feature extraction model, so that the feature extraction model learns the similar features of two images in the similar image group; dissimilar images
  • the group is used as a negative sample for the training of the feature extraction model, so that the feature extraction model has a stronger ability to distinguish between similar image groups and dissimilar image groups.
  • the feature extraction model trained by the feature extraction training set can more accurately extract the features in the detected key frame.
  • the edit type recognition training set also includes multiple image groups, and each image group is composed of an original image and a generated image corresponding to the original image.
  • the generated image corresponding to the original image is the original image obtained through one or more types of editing operations.
  • Each image group is set with one or more tags, and the tag of each image group is the generated image in this image group.
  • Edit type For example: an image group includes an original image and a generated image obtained by rotating and cropping the original image, then the image group has two labels, namely rotation and cropping (or other representations corresponding to rotation and cropping) Tag of).
  • the images in the image group in the edit type recognition training set can be the same as the images in the similar image group in the feature extraction training set, that is, the obtained training image can be reused by the two training sets, but the edit type recognition training set
  • the label of the image group is different from the similar image group in the feature extraction training set.
  • the edit type recognition training set includes image groups with multiple types of labels. There are also multiple image groups for each type of label.
  • the image group for each type of label can be a single-label image group (for example, if the label is multiple image groups with rotation, this is more
  • the generated images in each image group are all obtained by rotating the original image), or it can be a multi-label image group (for example: multiple image groups with three labels of rotation, cropping and adding icons, then the multiple image groups
  • the generated images are all obtained by rotating, cropping and adding icons from the original image).
  • the edit type recognition model 204 may use multi-label cross entropy as a loss function when being trained.
  • the feature extraction model 203 and the editing type recognition model 204 trained by the training device 200 can be used for feature extraction and video/image editing type recognition, respectively.
  • the trained feature extraction model 203 and the edit type recognition model 204 are deployed to the detection device 300.
  • the trained feature extraction model 203 is called For the feature extraction model 302
  • the edited edit type recognition model 204 that has been trained is called the edit type recognition model 304.
  • the detection device 300 includes a structure analysis module 301, a feature extraction model 302, a comparison analysis module 303, an edit type recognition model 304, and an output module 305.
  • the structure analysis module 301 is used to receive a piece of video to be detected (can be a complete video file or a piece of video, such as a video stream obtained in real time), analyze the structure of the video, and divide a piece of video into one Or multiple shots (or decompose a video into one or more video fragments, and then decompose each video fragment into one or more shots), and the structure analysis module 301 is also used to determine in each shot that the One or more key frames of the lens content.
  • the structure analysis module 301 outputs a piece of structure data corresponding to the video structure.
  • the structure data indicates the position of each shot of the video in the video and the position of the key frame of each shot.
  • the feature extraction model 302 is connected with the structure analysis module 301 through a communication path, and is used to read the key frame of each shot in the video according to the structure data, perform feature extraction on each key frame, and output the feature of each key frame.
  • the comparative analysis module 303 is connected with the feature extraction model 302 through a communication path, and is used to query the video library according to the feature corresponding to each key frame, and obtain one or more videos similar to the video to be detected in the video library.
  • the comparative analysis module 303 is also used to determine the similarity between the similar video and the video to be detected, and the correspondence and similarity of key frames or shots in the similar video that are similar to the video to be detected.
  • the editing type recognition model 304 is connected to the comparison analysis module 303.
  • the editing type recognition model 304 inputs each similar key frame and the corresponding key frame in the detected video to the similar key frames in the similar videos obtained by the comparison analysis module 303.
  • the editing type recognition model 304 obtains the editing type between the similar key frame in the similar video and the corresponding key frame in the video to be detected according to the editing type recognition model 304.
  • the output module 305 is respectively connected with the comparative analysis module 303 and the editing type recognition model 304 through a communication channel, and the output model 305 outputs one or more videos similar to the video to be detected and similar key frames in the similar videos obtained by the comparative analysis module 303 The editing type between the corresponding key frames in the video to be detected.
  • the output module 305 outputs the similarity between the similar video and the video to be detected and the position information of the key frames or shots in the similar video that are similar to the video to be detected.
  • Both the training device 200 and the detection device 300 can be software devices.
  • the training device 200 and the detection device 300 can be deployed on the same computing device (for example, deployed on the same server). Or deployed on two different virtual machines in the same server), the training device 200 and the detection device 300 can also be deployed on different computing devices (for example: the training device 200 is deployed in one or more of the cloud environments). On each server, the detection device 300 is deployed on one or more servers in the edge environment).
  • the deployment of the training device 200 is also relatively flexible. It is the same as the deployment method of the detection device described above. It can be deployed entirely on the same computing device, or each part can be deployed on different computing devices, and different computing devices can operate in coordination for training.
  • the various parts in the device 200 implement all the functions of the training device 200.
  • S401 Receive a video to be detected, and determine a shot and a key frame according to the content of the video to be detected.
  • the detection device 300 obtains a piece of video to be detected for editing type (for example, the detection device 300 receives a piece of video to be detected uploaded by a user or an administrator, or the detection device 300 receives a piece of video shot by another device in real time), according to
  • the content of the video to be detected performs structural analysis of the video to be detected, and the shots and key frames in each shot are determined.
  • a piece of video to be inspected can include multiple shots.
  • a shot is a single shot of a camera with a continuous background.
  • a shot is usually a scene, and the content of a shot can be represented by key frames.
  • the method for determining the shot may adopt a sliding window method, and the boundary of the shot may be determined based on the gray-scale histogram difference between the video frames at the previous and next moments in the video to be detected.
  • the video to be detected is segmented by the boundary of the lens, and the key frame is selected according to the content of the video frame in each lens after the segmentation.
  • a piece of structure data corresponding to the video is obtained through structural analysis of the video to be detected.
  • the structure data can be expressed as ⁇ [s 1 ,e 1 ,k 10 ,k 11 ,...],[s 2 ,e 2, k 20, k 21, ...], ..., [s n, e n, k n0, k n1, ...] ⁇ , where the structure of the data in [s 1, e 1, k 10, k 11, ...] Represents a shot, s 1 is the frame sequence of the starting video frame of the shot in the entire video, e 1 is the frame sequence of the ending video frame of the shot in the entire video, and k 10 is a key frame in the shot. The number of frames offset from the starting video frame.
  • this application does not limit the specific implementation of determining shots and key frames based on the content of the video to be detected.
  • different methods can be used to determine shots and key frames.
  • the detected video content determines the specific scheme of shots and key frames. In this step, all key frames in the video to be detected can be obtained.
  • the shots included in the video to be detected are determined first, and then the key frames in the shots are further determined.
  • the key frame may be directly obtained without determining the lens.
  • the feature extraction model performs feature extraction on the key frame.
  • the key frame of each shot obtained in the foregoing step S401 is a two-dimensional image.
  • each key frame is input to the trained feature extraction model, and the feature extraction model outputs the feature of each key frame.
  • the feature of the key frame can be a multi-dimensional matrix, and the feature of the key frame represents the hidden characteristics of the screen content of the key frame.
  • This application does not specifically limit the structure of the neural network model used in the feature extraction model.
  • the feature extraction model can use a neural network model commonly used in the industry for image classification or the backbone of a neural network model for image recognition, or it can be An improved neural network model.
  • the feature extraction model shown in Figure 7 uses a convolutional neural network model.
  • the convolutional neural network model includes multiple convolutional layers.
  • Each convolutional layer includes one or more convolution kernels, and each convolutional kernel includes multiple parameter.
  • the size of each convolution kernel can be the same or different (for example: there can be 16 convolution kernels with a size of 7*7 in the first convolution layer of the feature extraction model), key frame (or tensor) input
  • the convolution layer outputs a tensor
  • the tensor output by the convolution layer is a three-dimensional array, including multiple values,
  • a tensor with a scale of W*H*L where W represents the width of the tensor, H represents the height of the tensor, L represents the number of channels of the tensor, and W, H, and L are all natural numbers greater than 0
  • W*H*L the number of convolution kernels included in the convolution layer determines the number of channels of the tensor output by the convolution layer, for example: a tens
  • the size and number of convolution kernels in different convolution layers can be the same or different.
  • the scale of the tensor output by each convolution layer is determined by the key frame (or tensor) input to the convolution layer and the volume
  • the size and number of convolution kernels in the buildup layer and the convolution calculation method are jointly determined.
  • the tensor output by the last convolutional layer is used as the feature of the key frame and output by the feature extraction model.
  • step S401 all the key frames in the video to be detected obtained in step S401 are subjected to the operation of step S402, therefore, the features of all key frames in the video to be detected are obtained in step S402.
  • S403 Determine a video similar to the video to be detected according to the characteristics of the key frame.
  • step S402 the characteristics of all key frames in the video to be detected are obtained, and the characteristics of each key frame are compared with the characteristics of the key frames of all videos in the video library to determine the key Similar key frames with similar frames are determined to determine the video to which the similar key frame belongs, where the video to which the similar key frame belongs is called a similar video, and the similar video is similar to the video to be detected.
  • the degree of similarity between the video to be detected and the similar video is further determined, and the correspondence and position of the shot where the similar key frame is located and the shot where the corresponding key frame is located are determined.
  • the video library used in step S403 is a pre-organized and calculated video library.
  • the video library includes multiple videos.
  • the same video structure analysis operation as the previous step S401 is performed on each video to determine each video.
  • the shots and key frames in each video that is, each video in the video library corresponds to a piece of structure data, and the structure data indicates the start and end frames of each shot in the video and the key frames in each shot.
  • This application also executes the same method as the aforementioned step S402 on each key frame in the video library, that is, also performs feature extraction on the key frame of each video in the video library to obtain the feature of each key frame. Therefore, the video library in this application stores multiple videos, structural data corresponding to each video, and features of key frames in each video.
  • This application does not limit the source of the videos in the video library.
  • videos can be collected adaptively according to the specific application scenarios of the video similarity detection method provided in this application, for example: the video similarity detection method used by the film and television works protection department
  • the videos in the video library can be original film and television works that can be collected.
  • the richer the video library the greater the probability of obtaining a similar video with high similarity to the video to be detected. It is worth noting that the operation of performing video structure analysis on the videos in the video library to determine the shots and key frames of the video can be performed at any time before step S403 is performed.
  • step S403 The specific process of step S403 is described as follows:
  • S4031 Determine similar key frames in the video library according to the acquired feature of each key frame in the video to be detected.
  • the comparison method can be used to calculate the similarity one by one, and the similarity is greater than the preset threshold of the video library.
  • the key frame of the video in is determined as the similar key frame of the compared key frame, and the specific method of similarity calculation is not limited in this application. If there is no similar key frame similar to any key frame in the video to be detected in the video library, the editing type detection of the video is ended. If there are similar key frames in the video library that are similar to the key frames in the video to be detected, proceed to the subsequent steps. It should be understood that each key frame of the video to be detected may have one or more similar key frames in the video library.
  • S4032 Determine a similar video similar to the video to be detected according to the similar key frame.
  • a graph search method may be used to determine similar videos that are similar to the video to be detected:
  • the similar key frames of can form a graph.
  • the key frames in the video to be detected and their corresponding similar key frames are regarded as nodes in the graph.
  • the similar key frames corresponding to each key frame in the graph Construct paths for each similar key frame for nodes, each path includes multiple nodes and edges connecting nodes.
  • the node is determined in turn according to the timing of the key frame of the video to be detected. For each key frame, a similar key frame corresponding to the key frame is found as a node on a path.
  • the determined similar key frame belongs to the same video as the existing similar key frame on the path. Therefore, the nodes on each of the multiple paths obtained meet the condition: similar key frames on the same path are in the same video (if one or more similar key frames corresponding to a certain key frame are not related to any path The existing similar key frames above belong to the same video, then the key frame is skipped). Therefore, each path corresponds to a video, which is called a similar video. According to the similar key frames on each path and the structure data of the video corresponding to this path stored in the video library, determine the lens of each similar key frame in the similar video. This lens is called the lens with the corresponding key frame.
  • the similar lens, the lens in which the similar lens and the corresponding key frame in the video to be detected are located is called a similar lens pair.
  • a similar lens pair may be determined as the similarity between the key frame included in one lens and the similar key frame included in another lens as the similarity of the similar lens pair.
  • the similarity between a similar video and the video to be detected can be based on the difference between the lens in the similar video and the lens of the video to be detected.
  • the similarity is obtained by weighted average, and the ratio of the sum of the duration of the shots where the similar key frames are located in the similar video to the total duration of the video can also be used as the similarity between the similar video and the video to be detected.
  • step S403 one or more similar videos that are similar to the video to be detected are obtained, and similar key frames in the similar videos are obtained.
  • the similar key frames in each similar video and the corresponding key frames in the video to be detected are formed into a key frame group, and then each similar video corresponds to one or more key frame groups.
  • the editing type recognition model extracts and predicts editing features between similar key frames and key frames in the key frame group, and outputs the key frames and similar key frames in the key frame group.
  • One or more editing types existing between key frames, and one or more editing types existing between key frames and similar key frames indicate one or more types of editing between the key frame and similar key frames Editing type, through the editing of one or more editing types, the conversion between key frames and similar key frames can be realized.
  • Each key frame group in each similar video is extracted and predicted by the editing type recognition model in turn, and the editing type between the similar key frame in each similar video and the corresponding key frame in the video to be detected is obtained . It is worth noting that, because key frames and similar key frames are video frames that represent the content of the shot in the video to be detected and the content of the similar shot in the similar video, there is a gap between the available key frame and the similar key frame.
  • the one or more editing types represent one or more editing types existing between the shot where the key frame is located and the similar shots in the corresponding similar videos.
  • the editing type recognition model uses a pre-trained neural network model
  • Figure 8 is an exemplary editing type recognition model.
  • the edit type recognition model shown in Figure 8 includes two feature extraction branches, called the first feature extraction branch and the second feature extraction branch, and a predictor.
  • the output of the two feature extraction branches is the input of the predictor.
  • the two feature extraction branches in the edit type recognition model are composed of the same multiple convolutional layers (the number of convolutional layers is the same, and the parameters of the convolutional layers are also the same), and the keyframes and similar keyframes in the keyframe group are respectively input to the first A feature extraction branch and a second feature extraction branch.
  • the first feature extraction branch performs convolution calculation on the key frame
  • the last layer of the first feature extraction branch outputs the editing features of the key frame
  • the second feature extraction branch is similar to The key frame is subjected to convolution calculation.
  • the last layer of the second feature extraction branch outputs the editing features of similar key frames.
  • the editing features of similar key frames and the editing features of key frames are used as the input of the predictor, which is performed by the predictor. Calculate and predict, output the type of editing existing between key frames and similar key frames. According to the relationship between key frames and similar key frames, the predictor can output one or more editing types.
  • the one or more editing types output by the predictor of the editing type recognition model means the one or more editing types used for editing between the input key frame and similar key frames, and also represents the key frame One or more editing types used for editing between the lens where the similar key frame is located and the similar lens where the similar key frame is located. If all the key frame groups corresponding to a similar video are obtained by the editing type recognition model of the same one or more editing types, the editing type between the similar video and the video to be detected is the one or more editing types Types of. If multiple key frame groups corresponding to a similar video have different editing types obtained through the editing type recognition model, the editing types of the similar video relative to the video to be detected are all different types of editing types.
  • the model obtains 3 outputs, namely (rotate, crop), (add icon), (mirror, crop), then the editing types existing between the similar video and the video to be detected are (rotate, crop, add icon, Mirror).
  • the video to be detected can be edited by similar videos through one or more editing types. Obtained by editing operations; 2.
  • the video to be detected can be obtained from similar videos through the opposite editing operations corresponding to one or more editing types, that is, the video to be detected can be obtained through editing operations of one or more editing types Similar videos; 3.
  • the obtained editing types include multiple, the video to be detected can be obtained from similar videos through one or more editing operations of one or more editing types and the opposite editing operations of another or more editing types.
  • This application does not limit the specific classification and name of the editing type.
  • the two editing operations that are opposite to each other can be divided into two editing types .
  • add or remove icons are set as the add icon edit type and the remove icon edit type respectively.
  • the editing type recognition model can output (1) the editing type used by the video to be detected obtained from the similar video, or (2) the editing used by the similar video obtained from the video to be detected Type, you can also output (3) these two editing types that are opposite to each other.
  • there is only one editing type and the editing type recognition model outputs this editing type, indicating that similar videos have been edited by this editing type to obtain the video to be detected.
  • the first video has the first key frame and the third key frame
  • the second video has the second key frame and the fourth key frame. If the difference between the first key frame and the second key frame is: the second key frame has one more icon than the first key frame.
  • the first video is the source video
  • the second video is the video generated after the first video (source video) undergoes an editing operation, and the editing between the first video and the second video is used.
  • the type is "add icon”.
  • the second video is the source video
  • the first video is the video generated after the second video (source video) undergoes an editing operation, so the editing between the first video and the second video is adopted
  • the edit type is "delete icon”.
  • the fourth key frame can be obtained after the third key frame is "mosaic".
  • the difference from the above example is that there is no opposite operation for the "mosaic” operation, so the editing type in this example is unique. Therefore, in this example: the first video is the source video, the second video is the video generated after the first video (source video) is edited, and the editing type used for editing between the first video and the second video It is "mosaic", and the editing type recognition model outputs the editing type "mosaic".
  • two editing operations that are opposite to each other can also be collectively referred to as an editing type.
  • the name of this editing type can only be one of the editing operations that are opposite to each other.
  • the name indicates, or the name of the editing type can reflect two editing operations that are opposite to each other.
  • the editing type recognition model outputs this editing type to indicate that the video to be detected is obtained by the positive editing operation corresponding to this editing type, or that the video to be detected corresponds to this editing type
  • the opposite editing operation is obtained.
  • the two opposite editing operations of adding an icon or deleting an icon are collectively referred to as an editing type, and the name of this editing type is "add icon".
  • S405 Output similar videos and editing types.
  • the editing type between each similar video and the video to be detected is obtained, and the information of each similar video or each similar video (for example: The name of the similar video), and the editing type between each similar video and the video to be detected are output to the display module.
  • the display module can be a module in the detection device, or it can be a device or equipment other than the detection device. Module, the display device can display each similar video or the information of each similar video and the editing type corresponding to each similar video through a visual interface or text.
  • the similarity between each similar video and the video to be detected can also be output to the display module.
  • the corresponding relationship between the similar shots in each similar video and the corresponding shots in the video to be detected can also be output to the display module.
  • the display module can display the relationship between the similar video and the video to be detected in various forms. The corresponding relationship.
  • the similarity between the lens of the video to be detected and the similar lens in the corresponding similar video may also be output to the display module.
  • the similar videos can be further screened according to the similarity of the similar videos, and only the relevant information corresponding to the screened similar videos is output to the display module.
  • the similarity between the similar video and the video to be detected is compared with a preset screening threshold, and only relevant information corresponding to similar videos greater than or equal to the preset screening threshold is output to the display module.
  • the display module can display the information in many different forms according to the obtained information.
  • FIG. 9 is a schematic diagram of a display module provided by an embodiment of the application displaying information output by the detection device in the form of text.
  • the text contains the relevant information corresponding to the top K similar videos with the highest similarity of the video Q to be detected in the video library, including: the name of each similar video, each similar The similarity between video and video Q, the list of similar shots in each similar video, and the overall editing type list between similar videos and video Q, where the list of similar shots in each similar video contains the shots in video Q
  • the information in the similar lens list indicates the similar lens Correspondence with the lens in the corresponding video Q.
  • FIG. 10 is a schematic diagram of a display module provided by another embodiment of the application displaying the information output by the detection device in the form of a visual interface.
  • the video Q to be detected and similar videos similar to video Q are displayed in the visual interface, as well as the corresponding similar shots and their similarities, as well as the editing type and each similar shot corresponding to each similar shot.
  • the similarity of the lens is shown in Figure 10.
  • the related information corresponding to the similar videos obtained in the foregoing steps S403 and S404 can also be output to the processing module.
  • the processing module can be a functional module in the detection device or a functional module in other devices or equipment.
  • the processing module can be Further processing of related information corresponding to similar videos.
  • S4011 Read video frames in the video to be detected, and perform similar comparisons on the video frames in the video to be detected.
  • the video frames are similarly compared. If the two frames before and after are similar, continue to read a new video frame, and compare the new video frame with the next video frame in the previous two video frames. If the video frame is not similar to the next video frame of the first two video frames, the two dissimilar video frames are stored in the buffer area.
  • S4012 Perform gray-scale color histogram difference calculation on two adjacent video frames in the buffer area.
  • the gray color histograms of two adjacent video frames are calculated respectively, and the gray color histograms of the two video frames are correspondingly subtracted to obtain the gray color histogram difference of the two video frames.
  • the difference of the gray-scale color histogram is stored in the cache.
  • step S4013 Determine the relationship between the number of video frames buffered in the buffer area and the preset minimum shot frame number. When the number of video frames buffered in the buffer area is greater than the preset minimum shot frame number, step S4014 is executed; otherwise, execute Step S4011.
  • S4015 In the video to be detected, the set of the determined frame after the previous shot boundary (or the first frame of the video to be detected), the currently determined shot boundary and all the video frames in between is determined as one A shot, in which key frames are determined.
  • the method of determining the key frame is to calculate the gray-scale color histogram difference of adjacent video frames in the lens, select the video frame whose difference is greater than the preset difference threshold, and screen the selected video frame. Choose a clear video frame with moderate brightness as the key frame.
  • step S4015 the buffer area cache is cleared.
  • step S4016 Calculate the gradient values of all video frames in the buffer area, compare the maximum gradient value with a preset gradient threshold, and if it is greater than the preset threshold, determine a video frame corresponding to the maximum gradient as a shot Border, go to step S4015. If it is less than or equal to the preset gradient threshold, step S4017 is executed.
  • step S4017 Determine the relationship between the number of video frames buffered in the buffer area and the preset maximum number of shot frames. When the number of video frames buffered in the buffer area is greater than the preset maximum number of shot frames, step S4015 is executed; otherwise, execute Step S4011.
  • different methods can be used to determine the shots and key frames according to the content of the video to be detected according to the type of the video content to be detected. For example: when the video to be detected is a video of lectures, variety shows, etc. with the same background, the video to be detected can be divided with a fixed duration. After the segmentation, each fixed-length video segment is a shot, and then each segment after the split The key frames are determined in each shot, and there are many ways to determine the key frames. For example, you can select several key frames in the shot according to the fixed video frame interval, or you can perform edge detection on each video frame in the shot, and select the edge and A video frame with a large difference between the edges of adjacent video frames is used as a key frame.
  • the method of video similarity detection is slightly different from the method described in steps S401-S405.
  • the trained edit type recognition model can be split into two parts, one part includes the first feature extraction branch As with the second feature extraction branch, the other part includes a predictor, and the two parts of the edit type recognition model can be stored in different locations (for example, different virtual machines, different physical computing devices).
  • the operation of using the first and second feature extraction branches to perform feature extraction on key frames and similar key frames can be completed before the editing type recognition model performs editing type recognition.
  • the first feature extraction branch and the second feature extraction branch can be used in the aforementioned step S403.
  • the second feature extraction branch extracts the editing features of the key frame and the similar key frame respectively, and the obtained editing features of the key frame and the similar key frame are temporarily stored in the storage module.
  • step S404 the editing feature of the key frame and the editing feature of the similar key frame in the storage module are input to the predictor, and the predictor outputs the editing type existing between the key frame and the similar key frame.
  • the present application provides a detection device 300 as shown in FIG. 5, and the modules and functions included in the detection device 300 are as described in the foregoing, and will not be repeated here.
  • the structure analysis module 301 in the detection device 300 is specifically used to perform the method described in the foregoing step S401; the feature extraction model 302 is specifically used to perform the method described in the foregoing step S402; the comparison analysis module 303 is specifically used
  • the method described in step S403 is executed; the editing type recognition model 304 is specifically used to execute the method described in step S404; the output module 305 is specifically used to execute the method described in step S405.
  • the present application also provides a computing device 100 as shown in FIG. 4.
  • the processor 102 in the computing device 100 reads the executable code included in the detection device 300 stored in the memory 104 to perform the aforementioned video similarity detection method.
  • each module in the detection apparatus 300 of the present application can be separately deployed on multiple computing devices, the present application also provides a computing device system as shown in FIG. 12, and the computing device system includes multiple computing devices 500, Each computing device 500 includes a bus 501, a processor 502, a communication interface 503, and a memory 504. The processor 502, the memory 504, and the communication interface 503 communicate through a bus 501.
  • the processor 502 may be a CPU.
  • the memory 504 may include a volatile memory (English: volatile memory), such as RAM.
  • the memory 504 may also include non-volatile memory, such as ROM, flash memory, HDD, or SSD.
  • Executable code is stored in the memory 504, and the processor 502 executes the executable code to perform part of the video similarity detection method.
  • the memory 504 may also include an operating system and other software modules required for running processes.
  • the operating system can be LINUX TM , UNIX TM , WINDOWS TM etc.
  • Each computing device 500 establishes a communication path through a communication network.
  • Each computing device 500 runs any one or more of the structure analysis module 301, the feature extraction model 302, the comparative analysis module 303, the edit type recognition model 304, and the output module 305.
  • Any computing device 500 may be a computing device in a cloud data center, or a computing device in an edge data center, or a terminal computing device.
  • the computer program product for video similarity detection includes one or more computer instructions for video similarity detection. When these computer program instructions are loaded and executed on the computer, the process or function described in FIG. 6 according to the embodiment of the present invention is generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server, or data center via wired (such as coaxial cable, optical fiber, digital subscriber line, or wireless (such as infrared, wireless, microwave, etc.)).
  • the computer-readable storage medium stores video A readable storage medium for similarly detected computer program instructions.
  • the computer readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center that includes one or more available media integration.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium (for example, an SSD).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Mathematical Physics (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Library & Information Science (AREA)
  • Biophysics (AREA)
  • Television Signal Processing For Recording (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供了一种视频相似检测的方法,涉及人工智能领域,具体涉及计算机视觉,该方法包括:检测装置接收第一视频,并根据第一视频确定第一视频的关键帧;输入关键帧至特征提取模型,获得关键帧的特征;根据关键帧的特征确定相似关键帧和第二视频,其中,第一视频为待检测的视频,第二视频为视频库中与第一视频相似的视频;进一步地输入关键帧和相似关键帧至编辑类型识别模型,获得编辑类型,其中,编辑类型指示第一视频与第二视频之间进行编辑采用的编辑类型。前述特征提取模型和编辑类型识别模型分别采用不同的神经网络模型。该方法可更准确地确定待检测的视频对应的相似视频,进一步获得待检测的视频与相似视频之间的编辑类型。

Description

一种视频相似检测的方法、装置及设备 技术领域
本申请涉及人工智能领域,尤其涉及一种视频相似检测的方法、执行该方法的装置及设备。
背景技术
随着移动互联网的快速发展和智能终端的普及,视频的生产、传播和消费在我们的生活中随处可见。基于视频的各种生产、消费和学习的应用和平台层出不穷,比如金融传播、在线教育、短视频、娱乐综艺等。传统的数字视频处理软件(比如会声会影等)、视频转码工具(比如格式工厂等)及当前发展火热的人工智能技术(比如生成对抗网络等)使得视频编辑变得简单快捷。对视频进行编辑有多种编辑类型,包括:裁剪、拼接、旋转、镜像、模糊、添加/去除文字、添加/去除图标、变换色彩/亮度/对比度、尺寸缩放、添加/去除边框、添加滤镜等,经过编辑后的视频可再次进入传播链路。
对已有的视频进行编辑获得多种不同风格的视频,在带来了视频内容的多样性和娱乐性的同时也给视频信息安全带来了更大的挑战。一个或多个已有的视频经过各种类型的编辑操作后得到的视频,怎么与已有的视频进行相似检测得到详细的重叠区域,以及如何识别出采用的编辑类型,对视频相似查询版权认证、广告识别等视频相关业务至关重要。
现有技术中,存在对视频进行相似检测的技术,例如:1、基于待检测的视频与与其相似的视频的模式噪声分布的相关性设定自适应阈值,根据自适应阈值对视频进行相似检测和定位;2、基于哈希算法计算待检测的视频与已有的视频的相似度;3、对视频中固定间隔的帧进行全局特征和局部特征的比对,获得被编辑或篡改的区域的位置。然而,现有技术仅能对一个视频是否是根据一个或多个已有的视频进行编辑获得的进行判断,并不能确定待检测的视频具体是由一个或多个已有的视频进行了何种类型的编辑操作获得的。因此,如何确定与待检测的视频相似的视频,且检测待检测的视频与相似视频之间存在的编辑类型是进行视频相似检测亟待解决的技术问题。
发明内容
本申请提供了一种视频相似检测的方法,该方法可确定与待检测的视频相似的视频,进一步地确定待检测的视频与相似视频之间采用的编辑类型。
第一方面,本申请提供一种视频相似检测的方法,该方法包括:接收第一视频,根据所述第一视频确定所述第一视频的关键帧;输入所述关键帧至特征提取模型,获得所述关键帧的特征;根据所述关键帧的特征确定相似关键帧和第二视频,其中,所述第二视频为所述相似关键帧所在的视频,所述第二视频与所述第一视频相似;输入所述关键帧和所述相似关键帧至编辑类型识别模型,获得编辑类型,其中,所述编辑类型指示所述第一视频与所述第二视频之间进行编辑采用的编辑类型。该方法提供的一种视频相似检测的方法不仅确定了待检测的视频对应的相似视频,还在确定了相似视频的基础上进 一步地获得了待检测的视频与相似视频之间的编辑类型,这使得该方法在用于视频版权认证、广告识别等应用时更具优势。
在第一方面的一种可能实现方式中,所述方法还包括:输出所述第二视频或所述第二视频的信息至显示模块,其中,所述第二视频的信息包括所述第二视频的名称;输出所述编辑类型至所述显示模块。通过显示模块直观地显示与待检测的视频相似的视频和视频之间的编辑类型,使得用户可以直观地获得这些信息。
在第一方面的一种可能实现方式中,所述编辑类型识别模型包括第一特征提取分支、第二特征提取分支和预测器;输入所述关键帧和所述相似关键帧至编辑类型识别模型,获得编辑类型具体包括:输入所述关键帧至所述第一特征提取分支,输入所述相似关键帧至所述第二特征提取分支;所述第一特征提取分支对所述关键帧进行特征提取,输出所述关键帧的编辑特征,所述第二特征提取分支对所述相似关键帧进行特征提取,输出所述相似关键帧的编辑特征;输入所述关键帧的编辑特征和所述相似关键帧的编辑特征至所述预测器,所述预测器输出所述编辑类型。该方法中采用包括特征提取分支和预测器的编辑类型识别模块,使得获得的编辑类型准确率高。
在第一方面的一种可能实现方式中,所述方法还包括:计算所述第一视频与所述第二视频之间的相似度;输出所述相似度至显示模块。该方法提供了相似度这一信息,更加丰富了关于待检测的视频的检测结果,这些结果可被用户或其他模块采用。
在第一方面的一种可能实现方式中,根据所述关键帧的特征确定相似关键帧和第二视频具体包括:根据所述关键帧的特征查询视频库,在所述视频库中获取所述相似关键帧,所述相似关键帧的特征与所述关键帧的特征相似;根据所述相似关键帧确定所述第二视频。这种从确定相似关键帧的角度确定相似视频的方式提高了视频相似检测的准确率。
在第一方面的一种可能实现方式中,所述特征提取模型和所述编辑类型识别模型分别采用不同的神经网络模型。特征提取模型和编辑类型识别模型均采用训练好的神经网络模型使得本申请视频相似检测的效率高,且获得的检测结果的准确率高。
在第一方面的一种可能实现方式中,所述编辑类型包括下述操作中的一种或多种:裁剪、拼接、旋转、镜像、模糊、添加文字、添加图标、变换色彩、变化亮度和变换对比度。
在第一方面的一种可能实现方式中,所述方法还包括:根据所述相似关键帧确定所述相似视频中的相似镜头,其中,所述相似镜头为与所述关键帧所在的镜头相似的镜头;输出所述相似镜头和所述关键帧所在的镜头的对应关系至显示模块。该方法还可以准确输出相似镜头和所述关键帧所在的镜头的对应关系,使得视频相似检测的结果更丰富、便于用户根据检测结果作进一步计划。
在第一方面的一种可能实现方式中,所述编辑类型还可以为所述相似镜头与所述关键帧所在的镜头之间进行编辑采用的编辑类型。
在第一方面的一种可能实现方式中,所述视频与所述相似视频之间的相似度还包括所述视频的镜头与对应的所述相似视频中的相似镜头之间的相似度。
在第一方面的一种可能实现方式中,根据所述视频确定所述视频的关键帧具体包括:根据所述视频的内容对所述视频进行结构分析,获得所述视频的镜头,所述镜头为所述 视频中表述一段背景连续的画面内容的视频帧的集合;在所述镜头中确定所述关键帧,所述关键帧为表示所述镜头的主要画面内容的视频帧。
第二方面,本申请提供一种检测装置,包括:结构分析模块,用于接收第一视频,根据所述第一视频确定所述第一视频的关键帧;特征提取模型,用于根据所述关键帧获得所述关键帧的特征;对比分析模块,用于根据所述关键帧的特征确定相似关键帧和第二视频,其中,所述第二视频为所述相似关键帧所在的视频,所述第二视频与所述第一视频相似;编辑类型识别模型,用于根据所述关键帧和所述相似关键帧获得编辑类型,其中,所述编辑类型指示所述第一视频与所述第二视频之间进行编辑采用的编辑类型。
在第二方面的一种可能实现方式中,所述检测装置还包括:输出模块,用于输出所述第二视频或所述第二视频的信息至显示模块,其中,所述第二视频的信息包括所述第二视频的名称;还用于输出所述编辑类型至所述显示模块。
在第二方面的一种可能实现方式中,所述编辑类型识别模型包括第一特征提取分支、第二特征提取分支和预测器;所述第一特征提取分支用于接收所述关键帧,对所述关键帧进行特征提取,输出所述关键帧的编辑特征;所述第二特征提取分支用于接收所述相似关键帧,对所述相似关键帧进行特征提取,输出所述相似关键帧的编辑特征;所述预测器用于根据所述关键帧的编辑特征和所述相似关键帧的编辑特征获得所述编辑类型。
在第二方面的一种可能实现方式中,所述对比分析模块还用于计算所述第一视频与所述第二视频之间的相似度;所述输出模块还用于输出所述相似度至显示模块。
在第二方面的一种可能实现方式中,所述结构分析模块具体用于:根据所述关键帧的特征查询视频库,在所述视频库中获取所述相似关键帧,所述相似关键帧的特征与所述关键帧的特征相似;根据所述相似关键帧确定所述第二视频。
在第二方面的一种可能实现方式中,所述特征提取模型和所述编辑类型识别模型分别采用不同的神经网络模型。
在第二方面的一种可能实现方式中,所述编辑类型包括下述操作中的一种或多种:裁剪、拼接、旋转、镜像、模糊、添加文字、添加图标、变换色彩、变化亮度和变换对比度。
在第二方面的一种可能实现方式中,所述对比分析模块还用于根据所述相似关键帧确定所述相似视频中的相似镜头,其中,所述相似镜头为与所述关键帧所在的镜头相似的镜头;所述输出模块还用于输出所述相似镜头和所述关键帧所在的镜头的对应关系至显示模块。
在第二方面的一种可能实现方式中,所述编辑类型还可以为所述相似镜头与所述关键帧所在的镜头之间进行编辑采用的编辑类型。
在第二方面的一种可能实现方式中,所述视频与所述相似视频之间的相似度还包括所述视频的镜头与对应的所述相似视频中的相似镜头之间的相似度。
在第二方面的一种可能实现方式中,所述结构分析模块具体用于:根据所述视频的内容对所述视频进行结构分析,获得所述视频的镜头,所述镜头为所述视频中表述一段背景连续的画面内容的视频帧的集合;在所述镜头中确定所述关键帧,所述关键帧为表示所述镜头的主要画面内容的视频帧。
第三方面,本申请提供一种计算设备系统,包括至少一台计算设备,每台计算设备包括存储器和处理器,所述至少一台计算设备的存储器,用于存储计算机指令;所述至少一台计算设备的处理器执行所述存储器存储的计算机指令,以执行第一方面或第一方面的任意一种可能的实现方式提供的方法。
第四方面,本申请提供一种非瞬态的可读存储介质,所述非瞬态的可读存储介质被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意一种可能的实现方式中提供的方法。该存储介质中存储了程序。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(英文:hard disk drive,缩写:HDD)、固态硬盘(英文:solid state drive,缩写:SSD)。
第五方面,本申请提供一种计算机程序产品,所述计算机程序产品包括计算机指令,在被计算设备执行时,所述计算设备执行前述第一方面或第一方面的任意可能的实现方式中提供的方法。该计算机程序产品可以为一个软件安装包,在需要使用前述第一方面或第一方面的任意可能的实现方式中提供的方法的情况下,可以下载该计算机程序产品并在计算设备上执行该计算机程序产品。
附图说明
为了更清楚地说明本申请实施例的技术方法,下面将对实施例中所需使用的附图作以简单地介绍。
图1为本申请实施例提供的视频、视频片段、镜头和关键帧之间的关系示意图;
图2为本申请实施例提供的一种检测装置的部署示意图;
图3为本申请实施例提供的另一种检测装置的部署示意图;
图4为本申请实施例提供的一种部署有检测装置的计算设备100的结构示意图;
图5为本申请实施例提供的一种训练装置200和一种检测装置300的结构示意图;
图6为本申请实施例提供的一种视频相似检测的方法的流程示意图;
图7为本申请实施例提供的一种特征提取模型的结构示意图;
图8为本申请实施例提供的一种编辑类型识别模型的结构示意图;
图9为本申请实施例提供的一种以文本的形式显示检测装置输出的信息的示意图;
图10为本申请实施例提供的一种以可视化界面的形式显示检测装置输出的信息的示意图;
图11为本申请实施例提供的一种确定镜头和关键帧的方法的流程示意图;
图12为本申请实施例提供的一种计算设备系统的示意图。
具体实施方式
下面将结合本申请中的附图,对本申请提供的实施例中的方案进行描述。
视频是对现实世界中连续发生的画面进行存储的一种电信号。图1为视频、视频片段、镜头、关键帧之间的关系示意图,一个视频根据其画面的内容可分为多个视频片段,每个视频片段记录的是一段相对完整情节的视频内容。一个视频片段又可分为多个镜头,每个镜头中的视频内容是摄像机单次拍摄的、背景连续的画面。一个镜头中包含一个或多个视频帧,每一个视频帧是一幅独立的图像。在一个镜头中,能够描述当前镜头的主 要内容的视频帧称为该镜头的关键帧,一个镜头可以有一个或多个关键帧。可采用多种方法确定一个镜头中的关键帧,使得一个镜头的内容可由关键帧中的内容来表征。由图1可知,一个视频可被多级划分,其最小单位是视频帧。因此,换而言之,一个或多个视频帧(包括关键帧)形成一个镜头,不同场景的镜头形成视频片段,一个或多个视频片段形成一个完整的视频。一个视频与其对应的视频片段、镜头、关键帧的关系称为一个视频的多级结构,可根据一个视频的内容对视频进行多级结构分析。
本申请提供一种视频相似检测的方法,该方法基于视频的多级结构对视频进行分析和检测,获得待检测的视频与视频库中的视频是否相似的结果。对于该视频与其他的一个或多个视频相似的情况,还可以进一步获得该视频与相似视频相似的位置信息和该视频的镜头相对于相似视频的编辑类型,例如:裁剪、拼接、旋转、镜像、模糊、添加文字、添加图标、变换色彩、变化亮度和变换对比度等。
值得注意的是,在本申请实施例中,两个视频相似是指两个视频之间包括一个或多个相似的关键帧,即:两个视频中的其中一个视频包含的一个或多个关键帧是由另一个视频包含的一个或多个关键帧经过一种或多种类型的编辑手段编辑获得的。例如:视频A中的第1个视频片段中的所有关键帧是由视频B中的关键帧经过添加字幕获得的,视频A中的第2-N(N是大于或等于1的正整数)个视频片段中的关键帧是由视频C中的部分关键帧经过去除图标获得的,则认为视频B和视频C是视频A的相似视频,即:视频A与视频B相似,视频A与视频C也相似。
本申请实施例提供一种视频相似检测的方法,该方法由检测装置执行。检测装置的部署较为灵活。
图2是本申请实施例提供的一种检测装置的部署示意图,检测装置可部署在云环境中,云环境是云计算模式下利用基础资源向用户提供云服务的实体。云环境包括云数据中心和云服务平台,所述云数据中心包括云服务提供商拥有的大量基础资源(包括计算资源、存储资源和网络资源),云数据中心包括的计算资源可以是大量的计算设备(例如服务器)。检测装置可以是云数据中心中用于对视频进行检测的服务器;检测装置也可以是创建在云数据中心中的用于对视频进行检测的虚拟机;检测装置还可以是部署在云数据中心中的服务器或者虚拟机上的软件装置,该软件装置用于对视频进行检测,该软件装置可以分布式地部署在多个服务器上、或者分布式地部署在多个虚拟机上、或者分布式地部署在虚拟机和服务器上。如图2所示,检测装置由云服务提供商在云服务平台抽象成一种视频相似检测的云服务提供给用户,用户在云服务平台购买该云服务后,云环境利用检测装置向用户提供视频相似检测云服务,用户可以通过应用程序接口(application program interface,API)或者通过云服务平台提供的网页界面上传待检测的视频至云环境,由检测装置接收待检测的视频,对待检测的视频进行检测,检测结果由检测装置返回至用户所在的终端,或者检测结果存储在云环境,例如:呈现在云服务平台的网页界面上供用户查看。
当检测装置为软件装置时,检测装置可以在逻辑上分成多个部分,每个部分具有不同的功能(不同部分例如:检测装置包括结构分析模块、特征提取模型、对比分析模块、编辑类型识别模型、输出模块)。检测装置的几个部分可以分别部署在不同的环境或设备中,例如:如图3所示,检测装置中的一部分部署在终端计算设备(如:终端服务器、 智能手机、笔记本电脑、平板电脑、个人台式电脑、智能摄相机),另一部分部署在数据中心(具体部署在数据中心中的服务器或虚拟机上),数据中心可以是云数据中心,数据中心也可以是边缘数据中心,边缘数据中心是部署在距离终端计算设备较近的边缘计算设备的集合。
部署在不同环境或设备的检测装置的各个部分之间协同实现视频编辑类型检测的功能,例如,在一种场景下,智能手机中部署有检测装置中的结构分析模块,智能手机获取一段视频,利用结构分析模块对该视频进行结构分析,智能手机通过网络将结构分析后的数据发送至数据中心,数据中心上部署有特征提取模型、对比分析模块、编辑类型识别模型、输出模块,这些模块/模型进一步地对结构分析后的数据进行处理,最终获得检测结果,数据中心将检测结果发送至智能手机,由此,使用智能手机的用户可获得视频的检测结果。应理解,本申请不对检测装置的哪些部分部署在终端计算设备和哪些部分部署在数据中心进行限制性的划分,实际应用时可根据终端计算设备的计算能力或具体应用需求进行适应性的部署。值得注意的是,在一种实施例中,检测装置还可以分三部分部署,其中,一部分部署在终端计算设备,一部分部署在边缘数据中心,一部分部署在云数据中心。
当检测装置为软件装置时,检测装置也可以单独部署在任意环境的一个计算设备上(例如:单独部署在一个终端计算设备上或者单独部署在数据中心中的一个计算设备上),如图4所示,计算设备100包括总线101、处理器102、通信接口103和存储器104。处理器102、存储器104和通信接口103之间通过总线101通信。其中,处理器102可以为中央处理器(英文:central processing unit,缩写:CPU)。存储器104可以包括易失性存储器(英文:volatile memory),例如随机存取存储器(英文:random access memory,缩写:RAM)。存储器104还可以包括非易失性存储器(英文:non-volatile memory,缩写:NVM),例如只读存储器(英文:read-only memory,缩写:ROM),快闪存储器,HDD或SSD。存储器104中存储有检测装置所包括的可执行代码,处理器102读取存储器104中的该可执行代码以执行视频相似检测的方法。存储器104中还可以包括操作系统等其他运行进程所需的软件模块。操作系统可以为LINUX TM,UNIX TM,WINDOWS TM等。
检测装置在执行本申请实施例提供的视频相似检测的方法时,需要采用神经网络模型,神经网络模型是一类模仿生物神经网络(动物的中枢神经系统)的结构和功能的数学计算模型,一个神经网络模型可以包括多中不同功能的神经网络层,每层包括参数和计算公式。根据计算公式的不同或功能的不同,神经网络模型中不同的层有不同的名称,例如:进行卷积计算的层称为卷积层,所述卷积层常用于对输入信号(例如:图像)进行特征提取。一个神经网络模型也可以由多个已有的神经网络模型组合构成。不同结构的神经网络模型可用于不同的场景(例如:分类、识别)或在用于同一场景时提供不同的效果,神经网络模型结构不同具体包括以下一项或多项:神经网络模型中网络层的层数不同、各个网络层的顺序不同、每个网络层中的权重、参数或计算公式不同。业界已存在多种不同的用于识别或分类等应用场景的具有较高准确率的 神经网络模型,其中,一些神经网络模型可以被特定的训练集进行训练后单独完成一项任务或与其他神经网络模型(或其他功能模块)组合完成一项任务。一些神经网络模型也可以被直接用于单独完成一项任务或与其他神经网络模型(或其他功能模块)组合完成一项任务。
在本申请的一个实施例中,执行视频相似检测的方法需要用到两种不同的神经网络模型,一种是用于对待检测的视频进行特征提取的神经网络模型,称为特征提取模型;另一种是用于对两个相似的视频之间的编辑类型进行识别的模型,称为编辑类型识别模型。特征提取模型和编辑类型识别模型在被用于进行视频的编辑类型检测之前可由训练装置进行训练,训练装置分别采用不同的训练集对特征提取模型和编辑类型识别模型进行训练,经训练装置训练完成的特征提取模型和编辑类型识别模型被部署于检测装置,由检测装置用于进行视频的编辑类型检测。
图5提供了一种训练装置200和检测装置300的结构示意图。下面结合图5对训练装置200和检测装置300的结构和功能进行介绍,应理解,本申请实施例仅是对训练装置200和检测装置300的结构和功能模块进行的示例性划分,本申请并不对其具体划分做任何限定。
训练装置200用于对特征提取模型203和编辑类型识别模型204分别进行训练,对特征提取模型203和编辑类型识别模型204进行训练需要两个不同的训练集,分别称为特征提取训练集和编辑类型识别训练集。获得的特征提取训练集和编辑类型识别训练集被保存在数据库中。可由采集装置采集多个训练视频或训练图像,采集到的多个训练视频或训练图像由人工或采集装置进行处理和标注后构成一个训练集。当采集装置采集的是多个训练视频时,采集装置对每个训练视频进行镜头划分,在划分的镜头中确定关键帧,将确定的关键帧作为训练图像,进而对训练图像进行处理和标注构建训练集。训练装置200在启动对特征提取模型203进行训练时,初始化模块201首先对特征提取模型203中的每层的参数进行初始化(即,为每个参数赋予一个初始值),进而训练模块202读取数据库中的特征提取训练集中的训练图像对特征提取模型203进行训练,直到特征提取模型203中的损失函数收敛或者特征提取训练集中所有的训练图像被用于训练,则特征提取模型203训练完成。同理,训练装置200在启动对编辑类型识别模型204进行训练时,初始化模块201首先对编辑类型识别模型204中的每层的参数进行初始化(即,为每个参数赋予一个初始值),进而训练模块202读取数据库中的编辑类型识别训练集中的训练图像对编辑类型识别模型204进行训练,直到编辑类型识别模型204中的损失函数收敛或者编辑类型识别训练集中所有的训练图像被用于训练,则编辑类型识别模型204训练完成。值得注意的是,特征提取模型203和编辑类型识别模型204也可由两个训练装置分别进行训练,特征提取模型203和/或编辑类型识别模型204还可以不需要由训练装置200进行训练,例如:特征提取模型203和/或编辑类型识别模型204采 用的是第三方已训练好的,且对特征提取和/或类型识别具有较好精确度的神经网络模型。在本申请的一个实施例中,也可不需要采集装置采集训练图像或训练视频以及构建特征提取训练集和/或编辑类型识别训练集,例如:特征提取训练集和/或编辑类型识别训练集从第三方直接获得。
值得注意的是,在本申请实施例中,可以采用Alexnet、Resnet、Mobilenet、Densenet等神经网络模型中的任意一个的基础特征提取部分作为特征提取模型203,用于对特征提取模型203进行训练的损失函数可以是tripletloss函数。特征提取训练集包括两种类型的图像组,一种被标注为相似图像组,即相似图像组的标签被设置为相似,另一种被标注为不相似图像组,即不相似图像组的标签被设置为不相似,一个特征提取训练集中包括多个相似图像组和多个不相似图像组。相似图像组中包括两个图像,其中一个为原始图像,另一个为由原始图像经过一种或多种编辑类型进行编辑获得的生成图像(例如:原始图像与由原始图像旋转和裁剪后的生成图像构成一个相似图像组)。由于原始图像与生成图像是经过某种或多种编辑类型变化获得的,因此原始图像与生成图像存在某些相似的特征。不相似图像组也包括两个(或更多个)图像,两个图像之间不存在编辑前与编辑后的关系。特征提取训练集被用于对特征提取模型进行训练,相似图像组被当做正样本用于特征提取模型的训练,使特征提取模型学习到相似图像组中的两个图像相似的特征;不相似图像组被当做负样本用于特征提取模型的训练,使特征提取模型区分相似图像组与不相似图像组的能力更强。经过特征提取训练集进行训练的特征提取模型可较精确地提取出被检测的关键帧中的特征。
值得注意的是,在本申请实施例中,编辑类型识别训练集也包括多个图像组,每个图像组由一个原始图像和这个原始图像对应的生成图像构成。原始图像对应的生成图像为原始图像经过一种或多种类型的编辑操作获得的,每个图像组都被设置有一个或多个标签,每个图像组的标签为这个图像组中生成图像的编辑类型。例如:一个图像组包括一个原始图像和一个由原始图像经过旋转和裁剪而获得的生成图像,则该图像组的标签有两个,分别为旋转和裁剪(或与旋转和裁剪对应的其他表示方式的标签)。应理解,编辑类型识别训练集中的图像组中的图像可以和特征提取训练集中的相似图像组中的图像相同,即:获得的训练图像可以由两个训练集复用,但是编辑类型识别训练集中的图像组与特征提取训练集中的相似图像组的标签不相同。编辑类型识别训练集中包括多类标签的图像组,每类标签的图像组也有多个,每类标签的图像组可以是单标签图像组(例如:标签为旋转的多个图像组,则这多个图像组中的生成图像均是由原始图像旋转获得的),也可以是多标签图像组(例如:有旋转、裁剪和添加图标三个标签的多个图像组,则这多个图像组中的生成图像均是由原始图像经过旋转、裁剪和添加图标获得的)。编辑类型识别模型204在被训练时可采用多标签交叉熵作为损失函数。
经过训练装置200训练完成的特征提取模型203和编辑类型识别模型204可分别被用于进行特征提取和视频/图像的编辑类型识别。在本申请的一个实施例中,如图5所示,训练完成的特征提取模型203和编辑类型识别模型204被部署至检测装置300,在检测装置300中,训练完成的特征提取模型203被称为特征提取模型302,训练完成的编辑类型识别模型204被称为编辑类型识别模型304。
如图5所示,检测装置300包括结构分析模块301、特征提取模型302、对比分析模块303、编辑类型识别模型304、输出模块305。
结构分析模块301用于接收一段待检测的视频(可以是一个完整的视频文件,也可以是一段视频片段,例如:实时获取的视频流),对该视频进行结构分析,将一段视频分解成一个或多个镜头(或将一段视频分解成一个或多个视频片段,再将每个视频片段分解成一个或多个镜头)、且结构分析模块301还用于在每个镜头中确定可以表征该镜头内容的一个或多个关键帧。结构分析模块301输出视频结构对应的一条结构数据,结构数据表示视频的每个镜头在视频中的位置和每个镜头的关键帧所在的位置。
特征提取模型302与结构分析模块301通过通信通路连接,用于根据结构数据读取视频中每个镜头的关键帧,对每个关键帧进行特征提取,输出每个关键帧的特征。
对比分析模块303与特征提取模型302通过通信通路连接,用于根据每个关键帧对应的特征查询视频库,在视频库中获取一个或多个与待检测的视频相似的视频。对比分析模块303还用于确定相似视频与待检测的视频的相似度以及相似视频中与待检测的视频相似的关键帧或镜头的对应关系和相似度。
编辑类型识别模型304与对比分析模块303连接,编辑类型识别模型304根据对比分析模块303获取的相似视频中相似关键帧,将每个相似关键帧与被检测的视频中对应的关键帧一同输入至编辑类型识别模型304,根据编辑类型识别模型304获得相似视频中的相似关键帧与待检测的视频中对应的关键帧之间的编辑类型。
输出模块305分别与对比分析模块303和编辑类型识别模型304通过通信通路连接,输出模型305输出对比分析模块303获得的一个或多个与待检测的视频相似的视频及相似视频中的相似关键帧与待检测的视频中对应的关键帧之间的编辑类型。可选的,输出模块305输出相似视频与待检测的视频的相似度以及相似视频中与待检测的视频相似的关键帧或镜头的位置信息。
训练装置200和检测装置300均可以是软件装置,当训练装置200和检测装置300都为软件装置时,训练装置200可与检测装置300部署在同一台计算设备上(例如:部署在同一台服务器上、或者部署在同一台服务器中的两个不同的虚拟机上)、训练装置200也可以与检测装置300部署在不同的计算设备上(例如:训练装置200部署在云环境中的一个或多个服务器上,检测装置300部署在边缘环境中的一个或多个服务器上)。训练装置200的部署也较为灵活,与前述描述的检测装置的部署 方式一样,其可以整个部署在同一计算设备上,也可以各部分分别部署在不同的计算设备上,不同的计算设备协同运行训练装置200中的各部分以实现训练装置200的全部功能。
下面结合图6具体描述本申请实施例提供的一种视频相似检测的方法。
S401:接收待检测的视频,根据待检测的视频的内容确定镜头和关键帧。
具体地,检测装置300获取待进行编辑类型检测的一段视频(例如:检测装置300接收用户或管理员上传的一段待检测的视频,或者检测装置300实时接收另一设备拍摄的一段视频),根据待检测的视频的内容对待检测的视频进行结构分析,确定镜头和每个镜头内的关键帧。一段待检测的视频可以包括多个镜头,镜头是摄像机单次拍摄的、背景连续的画面,一个镜头通常是一个场景,在一个镜头内的画面内容可由关键帧表示。
在该步骤中确定镜头的方法可采用滑动窗口的方式,基于待检测的视频中的前后时刻的视频帧之间的灰度直方图差值确定镜头的边界。以镜头的边界对待检测的视频进行镜头分割,在分割后的每个镜头内根据镜头内的视频帧画面内容进行关键帧选取。在该步骤中,经过对待检测的视频进行结构分析获得该视频对应的一条结构数据,结构数据可被表示为{[s 1,e 1,k 10,k 11,…],[s 2,e 2,k 20,k 21,…],…,[s n,e n,k n0,k n1,…]},其中结构数据中的[s 1,e 1,k 10,k 11,…]表示一个镜头,s 1为该镜头的起始视频帧在整个视频中的帧序,e 1为该镜头的结束视频帧在整个视频中的帧序,k 10为该镜头内的一个关键帧相对于起始视频帧的偏移帧数。
应理解,本申请中不限定根据待检测的视频内容确定镜头和关键帧的具体实现方式,对于不同的待检测的视频可采用不同的方法确定镜头和关键帧,在后续将介绍一种根据待检测的视频内容确定镜头和关键帧的具体方案。本步骤可以获得待检测的视频中的所有关键帧。
需要说明的是,本步骤中先确定待检测的视频所包含的镜头,然后再进一步确定镜头中的关键帧。可选的,在其他实施例中也可以不确定镜头而直接获得关键帧。
S402:特征提取模型对关键帧进行特征提取。
具体地,前述步骤S401获得的每个镜头的关键帧为二维图像,如图7所示,将每个关键帧输入至已训练完成的特征提取模型,特征提取模型输出每个关键帧的特征,关键帧的特征可以是一个多维矩阵,关键帧的特征表示该关键帧的画面内容隐含的特点。本申请不对特征提取模型采用的神经网络模型的结构进行具体的限制,特征提取模型可以使用业界通用的用于图像分类的神经网络模型或者用于图像识别的神经网络模型的主干部分,也可以是经过改进后的一个神经网络模型。图7所示的特征提取模型采用一个卷积神经网络模型,卷积神经网络模型包括多个卷积层,每个卷积层包括一个或多个卷积 核,每个卷积核包括多个参数。每个卷积核的大小可以相同也可以不同(例如:特征提取模型的第一个卷积层中可以有16个大小均为7*7的卷积核),关键帧(或张量)输入至一个卷积层中与该卷积层中的各卷积核进行卷积操作后,该卷积层输出一个张量,卷积层输出的张量是一个三维的数组,包括多个数值,例如:尺度为W*H*L的张量(其中,W表示张量的宽度,H表示张量的高度,L表示张量的通道数,W、H和L均为大于0的自然数)包括W*H*L个数值,卷积层中包括的卷积核的个数决定了该卷积层输出的张量的通道数,例如:尺度为W*H*L的张量(其中,W表示张量的宽度,H表示张量的高度,L表示张量的通道数,W、H和L均为大于0的自然数),输入至包含J个尺寸为1*1卷积核的卷积层后,与卷积层中J个1*1的卷积核进行卷积,该卷积层输出的张量尺度为W*H*J(J为大于0的自然数)。不同卷积层中的卷积核的大小和个数可以相同也可以不相同,每个卷积层输出的张量的尺度由输入至该卷积层的关键帧(或张量)和该卷积层中的卷积核的大小和个数以及卷积计算的方式共同决定。关键帧输入至特征提取网络后,将最后一个卷积层输出的张量作为该关键帧的特征,由特征提取模型输出。
值得注意的是,步骤S401获得的待检测的视频中所有的关键帧都进行步骤S402的操作,因此,步骤S402获得的是待检测的视频中所有关键帧的特征。
S403:根据关键帧的特征确定与待检测的视频相似的视频。
具体地,由步骤S402获得了待检测的视频中所有关键帧的特征,将每个关键帧的特征与视频库中所有视频的关键帧的特征进行比对,确定与待检测的视频中的关键帧相似的相似关键帧,确定所述相似关键帧所属的视频,其中,所述相似关键帧所属的视频称为相似视频,相似视频与待检测的视频相似。进一步确定所述待检测的视频与相似视频的相似度以及确定所述相似关键帧所在的镜头与对应的关键帧所在的镜头的对应关系和位置。
值得注意的是,在步骤S403中所用到的视频库为预先整理和计算好的视频库,视频库中包括多个视频,对每个视频执行与前述步骤S401相同的视频结构分析操作以确定每个视频中的镜头和关键帧,即:视频库中的每个视频对应一条结构数据,该结构数据指示了视频中的每个镜头的起止帧以及每个镜头内的关键帧。本申请还对视频库中的每个关键帧执行与前述步骤S402相同的方法,即:还对视频库中每个视频的关键帧进行特征提取,获得每个关键帧的特征。因此,本申请中的视频库存储了多个视频、每个视频对应的结构数据以及每个视频中的关键帧的特征。
本申请不限定视频库中视频的来源,创建视频库时可根据本申请提供的视频相似检测的方法的具体应用场景进行适应性地收集视频,例如:对于影视作品保护部门用视频相似检测的方法甄别盗版视频,则视频库中的视频可为现有的所能收集到的原创影视作品。视频库越丰富则获得与待检测的视频相似度高的相似视频的概率越大。值得注意的是,对视频库中的视频进行视频结构分析以确定视频的镜头和关键 帧的操作可在执行步骤S403之前的任意时间执行。
步骤S403的具体流程描述如下:
S4031:根据获取的待检测的视频中每个关键帧的特征确定视频库中的相似关键帧。
将待检测的视频中的每个关键帧的特征与视频库中所有视频的关键帧的特征进行比对,比对方法可采用逐个计算相似度,将相似度大于预设定的阈值的视频库中的视频的关键帧确定为被比对的关键帧的相似关键帧,本申请不限定相似度计算的具体方式。若视频库中不存在与待检测的视频中的任何一个关键帧相似的相似关键帧,则结束对视频进行编辑类型的检测。若视频库中存在与待检测的视频中的关键帧相似的相似关键帧,则进行后续步骤。应理解,待检测的视频的每个关键帧在视频库中可能存在一个或多个相似关键帧。
S4032:根据相似关键帧确定与待检测的视频相似的相似视频。
具体地,在一种实施方式中,可采用图搜索方法确定与待检测的视频相似的相似视频:
将待检测的视频中的关键帧按时序进行排列,将每个关键帧与视频库中与该关键帧相似的所有相似关键帧进行对应,待检测的视频中的所有关键帧和与关键帧对应的相似关键帧可以构成一个图,将待检测的视频中的关键帧及其对应的相似关键帧看作是图中的节点,按照关键帧的时序以图中每个关键帧对应的相似关键帧为节点构建每个相似关键帧的路径,每条路径上包括多个节点和连接节点的边。在确定路径时,根据待检测的视频的关键帧的时序依次进行节点确定,对每个关键帧而言,把寻找到的与该关键帧对应的一个相似关键帧作为一条路径上的一个节点,确定的相似关键帧与该路径上已有的相似关键帧属于同一个视频。因此获得的多条路径中的每一条路径上的节点满足条件:同一条路径上的相似关键帧在同一个视频中(若某一个关键帧对应的一个或多个相似关键帧不与任一路径上已有的相似关键帧属于同一视频,则跳过该关键帧),由此,每一条路径对应一个视频,该视频称为相似视频。根据每一条路径上的相似关键帧和视频库中存储的这条路径对应的视频的结构数据,确定每个相似关键帧在相似视频中的镜头,这个镜头称为与对应的关键帧所在的镜头的相似镜头,相似镜头与待检测的视频中对应的关键帧所在的镜头称为一个相似镜头对。可选的,可以将一个相似镜头对中:一个镜头包含的关键帧与另一个镜头包含的相似关键帧之间的相似度确定为该相似镜头对的相似度。
计算每个相似视频与待检测的视频的相似度,一个相似视频与待检测的视频的相似度可采用该相似视频中的镜头与待检测的视频的镜头构成的每个相似镜头对之间的相似度加权平均获得,也可以用相似视频中相似关键帧所在的镜头的时长之和占视频的总时长的比例作为该相似视频与待检测的视频的相似度。
S404,将待检测的视频中的关键帧和对应的相似视频中的相似关键帧输入至编辑类型识别模型,由编辑类型识别模型输出编辑类型。
具体地,经过步骤S403获得了与待检测的视频相似的一个或多个相似视频,且获得了相似视频中的相似关键帧。将每个相似视频中的相似关键帧和与其对应的待检测的视频中的关键帧组成一个关键帧组,则每个相似视频对应有一个或多个关键帧组。将每个关键帧组输入至编辑类型识别模型,编辑类型识别模型经过对关键帧组中的相似关键帧和关键帧之间进行编辑特征提取和预测,输出该关键帧组中的关键帧与相似关键帧之间存在的一种或多种编辑类型,关键帧与相似关键帧之间存在的一种或多种编辑类型表示该关键帧和相似关键帧之间进行编辑采用的一种或多种编辑类型,通过这一种或多种编辑类型的编辑可实现关键帧和相似关键帧之间的转换。每个相似视频中的每个关键帧组依次经过编辑类型识别模型进行编辑特征提取和预测,获得了每个相似视频中的相似关键帧与待检测的视频中对应的关键帧之间的编辑类型。值得注意的是,由于关键帧和相似关键帧分别是表示待检测的视频中的镜头的内容和相似视频中的相似镜头的内容的视频帧,因此,可用关键帧与相似关键帧之间存在的一种或多种编辑类型表示该关键帧所在的镜头与对应的相似视频中的相似镜头之间存在的一种或多种编辑类型。
编辑类型识别模型采用预先训练好的一种神经网络模型,图8为一种示例性的编辑类型识别模型。图8所示的编辑类型识别模型包括两个特征提取分支,称为第一特征提取分支和第二特征提取分支,以及一个预测器,两个特征提取分支的输出为该预测器的输入,预测器用于输出预测的一种或多种编辑类型。编辑类型识别模型中的两个特征提取分支采用相同的多个卷积层构成(卷积层数相同、卷积层的参数也相同),关键帧组中的关键帧和相似关键帧分别输入第一特征提取分支和第二特征提取分支,第一特征提取分支对关键帧进行卷积计算,第一特征提取分支的最后一层卷积层输出关键帧的编辑特征,第二特征提取分支对相似关键帧进行卷积计算,第二特征提取分支的最后一层卷积层输出相似关键帧的编辑特征,相似关键帧的编辑特征和关键帧的编辑特征共同作为预测器的输入,由预测器进行计算和预测,输出关键帧和相似关键帧之间存在的编辑类型。根据关键帧和相似关键帧之间的关系,预测器可输出一种或多种编辑类型。
值得注意的是,编辑类型识别模型的预测器输出的一种或多种编辑类型即表示输入的关键帧和相似关键帧之间进行编辑采用的一种或多种编辑类型,也表示该关键帧所在的镜头和该相似关键帧所在的相似镜头之间进行编辑采用的一种或多种编辑类型。若一个相似视频对应的全部关键帧组经过编辑类型识别模型获得的都是同一种或多种编辑类型,则该相似视频与待检测的视频之间的编辑类型即为该一种或多种编辑类型。若一个相似视频对应的多个关键帧组经过编辑类型识别模型获得的编辑类型的种类不同,则该相似视频相对于待检测的视频之间的编辑类型为所有不同 种类的编辑类型。例如:相似视频和待检测的视频之间存在3个关键帧组,即该相似视频中存在3个与待检测的视频的关键帧相似的相似关键帧,3个关键帧组分别经过编辑类型识别模型,获得3个输出,分别是(旋转,裁剪)、(添加图标)、(镜像,裁剪),则该相似视频和待检测的视频之间存在的编辑类型为(旋转,裁剪、添加图标、镜像)。
值得注意的是,待检测的视频与相似视频之间进行编辑采用的一种或多种编辑类型可以有三种情况:1、待检测的视频可以由相似视频经过这一种或多种编辑类型的编辑操作获得;2、待检测的视频可以由相似视频经过这一种或多种编辑类型对应的相反的编辑操作获得,即待检测的视频经过这一种或多种编辑类型的编辑操作可得到相似视频;3、当获得的编辑类型包括多个时,待检测的视频可以由相似视频经过其中一个或多个编辑类型的编辑操作以及其中另一个或多个编辑类型的相反的编辑操作获得。
本申请不限定编辑类型的具体分类和名称。在一种编辑类型的划分方式中,对于一些互为反操作的编辑操作,例如:添加或去除图标,添加或去除滤镜,可以将互为反操作的两个编辑操作划分为两种编辑类型。例如:添加或去除图标被分别设置为添加图标编辑类型,以及去除图标编辑类型。对于编辑类型为这种划分方式的情况,编辑类型识别模型可以输出(1)由相似视频获得待检测的视频采用的编辑类型,也可以输出(2)由待检测的视频获得相似视频采用的编辑类型,还可以输出(3)这两个互为相反操作的编辑类型。对于一些不存在互为相反操作的编辑操作,则只有一种编辑类型,编辑类型识别模型输出这种编辑类型,表示相似视频经过这种编辑类型的编辑获得待检测的视频。
下面进行一些举例以方便理解:
假设第一视频拥有第一关键帧、第三关键帧,第二视频拥有第二关键帧、第四关键帧。如果第一关键帧和第二关键帧的区别在于:第二关键帧比第一关键帧多了一个图标。那么,可以认为是:第一视频是源视频,第二视频是第一视频(源视频)经过编辑操作后生成的视频,所述第一视频与所述第二视频之间进行编辑采用的编辑类型是“添加图标”。也可以认为是:第二视频是源视频,第一视频是在第二视频(源视频)经过编辑操作后生成的视频,因此所述第一视频与所述第二视频之间进行编辑采用的编辑类型是“删除图标”。换句话说,由于“添加图标”和“删除图标”操作是相反操作,因此检测结果可能有两个,而且在这两个检测结果中,“源视频”的角色刚好相反。在实际应用中,可以输出其中一种检测结果给用户,也可以两个结果都输出给用户。除了添加图标(或者删除图标)这种编辑操作之外,添加文字、添加滤镜等编辑操作都存在这样的情况。
再举另外一个例子,假设第三关键帧和第四关键帧的区别在于:对第三关键帧进行“马赛克”操作后可以得到第四关键帧。和上面例子不同的是,“马赛克”操 作不存在相反操作,因此本例中编辑类型是唯一的。因此本例中:第一视频是源视频,第二视频是第一视频(源视频)经过编辑操作后生成的视频,所述第一视频与所述第二视频之间进行编辑采用的编辑类型是“马赛克”,编辑类型识别模型输出“马赛克”这种编辑类型即可。
在另一种编辑类型的划分方式中,也可以将互为反操作的两个编辑操作合称为一种编辑类型,该种编辑类型的名称可以仅用互为反操作的其中一个编辑操作的名称表示,或者编辑类型的名称可体现互为反操作的两个编辑操作。对于编辑类型为这种划分方式的情况,编辑类型识别模型输出这种编辑类型则表示待检测的视频由这种编辑类型对应的正编辑操作获得,或者表示待检测的视频由这种编辑类型对应的相反的编辑操作获得。例如:将添加图标或者删除图标这两个互为相反的编辑操作合称为一种编辑类型,这一种编辑类型的名称就为“添加图标”。将第一视频的关键帧和第二视频的关键帧输入至编辑类型识别模型,若编辑类型识别模型输出的编辑类型为“添加图标”,则第一视频和第二视频之间进行编辑采用的编辑类型为“添加图标”,第一视频和第二视频之间具体的关系有两种可能:可以是第一视频经过添加图标后获得第二视频,或者是第一视频经过删除图标后获得第二视频。
S405:输出相似视频和编辑类型。
经过前述步骤S403获得了一个或多个相似视频,经过步骤S404获得了每个相似视频与待检测的视频之间存在的编辑类型,可将每个相似视频或者每个相似视频的信息(例如:相似视频的名称),以及每个相似视频与待检测的视频之间存在的编辑类型输出至显示模块,显示模块可以是检测装置中的一个模块,也可以是检测装置以外的其他装置或设备的模块,显示装置可通过可视化界面或者文本的方式显示每个相似视频或者每个相似视频的信息以及每个相似视频对应的编辑类型。
可选的,还可以输出每个相似视频和待检测的视频之间的相似度至显示模块。
可选的,还可以输出每个相似视频中的相似镜头和待检测的视频中对应的镜头之间的对应关系至显示模块,显示模块可通过多种形式显示相似视频和待检测的视频之间的该对应关系。
可选的,还可以输出待检测的视频的镜头与对应的相似视频中的相似镜头之间的相似度至显示模块。
上述相似视频或相似视频的信息、相似视频与待检测的视频之间的编辑类型、相似视频与待检测的视频之间的相似度、相似视频中的相似镜头和待检测的视频中对应的镜头之间的对应关系、待检测的视频的镜头与对应的相似视频中的相似镜头之间的相似度统称为相似视频对应的相关信息。
值得注意的是,对于前述步骤S403获得的多个相似视频,可以进一步根据相似视频的相似度大小对相似视频进行筛选,仅将筛选后的相似视频对应的相关的信息输出至显 示模块。例如,将相似视频与待检测的视频之间的相似度与预设定的筛选阈值进行比较,仅输出大于或等于预设定的筛选阈值的相似视频对应的相关信息至显示模块。
显示模块根据获得的上述信息可以对这些信息进行多种不同形式的显示。
图9为本申请的一种实施例提供的显示模块以文本的形式显示检测装置输出的信息的示意图。如图9所示,在文本中包含了在视频库中查询到的与待检测的视频Q相似度最高的前K个相似视频对应的相关信息,包括:每个相似视频的名称、每个相似视频与视频Q的相似度、每个相似视频中的相似镜头列表以及相似视频与视频Q之间的整体编辑类型列表,其中,每个相似视频中的相似镜头列表中包含了视频Q中的镜头的起始和结束的帧序或时间、相似视频中对应的相似镜头的起始和结束的帧序或时间、镜头相似度、镜头编辑类型等信息,相似镜头列表中的信息即表示了相似镜头和对应视频Q中的镜头的对应关系。
图10为本申请的另一种实施例提供的显示模块以可视化界面的形式显示检测装置输出的信息的示意图。如图10所示,在可视化界面中显示了待检测的视频Q,以及与视频Q相似的相似视频,以及对应的相似镜头及其相似度,还有每个相似镜头对应的编辑类型和每个镜头的相似度。
可选的,由前述步骤S403和S404获取的相似视频对应的相关信息也可以输出至处理模块,处理模块可以是检测装置中的功能模块也可以是其他装置或设备中的功能模块,处理模块可对相似视频对应的相关信息进行进一步地处理。
由上述步骤S401-S405所述的方法即可对待检测的视频完成编辑类型的检测。应理解,上述方法中各个步骤描述的具体实现方式仅仅是示例性的描述,不对本申请提供的一种视频相似检测的方法造成任何限定。
下面结合图11示例性地描述前述步骤S401中根据待检测的视频的内容确定镜头和关键帧的具体实现方式:
S4011:读取待检测的视频中的视频帧,对待检测的视频中的视频帧进行相似比对。
具体地,首先按照时序读取待检测的视频中的第一个视频帧和第二个视频帧,利用图像哈希算法或其他相似比对算法将两帧视频帧中后一视频帧与前一视频帧进行相似比对,若前后两帧为相似的,则继续读取新的一个视频帧,将新的一个视频帧与前两个视频帧中的后一视频帧进行相似比对,直到新的视频帧与前两个视频帧的后一视频帧不相似,则将两个不相似的视频帧存入缓存区。
S4012:对缓存区中的两个相邻视频帧进行灰度颜色直方图差值计算。
具体地,分别计算两个相邻的视频帧的灰度颜色直方图,将两个视频帧的灰度颜色直方图进行对应相减,获得两个视频帧的灰度颜色直方图差值,在缓存中存储该灰度颜色直方图差值。
S4013:判断缓存区中缓存的视频帧数量与预设定的最小镜头帧数的关系,当缓存区中缓存的视频帧数量大于预设定的最小镜头帧数,则执行步骤S4014;否则,执行步骤S4011。
S4014:根据缓存区内所有灰度颜色直方图差值的最大值和平均值判断镜头边界。
具体地,计算缓存区内所有灰度颜色直方图差值的最大值M和平均值S(其中,M和S是大于0的实数),若M>n*S(其中0>n>=1),则将灰度颜色直方图差值的最大值M对应的两个视频帧中后一帧视频帧确定为镜头边界,执行步骤S4015;若M<=n*S(其中0>n>=1),则执行步骤S4016。值得注意的是,n的取值可根据不同的应用需求预先进行设置。
S4015:在待检测的视频中将已确定的前一个镜头边界的后一帧(或者待检测的视频的第一帧)与当前确定的镜头边界及其之间的所有视频帧的集合确定为一个镜头,在该镜头内进行关键帧确定。确定关键帧的方法为计算镜头内相邻视频帧的灰度颜色直方图差值,将差值大于预设定的差值阈值的视频帧选出,并对选出的视频帧进行画面筛选,选择清晰、亮度适中的视频帧作为关键帧。
值得注意的是,步骤S4015执行完成之后,清空缓存区缓存。继续执行前述步骤S4011。
S4016:计算缓存区中的所有视频帧的梯度值,将最大梯度值与预设定的梯度阈值进行比较,若大于预设定的阈值,则将该最大梯度值对应的一个视频帧确定为镜头边界,执行步骤S4015。若小于或等于预设定的梯度阈值,则执行步骤S4017。
S4017:判断缓存区中缓存的视频帧数量与预设定的最大镜头帧数的关系,当缓存区中缓存的视频帧数量大于预设定的最大镜头帧数,则执行步骤S4015;否则,执行步骤S4011。
上述步骤按照各自的执行条件执行,直到待检测的视频中所有的视频帧都被处理,则结束执行。
值得注意的是,在本申请中,根据待检测的视频的内容确定镜头和关键帧可以根据待检测的视频内容的类型不同而采用不同的方法。例如:待检测的视频为同一背景的讲座、综艺等内容的视频时,可采用固定时长对待检测的视频进行分割,分割后的每个固定时长的视频片段为一个镜头,再在分割后的每个镜头内进行关键帧确定,关键帧确定的方法也多种多样,例如:可根据固定视频帧间隔选择镜头内几个关键帧,也可以对镜头内每个视频帧进行边缘检测,选择边缘与相邻视频帧的边缘相差较大的视频帧作为关键帧。
在本申请的另一个实施例中,视频相似检测的方法与前述步骤S401-S405描述的方法稍有不同,经过训练好的编辑类型识别模型可以拆分成两部分,一部分包括第一特征提取分支和第二特征提取分支,另一部分包括预测器,编辑类型识别模型的 两部分可以存储在不同的位置(例如:不同的虚拟机、不同的物理计算设备)。采用第一和第二特征提取分支对关键帧和相似关键帧进行特征提取的操作可在编辑类型识别模型进行编辑类型识别之前完成,例如:可以在前述步骤S403中采用第一特征提取分支和第二特征提取分支分别对关键帧和相似关键帧进行编辑特征的提取,所获得的关键帧的编辑特征和相似关键帧的编辑特征暂存至存储模块中。在步骤S404中再将存储模块中的关键帧的编辑特征和相似关键帧的编辑特征输入至预测器,由预测器输出关键帧和相似关键帧之间的存在的编辑类型。
本申请提供一种如图5所示的检测装置300,检测装置300包括的模块和功能如前文中的描述,在此不再赘述。在一种实施例中,检测装置300中的结构分析模块301具体用于执行前述步骤S401所描述的方法;特征提取模型302具体用于执行前述步骤S402所描述的方法;对比分析模块303具体用于执行前述步骤S403所描述的方法;编辑类型识别模型304具体用于执行前述步骤S404所描述的方法;输出模块305具体用于执行前述步骤S405所描述的方法。
本申请还提供一种如图4所示的计算设备100,计算设备100中的处理器102读取存储器104存储的检测装置300包括的可执行代码以执行前述视频相似检测的方法。
由于本申请的检测装置300中的各个模块可以分别部署在多个计算设备上,因此,本申请还提供一种如图12所示的计算设备系统,该计算设备系统包括多个计算设备500,每个计算设备500包括总线501、处理器502、通信接口503和存储器504。处理器502、存储器504和通信接口503之间通过总线501通信。
其中,处理器502可以为CPU。存储器504可以包括易失性存储器(英文:volatile memory),例如RAM。存储器504还可以包括非易失性存储器,例如ROM,快闪存储器,HDD或SSD。存储器504中存储有可执行代码,处理器502执行该可执行代码以执行视频相似检测的部分方法。存储器504中还可以包括操作系统等其他运行进程所需的软件模块。操作系统可以为LINUX TM,UNIX TM,WINDOWS TM等。
每个计算设备500间通过通信网络建立通信通路。每个计算设备500上运行结构分析模块301、特征提取模型302、对比分析模块303、编辑类型识别模型304、输出模块305中的任意一个或多个。任一计算设备500可以为云数据中心中的计算设备,或边缘数据中心中的计算设备,或终端计算设备。
上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。视频相似检测的计算机程序产品包括一个或多个视频相似检测的计算机指令,在计算机上 加载和执行这些计算机程序指令时,全部或部分地产生按照本发明实施例图6所述的流程或功能。
所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质存储有视频相似检测的计算机程序指令的可读存储介质。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如SSD)。

Claims (17)

  1. 一种视频相似检测的方法,其特征在于,包括:
    接收第一视频,根据所述第一视频确定所述第一视频的关键帧;
    输入所述关键帧至特征提取模型,获得所述关键帧的特征;
    根据所述关键帧的特征确定相似关键帧和第二视频,其中,所述第二视频为所述相似关键帧所在的视频,所述第二视频与所述第一视频相似;
    输入所述关键帧和所述相似关键帧至编辑类型识别模型,获得编辑类型,其中,所述编辑类型指示所述第一视频与所述第二视频之间进行编辑采用的编辑类型。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    输出所述第二视频或所述第二视频的信息至显示模块,其中,所述第二视频的信息包括所述第二视频的名称;
    输出所述编辑类型至所述显示模块。
  3. 如权利要求1或2所述的方法,其特征在于,所述编辑类型识别模型包括第一特征提取分支、第二特征提取分支和预测器;
    输入所述关键帧和所述相似关键帧至编辑类型识别模型,获得编辑类型具体包括:
    输入所述关键帧至所述第一特征提取分支,输入所述相似关键帧至所述第二特征提取分支;
    所述第一特征提取分支对所述关键帧进行特征提取,输出所述关键帧的编辑特征,所述第二特征提取分支对所述相似关键帧进行特征提取,输出所述相似关键帧的编辑特征;
    输入所述关键帧的编辑特征和所述相似关键帧的编辑特征至所述预测器,所述预测器输出所述编辑类型。
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述方法还包括:
    计算所述第一视频与所述第二视频之间的相似度;
    输出所述相似度至显示模块。
  5. 如权利要求1-4任一项所述的方法,其特征在于,根据所述关键帧的特征确定相似关键帧和第二视频具体包括:
    根据所述关键帧的特征查询视频库,在所述视频库中获取所述相似关键帧,所述相似关键帧的特征与所述关键帧的特征相似;
    根据所述相似关键帧确定所述第二视频。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述特征提取模型和所述编辑类型识别模型分别采用不同的神经网络模型。
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述编辑类型包括下述操作中的一种或多种:
    裁剪、拼接、旋转、镜像、模糊、添加文字、添加图标、变换色彩、变化亮度和变换对比度。
  8. 一种检测装置,其特征在于,包括:
    结构分析模块,用于接收第一视频,根据所述第一视频确定所述第一视频的关键帧;
    特征提取模型,用于根据所述关键帧获得所述关键帧的特征;
    对比分析模块,用于根据所述关键帧的特征确定相似关键帧和第二视频,其中,所述第二视频为所述相似关键帧所在的视频,所述第二视频与所述第一视频相似;
    编辑类型识别模型,用于根据所述关键帧和所述相似关键帧获得编辑类型,其中,所述编辑类型指示所述第一视频与所述第二视频之间进行编辑采用的编辑类型。
  9. 如权利要求8所述的装置,其特征在于,所述检测装置还包括:
    输出模块,用于输出所述第二视频或所述第二视频的信息至显示模块,其中,所述第二视频的信息包括所述第二视频的名称;还用于输出所述编辑类型至所述显示模块。
  10. 如权利要求8或9所述的装置,其特征在于,所述编辑类型识别模型包括第一特征提取分支、第二特征提取分支和预测器;
    所述第一特征提取分支用于接收所述关键帧,对所述关键帧进行特征提取,输出所述关键帧的编辑特征;
    所述第二特征提取分支用于接收所述相似关键帧,对所述相似关键帧进行特征提取,输出所述相似关键帧的编辑特征;
    所述预测器用于根据所述关键帧的编辑特征和所述相似关键帧的编辑特征获得所述编辑类型。
  11. 如权利要求8-10任一项所述的装置,其特征在于,
    所述对比分析模块还用于计算所述第一视频与所述第二视频之间的相似度;
    所述输出模块还用于输出所述相似度至显示模块。
  12. 如权利要求8-11任一项所述的装置,其特征在于,
    所述结构分析模块具体用于:根据所述关键帧的特征查询视频库,在所述视频库中获取所述相似关键帧,所述相似关键帧的特征与所述关键帧的特征相似;根据所述相似关键帧确定所述第二视频。
  13. 如权利要求8-12任一项所述的装置,其特征在于,所述特征提取模型和所述编辑类型识别模型分别采用不同的神经网络模型。
  14. 如权利要求8-13任一项所述的装置,其特征在于,所述编辑类型包括下述操作中的一种或多种:
    裁剪、拼接、旋转、镜像、模糊、添加文字、添加图标、变换色彩、变化亮度和变换对比度。
  15. 一种计算设备系统,包括至少一台计算设备,其特征在于,每台计算设备包括存储器和处理器,所述至少一台计算设备的存储器,用于存储计算机指令;
    所述至少一台计算设备的处理器执行所述存储器存储的计算机指令,以执行上述权利要求1至7中任一项所述的方法。
  16. 一种非瞬态的可读存储介质,其特征在于,所述非瞬态的可读存储介质被计算设备执行时,所述计算设备执行上述权利要求1至7中任一项所述的方法。
  17. 一种计算机程序产品,其特征在于,所述计算机程序产品被计算设备执行时,所述计算设备执行上述权利要求1至7中任一项所述的方法。
PCT/CN2019/096515 2019-07-18 2019-07-18 一种视频相似检测的方法、装置及设备 WO2021007846A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP19937682.3A EP3989158A4 (en) 2019-07-18 2019-07-18 Method, apparatus and device for video similarity detection
PCT/CN2019/096515 WO2021007846A1 (zh) 2019-07-18 2019-07-18 一种视频相似检测的方法、装置及设备
CN201980098001.5A CN114041165A (zh) 2019-07-18 2019-07-18 一种视频相似检测的方法、装置及设备
US17/568,705 US20220172476A1 (en) 2019-07-18 2022-01-04 Video similarity detection method, apparatus, and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/096515 WO2021007846A1 (zh) 2019-07-18 2019-07-18 一种视频相似检测的方法、装置及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/568,705 Continuation US20220172476A1 (en) 2019-07-18 2022-01-04 Video similarity detection method, apparatus, and device

Publications (1)

Publication Number Publication Date
WO2021007846A1 true WO2021007846A1 (zh) 2021-01-21

Family

ID=74209611

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096515 WO2021007846A1 (zh) 2019-07-18 2019-07-18 一种视频相似检测的方法、装置及设备

Country Status (4)

Country Link
US (1) US20220172476A1 (zh)
EP (1) EP3989158A4 (zh)
CN (1) CN114041165A (zh)
WO (1) WO2021007846A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609316A (zh) * 2021-07-27 2021-11-05 支付宝(杭州)信息技术有限公司 媒体内容相似度的检测方法和装置
CN115205765A (zh) * 2022-09-15 2022-10-18 成都中轨轨道设备有限公司 一种基于fpga的视频分析方法及系统

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419525A (zh) * 2022-03-30 2022-04-29 成都考拉悠然科技有限公司 一种有害视频的检测方法及其系统
CN114697761B (zh) * 2022-04-07 2024-02-13 脸萌有限公司 一种处理方法、装置、终端设备及介质
CN114626024A (zh) * 2022-05-12 2022-06-14 北京吉道尔科技有限公司 一种基于区块链的互联网侵权视频低耗检测方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105809174A (zh) * 2016-03-29 2016-07-27 北京小米移动软件有限公司 识别图像的方法及装置
US20170140541A1 (en) * 2015-11-18 2017-05-18 Yi-Chih Lu Method for Identifying a Target Object in a Video File
CN109189991A (zh) * 2018-08-17 2019-01-11 百度在线网络技术(北京)有限公司 重复视频识别方法、装置、终端及计算机可读存储介质
CN109492129A (zh) * 2018-10-26 2019-03-19 武汉理工大学 一种基于双流神经网络的相似视频搜索方法和系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090263014A1 (en) * 2008-04-17 2009-10-22 Yahoo! Inc. Content fingerprinting for video and/or image
US9436876B1 (en) * 2014-12-19 2016-09-06 Amazon Technologies, Inc. Video segmentation techniques
US9805255B2 (en) * 2016-01-29 2017-10-31 Conduent Business Services, Llc Temporal fusion of multimodal data from multiple data acquisition systems to automatically recognize and classify an action
US10319412B2 (en) * 2016-11-16 2019-06-11 Adobe Inc. Robust tracking of objects in videos
CN106682108B (zh) * 2016-12-06 2022-07-12 浙江大学 一种基于多模态卷积神经网络的视频检索方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170140541A1 (en) * 2015-11-18 2017-05-18 Yi-Chih Lu Method for Identifying a Target Object in a Video File
CN105809174A (zh) * 2016-03-29 2016-07-27 北京小米移动软件有限公司 识别图像的方法及装置
CN109189991A (zh) * 2018-08-17 2019-01-11 百度在线网络技术(北京)有限公司 重复视频识别方法、装置、终端及计算机可读存储介质
CN109492129A (zh) * 2018-10-26 2019-03-19 武汉理工大学 一种基于双流神经网络的相似视频搜索方法和系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3989158A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609316A (zh) * 2021-07-27 2021-11-05 支付宝(杭州)信息技术有限公司 媒体内容相似度的检测方法和装置
CN115205765A (zh) * 2022-09-15 2022-10-18 成都中轨轨道设备有限公司 一种基于fpga的视频分析方法及系统

Also Published As

Publication number Publication date
CN114041165A (zh) 2022-02-11
EP3989158A4 (en) 2022-06-29
US20220172476A1 (en) 2022-06-02
EP3989158A1 (en) 2022-04-27

Similar Documents

Publication Publication Date Title
WO2021007846A1 (zh) 一种视频相似检测的方法、装置及设备
WO2022116888A1 (zh) 一种视频数据处理方法、装置、设备以及介质
US9047376B2 (en) Augmenting video with facial recognition
WO2019218824A1 (zh) 一种移动轨迹获取方法及其设备、存储介质、终端
CN111062871B (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
US10621755B1 (en) Image file compression using dummy data for non-salient portions of images
KR102354692B1 (ko) 규칙 기반 비디오 중요도 분석
CN111651636B (zh) 视频相似片段搜索方法及装置
CN110688524B (zh) 视频检索方法、装置、电子设备及存储介质
CN111553362B (zh) 一种视频处理方法、电子设备和计算机可读存储介质
CN112954450B (zh) 视频处理方法、装置、电子设备和存储介质
CN113010703A (zh) 一种信息推荐方法、装置、电子设备和存储介质
US9665773B2 (en) Searching for events by attendants
CN112989116B (zh) 一种视频推荐方法、系统及装置
CN111783712A (zh) 一种视频处理方法、装置、设备及介质
US9081801B2 (en) Metadata supersets for matching images
CN111491187A (zh) 视频的推荐方法、装置、设备及存储介质
CN111209897A (zh) 视频处理的方法、装置和存储介质
CN112084812A (zh) 图像处理方法、装置、计算机设备及存储介质
US11537636B2 (en) System and method for using multimedia content as search queries
US11995889B2 (en) Cognitive generation of HTML pages based on video content
US10853417B2 (en) Generating a platform-based representative image for a digital video
KR20200115017A (ko) 영상 검색 장치 및 방법
Jin et al. Network video summarization based on key frame extraction via superpixel segmentation
US20160323627A1 (en) Method for annotating an object in a multimedia asset

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19937682

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019937682

Country of ref document: EP

Effective date: 20220120