CN114329050A

CN114329050A - Visual media data deduplication processing method, device, equipment and storage medium

Info

Publication number: CN114329050A
Application number: CN202111541971.3A
Authority: CN
Inventors: 汪翔; 黄珊
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-04-12

Abstract

The application relates to a visual media data deduplication processing method, a visual media data deduplication processing device, equipment and a storage medium. The method involves artificial intelligence, comprising: and respectively carrying out visual feature extraction on at least two visual media data to obtain the visual features of each visual media data, wherein the visual features comprise image features and character region features. The text information extraction is carried out on each visual media data to obtain the text content characteristics of each visual media data, the similarity analysis is carried out on at least two visual media data based on the visual characteristics and the text content characteristics to obtain the similarity between the visual media data, and the duplicate removal processing is carried out on at least two visual media data according to the similarity between the visual media data. By adopting the method, the accuracy of the similarity degree between the videos obtained by calculation can be improved by combining and considering from multiple angles, the condition that repeated videos are omitted and the duplicate removal is not carried out is avoided, and the video duplicate removal rate and the duplicate removal processing efficiency of the video platform are improved.

Description

Visual media data deduplication processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for visual media data deduplication processing.

Background

With the development of computer technology and the wide application of the internet in the life and work of people, more and more people acquire and transmit information through the internet. The information form can be characters, web pages, audio data, visual media data and the like, and the visual media data is the most comprehensive data form including the content, and is the information transmission mode of most users.

Because the threshold of the existing visual media making mode is low, a user can quickly generate the visual media by means of a visual media making tool, and then mass visual media are published on a network all the time. However, the visual media data published by different users have the situation of repeating or copying other visual media data, and the problem that a large amount of visual media data are repeated to occupy the publishing channel and display interface resources is easily caused, so for the visual media platform, the visual media data published on the platform needs to be paid attention in real time, and the visual media data needs to be subjected to deduplication processing, so that the quality of the published visual media data is improved, and more users are attracted.

Conventionally, a deduplication method based on MD5 values is adopted, i.e. MD5 values (i.e. MD5 information digest values) of visual media data are first calculated, and then whether the visual media data are the same or not is determined according to the MD5 values, and the visual media data are further deduplicated. However, the conventional deduplication method based on the md5 value is sensitive to the interference factors existing in the visual media data, such as compression, clipping, watermarking and other operations in the uploading process of the visual media data. That is, the same visual media data is easily recognized as non-repetitive visual media data after compression or a light editing operation (such as clipping or watermarking). Therefore, in the conventional deduplication method based on the md5 value, the deduplication rate still needs to be improved.

Disclosure of Invention

Accordingly, it is desirable to provide a visual media data deduplication processing method, device, apparatus, and storage medium capable of improving the visual media data deduplication rate of a visual media platform in view of the above technical problems.

A visual media data deduplication processing method, the method comprising:

respectively extracting visual features of at least two visual media data to obtain the visual features of the visual media data, wherein the visual features comprise image features and character region features;

extracting character information of each visual media data to obtain character content characteristics of each visual media data;

based on the visual features and the character content features, carrying out similarity analysis on the at least two visual media data to obtain the similarity between the visual media data;

and according to the similarity between the visual media data, carrying out duplicate removal processing on the at least two visual media data.

A visual media data device, the device comprising:

the visual feature extraction module is used for respectively extracting visual features of at least two pieces of visual media data to obtain the visual features of the visual media data, and the visual features comprise image features and character region features;

the text content characteristic extraction module is used for extracting text information of each visual media data to obtain text content characteristics of each visual media data;

the similarity analysis module is used for carrying out similarity analysis on the at least two visual media data based on the visual characteristics and the character content characteristics to obtain the similarity between the visual media data;

and the duplication elimination processing module is used for carrying out duplication elimination processing on the at least two visual media data according to the similarity between the visual media data.

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

A computer program product comprising a computer program which when executed by a processor performs the steps of:

In the visual media data deduplication processing method, the visual media data deduplication processing device, the visual media data deduplication processing equipment and the storage medium, the visual characteristics of each piece of visual media data are obtained by respectively extracting the visual characteristics of at least two pieces of visual media data, wherein the visual characteristics comprise image characteristics and character area characteristics, and the visual media data with the same image characteristics and different character area characteristics are prevented from being classified as the same visual media data by simultaneously considering the character area characteristics on the visual media data. The character information of each visual media data is extracted to obtain the character content characteristics of each visual media data, and then the similarity analysis can be carried out on at least two visual media data based on the visual characteristics and the character content characteristics to obtain the similarity between the visual media data, and the duplication elimination processing can be carried out on at least two visual media data according to the similarity between the visual media data. The method has the advantages that the combination and consideration from multiple angles are realized, the accuracy of the similarity degree between the videos obtained through calculation is improved, the condition that repeated videos are not subjected to duplicate removal is avoided being omitted, meanwhile, the problem that interference factors are excessively concerned in the video in the traditional duplicate removal mode based on the md5 value can be avoided through the mode of multi-angle comprehensive consideration, and the video duplicate removal rate and the duplicate removal processing efficiency of a video platform are improved.

Drawings

FIG. 1 is a diagram of an exemplary implementation of a visual media data deduplication process;

FIG. 2 is a flow diagram that illustrates a visual media data deduplication process in one embodiment;

FIG. 3 is a flow diagram that illustrates the deduplication processing of at least two visual media data in one embodiment;

FIG. 4 is a schematic flow chart diagram illustrating a trained feature extraction network in one embodiment;

FIG. 5 is a flowchart illustrating a visual media data deduplication processing method according to another embodiment;

FIG. 6 is a flowchart illustrating a visual media data deduplication processing method according to yet another embodiment;

FIG. 7 is a block diagram of a visual media data deduplication processing apparatus in one embodiment;

FIG. 8 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application provides a video duplicate removal processing method, which relates to an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

Computer Vision technology (CV) is a science for researching how to make a machine see, and further means that a camera and a Computer are used for replacing human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and map construction, automatic driving, intelligent transportation and other technologies, and also includes common biometric identification technologies such as face recognition and fingerprint recognition.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.

The video duplicate removal processing method provided by the application relates to technologies such as artificial intelligence computer vision and can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process, and the data storage system may be integrated on the server 104, or may be placed on a cloud or other network server. The server 104 extracts visual features of at least two pieces of visual media data to obtain the visual features of each piece of visual media data, and extracts text information of each piece of visual media data to obtain text content features of each piece of visual media data. The visual features include image features and character region features, and the server 104 may receive local visual media data sent by the terminal 102, or may obtain the visual media data from a cloud database of the server 104 itself. Further, the server 104 may perform similarity analysis on at least two visual media data based on the visual characteristics and the text content characteristics to obtain similarity between the visual media data, and further perform deduplication processing on the at least two visual media data according to the similarity between the visual media data to obtain deduplicated visual media data and store the deduplicated visual media data. The server 104 may store the duplicate-removed visual media data in its cloud database, or send the duplicate-removed visual media data to the terminal 102 for display and storage. The terminal 102 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle-mounted terminal, a smart television, and the like. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The terminal 102 and the server 104 may be directly or indirectly connected through wired or wireless communication, and the application is not limited thereto.

In one embodiment, as shown in fig. 2, a visual media data deduplication processing method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

step S202, visual characteristic extraction is respectively carried out on at least two visual media data to obtain the visual characteristics of each visual media data, and the visual characteristics comprise image characteristics and character region characteristics.

Specifically, a visual media database to be deduplicated is obtained, wherein the visual media database to be deduplicated at least includes two visual media data to be deduplicated. And then extracting visual features of at least two visual media data in the visual media database to obtain the visual features of the visual media data.

The visual features of the visual media data specifically include image features and character region features, the image features represent the image data represented by the visual media data, and the character region features represent specific regions where characters are located in the visual media data with the characters.

Further, visual media data may include video data, which may include video data for different durations or different video platforms, and image data, which may include different types and uses of image data, such as pictures and emoticons.

In an embodiment, taking visual media data as video data as an example, before performing visual feature extraction on at least two pieces of visual media data respectively to obtain a visual feature of each piece of visual media data, the method further includes: and performing video frame extraction on at least two pieces of video data to obtain video frames.

Specifically, the purpose of video frame extraction is to select representative image frames in a video for measuring video similarity, so as to reduce the amount of calculation in the similarity analysis process. The video frame extraction may adopt various methods, such as frame extraction at fixed intervals, key frame extraction, and the like.

The key frame extraction may extract a specific key frame of interest from the video according to an actual situation, for example, an object detection and recognition algorithm may be used to extract an image frame containing an object of interest from the video. Specifically, the video key frames can be obtained by obtaining target video information, i.e., video information in which a user is interested, and performing key frame extraction on at least two pieces of video data according to the target video information.

Similarly, the frame extraction at fixed intervals means that one frame is extracted at fixed intervals, and specifically, the frame extraction may be performed on at least two video data at regular time intervals according to a preset frame extraction time interval to obtain corresponding fixed video frames. The preset frame extracting time interval can be adjusted and modified according to user requirements or actual application scenes, and is not particularly limited.

Further, when frames are extracted, it is required to ensure that the frame extraction algorithm has no randomness, that is, for the same video, the frames extracted each time should be the same, so that it is ensured that the frames extracted by two identical videos are also the same, and the similarity calculation is facilitated. The number of frames extracted per video should be fixed, e.g. 10 frames, while the number of frames that are exceeded is discarded, and the number of frames that are not enough can be filled with the last frame.

Step S204, extracting the character information of each visual media data to obtain the character content characteristics of each visual media data.

Specifically, image preprocessing is performed on each visual media data to obtain a preprocessed image frame to be recognized, character segmentation and character recognition are performed on the image frame to be recognized to sequentially obtain segmented characters, then dimension reduction processing and feature extraction are performed on the segmented characters to obtain character features, and further feature classification and content recognition can be performed on the character features to obtain character content features of the visual media data.

The image frame to be identified after the preprocessing is obtained by performing image preprocessing, such as graying processing, i.e. binarization processing, and noise reduction processing, on each visual media data. And then carrying out character segmentation and character recognition on the image frame to be recognized, wherein the character segmentation is to segment the characters in the image frame to be recognized into single characters, and carrying out character recognition on the segmented characters in sequence. If the character line has an inclination condition, the character is further subjected to inclination correction, and the divided characters are further subjected to normalization processing, namely after the single character image is adjusted to the same size specification, the divided characters are subjected to character recognition one by one.

Furthermore, the character features are obtained by performing dimension reduction processing and feature extraction on the segmented characters. The characters can comprise different character types such as numbers, letters, symbols, Chinese characters and the like, and the character types can be further determined by extracting the characteristics of the characters. For the numbers, the letters or the symbols, because the numbers of the numbers, the letters or the symbols are fewer, the numbers, the letters or the symbols can belong to a small character set, and the requirements of character recognition can be met without dimension reduction processing or simple dimension reduction processing.

For the Chinese characters, because the number of the Chinese characters is large, the Chinese characters belong to a large character set, the structure of the Chinese characters is complex, the shapes of the Chinese characters are close to the characters, and the difficulty in character recognition is high, in order to improve the character recognition efficiency, dimension reduction processing needs to be performed on the Chinese characters, and the feature dimension is reduced. Meanwhile, in order to ensure that the feature vector after the dimension reduction retains enough character information quantity so as to achieve the purpose of distinguishing different characters, the strength of the dimension reduction processing needs to be adjusted and modified according to actual requirements.

In one embodiment, after the character features are obtained by performing dimension reduction processing and feature extraction on the segmented characters, feature classification and content recognition are further performed based on the character features to obtain the character content features of the visual media data.

Specifically, character features are subjected to feature classification and content recognition through a trained classifier, which character classification each character feature belongs to is determined, and then character content can be further obtained after the character classification of the character features is determined, so that character content features of visual media data are obtained.

Specifically, a character library can be obtained through random acquisition, and the initial classifier is trained according to the character library to obtain a trained classifier.

In one embodiment, the text information extraction related algorithm or the network model may be used to extract text information from the visual media data to obtain text content characteristics of each visual media data, such as an OCR (Optical Character Recognition) algorithm. In this embodiment, the algorithm or network model related to the extraction of the text information is not specifically limited, and the algorithm or network model used can meet the requirement of the extraction of the text information.

Step S206, based on the visual characteristics and the character content characteristics, similarity analysis is carried out on at least two visual media data to obtain the similarity between the visual media data.

Specifically, based on the image features and the character region features, visual similarity calculation is performed to generate visual similarity values of at least two visual media data, character similarity calculation is performed based on character content features to generate character similarity values of at least two visual media data, and the visual similarity values and the character similarity values are further integrated to obtain the similarity between the visual media data.

The euclidean distance between the two visual media data is calculated based on the image features and the character region features, and the visual similarity of the two visual media data is judged according to the calculated euclidean distance. Wherein, the smaller the Euclidean distance, the higher the visual similarity degree of the two visual media data.

In addition, when the visual media data is a picture, the visual similarity of the two visual media data may also be calculated by using a Perceptual hash algorithm (i.e., Perceptual hash algorithm) in this embodiment. The perceptual hash algorithm is used for generating a fingerprint (fingerprint) character string for each picture, comparing the fingerprint character strings of different pictures, and according to the comparison result of the fingerprint character strings, indicating that the pictures are more similar, namely the visual similarity of the two visual media data is higher, according to the closer the fingerprint character strings are.

In one embodiment, the text similarity calculation may be performed according to the text content characteristics and the editing distance corresponding to the text content, so as to generate text similarity values of the at least two pieces of visual media data.

The editing distance refers to the minimum step required for changing one segment of characters into another segment of characters through three operations of deleting, adding and replacing, and it can be understood that the smaller the editing distance corresponding to the contents of the two segments of characters is, the higher the similarity degree of the two segments of characters is.

In addition, due to the fact that the complexity of the calculation of the edit distance is high, under the scene with high requirement on time effectiveness, the quick Jaccard similarity calculation can be adopted. The Jaccard similarity (i.e., Jaccard similarity coefficient) is mainly used for comparing similarity and difference between limited sample sets, and can calculate the similarity between samples of symbolic measure or Boolean value measure. For two sections of characters, the similarity of Jaccard is the number of elements of the intersection of the two sections of characters divided by the number of elements of the union.

Furthermore, the similarity between the visual media data can be obtained by calculating the sum of the visual similarity value and the character similarity value.

Step S208, according to the similarity between the visual media data, at least two visual media data are subjected to duplication elimination processing.

Specifically, whether the similarity between the visual media data is greater than a preset similarity threshold is judged by obtaining the preset similarity threshold and comparing the similarity between the visual media data with the preset similarity threshold. And when the similarity between the visual media data is determined to be greater than the preset similarity threshold, the fact that duplicate data exist in the two currently compared visual media data is indicated, further deduplication processing needs to be carried out, and the deduplicated visual media data are reserved.

In one embodiment, the visual media data includes video data, and before performing the visual feature extraction, video frame extraction needs to be performed on at least two pieces of video data to obtain video frames, and then the visual feature extraction is performed on the video frames to obtain the visual features of each video frame, including image features and text region features, and the text information extraction is performed on each video frame to obtain text content features of each video frame.

Specifically, based on the visual features and the character content features, similarity analysis is performed on at least two video frames to obtain similarity between the two video frames, a preset similarity threshold is obtained, and whether the similarity between the video frames is greater than the preset similarity threshold is judged.

When the similarity between the video frames is determined to be larger than a preset similarity threshold value, the current two video frames are determined to be similar video frame pairs, and the number of the similar video frame pairs in the preset pair video frames extracted by any two video data is obtained.

Further, whether repeated videos exist in the two current video data is determined by obtaining a preset similar video frame pair threshold value and according to the preset similar video frame pair threshold value and the number of similar video frame pairs. And when the repeated videos exist, performing video deduplication processing on the current two video data.

In the visual media data deduplication processing method, the visual features of each visual media data are obtained by respectively extracting the visual features of at least two visual media data, wherein the visual features comprise image features and character region features, and the visual media data with the same image features and different character region features are prevented from being classified as the same visual media data by simultaneously considering the character region features on the visual media data. The character information of each visual media data is extracted to obtain the character content characteristics of each visual media data, and then the similarity analysis can be carried out on at least two visual media data based on the visual characteristics and the character content characteristics to obtain the similarity between the visual media data, and the duplication elimination processing can be carried out on at least two visual media data according to the similarity between the visual media data. The method has the advantages that the combination and consideration from multiple angles are realized, the accuracy of the similarity degree between the videos obtained through calculation is improved, the condition that repeated videos are not subjected to duplicate removal is avoided being omitted, meanwhile, the problem that interference factors are excessively concerned in the video in the traditional duplicate removal mode based on the md5 value can be avoided through the mode of multi-angle comprehensive consideration, and the video duplicate removal rate and the duplicate removal processing efficiency of a video platform are improved.

In one embodiment, the step of extracting visual features of at least two pieces of visual media data to obtain the visual features of each piece of visual media data includes:

when the visual media data are detected to have characters, respectively carrying out character region position detection on at least two visual media data based on the trained feature extraction network to generate character region features of the visual media data;

and respectively carrying out image feature extraction on the at least two visual media data based on the trained feature extraction network to obtain the image features of the at least two visual media data.

Specifically, in the visual media data, the text information therein is of great importance, and the matching of different texts on the same visual picture results in a great change in the meaning of the visual media data, so that the position of the text in the picture, i.e., the position of the text region, needs to be considered when extracting the visual features.

In this embodiment, specifically, the position of the text region in the visual media data may be detected based on the trained feature extraction network, and the text region feature may be generated and obtained. The character area feature may be a character area mask, where the character area mask is a matrix with a value of 0 or 1, 1 represents a character area, and 0 is a non-character area.

Further, after the character area mask is obtained, the character area mask is used as a channel to form four channels with the RGB channel of the image, and the four channels are used as input data of the trained feature extraction network. It can be understood that, when the trained feature extraction network performs feature extraction on input data composed of the text region mask and the RGB channel data of the image, text region features and image features of the obtained visual media data can be output.

In this embodiment, when it is detected that the visual media data has a character, the character region position of at least two visual media data is detected based on the trained feature extraction network, so as to generate character region features of the visual media data, and the image features of at least two visual media data are extracted based on the trained feature extraction network, so as to obtain the image features of at least two visual media data. When the visual characteristics of the visual media data are extracted, the visual picture of the data media data and the position of the character area on the visual picture are considered at the same time, so that the condition that the visual media data with the same image characteristics and different character area characteristics are classified as the same visual media data is avoided, the subsequent deduplication rate of the visual media data is improved, repeated deduplication processing operation is reduced, and the deduplication processing efficiency is further improved.

In an embodiment, as shown in fig. 3, when the visual media data is video data and the similarity between the visual media data is the similarity between video frames, the step of performing deduplication processing on at least two visual media data, that is, the step of performing deduplication processing on at least two visual media data according to the similarity between the visual media data specifically includes:

step S302, a preset similarity threshold is obtained, and whether the similarity between video frames is greater than the preset similarity threshold is judged.

Specifically, whether the similarity between video frames is greater than a preset similarity threshold is judged by obtaining the preset similarity threshold and comparing the preset similarity threshold with the similarity between the video frames. The preset similarity threshold can be adjusted and modified according to user requirements and actual application scenes, and is not limited to certain specific values.

Step S304, when the similarity between the video frames is determined to be greater than the preset similarity threshold, determining that the current two video frames are a similar video frame pair.

Specifically, a preset similarity threshold value is compared with the similarity between video frames, and when the similarity between the video frames is determined to be greater than the preset similarity threshold value, it is determined that the two currently compared video frames are a similar video frame pair. When the similarity between the video frames is not greater than the preset similarity threshold, the two video frames which are compared currently do not belong to the similar video frames.

Step S306, acquiring the number of similar video frame pairs in the preset pair of video frames extracted from any two video data.

Specifically, similarity analysis is performed on video frames extracted from any two pieces of video data, all pairs of similar video frames are determined, and the number of the pairs of similar video frames is counted.

In one embodiment, assuming that 10 frames are extracted from each video, for videos a and B, the similarity between the 10 extracted video frames of the a video and the 10 extracted video frames of the B video is first calculated, and whether the similarity between the respective extracted 10 video frames is greater than a preset similarity threshold is determined. The frame number extracted by the video is not specifically limited, and can be modified and adjusted according to actual requirements and different application scenes.

Specifically, for each frame extracted from the a video, the video frame with the highest similarity is sequentially matched from the B video, 10 pairs of video frames are obtained, and the number of similar video frame pairs in the 10 pairs of video frames is counted. Wherein, the 10 pairs of similar video frames in the video frame pairs represent the video frame pairs with the similarity larger than the preset similarity threshold.

Step S308, determining whether repeated videos exist in the current two video data or not according to a preset similar video frame pair threshold value and the number of similar video frame pairs.

Specifically, whether the number of the similar video frame pairs between two pieces of video data is greater than the preset similar video frame pair threshold is judged by counting the number of the similar video frame pairs between the two pieces of video data, acquiring a preset similar video frame pair threshold, and comparing the counted number of the similar video frame pairs with the preset similar video frame pair threshold.

When the number of pairs of similar video frames is greater than a preset threshold value of pairs of similar video frames, determining that the repeated video exists in the current two video data.

For example, for each frame extracted from the video a, matching the video frame with the highest similarity from the video B in sequence to obtain 10 pairs of video frames, counting the number of pairs of similar video frames in the 10 pairs of video frames, if the number of pairs of similar video frames is greater than a preset similar video frame pair threshold (for example, the preset similar video frame pair threshold is 9 pairs), determining that the videos are similar, and determining that a duplicate video exists in the current two video data.

The preset similar video frame pair threshold value can be adjusted and modified according to user requirements or actual application scenes, and is not limited to specific values, and the limitation condition is that the preset similar video frame pair threshold value needs to be less than or equal to the number of video frame pairs extracted from any two videos.

And step S310, when the repeated videos exist, performing video duplicate removal processing on the current two video data.

Specifically, when the number of pairs of similar video frames is determined to be greater than a preset similar video frame pair threshold, determining that a duplicate video exists in two currently compared videos, and performing video deduplication processing on the two currently compared videos, namely deleting one of the videos, and reserving and storing one of the videos.

When a plurality of videos need to be subjected to duplicate removal processing, two videos can be randomly selected for comparison, one of the videos is reserved, the reserved video and any one of the videos needing to be compared are continuously compared, and the current duplicate removal processing operation is completed until no duplicate video exists at last.

In this embodiment, a preset similarity threshold is obtained, whether the similarity between video frames is greater than the preset similarity threshold is judged, and when it is determined that the similarity between video frames is greater than the preset similarity threshold, it is determined that the current two video frames are a similar video frame pair. And further acquiring the number of pairs of similar video frames in preset pairs of video frames extracted from any two pieces of video data, determining whether repeated videos exist in the current two pieces of video data according to a preset similar video frame pair threshold value and the number of pairs of similar video frames, and performing video deduplication processing on the current two pieces of video data when the repeated videos exist. The method and the device have the advantages that the similarity calculation is carried out from the video frames of the video data, the corresponding similar video frame pairs are determined, and after all the similar video frame pairs are determined, whether the two videos which are compared currently are the same or not is further judged based on the number of the similar video frame pairs, so that the accuracy of calculating and comparing the video similarity degree is improved, the condition that duplicate videos are not removed is avoided, and the video duplicate removal processing efficiency of a video platform is improved.

In an embodiment, as shown in fig. 4, the step of obtaining a trained feature extraction network specifically includes:

step S402, randomly collecting a visual media data sample set.

Specifically, a sample set of visual media data, such as video data, picture data, and emotion data, is randomly acquired.

Step S404, acquiring different area images of visual media data in a visual media data sample set as a training visual media data positive sample, wherein the visual media data positive sample comprises area images with character area characteristics.

Specifically, for the same visual media data in the visual media data sample set, images of different areas of the visual media data need to be acquired as a positive sample of training visual media data. The images of different areas of the same visual media data are acquired, so that whether the visual media data contain characters or not and the position areas of the characters are specifically located at the positions of the visual media data are determined, and further, the area images of the character area features are included in the positive sample of the visual media data.

Step S406, obtaining the same area image of different visual media data in the visual media data sample set as a training visual media data negative sample.

Specifically, for different visual media data in the visual media data sample set, an image of the same region of the different visual media data, such as an image of one region of some current visual media data, needs to be acquired, and images of multiple corresponding regions of some visual media data may also be acquired at the same time, and used as a negative sample of training visual media data.

And step S408, training the initial feature extraction network according to the training visual media data positive sample and the training visual media data negative sample to obtain the trained feature extraction network.

Specifically, a training visual media data sample set can be obtained according to a training visual media data positive sample and a training visual media data negative sample, and then the initial feature extraction network is trained according to the training visual media data sample set to obtain a trained feature extraction network.

In order to reduce the amount of the mark, a self-supervision learning method can be adopted for training, namely different areas of the same image are randomly adopted as positive samples, different parts on different images are adopted as negative samples, a similarity network is trained, and finally a trained feature extraction network is obtained so as to extract image features and character area features of the video frame. The trained initial feature extraction network can be network models of various different types or different structures, can realize feature extraction, namely meet requirements, and does not limit the specific types of the features.

In this embodiment, the visual media data sample set is randomly acquired, and different area images of the visual media data in the visual media data sample set are acquired as a training visual media data positive sample, where the visual media data positive sample includes an area image with text area features. The same area images of different visual media data in the visual media data sample set are obtained to serve as training visual media data negative samples, and then the initial feature extraction network is trained according to the training visual media data positive samples and the training visual media data negative samples, so that the trained feature extraction network is obtained. The training of the initial feature extraction network is realized according to the training visual media data negative sample and the training visual media data positive sample of the regional image including the character regional features, so that the image features and the character regional features of the visual media data can be simultaneously extracted by the trained feature extraction network, the condition that the visual media data with the same image features and different character regional features are classified as the same visual media data is avoided, the subsequent de-duplication rate of the visual media data is improved, the repeated de-duplication processing operation is reduced, and the de-duplication processing efficiency of the visual media data is further improved.

In one embodiment, as shown in fig. 5, a visual media data deduplication processing method is provided, which specifically includes the following steps:

step S501, a visual media data sample set is randomly collected.

Step S502, acquiring different area images of visual media data in a visual media data sample set as a training visual media data positive sample, wherein the visual media data positive sample comprises area images with character area characteristics.

Step S503, acquiring the same area image of different visual media data in the visual media data sample set as the negative sample of the training visual media data.

And step S504, training the initial feature extraction network according to the training visual media data positive sample and the training visual media data negative sample to obtain a trained feature extraction network.

And step S505, when the visual media data are detected to have characters, respectively carrying out character area position detection on at least two visual media data based on the trained feature extraction network to generate character area features of the visual media data.

Step S506, based on the trained feature extraction network, image feature extraction is respectively carried out on the at least two visual media data, and image features of the at least two visual media data are obtained.

And step S507, carrying out image preprocessing on each visual media data to obtain a preprocessed image frame to be recognized.

And step S508, performing character segmentation and character recognition on the image frame to be recognized, and sequentially obtaining segmented characters.

Step S509, performing dimension reduction processing and feature extraction on the segmented character to obtain character features.

Step S510, performing feature classification and content identification based on the character features to obtain the text-content features of the visual media data.

Step S511, based on the image characteristics and the character region characteristics, performing visual similarity calculation to generate visual similarity values of at least two visual media data.

Step S512, performing character similarity calculation based on character content characteristics to generate character similarity values of at least two visual media data.

Step S513, the visual similarity value and the character similarity value are integrated to obtain the similarity between the visual media data.

Step S514, according to the similarity between the visual media data, at least two visual media data are processed with duplicate removal.

In the visual media data deduplication processing method, the text region characteristics on the visual media data are considered at the same time, the visual media data with the same image characteristics and different text region characteristics are avoided from appearing, the visual media data are classified as the same visual media data, and are considered in combination from multiple angles, so that the accuracy of the similarity degree between the videos obtained through calculation is improved, the situation that duplicate videos are not deduplicated is avoided from being omitted, meanwhile, the problem that interference factors are excessively concerned in the videos in a traditional deduplication mode based on the md5 value can be avoided by adopting a multi-angle comprehensive consideration mode, and the video deduplication rate and the deduplication processing efficiency of a video platform are improved.

In one embodiment, as shown in fig. 6, a visual media data deduplication processing method is provided, where when the visual media data is video data, the method specifically includes the following steps:

step S601, performing video frame extraction on at least two pieces of video data to obtain video frames.

Step S602, when detecting that the video frames have characters, respectively performing character area position detection on at least two video frames based on the trained feature extraction network to generate character area features of the video frames.

Step S603, performing image feature extraction on the at least two video frames based on the trained feature extraction network, respectively, to obtain image features of the at least two video frames.

Step S604, image preprocessing is carried out on each video frame to obtain a preprocessed image frame to be recognized.

And step S605, performing character segmentation and character recognition based on the image frame to be recognized, and sequentially obtaining segmented characters.

And step S606, performing dimension reduction processing and feature extraction on the segmented characters to obtain character features.

Step S607, the character feature is used for feature classification and content identification to obtain the character content feature of the video frame.

Step S608, based on the image features and the character region features, performing a visual similarity calculation to generate visual similarity values of at least two video frames.

Step S609, performing character similarity calculation based on character content characteristics to generate character similarity values of at least two video frames.

And step S610, integrating the visual similarity value and the character similarity value to obtain the similarity between video frames.

Step S611, obtain a preset similarity threshold, and determine whether the similarity between video frames is greater than the preset similarity threshold.

Step S612, when it is determined that the similarity between the video frames is greater than the preset similarity threshold, determining that the current two video frames are a similar video frame pair.

In step S613, the number of similar video frame pairs in the preset pair of video frames extracted from any two pieces of video data is obtained.

And step S614, determining whether repeated videos exist in the current two video data or not according to the preset similar video frame pair threshold and the number of the similar video frame pairs.

And step S615, when the repeated videos exist, video duplicate removal processing is carried out on the current two video data.

The application also provides an application scene, and the application scene applies the visual media data deduplication processing method. Specifically, the application of the visual media data deduplication processing method in the application scenario is as follows:

and when the visual media data are pictures, directly extracting visual features of at least two pictures respectively to obtain the visual features of each picture, wherein the visual features comprise image features and character region features. Meanwhile, text information extraction is carried out on each picture to obtain text content characteristics of each picture, similarity analysis is carried out on at least two pieces of visual media data based on the visual characteristics and the text content characteristics to obtain similarity between the pictures, and de-duplication processing is carried out on at least two pictures according to the similarity between the pictures.

It should be understood that, although the steps in the flowcharts related to the above embodiments are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in each flowchart related to the above embodiments may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.

In one embodiment, as shown in fig. 7, there is provided a visual media data deduplication processing apparatus, which may be a part of a computer device by using a software module or a hardware module, or a combination of the two, and specifically includes: a visual feature extraction module 702, a text feature extraction module 704, a similarity analysis module 706, and a deduplication processing module 708, where:

the visual feature extraction module 702 is configured to perform visual feature extraction on at least two pieces of visual media data to obtain visual features of each piece of visual media data, where the visual features include image features and text region features.

And a text content feature extraction module 704, configured to perform text information extraction on each visual media data to obtain text content features of each visual media data.

The similarity analysis module 706 is configured to perform similarity analysis on at least two pieces of visual media data based on the visual characteristics and the text content characteristics to obtain a similarity between the visual media data.

The deduplication processing module 708 is configured to perform deduplication processing on at least two pieces of visual media data according to similarity between the visual media data.

In the visual media data deduplication processing device, the visual features of each visual media data are obtained by respectively extracting the visual features of at least two visual media data, wherein the visual features comprise image features and character region features, and the visual media data with the same image features and different character region features are prevented from being classified as the same visual media data by simultaneously considering the character region features on the visual media data. The character information of each visual media data is extracted to obtain the character content characteristics of each visual media data, and then the similarity analysis can be carried out on at least two visual media data based on the visual characteristics and the character content characteristics to obtain the similarity between the visual media data, and the duplication elimination processing can be carried out on at least two visual media data according to the similarity between the visual media data. The method has the advantages that the combination and consideration from multiple angles are realized, the accuracy of the similarity degree between the videos obtained through calculation is improved, the condition that repeated videos are not subjected to duplicate removal is avoided being omitted, meanwhile, the problem that interference factors are excessively concerned in the video in the traditional duplicate removal mode based on the md5 value can be avoided through the mode of multi-angle comprehensive consideration, and the video duplicate removal rate and the duplicate removal processing efficiency of a video platform are improved.

In one embodiment, the visual feature extraction module is further configured to:

when the visual media data are detected to have characters, respectively carrying out character region position detection on at least two visual media data based on the trained feature extraction network to generate character region features of the visual media data; and respectively carrying out image feature extraction on the at least two visual media data based on the trained feature extraction network to obtain the image features of the at least two visual media data.

In one embodiment, the similarity analysis module is further configured to:

based on the image characteristics and the character region characteristics, performing visual similarity calculation to generate visual similarity values of at least two visual media data; performing character similarity calculation based on character content characteristics to generate character similarity values of at least two visual media data; and integrating the visual similarity value and the character similarity value to obtain the similarity between the visual media data.

In one embodiment, a visual media data deduplication processing apparatus is provided, further comprising a video frame extracting module, configured to perform video frame extraction on at least two pieces of video data to obtain video frames;

the similarity between the visual media data is the similarity between video frames, and the deduplication processing module is further configured to:

acquiring a preset similarity threshold, and judging whether the similarity between video frames is greater than the preset similarity threshold; when the similarity between the video frames is determined to be larger than a preset similarity threshold, determining that the current two video frames are a similar video frame pair; acquiring the number of similar video frame pairs in preset pairs of video frames extracted from any two pieces of video data; determining whether repeated videos exist in the current two video data or not according to a preset similar video frame pair threshold value and the number of similar video frame pairs; and when the repeated videos exist, performing video deduplication processing on the current two video data.

In one embodiment, a visual media data deduplication processing apparatus is provided, further comprising a feature extraction network training module configured to:

randomly acquiring a visual media data sample set; acquiring images of different areas of visual media data in a visual media data sample set as a training visual media data positive sample; the visual media data positive sample comprises a regional image with character regional characteristics; acquiring images of the same area of different visual media data in a visual media data sample set as a negative sample of training visual media data; and training the initial feature extraction network according to the training visual media data positive sample and the training visual media data negative sample to obtain the trained feature extraction network.

In one embodiment, the text feature extraction module is further configured to:

carrying out image preprocessing on each visual media data to obtain a preprocessed image frame to be recognized; performing character segmentation and character recognition on the image frame to be recognized to sequentially obtain segmented characters; performing dimension reduction processing and feature extraction on the segmented characters to obtain character features; and performing feature classification and content identification based on the character features to obtain the character content features of the visual media data.

In one embodiment, the video frame extraction module is further configured to

Acquiring target video information, and performing key frame extraction on at least two pieces of video data according to the target video information to obtain video key frames; or performing timing video frame extraction on at least two video data according to a preset frame extraction time interval to obtain corresponding fixed video frames.

For specific limitations of the visual media data deduplication processing apparatus, reference may be made to the above limitations of the visual media data deduplication processing method, and details are not repeated here. The modules in the above-mentioned visual media data deduplication processing apparatus may be implemented wholly or partially by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 8. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing visual media data, image characteristics, character region characteristics, character content characteristics, similarity among the visual media data and other data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a visual media data deduplication processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 8 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.

In an embodiment, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), Magnetic Random Access Memory (MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for visual media data deduplication processing, the method comprising:

2. The method of claim 1, wherein the performing visual feature extraction on at least two visual media data respectively to obtain the visual feature of each visual media data comprises:

when the visual media data are detected to have characters, respectively carrying out character region position detection on at least two visual media data based on a trained feature extraction network to generate character region features of the visual media data;

and respectively extracting the image features of at least two visual media data based on the trained feature extraction network to obtain the image features of the at least two visual media data.

3. The method of claim 1, wherein the analyzing similarity of the at least two visual media data based on the visual characteristics and the text content characteristics to obtain the similarity between the visual media data comprises:

performing visual similarity calculation based on the image features and the character region features to generate visual similarity values of the at least two visual media data;

performing character similarity calculation based on the character content characteristics to generate character similarity values of the at least two visual media data;

and integrating the visual similarity value and the character similarity value to obtain the similarity between the visual media data.

4. A method according to any of claims 1 to 3, wherein the visual media data comprises video data; before the performing visual feature extraction on at least two pieces of visual media data respectively to obtain the visual feature of each piece of visual media data, the method further includes: performing video frame extraction on at least two pieces of video data to obtain video frames;

the similarity between the visual media data is the similarity between video frames; the performing, according to the similarity between the visual media data, a deduplication process on the at least two visual media data includes:

acquiring a preset similarity threshold, and judging whether the similarity between the video frames is greater than the preset similarity threshold;

when the similarity between the video frames is determined to be larger than the preset similarity threshold, determining that the current two video frames are a similar video frame pair;

acquiring the number of similar video frame pairs in preset pairs of video frames extracted by any two video data;

determining whether repeated videos exist in the two current video data or not according to a preset similar video frame pair threshold value and the number of the similar video frame pairs;

and when determining that the repeated videos exist, performing video deduplication processing on the current two video data.

5. A method according to claim 2 or 3, wherein obtaining a trained feature extraction network comprises:

randomly acquiring a visual media data sample set;

acquiring images of different areas of visual media data in the visual media data sample set as a training visual media data positive sample; the visual media data positive sample comprises a regional image with character regional characteristics;

acquiring the same area images of different visual media data in the visual media data sample set as training visual media data negative samples;

and training the initial feature extraction network according to the training visual media data positive sample and the training visual media data negative sample to obtain a trained feature extraction network.

6. The method according to any one of claims 1 to 3, wherein the extracting text information from each of the visual media data to obtain text content characteristics of each of the visual media data comprises:

carrying out image preprocessing on each visual media data to obtain a preprocessed image frame to be recognized;

performing character segmentation and character recognition on the image frame to be recognized to sequentially obtain segmented characters;

performing dimension reduction processing and feature extraction on the segmented characters to obtain character features;

and performing feature classification and content identification based on the character features to obtain the character content features of the visual media data.

7. The method of claim 4, wherein the video decimation for at least two video data to obtain the video frame comprises:

acquiring target video information, and performing key frame extraction on the at least two pieces of video data according to the target video information to obtain video key frames; or

And performing timing video frame extraction on the at least two video data according to a preset frame extraction time interval to obtain corresponding fixed video frames.

8. A visual media data deduplication processing apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.