WO2023024749A1 - 视频检索的方法、装置、设备及存储介质 - Google Patents

视频检索的方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023024749A1
WO2023024749A1 PCT/CN2022/105871 CN2022105871W WO2023024749A1 WO 2023024749 A1 WO2023024749 A1 WO 2023024749A1 CN 2022105871 W CN2022105871 W CN 2022105871W WO 2023024749 A1 WO2023024749 A1 WO 2023024749A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
quantization
texture
model
feature
Prior art date
Application number
PCT/CN2022/105871
Other languages
English (en)
French (fr)
Inventor
郭卉
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP22860095.3A priority Critical patent/EP4390725A1/en
Publication of WO2023024749A1 publication Critical patent/WO2023024749A1/zh
Priority to US18/136,538 priority patent/US20230297617A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7847Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content
    • G06F16/7857Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using low-level visual features of the video content using texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/732Query formulation
    • G06F16/7328Query by example, e.g. a complete video frame or video sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/54Extraction of image or video features relating to texture
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/778Active pattern-learning, e.g. online learning of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences

Definitions

  • the present application relates to the field of computers, in particular to the field of artificial intelligence, and provides a video retrieval method, device, equipment and storage medium.
  • Method 1 is based on the k-means clustering (K-means) algorithm to obtain corresponding quantitative features.
  • K-means k-means clustering
  • the second method is to obtain corresponding quantitative features based on product quantization (Product Quantization, PQ).
  • Product Quantization Product Quantization, PQ
  • the quantitative features obtained by this method will reduce the generation accuracy of quantitative features due to the loss in the generation process, which will affect the quality of video retrieval. Recall (match) performance;
  • the third method is to obtain the corresponding quantitative features based on the deep learning neural network.
  • the neural network first extracts the embedding features of the video image, and then performs feature extraction processing on the embedding features to obtain the corresponding quantitative features. Due to the loss in the generation process, The generation accuracy of quantitative features is reduced, which in turn affects the recall performance of video retrieval.
  • Embodiments of the present application provide a video retrieval method, device, device, and storage medium to solve the problems of low quantization efficiency and low accuracy.
  • the embodiment of the present application provides a method for video retrieval, including:
  • the target quantization processing sub-model of the target video retrieval model uses the target quantization processing sub-model of the target video retrieval model to obtain corresponding first quantization features, and based on the first quantization features, select from each first candidate video that matches The category similarity of the video to be retrieved meets at least one second candidate video that meets the set category similarity requirements; wherein, the quantization control parameters of the target quantization processing sub-model are based on the texture corresponding to each training sample during the training process The characteristic loss value is adjusted, and the texture characteristic loss value is determined based on the texture control parameters preset for the texture processing sub-model to be trained during the parameter adjustment process of the texture processing sub-model to be trained;
  • the second candidate video whose content similarity meets the set content similarity requirement is output as a corresponding target video.
  • the embodiment of the present application also provides a video retrieval device, including:
  • the image processing unit is used to adopt the target image processing sub-model of the trained target video retrieval model to perform feature extraction on the video to be retrieved to obtain corresponding image features;
  • the quantization processing unit is configured to use the target quantization processing sub-model of the target video retrieval model to perform feature extraction on the image features to obtain corresponding first quantization features, and based on the first quantization features, from each first Screening out at least one second candidate video whose category similarity with the video to be retrieved meets the set category similarity requirements from the candidate videos; wherein, the quantization control parameters of the target quantization processing sub-model are during the training process, based on each The texture feature loss value corresponding to each training sample is adjusted, and the texture feature loss value is based on the texture control parameter preset for the texture processing sub-model to be trained during the parameter adjustment process of the texture processing sub-model to be trained definite;
  • the retrieval unit is configured to output, as a corresponding target video, a second candidate video whose content similarity meets a set content similarity requirement based on the content similarity between the video to be retrieved and the at least one second candidate video.
  • the embodiment of the present application also provides a computer device, including a processor and a memory, wherein the memory stores program code, and when the program code is executed by the processor, the processor Execute the steps of any one of the video retrieval methods above.
  • the embodiment of the present application also provides a computer-readable storage medium, which includes program code, and when the program code is run on the computer device, the program code is used to make the computer device execute any one of the above-mentioned Steps of a method for video retrieval.
  • the embodiment of the present application further provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps of any one of the above video retrieval methods.
  • Embodiments of the present application provide a video retrieval method, device, device, and storage medium, the method including: performing feature extraction on the image features of the video to be retrieved, obtaining the first quantitative feature, and then based on the first quantitative feature, obtaining the Retrieve the second candidate video with high category similarity to the video to be retrieved, and finally output the second candidate video with high content similarity with the video to be retrieved as the target video.
  • the target quantization processing sub-model can learn the sorting ability of the target texture feature sub-model, ensuring the sorting effect of the two sub-models It tends to be consistent, avoiding the random ordering of the target quantization processing sub-models due to fixed quantization control parameters. Due to the end-to-end model architecture, the target quantization processing sub-model trained in the above way can obtain corresponding quantization features based on image features, which reduces the loss in the process of generating quantization features and improves the generation accuracy of quantization features. In addition, the embodiment of the present application also optimizes the sorting ability of the target quantization processing sub-model, and further improves the recall performance of the video retrieval.
  • Figure 1a is a schematic diagram of an application scenario in an embodiment of the present application.
  • Figure 1b is a schematic diagram of the first display interface provided by the embodiment of the present application.
  • Figure 1c is a schematic diagram of the second display interface provided by the embodiment of the present application.
  • Figure 1d is a schematic diagram of the structure of the target video retrieval model provided by the embodiment of the present application.
  • Figure 1e is a schematic diagram of the architecture of the quantization processing model used in the related art
  • Figure 2a is a schematic flow diagram of the training target video retrieval model provided by the embodiment of the present application.
  • Fig. 2b is a schematic flow diagram of mining multiple sample triplets provided by the embodiment of the present application.
  • Fig. 2c is a schematic flow diagram of the first method for generating quantized feature loss values provided by the embodiment of the present application.
  • Fig. 2d is a schematic flow diagram of the second method for generating quantized feature loss values provided by the embodiment of the present application.
  • FIG. 3a is a schematic flow diagram of establishing an index table and a mapping table provided by the embodiment of the present application
  • FIG. 3b is a logical schematic diagram of establishing an index table and a mapping table provided by the embodiment of the present application;
  • Fig. 4a is a schematic flow chart of the video retrieval method provided by the embodiment of the present application.
  • Fig. 4b is a logical schematic diagram of a specific embodiment of the application video retrieval method provided by the embodiment of the present application;
  • FIG. 5 is a schematic structural diagram of a device for video retrieval provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of the composition and structure of a computer device provided in the embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a computing device in an embodiment of the present application.
  • the embodiment of the present application relates to the field of artificial intelligence (AI), and is designed based on machine learning (Machine Learning, ML) and computer vision (Computer Vision, CV) technologies.
  • AI artificial intelligence
  • ML Machine Learning
  • CV Computer Vision
  • the solutions provided in the embodiments of the present application involve deep learning of artificial intelligence, augmented reality and other technologies, and are further described in detail through the following embodiments.
  • Embodiments of the present application provide a video retrieval method, device, device, and storage medium to solve the problems of low quantization efficiency and low accuracy.
  • the embodiments of the present application can be applied to various video retrieval scenarios. For example, in the video infringement scene, use the video retrieval method provided by the embodiments of the present application to recall a batch of videos with high content similarity to the video to be retrieved, and The recalled videos are determined to be infringing videos.
  • the method includes: performing feature extraction on the image features of the video to be retrieved, obtaining the first quantitative feature, and then based on the first quantitative feature, obtaining a second candidate video with a high degree of similarity with the category of the video to be retrieved, and finally combining the video with the video to be retrieved
  • the second candidate video with high content similarity is output as the target video. Since the quantization control parameters of the target quantization processing sub-model will be adjusted according to the texture feature loss value corresponding to each training sample, the target quantization processing sub-model can learn the sorting ability of the target texture feature sub-model, ensuring the sorting effect of the two sub-models It tends to be consistent, avoiding the random ordering of the target quantization processing sub-models due to fixed quantization control parameters.
  • the target quantization processing sub-model trained in the above way can obtain corresponding quantization features based on image features, which reduces the loss in the process of generating quantization features and improves the generation accuracy of quantization features.
  • the embodiment of the present application also optimizes the sorting ability of the target quantization processing sub-model, and further improves the recall performance of the video retrieval.
  • FIG. 1a and FIG. 1b in the application scenario of the embodiment of the present application, two physical terminal devices 110 and one server 130 are included.
  • the target object (for example, a user) can log in the video retrieval client through the physical terminal device 110, and the retrieval interface is presented on the display screen 120 of the physical terminal device 110; after that, the target object inputs the image of the video to be retrieved in the retrieval interface, so that The target video retrieval model running on the target server 130 is based on the image of the video to be retrieved, and obtains a target video with a high content similarity with the video to be retrieved from the huge video library connected to the background port; the physical terminal device 110 receives the After all the target videos returned by the target server 130, each target video will be presented on the display interface of the display screen 120. At the same time, the user can also check the video details of the selected target video by clicking on the page and other gesture operations. Similar segments or duplicate segments of the target video and the video to be retrieved will be marked.
  • the display interface shown in Figure 1b presents an excerpt of a TV series.
  • the corresponding progress bar color is white; for the unplayed segment, the corresponding progress bar color is black; for similar segments or For repeated segments, the color of the corresponding progress bar is gray.
  • the user can roughly estimate the similarity between the target video and the video to be retrieved through the color of the progress bar, which is convenient for the user to judge the infringement of the video creation.
  • the display interface shown in Figure 1c presents an excerpt of a TV series.
  • the corresponding progress bar color is white; for the unplayed segment, the corresponding progress bar color is black; for similar segments or For repeating segments, triangular markers or other shaped markers will be used on the progress bar to mark the start and end points of these segments, so that users can directly jump to the corresponding plot by clicking the markers.
  • Users can also roughly estimate the similarity between the target video and the video to be retrieved through the number of marked points on the progress bar.
  • the physical terminal device 110 is an electronic device used by a user, and the electronic device may be a computer device such as a personal computer, a mobile phone, a tablet computer, a notebook computer, an e-book reader, or a smart home.
  • a computer device such as a personal computer, a mobile phone, a tablet computer, a notebook computer, an e-book reader, or a smart home.
  • Each physical terminal device 110 communicates with the target server 130 through the communication network.
  • the communication network is a wired network or a wireless network. Therefore, each physical terminal device 110 can directly or indirectly establish a communication connection with the target server 130 through a wired network or a wireless network, which is not limited in this application. .
  • the target server 130 can be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, and can also provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication , middleware service, domain name service, security service, content delivery network (Content Delivery Network, CDN), big data and artificial intelligence platform and other cloud servers for basic cloud computing services, this application is not limited here.
  • cloud services cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication , middleware service, domain name service, security service, content delivery network (Content Delivery Network, CDN), big data and artificial intelligence platform and other cloud servers for basic cloud computing services, this application is not limited here.
  • CDN Content Delivery Network
  • the target video retrieval model is deployed on the target server 130.
  • the target video retrieval model includes a target image processing sub-model, a target texture processing sub-model and a target quantization processing sub-model.
  • the target image processing sub-model and the target texture processing sub-model are all deep learning network models constructed with the network architecture of ResNet_101, and model pre-training is performed based on imageNet.
  • imageNet is a large-scale general-purpose object recognition open source data set. There are a large number of pre-labeled image data in imageNet, and imageNet contains about 1000 types of image data. Therefore, the deep learning network model obtained based on imageNet pre-training has stable model parameters. The versatility of the overall model is better.
  • the target quantization processing sub-model is used to perform binary quantization processing on complex high-dimensional image features, and the high-dimensional image features are compressed into binary codes of specified bits (ie, quantized features).
  • the quantitative feature is used as an index to recall the corresponding target video, which greatly reduces the calculation time and complexity, is more conducive to calculation, and is very beneficial to the retrieval of massive data.
  • each bit is 0 or 1, such as compressing 128-dimensional image features into 4-bit binary coding 0100.
  • the target texture processing sub-model and the target quantization processing sub-model in the embodiment of the present application are two sub-models placed in parallel.
  • the advantage of this deployment is that in the training phase, the target quantization The quantization control parameters of the processing sub-model will be adjusted according to the texture feature loss value corresponding to each training sample, so that the target quantization processing sub-model can learn the sorting ability of the target texture feature sub-model, ensuring that the sorting effects of the two sub-models tend to be consistent , to avoid random ordering of target quantization processing sub-models due to fixed quantization control parameters.
  • the embodiment of the present application adopts an end-to-end model architecture, which can obtain corresponding quantitative features based on image features, reducing the need to generate quantitative features.
  • the loss of the quantization feature improves the generation accuracy of the quantization feature.
  • the embodiment of the present application also optimizes the sorting ability of the target quantization processing sub-model, and further improves the recall performance of the video retrieval.
  • the target video retrieval model built based on artificial intelligence technology in the embodiment of the present application when processing large-scale retrieval videos, compared with the traditional k-means clustering algorithm, has better processing speed and recall performance, and consumes less There are also fewer resources.
  • the training process of the target video retrieval model it is specifically divided into a pre-training stage and a fine-tuning joint learning stage.
  • the training data used in the two training stages is the same.
  • the difference lies in the network parameters that need to be learned in the two training stages and The resulting loss values are different.
  • the pre-training stage use the image processing texture feature loss value to adjust the parameters of the image processing sub-model to be trained and the texture processing sub-model to be trained to obtain candidate image processing sub-models and candidate texture processing sub-models;
  • the joint learning stage use the texture feature loss value to adjust the parameters of the candidate texture processing sub-model, use the quantization feature loss value to adjust the parameters of the candidate image processing sub-model and the quantization processing sub-model to be trained, and obtain the target image processing sub-model model, a target texture processing submodel, and a target quantization processing submodel.
  • each sample triplet includes a sample video, and a positive label and a negative label associated with the sample video.
  • the embodiment of the present application uses labeled training data, so that the quantization processing sub-model to be trained can learn positive labels and negative labels at the same time, thereby improving the target quantization processing sub-model recall effect.
  • the positive label refers to a sample video with high content similarity to the sample video
  • the negative label refers to a sample video with only a small amount of the same or similar content as the sample video.
  • the embodiment of the present application can obtain multiple sample triplets by performing the following operations:
  • S2011 Obtain a similar sample set including a plurality of similar sample pairs, each similar sample pair includes a sample video and a forward label associated with the sample video.
  • the image processing sub-model to be trained and the texture processing sub-model to be trained are all deep learning network models based on the pre-training of the large-scale general object recognition open source dataset ImageNet.
  • Each similar sample pair is sequentially input into the above two sub-models, and the texture feature groups corresponding to each similar sample pair can be obtained.
  • Each texture feature group includes the texture features of the sample video and the texture features of the forward label.
  • S2014 For each other similar sample pair in the similar sample set, respectively perform the following operations: based on the texture feature of the sample video c, and the texture feature of any other sample video in one other similar sample pair, obtain the corresponding texture feature distance.
  • the Euclidean distance between two texture features is used as the corresponding texture feature distance.
  • S2015 Arrange each other sample video in the order of the distance of the texture feature.
  • the other sample videos According to the order of the texture feature distance from near to far, arrange the other sample videos, and according to the introduction to the texture feature distance above, it can be seen that the content similarity between the top k% of other sample videos and the sample video is very high, and the embodiment of the present application
  • the negative label that needs to be mined is that there are only a small number of sample videos with the same or similar content as the sample video. It is obvious that the other sample videos in the top k% do not meet the definition of negative labels and are removed as interference noise. Among them, k is a controllable value, the greater the interference noise, the greater the corresponding k value.
  • sample video 1, sample video 2 For example, assuming that the similar sample pair is (sample video 1, sample video 2), Table 1 shows the texture feature distance between other sample videos and sample video 1, and removes the top k% of other sample videos, such as sample video 3 and Sample video 6, and remove sample video 8 and sample video 9, which rank extremely low, and finally screen and obtain the following multiple sample triplets: (sample video 1, sample video 2, other sample video 7), (sample video 1, Sample video 2, other sample videos 4), (sample video 1, sample video 2, other sample videos 5).
  • step S2017 Determine whether all similar sample pairs in the similar sample set have been read, and if so, perform step S2018; otherwise, return to step S2013.
  • S202 Read a sample triplet d, use the sample triplet d as training data, and sequentially input the sample triplet d into the image processing sub-model to be trained and the texture processing sub-model to be processed, to obtain a corresponding first texture set.
  • the following formula 1 is used to generate the corresponding image processing texture feature loss value; then based on the image processing texture feature loss value, the stochastic gradient descent (SGD) method is used to train the image processing sub-model and the texture processing to be trained The sub-models are tuned for parameters.
  • SGD stochastic gradient descent
  • L em in Formula 1 is the image processing texture feature loss value
  • x a is the first sample texture feature of the sample video
  • x p is the first sample texture feature of the positive label
  • x n is the negative label
  • ⁇ x a -x p ⁇ represents the texture feature distance between positive sample pairs
  • ⁇ x a -x n ⁇ represents the texture feature distance between negative sample pairs
  • margin_em represents texture Control parameters.
  • step S204 Determine whether the image processing texture feature loss value is higher than the preset image processing texture feature loss threshold value, if yes, return to step S202; otherwise, execute step S205.
  • S205 Stop the iterative training sub-model, and output the candidate image processing sub-model and the candidate texture processing sub-model obtained in the last iteration.
  • S207 Generate corresponding texture feature loss values based on multiple second sample texture features included in the second texture set, and adjust parameters of candidate texture processing sub-models based on the texture feature loss values.
  • L em ′ in formula 2 is the loss value of the image processing texture feature
  • x a ′ is the second sample texture feature of the sample video
  • x p ′ is the second sample texture feature of the positive label
  • x n ′ is the negative ⁇ x a ′-x p ′ ⁇ represents the texture feature distance between positive sample pairs
  • ⁇ x a ′-x n ′ ⁇ represents the texture feature between negative sample pairs Distance
  • margin_em characterizes texture control parameters.
  • the first method of generating a quantized feature loss value is introduced.
  • margin_i represents the quantization control parameter of the i-th sample triplet
  • margin0 is the preset Hamming distance
  • Mem is the ratio between the texture feature distance and the Hamming distance
  • Le em _i is the i-th sample triplet Group of texture feature loss values.
  • margin_i margin0*L em _i/Mem formula 3
  • S2082 Based on a plurality of sample quantization features and quantization control parameters included in a quantization feature group, respectively determine a training sample loss value and a symbol quantization loss value of the quantization processing sub-model to be trained.
  • L triplet is the training sample loss value
  • the sample quantization feature of the sample video is the sample quantization feature of the forward label
  • the sample quantization feature of the negative label Characterizes the quantized feature distance between forward sample pairs, Then it represents the quantitative feature distance between negative sample pairs
  • margin_i represents the quantization control parameter of the i-th sample triplet.
  • L coding is the symbol quantization loss value
  • ui represents the i-th bit in the sample quantization feature
  • bi represents the i-th bit of the symbol quantization feature. If ui is a negative number, the value of bi is -1, otherwise the value of bi is is 1.
  • L q in Formula 7 is the quantized feature loss value
  • L triplet is the training sample loss value of the quantization processing sub-model to be trained
  • w 21 is the weight assigned to the training sample loss value
  • L coding is the quantization processing to be trained
  • w 22 is the weight assigned to the signed quantization loss value.
  • the embodiment of the present application also provides a second way of generating a quantized feature loss value.
  • S2081' Determine the training sample loss value and the symbol quantization loss value of the quantization processing sub-model to be trained based on multiple sample quantization features included in a quantization feature group.
  • L q is the quantization feature loss value
  • L triplet is the training sample loss value of the quantization processing sub-model to be trained
  • w 21 and L em ′ are the weights of the training sample loss value
  • L coding is the quantization processing sub-model to be trained
  • w 22 is the weight assigned to the signed quantization loss value.
  • step S209 Determine whether the texture feature loss value and the quantization feature loss value are higher than the preset image processing texture feature loss threshold, if so, return to step S206; otherwise, execute step S210.
  • S210 Stop the iterative training sub-model, and output the target image processing sub-model, the target texture processing sub-model and the target quantization processing sub-model obtained in the last iteration.
  • an index table and a mapping table of the video database are established by using the trained target video retrieval model.
  • S302 Input the first candidate video s into the target video retrieval model, and obtain corresponding initial quantization features and second texture features;
  • S303 Add the second texture feature to the mapping table, and respectively determine the quantized feature distance between the initial quantized feature and each second quantized feature recorded in the index table;
  • step S305 Determine whether all the first candidate videos in the video database have been read, if so, execute step S306; otherwise, return to step S301;
  • step S303 when step S303 is executed, if the index table is empty, the initial quantized feature of the first candidate video s is added to the index table as the second quantized feature.
  • the index table is shown in Lindex:[q1:[img1,img2,img6],q2:[img3],q3:[img4]].
  • the table includes multiple second quantitative features, and each quantitative feature corresponds to at least one first Candidate videos, therefore, each second quantization feature characterizes the video category to which at least one first candidate video corresponds; a mapping table such as T: [[img1, embedding1], [img2, embedding2], ..., [img6, embedding6]], the table includes multiple first candidate videos and corresponding second texture features.
  • the process shown in FIG. 3a may also be executed to establish corresponding index relationships and mapping relationships.
  • the video retrieval method provided by the embodiment of the present application is applied to the trained target video retrieval model.
  • S401 Using the target image processing sub-model of the trained target video retrieval model, perform feature extraction on the video to be retrieved to obtain corresponding image features.
  • the complete video of the video to be retrieved can be input into the target image processing sub-model to obtain a corresponding image feature; key frames can also be extracted from the video to be retrieved first, and then the obtained multiple key frames It is input into the target image processing sub-model to obtain corresponding multiple image features.
  • S402 Use the target quantization processing sub-model of the target video retrieval model to perform feature extraction on the image features, obtain the corresponding first quantization features, and based on the first quantization features, select from each first candidate video that is related to the video to be retrieved At least one second candidate video whose category similarity meets the set category similarity requirements; wherein, the quantization control parameters of the target quantization processing sub-model are adjusted based on the texture feature loss value corresponding to each training sample during the training process, and the texture The feature loss value is determined based on the texture control parameters preset for the texture processing sub-model to be trained during the parameter adjustment process of the texture processing sub-model to be trained.
  • the quantization control parameters of the target quantization processing sub-model are adjusted according to the texture feature loss value corresponding to each training sample, so that the target quantization processing sub-model learns the sorting ability of the target texture feature sub-model, ensuring that the sorting effects of the two sub-models tend to be consistent , to avoid random ordering of target quantization processing sub-models due to fixed quantization control parameters.
  • the end-to-end model architecture enables the target quantization processing sub-model to obtain corresponding quantization features based on image features, which reduces the loss in the process of generating quantization features and improves the accuracy of quantization features generation.
  • the implementation of this application The example also optimizes the ranking ability of the target quantization processing sub-model, and further improves the recall performance of video retrieval.
  • the index table contains a plurality of second quantization features, and each quantization feature corresponds to at least one first candidate video. Therefore, when step S402 is performed, the first quantization feature and each of the first candidate videos are respectively determined. The first candidate video whose quantization feature distance is lower than the preset quantization feature distance threshold value is determined as the second candidate video.
  • S403 Based on the content similarity between the video to be retrieved and at least one second candidate video, output the second candidate video whose content similarity meets the set content similarity requirement as a corresponding target video.
  • the complete video of the video to be retrieved can be used as the model input, and multiple key frames obtained can also be used as the model input. Therefore, for different model inputs, the following methods for obtaining the target video are provided. Way.
  • Method 1 It is applicable to the input of the above two models, and the target video is obtained by screening according to the texture feature distance.
  • the target texture processing sub-model is used to perform feature extraction on image features to obtain corresponding first texture features; and then for at least one second candidate video, the following operations are respectively performed: determine the first texture feature and a second The texture feature distance between the second texture features of the candidate video, if the texture feature distance is lower than the preset texture feature distance threshold value, it is determined that the content similarity between the video to be retrieved and the second candidate video conforms to the set content similar requirements, and determine the second candidate video as the target video output; wherein, the second texture feature represents the texture information of the second candidate video.
  • multiple distance calculation methods such as Euclidean distance and Hamming distance can be used to calculate the quantitative feature distance and texture feature distance.
  • Method 2 For the complete video as the input of the model, the target video is obtained by screening according to the degree of repetition of the content.
  • the ratio between the total matching duration and the comparison duration is determined as the content repetition degree between the video to be retrieved and a second candidate video; wherein, the total matching duration is based on at least one second candidate video and the video to be retrieved respectively obtained by matching the duration, the comparison duration is the value of the shorter duration of the video to be retrieved and the video in the second candidate video;
  • the content repetition exceeds the set content repetition threshold, it is determined that the content similarity between the video to be retrieved and the second candidate video meets the set content similarity requirements, and the second candidate video is determined as the target video output.
  • the video duration of the video to be retrieved is 30s
  • the matching duration between each second candidate video and the video to be retrieved is as shown in Table 2
  • method 2 is used to obtain the video duration between the video to be retrieved and each second candidate video.
  • Content repeatability, and finally the second candidate videos 1-3 are returned to the user as target videos.
  • Second Candidate Video 1 15s 5s 6 Second Candidate Video 2 20s 10s 4.5 Second Candidate Video 3 25s 20s 3.6 Second Candidate Video 4 60s 35s 3 Second Candidate Video 5 120s 20s 3
  • Method 3 Aiming at using multiple key frames as model input, the target video is obtained by screening according to the repetition degree of the content.
  • Each key frame corresponds to a first quantized feature
  • each first quantized feature can recall a second candidate video with the same feature. Therefore, the ratio between the number of the same quantized feature and the comparison duration can be determined as the video to be retrieved The degree of content repetition with the second candidate video.
  • the ratio between the quantity of the same quantitative feature and the comparison time length is determined as the content repetition degree between the video to be retrieved and the second candidate video; wherein, the comparison time length is the video time length between the video to be retrieved and the second candidate video. Short duration value;
  • the content repetition exceeds the set content repetition threshold, it is determined that the content similarity between the video to be retrieved and the second candidate video meets the set content similarity requirements, and the second candidate video is determined as the target video output .
  • the video duration of the video to be retrieved is 30s, and a total of 10 key frames are extracted.
  • the number of identical quantitative features between each second candidate video and the video to be retrieved is shown in Table 3, then adopt method 2 to obtain the content repetition degree between the video to be retrieved and each second candidate video, and finally the second candidate Videos 1-2 are returned to the user as target videos.
  • Second Candidate Video 1 15s 5 0.33 Second Candidate Video 2 20s 8 0.4 Second Candidate Video 3 25s 2 0.08
  • the embodiment of the present application also provides a video retrieval device.
  • the device 500 may include:
  • the image processing unit 501 is configured to use the target image processing sub-model of the trained target video retrieval model to perform feature extraction on the video to be retrieved to obtain corresponding image features;
  • the quantization processing unit 502 is configured to use the target quantization processing sub-model of the target video retrieval model to perform feature extraction on image features to obtain corresponding first quantization features, and based on the first quantization features, to filter out from each first candidate video At least one second candidate video whose category similarity with the video to be retrieved meets the set category similarity requirements; wherein, the quantization control parameter of the target quantization processing sub-model is based on the texture feature loss value corresponding to each training sample during the training process For adjustment, the texture feature loss value is determined based on the texture control parameters preset for the texture processing sub-model to be trained during the parameter adjustment process of the texture processing sub-model to be trained;
  • the retrieval unit 503 is configured to, based on the content similarity between the video to be retrieved and at least one second candidate video, output the second candidate video whose content similarity meets the set content similarity requirement as a corresponding target video.
  • the target video retrieval model further includes a target texture processing sub-model, and the retrieval unit 503 is used for:
  • For at least one second candidate video perform the following operations respectively: determine the texture feature distance between the first texture feature and the second texture feature of a second candidate video, if the texture feature distance is lower than the preset texture feature distance threshold value , it is determined that the content similarity between the video to be retrieved and a second candidate video meets the set content similarity requirements, and a second candidate video is determined as the target video output; wherein, the second texture feature represents a corresponding second Texture information of candidate videos.
  • the retrieval unit 503 is used to:
  • For at least one second candidate video perform the following operations respectively:
  • the ratio between the total matching duration and the comparison duration is determined as the content repetition degree between the video to be retrieved and a second candidate video; wherein, the total matching duration is based on at least one second candidate video and the video to be retrieved respectively The matching duration is obtained, and the comparison duration is the shorter duration value of the video to be retrieved and a second candidate video;
  • the content repetition exceeds the set content repetition threshold, it is determined that the content similarity between the video to be retrieved and a second candidate video meets the set content similarity requirements, and a second candidate video is determined as the target video output.
  • the retrieval unit 503 is used to:
  • For at least one second candidate video perform the following operations respectively:
  • the ratio between the quantity of the same quantitative feature and the comparison duration is determined as the content repetition degree between the video to be retrieved and a second candidate video; wherein, the comparison duration is the video duration ratio between the video to be retrieved and a second candidate video Short duration value;
  • the content repetition exceeds the set content repetition threshold, it is determined that the content similarity between the video to be retrieved and a second candidate video meets the set content similarity requirements, and a second candidate video is determined as the target video output .
  • the quantization processing unit 502 is used to:
  • each second quantization feature represents a video category to which at least one corresponding first candidate video belongs.
  • the device 500 further includes a model training unit 504, and the model training unit 504 obtains a trained target video retrieval model by performing the following methods:
  • each sample triplet includes a sample video, and a positive label and a negative label associated with the sample video;
  • Each sample triplet is used as training data, which is sequentially input into the image processing sub-model to be trained and the texture processing sub-model to be processed to obtain the corresponding first texture set; wherein, each time a first texture set is obtained, based on a A plurality of first sample texture features contained in the first texture set generates corresponding image processing texture feature loss values, and based on the image processing texture feature loss values, the image processing sub-model to be trained and the texture processing sub-model to be trained are performed Parameter adjustment until the image processing texture feature loss value is not higher than the preset image processing texture feature loss threshold value, and the candidate image processing sub-model and the candidate texture processing sub-model are obtained;
  • Each sample triplet is used as training data, which is sequentially input into the candidate image processing sub-model, the candidate texture processing sub-model and the quantization processing sub-model to be trained to obtain the corresponding second texture set and quantization feature group; wherein, each obtained A second texture set, based on a plurality of second sample texture features contained in a second texture set, generating corresponding texture feature loss values, and based on the texture feature loss values, adjusting the parameters of the candidate texture processing sub-models, and obtaining each A quantization feature group, based on multiple sample quantization features and texture feature loss values contained in a quantization feature group, generates corresponding quantization feature loss values, and based on the quantization feature loss values, the candidate image processing sub-model and the quantization processing to be trained
  • the parameters of the sub-models are adjusted until the texture feature loss value and the quantization feature loss value are not higher than the preset feature loss threshold value, and the target image processing sub-model, the target texture processing sub-model, and the target quantization processing sub-model are obtained.
  • model training unit 504 is used for:
  • a corresponding quantization feature loss value is generated.
  • model training unit 504 is used for:
  • a corresponding quantization feature loss value is generated.
  • each module or unit
  • the functions of each module can be implemented in one or more pieces of software or hardware when implementing the present application.
  • the embodiment of the present application also provides a computer device.
  • the computer device 600 may include at least a processor 601 and a memory 602 .
  • the memory 602 stores program codes, and when the program codes are executed by the processor 601, the processor 601 is made to execute the steps of any one of the video retrieval methods described above.
  • a computing device may include at least one processor, and at least one memory.
  • the memory stores program codes, and when the program codes are executed by the processor, the processor is made to execute the steps in the video retrieval method described above in this specification according to various exemplary embodiments of the present application.
  • the processor may perform the steps as shown in FIG. 4 .
  • a computing device 700 according to this embodiment of the present application is described below with reference to FIG. 7 .
  • the computing device 700 in FIG. 7 is only an example, and should not limit the functions and scope of use of this embodiment of the present application.
  • computing device 700 takes the form of a general-purpose computing device.
  • Components of the computing device 700 may include but not limited to: at least one processing unit 701 , at least one storage unit 702 , and a bus 703 connecting different system components (including the storage unit 702 and the processing unit 701 ).
  • Bus 703 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus structures.
  • the storage unit 702 may include a computer-readable storage medium in the form of a volatile or non-volatile memory, such as a random access memory (RAM) 7021 and/or a cache storage unit 7022, and may further include a read-only memory (ROM) 7023.
  • the computer-readable storage medium includes program code, and when the program code is run on the computer device, the program code is used to make the computer device execute the steps of any one of the above video retrieval methods.
  • the storage unit 702 may also include a program/utility 7025 having a set (at least one) of program modules 7024, such program modules 7024 including but not limited to: an operating system, one or more application programs, other program modules, and program data, Implementations of networked environments may be included in each or some combination of these examples.
  • Computing device 700 may also be in communication with one or more external devices 704 (e.g., keyboards, pointing devices, etc.), may also communicate with one or more devices that enable a user to interact with computing device 700, and/or may 700 is capable of communicating with any device (eg, router, modem, etc.) that communicates with one or more other computing devices. Such communication may occur through input/output (I/O) interface 705 . Also, the computing device 700 can also communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through a network adapter 706 . As shown, network adapter 706 communicates with other modules for computing device 700 over bus 703 . It should be understood that although not shown, other hardware and/or software modules may be used in conjunction with computing device 700, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • various aspects of the video retrieval method provided by this application can also be implemented in the form of a program product, which includes program codes.
  • the program product runs on a computer device
  • the program code uses to make a computer device execute the steps in the video retrieval method according to various exemplary embodiments of the present application described above in this specification, for example, the electronic device may execute the steps shown in FIG. 4 .
  • a program product may take the form of any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples (non-exhaustive list) of readable storage media include: electrical connection with one or more conductors, portable disk, hard disk, random access memory (RAM), read only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • an embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the steps of any one of the above video retrieval methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

涉及计算机领域,特别涉及人工智能领域,提供了一种视频检索的方法、装置、设备及存储介质,以解决量化效率低和准确率低的问题。该方法包括:对待检索视频的图像特征进行特征提取,获得第一量化特征,再基于第一量化特征,获得与待检索视频的类别相似度高的第二候选视频,最后将与待检索视频的内容相似度高的第二候选视频,作为目标视频。根据每个训练样本对应的纹理特征损失值调整量化控制参数,以使目标量化处理子模型学习到目标纹理特征子模型的排序能力,确保两个子模型的排序效果趋于一致,而端到端的模型架构,令目标量化处理子模型能够基于图像特征获得对应的量化特征,提高了量化特征的生成准确率和视频检索的召回性能。

Description

视频检索的方法、装置、设备及存储介质
本申请要求于2021年8月24日提交中国专利局、申请号为202110973390.0、名称为“视频检索的方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机领域,特别涉及人工智能领域,提供了一种视频检索的方法、装置、设备及存储介质。
背景技术
在相关技术中,通常使用量化特征作为视频的索引标签,检索获得相应的视频。其中,一般采用以下任意一种方法,获得相应的量化特征:
方法一,基于k均值聚类(K-means)算法获得相应的量化特征,但针对大规模样本数据聚类时,为了保证索引检索的准确率,需要耗费大量资源,才能获得足够多的量化特征;
方法二,基于乘积量化(Product Quantization,PQ)获得相应的量化特征,但采用这种方法获得的量化特征,会因生成过程中的损失,降低量化特征的生成准确率,进而影响到视频检索的召回(match)性能;
方法三,基于深度学习神经网络获得相应的量化特征,但该神经网络是先提取视频图像的embedding特征,再对embedding特征进行特征提取处理,获得相应的量化特征,会因生成过程中的损失,降低量化特征的生成准确率,进而影响到视频检索的召回性能。
发明内容
本申请实施例提供了一种视频检索的方法、装置、设备及存储介质,以解决量化效率低和准确率低的问题。
第一方面,本申请实施例提供了一种视频检索的方法,包括:
采用已训练的目标视频检索模型的目标图像处理子模型,对待检索视频进行特征提取,获得对应的图像特征;
采用所述目标视频检索模型的目标量化处理子模型,对所述图像特征进行特征提取,获得对应的第一量化特征,并基于所述第一量化特征,从各个第一候选视频中筛选出与所述待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频;其中,所述目标量化处理子模型的量化控制参数是在训练过程中,基于每个训练样本对应的纹理特征损失值进行调整的,所述纹理特征损失值是基于对待训练的纹理处理子模型进行参数调整过程中,针对所述待训练的纹理处理子模型预设的纹理控制参数确定的;
基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出。
第二方面,本申请实施例还提供了一种视频检索的装置,包括:
图像处理单元,用于采用已训练的目标视频检索模型的目标图像处理子模型,对待检索视频进行特征提取,获得对应的图像特征;
量化处理单元,用于采用所述目标视频检索模型的目标量化处理子模型,对所述图像特征进行特征提取,获得对应的第一量化特征,并基于所述第一量化特征,从各个第一候选视频中筛选出与所述待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频;其中,所述目标量化处理子模型的量化控制参数是在训练过程中,基于每个训练样本对应的纹理特征损失值进行调整的,所述纹理特征损失值是基于对待训练的纹理处理子模型进行参数调整过程中,针对所述待训练的纹理处理子模型预设的纹理控制参数确定的;
检索单元,用于基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出。
第三方面,本申请实施例还提供了一种计算机设备,包括处理器和存储器,其中,所述存储器存储有程序代码,当所述程序代码被所述处理器执行时,使得所述处理器执行上述任意一种视频检索的方法的步骤。
第四方面,本申请实施例还提供了一种计算机可读存储介质,其包括程序代码,当程序代码在计算机设备上运行时,所述程序代码用于使所述计算机设备执行上述任意一种视频检索的方法的步骤。
第五方面,本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述任意一种视频检索的方法的步骤。
本申请有益效果如下:
本申请实施例提供了一种视频检索的方法、装置、设备及存储介质,该方法包括:对待检索视频的图像特征进行特征提取,获得第一量化特征,再基于第一量化特征,获得与待检索视频的类别相似度高的第二候选视频,最后将与待检索视频的内容相似度高的第二候选视频,作为目标视频输出。由于目标量化处理子模型的量化控制参数,会根据每个训练样本对应的纹理特征损失值进行调整,使得目标量化处理子模型学习到目标纹理特征子模型的排序能力,确保两个子模型的排序效果趋于一致,避免因固定的量化控制参数,导致目标量化处理子模型存在随机排序的情况。因端到端的模型架构,使得采用上述方式训练得到的目标量化处理子模型,可基于图像特征获得对应的量化特征,减少了在生成量化特征过程中的损失,提高了量化特征的生成准确率,再加上本申请实施例还优化了目标量化处理子模型的排序能力,又进一步提高了视频检索的召回性能。
本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点可通过在所写的说明书、权利要求书、以及附图中所特别指出的结构来实现和获得。
附图简要说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1a为本申请实施例中一种应用场景的示意图;
图1b为本申请实施例提供的第一种展示界面示意图;
图1c为本申请实施例提供的第二种展示界面示意图;
图1d为本申请实施例提供的目标视频检索模型的架构示意图;
图1e为相关技术中使用的量化处理模型的架构示意图;
图2a为本申请实施例提供的训练目标视频检索模型的流程示意图;
图2b为本申请实施例提供的挖掘多个样本三元组的流程示意图;
图2c为本申请实施例提供的第一种生成量化特征损失值的流程示意图;
图2d为本申请实施例提供的第二种生成量化特征损失值的流程示意图;
图3a为本申请实施例提供的建立索引表、映射表的流程示意图;
图3b为本申请实施例提供的建立索引表、映射表的逻辑示意图;
图4a为本申请实施例提供的视频检索方法的流程示意图;
图4b为本申请实施例提供的应用视频检索方法的具体实施例的逻辑示意图;
图5为本申请实施例提供的一种视频检索的装置的结构示意图;
图6为本申请实施例中提供的一种计算机设备的组成结构示意图;
图7为本申请实施例中的一个计算装置的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请技术方案的一部分实施例,而不是全部的实施例。基于本申请文件中记载的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请技术方案保护的范围。
以下对本申请实施例中的部分用语进行解释说明,以便于本领域技术人员理解。
本申请实施例涉及人工智能(ArtificialIntelligence,AI)领域,是基于机器学习(MachineLearning,ML)和计算机视觉(Computer Vision,CV)技术设计的。本申请实施例提供的方案,涉及人工智能的深度学习、增强现实等技术,具体通过如下实施例进一步说明。
下面对本申请实施例进行简要介绍。
本申请实施例提供了一种视频检索的方法、装置、设备及存储介质,以解决量化效率低和准确率低的问题。本申请实施例可应用于各类视频检索场景下,如在视频侵权场景中,使用本申请实施例提供的视频检索方法,召回一批与待检索视频的内容相似度较高的视频,并将召回的视频判定为侵权视频。
该方法包括:对待检索视频的图像特征进行特征提取,获得第一量化特征,再基于第一量化特征,获得与待检索视频的类别相似度高的第二候选视频,最后将与待检索视频的内容相似度高的第二候选视频,作为目标视频输出。由于目标量化处理子模型的量化控制参数,会根据每个训练样本对应的纹理特征损失值进行调整,使得目标量化处理子模型学习到目标纹理特征子模型的排序能力,确保两个子模型的排序效果趋于一致,避免因固定的量化控制参数,导致目标量化处理子模型存在随机排序的情况。因端到端的模型架构,使得采用上述方式训练得到的目标量化处理子模型,可基于图像特征获得对应的量化特征,减少了在生成量化特征过程中的损失,提高了量化特征的生成准确率,再加上本申请实施例还优化了目标量化处理子模型的排序能力,又进一步提高了视频检索的召回性能。
以下结合说明书附图对本申请的实施例进行说明,应当理解,此处所描述的实施例仅用于说明和解释本申请,并不用于限定本申请,并且在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
参阅图1a和图1b示出的示意图,在本申请实施例的应用场景中,包括两个物理终端设备110和一个服务器130。
目标对象(例如,用户)可通过物理终端设备110登录视频检索客户端,并且物理终端设备110的显示屏120上呈现检索界面;之后,目标对象在检索界面中输入待检索视频的图像,以使运行在目标服务器130上的目标视频检索模型基于待检索视频的图像,从后台端口连接的庞大视频库中,获取与待检索视频的内容相似度较高的目标视频;物理终端设备110在接收到目标服务器130返回的全部目标视频之后,在显示屏120的展示界面上呈现各个目标视频,同时用户还可以通过点击页面等手势操作,查看被选中的目标视频的视频详情,而且,进度条上还会标记出目标视频与待检索视频的相似片段或重复片段。
如图1b所示的展示界面呈现了某部电视剧的节选片段,针对已播放片段,其对应的进度条颜色为白色;针对未播放片段,其对应的进度条颜色为黑色;而针对相似片段或重复片段,其对应的进度条颜色为灰色,这样,用户可以通过进度条颜色,粗略估计出目标视频与待检索视频的相似程度,便于用户进行视频创作的侵权判定。
如图1c所示的展示界面呈现了某部电视剧的节选片段,针对已播放片段,其对应的进度条颜色为白色;针对未播放片段,其对应的进度条颜色为黑色;而针对相似片段或重复片段,会在进度条上用三角形标记点或其他形状的标记点,标记出这些片段的起始点、终止点,这样,用户可以通过点击标记点,直接跳转到相应的剧情,同样地,用户也可以通过进度条上的标记点数量,粗略估计出目标视频与待检索视频的相似程度。
在本申请实施例中,物理终端设备110是用户使用的电子设备,电子设备可以是个人计算机、手机、平板电脑、笔记本电脑、电子书阅读器、智能家居等计算机设备。
各物理终端设备110通过通信网络与目标服务器130进行通信。在一种实施方式中,通信网络为有线网络或者无线网络,因此,各物理终端设备110可通过有线网络或者无线网络,直接或间接地与目标服务器130建立通信连接,本申请在此不做限制。
目标服务器130可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络(Content Delivery Network,CDN)、大数据以及人工智能平台等基础云计算服务的云服务器,本申请在此不做限制。
其中,目标服务器130上部署了目标视频检索模型,如图1d所示,目标视频检索模型包括目标图像处理子模型、目标纹理处理子模型和目标量化处理子模型。
目标图像处理子模型、目标纹理处理子模型,均是采用ResNet_101的网络架构构建的深度学习网络模型,并基于imageNet进行模型预训练。imageNet是大型通用物体识别开源数据集,在imageNet中有大量事先标注好的图像数据,且imageNet大概含有1000类的图像数据,因此,基于imageNet预训练获得的深度学习网络模型,其模型参数的稳定性、整体模型的通用性更优。
另外,还可以采用除ResNet_101以外的网络架构构建深度学习网络模型,以及基于其他大规模数据集,对深度学习网络模型进行预训练,如基于openimage预训练获得的深度学习网络模型。
采用目标量化处理子模型,对复杂的高维图像特征进行二值量化处理,将高维图像特征压缩为指定位数的二进制编码(即量化特征)。在进行视频检索时,以量化特征为索引,召回相应的目标视频,大大降低了计算时间和计算复杂度,更加有利于计算,对海量数据的检索是非常有利的。
另外,对于二进制编码来说,每一位的取值为0或1,如将128维的图像特征压缩到4比特(bit)的二进制编码0100。
与图1e所示的传统的量化处理模型不同,本申请实施例中的目标纹理处理子模型和目 标量化处理子模型是并行放置的两个子模型,这样部署的好处在于,在训练阶段,目标量化处理子模型的量化控制参数,会根据每个训练样本对应的纹理特征损失值进行调整,使得目标量化处理子模型学习到目标纹理特征子模型的排序能力,确保两个子模型的排序效果趋于一致,避免因固定的量化控制参数,导致目标量化处理子模型存在随机排序的情况。
在应用阶段,相较于图1e所示的非端到端的模型架构而言,本申请实施例采用端到端的模型架构,可基于图像特征获得对应的量化特征,减少了在生成量化特征过程中的损失,提高了量化特征的生成准确率,再加上本申请实施例还优化了目标量化处理子模型的排序能力,又进一步提高了视频检索的召回性能。
而且,本申请实施例基于人工智能技术搭建的目标视频检索模型,在处理大规模的检索视频时,相较于传统的k均值聚类算法而言,其处理速度、召回性能更优,耗费的资源也更少。
在目标视频检索模型的训练过程中,具体分为预训练阶段和微调联合学习阶段,但是,两个训练阶段使用的训练数据是相同的,区别在于,两个训练阶段中需要学习的网络参数和生成的损失值是不同的。
其中,在预训练阶段中,使用图像处理纹理特征损失值,对待训练的图像处理子模型、待训练的纹理处理子模型进行参数调整,获得候选图像处理子模型和候选纹理处理子模型;在微调联合学习阶段中,使用纹理特征损失值,对候选纹理处理子模型进行参数调整,使用量化特征损失值,对候选图像处理子模型和待训练的量化处理子模型进行参数调整,获得目标图像处理子模型、目标纹理处理子模型,以及目标量化处理子模型。
为了便于理解,参阅图2a所示的流程示意图,介绍目标视频检索模型的训练过程。
S201:获得多个样本三元组,每个样本三元组包含样本视频、以及样本视频关联的正向标签和负向标签。
相较于传统的量化处理方法来说,本申请实施例采用了带标签的训练数据,使得待训练的量化处理子模型能够同时学习到正向标签、负向标签,进而提高目标量化处理子模型的召回效果。
其中,正向标签指的是,与样本视频的内容相似度较高的样本视频,而负向标签则指的是,与样本视频仅存在少量内容相同或相似的样本视频。
参阅图2b示出的流程示意图,本申请实施例可通过执行以下操作,获得多个样本三元组:
S2011:获取一个包含了多个相似样本对的相似样本集合,每个相似样本对包含样本视频、以及样本视频关联的正向标签。
S2012:将相似样本集合中的各个相似样本对,依次输入到待训练的图像处理子模型、待训练的纹理处理子模型中,获得对应的纹理特征组。
待训练的图像处理子模型、待训练的纹理处理子模型,均是基于大型通用物体识别开源数据集ImageNet预训练获得的深度学习网络模型。将各个相似样本对依次输入到上述两个子模型中,可获得各个相似样本对各自对应的纹理特征组,每个纹理特征组包括样本视频的纹理特征、以及正向标签的纹理特征。
S2013:读取一个相似样本对的样本视频c;
S2014:针对相似样本集合中的各个其他相似样本对,分别执行以下操作:基于样本视频c的纹理特征、与一个其他相似样本对中任意一个其他样本视频的纹理特征,获得相应的纹理特征距离。
在一种实施方式中,将两个纹理特征之间的欧式距离作为对应的纹理特征距离。欧式 距离的取值越小,表征两个样本视频的内容相似度越高;反之,欧式距离的取值越大,表征两个样本视频的内容相似度越低。
S2015:按纹理特征距离的远近顺序,排列各个其他样本视频。
S2016:在剔除前k%的其他样本视频之后,将排列在前m个的其他样本视频确定为样本视频的负向标签。
按照纹理特征距离从近至远的顺序,排列各个其他样本视频,再根据前文对纹理特征距离的介绍可知,前k%的其他样本视频与样本视频的内容相似度非常高,而本申请实施例中需要挖掘的负向标签是,与样本视频仅存在少量内容相同或相似的样本视频,很明显前k%的其他样本视频不符合负向标签的定义,被作为干扰噪声剔除掉。其中,k为可控值,干扰噪声越大,其对应的k值也越大。
而排名极其靠后的其他样本视频,因与样本视频几乎不存在相同或相似的内容,也不符合负向标签的定义,因此,本申请实施例是将剔除前k%的其他样本视频之后排列在前m个的其他样本视频,确定为样本视频的负向标签。
例如,假设相似样本对为(样本视频1,样本视频2),表1示出了其他样本视频与样本视频1之间的纹理特征距离,剔除前k%的其他样本视频,例如样本视频3和样本视频6,并剔除排名极其靠后的样本视频8和样本视频9,最终筛选获得以下多个样本三元组:(样本视频1,样本视频2,其他样本视频7)、(样本视频1,样本视频2,其他样本视频4)、(样本视频1,样本视频2,其他样本视频5)。
表1
其他样本视频的名称 纹理特征距离
其他样本视频3 0.5
其他样本视频6 0.55
其他样本视频7 0.8
其他样本视频4 0.83
其他样本视频5 0.9
其他样本视频8 1.2
其他样本视频9 1.5
S2017:判断相似样本集合中的全部相似样本对,是否均读取完毕,若是,执行步骤S2018;否则,返回步骤S2013。
S2018:判断所有相似样本集合是否均读取完毕,若是,输出所有样本三元组;否则,返回步骤S2011。
S202:读取一个样本三元组d,将样本三元组d作为训练数据,依次输入到待训练的图像处理子模型、待处理的纹理处理子模型中,获得对应的第一纹理集合。
S203:基于第一纹理集合包含的多个第一样本纹理特征,生成对应的图像处理纹理特征损失值,并基于图像处理纹理特征损失值,对待训练的图像处理子模型、待训练的纹理处理子模型进行参数调整。
采用如下公式1,生成对应的图像处理纹理特征损失值;再基于图像处理纹理特征损失值,采用随机梯度下降(stochastic gradient descent,SGD)方法,对待训练的图像处理子模型、待训练的纹理处理子模型进行参数调整。
其中,公式1中的L em为图像处理纹理特征损失值,x a为样本视频的第一样本纹理特征,x p为正向标签的第一样本纹理特征,x n则为负向标签的第一样本纹理特征,‖x a-x p‖表 征正向样本对之间的纹理特征距离,‖x a-x n‖则表征负向样本对之间的纹理特征距离,margin_em表征纹理控制参数。
L em=max(‖x a-x p‖-‖x a-x n‖+margin_em)     公式1
S204:判断图像处理纹理特征损失值是否高于预设的图像处理纹理特征损失门限值,若是,返回步骤S202;否则,执行步骤S205。
S205:停止迭代训练子模型,输出最后一轮迭代获得的候选图像处理子模型和候选纹理处理子模型。
S206:读取一个样本三元组e,将样本三元组e作为训练数据,依次输入到候选图像处理子模型、候选纹理处理子模型和待训练的量化处理子模型中,获得对应的第二纹理集合和量化特征组。
S207:基于第二纹理集合包含的多个第二样本纹理特征,生成对应的纹理特征损失值,并基于纹理特征损失值,对候选纹理处理子模型进行参数调整。
采用如下公式2,生成对应的纹理特征损失值;再基于纹理特征损失值,采用SGD方法,对候选纹理处理子模型进行参数调整。
其中,公式2中的L em′为图像处理纹理特征损失值,x a′为样本视频的第二样本纹理特征,x p′为正向标签的第二样本纹理特征,x n′则为负向标签的第二样本纹理特征,‖x a′-x p′‖表征正向样本对之间的纹理特征距离,‖x a′-x n′‖则表征负向样本对之间的纹理特征距离,margin_em表征纹理控制参数。
L em′=max(‖x a′-x p′‖-‖x a′-x n′‖+margin_em)     公式2
S208:基于量化特征组包含的多个样本量化特征、纹理特征损失值,生成对应的量化特征损失值,并基于量化特征损失值,对候选图像处理子模型和待训练的量化处理子模型进行参数调整。
参阅图2c示出的流程示意图,对第一种生成量化特征损失值的方式进行介绍。
S2081:基于纹理特征损失值,调整待训练的量化处理子模型的量化控制参数。
采用如下公式3,计算第i个样本三元组的量化控制参数。具体地,margin_i表征第i个样本三元组的量化控制参数,margin0是预设的汉明距离,Mem是纹理特征距离与汉明距离之间的比值,L em_i为第i个样本三元组的纹理特征损失值。
margin_i=margin0*L em_i/Mem        公式3
S2082:基于一个量化特征组包含的多个样本量化特征和量化控制参数,分别确定待训练的量化处理子模型的训练样本损失值和符号量化损失值。
其中,如下公式4为训练样本损失值的计算公式。L triplet为训练样本损失值,
Figure PCTCN2022105871-appb-000001
为样本视频的样本量化特征,
Figure PCTCN2022105871-appb-000002
为正向标签的样本量化特征,
Figure PCTCN2022105871-appb-000003
为负向标签的样本量化特征,
Figure PCTCN2022105871-appb-000004
表征正向样本对之间的量化特征距离,
Figure PCTCN2022105871-appb-000005
则表征负向样本对之间的量化特征距离,margin_i表征第i个样本三元组的量化控制参数。
Figure PCTCN2022105871-appb-000006
先采用如下公式5,对样本量化特征中的每一位进行符号量化,获得符号量化特征;再采用如下公式6,基于样本量化特征与符号量化特征,生成对应的符号量化损失值。
其中,L coding为符号量化损失值,ui表征样本量化特征中的第i位,bi表征符号量化特征的第i位,若ui为负数,则bi的取值为-1,否则bi的取值为1。
Figure PCTCN2022105871-appb-000007
Figure PCTCN2022105871-appb-000008
S2083:基于待训练的量化处理子模型的训练样本损失值、符号量化损失值,生成对应的量化特征损失值。
采用如下公式7,生成对应的量化特征损失值;再基于量化特征损失值,采用SGD方法,对候选图像处理子模型和待训练的量化处理子模型进行参数调整。
其中,公式7中的L q为量化特征损失值,L triplet为待训练的量化处理子模型的训练样本损失值,w 21是为训练样本损失值分配的权重,L coding为待训练的量化处理子模型的符号量化损失值,w 22是为符号量化损失值分配的权重。
L q=w 21L triplet+w 22L coding       公式7
参阅图2d示出的流程示意图,本申请实施例还提供了第二种生成量化特征损失值的方式。
S2081':基于一个量化特征组包含的多个样本量化特征,分别确定待训练的量化处理子模型的训练样本损失值和符号量化损失值。
将多个样本量化特征代入公式4中,生成对应的训练样本损失值,但此时公式4中的margin_i的取值与margin_em的取值相同。
将多个样本量化特征,依次代入公式5~6中,获得对应的符号量化损失值。在前文中已经介绍过相关公式,在此将不再赘述。
S2082':基于待训练的量化处理子模型的训练样本损失值、符号量化损失值和纹理特征损失值,生成对应的量化特征损失值。
采用如下公式8,生成对应的量化特征损失值。其中,L q为量化特征损失值,L triplet为待训练的量化处理子模型的训练样本损失值,w 21、L em′均为训练样本损失值的权重,L coding为待训练的量化处理子模型的符号量化损失值,w 22是为符号量化损失值分配的权重。
L q=w 21(L em′*L triplet)+w 22L coding        公式8
S209:判断纹理特征损失值、量化特征损失值是否均高于预设的图像处理纹理特征损失门限值,若是,返回步骤S206;否则,执行步骤S210。
S210:停止迭代训练子模型,输出最后一轮迭代获得的目标图像处理子模型、目标纹理处理子模型和目标量化处理子模型。
接下来,参阅图3a示出的流程示意图、图3b示出的逻辑示意图,采用已训练的目标视频检索模型,建立视频数据库的索引表、映射表。
S301:读取一个第一候选视频s;
S302:将第一候选视频s输入到目标视频检索模型中,获得对应的初始量化特征、第二纹理特征;
S303:将第二纹理特征添加到映射表中,以及分别确定初始量化特征与索引表中记载的各个第二量化特征之间的量化特征距离;
S304:将第一候选视频s添加到最小量化特征距离对应的第二量化特征中;
S305:判断视频数据库中的全部第一候选视频,是否均读取完毕,若是,执行步骤S306;否则,返回步骤S301;
S306:输出最后一轮迭代获得的映射表、索引表。
其中,在执行步骤S303时,若索引表为空值,则将第一候选视频s的初始量化特征作为第二量化特征添加到索引表中。索引表如Lindex:[q1:[img1,img2,img6],q2:[img3],q3:[img4]]所示,表中包括多个第二量化特征,每个量化特征对应至少一个第一候选视频,因此,每个第二量化特征表征对应的至少一个第一候选视频所归属的视频类别;映射表如T:[[img1,embedding1],[img2,embedding2],……,[img6,embedding6]],表中包括多个第一候选视频、及对应的第二纹理特征。
另外,针对新加入视频数据库中的第一候选视频,也可以执行如图3a所示的流程,建立相应的索引关系、映射关系。
接下来,参阅图4a示出的流程示意图,在已训练的目标视频检索模型上,应用本申请实施例提供的视频检索方法。
S401:采用已训练的目标视频检索模型的目标图像处理子模型,对待检索视频进行特征提取,获得对应的图像特征。
在执行步骤S401时,可将待检索视频的完整视频输入到目标图像处理子模型中,获得相应的一个图像特征;也可以先从待检索视频中提取关键帧,再将获得的多个关键帧输入到目标图像处理子模型中,获得相应的多个图像特征。
S402:采用目标视频检索模型的目标量化处理子模型,对图像特征进行特征提取,获得对应的第一量化特征,并基于第一量化特征,从各个第一候选视频中筛选出与待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频;其中,目标量化处理子模型的量化控制参数是在训练过程中,基于每个训练样本对应的纹理特征损失值进行调整的,纹理特征损失值是基于对待训练的纹理处理子模型进行参数调整过程中,针对待训练的纹理处理子模型预设的纹理控制参数确定的。
目标量化处理子模型的量化控制参数根据每个训练样本对应的纹理特征损失值进行调整,使得目标量化处理子模型学习到目标纹理特征子模型的排序能力,确保两个子模型的排序效果趋于一致,避免因固定的量化控制参数,导致目标量化处理子模型存在随机排序的情况。而端到端的模型架构,使得目标量化处理子模型能够基于图像特征,获得对应的量化特征,减少了在生成量化特征过程中的损失,提高了量化特征的生成准确率,再加上本申请实施例还优化了目标量化处理子模型的排序能力,又进一步提高了视频检索的召回性能。
根据前文的介绍可知,索引表中包含多个第二量化特征,每个量化特征对应至少一个第一候选视频,因此,在执行步骤S402时,分别确定第一量化特征与各个第一候选视频各自的第二量化特征之间的量化特征距离,再将量化特征距离低于预设量化特征距离门限值的第一候选视频,确定为第二候选视频。
S403:基于待检索视频与至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出。
在本申请实施例中,既可以将待检索视频的完整视频作为模型输入,也可以将获得的多个关键帧作为模型输入,因此,针对不同的模型输入,提供了以下几种获得目标视频的方式。
方式1:针对上述两种模型输入均适用,是按照纹理特征距离,筛选获得目标视频。
在实施例中,采用目标纹理处理子模型,对图像特征进行特征提取,获得对应的第一纹理特征;再针对至少一个第二候选视频,分别执行以下操作:确定第一纹理特征与一个第二候选视频的第二纹理特征之间的纹理特征距离,若纹理特征距离低于预设纹理特征距离门限值,则判定待检索视频与该第二候选视频之间的内容相似度符合设定内容相似要求, 并将该第二候选视频确定为目标视频输出;其中,第二纹理特征表征该第二候选视频的纹理信息。在本申请实施例中,可采用欧式距离、汉明距离等多种距离计算方式,计算量化特征距离、纹理特征距离,无论采用哪种距离计算方式,若距离的取值较小,则表示两个视频的内容相似度高;反之,若距离的取值较大,则表示两个视频的内容相似度低,后续将不再赘述。
方式2:针对将完整视频作为模型输入,按照内容重复度,筛选获得目标视频。
在本实施例中,针对至少一个第二候选视频,分别执行以下操作:
将总匹配时长与比较时长之间的比值确定为待检索视频与一个第二候选视频之间的内容重复度;其中,总匹配时长是基于至少一个第二候选视频各自与待检索视频之间的匹配时长获得的,比较时长是待检索视频与该第二候选视频中视频时长较短的时长取值;
若内容重复度超过设定的内容重复度门限值,则判定待检索视频与该第二候选视频之间的内容相似度符合设定内容相似要求,并将该第二候选视频确定为目标视频输出。
例如,假设待检索视频的视频时长为30s,各个第二候选视频与待检索视频之间的匹配时长如表2所示,则采用方式2,获得待检索视频与各个第二候选视频之间的内容重复度,最终将第二候选视频1~3作为目标视频返回给用户。
表2
第二候选视频的名称 视频时长 匹配时长 内容重复度
第二候选视频1 15s 5s 6
第二候选视频2 20s 10s 4.5
第二候选视频3 25s 20s 3.6
第二候选视频4 60s 35s 3
第二候选视频5 120s 20s 3
方式3:针对将多个关键帧作为模型输入,按照内容重复度,筛选获得目标视频。
每个关键帧各自对应一个第一量化特征,每个第一量化特征可召回特征相同的第二候选视频,因此,可将相同量化特征的数量与比较时长之间的比值,确定为待检索视频与该第二候选视频之间的内容重复度。
在实施例中,针对至少一个第二候选视频,分别执行以下操作:
确定待检索视频与一个第二候选视频之间的相同量化特征的数量;
将相同量化特征的数量与比较时长之间的比值,确定为待检索视频与该第二候选视频之间的内容重复度;其中,比较时长是待检索视频与该第二候选视频中视频时长较短的时长取值;
若内容重复度超过设定内容重复度门限值,则判定待检索视频与该第二候选视频之间的内容相似度符合设定内容相似要求,并将该第二候选视频确定为目标视频输出。
例如,假设待检索视频的视频时长为30s,共抽取出10个关键帧。各个第二候选视频与待检索视频之间的相同量化特征的数量如表3所示,则采用方式2,获得待检索视频与各个第二候选视频之间的内容重复度,最终将第二候选视频1~2作为目标视频返回给用户。
表3
第二候选视频的名称 视频时长 相同量化特征的数量 内容重复度
第二候选视频1 15s 5 0.33
第二候选视频2 20s 8 0.4
第二候选视频3 25s 2 0.08
第二候选视频4 60s 3 0.1
第二候选视频5 120s 1 0.03
为了便于理解,参阅图4b示出的逻辑示意图,介绍在具体实施例上应用视频检索方法的过程。
将待检索视频的完整视频输入已训练的目标视频检索模型中,获得对应的第一纹理特征、第一量化特征;根据第一量化特征与索引表中各个第二量化特征之间的量化特征距离,获得与待检索视频的类别相似度较高的多个候选视频;再根据第一纹理特征与上一轮被召回的各个候选视频的第二纹理特征之间的纹理特征距离,将排列在前N个的候选视频作为与待检索视频的内容相似度较高的目标视频,并返回给用户。
与上述方法实施例基于同一发明构思,本申请实施例还提供了一种视频检索的装置,如图5所示,装置500可以包括:
图像处理单元501,用于采用已训练的目标视频检索模型的目标图像处理子模型,对待检索视频进行特征提取,获得对应的图像特征;
量化处理单元502,用于采用目标视频检索模型的目标量化处理子模型,对图像特征进行特征提取,获得对应的第一量化特征,并基于第一量化特征,从各个第一候选视频中筛选出与待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频;其中,目标量化处理子模型的量化控制参数是在训练过程中,基于每个训练样本对应的纹理特征损失值进行调整的,纹理特征损失值是基于对待训练的纹理处理子模型进行参数调整过程中,针对待训练的纹理处理子模型预设的纹理控制参数确定的;
检索单元503,用于基于待检索视频与至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出。
在实施例中,目标视频检索模型还包括目标纹理处理子模型,检索单元503用于:
采用目标纹理处理子模型,对图像特征进行特征提取,获得对应的第一纹理特征;
针对至少一个第二候选视频,分别执行以下操作:确定第一纹理特征与一个第二候选视频的第二纹理特征之间的纹理特征距离,若纹理特征距离低于预设纹理特征距离门限值,则判定待检索视频与一个第二候选视频之间的内容相似度符合设定内容相似要求,并将一个第二候选视频确定为目标视频输出;其中,第二纹理特征表征对应的一个第二候选视频的纹理信息。
在实施例中,检索单元503用于:
针对至少一个第二候选视频,分别执行以下操作:
将总匹配时长与比较时长之间的比值,确定为待检索视频与一个第二候选视频之间的内容重复度;其中,总匹配时长是基于至少一个第二候选视频各自与待检索视频之间的匹配时长获得的,比较时长是待检索视频与一个第二候选视频中视频时长较短的时长取值;
若内容重复度超过设定的内容重复度门限值,则判定待检索视频与一个第二候选视频之间的内容相似度符合设定内容相似要求,并将一个第二候选视频确定为目标视频输出。
在实施例中,检索单元503用于:
针对至少一个第二候选视频,分别执行以下操作:
确定待检索视频与一个第二候选视频之间的相同量化特征的数量;
将相同量化特征的数量与比较时长之间的比值,确定为待检索视频与一个第二候选视频之间的内容重复度;其中,比较时长是待检索视频与一个第二候选视频中视频时长较短的时长取值;
若内容重复度超过设定内容重复度门限值,则判定待检索视频与一个第二候选视频之间的内容相似度符合设定内容相似要求,并将一个第二候选视频确定为目标视频输出。
在实施例中,量化处理单元502用于:
分别确定第一量化特征与各个第一候选视频各自的第二量化特征之间的量化特征距离;
将量化特征距离低于预设量化特征距离门限值的第一候选视频,确定为第二候选视频;其中,每个第二量化特征表征对应的至少一个第一候选视频所归属的视频类别。
在实施例中,装置500还包括模型训练单元504,模型训练单元504通过执行以下方式,获得已训练的目标视频检索模型:
获得多个样本三元组,每个样本三元组包含样本视频、以及样本视频关联的正向标签和负向标签;
将各个样本三元组作为训练数据,依次输入到待训练的图像处理子模型、待处理的纹理处理子模型中,获得对应的第一纹理集合;其中,每获得一个第一纹理集合,基于一个第一纹理集合包含的多个第一样本纹理特征,生成对应的图像处理纹理特征损失值,并基于图像处理纹理特征损失值,对待训练的图像处理子模型、待训练的纹理处理子模型进行参数调整,直至图像处理纹理特征损失值不高于预设的图像处理纹理特征损失门限值时,获得候选图像处理子模型和候选纹理处理子模型;
将各个样本三元组作为训练数据,依次输入到候选图像处理子模型、候选纹理处理子模型和待训练的量化处理子模型中,获得对应的第二纹理集合和量化特征组;其中,每获得一个第二纹理集合,基于一个第二纹理集合包含的多个第二样本纹理特征,生成对应的纹理特征损失值,并基于纹理特征损失值,对候选纹理处理子模型进行参数调整,以及每获得一个量化特征组,基于一个量化特征组包含的多个样本量化特征、纹理特征损失值,生成对应的量化特征损失值,并基于量化特征损失值,对候选图像处理子模型和待训练的量化处理子模型进行参数调整,直至纹理特征损失值、量化特征损失值均不高于预设的特征损失门限值时,获得目标图像处理子模型、目标纹理处理子模型,以及目标量化处理子模型。
在实施例中,模型训练单元504用于:
基于纹理特征损失值,调整待训练的量化处理子模型的量化控制参数;
基于一个量化特征组包含的多个样本量化特征和量化控制参数,分别确定待训练的量化处理子模型的训练样本损失值和符号量化损失值;
基于待训练的量化处理子模型的训练样本损失值、符号量化损失值,生成对应的量化特征损失值。
在实施例中,模型训练单元504用于:
基于一个量化特征组包含的多个样本量化特征,分别确定待训练的量化处理子模型的训练样本损失值和符号量化损失值;
基于待训练的量化处理子模型的训练样本损失值、符号量化损失值和纹理特征损失值,生成对应的量化特征损失值。
为了描述的方便,以上各部分按照功能划分为各模块(或单元)分别描述。当然,在实施本申请时可以把各模块(或单元)的功能在同一个或多个软件或硬件中实现。
在介绍了本申请示例性实施方式的服务平台的访问方法和装置之后,接下来,介绍根据本申请的另一示例性实施方式的计算机设备。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序 产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
与上述方法实施例基于同一发明构思,本申请实施例中还提供了一种计算机设备,参阅图6所示,计算机设备600可以至少包括处理器601、以及存储器602。其中,存储器602存储有程序代码,当程序代码被处理器601执行时,使得处理器601执行上述任意一种视频检索的方法的步骤。
在一些可能的实施方式中,根据本申请的计算装置可以包括至少一个处理器、以及至少一个存储器。其中,存储器存储有程序代码,当程序代码被处理器执行时,使得处理器执行本说明书上述描述的根据本申请各种示例性实施方式的视频检索的方法中的步骤。例如,处理器可以执行如图4中所示的步骤。
下面参照图7来描述根据本申请的这种实施方式的计算装置700。图7的计算装置700仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图7所示,计算装置700以通用计算装置的形式表现。计算装置700的组件可以包括但不限于:上述至少一个处理单元701、上述至少一个存储单元702、连接不同系统组件(包括存储单元702和处理单元701)的总线703。
总线703表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器、外围总线、处理器或者使用多种总线结构中的任意总线结构的局域总线。
存储单元702可以包括易失性或非易失性存储器形式的计算机可读存储介质,例如随机存取存储器(RAM)7021和/或高速缓存存储单元7022,还可以进一步包括只读存储器(ROM)7023。该计算机可读存储介质包括程序代码,当程序代码在计算机设备上运行时,该程序代码用于使该计算机设备执行上述任意一种视频检索的方法的步骤。
存储单元702还可以包括具有一组(至少一个)程序模块7024的程序/实用工具7025,这样的程序模块7024包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
计算装置700也可以与一个或多个外部设备704(例如键盘、指向设备等)通信,还可与一个或者多个使得用户能与计算装置700交互的设备通信,和/或与使得该计算装置700能与一个或多个其它计算装置进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口705进行。并且,计算装置700还可以通过网络适配器706与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器706通过总线703与用于计算装置700的其它模块通信。应当理解,尽管图中未示出,可以结合计算装置700使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理器、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
与上述方法实施例基于同一发明构思,本申请提供的视频检索的方法的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在计算机设备上运行时,程序代码用于使计算机设备执行本说明书上述描述的根据本申请各种示例性实施方式的视频检索的方法中的步骤,例如,电子设备可以执行如图4中所示的步骤。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、 只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
具体地,本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述任意一种视频检索的方法的步骤。
尽管已描述了本申请的实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括实施例以及落入本申请范围的所有变更和修改。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (16)

  1. 一种视频检索的方法,由计算机设备执行,包括:
    采用已训练的目标视频检索模型的目标图像处理子模型,对待检索视频进行特征提取,获得对应的图像特征;
    采用所述目标视频检索模型的目标量化处理子模型,对所述图像特征进行特征提取,获得对应的第一量化特征,并基于所述第一量化特征,从各个第一候选视频中筛选出与所述待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频;其中,所述目标量化处理子模型的量化控制参数是在训练过程中,基于每个训练样本对应的纹理特征损失值进行调整的,所述纹理特征损失值是基于对待训练的纹理处理子模型进行参数调整过程中,针对所述待训练的纹理处理子模型预设的纹理控制参数确定的;
    基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出。
  2. 如权利要求1所述的方法,其中,所述目标视频检索模型还包括目标纹理处理子模型;
    所述基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出,包括:
    采用所述目标纹理处理子模型,对所述图像特征进行特征提取,获得对应的第一纹理特征;
    针对所述至少一个第二候选视频,分别执行以下操作:确定所述第一纹理特征与一个第二候选视频的第二纹理特征之间的纹理特征距离,若所述纹理特征距离低于预设纹理特征距离门限值,则判定所述待检索视频与所述一个第二候选视频之间的内容相似度符合所述设定内容相似要求,并将所述一个第二候选视频确定为所述目标视频输出;其中,所述第二纹理特征表征对应的一个第二候选视频的纹理信息。
  3. 如权利要求1所述的方法,其中,所述基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出,包括:
    针对所述至少一个第二候选视频,分别执行以下操作:
    将总匹配时长与比较时长之间的比值,确定为所述待检索视频与一个第二候选视频之间的内容重复度;其中,所述总匹配时长是基于所述至少一个第二候选视频各自与所述待检索视频之间的匹配时长获得的,所述比较时长是所述待检索视频与所述一个第二候选视频中视频时长较短的时长取值;
    若所述内容重复度超过设定的内容重复度门限值,则判定所述待检索视频与所述一个第二候选视频之间的内容相似度符合所述设定内容相似要求,并将所述一个第二候选视频确定为所述目标视频输出。
  4. 如权利要求1所述的方法,其中,所述基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出,包括:
    针对所述至少一个第二候选视频,分别执行以下操作:
    确定所述待检索视频与一个第二候选视频之间的相同量化特征的数量;
    将所述相同量化特征的数量与比较时长之间的比值,确定为所述待检索视频与一个第二候选视频之间的内容重复度;其中,所述比较时长是所述待检索视频与所述一个第二候 选视频中视频时长较短的时长取值;
    若所述内容重复度超过设定内容重复度门限值,则判定所述待检索视频与所述一个第二候选视频之间的内容相似度符合所述设定内容相似要求,并将所述一个第二候选视频确定为所述目标视频输出。
  5. 如权利要求1-4任一项所述的方法,其中,所述基于所述第一量化特征,从各个第一候选视频中筛选出与所述待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频,包括:
    分别确定所述第一量化特征与所述各个第一候选视频各自的第二量化特征之间的量化特征距离;
    将量化特征距离低于预设量化特征距离门限值的第一候选视频,确定为第二候选视频;其中,每个第二量化特征表征对应的至少一个第一候选视频所归属的视频类别。
  6. 如权利要求1-4任一项所述的方法,其中,通过执行以下方式,获得所述已训练的目标视频检索模型:
    获得多个样本三元组,每个样本三元组包含样本视频、以及所述样本视频关联的正向标签和负向标签;
    将各个样本三元组作为训练数据,依次输入到待训练的图像处理子模型、待处理的纹理处理子模型中,获得对应的第一纹理集合;其中,每获得一个第一纹理集合,基于所述一个第一纹理集合包含的多个第一样本纹理特征,生成对应的图像处理纹理特征损失值,并基于所述图像处理纹理特征损失值,对所述待训练的图像处理子模型、所述待训练的纹理处理子模型进行参数调整,直至所述图像处理纹理特征损失值不高于预设的图像处理纹理特征损失门限值时,获得候选图像处理子模型和候选纹理处理子模型;
    将所述将各个样本三元组作为训练数据,依次输入到所述候选图像处理子模型、候选纹理处理子模型和待训练的量化处理子模型中,获得对应的第二纹理集合和量化特征组;其中,每获得一个第二纹理集合,基于所述一个第二纹理集合包含的多个第二样本纹理特征,生成对应的纹理特征损失值,并基于所述纹理特征损失值,对所述候选纹理处理子模型进行参数调整,以及每获得一个量化特征组,基于所述一个量化特征组包含的多个样本量化特征、所述纹理特征损失值,生成对应的量化特征损失值,并基于所述量化特征损失值,对所述候选图像处理子模型和所述待训练的量化处理子模型进行参数调整,直至所述纹理特征损失值、所述量化特征损失值均不高于预设的特征损失门限值时,获得所述目标图像处理子模型、所述目标纹理处理子模型,以及所述目标量化处理子模型。
  7. 如权利要求6所述的方法,其中,基于所述一个量化特征组包含的多个样本量化特征、所述纹理特征损失值,生成对应的量化特征损失值,包括:
    基于所述纹理特征损失值,调整所述待训练的量化处理子模型的量化控制参数;
    基于所述一个量化特征组包含的多个样本量化特征和所述量化控制参数,分别确定所述待训练的量化处理子模型的训练样本损失值和符号量化损失值;
    基于所述待训练的量化处理子模型的训练样本损失值、符号量化损失值,生成对应的量化特征损失值。
  8. 如权利要求6所述的方法,其中,基于所述一个量化特征组包含的多个样本量化特征、所述纹理特征损失值,生成对应的量化特征损失值,包括:
    基于所述一个量化特征组包含的多个样本量化特征,分别确定所述待训练的量化处理子模型的训练样本损失值和符号量化损失值;
    基于所述待训练的量化处理子模型的训练样本损失值、符号量化损失值和所述纹理特 征损失值,生成对应的量化特征损失值。
  9. 一种视频检索的装置,包括:
    图像处理单元,用于采用已训练的目标视频检索模型的目标图像处理子模型,对待检索视频进行特征提取,获得对应的图像特征;
    量化处理单元,用于采用所述目标视频检索模型的目标量化处理子模型,对所述图像特征进行特征提取,获得对应的第一量化特征,并基于所述第一量化特征,从各个第一候选视频中筛选出与所述待检索视频的类别相似度符合设定类别相似要求的至少一个第二候选视频;其中,所述目标量化处理子模型的量化控制参数是在训练过程中,基于每个训练样本对应的纹理特征损失值进行调整的,所述纹理特征损失值是基于对待训练的纹理处理子模型进行参数调整过程中,针对所述待训练的纹理处理子模型预设的纹理控制参数确定的;
    检索单元,用于基于所述待检索视频与所述至少一个第二候选视频之间的内容相似度,将内容相似度符合设定内容相似要求的第二候选视频,作为对应的目标视频输出。
  10. 如权利要求9所述的装置,其中,所述目标视频检索模型还包括目标纹理处理子模型,所述检索单元用于:
    采用所述目标纹理处理子模型,对所述图像特征进行特征提取,获得对应的第一纹理特征;
    针对所述至少一个第二候选视频,分别执行以下操作:确定所述第一纹理特征与一个第二候选视频的第二纹理特征之间的纹理特征距离,若所述纹理特征距离低于预设纹理特征距离门限值,则判定所述待检索视频与所述一个第二候选视频之间的内容相似度符合所述设定内容相似要求,并将所述一个第二候选视频确定为所述目标视频输出;其中,所述第二纹理特征表征对应的一个第二候选视频的纹理信息。
  11. 如权利要求9所述的装置,其中,所述检索单元用于:
    针对所述至少一个第二候选视频,分别执行以下操作:
    将总匹配时长与比较时长之间的比值,确定为所述待检索视频与一个第二候选视频之间的内容重复度;其中,所述总匹配时长是基于所述至少一个第二候选视频各自与所述待检索视频之间的匹配时长获得的,所述比较时长是所述待检索视频与所述一个第二候选视频中视频时长较短的时长取值;
    若所述内容重复度超过设定的内容重复度门限值,则判定所述待检索视频与所述一个第二候选视频之间的内容相似度符合所述设定内容相似要求,并将所述一个第二候选视频确定为所述目标视频输出。
  12. 如权利要求9所述的装置,其中,所述检索单元用于:
    针对所述至少一个第二候选视频,分别执行以下操作:
    确定所述待检索视频与一个第二候选视频之间的相同量化特征的数量;
    将所述相同量化特征的数量与比较时长之间的比值,确定为所述待检索视频与一个第二候选视频之间的内容重复度;其中,所述比较时长是所述待检索视频与所述一个第二候选视频中视频时长较短的时长取值;
    若所述内容重复度超过设定内容重复度门限值,则判定所述待检索视频与所述一个第二候选视频之间的内容相似度符合所述设定内容相似要求,并将所述一个第二候选视频确定为所述目标视频输出。
  13. 如权利要求9-12任一项所述的装置,其中,所述量化处理单元用于:
    分别确定所述第一量化特征与所述各个第一候选视频各自的第二量化特征之间的量 化特征距离;
    将量化特征距离低于预设量化特征距离门限值的第一候选视频,确定为第二候选视频;其中,每个第二量化特征表征对应的至少一个第一候选视频所归属的视频类别。
  14. 一种计算机设备,其包括处理器和存储器,其中,所述存储器存储有程序代码,当所述程序代码被所述处理器执行时,使得所述处理器执行权利要求1~8中任一项所述方法的步骤。
  15. 一种计算机可读存储介质,其包括程序代码,当程序代码在计算机设备上运行时,所述程序代码用于使所述计算机焦设备执行权利要求1~8中任一项所述方法的步骤。
  16. 一种计算机程序产品或计算机程序,所述计算机程序产品或计算机程序包括计算机指令,所述计算机指令存储在计算机可读存储介质中;计算机设备的处理器从所述计算机可读存储介质读取并执行所述计算机指令,使得所述计算机设备执行权利要求1~8中任一项所述方法的步骤。
PCT/CN2022/105871 2021-08-24 2022-07-15 视频检索的方法、装置、设备及存储介质 WO2023024749A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22860095.3A EP4390725A1 (en) 2021-08-24 2022-07-15 Video retrieval method and apparatus, device, and storage medium
US18/136,538 US20230297617A1 (en) 2021-08-24 2023-04-19 Video retrieval method and apparatus, device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110973390.0A CN114282059A (zh) 2021-08-24 2021-08-24 视频检索的方法、装置、设备及存储介质
CN202110973390.0 2021-08-24

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/136,538 Continuation US20230297617A1 (en) 2021-08-24 2023-04-19 Video retrieval method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023024749A1 true WO2023024749A1 (zh) 2023-03-02

Family

ID=80868419

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/105871 WO2023024749A1 (zh) 2021-08-24 2022-07-15 视频检索的方法、装置、设备及存储介质

Country Status (4)

Country Link
US (1) US20230297617A1 (zh)
EP (1) EP4390725A1 (zh)
CN (1) CN114282059A (zh)
WO (1) WO2023024749A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282059A (zh) * 2021-08-24 2022-04-05 腾讯科技(深圳)有限公司 视频检索的方法、装置、设备及存储介质
CN115098732B (zh) * 2022-08-11 2022-11-11 腾讯科技(深圳)有限公司 数据处理方法及相关装置
CN117670689A (zh) * 2024-01-31 2024-03-08 四川辰宇微视科技有限公司 一种通过ai算法控制提高紫外像增强器图像质量的方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190332867A1 (en) * 2017-05-11 2019-10-31 Tencent Technology (Shenzhen) Company Limited Method and apparatus for retrieving similar video and storage medium
WO2021017289A1 (zh) * 2019-08-01 2021-02-04 平安科技(深圳)有限公司 在视频中定位对象的方法、装置、计算机设备及存储介质
CN113254687A (zh) * 2021-06-28 2021-08-13 腾讯科技(深圳)有限公司 图像检索、图像量化模型训练方法、装置和存储介质
CN113255625A (zh) * 2021-07-14 2021-08-13 腾讯科技(深圳)有限公司 一种视频检测方法、装置、电子设备和存储介质
CN114282059A (zh) * 2021-08-24 2022-04-05 腾讯科技(深圳)有限公司 视频检索的方法、装置、设备及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190332867A1 (en) * 2017-05-11 2019-10-31 Tencent Technology (Shenzhen) Company Limited Method and apparatus for retrieving similar video and storage medium
WO2021017289A1 (zh) * 2019-08-01 2021-02-04 平安科技(深圳)有限公司 在视频中定位对象的方法、装置、计算机设备及存储介质
CN113254687A (zh) * 2021-06-28 2021-08-13 腾讯科技(深圳)有限公司 图像检索、图像量化模型训练方法、装置和存储介质
CN113255625A (zh) * 2021-07-14 2021-08-13 腾讯科技(深圳)有限公司 一种视频检测方法、装置、电子设备和存储介质
CN114282059A (zh) * 2021-08-24 2022-04-05 腾讯科技(深圳)有限公司 视频检索的方法、装置、设备及存储介质

Also Published As

Publication number Publication date
EP4390725A1 (en) 2024-06-26
US20230297617A1 (en) 2023-09-21
CN114282059A (zh) 2022-04-05

Similar Documents

Publication Publication Date Title
WO2023024749A1 (zh) 视频检索的方法、装置、设备及存储介质
WO2022068196A1 (zh) 跨模态的数据处理方法、装置、存储介质以及电子装置
WO2020140386A1 (zh) 基于TextCNN知识抽取方法、装置、计算机设备及存储介质
WO2022126971A1 (zh) 基于密度的文本聚类方法、装置、设备及存储介质
WO2021155713A1 (zh) 基于权重嫁接的模型融合的人脸识别方法及相关设备
WO2022042123A1 (zh) 图像识别模型生成方法、装置、计算机设备和存储介质
WO2022095356A1 (zh) 用于图像分类的迁移学习方法、相关装置及存储介质
Xu et al. Rethinking data collection for person re-identification: active redundancy reduction
WO2022048363A1 (zh) 网站分类方法、装置、计算机设备及存储介质
WO2020147409A1 (zh) 一种文本分类方法、装置、计算机设备及存储介质
US20220270384A1 (en) Method for training adversarial network model, method for building character library, electronic device, and storage medium
WO2023138188A1 (zh) 特征融合模型训练及样本检索方法、装置和计算机设备
WO2023108995A1 (zh) 向量相似度计算方法、装置、设备及存储介质
CN111538859B (zh) 一种动态更新视频标签的方法、装置及电子设备
WO2023024408A1 (zh) 用户特征向量确定方法、相关设备及介质
CN108090117A (zh) 一种图像检索方法及装置,电子设备
WO2022001233A1 (zh) 基于层次化迁移学习的预标注方法及其相关设备
CN108764258A (zh) 一种用于群体图像插入的最优图像集选取方法
CN113360683A (zh) 训练跨模态检索模型的方法以及跨模态检索方法和装置
CN113190696A (zh) 一种用户筛选模型的训练、用户推送方法和相关装置
CN110532448B (zh) 基于神经网络的文档分类方法、装置、设备及存储介质
CN111651660A (zh) 一种跨媒体检索困难样本的方法
WO2023065640A1 (zh) 一种模型参数调整方法、装置、电子设备和存储介质
CN113139490B (zh) 一种图像特征匹配方法、装置、计算机设备及存储介质
WO2022142032A1 (zh) 手写签名校验方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22860095

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022860095

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022860095

Country of ref document: EP

Effective date: 20240322