CN114595357A - Video searching method and device, electronic equipment and storage medium - Google Patents

Video searching method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114595357A
CN114595357A CN202210163976.5A CN202210163976A CN114595357A CN 114595357 A CN114595357 A CN 114595357A CN 202210163976 A CN202210163976 A CN 202210163976A CN 114595357 A CN114595357 A CN 114595357A
Authority
CN
China
Prior art keywords
data
key frame
image
video
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210163976.5A
Other languages
Chinese (zh)
Inventor
唐小初
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210163976.5A priority Critical patent/CN114595357A/en
Priority to PCT/CN2022/090736 priority patent/WO2023159765A1/en
Publication of CN114595357A publication Critical patent/CN114595357A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a video searching method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring original search data, wherein the original search data comprises text data and original video data; performing frame extraction processing on original video data to obtain candidate key frame data; standardizing the candidate key frame data through a pre-trained data processing model to obtain standard key frame data; encoding the text data through an encoding layer of the data processing model to obtain a text vector, and encoding the standard key frame data through the encoding layer to obtain a plurality of key frame image vectors; calculating a first similarity value of the text vector and each key frame image vector; and screening the standard key frame data according to the first similarity value to obtain a target video clip. The method and the device for searching the video can improve the accuracy of video searching.

Description

Video searching method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a video search method and apparatus, an electronic device, and a storage medium.
Background
Due to the huge amount of video data, video searching is needed for rapidly positioning videos. At present, when video searching is performed, matching searching is mostly performed by using key information, for example, searching is performed by using a text based on voice recognition, and the like, which often has a problem of low accuracy of video searching, and therefore, how to improve accuracy of video searching becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application mainly aims to provide a video searching method and device, electronic equipment and a storage medium, and aims to improve the accuracy of video searching.
In order to achieve the above object, a first aspect of an embodiment of the present application provides a video search method, where the method includes:
acquiring original search data, wherein the original search data comprises text data and original video data;
performing frame extraction processing on the original video data to obtain candidate key frame data;
standardizing the candidate key frame data through a pre-trained data processing model to obtain standard key frame data;
coding the text data through a coding layer of the data processing model to obtain a text vector, and coding the standard key frame data through the coding layer to obtain a plurality of key frame image vectors;
calculating a first similarity value of the text vector and each key frame image vector;
and screening the standard key frame data according to the first similarity value to obtain a target video clip.
In some embodiments, the performing frame extraction on the original video data to obtain candidate key frame data includes:
analyzing the original video data to obtain an original video image;
vectorizing the original video image through an online network of a pre-trained key frame extraction model to obtain a first image feature vector;
performing image enhancement processing on the original video image through a target network of a pre-trained key frame extraction model to obtain a second image feature vector;
and calculating a second similarity value of the first image feature vector and the second image feature vector, and obtaining the candidate key frame data according to the second similarity value.
In some embodiments, the vectorizing the original video image through an online network of a pre-trained key frame extraction model to obtain a first image feature vector includes:
performing feature extraction on the original video image through the online network to obtain a first video feature map;
and mapping the first video feature map to a preset first high-dimensional vector space through the online network to obtain the first image feature vector.
In some embodiments, the image enhancement processing on the original video image through a target network of a pre-trained key frame extraction model to obtain a second image feature vector includes:
performing image enhancement processing on the original video image through the target network, and performing feature extraction on the original video image subjected to the image enhancement processing to obtain a second video feature map;
and mapping the second video feature map to a preset second high-dimensional vector space through the target network to obtain the second image feature vector.
In some embodiments, the normalizing the candidate keyframe data by a pre-trained data processing model to obtain normalized keyframe data includes:
extracting the characteristics of the candidate key frame data to obtain candidate text characteristics, candidate audio characteristics and candidate key frame images;
performing semantic analysis on the candidate audio features to obtain standard audio data;
performing text recognition processing on the candidate text features to obtain character text data;
and performing fusion processing on the standard audio data, the character text data and the candidate key frame image to obtain standard key frame data.
In some embodiments, the encoding layer includes a text encoder and an image encoder, and the encoding processing on the text data by the encoding layer of the data processing model to obtain a text vector and the encoding processing on the plurality of standard key frame images in the standard key frame data by the encoding layer to obtain a plurality of key frame image vectors includes:
performing text coding on the text data through a text coder to obtain the text vector;
and carrying out image coding on the plurality of standard key frame images through an image coder to obtain a plurality of key frame image vectors.
In some embodiments, the screening the standard key frame data according to the first similarity value to obtain a target video clip includes:
screening the standard key frame image in the standard key frame data according to the first similarity value to obtain a target key frame image;
and splicing the target key frame images to obtain the target video clip.
To achieve the above object, a second aspect of the embodiments of the present application provides a video search apparatus, including:
the data acquisition module is used for acquiring original search data, wherein the original search data comprises text data and original video data;
the frame extraction processing module is used for carrying out frame extraction processing on the original video data to obtain candidate key frame data;
the standardization processing module is used for carrying out standardization processing on the candidate key frame data through a pre-trained data processing model to obtain standard key frame data;
the encoding module is used for encoding the text data through an encoding layer of the data processing model to obtain text vectors, and encoding the standard key frame data through the encoding layer to obtain a plurality of key frame image vectors;
a calculating module, configured to calculate a first similarity value between the text vector and each of the key frame image vectors;
and the screening module is used for screening the standard key frame data according to the first similarity value to obtain a target video clip.
In order to achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for implementing connection communication between the processor and the memory, wherein the program, when executed by the processor, implements the method of the first aspect.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium for computer-readable storage, and stores one or more programs, which are executable by one or more processors to implement the method of the first aspect.
The video searching method and device, the electronic equipment and the storage medium provided by the application can be used for obtaining original searching data, wherein the original searching data comprises text data and original video data. And then, frame extraction processing is carried out on the original video data to obtain candidate key frame data, and standardization processing is carried out on the candidate key frame data through a pre-trained data processing model to obtain standard key frame data, so that the obtained standard key frame data can better meet the requirement of video search, and the calculation amount of the video search is reduced. And coding the text data and the standard key frame data through a coding layer of the data processing model to obtain a text vector and a plurality of key frame image vectors. Meanwhile, the first similarity value of the text vector and each key frame image vector is calculated, so that the correlation between the text vector and each key frame image vector can be accurately determined. And finally, screening the standard key frame data according to the first similarity value to obtain a target video clip, so that the accuracy of video searching can be improved.
Drawings
Fig. 1 is a flowchart of a video search method provided in an embodiment of the present application;
FIG. 2 is a flowchart of step S102 in FIG. 1;
FIG. 3 is a flowchart of step S202 in FIG. 2;
fig. 4 is a flowchart of step S203 in fig. 2;
fig. 5 is a flowchart of step S103 in fig. 1;
fig. 6 is a flowchart of step S104 in fig. 1;
FIG. 7 is a flowchart of step S106 in FIG. 1;
fig. 8 is a schematic structural diagram of a video search apparatus according to an embodiment of the present application;
fig. 9 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
First, several terms referred to in the present application are resolved:
artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.
Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and it belongs to a branch of artificial intelligence, which is a cross discipline of computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, character recognition of handwriting and print, speech recognition and text-to-speech conversion, information intention recognition, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation and the like related to language processing.
Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.
Self-supervised learning (self-supervised learning): the self-supervision learning can be regarded as an ideal state of machine learning, and the model directly learns by itself from the unlabeled data without labeling the data. The core of the self-supervised learning is how to automatically generate labels for data. For example, a picture is input, the picture is randomly rotated by an angle, then the rotated picture is used as input, and the randomly rotated angle is used as a label. For another example, an input picture is uniformly divided into 3 × 3 grids, the content in each grid is used as a small block (patch), the arrangement order of the patches is randomly disturbed, then the disordered patches is used as an input, and the correct arrangement order is used as a label (label). Like this automatically generated annotation, no human involvement is required at all. The level of the self-supervised learning performance is mainly evaluated by the quality of a feature map (feature) obtained by modeling. The quality of feature is mainly evaluated by using the feature in other visual tasks (such as classification, segmentation, object detection and the like) in a transfer learning manner and then evaluating the quality of the result of the visual tasks.
Web crawlers (also known as web spiders, web robots, among FOAF communities, more often called web chasers): a web crawler is a program or script that automatically crawls the world Wide Web according to certain rules.
Frame extraction: the video frame extraction is a process of simulating the process of taking a picture at a certain time interval and combining the pictures to form a video (namely, low-speed shooting) in a mode of extracting a plurality of frames at certain intervals in a video.
Key frame: the term of computer animation refers to the frame where the key action in the change of the movement of the character or object is located, which is equivalent to the original picture in the two-dimensional animation. Animations from key frame to key frame can be added by software creation, called transition frames or intermediate frames. "frame", which is a single video frame of the smallest unit in a motion picture, corresponds to each shot on a motion picture film, and the frame is represented as one frame or one mark on the time axis of motion picture software.
A convolutional neural network: the convolutional neural network is a feedforward neural network, the artificial neuron can respond to surrounding units and can perform large-scale image processing, and the convolutional neural network can comprise a convolutional layer, a pooling layer and a full-link layer. The convolution layer is used for carrying out feature extraction on the image, specifically, convolution operation is carried out on the image by utilizing convolution kernel to obtain preliminary image features, pooling is carried out through the pooling layer, processing is carried out through the full-connection layer, and finally the image convolution features are obtained. Pooling is understood to be compression, and is an aggregation of features at different locations, for example, calculating an average of a specific feature in a region of an image as a value of the region, so that dimensionality is reduced while results are improved and overfitting is not easy, and this aggregation operation is called pooling. Pooling comprises average pooling and large pooling, wherein an average value of a particular feature on a region is taken as a value of the region and is called average pooling, and a maximum value of a particular feature on a region is taken as a value of the region and is called maximum pooling.
Image enhancement: enhancing useful information in an image, which may be a process of distortion, is aimed at improving the visual impact of the image for a given image application. The method aims to emphasize the overall or local characteristics of the image, changes the original unclear image into clear or emphasizes certain interesting characteristics, enlarges the difference between different object characteristics in the image, inhibits the uninteresting characteristics, improves the image quality, enriches the information content, enhances the image interpretation and identification effects, and meets the requirements of certain special analysis.
Image enhancement can be divided into two broad categories: frequency domain methods and spatial domain methods. The former takes an image as a two-dimensional signal, and performs signal enhancement based on two-dimensional fourier transform. The noise in the image can be removed by adopting a low-pass filtering method (namely, only a low-frequency signal passes through the low-pass filtering method); by adopting a high-pass filtering method (only allowing high-frequency signals to pass), high-frequency signals such as edges and the like can be enhanced, so that a blurred picture becomes clear. Typical examples of the latter spatial domain method include a local averaging method, a median filtering (taking an intermediate pixel value in a local neighborhood), and the like, which are used to remove or reduce noise.
Encoding (encoder): the encoding is used to convert the input sequence into a fixed length vector.
Automatic Speech Recognition technology (ASR): an automatic speech recognition technology is a technology that converts human speech into text. The input to speech recognition is typically a speech signal in the time domain, with the length (length T) and dimension (dimension) of the signal being mathematically represented by a series of vectors, and the output of the automatic semantic recognition technique being text, with the length (length N) of the field and the different tokens (differential tokens) being represented by a series of token tokens.
With the development of internet technology, data faced by people is rapidly increasing, wherein video data is one of the data, videos bring much fun to the life of people and enrich the life of people, and the visual field and knowledge of people can be expanded by watching different videos.
In general, people are more inclined to watch video than to text. In some scenarios, for videos that have already been viewed, it is difficult to search through text alone without history. And because the video data volume is huge, in order to position the video rapidly, need carry on the video search. At present, when video searching is performed, matching searching is mostly performed by using key information, for example, searching is performed by using a text based on voice recognition, and the like, which often has a problem of low accuracy of video searching, and therefore, how to improve accuracy of video searching becomes a technical problem to be solved urgently.
Based on this, embodiments of the present application provide a video search method and apparatus, an electronic device, and a storage medium, which aim to improve accuracy of video search.
The video search method and apparatus, the electronic device, and the storage medium provided in the embodiments of the present application are specifically described with reference to the following embodiments, and first, the video search method in the embodiments of the present application is described.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
The embodiment of the application provides a video searching method, and relates to the technical field of artificial intelligence. The video searching method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured into an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and cloud servers for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content delivery network) and big data and artificial intelligence platforms; the software may be an application or the like implementing a video search method, but is not limited to the above form.
The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Fig. 1 is an alternative flowchart of a video search method provided in an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S106.
Step S101, acquiring original search data, wherein the original search data comprises text data and original video data;
step S102, frame extraction processing is carried out on original video data to obtain candidate key frame data;
step S103, carrying out standardization processing on candidate key frame data through a pre-trained data processing model to obtain standard key frame data;
step S104, carrying out coding processing on the text data through a coding layer of the data processing model to obtain a text vector, and carrying out coding processing on the standard key frame data through the coding layer to obtain a plurality of key frame image vectors;
step S105, calculating a first similarity value of the text vector and each key frame image vector;
and S106, screening the standard key frame data according to the first similarity value to obtain a target video clip.
In steps S101 to S106 of the embodiment of the present application, candidate key frame data are obtained by performing frame extraction processing on original video data, and standard key frame data are obtained by performing standardization processing on the candidate key frame data through a pre-trained data processing model, so that the obtained standard key frame data can better meet the requirement of video search, and the amount of calculation of video search is reduced. And respectively coding the text data and the standard key frame data through a coding layer of the data processing model to obtain a text vector and a plurality of key frame image vectors. Meanwhile, the first similarity value of the text vector and each key frame image vector is calculated, so that the correlation between the text vector and each key frame image vector can be accurately determined. And finally, screening the standard key frame data according to the first similarity value to obtain a target video clip, so that the accuracy of video search can be improved.
In step S101 of some embodiments, original search data may be obtained by writing a web crawler, and performing targeted crawling after setting a data source. It should be noted that the original search data includes text data and original video data; the text data refers to search texts such as search word segments, query sentences and search characters input by a user; the raw video data may be video data stored in a raw video library, and may be, for example, online short video data, movie data, learning video data, and the like.
Before step S102 in some embodiments, the video search method further includes a pre-training key frame extraction model, which specifically includes:
step a, obtaining a sample video image;
b, inputting the sample video image pair into the initial model;
c, performing image processing on the sample video image through the initial model to obtain a first sample characteristic image vector and a second sample characteristic image vector;
d, calculating the similarity of the first sample characteristic image vector and the second sample characteristic image vector through a loss function of the initial model;
and e, optimizing the loss function of the initial model according to the similarity so as to update the initial model and obtain a key frame extraction model.
In step a of some embodiments, the sample video data may be obtained by writing a web crawler, and crawling data with a target after setting a data source. The sample video data may also be acquired in other ways, not limited to this. And further, performing video analysis on the obtained sample video data to obtain a sample video image.
In step b of some embodiments, the sample video image pair is input into an initial model, which is a BYOL model, wherein the initial model comprises the online network and the target network.
In step c of some embodiments, the sample video image pair is subjected to image enhancement, feature extraction and mapping processing through the initial model, so as to obtain a first sample feature image vector and a second sample feature image vector. Specifically, image enhancement, feature extraction and mapping processing are carried out on a sample video image through an online network to obtain a first sample feature image, image enhancement, feature extraction and mapping processing are carried out on the sample video image through a target network to obtain a second sample feature image, wherein different image enhancement strategies are adopted in the image enhancement processes of the online network and the target network.
In step d of some embodiments, when the similarity of the first sample feature image vector and the second sample feature image vector is calculated by the loss function of the initial model, a cosine similarity algorithm may be used to calculate the similarity of the first sample feature image vector and the second sample feature image vector.
In step e of some embodiments, a loss function of the initial model is optimized according to the similarity, the similarity is compared with a preset similarity threshold, and the model parameters are continuously adjusted so that a loss value of the loss function meets a preset update condition, thereby stopping updating the initial model and obtaining the key frame extraction model, where the update condition may be that the loss value is smaller than a predicted loss threshold, or that the iteration number reaches a preset number threshold, and the like, but is not limited thereto.
Referring to fig. 2, in some embodiments, step S102 may include, but is not limited to, step S201 to step S204:
step S201, analyzing original video data to obtain an original video image;
step S202, vectorizing an original video image through an online network of a pre-trained key frame extraction model to obtain a first image feature vector;
step S203, performing image enhancement processing on the original video image through a target network of a pre-trained key frame extraction model to obtain a second image feature vector;
step S204, calculating a second similarity value of the first image feature vector and the second image feature vector, and obtaining candidate key frame data according to the second similarity value.
In step S201 of some embodiments, the original video data is subjected to tag classification according to a preset video category tag to obtain a plurality of different types of tag video data, and a video code stream in the tag video data is analyzed to filter a background image in each tag video data to obtain an original video image.
Referring to fig. 3, in some embodiments, step S202 may include, but is not limited to, step S301 to step S302:
step S301, extracting the characteristics of an original video image through an online network to obtain a first video characteristic diagram;
step S302, the first video feature map is mapped to a preset first high-dimensional vector space through an online network, and a first image feature vector is obtained.
In step S301 of some embodiments, the original video image is subjected to feature extraction by the convolution layer of the online network, and representative features of the original video image are captured, so as to generate a first video feature map.
In step S302 of some embodiments, the first video feature map is mapped to a higher-dimensional potential vector space, i.e., a first high-dimensional vector space, through an online network to obtain a first image feature vector.
Referring to fig. 4, in some embodiments, step S203 may include, but is not limited to, step S401 to step S402:
step S401, performing image enhancement processing on an original video image through a target network, and performing feature extraction on the original video image after the image enhancement processing to obtain a second video feature map;
and step S402, mapping the second video characteristic diagram to a preset second high-dimensional vector space through a target network to obtain a second image characteristic vector.
In step S401 of some embodiments, an original video image is first subjected to image enhancement processing by an image enhancement policy preset in the target network. For example, an original video image is regarded as a two-dimensional signal, signal enhancement based on two-dimensional fourier transform is performed on the original video image, only a low-frequency signal is passed through by adopting a low-pass filtering method, so that noise in the original video image is removed, the image quality of the original video image is improved, further, feature extraction is performed on the original video image after image enhancement processing, the representation features of the original video image after image enhancement processing are captured, and a second video feature map is generated.
In step S402 of some embodiments, the second video feature map is mapped to a higher-dimensional potential vector space, i.e., a second high-dimensional vector space, by the target network, so as to obtain a second image feature vector.
It should be noted that the network structures of the online network and the target network are basically the same, but the parameters of the online network and the target network are not the same.
In step S204 of some embodiments, a second similarity value between the first image feature vector and the second image feature vector may be calculated by a preset similarity algorithm, and candidate keyframe data is obtained according to the second similarity value. The similarity algorithm may be a cosine similarity algorithm. For example, the first image feature vector is denoted as BYOL (img1), the second image feature vector is denoted as BYOL (img2), and the second similarity value cos (BYOL (img1), BYOL (img2)) is calculated; and extracting the candidate key frame images with the second similarity value smaller than or equal to the similarity threshold value to serve as a set, wherein all the candidate key frame images in the set are the candidate key frame data.
In some embodiments, the similarity threshold may be 0.9, and if the second similarity value between the first image feature vector and the second image feature vector is less than 0.9, it indicates that the difference between the original video image and the original video image after the image enhancement processing is larger, and the current frame corresponding to the original video image is the video key frame.
Before step S103 in some embodiments, the video search method further includes a pre-training data processing model, where the data processing model may be constructed according to a convolutional neural network and a CLIP model, the data processing model includes a convolutional layer, a pooling layer, a coding layer, and a full-link layer, and a feature result obtained when feature extraction is performed through the convolutional layer and the pooling layer of the data processing model is a multi-dimensional feature, which may be understood as obtaining a plurality of feature maps, for example, processing the image a through the convolutional layer and the pooling layer of the data processing model, and obtaining features in the form of 512 × 28 × 28 features, which may be understood as 512 feature maps with a size of 28 × 28, and which may also be understood as 28 × 28 single vectors with 512 dimensions, that is, 512 elements in a single vector. And then, carrying out feature extraction on the convolution layer and the pooling layer to obtain multi-dimensional features, and enabling the multi-dimensional features to pass through a full connection layer to obtain image convolution features, namely, obtaining a one-dimensional feature vector based on the multi-dimensional features, wherein the image convolution features are one-dimensional feature vector.
Referring to fig. 5, in some embodiments, step S103 may further include, but is not limited to, step S501 to step S504:
step S501, feature extraction is carried out on candidate key frame data to obtain candidate text features, candidate audio features and candidate key frame images;
step S502, semantic analysis is carried out on the candidate audio features to obtain standard audio data;
step S503, performing text recognition processing on the candidate text features to obtain character text data;
step S504, the standard audio data, the character text data and the candidate key frame image are subjected to fusion processing to obtain standard key frame data.
In step S501 of some embodiments, tag classification is performed on the candidate key frame data according to a data category tag preset in a convolutional layer of the data processing model to obtain a plurality of different types of tag key frame data, and then convolution processing is performed on the tag key frame data through the convolutional layer to extract different types of tag key frame data, so as to obtain a candidate text feature, a candidate audio feature, and a candidate key frame image, respectively. Note that the data category label includes a text label, an audio label, and an image label.
In step S502 of some embodiments, the ASR speech recognizer performs semantic error correction and filtering on the candidate audio features, and removes candidate audio features with unclear ideograms and audio fuzziness to obtain candidate audio texts. And then, completing the audio data which are incomplete in the candidate audio text to obtain standard audio data.
In step S503 of some embodiments, the character recognition software is used to perform text cleaning on the candidate text features, eliminate fuzzy text data in the candidate text, and perform sentence expansion on the incomplete text data in the candidate text, for example, performing sentence expansion by synonym replacement, part-of-speech modification, and the like, so as to obtain character text data.
In step S504 of some embodiments, first, word vectorization processing is performed on the standard audio data, the character text data, and the candidate key frame image, respectively, to obtain a standard audio vector, a character text vector, and a candidate key frame vector, vector addition processing is performed on the standard audio vector, the character text vector, and the candidate key frame vector, to obtain a standard video key frame vector, and finally, decoding processing is performed on the standard video key frame vector, to obtain standard key frame data.
Referring to fig. 6, in some embodiments, the encoding layer of the data processing model includes a text encoder and an image encoder, and step S104 may further include, but is not limited to, step S601 to step S602:
step S601, text encoding is carried out on the text data through a text encoder to obtain a text vector;
step S602, an image encoder performs image encoding on a plurality of standard key frame images to obtain a plurality of key frame image vectors.
In step S601 of some embodiments, the text data is text-coded by the transform algorithm of the text editor. And obtaining a text vector.
In step S602 of some embodiments, the image encoder performs image encoding on a plurality of standard key frame images through a residual network of the image encoder, where the residual network is composed of a plurality of residual dense blocks, and each residual dense block may be connected in a jumping manner, which can effectively reduce gradient loss and improve quality of the obtained plurality of key frame image vectors.
In step S105 of some embodiments, a first similarity value of the text vector and each key frame image vector is calculated by using a cosine similarity algorithm.
Specifically, when the first similarity value between the text vector and each key frame image vector is calculated, assuming that the text vector is u and one of the key frame image vectors is v, the first similarity value between the text vector and the key frame image vector is calculated according to the formula of the cosine similarity algorithm, as shown in formula 1:
Figure BDA0003515308560000121
referring to fig. 7, in some embodiments, step S106 may further include, but is not limited to, step S701 to step S702:
step S701, screening the standard key frame image in the standard key frame data according to the first similarity value to obtain a target key frame image;
step S702, splicing the plurality of target key frame images to obtain a target video clip.
In step S701 of some embodiments, the first similarity value is compared with a preset similarity threshold, a standard key frame image in the standard key frame data is screened according to the first similarity value and the similarity threshold, and the standard key frame image with the similarity value greater than the similarity threshold is selected as the target key frame image.
In step S702 of some embodiments, segment splicing is performed on a plurality of target key frame images according to a sequence number or a time sequence labeled in advance on a standard key frame image, so as to obtain a target video segment. For example, serial numbers of the standard key frame images are labeled by Arabic numerals, the target key frame images obtained through screening are the standard key frame images 1, the standard key frame images 3, the standard key frame images 7, the standard key frame images 21 and the standard key frame images 43, and the standard key frame images are subjected to segment splicing according to the sequence of the serial numbers from small to large to obtain target video segments. Or, since each standard key frame image is taken from the original video data, each standard key frame image corresponds to a playing time, for example, the playing time corresponding to the standard key frame image a included in the target key frame image is 50 seconds, the playing time corresponding to the standard key frame image B is 74 seconds, the playing time corresponding to the standard key frame image C is 21 seconds, and the playing time corresponding to the standard key frame image D is 123 seconds, the standard key frame image C, the standard key frame image a, the standard key frame image B, and the standard key frame image D are sequentially stitched according to the sequence of the playing times from small to large, so as to obtain the target video clip.
Through the steps S701 to S702, a plurality of target key frame images meeting the requirements can be spliced in sequence conveniently to obtain a corresponding target video clip, and the target video clip can be quickly found when the user inputs the same search word segment again, so that the video search efficiency is improved.
The method and the device for searching the video data acquire original searching data, wherein the original searching data comprise text data and original video data. And then, frame extraction processing is carried out on the original video data to obtain candidate key frame data, and standardization processing is carried out on the candidate key frame data through a pre-trained data processing model to obtain standard key frame data, so that the obtained standard key frame data can better meet the requirement of video search, and the calculation amount of the video search is reduced. And then, respectively encoding the text data and the standard key frame data through an encoding layer of the data processing model to obtain a text vector and a plurality of key frame image vectors. Meanwhile, the first similarity value of the text vector and each key frame image vector is calculated, so that the correlation between the text vector and each key frame image vector can be accurately determined. And finally, screening the standard key frame data according to the first similarity value to obtain a target video clip, so that the accuracy of video search can be improved.
Referring to fig. 8, an embodiment of the present application further provides a video search apparatus, which can implement the video search method, and the apparatus includes:
a data obtaining module 801, configured to obtain original search data, where the original search data includes text data and original video data;
a frame extraction processing module 802, configured to perform frame extraction processing on original video data to obtain candidate key frame data;
a normalization processing module 803, configured to perform normalization processing on the candidate key frame data through a pre-trained data processing model to obtain standard key frame data;
the encoding module 804 is configured to perform encoding processing on the text data through an encoding layer of the data processing model to obtain a text vector, and perform encoding processing on the standard key frame data through the encoding layer to obtain a plurality of key frame image vectors;
a calculating module 805, configured to calculate a first similarity value between the text vector and each key frame image vector;
and a screening module 806, configured to perform screening processing on the standard key frame data according to the first similarity value to obtain a target video clip.
The specific implementation of the video search apparatus is substantially the same as the specific implementation of the video search method, and is not described herein again.
An embodiment of the present application further provides an electronic device, where the electronic device includes: the video search system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program realizes the video search method when being executed by the processor. The electronic equipment can be any intelligent terminal including a tablet computer, a vehicle-mounted computer and the like.
Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:
the processor 901 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more integrated circuits, and is configured to execute a relevant program to implement the technical solution provided in the embodiment of the present application;
the memory 902 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a Random Access Memory (RAM). The memory 902 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 902 and called by the processor 901 to execute the video search method according to the embodiments of the present disclosure;
an input/output interface 903 for implementing information input and output;
a communication interface 904, configured to implement communication interaction between the device and another device, where communication may be implemented in a wired manner (e.g., USB, network cable, etc.), or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);
a bus 905 that transfers information between various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);
wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 enable a communication connection within the device with each other through a bus 905.
The embodiment of the present application further provides a storage medium, which is a computer-readable storage medium for a computer-readable storage, and the storage medium stores one or more programs, and the one or more programs are executable by one or more processors to implement the video search method.
The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
According to the video searching method, the video searching device, the electronic equipment and the storage medium, original searching data are obtained, wherein the original searching data comprise text data and original video data. And then, frame extraction processing is carried out on the original video data to obtain candidate key frame data, and standardization processing is carried out on the candidate key frame data through a pre-trained data processing model to obtain standard key frame data, so that the obtained standard key frame data can better meet the requirement of video search, and the calculation amount of the video search is reduced. And then, respectively encoding the text data and the standard key frame data through an encoding layer of the data processing model to obtain a text vector and a plurality of key frame image vectors. Meanwhile, the first similarity value of the text vector and each key frame image vector is calculated, so that the correlation between the text vector and each key frame image vector can be accurately determined. And finally, screening the standard key frame data according to the first similarity value to obtain a target video clip, so that the accuracy of video search can be improved.
The embodiments described in the embodiments of the present application are for more clearly illustrating the technical solutions of the embodiments of the present application, and do not constitute a limitation to the technical solutions provided in the embodiments of the present application, and it is obvious to those skilled in the art that the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems with the evolution of technology and the emergence of new application scenarios.
It will be appreciated by those skilled in the art that the embodiments shown in fig. 1-7 are not limiting of the embodiments of the present application and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.
The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, which are essential or part of the technical solutions contributing to the prior art, or all or part of the technical solutions, may be embodied in the form of a software product stored in a storage medium, which includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and the scope of the claims of the embodiments of the present application is not limited thereto. Any modifications, equivalents and improvements that may occur to those skilled in the art without departing from the scope and spirit of the embodiments of the present application are intended to be within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method for video search, the method comprising:
acquiring original search data, wherein the original search data comprises text data and original video data;
performing frame extraction processing on the original video data to obtain candidate key frame data;
standardizing the candidate key frame data through a pre-trained data processing model to obtain standard key frame data;
coding the text data through a coding layer of the data processing model to obtain a text vector, and coding the standard key frame data through the coding layer to obtain a plurality of key frame image vectors;
calculating a first similarity value of the text vector and each key frame image vector;
and screening the standard key frame data according to the first similarity value to obtain a target video clip.
2. The video search method of claim 1, wherein the step of performing frame extraction on the original video data to obtain candidate key frame data comprises:
analyzing the original video data to obtain an original video image;
vectorizing the original video image through an online network of a pre-trained key frame extraction model to obtain a first image feature vector;
performing image enhancement processing on the original video image through a target network of a pre-trained key frame extraction model to obtain a second image feature vector;
and calculating a second similarity value of the first image feature vector and the second image feature vector, and obtaining the candidate key frame data according to the second similarity value.
3. The video search method of claim 2, wherein the vectorizing the original video image through the online network of the pre-trained key frame extraction model to obtain the first image feature vector comprises:
performing feature extraction on the original video image through the online network to obtain a first video feature map;
and mapping the first video feature map to a preset first high-dimensional vector space through the online network to obtain the first image feature vector.
4. The video search method of claim 2, wherein the image enhancement processing on the original video image through the target network of the pre-trained key frame extraction model to obtain the second image feature vector comprises:
performing image enhancement processing on the original video image through the target network, and performing feature extraction on the original video image subjected to the image enhancement processing to obtain a second video feature map;
and mapping the second video feature map to a preset second high-dimensional vector space through the target network to obtain the second image feature vector.
5. The method of claim 1, wherein the normalizing the candidate keyframe data by the pre-trained data processing model to obtain normalized keyframe data comprises:
extracting the characteristics of the candidate key frame data to obtain candidate text characteristics, candidate audio characteristics and candidate key frame images;
performing semantic analysis on the candidate audio features to obtain standard audio data;
performing text recognition processing on the candidate text features to obtain character text data;
and performing fusion processing on the standard audio data, the character text data and the candidate key frame image to obtain standard key frame data.
6. The video search method of claim 1, wherein the coding layer comprises a text coder and an image coder, and the coding layer of the data processing model performs coding processing on the text data to obtain a text vector, and performs coding processing on a plurality of standard key frame images in the standard key frame data to obtain a plurality of key frame image vectors, including:
performing text encoding on the text data through a text encoder to obtain the text vector;
and carrying out image coding on the plurality of standard key frame images through an image coder to obtain a plurality of key frame image vectors.
7. The video search method according to any one of claims 1 to 6, wherein the filtering the standard key frame data according to the first similarity value to obtain a target video clip comprises:
screening the standard key frame image in the standard key frame data according to the first similarity value to obtain a target key frame image;
and splicing the target key frame images to obtain the target video clip.
8. A video search apparatus, characterized in that the apparatus comprises:
the data acquisition module is used for acquiring original search data, wherein the original search data comprises text data and original video data;
the frame extraction processing module is used for carrying out frame extraction processing on the original video data to obtain candidate key frame data;
the standardization processing module is used for carrying out standardization processing on the candidate key frame data through a pre-trained data processing model to obtain standard key frame data;
the encoding module is used for encoding the text data through an encoding layer of the data processing model to obtain text vectors, and encoding the standard key frame data through the encoding layer to obtain a plurality of key frame image vectors;
a calculating module, configured to calculate a first similarity value between the text vector and each of the key frame image vectors;
and the screening module is used for screening the standard key frame data according to the first similarity value to obtain a target video clip.
9. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling connection communication between the processor and the memory, the program, when executed by the processor, implementing a video search method according to any one of claims 1 to 7.
10. A storage medium that is a computer-readable storage medium for computer-readable storage, characterized in that the storage medium stores one or more programs that are executable by one or more processors to implement the video search method of any one of claims 1 to 7.
CN202210163976.5A 2022-02-22 2022-02-22 Video searching method and device, electronic equipment and storage medium Pending CN114595357A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210163976.5A CN114595357A (en) 2022-02-22 2022-02-22 Video searching method and device, electronic equipment and storage medium
PCT/CN2022/090736 WO2023159765A1 (en) 2022-02-22 2022-04-29 Video search method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210163976.5A CN114595357A (en) 2022-02-22 2022-02-22 Video searching method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114595357A true CN114595357A (en) 2022-06-07

Family

ID=81804336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210163976.5A Pending CN114595357A (en) 2022-02-22 2022-02-22 Video searching method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114595357A (en)
WO (1) WO2023159765A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115661727A (en) * 2022-12-27 2023-01-31 苏州浪潮智能科技有限公司 Video behavior positioning method and device, electronic equipment and storage medium
CN117851640A (en) * 2024-03-04 2024-04-09 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117909540A (en) * 2023-12-29 2024-04-19 广东智媒云图科技股份有限公司 Video data storage and search method, apparatus, device and storage medium
CN118038336A (en) * 2024-03-26 2024-05-14 杭州纳视文化创意有限公司 Extraction method of key frames for AI animation
CN118075552B (en) * 2024-04-22 2024-06-28 黑龙江省邦盾科技有限公司 Studio video feature image enhancement processing method
CN118354164A (en) * 2024-06-17 2024-07-16 阿里巴巴(中国)有限公司 Video generation method, electronic device and computer readable storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111836111A (en) * 2019-04-17 2020-10-27 微软技术许可有限责任公司 Technique for generating barrage
CN111191078B (en) * 2020-01-08 2024-05-07 深圳市雅阅科技有限公司 Video information processing method and device based on video information processing model
CN112115299B (en) * 2020-09-17 2024-08-13 北京百度网讯科技有限公司 Video searching method, video searching device, video recommending method, electronic equipment and storage medium
CN112784110A (en) * 2021-01-26 2021-05-11 北京嘀嘀无限科技发展有限公司 Key frame determination method and device, electronic equipment and readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115661727A (en) * 2022-12-27 2023-01-31 苏州浪潮智能科技有限公司 Video behavior positioning method and device, electronic equipment and storage medium
CN117851640A (en) * 2024-03-04 2024-04-09 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics
CN117851640B (en) * 2024-03-04 2024-05-31 广东智媒云图科技股份有限公司 Video data processing method, device, equipment and medium based on composite characteristics

Also Published As

Publication number Publication date
WO2023159765A1 (en) 2023-08-31

Similar Documents

Publication Publication Date Title
US20220309762A1 (en) Generating scene graphs from digital images using external knowledge and image reconstruction
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN114595357A (en) Video searching method and device, electronic equipment and storage medium
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN113887215A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN114626097A (en) Desensitization method, desensitization device, electronic apparatus, and storage medium
CN114723996A (en) Model training method, image description generation method and device, equipment and medium
CN114298121A (en) Multi-mode-based text generation method, model training method and device
CN114638960A (en) Model training method, image description generation method and device, equipment and medium
CN114416995A (en) Information recommendation method, device and equipment
CN114359810A (en) Video abstract generation method and device, electronic equipment and storage medium
CN117251551B (en) Natural language processing system and method based on large language model
CN115544303A (en) Method, apparatus, device and medium for determining label of video
CN114064894A (en) Text processing method and device, electronic equipment and storage medium
CN114841146A (en) Text abstract generation method and device, electronic equipment and storage medium
CN113408282B (en) Method, device, equipment and storage medium for topic model training and topic prediction
CN114490949A (en) Document retrieval method, device, equipment and medium based on BM25 algorithm
CN117079310A (en) Pedestrian re-identification method based on image-text multi-mode fusion
CN114625877A (en) Text classification method and device, electronic equipment and storage medium
CN114897053A (en) Subspace clustering method, subspace clustering device, subspace clustering equipment and storage medium
CN114722774A (en) Data compression method and device, electronic equipment and storage medium
CN114090778A (en) Retrieval method and device based on knowledge anchor point, electronic equipment and storage medium
CN113868417A (en) Sensitive comment identification method and device, terminal equipment and storage medium
CN111666437A (en) Image-text retrieval method and device based on local matching
Inayathulla Image Caption Generation using Deep Learning For Video Summarization Applications.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination