WO2020177673A1 - 一种视频序列选择的方法、计算机设备及存储介质 - Google Patents
一种视频序列选择的方法、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2020177673A1 WO2020177673A1 PCT/CN2020/077481 CN2020077481W WO2020177673A1 WO 2020177673 A1 WO2020177673 A1 WO 2020177673A1 CN 2020077481 W CN2020077481 W CN 2020077481W WO 2020177673 A1 WO2020177673 A1 WO 2020177673A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- video
- spatiotemporal
- trained
- feature
- Prior art date
Links
- 238000010187 selection method Methods 0.000 title abstract description 7
- 238000000034 method Methods 0.000 claims abstract description 104
- 230000002452 interceptive effect Effects 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 13
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 9
- 230000000007 visual effect Effects 0.000 claims description 186
- 230000006870 function Effects 0.000 claims description 128
- 230000006399 behavior Effects 0.000 claims description 36
- 230000008569 process Effects 0.000 claims description 27
- 230000002123 temporal effect Effects 0.000 claims description 22
- 230000015654 memory Effects 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 16
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000006403 short-term memory Effects 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims 1
- 238000012549 training Methods 0.000 description 45
- 238000005516 engineering process Methods 0.000 description 25
- 230000003993 interaction Effects 0.000 description 23
- 238000013473 artificial intelligence Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 16
- 238000013461 design Methods 0.000 description 10
- 239000000284 extract Substances 0.000 description 8
- 230000007246 mechanism Effects 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 4
- 230000007787 long-term memory Effects 0.000 description 4
- 241000219109 Citrullus Species 0.000 description 3
- 235000012828 Citrullus lanatus var citroides Nutrition 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 241000555745 Sciuridae Species 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000001965 increasing effect Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 244000099147 Ananas comosus Species 0.000 description 1
- 235000007119 Ananas comosus Nutrition 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 235000021438 curry Nutrition 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000001815 facial effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 238000013468 resource allocation Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
- 230000003313 weakening effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/22—Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
- G06V10/443—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
- G06V10/449—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
- G06V10/451—Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
- G06V10/454—Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/75—Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
- G06V10/758—Involving statistics of pixels or of feature values, e.g. histogram matching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Definitions
- This application relates to the field of artificial intelligence technology, and in particular to a method for video sequence selection, computer equipment and storage medium.
- Artificial intelligence belongs to a branch of computer science. The main goal is to enable machines to do complex tasks that usually require human intelligence.
- the main goal is to enable machines to do complex tasks that usually require human intelligence.
- the amount of video data has increased dramatically.
- how to effectively process them so that users can quickly obtain the information they want is a key issue in current research and application.
- the video sequence generated by the above method has relevance to the text, it only considers the matching relationship between the single frame image and the text, resulting in a low matching degree between the output video sequence and the text, which is not conducive to the video sequence. Understanding of the content.
- the embodiments of the present application provide a method, computer equipment and storage medium for selecting a video sequence. Since the temporal and spatial candidate regions include the relationship between the image in time and space, the timing of the video and text is taken into account when matching. The relevance of the video sequence takes into account the impact of the video timing information on the video sequence and the text, thereby improving the matching degree of the output video sequence and the text, thereby helping to better understand the video content.
- the first aspect of the present application provides a method for selecting a video sequence.
- the method is applied to a computer device and includes:
- the computer device receives a video to be matched and a text to be matched, wherein the video to be matched includes multiple frames of images, the text to be matched includes at least one word, and the text to be matched corresponds to a feature sequence of the text to be matched;
- the computer device invokes a spatiotemporal candidate region generator to extract a set of spatiotemporal candidate regions from the video to be matched, wherein the set of spatiotemporal candidate regions includes N spatiotemporal candidate regions, and the N is an integer greater than or equal to 1, and A spatiotemporal candidate area is a video sequence;
- the computer device performs feature extraction on each of the spatiotemporal candidate regions in the set of spatiotemporal candidate regions by using a convolutional neural network to obtain N video feature sequences to be matched, wherein the video feature sequence to be matched and the spatiotemporal candidate Correspondence between regions;
- the computer device invokes an attention-based interactor to obtain the matching score corresponding to each of the spatiotemporal candidate regions, where the interactor is used to process the video feature sequence to be matched and the text feature sequence to be matched ,
- the matching score is used to indicate the matching relationship between the spatiotemporal candidate region and the text to be matched;
- the computer device selects a target spatiotemporal candidate area from the spatiotemporal candidate area set according to the matching score corresponding to each spatiotemporal candidate area output by the interactor, and outputs the target spatiotemporal candidate area.
- a second aspect of the present application provides a video sequence selection device, including:
- An acquisition module configured to receive a video to be matched and a text to be matched, wherein the video to be matched includes multiple frames of images, the text to be matched includes at least one word, and the text to be matched corresponds to a feature sequence of the text to be matched;
- a temporal and spatial candidate area is a video sequence
- the encoding module is used for feature extraction of each of the spatiotemporal candidate regions in the set of spatiotemporal candidate regions through a convolutional neural network to obtain N video feature sequences to be matched, wherein the video feature sequence to be matched is the same as the spatiotemporal
- the candidate regions have a corresponding relationship;
- the acquisition module is further configured to call an attention-based interactor to obtain the matching score corresponding to each spatiotemporal candidate region, wherein the interactor is used to compare the video feature sequence to be matched with the text to be matched
- the feature sequence is processed, and the matching score is used to represent the matching relationship between the spatiotemporal candidate region and the text to be matched;
- the selection module is configured to select a target spatiotemporal candidate area from the spatiotemporal candidate area set according to the matching score corresponding to each spatiotemporal candidate area output by the interactor, and output the target spatiotemporal candidate area.
- the generating module is configured to call the spatio-temporal candidate region generator to obtain the candidate region and confidence score of each frame of the image in the video to be matched, wherein each candidate region corresponds to a confidence score; to obtain the to-be-matched The degree of coincidence between two adjacent frames of images in the video; generating the set of spatiotemporal candidate regions according to the candidate region of each frame of image, the confidence score, and the degree of coincidence.
- the acquisition module is configured to call the encoder of the interactor for each of the spatiotemporal candidate regions to encode the to-be-matched video feature sequence corresponding to the spatiotemporal candidate region to obtain a set of visual features, wherein the visual The feature set includes at least one visual feature; the encoder of the interactor is called to encode the text feature sequence to be matched to obtain a text feature set, wherein the text feature set includes at least one text feature; according to the visual feature A set and the text feature set to determine a visual text feature set, wherein the visual text feature set includes at least one visual text feature, and the visual text feature represents a text feature based on the visual feature; according to the visual text feature set and The visual feature set determines the matching score corresponding to the spatiotemporal candidate area.
- the acquisition module is configured to calculate the visual feature set in the following manner:
- the H p represents the set of visual features
- the t p represents the number of time steps in the spatio-temporal candidate region
- the LSTM p () represents the first long short-term memory network LSTM encoder
- the f t p represents the video feature sequence to be matched Line t feature
- the text feature set is calculated as follows:
- the H q represents the text feature set
- the t q represents the number of words in the text to be matched
- the LSTM q () represents the second LSTM encoder
- the f t q represents the t-th line feature in the text feature sequence to be matched .
- the acquisition module is configured to call the interactor to calculate the attention weight of the text feature corresponding to the visual feature according to the visual feature set and the text feature set; and calculate the visual feature corresponding to the visual feature according to the attention weight
- the normalized attention weight of the text feature; and the visual text feature set is calculated according to the normalized attention weight and the text feature.
- the acquisition module is configured to calculate the attention weight in the following manner:
- e i,j represents the attention weight of the i-th visual feature corresponding to the j-th text feature
- the w T represents the first model parameter
- the W q represents the second model parameter
- the W p represents the third model parameter
- the b 1 represents the fourth model parameter
- the tanh() represents the hyperbolic tangent function
- the normalized attention weight is calculated as follows:
- the a i, j represents the normalized attention weight of the i-th visual feature corresponding to the j-th text feature
- the t q represents the number of words in the text to be matched
- the k represents For the k-th word in the text to be matched, the k is an integer greater than or equal to 1 and less than or equal to the t q
- the exp() represents an exponential function
- the visual text feature set is calculated as follows:
- the H qp represents the visual text feature set
- the t p represents the number of time steps of the spatiotemporal candidate area
- the h qp represents the visual text feature
- the obtaining module is configured to calculate the matching score in the following manner:
- the s(q,p) represents the matching score corresponding to the spatiotemporal candidate region
- the h i p represents the visual feature corresponding to the i-th time step
- the ⁇ () represents a similarity calculation function
- the third aspect of the present application provides a model training device, including:
- An acquisition module for acquiring a first video to be trained, a second video to be trained, a first text to be trained, and a second text to be trained, wherein the first video to be trained and the first text to be trained have a matching relationship , And the first video to be trained does not have a matching relationship with the second text to be trained, the second video to be trained has a matching relationship with the second text to be trained, and the second video to be trained has a matching relationship with The first text to be trained does not have a matching relationship;
- the determining module is configured to determine the permutation loss function according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained acquired by the acquiring module, where ,
- the permutation loss function is used to process the first video to be trained and the second text to be trained, and to process the second video to be trained and the first text to be trained;
- the determining module is further configured to determine diversity according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained acquired by the acquiring module Loss function, wherein the diversity loss function is used to process the first video to be trained and the first text to be trained, and to process the second video to be trained and the second text to be trained deal with;
- the determining module is further configured to determine a target loss function according to the permutation loss function and the diversity loss function;
- the training module is configured to use the target loss function determined by the determining module to train the interactor to be trained to obtain an attention-based interactor, wherein the interactor is used to output a match between the video to be matched and the text to be matched Points.
- the determining module is configured to obtain a first set of spatiotemporal candidate regions in the first video to be trained and a second set of spatiotemporal candidate regions in the second video to be trained, wherein the first spatiotemporal candidate
- the area set includes at least one first spatiotemporal candidate area, the first spatiotemporal candidate area is a video sequence, and the second spatiotemporal candidate area set includes at least one second spatiotemporal candidate area, and the second spatiotemporal candidate area is a video sequence; Calculating a first matching score according to the first text to be trained and the second set of spatiotemporal candidate regions; calculating a second matching score according to the second text to be trained and the first set of spatiotemporal candidate regions; Calculate a third matching score according to the first text to be trained and the first set of spatiotemporal candidate regions; according to the first matching score, the second matching score, and the third matching score, Determine the permutation loss function.
- the determining module is configured to determine a matching behavior distribution according to a first set of spatiotemporal candidate regions and the first text to be trained, wherein the first set of spatiotemporal candidate regions is generated according to the first video to be trained,
- the matching behavior distribution represents the matching relationship between each first spatiotemporal candidate region in the first spatiotemporal candidate region set and the first text to be trained; the matching behavior distribution is normalized to obtain a target Matching behavior distribution; determining the diversity loss function according to the target matching behavior distribution.
- the determining module is configured to obtain a control coefficient, and determine the target loss function according to the control coefficient, the permutation loss function, and the diversity loss function.
- a fourth aspect of the present application provides a computer device, including: a memory, a transceiver, a processor, and a bus system;
- the memory is used to store programs
- the processor is used to execute the program in the memory and includes the following steps:
- the spatio-temporal candidate region generator is invoked to extract a set of spatio-temporal candidate regions from the video to be matched, wherein the set of spatio-temporal candidate regions includes N spatio-temporal candidate regions, where N is an integer greater than or equal to 1, and one spatio-temporal candidate region Is a video sequence;
- the attention-based interactor is called to obtain the matching score corresponding to each of the spatio-temporal candidate regions, where the interactor is used to process the video feature sequence to be matched and the text feature sequence to be matched, and the matching
- the score is used to indicate the matching relationship between the temporal and spatial candidate region and the text to be matched;
- the bus system is used to connect the memory and the processor, so that the memory and the processor communicate.
- the fifth aspect of the present application provides a computer-readable storage medium having instructions stored in the computer-readable storage medium, which when run on a computer device, cause the computer device to execute the video sequence described in the first aspect above The method of choice.
- the embodiment of the present application provides a method for selecting a video sequence.
- a video to be matched and a text to be matched are received.
- the video to be matched includes multiple frames of images
- the text to be matched includes at least one word
- the text to be matched corresponds to the text to be matched.
- Feature sequence and then extract a set of spatiotemporal candidate regions from the video to be matched.
- feature extraction is performed on each of the spatiotemporal candidate regions in the set of spatiotemporal candidate regions to obtain N video feature sequences to be matched.
- the video feature sequence to be matched It has a corresponding relationship with the spatiotemporal candidate area, and then the attention-based interactor can be called to obtain the matching score corresponding to each spatiotemporal candidate area, and finally according to the matching score corresponding to each spatiotemporal candidate area, the target is selected from the set of spatiotemporal candidate areas
- the spatio-temporal candidate area in the video is matched with the text, instead of matching each frame of the image with the text in the video, the advantage of this operation is that because the spatio-temporal candidate area includes the image in time and space Therefore, the timing correlation between video and text is considered when matching, that is, the impact of video timing information on the video sequence and text is considered, thereby improving the matching degree of the output video sequence and the text. Conducive to a better understanding of video content.
- FIG. 1 is a schematic diagram of an architecture of a video sequence selection system in an embodiment of the application
- Figure 2 is a schematic diagram of a framework and flow of a video sequence selection system in an embodiment of the application
- FIG. 3 is a schematic diagram of an embodiment of a method for selecting a video sequence in an embodiment of the application
- FIG. 4 is a schematic diagram of an embodiment of extracting a space-time candidate region in an embodiment of the application
- Figure 5 is a schematic structural diagram of an interactor based on an attention mechanism in an embodiment of the application
- Fig. 6 is a schematic diagram of an embodiment of a model training method in an embodiment of the application.
- FIG. 7 is a schematic diagram of an embodiment of a video sequence selection device in an embodiment of the application.
- Fig. 8 is a schematic diagram of an embodiment of a model training device in an embodiment of the application.
- Figure 9 is a schematic diagram of a structure of a server in an embodiment of the application.
- AI Artificial Intelligence
- digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
- artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
- Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
- Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
- Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
- Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
- Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that the computer processing becomes more suitable for human eyes to observe or send to the instrument to detect images.
- Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, 3D object reconstruction, 3D technology, virtual Technologies such as reality, augmented reality, synchronized positioning and map construction also include common facial recognition, fingerprint recognition and other biometric recognition technologies.
- Machine Learning is a multi-disciplinary interdisciplinary, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
- Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
- Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning techniques.
- artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , Robotics, intelligent medical care, intelligent customer service, etc., I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play more and more important values.
- the embodiments of the present application provide a method, computer equipment and storage medium for selecting a video sequence. Since the temporal and spatial candidate regions include the relationship between the image in time and space, the timing of the video and text is taken into account when matching. Relevance, that is, the influence of video timing information on the video sequence and text is considered, thereby improving the matching degree between the output video sequence and the text, which is conducive to a better understanding of the video content.
- this application can be applied to scenes of understanding and positioning of video content, including but not limited to scenes of video classification, scenes of fast retrieval on video websites, and scenes of fast positioning in videos.
- the matching relationship between text and video can be measured, so as to achieve the purpose of outputting a video sequence given a sentence and a video.
- a video sequence related to player Curry needs to be extracted from a video of the National Basketball Association (NBA) to make a video collection.
- NBA National Basketball Association
- multiple spatio-temporal candidate regions will be generated first. These spatio-temporal candidate regions are also video sequences.
- the spatiotemporal candidate area with the highest degree of matching with the text is selected as the target spatiotemporal candidate area, and the target spatiotemporal candidate area can be recorded as video sequence 1.
- the target spatiotemporal candidate area can be recorded as video sequence 2. If you need to make a video collection, you can stitch video 1 and video 2 to get the final video.
- FIG. 1 is an architecture of the video sequence selection system in an embodiment of the application.
- the video sequence selection method provided in this application is usually applied to a computer device.
- the computer device may be a server 100 or a client.
- This application will take the application to a server as an example.
- FIG. 2 is a schematic diagram of a framework and flow of the video sequence selection system in an embodiment of the application.
- the server 100 obtains the video input from the client, which can be understood However, the video may also be data pre-stored in the server 100, which is not limited here.
- the server 100 extracts multiple spatiotemporal candidate areas from the video through the spatiotemporal candidate area generator, such as the spatiotemporal candidate area A, the spatiotemporal candidate area B, and the spatiotemporal candidate area C in FIG. 2.
- the user can input a sentence through the client, such as "A brown squirrel is playing with a blue ball on the floor (a brown squirrel is playing with a blue ball on the floor)", and use an attention-based interactive device to separate the sentences Interact with spatiotemporal candidate area A, spatiotemporal candidate area B, and spatiotemporal candidate area C.
- Obtain the matching value of the sentence and the spatiotemporal candidate area A is 60
- the matching value of the sentence and the spatiotemporal candidate area B is 50
- the matching value of the sentence and the spatiotemporal candidate area C is 90
- the spatiotemporal candidate area C is output as the target spatiotemporal candidate area.
- the temporal and spatial candidate area C is represented as a video sequence.
- the attention-based interactor is obtained through optimization of a loss function.
- the loss function may include a permutation loss function and a diversity function, which is not limited here.
- the aforementioned attention-based interactor is also referred to as a video text interaction model in the following.
- the client is deployed on the terminal device 110.
- the terminal device 110 includes, but is not limited to, a tablet computer, a notebook computer, a palmtop computer, a mobile phone, a voice interaction device, and a personal computer (PC). Make a limit.
- voice interaction devices include but are not limited to smart speakers and smart home appliances.
- An embodiment of the method for selecting a video sequence in this application includes:
- a computer device receives a video to be matched and a text to be matched, where the video to be matched includes multiple frames of images, the text to be matched includes at least one word, and the text to be matched corresponds to a feature sequence of the text to be matched;
- the video sequence selection device first needs to obtain the video to be matched and the text to be matched, where the video sequence selection device can be deployed on the server 100 shown in FIG. 1 or deployed as shown in FIG. 1
- the terminal device 110 with strong computing capability is not limited here.
- the video to be matched includes multiple frames of images, and the text to be matched includes at least one word.
- the word vector model can be used to process the text to be matched to realize the mathematicalization of the language, thereby obtaining the feature sequence of the text to be matched .
- At least one word vector is included in the text feature sequence to be matched, and the word vector is a vector representation of the word.
- the video format of the video to be matched includes, but is not limited to, the motion picture experts group (MPEG) format, audio video interleaved (AVI), format, advanced streaming format (advanced streaming format, ASF), Microsoft Media Video (Windows media video, WMV) format, 3rd generation partnership project file format (3GP), multimedia container file format (MKV), streaming media Format (flash video) and video container variable bitrate file format (RealMedia variable bitrate file format, RMVB), etc.
- MPEG motion picture experts group
- AVI audio video interleaved
- ASF advanced streaming format
- Microsoft Media Video Windows media video
- 3GP 3rd generation partnership project file format
- MKV multimedia container file format
- flash video flash video
- RealMedia variable bitrate file format RealMedia variable bitrate file format
- the language types of the text to be matched include but are not limited to Chinese, English, Japanese, French, German, and Arabic.
- the computer device invokes the spatiotemporal candidate area generator to extract a spatiotemporal candidate area set from the video to be matched, where the spatiotemporal candidate area set includes N spatiotemporal candidate areas, and N is an integer greater than or equal to 1;
- the video sequence selection device processes the video to be matched to obtain a series of spatiotemporal candidate regions.
- This series of spatiotemporal candidate regions is called a spatiotemporal candidate region set.
- the spatiotemporal candidate region set includes N spatiotemporal candidate regions, which can be expressed as N represents the total number of spatiotemporal candidate areas in the spatiotemporal candidate area set, and a spatiotemporal candidate area is a video sequence.
- the computer device uses a convolutional neural network to perform feature extraction on each of the spatiotemporal candidate regions in the set of spatiotemporal candidate regions to obtain N video feature sequences to be matched, where the video feature sequences to be matched have a corresponding relationship with the spatiotemporal candidate regions;
- the video sequence selection device separately edits each spatio-temporal candidate area in the set of spatio-temporal candidate areas.
- a spatio-temporal candidate area as an example, suppose the spatio-temporal candidate area is denoted as p, and extract it using a convolutional neural network.
- the corresponding sequence feature that is, the video feature sequence F p to be matched is obtained, and Among them, t p represents the number of time steps of the video, the number of time steps represents the fixed length of the video after compression, and d p represents the characteristic length of the video, such as 6048 or 4096, which is not limited here.
- Each temporal and spatial candidate region corresponds to a video feature sequence to be matched.
- the computer device calls the attention-based interactor to obtain the matching score corresponding to each spatio-temporal candidate region, where the interactor is used to process the video feature sequence to be matched and the text feature sequence to be matched, and the matching score is used to represent the time and space The matching relationship between the candidate area and the text to be matched;
- the video sequence selection device inputs each to-be-matched video feature sequence and the to-be-matched text feature sequence into the video text interaction model, and the video text interaction model outputs a corresponding matching score.
- the spatio-temporal candidate region set includes 3 spatio-temporal candidate regions, each of which corresponds to a video feature sequence to be matched.
- the spatio-temporal candidate region A corresponds to the video feature sequence to be matched
- the spatiotemporal candidate region B corresponds to the video feature sequence to be matched.
- the spatio-temporal candidate area C corresponds to the video feature sequence C to be matched.
- the video feature sequence A to be matched and the text feature sequence to be matched are input to the video text interaction model, and the video text interaction model outputs a matching score A.
- the video feature sequence B to be matched and the text feature sequence to be matched are input to the video text interaction model, and the video text interaction model outputs a matching score B.
- the video feature sequence C to be matched and the text feature sequence to be matched are input to the video text interaction model, and the video text interaction model outputs a matching score C.
- the higher the matching score the stronger the matching relationship.
- the computer device selects the target spatiotemporal candidate area from the spatiotemporal candidate area set according to the matching score corresponding to each spatiotemporal candidate area output by the interactor, and outputs the target spatiotemporal candidate area.
- the video sequence selection device selects the spatiotemporal candidate area with the highest matching score as the target spatiotemporal candidate area according to the matching score corresponding to each spatiotemporal candidate area, and outputs the target spatiotemporal candidate area, where the target spatiotemporal candidate The area is a video sequence.
- An embodiment of the application provides a method for selecting a video sequence.
- a video to be matched and a text to be matched are received.
- the video to be matched includes multiple frames of images
- the text to be matched includes at least one word
- the text to be matched corresponds to the text to be matched.
- Feature sequence and then extract a set of spatiotemporal candidate regions from the video to be matched.
- feature extraction is performed on each spatiotemporal candidate region in the set of spatiotemporal candidate regions to obtain N video feature sequences to be matched.
- the video feature sequence to be matched It has a corresponding relationship with the spatio-temporal candidate area, and then the matching score corresponding to each spatio-temporal candidate area can be obtained through the video text interaction model, and finally according to the matching score corresponding to each spatio-temporal candidate area, the target spatio-temporal candidate is selected from the set of spatio-temporal candidate areas Region, where a spatiotemporal candidate region is a video sequence.
- the spatio-temporal candidate area in the video is matched with the text instead of matching each frame of the video with the text.
- the advantage of this operation is that because the spatio-temporal candidate area includes the image in time and space Therefore, the timing correlation between video and text is considered when matching, that is, the impact of video timing information on the video sequence and text is considered, thereby improving the matching degree of the output video sequence and the text. Conducive to a better understanding of video content.
- the method for selecting a video sequence provided in the embodiment of the present application further includes an optional embodiment.
- the above-mentioned step 302 computer device calls the space-time
- the candidate region generator extracts a set of spatiotemporal candidate regions from the video to be matched, which may include:
- the computer device calls the spatio-temporal candidate region generator to obtain the candidate region and the confidence score of each frame of the image in the video to be matched, wherein each candidate region corresponds to a confidence score; to obtain two adjacent frames of the image in the video to be matched The degree of coincidence between each frame; according to the candidate area, confidence score and degree of coincidence of each frame of image, generate a set of spatio-temporal candidate regions.
- a method of generating a set of spatiotemporal candidate regions based on the video to be matched is introduced.
- the video sequence selection device uses a single-frame candidate region generation method to detect each frame in the video to be matched, thereby obtaining each frame
- the candidate regions and confidence scores in the image for ease of understanding, please refer to Figure 4.
- Figure 4 is a schematic diagram of an embodiment of extracting spatiotemporal candidate regions in an embodiment of the application. As shown in Figure 4, it can be extracted in one frame of image At least one candidate area, it should be noted that one frame of image does not have timing information, and multiple frames of image have timing information.
- the candidate area S1 is an image of a watermelon
- the candidate area S2 is an image of a pineapple.
- the space-time candidate area consists of a series of borders Composition
- b i represents a candidate area in the i-th frame of the video to be matched
- T represents the total number of image frames of the video to be matched.
- the embodiment of the application also provides a way to determine the set of spatiotemporal candidate regions.
- the candidate region and confidence score of each frame of the image to be matched can be obtained, and each candidate region corresponds to a confidence score, and then the candidate region can be obtained.
- Match the degree of overlap between two adjacent frames in the video and finally generate a set of spatiotemporal candidate regions according to the candidate area, confidence score and degree of overlap of each frame of image.
- the temporal and spatial changes of the video are combined, and the appearance signal and the motion signal in the video are combined to generate the spatiotemporal candidate area, thereby improving the accuracy of the spatiotemporal candidate area generation.
- the video sequence selection method provided in this embodiment of the present application further includes another optional embodiment.
- the computer described in step 304 above The device invokes the attention-based interactor to obtain the matching score corresponding to each of the spatiotemporal candidate regions, which may include:
- the computer device invokes the encoder of the interactor to encode the to-be-matched video feature sequence corresponding to the spatiotemporal candidate area to obtain a visual feature set, where the visual feature set includes at least one visual feature;
- the computer device calls the encoder of the interactor to encode the text feature sequence to be matched to obtain a text feature set, where the text feature set includes at least one text feature;
- the computer device calls the interactor to execute a set of visual features And a text feature set to determine a visual text feature set, where the visual text feature set includes at least one visual text feature, and the visual text feature represents a text feature based on the visual feature; according to the visual text feature set and the visual feature set, the spatiotemporal candidate area is determined The corresponding match score.
- FIG. 5 is a schematic structural diagram of an interactor based on the attention mechanism in an embodiment of this application.
- the LSTM encoder is a component of the video text interaction model.
- the model training device inputs the video feature sequence F p to be matched to an LSTM encoder, and the LSTM encoder encodes the video feature sequence F p to be matched to obtain a visual feature set H p , where the visual feature set H p includes t p visual features h p .
- the model training device uses the attention mechanism to perform targeted weighted summation on the text feature set H q to obtain the visual text feature set H qp , where the visual text feature set H qp includes t p visual text features h qp , visual text features are visually oriented text features.
- the text feature H qp visual model training device in the set of visual text and visual characteristics h qp characteristic visual features set H p h p are calculated to obtain a value t p, t p number of these scores are summed i.e.
- the matching score corresponding to the temporal and spatial candidate region can be obtained.
- each spatio-temporal candidate area in the spatio-temporal candidate area set is as described above. From this, the matching score of each spatio-temporal candidate area in the spatio-temporal candidate area set is obtained, and other spatio-temporal candidate areas will not be repeated here. Processing method.
- the embodiment of the present application also provides a way to obtain the matching scores corresponding to the spatiotemporal candidate regions, respectively, through the encoder of the video text interaction model to encode the video feature sequence to be matched to obtain the visual feature set, and the text feature to be matched
- the sequence is encoded to obtain the text feature set, and then the visual text feature set is determined according to the visual feature set and the text feature set, and finally, the matching score corresponding to the spatiotemporal candidate region is determined according to the visual text feature set and the visual feature set.
- the method for selecting a video sequence provided in the embodiment of the present application further includes another optional embodiment.
- the The encoder of the interactive device encodes the video feature sequence to be matched to obtain the visual feature set, which may include:
- the visual feature set is calculated as follows:
- H p represents a set of visual features
- t p represents the number of time steps in the spatiotemporal candidate region
- LSTM p () represents the first LSTM encoder
- f t p represents the t-th line feature in the video feature sequence to be matched
- the encoder of the video text interaction model encodes the text feature sequence to be matched to obtain the text feature set, which may include:
- the text feature set is calculated as follows:
- H q represents the set of text features
- t q represents the number of words in the text to be matched
- LSTM q () represents the second LSTM encoder
- an implementation manner of generating a visual feature set and generating a text feature set will be introduced.
- the following describes the way to generate the visual feature set, that is, the (t-1)th visual feature in the visual feature set H p is determined by the LSTM encoder And wherein the video sequence to be matched in F p t-th row is input to the feature f t p LSTM a first encoder, a first output from the encoder LSTM t th visual features When output t p visual features, the visual feature set H p can be obtained.
- the following describes the method of generating the text feature set, that is, the LSTM encoder is used to analyze the (t-1)th text feature in the text feature set H q Text to be matched and wherein the first sequence in F q wherein t f t q input line to the second LSTM encoder output from the encoder of the second LSTM text feature t When t q text features are output, the text feature set H q can be obtained.
- the LSTM encoder increases the filtering of the past state, so that you can choose which state has more influence on the current, instead of simply selecting the most recent state.
- the structure of LSTM encoder includes forgetting gate, learning gate, memory gate and use gate. Long-term memory enters the forgetting gate and forgets what it considers useless. Short-term memory and events are merged together in the learning gate, and unnecessary information is removed as new information learned. The unlike long-term memory and the new information just learned will be merged together in the memory gate. This gate puts the two together. Since it is called the memory gate, it will output the updated long-term memory. Finally, the gate will decide what to use from the previously known information and the newly learned information to make predictions, so it also accepts long-term memory and new information input, merges them together and decides to output what.
- the embodiment of the present application also provides an implementation manner of using an encoder to encode a video feature sequence to be matched to obtain a visual feature set, and an implementation manner of using an encoder to encode a text feature sequence to be matched to obtain a text feature set.
- the method for selecting a video sequence provided in this embodiment of the application further includes another optional embodiment.
- the computer device Calling the interactor to execute the determination of the visual text feature set according to the visual feature set and the text feature set may include:
- the computer device invokes the interactor to calculate the attention weight of the text feature corresponding to the visual feature according to the visual feature set and the text feature set; and calculate the normalized attention weight of the text feature corresponding to the visual feature according to the attention weight; According to the normalized attention weight and text features, the visual text feature set is calculated.
- a method of generating a visual text feature set is introduced.
- the attention mechanism is the main means to solve the problem of information overload. It is a resource allocation scheme that allocates computing resources to more important tasks.
- the attention mechanism adopted in this application can be multi-headed attention , Hard attention, key-value pair attention or structured attention.
- Multi-head attention is the use of multiple queries to calculate and select multiple information from input information in parallel. Each attention focuses on a different part of the input information.
- Hard attention is the expectation of all input information based on the attention distribution.
- hard attention There are two ways to achieve hard attention, one is to select the input information with the highest probability.
- Another kind of hard attention can be achieved by randomly sampling the attention distribution.
- the key-value pair attention can be expressed in a key-value pair format, where the "key” is used to calculate the attention distribution, and the "value” is used to generate the selected information.
- Structured attention needs to select task-related information from the input information.
- Active attention is a multiple distribution on all input information, which is a flat structure. If the input information itself has a hierarchical structure, for example, the text can be divided into different levels of granularity such as words, sentences, paragraphs, and chapters, we can use hierarchical attention to make better information selection.
- the embodiment of the present application also provides a way to determine the visual text feature set, that is, first obtain the attention weight of the text feature corresponding to the visual feature according to the visual feature set and the text feature set, and then obtain the visual text based on the attention weight.
- the feature corresponds to the normalized attention weight of the text feature, and finally the visual text feature set is obtained according to the normalized attention weight and the text feature.
- the method for selecting a video sequence provided in the embodiment of the present application further includes another optional embodiment.
- the computer device invokes the interaction According to the visual feature set and the text feature set, the processor executes the calculation of the attention weight of the text feature corresponding to the visual feature, which can include:
- e i,j represents the attention weight of the i-th visual feature corresponding to the j-th text feature, Represents the j-th text feature, Represents the i-th visual feature, w T represents the first model parameter, W q represents the second model parameter, W p represents the third model parameter, b 1 represents the fourth model parameter, b 2 represents the fifth model parameter, tanh() Represents the hyperbolic tangent function;
- calculating the normalized attention weight of the visual feature corresponding to the text feature can include:
- the normalized attention weight is calculated as follows:
- a i, j represents the normalized attention weight of the i-th visual feature corresponding to the j-th text feature
- t q represents the number of words in the text to be matched
- k represents the k-th word in the text to be matched
- k is An integer greater than or equal to 1 and less than or equal to t q
- exp() represents an exponential function
- the visual text feature set is calculated, including:
- the visual text feature set is calculated as follows:
- H qp represents a set of visual text features
- t p represents the number of time steps in the spatiotemporal candidate region
- h qp represents visual text features
- an implementation method for computing a visual text feature set is introduced. First, obtain the j-th text feature from the text feature set H q And get the i-th visual feature from the visual feature set H p
- the attention weight is calculated as follows:
- the attention weight is normalized, that is, the normalized attention weight is calculated as follows:
- the embodiment of the present application also provides an implementation method for calculating the attention weight of the text feature corresponding to the visual feature, and at the same time, provides an implementation method for calculating the normalized attention weight of the text feature corresponding to the visual feature, and Provides a way to calculate the visual text feature set.
- the video sequence selection method provided in the embodiments of the present application further includes another optional embodiment.
- the computer device invokes the interaction According to the visual text feature set and the visual feature set, the processor executes to determine the matching score corresponding to the spatiotemporal candidate region, which may include:
- the matching score is calculated as follows:
- s(q,p) represents the matching score corresponding to the space-time candidate area
- ⁇ () represents the similarity calculation function
- an implementation method for calculating the matching score corresponding to the space-time candidate region is introduced. Since a spatio-temporal candidate region is composed of video sequences, a corresponding video feature sequence to be matched will be obtained after encoding processing of a spatio-temporal candidate region, and a video feature sequence to be matched will correspond to multiple visual features. Each video feature is calculated to obtain the matching sub-score, and finally each matching sub-score is added to obtain the matching score corresponding to the entire spatiotemporal candidate region.
- the embodiment of the present application also provides an implementation manner for determining the matching score corresponding to the spatiotemporal candidate region according to the visual text feature set and the visual feature set.
- An embodiment of the method of model training in this application includes:
- the computer device obtains a first video to be trained, a second video to be trained, a first text to be trained, and a second text to be trained, where the first video to be trained and the first text to be trained have a matching relationship, and the first to be trained
- the training video does not have a matching relationship with the second text to be trained, the second video to be trained has a matching relationship with the second text to be trained, and the second video to be trained does not have a matching relationship with the first text to be trained;
- the model training device first obtains the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained.
- Object that is, the first video to be trained matches the first text to be trained, the second video to be trained matches the second text to be trained, the first video to be trained does not match the second text to be trained, and the second video to be trained matches the first The text to be trained does not match.
- the computer device determines an arrangement loss function according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained.
- the arrangement loss function is used to compare the first video to be trained and the second Processing the text to be trained, and processing the second video to be trained and the first text to be trained;
- the model training device uses the first video to be trained and the first text to be trained as positive samples, the second video to be trained and the second text to be trained as positive samples, and the first video to be trained and the second text to be trained are used as positive samples.
- the training text is used as a negative sample, and the second to-be-trained video and the first to-be-trained text are used as the negative sample, and then the matching score of the positive sample and the matching score of the negative sample are obtained.
- the permutation loss function is constructed according to the size relationship between the matching scores of the positive samples and the matching scores of the negative samples.
- the computer device determines a diversity loss function according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained, where the diversity loss function is used to compare the first video to be trained and the Process the first text to be trained, and process the second video to be trained and the second text to be trained;
- the model training device uses the first video to be trained and the first text to be trained as positive samples, and the second video to be trained and the second text to be trained as positive samples, and the data corresponding to any positive sample can be selected to construct Diversity loss function.
- a positive sample it is necessary to make different spatiotemporal candidate regions have different matching scores, that is, the score distribution of the entire spatiotemporal distribution region should be diverse, not equal, so as to achieve a more accurate matching effect.
- the computer device determines the target loss function according to the permutation loss function and the diversity loss function
- the model training device combines the permutation loss function and the diversity loss function to generate the target loss function.
- the computer device uses the target loss function to train the interactive device to be trained to obtain an attention-based interactive device, where the interactive device is used to output matching scores between the video to be matched and the text to be matched.
- the model training device of the computer equipment uses the target loss function that has been constructed to train the video text interaction model to be trained to obtain the video text interaction model. After the video to be matched and the text to be matched are characterized, they are input into the video text interaction model to obtain the matching scores of each spatio-temporal candidate area in the video, and finally the spatio-temporal candidate area with the largest matching score is selected as the target spatio-temporal candidate area .
- the loss function is usually used as a learning criterion and is connected with the optimization problem, that is, solving and evaluating the model by minimizing the loss function.
- the embodiment of the application provides a method for model training.
- a first video to be trained, a second video to be trained, a first text to be trained, and a second text to be trained are acquired, the first video to be trained and the first text to be trained Matching, the first video to be trained does not match the second text to be trained, the second video to be trained matches the second text to be trained, and the second video to be trained does not match the first text to be trained.
- the permutation loss function and the diversity loss function are determined according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained, the permutation loss function and the diversity loss function are determined.
- the permutation loss function and the diversity loss function are combined to perform the model Train to get a video text interaction model.
- using the permutation loss function and the diversity loss function to train the model at the same time can improve the matching accuracy between the text and the different temporal and spatial candidate regions, thereby helping to improve the accuracy of model training.
- the method for model training provided in the embodiment of the present application further includes an optional embodiment.
- the above-mentioned step 602 is based on the first to-be-trained For the video, the second video to be trained, the first text to be trained, and the second text to be trained, determining the arrangement loss function may include:
- a first set of spatiotemporal candidate regions in a first video to be trained and acquire a second set of spatiotemporal candidate regions in a second video to be trained, wherein the first set of spatiotemporal candidate regions includes at least one first spatiotemporal candidate region, and
- the spatiotemporal candidate area is a video sequence
- the second spatiotemporal candidate area set includes at least one second spatiotemporal candidate area
- the second spatiotemporal candidate area is a video sequence
- the permutation loss function is determined.
- the model training device extracts the spatiotemporal candidate regions for the first video to be trained and the second video to be trained respectively, and the extraction method is as described in the first optional embodiment corresponding to FIG. 3.
- the first spatiotemporal candidate region set corresponding to the first to-be-trained video can be obtained, and the first spatiotemporal candidate region set includes at least one first spatiotemporal candidate region.
- a second spatiotemporal candidate area set corresponding to the second video to be trained is obtained, and the second spatiotemporal candidate area set includes at least one second spatiotemporal candidate area.
- the matching score between the two is defined as S(v,q) as:
- p i represents the i-th first spatiotemporal candidate area in the first spatiotemporal candidate area set
- N represents the total number of the first spatiotemporal candidate area in the first video to be trained
- max i () represents the maximum value
- s(q, p i ) represents the characterization of the matching behavior between the i-th first spatiotemporal candidate region p i and the input first text to be trained q.
- S(v,q) can be expressed as the third matching score.
- p 'i denotes the i-th second temporal candidate region set in second temporal candidate region
- N denotes the total number of training video to be a second candidate for the second temporal region
- max i () represents a maximum value
- s (q , p 'i) indicates a match between characterize the behavior of the i-th second temporal candidate region and a first p i training text to be entered q.
- S(v',q) can be expressed as the first matching score.
- the matching score between the two is defined as S(v, q'):
- p i represents the i-th first spatiotemporal candidate area in the first spatiotemporal candidate area set
- N represents the total number of first spatiotemporal candidate areas in the first video to be trained
- max i () represents the maximum value
- s(q' , p i ) represents the characterization of the matching behavior between the i-th first spatiotemporal candidate region p i and the input second text to be trained q'.
- S(v,q') can be expressed as the second matching score.
- L rank ⁇ v ⁇ v' ⁇ q ⁇ q' [max(0,S(v,q')-S(v,q)+ ⁇ )+max(0,S(v',q)-S( v,q)+ ⁇ )];
- the ranking loss function L rank can directly cause the third matching score S(v,q) of the positive sample to be higher than the second matching score S(v,q') of the negative sample and the first matching score
- the value S(v',q) is larger, therefore, it can help to generate a strong matching behavior s(q,p * ) between the target spatiotemporal candidate region p * and the first text to be trained as q.
- the embodiment of the present application also provides a way to determine the permutation loss function, that is, the model training device first obtains the first spatiotemporal candidate region set in the first video to be trained, and obtains the second spatiotemporal set in the second video to be trained Candidate region set, the first spatiotemporal candidate region set includes at least one first spatiotemporal candidate region, the first spatiotemporal candidate region is a video sequence, and the second spatiotemporal candidate region set includes at least one second spatiotemporal candidate region, and the second spatiotemporal candidate region is a video Sequence, and then calculate the first matching score according to the first text to be trained and the second set of spatiotemporal candidate regions, calculate the second matching score according to the second text to be trained and the first set of spatiotemporal candidate regions, The training text and the first spatio-temporal candidate region set are calculated, the third matching score is calculated, and the permutation loss function is determined according to the first matching score, the second matching score, and the third matching score.
- the designed permutation loss function can promote the matching score of the matching data to be larger than the matching score of the unmatched data, so that a strong matching relationship between the target spatiotemporal candidate region and the text is generated.
- the permutation loss function can be Distinguish whether the match between the video and the text contributes.
- the method for model training provided in the embodiment of the present application further includes another optional embodiment.
- the above-mentioned step 603 is based on the first waiting.
- the training video, the second video to be trained, the first text to be trained, and the second text to be trained to determine the diversity loss function may include:
- the matching behavior distribution is determined.
- the first set of spatiotemporal candidate regions is generated based on the first video to be trained, and the matching behavior distribution represents each of the first set of spatiotemporal candidate regions.
- the matching relationship between the first spatiotemporal candidate region and the first text to be trained normalize the matching behavior distribution to obtain the target matching behavior distribution; determine the diversity loss function according to the target matching behavior distribution.
- the content of the diversity loss function will be introduced.
- the sentences expressed in natural language are located in the video, only a small part of the temporal and spatial candidate regions are semantically matched with the input sentence. This is because of a reasonable distribution of matching behaviors. It should be diverse. That is, only a small part of the matching behaviors between the spatiotemporal candidate regions and the text belong to strong matching behaviors, and the matching behaviors of other spatiotemporal candidate regions should be weak.
- the model training device extracts the spatiotemporal candidate regions of the first video to be trained, and the extraction method is as described in the first optional embodiment corresponding to FIG. 3. Based on the above method, a set of spatiotemporal candidate regions corresponding to the first video to be trained can be obtained, and the set of spatiotemporal candidate regions includes at least one spatiotemporal candidate region.
- the first video to be trained is v
- the first text to be trained is q.
- this scheme first normalizes the matching behavior distribution through the softmax function That is, the calculation is carried out as follows:
- p k refers to any one of p n
- p k represents the k-th spatiotemporal candidate region
- L div represents the diversity loss function.
- the embodiment of the present application also provides a way to determine the diversity loss function, that is, the model training device can determine the matching behavior distribution according to the first spatiotemporal candidate region set and the first text to be trained, and then the matching behavior distribution is classified One process, the target matching behavior distribution is obtained, and finally the diversity loss function is determined according to the target matching behavior distribution.
- the designed diversity loss function can not only enhance the matching relationship between the spatial-temporal candidate region with higher confidence and the text, but also weaken the matching relationship between the spatial-temporal candidate region with lower confidence and the text. As a result, it is closer to the actual matching relationship between the temporal and spatial candidate regions and the text, which in turn facilitates training to obtain a more accurate network model.
- the model training method provided in the embodiment of the present application further includes another optional embodiment.
- the foregoing step 604 determines the target loss function according to the permutation loss function and the diversity loss function, which may include:
- the target loss function is determined as follows:
- L L rank + ⁇ L div ;
- L represents the objective loss function
- L rank represents the ranking loss function
- L div represents the diversity loss function
- ⁇ represents the control coefficient
- an implementation method for generating a target loss function is introduced. After the model training device obtains the permutation loss function and the diversity loss function, they are added together, and at the same time, a coefficient is added to the diversity loss function. The details are as follows:
- L L rank + ⁇ L div ;
- ⁇ can be set to 0.5 or other reasonable values, which is not limited here.
- the computer device obtains the control coefficient, and determines the target loss function based on the control coefficient, the permutation loss function, and the diversity loss function.
- the embodiment of the present application also provides an implementation manner for determining the target loss function, that is, combining the permutation loss function and the diversity loss function that have been designed.
- the designed target loss function can distinguish between matched video and sentence pairs and unmatched video and sentence pairs, and can also enhance the matching of highly reliable spatiotemporal candidate regions and sentences in matched video and sentence pairs. It can also reduce the matching relationship between the temporal and spatial candidate regions with low credibility and the sentence, thereby improving the reliability of model training, thereby obtaining a more accurate network model.
- FIG. 7 is a schematic diagram of an embodiment of the video sequence selection device in an embodiment of the application.
- the video sequence selection device 70 includes:
- the acquiring module 701 is configured to receive a video to be matched and a text to be matched, where the video to be matched includes multiple frames of images, the text to be matched includes at least one word, and the text to be matched corresponds to a feature sequence of the text to be matched;
- the generating module 702 is configured to call a spatiotemporal candidate region generator to extract a set of spatiotemporal candidate regions from the video to be matched, wherein the set of spatiotemporal candidate regions includes N spatiotemporal candidate regions, and N is greater than or equal to 1 Integer, a spatiotemporal candidate area is a video sequence;
- the encoding module 703 is configured to perform feature extraction on each of the spatio-temporal candidate regions in the set of spatio-temporal candidate regions through a convolutional neural network to obtain N video feature sequences to be matched, where the video feature sequence to be matched is the same as the The temporal and spatial candidate regions have a corresponding relationship;
- the acquiring module 701 is further configured to call the matching score corresponding to each spatio-temporal candidate region of the attention-based interactor, wherein the interactor is used to compare the video feature sequence to be matched with the text to be matched
- the feature sequence is processed, and the matching score is used to represent the matching relationship between the spatiotemporal candidate region and the text to be matched;
- the selection module 704 is configured to select a target spatiotemporal candidate area from the spatiotemporal candidate area set according to the matching score corresponding to each spatiotemporal candidate area output by the interactor, and output the target spatiotemporal candidate area.
- An embodiment of the application provides a video sequence selection device, which first receives a video to be matched and a text to be matched, where the video to be matched includes multiple frames of images, the text to be matched includes at least one word, and the text to be matched corresponds to the feature of the text to be matched Sequence, and then extract the set of spatio-temporal candidate regions from the video to be matched. Next, feature extraction of each spatio-temporal candidate region in the set of spatio-temporal candidate regions is required to obtain N video feature sequences to be matched.
- the video feature sequence to be matched is The spatio-temporal candidate regions have a corresponding relationship, and then the attention-based interactor can be called to obtain the matching score corresponding to each spatio-temporal candidate region. Finally, according to the matching score corresponding to each spatio-temporal candidate region, the target spatio-temporal region is selected from the set of spatiotemporal candidate regions.
- the spatio-temporal candidate area in the video is matched with the text, instead of matching each frame of the image with the text in the video, the advantage of this operation is that because the spatio-temporal candidate area includes the image in time and space Therefore, the timing correlation between video and text is considered when matching, that is, the impact of video timing information on the video sequence and text is considered, thereby improving the matching degree of the output video sequence and the text. Conducive to a better understanding of video content.
- the generating module 702 is configured to call the spatio-temporal candidate region generator to obtain the candidate regions and confidence scores of each frame of the image to be matched, wherein each candidate region corresponds to a confidence score; Match the degree of coincidence between two adjacent frames of images in the video; generate the set of spatiotemporal candidate regions according to the candidate region of each frame of image, the confidence score, and the degree of coincidence.
- the acquiring module 701 is configured to, for each of the spatiotemporal candidate regions, call the encoder of the interactor to encode the to-be-matched video feature sequence corresponding to the spatiotemporal candidate region to obtain a set of visual features, where the The visual feature set includes at least one visual feature; the encoder of the interactive device is called to encode the text feature sequence to be matched to obtain a text feature set, wherein the text feature set includes at least one text feature; The feature set and the text feature set determine a visual text feature set, wherein the visual text feature set includes at least one visual text feature, and the visual text feature represents a text feature based on the visual feature; according to the visual text feature set And the set of visual features to determine the matching score corresponding to the spatiotemporal candidate region.
- the acquiring module 701 is configured to calculate the visual feature set in the following manner:
- the H p represents the set of visual features
- the t p represents the number of time steps in the spatio-temporal candidate region
- the LSTM p () represents the first long short-term memory network LSTM encoder
- the f t p represents the video feature sequence to be matched T-th row characteristics
- the text feature set is calculated as follows:
- the H q represents the text feature set
- the t q represents the number of words in the text to be matched
- the LSTM q () represents the second LSTM encoder
- the f t q represents the t-th line feature in the text feature sequence to be matched .
- the acquisition module 701 is configured to call the interactor to acquire the attention weight of the text feature corresponding to the visual feature according to the visual feature set and the text feature set; obtain the visual feature according to the attention weight The feature corresponds to the normalized attention weight of the text feature; according to the normalized attention weight and the text feature, a visual text feature set is obtained.
- the obtaining module 701 is configured to calculate the attention weight in the following manner:
- e i,j represents the attention weight of the i-th visual feature corresponding to the j-th text feature
- the w T represents the first model parameter
- the W q represents the second model parameter
- the W p represents the third model parameter
- the b 1 represents the fourth model parameter
- the tanh() represents the hyperbolic tangent function
- the normalized attention weight is calculated as follows:
- the a i, j represents the normalized attention weight of the i-th visual feature corresponding to the j-th text feature
- the t q represents the number of words in the text to be matched
- the k represents For the k-th word in the text to be matched, the k is greater than or equal to 1 and less than or equal to the integer of t q
- the exp() represents an exponential function
- the visual text feature set is calculated as follows:
- the H qp represents the visual text feature set
- the t p represents the number of time steps of the spatiotemporal candidate area
- the h qp represents the visual text feature
- the obtaining module 701 is configured to calculate the matching score in the following manner:
- the s(q,p) represents the matching score corresponding to the spatiotemporal candidate region
- the h i p represents the visual feature corresponding to the i-th time step
- the ⁇ () represents a similarity calculation function
- FIG. 8 is a schematic diagram of an embodiment of the model training device in an embodiment of this application.
- the model training device 80 includes:
- the acquiring module 801 is configured to acquire a first video to be trained, a second video to be trained, a first text to be trained, and a second text to be trained, wherein the first video to be trained and the first text to be trained have a match Relationship, and the first video to be trained does not have a matching relationship with the second text to be trained, the second video to be trained has a matching relationship with the second text to be trained, and the second video to be trained Does not have a matching relationship with the first text to be trained;
- the determining module 802 is configured to determine the arrangement loss function according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained acquired by the acquiring module 801 , Wherein the permutation loss function is used to process the first video to be trained and the second text to be trained, and to process the second video to be trained and the first text to be trained;
- the determining module 802 is further configured to determine according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained acquired by the acquiring module 801 A diversity loss function, wherein the diversity loss function is used to process the first video to be trained and the first text to be trained, and to process the second video to be trained and the second to be trained Text processing;
- the determining module 802 is further configured to determine a target loss function according to the permutation loss function and the diversity loss function;
- the training module 803 is configured to use the target loss function determined by the determining module 802 to train the interactive device to be trained to obtain an attention-based interactive device, wherein the interactive device is used to output the video to be matched and the text to be matched.
- a model training device is provided. First, a first video to be trained, a second video to be trained, a first text to be trained, and a second text to be trained are acquired, the first video to be trained and the first video to be trained Text matching, the first video to be trained does not match the second text to be trained, the second video to be trained matches the second text to be trained, and the second video to be trained does not match the first text to be trained. Then, according to the first video to be trained, the second video to be trained, the first text to be trained, and the second text to be trained, the permutation loss function and the diversity loss function are determined. Finally, the permutation loss function and the diversity loss function are combined to perform the model Train to get an attention-based interactor.
- using the permutation loss function and the diversity loss function to train the model at the same time can not only improve the matching accuracy between the text and the candidate regions in different time and space, but also improve the matching accuracy between the text and the candidate region, which is beneficial to Improve the accuracy of model training.
- the determining module 802 is configured to obtain a first set of spatiotemporal candidate regions in the first video to be trained, and obtain a second set of spatiotemporal candidate regions in the second video to be trained, wherein the first spatiotemporal The candidate area set includes at least one first spatiotemporal candidate area, the first spatiotemporal candidate area is a video sequence, the second spatiotemporal candidate area set includes at least one second spatiotemporal candidate area, and the second spatiotemporal candidate area is a video sequence Calculate a first matching score according to the first text to be trained and the second set of spatiotemporal candidate regions; calculate a second matching score according to the second text to be trained and the first set of spatiotemporal candidate regions Calculate a third matching score according to the first text to be trained and the first set of spatiotemporal candidate regions; according to the first matching score, the second matching score, and the third matching score , Determine the permutation loss function.
- the determining module 802 is configured to determine a matching behavior distribution according to a first set of spatiotemporal candidate regions and the first text to be trained, wherein the first set of spatiotemporal candidate regions is generated according to the first video to be trained ,
- the matching behavior distribution represents the matching relationship between each first spatiotemporal candidate region in the first spatiotemporal candidate region set and the first text to be trained; normalizing the matching behavior distribution to obtain Target matching behavior distribution; determining the diversity loss function according to the target matching behavior distribution.
- the determining module 802 is configured to obtain the control coefficient, according to the The control coefficient, the permutation loss function, and the diversity loss function determine the target loss function.
- FIG. 9 is a schematic diagram of a server structure provided by an embodiment of the present invention.
- the server 900 may have relatively large differences due to different configurations or performance, and may include one or more central processing units (CPU) 922 (for example, , One or more processors) and memory 932, and one or more storage media 930 (for example, one or more storage devices with a large amount of storage) for storing application programs 942 or data 944.
- the memory 932 and the storage medium 930 may be short-term storage or persistent storage.
- the program stored in the storage medium 930 may include one or more modules (not shown in the figure), and each module may include a series of command operations on the server.
- the central processing unit 922 may be configured to communicate with the storage medium 930 and execute a series of instruction operations in the storage medium 530 on the server 900.
- the server 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input and output interfaces 958, and/or one or more operating systems 941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- operating systems 941 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- the steps performed by the server in the foregoing embodiment may be based on the server structure shown in FIG. 9.
- the CPU 922 in the server is used to execute the method of video sequence selection or the method of model training provided in the foregoing embodiment.
- the disclosed system, device, and method may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
- the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
- each unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
- the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
- the technical solution of this application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the method described in each embodiment of the present application.
- the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biodiversity & Conservation Biology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请公开了一种视频序列选择的方法,应用于计算机设备,包括:接收待匹配视频以及待匹配文本,待匹配文本对应于待匹配文本特征序列;调用时空候选区域生成器从待匹配视频中提取时空候选区域集合,时空候选区域集合中包括N个时空候选区域;通过卷积神经网络对每个时空候选区域进行特征提取,得到N个待匹配视频特征序列;调用基于注意力的交互器获取每个时空候选区域对应的匹配分值,匹配分值用于表示时空候选区域与待匹配文本之间的匹配关系;根据每个时空候选区域对应的匹配分值,从时空候选区域集合中选择目标时空候选区域,输出目标时空候选区域。本申请在匹配的时候考虑到了视频与文本在时序上的关联性,从而提升视频序列与文本的匹配度。
Description
本申请要求于2019年03月05日提交的申请号为201910165102.1、发明名称为“一种视频序列选择的方法、模型训练的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及人工智能技术领域,尤其涉及一种视频序列选择的方法、计算机设备及存储介质。
人工智能属于计算机科学的一个分支,主要目标是使机器能够胜任一些通常需要人类智能才能完成的复杂工作。在人工智能得到愈加广泛重视的背景下,随着计算机网络技术、多媒体技术以及数字传输技术的不断发展,以及摄像机、手机以及平板电脑等数码设备的不断普及,视频的数据量急剧增长。面对海量的视频,如何有效地对其进行处理,从而使用户能够迅速获取想要的信息,是当前研究和应用的关键问题。
目前,在提取视频中所需的内容时,通常是对视频中单帧图像和文本分别进行编码,然后将文本与每帧图像进行匹配,从而得到每帧图像与文本的匹配结果,再根据匹配结果得到单帧图像在视频中的空间位置,最后将这些空间位置串联起来得到一个与文本关联的视频序列。
然而,采用上述方式生成的视频序列虽然与文本之间具有关联性,但是仅仅考虑到单帧图像与文本之间的匹配关系,导致输出的视频序列与文本的匹配度较低,不利于对视频内容的理解。
发明内容
本申请实施例提供了一种视频序列选择的方法、、计算机设备及存储介质,由于时空候选区域包括了图像在时间和空间上的关系,因此,在匹配的时候考虑到了视频与文本在时序上的关联性,即考虑了视频时序信息对视频序列以及文本的影响,从而提升了输出的视频序列与文本的匹配度,进而有利于更好地理解视频内容。
有鉴于此,本申请第一方面提供一种视频序列选择的方法,所述方法应用于计算机设备,包括:
所述计算机设备接收待匹配视频以及待匹配文本,其中,所述待匹配视频包括多帧图像,所述待匹配文本包括至少一个词语,所述待匹配文本对应于待匹配文本特征序列;
所述计算机设备调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,其中,所述时空候选区域集合中包括N个时空候选区域,所述N 为大于或等于1的整数,一个时空候选区域为一个视频序列;
所述计算机设备通过卷积神经网络对所述时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,所述待匹配视频特征序列与所述时空候选区域具有对应关系;
所述计算机设备调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,其中,所述交互器用于对所述待匹配视频特征序列与所述待匹配文本特征序列进行处理,所述匹配分值用于表示所述时空候选区域与所述待匹配文本之间的匹配关系;
所述计算机设备根据所述交互器输出的所述每个时空候选区域对应的匹配分值,从所述时空候选区域集合中选择目标时空候选区域,输出所述目标时空候选区域。
本申请第二方面提供一种视频序列选择装置,包括:
获取模块,用于接收待匹配视频以及待匹配文本,其中,所述待匹配视频包括多帧图像,所述待匹配文本包括至少一个词语,所述待匹配文本对应于待匹配文本特征序列;
生成模块,用于调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,其中,所述时空候选区域集合中包括N个时空候选区域,所述N为大于或等于1的整数,一个时空候选区域为一个视频序列;
编码模块,用于通过卷积神经网络对所述时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,所述待匹配视频特征序列与所述时空候选区域具有对应关系;
所述获取模块,还用于调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,其中,所述交互器用于对所述待匹配视频特征序列与所述待匹配文本特征序列进行处理,所述匹配分值用于表示所述时空候选区域与所述待匹配文本之间的匹配关系;
选择模块,用于根据所述交互器输出的所述每个时空候选区域对应的匹配分值,从所述时空候选区域集合中选择目标时空候选区域,输出所述目标时空候选区域。
在一种可能的设计中,在本申请实施例的第二方面的第一种实现方式中,
所述生成模块,用于调用所述时空候选区域生成器获取所述待匹配视频中每帧图像的候选区域以及置信度得分,其中,每个候选区域对应一个置信度得分;获取所述待匹配视频中相邻两帧图像之间的重合度;根据所述每帧图像的候选区域、所述置信度得分以及所述重合度,生成所述时空候选区域集合。
在一种可能的设计中,在本申请实施例的第二方面的第二种实现方式中,
所述获取模块,用于对于所述每个时空候选区域,调用所述交互器的编码器对所述时空候选区域对应的待匹配视频特征序列进行编码,得到视觉特征集合,其中,所述视觉特征集合包括至少一个视觉特征;调用所述交互器的编码器对所述待匹配文本特征序列进行编码,得到文本特征集合,其中,所述文本特征集合包括至少一个文本特征;根据所述视觉特征集合以及所述文本特征集 合,确定视觉文本特征集合,其中,所述视觉文本特征集合包括至少一个视觉文本特征,所述视觉文本特征表示基于视觉特征的文本特征;根据所述视觉文本特征集合以及所述视觉特征集合,确定所述时空候选区域对应的匹配分值。
在一种可能的设计中,在本申请实施例的第二方面的第三种实现方式中,
所述获取模块,用于采用如下方式计算所述视觉特征集合:
其中,所述H
p表示所述视觉特征集合,所述
表示所述视觉特征集合中的第t个视觉特征,所述t
p表示所述时空候选区域的时间步数,所述
表示所述视觉特征集合中的第(t-1)个视觉特征,所述LSTM
p()表示第一长短期记忆网络LSTM编码器,所述f
t
p表示所述待匹配视频特征序列中的第t行特征;
采用如下方式计算所述文本特征集合:
其中,所述H
q表示所述文本特征集合,所述
表示所述文本特征集合中的第t个文本特征,所述t
q表示所述待匹配文本的词语数量,所述
表示所述文本特征集合中的第(t-1)个文本特征,所述LSTM
q()表示第二LSTM编码器,所述f
t
q表示所述待匹配文本特征序列中的第t行特征。
在一种可能的设计中,在本申请实施例的第二方面的第四种实现方式中,
所述获取模块,用于调用所述交互器执行根据所述视觉特征集合以及所述文本特征集合,计算视觉特征对应文本特征的注意力权重;根据所述注意力权重,计算所述视觉特征对应所述文本特征的归一化注意力权重;根据所述归一化注意力权重以及所述文本特征,计算视觉文本特征集合。
在一种可能的设计中,在本申请实施例的第二方面的第五种实现方式中,
所述获取模块,用于采用如下方式计算所述注意力权重:
其中,所述e
i,j表示第i个视觉特征对应第j个文本特征的注意力权重,所述
表示所述第j个文本特征,所述
表示所述第i个视觉特征,所述w
T表示第一模型参数,所述W
q表示第二模型参数,所述W
p表示第三模型参数,所述b
1表示第四模型参数,所述b
2表示第五模型参数,所述tanh()表示双曲正切函数;
采用如下方式计算所述归一化注意力权重:
其中,所述a
i,j表示所述第i个视觉特征对应所述第j个文本特征的归一化注意力权重,所述t
q表示所述待匹配文本的词语数量,所述k表示所述待匹配文 本中的第k个词语,所述k为大于或等于1,且小于或等于所述t
q的整数,所述exp()表示指数函数;
采用如下方式计算所述视觉文本特征集合:
其中,所述H
qp表示所述视觉文本特征集合,所述t
p表示所述时空候选区域的时间步数,所述h
qp表示视觉文本特征。
在一种可能的设计中,在本申请实施例的第二方面的第六种实现方式中,
所述获取模块,用于采用如下方式计算所述匹配分值:
其中,所述s(q,p)表示所述时空候选区域对应的匹配分值,所述
表示第i个时间步数对应的视觉特征和视觉文本特征之间的匹配子分值,所述
表示所述第i个时间步数对应的视觉文本特征,所述h
i
p表示所述第i个时间步数对应的视觉特征,所述φ()表示相似度计算函数。
本申请第三方面提供一种模型训练装置,包括:
获取模块,用于获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,其中,所述第一待训练视频与所述第一待训练文本具有匹配关系,且所述第一待训练视频与所述第二待训练文本不具有匹配关系,所述第二待训练视频与所述第二待训练文本具有匹配关系,且所述第二待训练视频与所述第一待训练文本不具有匹配关系;
确定模块,用于根据所述获取模块获取的所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定排列损失函数,其中,所述排列损失函数用于对所述第一待训练视频以及所述第二待训练文本进行处理,并对所述第二待训练视频以及所述第一待训练文本进行处理;
所述确定模块,还用于根据所述获取模块获取的所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定多样性损失函数,其中,所述多样性损失函数用于对所述第一待训练视频以及所述第一待训练文本进行处理,并对所述第二待训练视频以及所述第二待训练文本进行处理;
所述确定模块,还用于根据所述排列损失函数以及所述多样性损失函数,确定目标损失函数;
训练模块,用于采用所述确定模块确定的所述目标损失函数对待训练的交互器进行训练,得到基于注意力的交互器,其中,所述交互器用于输出待匹配 视频与待匹配文本的匹配分值。
在一种可能的设计中,在本申请实施例的第三方面的第一种实现方式中,
所述确定模块,用于获取所述第一待训练视频中的第一时空候选区域集合,以及获取所述第二待训练视频中的第二时空候选区域集合,其中,所述第一时空候选区域集合包括至少一个第一时空候选区域,所述第一时空候选区域为视频序列,所述第二时空候选区域集合包括至少一个第二时空候选区域,所述第二时空候选区域为视频序列;根据所述第一待训练文本以及所述第二时空候选区域集合,计算第一匹配分值;根据所述第二待训练文本以及所述第一时空候选区域集合,计算第二匹配分值;根据所述第一待训练文本以及所述第一时空候选区域集合,计算第三匹配分值;根据所述第一匹配分值、所述第二匹配分值以及所述第三匹配分值,确定所述排列损失函数。
在一种可能的设计中,在本申请实施例的第三方面的第二种实现方式中,
所述确定模块,用于根据第一时空候选区域集合以及所述第一待训练文本,确定匹配行为分布,其中,所述第一时空候选区域集合是根据所述第一待训练视频生成的,所述匹配行为分布表示所述第一时空候选区域集合中每个第一时空候选区域与所述第一待训练文本之间的匹配关系;对所述匹配行为分布进行归一化处理,得到目标匹配行为分布;根据所述目标匹配行为分布确定所述多样性损失函数。
在一种可能的设计中,在本申请实施例的第三方面的第三种实现方式中,
所述确定模块,用于获取控制系数,根据所述控制系数、所述排列损失函数以及所述多样性损失函数,确定所述目标损失函数。
本申请第四方面提供一种计算机设备,包括:存储器、收发器、处理器以及总线系统;
其中,所述存储器用于存储程序;
所述处理器用于执行所述存储器中的程序,包括如下步骤:
接收待匹配视频以及待匹配文本,其中,所述待匹配视频包括多帧图像,所述待匹配文本包括至少一个词语,所述待匹配文本对应于待匹配文本特征序列;
调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,其中,所述时空候选区域集合中包括N个时空候选区域,所述N为大于或等于1的整数,一个时空候选区域为一个视频序列;
通过卷积神经网络对所述时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,所述待匹配视频特征序列与所述时空候选区域具有对应关系;
调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,其中,所述交互器用于对所述待匹配视频特征序列与所述待匹配文本特征序列进行处理,所述匹配分值用于表示所述时空候选区域与所述待匹配文本之间的匹配关系;
根据所述交互器输出的所述每个时空候选区域对应的匹配分值,从所述时 空候选区域集合中选择目标时空候选区域,输出所述目标时空候选区域;
所述总线系统用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
本申请的第五方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机设备上运行时,使得计算机设备执行上述第一方面所述的视频序列选择的方法。
从以上技术方案可以看出,本申请实施例具有以下优点:
本申请实施例提供了一种视频序列选择的方法,首先接收待匹配视频以及待匹配文本,其中,待匹配视频包括多帧图像,待匹配文本包括至少一个词语,待匹配文本对应于待匹配文本特征序列,然后从待匹配视频中提取时空候选区域集合,接下来需要对时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,待匹配视频特征序列与时空候选区域具有对应关系,然后可以调用基于注意力的交互器获取每个时空候选区域对应的匹配分值,最后根据每个时空候选区域对应的匹配分值,从时空候选区域集合中选择目标时空候选区域,其中,一个时空候选区域为一个视频序列。通过上述方式,对视频中的时空候选区域和文本进行匹配,而不再是对视频中的每帧图像与文本进行匹配,这样操作的好处是,由于时空候选区域包括了图像在时间和空间上的关系,因此,在匹配的时候考虑到了视频与文本在时序上的关联性,即考虑了视频时序信息对视频序列以及文本的影响,从而提升了输出的视频序列与文本的匹配度,进而有利于更好地理解视频内容。
图1为本申请实施例中视频序列选择系统的一个架构示意图;
图2为本申请实施例中视频序列选择系统的一个框架与流程示意图;
图3为本申请实施例中视频序列选择的方法一个实施例示意图;
图4为本申请实施例中提取时空候选区域的一个实施例示意图;
图5为本申请实施例中基于注意力机制的一个交互器结构示意图;
图6为本申请实施例中模型训练的方法一个实施例示意图;
图7为本申请实施例中视频序列选择装置一个实施例示意图;
图8为本申请实施例中模型训练装置一个实施例示意图;
图9为本申请实施例中服务器一个结构示意图。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原 理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能系统。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、光学字符识别(Optical Character Recognition,OCR)、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、三维技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
本申请实施例提供的方案涉及人工智能的计算机视觉和机器学习等技术,通过如下实施例进行说明:
本申请实施例提供了一种视频序列选择的方法、计算机设备及存储介质,由于时空候选区域包括了图像在时间和空间上的关系,因此,在匹配的时候考虑到了视频与文本在时序上的关联性,即考虑了视频时序信息对视频序列以及文本的影响,从而提升了输出的视频序列与文本的匹配度,进而有利于更好地理解视频内容。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解的是,这样使用的数据在适当情况下可以互换,以便这里描述的本申请的实施例,能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“对应于”以及他们的任何变形,意图在于覆 盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备,不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应该理解的是,本申请可以应用于视频内容理解和定位的场景,包含但不仅限于视频分类的场景,在视频类网站上进行快速检索的场景,以及在视频中快速定位的场景。采用本申请所提供的视频序列选择的方法,能够衡量文本和视频之间的匹配关系,从而实现给定一个句子和一段视频,即可输出一段视频序列的目的。比如,需要在一段美国职业篮球联赛(National Basketball Association,NBA)视频中提取与球员库里相关的视频序列,以此制成一段视频集锦。首先,通过本申请所提供的方法,会先生成多个时空候选区域,这些时空候选区域也就是视频序列,然后根据文本“Curie's three-point shot(库里的三分球)”,从这些时空候选区域中选择与该文本匹配度最高的时空候选区域作为目标时空候选区域,可以将目标时空候选区域记为视频序列1。类似地,如果需要提取多个时空候选区域,则可以再输入不同的文本,比如“Harden drives the ball(哈登带球过人)”,然后从这些时空候选区域中选择与该文本匹配度最高的时空候选区域作为目标时空候选区域,可以将目标时空候选区域记为视频序列2。如果需要制成视频集锦,那么可以对视频1和视频2进行拼接,得到最终的视频。
为了便于理解,本申请提出了一种视频序列选择的方法,该方法应用于图1所示的视频序列选择系统,请参阅图1,图1为本申请实施例中视频序列选择系统的一个架构示意图,如图1所示,本申请中所提供的视频序列选择方法通常应用于计算机设备,该计算机设备可以是服务器100,也可以是客户端,本申请将以应用于服务器为例介绍。
在结合图1的基础上,请参阅图2,图2为本申请实施例中视频序列选择系统的一个框架与流程示意图,如图2所示,服务器100获取从客户端输入的视频,可以理解的是,该视频也可以是预先存储于服务器100中的数据,此处不做限定。接下来,服务器100通过时空候选区域生成器从视频中提取多个时空候选区域,如图2中的时空候选区域A、时空候选区域B和时空候选区域C。用户可通过客户端输入一段句子,比如“A brown squirrel is playing with a blue ball on the floor(一只棕色松鼠正在地板上玩蓝色的球)”,采用基于注意力的交互器将该句子分别与时空候选区域A、时空候选区域B和时空候选区域C进行交互。得到句子与时空候选区域A的匹配值为60,句子与时空候选区域B的匹配值为50,句子与时空候选区域C的匹配值为90,于是将时空候选区域C作为目标时空候选区域输出,其中,时空候选区域C表现为一段视频序列。
此外,基于注意力的交互器是通过损失函数优化得到的,该损失函数可以包括一个排列损失函数和一个多样性函数,此处不做限定。其中,上述基于注意力的交互器在后文中也被称之为视频文本交互模型。
需要说明的是,客户端部署于终端设备110上,其中,终端设备110包含但不仅限于平板电脑、笔记本电脑、掌上电脑、手机、语音交互设备及个人电脑(personal computer,PC),此处不做限定。其中,语音交互设备包含但不 仅限于智能音响以及智能家电。
结合上述介绍,下面将对本申请中视频序列选择的方法进行介绍,请参阅图3,本申请中视频序列选择的方法的一个实施例包括:
301、计算机设备接收待匹配视频以及待匹配文本,其中,待匹配视频包括多帧图像,待匹配文本包括至少一个词语,待匹配文本对应于待匹配文本特征序列;
本实施例中,视频序列选择装置首先需要获取待匹配视频以及待匹配文本,其中,该视频序列选择装置既可以部署于图1中所示的服务器100上,也可以部署于图1中所示的计算能力较强的终端设备110上,此处不做限定。待匹配视频包括多帧图像,待匹配文本包括至少一个词语,在获取到待匹配文本之后,可以采用词向量模型对该待匹配文本进行处理,实现将语言数学化,从而得到待匹配文本特征序列。待匹配文本特征序列中包括了至少一个词向量,词向量即为词语的向量表示。假设待匹配文本为句子q,对句子q进行编码,得到待匹配文本特征序列F
q,且
其中,t
q表示句子q中的词语数量,d
q表示词向量的特征长度。
可以理解的是,待匹配视频的视频格式包含但不仅限于运动图像专家组(motion picture experts group,MPEG)格式、音频视频交错(audio video interleaved,AVI)、格式、高级流格式(advanced streaming format,ASF)、微软媒体视频(Windows media video,WMV)格式、第三代合作伙伴项目计划文件格式(3rd generation partnership project file format,3GP)、多媒体容器文件格式(multimedia container file format,MKV)、流媒体格式(flash video)以及视频容器可变比特率文件格式(RealMedia variable bitrate file format,RMVB)等。
可以理解的是,待匹配文本的语言类型包含但不仅限于中文、英文、日文、法文、德文以及阿拉伯语等。
302、计算机设备调用时空候选区域生成器从待匹配视频提取时空候选区域集合,其中,时空候选区域集合中包括N个时空候选区域,N为大于或等于1的整数;
本实施例中,视频序列选择装置对待匹配视频进行处理,得到一系列时空候选区域,这一系列的时空候选区域称为时空候选区域集合,时空候选区域集合包括N个时空候选区域,可以表示为
N即表示时空候选区域集合中的时空候选区域总数,一个时空候选区域即为一个视频序列。
303、计算机设备通过卷积神经网络对时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,待匹配视频特征序列与时空候选区域具有对应关系;
本实施例中,视频序列选择装置分别对时空候选区域集合中的每个时空候选区域进行编,以一个时空候选区域为例,假设该时空候选区域表示为p,使用卷积神经网络对其提取相应的序列特征,即得到待匹配视频特征序列F
p,且
其中,t
p表示视频的时间步数,时间步数表示视频被压缩后的固定 长度,d
p表示视频的特征长度,比如可以是6048或4096等,此处不做限定。每个时空候选区域对应一个待匹配视频特征序列。
304、计算机设备调用基于注意力的交互器获取每个时空候选区域对应的匹配分值,其中,该交互器用于对待匹配视频特征序列与待匹配文本特征序列进行处理,匹配分值用于表示时空候选区域与待匹配文本之间的匹配关系;
本实施例中,视频序列选择装置将每个待匹配视频特征序列以及待匹配文本特征序列输入至视频文本交互模型,由该视频文本交互模型输出相应的匹配分值。比如,时空候选区域集合包括3个时空候选区域,每个时空候选区域分别对应一个待匹配视频特征序列,比如时空候选区域A对应待匹配视频特征序列A,时空候选区域B对应待匹配视频特征序列B,时空候选区域C对应待匹配视频特征序列C。此时,将待匹配视频特征序列A与待匹配文本特征序列输入至视频文本交互模型,由视频文本交互模型输出匹配分值A。将待匹配视频特征序列B与待匹配文本特征序列输入至视频文本交互模型,由视频文本交互模型输出匹配分值B。将待匹配视频特征序列C与待匹配文本特征序列输入至视频文本交互模型,由视频文本交互模型输出匹配分值C。通常情况下,匹配分值越高,表示匹配关系越强。
305、计算机设备根据该交互器输出的每个时空候选区域对应的匹配分值,从时空候选区域集合中选择目标时空候选区域,输出目标时空候选区域。
本实施例中,视频序列选择装置根据每个时空候选区域对应的匹配分值,从中选择匹配分值最高的时空候选区域作为目标时空候选区域,并输出该目标时空候选区域,其中,目标时空候选区域即为一个视频序列。
本申请实施例提供了一种视频序列选择的方法,首先接收待匹配视频以及待匹配文本,其中,待匹配视频包括多帧图像,待匹配文本包括至少一个词语,待匹配文本对应于待匹配文本特征序列,然后从待匹配视频中提取时空候选区域集合,接下来需要对时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,待匹配视频特征序列与时空候选区域具有对应关系,然后可以通过视频文本交互模型获取每个时空候选区域对应的匹配分值,最后根据每个时空候选区域对应的匹配分值,从时空候选区域集合中选择目标时空候选区域,其中,一个时空候选区域为一个视频序列。通过上述方式,将视频中的时空候选区域和文本进行匹配,而不再是将视频中的每帧图像与文本进行匹配,这样操作的好处是,由于时空候选区域包括了图像在时间和空间上的关系,因此,在匹配的时候考虑到了视频与文本在时序上的关联性,即考虑了视频时序信息对视频序列以及文本的影响,从而提升了输出的视频序列与文本的匹配度,进而有利于更好地理解视频内容。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的视频序列选择的方法还包括一个可选实施例,在该可选实施例中,上述步骤302计算机设备调用时空候选区域生成器从待匹配视频提取时空候选区域集合,可以包括:
所述计算机设备调用所述时空候选区域生成器获取待匹配视频中每帧图像 的候选区域以及置信度得分,其中,每个候选区域对应一个置信度得分;获取待匹配视频中相邻两帧图像之间的重合度;根据每帧图像的候选区域、置信度得分以及重合度,生成时空候选区域集合。
本实施例中,介绍了一种根据待匹配视频生成时空候选区域集合的方式,首先,视频序列选择装置采用单帧的候选区域生成方法检测待匹配视频中的每一帧,由此得到每帧图像中的候选区域以及置信度得分,为了便于理解,请参阅图4,图4为本申请实施例中提取时空候选区域的一个实施例示意图,如图4所示,在一帧图像中可以提取至少一个候选区域,需要注意的是,一帧图像是没有时序信息的,多帧图像就具有时序信息。候选区域S1为一个西瓜的图像,候选区域S2为一个菠萝的图像,由此可见,不同的候选区域其对应的边框大小也不同。以西瓜对应的候选区域为例,假设有10帧图像的置信度得分相近,且重合度高,则这10帧图像构成“西瓜”对应的时空候选区域。
其次,本申请实施例还提供了一种确定时空候选区域集合的方式,首先可以获取待匹配视频中每帧图像的候选区域以及置信度得分,每个候选区域对应一个置信度得分,然后获取待匹配视频中相邻两帧之间的重合度,最后根据每帧图像的候选区域、置信度得分以及重合度,生成时空候选区域集合。通过上述方式,结合了视频在时间和空间上的变化,同时结合了视频中外观信号和运动信号生成时空候选区域,由此,可以提升时空候选区域生成的准确性。
可选地,在上述图3对应的实施例的基础上,本申请实施例提供的视频序列选择的方法还包括另一个可选实施例,在该可选实施例中,上述步骤304所述计算机设备调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,可以包括:
对于每个时空候选区域,所述计算机设备调用所述交互器的编码器对该时空候选区域对应的待匹配视频特征序列进行编码,得到视觉特征集合,其中,视觉特征集合包括至少一个视觉特征;所述计算机设备调用所述交互器的编码器对待匹配文本特征序列进行编码,得到文本特征集合,其中,文本特征集合包括至少一个文本特征;所述计算机设备调用所述交互器执行根据视觉特征集合以及文本特征集合,确定视觉文本特征集合,其中,视觉文本特征集合包括至少一个视觉文本特征,视觉文本特征表示基于视觉特征的文本特征;根据视觉文本特征集合以及视觉特征集合,确定该时空候选区域对应的匹配分值。
本实施例中,介绍了一种通过视频文本交互模型获取时空候选区域对应的匹配分值的实现方式。为了便于理解,请参阅图5,图5为本申请实施例中基于注意力机制的一个交互器结构示意图,如图5所示,为了便于说明,下面将以一个时空候选区域为例进行说明,首先获取该时空候选区域对应的待匹配视频特征序列F
p以及待匹配文本特征序列F
q,然后使用两个基于长短期记忆网络(Long Short-Term Memory,LSTM)的循环神经网络作为编码器。其中,LSTM 编码器属于视频文本交互模型的组成部分。
然后模型训练装置将待匹配视频特征序列F
p输入至一个LSTM编码器,由这个LSTM编码器对待匹配视频特征序列F
p进行编码,得到视觉特征集合H
p,其中,视觉特征集合H
p包括t
p个视觉特征h
p。将待匹配文本特征序列F
q输入至另一个LSTM编码器,由另一个LSTM编码器对待匹配文本特征序列F
q进行编码,得到文本特征集合H
q,其中,文本特征集合H
q包括t
q个文本特征h
q。
接下来,模型训练装置采用注意力机制对文本特征集合H
q进行有针对性的加权求和,从而得到视觉文本特征集合H
qp,其中,视觉文本特征集合H
qp包括t
p个视觉文本特征h
qp,视觉文本特征即为视觉导向的文本特征。最后,模型训练装置对视觉文本特征集合H
qp中的视觉文本特征h
qp以及视觉特征集合H
p中视觉特征h
p进行计算,得到t
p个分值,对这t
p个分值求和即可得到该时空候选区域对应的匹配分值。
可以理解的是,时空候选区域集合中每个时空候选区域的处理方式均如上所述,由此,得到时空候选区域集合中各个时空候选区域的匹配分值,此处不再赘述其他时空候选区域的处理方式。
其次,本申请实施例还提供了一种获取时空候选区域对应的匹配分值的方式,分别通过视频文本交互模型的编码器对待匹配视频特征序列进行编码,得到视觉特征集合,并对待匹配文本特征序列进行编码,得到文本特征集合,然后根据视觉特征集合以及文本特征集合,确定视觉文本特征集合,最后,根据视觉文本特征集合以及视觉特征集合,确定时空候选区域对应的匹配分值。通过上述方式,利用交互机制将视频和文本进行特征融合,且能够刻画视频中每一个时空候选区域与文本之间的匹配关系,由此,在时间和空间上都能够实现与文本的匹配,进而提高对视频内容理解能力。
可选地,在上述图3对应的另一个可选实施例的基础上,本申请实施例提供的视频序列选择的方法还包括再一个可选实施例,在该可选实施例中,调用所述交互器的编码器对待匹配视频特征序列进行编码,得到视觉特征集合,可以包括:
采用如下方式计算视觉特征集合:
其中,H
p表示视觉特征集合,
表示视觉特征集合中的第t个视觉特征,t
p表示时空候选区域的时间步数,
表示视觉特征集合中的第(t-1)个视觉特征,LSTM
p()表示第一LSTM编码器,f
t
p表示待匹配视频特征序列中的第t行特征;
通过视频文本交互模型的编码器对待匹配文本特征序列进行编码,得到文本特征集合,可以包括:
采用如下方式计算文本特征集合:
其中,H
q表示文本特征集合,
表示文本特征集合中的第t个文本特征,t
q表示待匹配文本的词语数量,
表示文本特征集合中的第(t-1)个文本特征,LSTM
q()表示第二LSTM编码器,
表示待匹配文本特征序列中的第t行特征。
本实施例中,将介绍一种生成视觉特征集合以及生成文本特征集合的实现方式。下面介绍生成视觉特征集合的方式,即采用LSTM编码器对视觉特征集合H
p中的第(t-1)个视觉特征
以及待匹配视频特征序列F
p中的第t行特征f
t
p输入至第一LSTM编码器,由第一LSTM编码器输出第t个视觉特征
当输出t
p个视觉特征
之后,即可得到视觉特征集合H
p。
下面介绍生成文本特征集合的方式,即采用LSTM编码器对文本特征集合H
q中的第(t-1)个文本特征
以及待匹配文本特征序列F
q中的第t行特征f
t
q输入至第二LSTM编码器,由第二LSTM编码器输出第t个文本特征
当输出t
q个文本特征
之后,即可得到文本特征集合H
q。
其中,LSTM编码器增加了对过去状态的过滤,从而可以选择哪些状态对当前更有影响,而不是简单的选择最近的状态。LSTM编码器结构包含了遗忘门、学习门、记忆门以及使用门,长期记忆进入遗忘门,忘记它认为没有用处的内容。短期记忆和事件在学习门里合并到一起,并移除掉不必要的信息,作为学到的新信息。还没遗忘的长期记忆和刚学到的新信息会在记忆门里合并到一起,这个门将这两者放到一起,由于它叫记忆门,所以它会输出更新后的长期记忆。最后,使用门会决定要从之前知道的信息以及刚学到的信息中挑出什么来使用,从而作出预测,所以它也接受长期记忆和新信息的输入,将它们合并到一起并决定要输出什么。
再次,本申请实施例还提供了一种利用编码器对待匹配视频特征序列进行编码,得到视觉特征集合的实现方式,以及利用编码器对待匹配文本特征序列进行编码,得到文本特征集合的实现方式。通过上述方式,实现对特征序列的编码处理,并且采用LSTM编码器进行编码,由此可以处理和预测时间序列中间隔和延迟相对较长的重要事件,从而提升方案的可行性和可操作性。
可选地,在上述图3对应的另一个可选实施例的基础上,本申请实施例提供的视频序列选择的方法还包括又一个可选实施例,在该可选实施例中,计算机设备调用所述交互器执行根据视觉特征集合以及文本特征集合,确定视觉文本特征集合,可以包括:
所述计算机设备调用所述交互器执行根据视觉特征集合以及文本特征集合,计算视觉特征对应文本特征的注意力权重;根据该注意力权重,计算视觉特征对应文本特征的归一化注意力权重;根据归一化注意力权重以及文本特征,计算视觉文本特征集合。
本实施例中,介绍了一种生成视觉文本特征集合的方式。
可以理解的是,注意力机制(attention mechanism)是解决信息超载问题的主要手段,属于一种资源分配方案,将计算资源分配给更重要的任务,本申请采取的注意力机制可以是多头注意力、硬性注意力、键值对注意力或者结构化注意力。
多头注意力是利用多个查询,来平行地计算从输入信息中选取多个信息。每个注意力关注输入信息的不同部分。硬注意力,即基于注意力分布的所有输入信息的期望。还有一种注意力是只关注到一个位置上,叫做硬性注意力。硬性注意力有两种实现方式,一种是选取最高概率的输入信息。另一种硬性注意力可以通过在注意力分布式上随机采样的方式实现。键值对注意力可以用键值对格式来表示输入信息,其中“键”用来计算注意力分布,“值”用来生成选择的信息。结构化注意力要从输入信息中选取出和任务相关的信息,主动注意力是在所有输入信息上的多项分布,是一种扁平结构。如果输入信息本身具有层次结构,比如文本可以分为词、句子、段落以及篇章等不同粒度的层次,我们可以使用层次化的注意力来进行更好的信息选择。
再次,本申请实施例还提供了一种确定视觉文本特征集合的方式,即首先根据视觉特征集合以及文本特征集合,获取视觉特征对应文本特征的注意力权重,然后根据该注意力权重,获取视觉特征对应文本特征的归一化注意力权重,最后根据该归一化注意力权重以及文本特征,获取视觉文本特征集合。通过上述方式,充分利用注意力机制生成视觉文本特征,以此获取更多需要关注目标的视觉信息,而抑制其他无用信息,从而极大地提高了视觉信息处理的效率与准确性,能够从众多信息中选择出对当前任务目标更关键的信息。
可选地,在上述又一个可选实施例的基础上,本申请实施例提供的视频序列选择的方法还包括另一个可选实施例,在该可选实施例中,计算机设备调用所述交互器执行根据视觉特征集合以及文本特征集合,计算视觉特征对应文本特征的注意力权重,可以包括:
采用如下方式获取注意力权重:
其中,e
i,j表示第i个视觉特征对应第j个文本特征的注意力权重,
表示第j个文本特征,
表示第i个视觉特征,w
T表示第一模型参数,W
q表示第二模型参数,W
p表示第三模型参数,b
1表示第四模型参数,b
2表示第五模型参数,tanh()表示双曲正切函数;
根据视觉特征对应文本特征的注意力权重,计算视觉特征对应文本特征的归一化注意力权重,可以包括:
采用如下方式计算归一化注意力权重:
其中,a
i,j表示第i个视觉特征对应第j个文本特征的归一化注意力权重, t
q表示待匹配文本的词语数量,k表示待匹配文本中的第k个词语,k为大于或等于1,且小于或等于t
q的整数,exp()表示指数函数;
根据该归一化注意力权重以及文本特征,计算视觉文本特征集合,包括:
采用如下方式计算视觉文本特征集合:
其中,H
qp表示视觉文本特征集合,t
p表示时空候选区域的时间步数,h
qp表示视觉文本特征。
接下来,对注意力权重进行归一化处理,即采用如下方式计算归一化注意力权重:
最后采用如下方式计算视觉文本特征集合:
进一步地,本申请实施例还提供了一种计算视觉特征对应文本特征的注意力权重的实现方式,同时,提供了一种计算视觉特征对应文本特征的归一化注意力权重的实现方式,以及提供了一种计算视觉文本特征集合的实现方式。通过上述方式,为方案的实现提供了具体且可行的方式,由此,提升方案的实用性和可行性。
可选地,在上述任一个可选实施例的基础上,本申请实施例提供的视频序列选择的方法还包括另一个可选实施例,在该可选实施例中,计算机设备调用所述交互器执行根据视觉文本特征集合以及视觉特征集合,确定时空候选区域对应的匹配分值,可以包括:
采用如下方式计算匹配分值:
其中,s(q,p)表示时空候选区域对应的匹配分值,
表示第i个时间步数对应的视觉特征和视觉文本特征之间的匹配子分值,
表示第i个时间步数对应的视觉文本特征,
表示第i个时间步数对应的视觉特征,φ()表示相似度计算函数。
本实施例中,介绍了一种计算时空候选区域对应的匹配分值的实现方式。由于一个时空候选区域是由视频序列组成的,一个时空候选区域经过编码处理后会得到一个对应的待匹配视频特征序列,而一个待匹配视频特征序列会对应多个视觉特征,因此,需要针对每个视频特征进行计算,得到匹配子分值,最后每个匹配子分值相加,即可得到整个时空候选区域对应的匹配分值。
为了便于理解,请参阅如下公式:
这是计算匹配子分值的方式,即计算第i个视频特征对应的匹配子分值。类似地,对每个视频特征进行上述计算,即可得到t
p个视频特征的匹配子分值,最后采用如下公式进行计算,即可得到整个时空候选区域的匹配分值:
更进一步地,本申请实施例还提供了一种根据视觉文本特征集合以及视觉特征集合,确定时空候选区域对应的匹配分值的实现方式。通过上述方式,为方案的实现提供了具体且可行的方式,由此,提升方案的实用性和可行性。
结合上述介绍,下面将对本申请中模型训练的方法进行介绍,请参阅图6,本申请中模型训练的方法的一个实施例包括:
601、计算机设备获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,其中,第一待训练视频与第一待训练文本具有匹配关系,且第一待训练视频与第二待训练文本不具有匹配关系,第二待训练视频与第二待训练文本具有匹配关系,且第二待训练视频与第一待训练文本不具有匹配关系;
本实施例中,模型训练装置首先获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,其中,有两对匹配的训练对象以及两对不匹配的训练对象,即第一待训练视频与第一待训练文本匹配,第二待训练视频与第二待训练文本匹配,第一待训练视频与第二待训练文本不匹配,第二待训练视频与第一待训练文本不匹配。
602、计算机设备根据第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,确定排列损失函数,其中,排列损失函数用于对第一待训练视频以及第二待训练文本进行处理,并对第二待训练视频以及第一待训练文本进行处理;
本实施例中,模型训练装置将第一待训练视频以及第一待训练文本作为正样本,将第二待训练视频以及第二待训练文本作为正样本,将第一待训练视频 以及第二待训练文本作为负样本,将第二待训练视频以及第一待训练文本作为负样本,进而获取正样本的匹配分值以及负样本的匹配分值。根据正样本的匹配分值以及负样本的匹配分值之间的大小关系构建排列损失函数。
603、计算机设备根据第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,确定多样性损失函数,其中,多样性损失函数用于对第一待训练视频以及第一待训练文本进行处理,并对第二待训练视频以及第二待训练文本进行处理;
本实施例中,模型训练装置将第一待训练视频以及第一待训练文本作为正样本,将第二待训练视频以及第二待训练文本作为正样本,可以选择任意一个正样本对应的数据构建多样性损失函数。在正样本中,需要使得不同的时空候选区域具有不同的匹配分值,也就是整个时空分布区域的分值分布应该是多样的,而非相等的情况,这样才能实现更为准确的匹配效果。
604、计算机设备根据排列损失函数以及多样性损失函数,确定目标损失函数;
本实施例中,模型训练装置结合排列损失函数以及多样性损失函数,生成目标损失函数。
605、计算机设备采用目标损失函数对待训练的交互器进行训练,得到基于注意力的交互器,其中,该交互器用于输出待匹配视频与待匹配文本的匹配分值。
本实施例中,计算机设备的模型训练装置利用已经构建完成的目标损失函数,对待训练视频文本交互模型进行训练,进而得到视频文本交互模型。将待匹配视频与待匹配文本经过特征化处理之后,输入至视频文本交互模型,由此得到视频中各个时空候选区域的匹配分值,最后选择匹配分值最大的时空候选区域作为目标时空候选区域。
损失函数通常作为学习准则,并且与优化问题相联系,即通过最小化损失函数求解和评估模型。
本申请实施例提供了一种模型训练的方法,首先获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,第一待训练视频与第一待训练文本匹配,第一待训练视频与第二待训练文本不匹配,第二待训练视频与第二待训练文本匹配,第二待训练视频与第一待训练文本不匹配。然后根据第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,确定排列损失函数以及多样性损失函数,最后,结合排列损失函数以及多样性损失函数对模型进行训练,得到视频文本交互模型。通过上述方式,同时利用排列损失函数以及多样性损失函数训练模型,能够提升文本与不同时空候选区域之间的匹配准确性,从而有利于提升模型训练的精度。
可选地,在上述图6对应的实施例的基础上,本申请实施例提供的模型训练的方法还包括一个可选实施例,在该可选实施例中,上述步骤602根据第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,确定排列损失函数,可以包括:
获取第一待训练视频中的第一时空候选区域集合,以及获取第二待训练视频中的第二时空候选区域集合,其中,第一时空候选区域集合包括至少一个第一时空候选区域,第一时空候选区域为视频序列,第二时空候选区域集合包括至少一个第二时空候选区域,第二时空候选区域为视频序列;
根据第一待训练文本以及第二时空候选区域集合,计算第一匹配分值;
根据第二待训练文本以及第一时空候选区域集合,计算第二匹配分值;
根据第一待训练文本以及第一时空候选区域集合,计算第三匹配分值;
根据第一匹配分值、第二匹配分值以及第三匹配分值,确定排列损失函数。
本实施例中,将介绍排列损失函数的内容。首先,模型训练装置对第一待训练视频以及第二待训练视频分别进行时空候选区域的提取,提取方式如上述图3对应的第一个可选实施例所描述的内容。基于上述方式,可以得到第一待训练视频对应的第一时空候选区域集合,第一时空候选区域集合包括至少一个第一时空候选区域。并且得到第二待训练视频对应的第二时空候选区域集合,第二时空候选区域集合包括至少一个第二时空候选区域。假设第一待训练视频为v,第一待训练文本为q,由此定义二者之间的匹配分值为S(v,q)为:
S(v,q)=max
is(q,p
i),i=1,...,N;
其中,p
i表示第一时空候选区域集合中第i个第一时空候选区域,N表示第一待训练视频中第一时空候选区域的总数,max
i()表示取最大值,s(q,p
i)表示第i个第一时空候选区域p
i和输入的第一待训练文本q之间的匹配行为刻画。S(v,q)可以表示为第三匹配分值。
类似的,假设第二待训练视频为v',第一待训练文本为q,由此定义二者之间的匹配分值为S(v',q)为:
S(v',q)=max
is(q,p'
i),i=1,...,N;
其中,p'
i表示第二时空候选区域集合中第i个第二时空候选区域,N表示第二待训练视频中第二时空候选区域的总数,max
i()表示取最大值,s(q,p'
i)表示第i个第二时空候选区域p
i和输入的第一待训练文本q之间的匹配行为刻画。S(v',q)可以表示为第一匹配分值。
类似的,假设第一待训练视频为v,第二待训练文本为q',由此定义二者之间的匹配分值为S(v,q')为:
S(v,q')=max
is(q',p
i),i=1,...,N;
其中,p
i表示第一时空候选区域集合中第i个第一时空候选区域,N表示第一待训练视频中第一时空候选区域的总数,max
i()表示取最大值,s(q',p
i)表示第i个第一时空候选区域p
i和输入的第二待训练文本q'之间的匹配行为刻画。S(v,q')可以表示为第二匹配分值。
基于上述计算得到的第一匹配分值、第二匹配分值以及第三匹配分值,可以得到如下排列损失函数:
L
rank=∑
v≠v'∑
q≠q'[max(0,S(v,q')-S(v,q)+Δ)+max(0,S(v',q)-S(v,q)+Δ)];
其中,Δ表示一个常量,排列损失函数L
rank能够直接促使正样本的第三匹配 分值S(v,q)比负样本的第二匹配分值S(v,q')以及第一匹配分值S(v',q)更大,因此,它能够帮助让目标时空候选区域p
*和第一待训练文本为q之间产生强烈的匹配行为s(q,p
*)。
其次,本申请实施例还提供了一种确定排列损失函数的方式,即模型训练装置先获取第一待训练视频中的第一时空候选区域集合,以及获取第二待训练视频中的第二时空候选区域集合,第一时空候选区域集合包括至少一个第一时空候选区域,第一时空候选区域为视频序列,第二时空候选区域集合包括至少一个第二时空候选区域,第二时空候选区域为视频序列,然后分别根据第一待训练文本以及第二时空候选区域集合,计算第一匹配分值,根据第二待训练文本以及第一时空候选区域集合,计算第二匹配分值,根据第一待训练文本以及第一时空候选区域集合,计算第三匹配分值,根据第一匹配分值、第二匹配分值以及第三匹配分值,确定排列损失函数。通过上述方式,设计得到的排列损失函数能够促使匹配数据的匹配分值比未匹配数据的匹配分值更大,从而使得目标时空候选区域和文本之间产生强烈的匹配关系,排列损失函数可以为区分视频和文本之间是否匹配做出贡献。
可选地,在上述图6对应的实施例的基础上,本申请实施例提供的模型训练的方法还包括另一个可选实施例,在该可选实施例中,上述步骤603根据第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,确定多样性损失函数,可以包括:
根据第一时空候选区域集合以及第一待训练文本,确定匹配行为分布,其中,第一时空候选区域集合是根据第一待训练视频生成的,匹配行为分布表示第一时空候选区域集合中每个第一时空候选区域与第一待训练文本之间的匹配关系;对匹配行为分布进行归一化处理,得到目标匹配行为分布;根据目标匹配行为分布确定多样性损失函数。
本实施例中,将介绍多样性损失函数的内容。根据先验经验可知,利用自然语言表达的句子在视频中定位时,仅有很少的一部分时空候选区域是和输入句子有语义配对关系的,这是因为一个合理的匹配行为分布
应该是多样性的。也就是,仅有少部分的时空候选区域和文本之间的匹配行为属于强烈匹配行为,另外的时空候选区域的匹配行为应该较弱。
为了使得产生的匹配行为分布
具有多样性,由此引入了多样性损失函数。首先,模型训练装置对第一待训练视频进行时空候选区域的提取,提取方式如上述图3对应的第一个可选实施例所描述的内容。基于上述方式,可以得到第一待训练视频对应的时空候选区域集合,时空候选区域集合包括至少一个时空候选区域。假设第一待训练视频为v,第一待训练文本为q。则,本方案先通过softmax函数归一化匹配行为分布
即采用如下方式进行计算:
其中,p
k是指p
n中任意一个,p
k表示第k个时空候选区域,然后再惩罚
分布的信息熵,其作用为增强置信度较高的时空候选区域和文本之间的匹配关系,同时减弱置信度较低的时空候选区域和文本之间的匹配关系。最终得到多样性损失函数:
其中,L
div表示多样性损失函数。
其次,本申请实施例还提供了一种确定多样性损失函数的方式,即模型训练装置可以根据第一时空候选区域集合以及第一待训练文本,确定匹配行为分布,然后对匹配行为分布进行归一化处理,得到目标匹配行为分布,最后根据目标匹配行为分布确定多样性损失函数。通过上述方式,设计得到的多样性损失函数不仅能够增强置信度较高的时空候选区域和文本之间的匹配关系,同时还能够减弱置信度较低的时空候选区域和文本之间的匹配关系,由此,更贴近时空候选区域与文本之间的实际匹配关系,进而有利于训练得到更加准确的网络模型。
可选地,在上述图6以及图6对应的两个可选实施例的基础上,本申请实施例提供的模型训练的方法还包括再一个可选实施例,在该可选实施例中,上述步骤604根据排列损失函数以及多样性损失函数,确定目标损失函数,可以包括:
采用如下方式确定目标损失函数:
L=L
rank+βL
div;
其中,L表示目标损失函数,L
rank表示排列损失函数,L
div表示多样性损失函数,β表示控制系数。
本实施例中,介绍了一种生成目标损失函数的实现方式。在模型训练装置获取到排列损失函数以及多样性损失函数之后,将两者进行相加,同时,为多样性损失函数增加一个系数,详细如下:
L=L
rank+βL
div;
其中,β可以设置为0.5,也可以设置为其他合理的数值,此处不做限定。
即,计算机设备获取控制系数,根据所述控制系数、所述排列损失函数以及所述多样性损失函数,确定所述目标损失函数。
再次,本申请实施例还提供了一种确定目标损失函数的实现方式,即对已经设计好的排列损失函数与多样性损失函数进行组合。通过上述方式,设计得到的目标损失函数既可以区分匹配的视频和句子对以及不匹配的视频和句子对,又可以增强匹配的视频和句子对中可信度高的时空候选区域和句子的匹配关系,并且能够降低可信度低的时空候选区域和句子之间的匹配关系,由此,提升模型训练的可靠性,从而得到更为准确的网络模型。
下面对本申请中的视频序列选择装置进行详细描述,请参阅图7,图7为本申请实施例中视频序列选择装置一个实施例示意图,视频序列选择装置70包括:
获取模块701,用于接收待匹配视频以及待匹配文本,其中,所述待匹配视频包括多帧图像,所述待匹配文本包括至少一个词语,所述待匹配文本对应于待匹配文本特征序列;
生成模块702,用于调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,其中,所述时空候选区域集合中包括N个时空候选区域,所述N为大于或等于1的整数,一个时空候选区域为一个视频序列;
编码模块703,用于通过卷积神经网络对所述时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,所述待匹配视频特征序列与所述时空候选区域具有对应关系;
所述获取模块701,还用于调用基于注意力的交互器所述每个时空候选区域对应的匹配分值,其中,所述交互器用于对所述待匹配视频特征序列与所述待匹配文本特征序列进行处理,所述匹配分值用于表示所述时空候选区域与所述待匹配文本之间的匹配关系;
选择模块704,用于根据交互器输出所述每个时空候选区域对应的匹配分值,从所述时空候选区域集合中选择目标时空候选区域,输出所述目标时空候选区域。
本申请实施例提供了一种视频序列选择装置,首先接收待匹配视频以及待匹配文本,其中,待匹配视频包括多帧图像,待匹配文本包括至少一个词语,待匹配文本对应于待匹配文本特征序列,然后从待匹配视频中提取时空候选区域集合,接下来需要对时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,待匹配视频特征序列与时空候选区域具有对应关系,然后可以调用基于注意力的交互器获取每个时空候选区域对应的匹配分值,最后根据每个时空候选区域对应的匹配分值,从时空候选区域集合中选择目标时空候选区域,其中,一个时空候选区域为一个视频序列。通过上述方式,对视频中的时空候选区域和文本进行匹配,而不再是对视频中的每帧图像与文本进行匹配,这样操作的好处是,由于时空候选区域包括了图像在时间和空间上的关系,因此,在匹配的时候考虑到了视频与文本在时序上的关联性,即考虑了视频时序信息对视频序列以及文本的影响,从而提升了输出的视频序列与文本的匹配度,进而有利于更好地理解视频内容。
可选地,在上述图7对应的实施例的基础上,本申请实施例提供的视频序列选择装置70的另一个可选实施例中,
所述生成模块702,用于调用所述时空候选区域生成器获取所述待匹配视频中每帧图像的候选区域以及置信度得分,其中,每个候选区域对应一个置信度得分;获取所述待匹配视频中相邻两帧图像之间的重合度;根据所述每帧图像的候选区域、所述置信度得分以及所述重合度,生成所述时空候选区域集合。
可选地,在上述图7对应的实施例的基础上,本申请实施例提供的视频序列选择装置70的另一个可选实施例中,
所述获取模块701,用于对于所述每个时空候选区域,调用所述交互器的编码器对所述时空候选区域对应的待匹配视频特征序列进行编码,得到视觉特征集合,其中,所述视觉特征集合包括至少一个视觉特征;调用所述交互器的编码器对所述待匹配文本特征序列进行编码,得到文本特征集合,其中,所述文本特征集合包括至少一个文本特征;根据所述视觉特征集合以及所述文本特征集合,确定视觉文本特征集合,其中,所述视觉文本特征集合包括至少一个视觉文本特征,所述视觉文本特征表示基于视觉特征的文本特征;根据所述视觉文本特征集合以及所述视觉特征集合,确定所述时空候选区域对应的匹配分值。
可选地,在上述图7对应的实施例的基础上,本申请实施例提供的视频序列选择装置70的另一个可选实施例中,
所述获取模块701,用于采用如下方式计算所述视觉特征集合:
其中,所述H
p表示所述视觉特征集合,所述
表示所述视觉特征集合中的第t个视觉特征,所述t
p表示所述时空候选区域的时间步数,所述
表示所述视觉特征集合中的第(t-1)个视觉特征,所述LSTM
p()表示第一长短期记忆网络LSTM编码器,所述f
t
p述表示所述待匹配视频特征序列中的第t行特征;
采用如下方式计算所述文本特征集合:
其中,所述H
q表示所述文本特征集合,所述
表示所述文本特征集合中的第t个文本特征,所述t
q表示所述待匹配文本的词语数量,所述
表示所述文本特征集合中的第(t-1)个文本特征,所述LSTM
q()表示第二LSTM编码器,所述f
t
q表示所述待匹配文本特征序列中的第t行特征。
可选地,在上述图7对应的实施例的基础上,本申请实施例提供的视频序列选择装置70的另一个可选实施例中,
所述获取模块701,用于调用所述交互器执行根据所述视觉特征集合以及所述文本特征集合,获取得到视觉特征对应文本特征的注意力权重;根据所述注意力权重,获取所述视觉特征对应所述文本特征的归一化注意力权重;根据所述归一化注意力权重以及所述文本特征,获取视觉文本特征集合。
可选地,在上述图7对应的实施例的基础上,本申请实施例提供的视频序列选择装置70的另一个可选实施例中,
所述获取模块701,用于采用如下方式计算所述注意力权重:
其中,所述e
i,j表示第i个视觉特征对应第j个文本特征的注意力权重,所述
表示所述第j个文本特征,所述
表示所述第i个视觉特征,所述w
T表示第一模型参数,所述W
q表示第二模型参数,所述W
p表示第三模型参数,所述b
1表示第四模型参数,所述b
2表示第五模型参数,所述tanh()表示双曲正切函数;
采用如下方式计算所述归一化注意力权重:
其中,所述a
i,j表示所述第i个视觉特征对应所述第j个文本特征的归一化注意力权重,所述t
q表示所述待匹配文本的词语数量,所述k表示所述待匹配文本中的第k个词语,所述k大于或等于1,且小于或等于所述t
q的整数,所述exp()表示指数函数;
采用如下方式计算所述视觉文本特征集合:
其中,所述H
qp表示所述视觉文本特征集合,所述t
p表示所述时空候选区域的时间步数,所述h
qp表示视觉文本特征。
可选地,在上述图7对应的实施例的基础上,本申请实施例提供的视频序列选择装置70的另一个可选实施例中,
所述获取模块701,用于采用如下方式计算所述匹配分值:
其中,所述s(q,p)表示所述时空候选区域对应的匹配分值,所述
表示第i个时间步数对应的视觉特征和视觉文本特征之间的匹配子分值,所述
表示所述第i个时间步数对应的视觉文本特征,所述h
i
p表示所述第i个时间步数对应的视觉特征,所述φ()表示相似度计算函数。
下面对本申请中的模型训练装置进行详细描述,请参阅图8,图8为本申请实施例中模型训练装置一个实施例示意图,模型训练装置80包括:
获取模块801,用于获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,其中,所述第一待训练视频与所述第一待训练文本具 有匹配关系,且所述第一待训练视频与所述第二待训练文本不具有匹配关系,所述第二待训练视频与所述第二待训练文本具有匹配关系,且所述第二待训练视频与所述第一待训练文本不具有匹配关系;
确定模块802,用于根据所述获取模块801获取的所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定排列损失函数,其中,所述排列损失函数用于对所述第一待训练视频以及所述第二待训练文本进行处理,并对所述第二待训练视频以及所述第一待训练文本进行处理;
所述确定模块802,还用于根据所述获取模块801获取的所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定多样性损失函数,其中,所述多样性损失函数用于对所述第一待训练视频以及所述第一待训练文本进行处理,并对所述第二待训练视频以及所述第二待训练文本进行处理;
所述确定模块802,还用于根据所述排列损失函数以及所述多样性损失函数,确定目标损失函数;
训练模块803,用于采用所述确定模块802确定的所述目标损失函数对待训练的交互器进行训练,得到基于注意力的交互器,其中,所述交互器用于输出待匹配视频与待匹配文本的匹配分值。
本申请实施例中,提供了一种模型训练装置,首先获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,第一待训练视频与第一待训练文本匹配,第一待训练视频与第二待训练文本不匹配,第二待训练视频与第二待训练文本匹配,第二待训练视频与第一待训练文本不匹配。然后根据第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,确定排列损失函数以及多样性损失函数,最后,结合排列损失函数以及多样性损失函数对模型进行训练,得到基于注意力的交互器。通过上述方式,同时利用排列损失函数以及多样性损失函数训练模型,不仅能够提升文本与不同时空候选区域之间的匹配准确性,还可以提升文本与候选区域之间的匹配准确性,从而有利于提升模型训练的精度。
可选地,在上述图8对应的实施例的基础上,本申请实施例提供的模型训练装置80的另一个可选实施例中,
所述确定模块802,用于获取所述第一待训练视频中的第一时空候选区域集合,以及获取所述第二待训练视频中的第二时空候选区域集合,其中,所述第一时空候选区域集合包括至少一个第一时空候选区域,所述第一时空候选区域为视频序列,所述第二时空候选区域集合包括至少一个第二时空候选区域,所述第二时空候选区域为视频序列;根据所述第一待训练文本以及所述第二时空候选区域集合,计算第一匹配分值;根据所述第二待训练文本以及所述第一时空候选区域集合,计算第二匹配分值;根据所述第一待训练文本以及所述第一时空候选区域集合,计算第三匹配分值;根据所述第一匹配分值、所述第二匹配分值以及所述第三匹配分值,确定所述排列损失函数。
可选地,在上述图8对应的实施例的基础上,本申请实施例提供的模型训 练装置80的另一个可选实施例中,
所述确定模块802,用于根据第一时空候选区域集合以及所述第一待训练文本,确定匹配行为分布,其中,所述第一时空候选区域集合是根据所述第一待训练视频生成的,所述匹配行为分布表示所述第一时空候选区域集合中每个第一时空候选区域与所述第一待训练文本之间的匹配关系;对所述匹配行为分布进行归一化处理,得到目标匹配行为分布;根据所述目标匹配行为分布确定所述多样性损失函数。
可选地,在上述图8对应的实施例的基础上,本申请实施例提供的模型训练装置80的另一个可选实施例中,所述确定模块802,用于获取控制系数,根据所述控制系数、所述排列损失函数以及所述多样性损失函数,确定所述目标损失函数。
图9是本发明实施例提供的一种服务器结构示意图,该服务器900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)922(例如,一个或一个以上处理器)和存储器932,一个或一个以上存储应用程序942或数据944的存储介质930(例如一个或一个以上海量存储设备)。其中,存储器932和存储介质930可以是短暂存储或持久存储。存储在存储介质930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器922可以设置为与存储介质930通信,在服务器900上执行存储介质530中的一系列指令操作。
服务器900还可以包括一个或一个以上电源926,一个或一个以上有线或无线网络接口950,一个或一个以上输入输出接口958,和/或,一个或一个以上操作系统941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
上述实施例中由服务器所执行的步骤可以基于该图9所示的服务器结构。
本申请实施例中,服务器中的CPU 922用于执行上述实施例中提供的视频序列选择的方法或模型训练的方法。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者 也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (13)
- 一种视频序列选择的方法,其特征在于,所述方法应用于计算机设备,包括:所述计算机设备接收待匹配视频以及待匹配文本,其中,所述待匹配视频包括多帧图像,所述待匹配文本包括至少一个词语,所述待匹配文本对应于待匹配文本特征序列;所述计算机设备调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,其中,所述时空候选区域集合中包括N个时空候选区域,所述N为大于或等于1的整数,一个时空候选区域为一个视频序列;所述计算机设备通过卷积神经网络对所述时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,所述待匹配视频特征序列与所述时空候选区域具有对应关系;所述计算机设备调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,其中,所述交互器用于对所述待匹配视频特征序列与所述待匹配文本特征序列进行处理,所述匹配分值用于表示所述时空候选区域与所述待匹配文本之间的匹配关系;所述计算机设备根据所述交互器输出的所述每个时空候选区域对应的匹配分值,从所述时空候选区域集合中选择目标时空候选区域,输出所述目标时空候选区域。
- 根据权利要求1所述的方法,其特征在于,所述计算机设备调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,包括:所述计算机设备调用所述时空候选区域生成器获取所述待匹配视频中每帧图像的候选区域以及置信度得分,其中,每个候选区域对应一个置信度得分;所述计算机设备调用所述时空候选区域生成器获取所述待匹配视频中相邻两帧图像之间的重合度;所述计算机设备调用所述时空候选区域生成器执行根据所述每帧图像的候选区域、所述置信度得分以及所述重合度,生成所述时空候选区域集合。
- 根据权利要求1所述的方法,其特征在于,所述计算机设备调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,包括:对于所述每个时空候选区域,所述计算机设备调用所述交互器的编码器对所述时空候选区域对应的待匹配视频特征序列进行编码,得到视觉特征集合,其中,所述视觉特征集合包括至少一个视觉特征;所述计算机设备调用所述交互器的编码器对所述待匹配文本特征序列进行编码,得到文本特征集合,其中,所述文本特征集合包括至少一个文本特征;所述计算机设备调用所述交互器执行根据所述视觉特征集合以及所述文本 特征集合,确定视觉文本特征集合,其中,所述视觉文本特征集合包括至少一个视觉文本特征,所述视觉文本特征表示基于视觉特征的文本特征;所述计算机设备调用所述交互器执行根据所述视觉文本特征集合以及所述视觉特征集合,确定所述时空候选区域对应的匹配分值。
- 根据权利要求3所述的方法,其特征在于,所述计算机设备调用所述交互器的编码器对所述时空候选区域对应的待匹配视频特征序列进行编码,得到视觉特征集合,包括:采用如下方式计算所述视觉特征集合:其中,所述H p表示所述视觉特征集合,所述 表示所述视觉特征集合中的第t个视觉特征,所述t p表示所述时空候选区域的时间步数,所述 表示所述视觉特征集合中的第(t-1)个视觉特征,所述LSTM p()表示第一长短期记忆网络LSTM编码器,所述f t p表示所述待匹配视频特征序列中的第t行特征;所述计算机设备调用所述交互器的编码器对所述待匹配文本特征序列进行编码,得到文本特征集合,包括:采用如下方式计算所述文本特征集合:
- 根据权利要求3所述的方法,其特征在于,所述计算机设备调用所述交互器执行根据所述视觉特征集合以及所述文本特征集合,确定视觉文本特征集合,包括:所述计算机设备调用所述交互器执行根据所述视觉特征集合以及所述文本特征集合,计算视觉特征对应文本特征的注意力权重;所述计算机设备调用所述交互器执行根据所述注意力权重,计算所述视觉特征对应所述文本特征的归一化注意力权重;所述计算机设备调用所述交互器执行根据所述归一化注意力权重以及所述文本特征,计算视觉文本特征集合。
- 根据权利要求5所述的方法,其特征在于,所述计算机设备调用所述交互器执行根据所述视觉特征集合以及所述文本特征集合,计算视觉特征对应文本特征的注意力权重,包括:采用如下方式计算所述注意力权重:其中,所述e i,j表示第i个视觉特征对应第j个文本特征的注意力权重,所述 表示所述第j个文本特征,所述 表示所述第i个视觉特征,所述w T表示第一模型参数,所述W q表示第二模型参数,所述W p表示第三模型参数,所述b 1表示第四模型参数,所述b 2表示第五模型参数,所述tanh()表示双曲正切函数;所述计算机设备调用所述交互器执行根据所述注意力权重,计算所述视觉特征对应所述文本特征的归一化注意力权重,包括:采用如下方式计算所述归一化注意力权重:其中,所述a i,j表示所述第i个视觉特征对应所述第j个文本特征的归一化注意力权重,所述t q表示所述待匹配文本的词语数量,所述k表示所述待匹配文本中的第k个词语,所述k为大于或等于1,且小于或等于所述t q的整数,所述exp()表示指数函数;所述计算机设备调用所述交互器执行根据所述归一化注意力权重以及所述文本特征,计算视觉文本特征集合,包括:采用如下方式计算所述视觉文本特征集合:其中,所述H qp表示所述视觉文本特征集合,所述t p表示所述时空候选区域的时间步数,所述h qp表示视觉文本特征。
- 根据权利要求1所述的方法,其特征在于,还包括:所述计算机设备获取第一待训练视频、第二待训练视频、第一待训练文本以及第二待训练文本,其中,所述第一待训练视频与所述第一待训练文本具有匹配关系,且所述第一待训练视频与所述第二待训练文本不具有匹配关系,所述第二待训练视频与所述第二待训练文本具有匹配关系,且所述第二待训练视频与所述第一待训练文本不具有匹配关系;所述计算机设备根据所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定排列损失函数,其中,所述排列损失函数用于对所述第一待训练视频以及所述第二待训练文本进行处理,并对所述第二待训练视频以及所述第一待训练文本进行处理;所述计算机设备根据所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定多样性损失函数,其中,所述多样性损失函数用于对所述第一待训练视频以及所述第一待训练文本进行处理,并对所述第二待训练视频以及所述第二待训练文本进行处理;所述计算机设备根据所述排列损失函数以及所述多样性损失函数,确定目标损失函数;所述计算机设备采用所述目标损失函数对待训练的交互器进行训练,得到所述交互器,其中,所述交互器用于输出待匹配视频与待匹配文本的匹配分值。
- 根据权利要求8所述的方法,其特征在于,所述计算机设备根据所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定排列损失函数,包括:所述计算机设备获取所述第一待训练视频中的第一时空候选区域集合,以及获取所述第二待训练视频中的第二时空候选区域集合,其中,所述第一时空候选区域集合包括至少一个第一时空候选区域,所述第一时空候选区域为视频序列,所述第二时空候选区域集合包括至少一个第二时空候选区域,所述第二时空候选区域为视频序列;所述计算机设备根据所述第一待训练文本以及所述第二时空候选区域集合,计算第一匹配分值;所述计算机设备根据所述第二待训练文本以及所述第一时空候选区域集合,计算第二匹配分值;所述计算机设备根据所述第一待训练文本以及所述第一时空候选区域集合,计算第三匹配分值;所述计算机设备根据所述第一匹配分值、所述第二匹配分值以及所述第三匹配分值,确定所述排列损失函数。
- 根据权利要求8所述的方法,其特征在于,所述计算机设备根据所述第一待训练视频、所述第二待训练视频、所述第一待训练文本以及所述第二待训练文本,确定多样性损失函数,包括:所述计算机设备根据第一时空候选区域集合以及所述第一待训练文本,确定匹配行为分布,其中,所述第一时空候选区域集合是根据所述第一待训练视频生成的,所述匹配行为分布表示所述第一时空候选区域集合中每个第一时空候选区域与所述第一待训练文本之间的匹配关系;所述计算机设备对所述匹配行为分布进行归一化处理,得到目标匹配行为分布;所述计算机设备根据所述目标匹配行为分布确定所述多样性损失函数。
- 根据权利要求8至10中任一项所述的方法,其特征在于,所述计算机设备根据所述排列损失函数以及所述多样性损失函数,确定目标损失函数,包括:所述计算机设备获取控制系数,根据所述控制系数、所述排列损失函数以及所述多样性损失函数,确定所述目标损失函数。
- 一种计算机设备,其特征在于,包括:存储器、收发器、处理器以及总线系统;其中,所述存储器用于存储程序;所述处理器用于执行所述存储器中的程序,包括如下步骤:接收待匹配视频以及待匹配文本,其中,所述待匹配视频包括多帧图像,所述待匹配文本包括至少一个词语,所述待匹配文本对应于待匹配文本特征序列;调用时空候选区域生成器从所述待匹配视频中提取时空候选区域集合,其中,所述时空候选区域集合中包括N个时空候选区域,所述N为大于或等于1的整数,一个时空候选区域为一个视频序列;通过卷积神经网络对所述时空候选区域集合中的每个时空候选区域进行特征提取,得到N个待匹配视频特征序列,其中,所述待匹配视频特征序列与所述时空候选区域具有对应关系;调用基于注意力的交互器获取所述每个时空候选区域对应的匹配分值,其中,所述交互器用于对所述待匹配视频特征序列与所述待匹配文本特征序列进行处理,所述匹配分值用于表示所述时空候选区域与所述待匹配文本之间的匹配关系;根据所述交互器输出的所述每个时空候选区域对应的匹配分值,从所述时空候选区域集合中选择目标时空候选区域,输出所述目标时空候选区域;所述总线系统用于连接所述存储器以及所述处理器,以使所述存储器以及所述处理器进行通信。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当其在计算机设备上运行时,使得计算机设备执行如权利要求1至11中任一项所述的视频序列选择的方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP20767030.8A EP3937072A4 (en) | 2019-03-05 | 2020-03-02 | VIDEO SEQUENCE SELECTION METHOD, COMPUTER DEVICE AND STORAGE MEDIUM |
US17/225,969 US12008810B2 (en) | 2019-03-05 | 2021-04-08 | Video sequence selection method, computer device, and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910165102.1 | 2019-03-05 | ||
CN201910165102.1A CN109919078B (zh) | 2019-03-05 | 2019-03-05 | 一种视频序列选择的方法、模型训练的方法及装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US17/225,969 Continuation US12008810B2 (en) | 2019-03-05 | 2021-04-08 | Video sequence selection method, computer device, and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020177673A1 true WO2020177673A1 (zh) | 2020-09-10 |
Family
ID=66963309
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/077481 WO2020177673A1 (zh) | 2019-03-05 | 2020-03-02 | 一种视频序列选择的方法、计算机设备及存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US12008810B2 (zh) |
EP (1) | EP3937072A4 (zh) |
CN (1) | CN109919078B (zh) |
WO (1) | WO2020177673A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113569755A (zh) * | 2021-07-29 | 2021-10-29 | 西安交通大学 | 基于对偶关系网络的时序动作定位方法、系统、设备及介质 |
CN113641854A (zh) * | 2021-07-28 | 2021-11-12 | 上海影谱科技有限公司 | 一种将文字转化为视频的方法及系统 |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11298062B2 (en) * | 2017-02-01 | 2022-04-12 | Conflu3Nce Ltd | Multi-purpose interactive cognitive platform |
US11158060B2 (en) | 2017-02-01 | 2021-10-26 | Conflu3Nce Ltd | System and method for creating an image and/or automatically interpreting images |
US11176675B2 (en) | 2017-02-01 | 2021-11-16 | Conflu3Nce Ltd | System and method for creating an image and/or automatically interpreting images |
CN109919078B (zh) | 2019-03-05 | 2024-08-09 | 腾讯科技(深圳)有限公司 | 一种视频序列选择的方法、模型训练的方法及装置 |
US20200394289A1 (en) * | 2019-06-14 | 2020-12-17 | Microsoft Technology Licensing, Llc | Biometric verification framework that utilizes a convolutional neural network for feature matching |
US11829150B2 (en) * | 2020-06-10 | 2023-11-28 | Toyota Research Institute, Inc. | Systems and methods for using a joint feature space to identify driving behaviors |
CN112199994B (zh) * | 2020-09-03 | 2023-05-12 | 中国科学院信息工程研究所 | 一种实时检测rgb视频中的3d手与未知物体交互的方法和装置 |
KR102591314B1 (ko) * | 2021-02-15 | 2023-10-20 | 한국전자통신연구원 | 비디오 의미 구간 검출 장치 및 이를 이용한 방법 |
CN113128431B (zh) * | 2021-04-25 | 2022-08-05 | 北京亮亮视野科技有限公司 | 视频片段检索方法、装置、介质与电子设备 |
CN115114480A (zh) * | 2022-04-26 | 2022-09-27 | 腾讯科技(深圳)有限公司 | 数据处理方法、装置、设备、可读存储介质及程序产品 |
CN115223086B (zh) * | 2022-09-20 | 2022-12-06 | 之江实验室 | 基于交互注意力引导与修正的跨模态动作定位方法与系统 |
CN115249062B (zh) * | 2022-09-22 | 2023-02-03 | 武汉大学 | 一种文本生成视频的网络模型、方法及装置 |
CN115661727B (zh) * | 2022-12-27 | 2023-04-28 | 苏州浪潮智能科技有限公司 | 视频的行为定位方法、装置、电子设备及存储介质 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102427507A (zh) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | 一种基于事件模型的足球视频集锦自动合成方法 |
CN102740127A (zh) * | 2011-03-29 | 2012-10-17 | 索尼公司 | 方法、装置和系统 |
CN108229285A (zh) * | 2017-05-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | 物体分类方法、物体分类器的训练方法、装置和电子设备 |
CN109919078A (zh) * | 2019-03-05 | 2019-06-21 | 腾讯科技(深圳)有限公司 | 一种视频序列选择的方法、模型训练的方法及装置 |
Family Cites Families (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070065840A1 (en) * | 2005-03-23 | 2007-03-22 | Irena Naguibneva | Novel oligonucleotide compositions and probe sequences useful for detection and analysis of microRNAS and their target mRNAS |
US8294763B2 (en) * | 2007-12-14 | 2012-10-23 | Sri International | Method for building and extracting entity networks from video |
US20090326049A1 (en) * | 2008-04-04 | 2009-12-31 | Alexander Aristarkhov | Blocking oligos for inhibition of microrna and sirna activity and uses thereof |
US20090296989A1 (en) * | 2008-06-03 | 2009-12-03 | Siemens Corporate Research, Inc. | Method for Automatic Detection and Tracking of Multiple Objects |
WO2014093935A1 (en) * | 2012-12-16 | 2014-06-19 | Cloud 9 Llc | Vital text analytics system for the enhancement of requirements engineering documents and other documents |
JP6483717B2 (ja) * | 2014-02-25 | 2019-03-13 | セント・ジュード・メディカル,カーディオロジー・ディヴィジョン,インコーポレイテッド | 不整脈源の分類のために電気生理学的性質を利用するためのシステムおよび方法 |
US9621917B2 (en) * | 2014-03-10 | 2017-04-11 | Euclid Discoveries, Llc | Continuous block tracking for temporal prediction in video encoding |
IL239191A0 (en) * | 2015-06-03 | 2015-11-30 | Amir B Geva | Image sorting system |
US9779774B1 (en) * | 2016-07-22 | 2017-10-03 | Microsoft Technology Licensing, Llc | Generating semantically meaningful video loops in a cinemagraph |
CN106529996A (zh) * | 2016-10-24 | 2017-03-22 | 北京百度网讯科技有限公司 | 基于深度学习的广告展示方法和装置 |
WO2018176017A1 (en) * | 2017-03-24 | 2018-09-27 | Revealit Corporation | Method, system, and apparatus for identifying and revealing selected objects from video |
US20210079366A1 (en) * | 2017-12-22 | 2021-03-18 | The Broad Institute, Inc. | Cas12a systems, methods, and compositions for targeted rna base editing |
US10963750B2 (en) * | 2018-01-04 | 2021-03-30 | IAS Machine, LLC | Procedural language and content generation environment for use in augmented reality/mixed reality systems to support laboratory and related operations |
CN108932304B (zh) * | 2018-06-12 | 2019-06-18 | 山东大学 | 基于跨模态的视频时刻定位方法、系统及存储介质 |
CN109325148A (zh) * | 2018-08-03 | 2019-02-12 | 百度在线网络技术(北京)有限公司 | 生成信息的方法和装置 |
CN108986186B (zh) * | 2018-08-14 | 2023-05-05 | 山东师范大学 | 文字转化视频的方法和系统 |
-
2019
- 2019-03-05 CN CN201910165102.1A patent/CN109919078B/zh active Active
-
2020
- 2020-03-02 WO PCT/CN2020/077481 patent/WO2020177673A1/zh unknown
- 2020-03-02 EP EP20767030.8A patent/EP3937072A4/en active Pending
-
2021
- 2021-04-08 US US17/225,969 patent/US12008810B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102740127A (zh) * | 2011-03-29 | 2012-10-17 | 索尼公司 | 方法、装置和系统 |
CN102427507A (zh) * | 2011-09-30 | 2012-04-25 | 北京航空航天大学 | 一种基于事件模型的足球视频集锦自动合成方法 |
CN108229285A (zh) * | 2017-05-27 | 2018-06-29 | 北京市商汤科技开发有限公司 | 物体分类方法、物体分类器的训练方法、装置和电子设备 |
CN109919078A (zh) * | 2019-03-05 | 2019-06-21 | 腾讯科技(深圳)有限公司 | 一种视频序列选择的方法、模型训练的方法及装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP3937072A4 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113641854A (zh) * | 2021-07-28 | 2021-11-12 | 上海影谱科技有限公司 | 一种将文字转化为视频的方法及系统 |
CN113641854B (zh) * | 2021-07-28 | 2023-09-26 | 上海影谱科技有限公司 | 一种将文字转化为视频的方法及系统 |
CN113569755A (zh) * | 2021-07-29 | 2021-10-29 | 西安交通大学 | 基于对偶关系网络的时序动作定位方法、系统、设备及介质 |
CN113569755B (zh) * | 2021-07-29 | 2023-08-22 | 西安交通大学 | 基于对偶关系网络的时序动作定位方法、系统、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
US12008810B2 (en) | 2024-06-11 |
CN109919078B (zh) | 2024-08-09 |
CN109919078A (zh) | 2019-06-21 |
EP3937072A1 (en) | 2022-01-12 |
EP3937072A4 (en) | 2022-04-20 |
US20210224601A1 (en) | 2021-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020177673A1 (zh) | 一种视频序列选择的方法、计算机设备及存储介质 | |
WO2020199932A1 (zh) | 模型训练方法、人脸识别方法、装置、设备及存储介质 | |
WO2021017606A1 (zh) | 视频处理方法、装置、电子设备及存储介质 | |
EP3940638B1 (en) | Image region positioning method, model training method, and related apparatus | |
US11995556B2 (en) | Video retrieval method, and method and apparatus for generating video retrieval mapping relationship | |
Xiong et al. | A unified framework for multi-modal federated learning | |
EP4002161A1 (en) | Image retrieval method and apparatus, storage medium, and device | |
WO2018196718A1 (zh) | 图像消歧方法、装置、存储介质和电子设备 | |
WO2020238353A1 (zh) | 数据处理方法和装置、存储介质及电子装置 | |
Dai et al. | Video scene segmentation using tensor-train faster-RCNN for multimedia IoT systems | |
Wang et al. | (2+ 1) D-SLR: an efficient network for video sign language recognition | |
Li et al. | Social context-aware person search in videos via multi-modal cues | |
Liu et al. | A multimodal approach for multiple-relation extraction in videos | |
Huang et al. | Dynamic sign language recognition based on CBAM with autoencoder time series neural network | |
Afrasiabi et al. | Spatial-temporal dual-actor CNN for human interaction prediction in video | |
WO2023168997A1 (zh) | 一种跨模态搜索方法及相关设备 | |
Liu | [Retracted] Sports Deep Learning Method Based on Cognitive Human Behavior Recognition | |
Wang et al. | Aerobics Action Recognition Algorithm Based on Three‐Dimensional Convolutional Neural Network and Multilabel Classification | |
CN116453005A (zh) | 一种视频封面的提取方法以及相关装置 | |
CN113704544A (zh) | 一种视频分类方法、装置、电子设备和存储介质 | |
Xiao et al. | Gaze prediction based on long short-term memory convolution with associated features of video frames | |
Shania et al. | Translator of Indonesian Sign Language Video using Convolutional Neural Network with Transfer Learning | |
Zhang et al. | [Retracted] Arm Movement Analysis Technology of Wushu Competition Image Based on Deep Learning | |
Singla et al. | Improving accuracy using ML/DL in vision based techniques of ISLR | |
CN117237857B (zh) | 视频理解任务的执行方法、装置和存储介质及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20767030 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020767030 Country of ref document: EP Effective date: 20211005 |