WO2022022152A1 - 视频片段定位方法、装置、计算机设备及存储介质 - Google Patents

视频片段定位方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022022152A1
WO2022022152A1 PCT/CN2021/100860 CN2021100860W WO2022022152A1 WO 2022022152 A1 WO2022022152 A1 WO 2022022152A1 CN 2021100860 W CN2021100860 W CN 2021100860W WO 2022022152 A1 WO2022022152 A1 WO 2022022152A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
segment
features
feature
target
Prior art date
Application number
PCT/CN2021/100860
Other languages
English (en)
French (fr)
Inventor
王景文
宋怡君
马林
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022022152A1 publication Critical patent/WO2022022152A1/zh
Priority to US17/949,984 priority Critical patent/US20230024382A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/02Editing, e.g. varying the order of information signals recorded on, or reproduced from, record carriers
    • G11B27/031Electronic editing of digitised analogue information signals, e.g. audio or video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8455Structuring of content, e.g. decomposing content into time segments involving pointers to the content, e.g. pointers to the I-frames of the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • H04N21/8456Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/102Programmed access in sequence to addressed parts of tracks of operating record carriers
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B27/00Editing; Indexing; Addressing; Timing or synchronising; Monitoring; Measuring tape travel
    • G11B27/10Indexing; Addressing; Timing or synchronising; Measuring tape travel
    • G11B27/34Indicating arrangements 

Definitions

  • the present application relates to the technical field of video processing, and in particular, to video segment positioning.
  • the video recognition model extracts the frame features of each video frame in the video and the text features of the text information.
  • the text feature is used to match the video frame and the text information, so as to determine the degree of matching between each video frame and the text information, and then locate the video segment in the video that best matches the text information.
  • Embodiments of the present application provide a video segment location method, apparatus, computer device, and storage medium, which can improve the accuracy of video segment location results.
  • the technical solution is as follows:
  • a video segment positioning method comprising:
  • a first attention weight of the at least two video segments is obtained, where the first attention weight is used to indicate a degree of matching between the video segment and the target text;
  • a video clip whose matching degree with the target text satisfies the reference condition is obtained from the at least two video clips as a target video clip in the video related to the target text.
  • a video segment positioning device comprising:
  • a first acquisition module configured to perform feature extraction on video units included in at least two video clips in the video to obtain unit features of the video units
  • a second acquiring module configured to acquire segment features of the at least two video segments based on unit features of video units included in the at least two video segments;
  • a feature fusion module which is used for feature fusion of the segment features of the at least two video segments and the text features of the target text respectively to obtain the fused segment features of the at least two video segments;
  • a third obtaining module configured to obtain a first attention weight of the at least two video clips based on the fused clip features of the at least two video clips, where the first attention weight is used to indicate the relationship between the video clip and the target text match;
  • the fourth acquisition module is configured to, according to the first attention weight, acquire, from the at least two video clips, a video clip whose matching degree with the target text satisfies the reference condition, as the video clip in the video and the video clip.
  • the target video clip associated with the target text is configured to, according to the first attention weight, acquire, from the at least two video clips, a video clip whose matching degree with the target text satisfies the reference condition, as the video clip in the video and the video clip.
  • a computer device comprising one or more processors and one or more memories, the one or more memories storing at least one piece of program code, the at least one piece of program code being executed by the one or more Multiple processors are loaded and executed to implement the operations performed by the video segment positioning method.
  • a computer-readable storage medium is provided, and a computer program is stored in the computer-readable storage medium, and the computer program is used to execute the video segment positioning method of the above aspect.
  • a computer program product comprising at least one piece of program code stored in a computer-readable storage medium.
  • the processor of the computer device reads the at least one piece of program code from the computer-readable storage medium, and the processor executes the at least one piece of program code, so that the computer device implements the operations performed by the video segment positioning method.
  • the unit features of the video unit dimension are obtained, the segment features of the video segments are determined according to the unit features, and the obtained segment features integrate the features of multiple video units and the time sequence association between the video units. Then, the segment features of the video clips and the text features of the target text are fused. In the process of feature fusion, the dimensional features of the video clips and the time-series correlation between the video clips are fully applied, so that the fused features can be obtained. More accurate attention weight, the attention weight is used to represent the matching degree between the video clip and the target text, and then when the video clip is located based on the attention weight, the target video that matches the target text can be more accurately located. Fragment.
  • FIG. 1 is a schematic diagram of an implementation environment of a method for locating a video segment provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a method for locating a video segment provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a video segment and a video unit provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a video recognition model provided by an embodiment of the present application.
  • FIG. 5 is a specific flowchart of a method for locating a video segment provided by an embodiment of the present application
  • FIG. 6 is a schematic diagram of a sampling method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a method for acquiring segment features provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a first attention weight adjustment method provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a display mode of a target video segment provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another display mode of a target video segment provided by an embodiment of the present application.
  • FIG. 11 is a flowchart of a video recognition model training method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a data processing process of a video recognition model provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a device for locating a video segment provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes several major directions such as computer vision technology, speech processing technology, natural language processing technology and machine learning. This application relates to computer vision technology in artificial intelligence technology.
  • Video recognition models are used to perform semantic understanding of videos, and based on a text description, video clips that match the text description can be accurately located from the video, without the need for users to manually screen a large number of videos. .
  • FIG. 1 is a schematic diagram of an implementation environment of a method for locating a video segment provided by an embodiment of the present application.
  • the implementation environment includes: a terminal 110 and a video recognition platform 140 .
  • the terminal 110 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc., but is not limited thereto.
  • the terminal 110 installs and runs an application program that supports video recognition and video segment positioning.
  • the application may be a video retrieval application or the like.
  • the terminal 110 is a terminal used by a user, and an application program running in the terminal 110 is logged in with a user account.
  • the terminal 110 may generally refer to one of multiple terminals, and this embodiment only takes the terminal 110 as an example for illustration.
  • the video recognition platform 140 is used to provide background services for applications that support the location of video clips.
  • the video recognition platform 140 undertakes the main video recognition work, and the terminal 110 undertakes the secondary video recognition work; or, the video recognition platform 140 undertakes the secondary video recognition work, and the terminal 110 undertakes the main video recognition work; or, the video recognition platform 140 undertakes the main video recognition work; Alternatively, the terminal 110 may independently undertake the video recognition work.
  • the video recognition platform 140 includes: an access server, a video recognition server and a database.
  • the access server is used to provide access services for the terminal 110 .
  • the video recognition server is used to provide background services related to video recognition and video clip positioning. There can be one or more video recognition servers.
  • a video recognition model may be set in the video recognition server, and the video recognition server provides support for the training and application process of the model.
  • the above server may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, network service, cloud Cloud servers for basic cloud computing services such as communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.
  • the foregoing terminal 110 and the video recognition platform 140 may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present application.
  • the number of the above-mentioned terminals may be more or less.
  • the above-mentioned terminal may be only one, or the above-mentioned terminal may be dozens or hundreds, or more.
  • the embodiments of the present application do not limit the number of terminals and device types.
  • the embodiment of the present application provides a video segment location method based on weakly supervised learning, which locates a video segment through a description in a natural language.
  • the technical solutions provided in this application can be applied to various types of application programs and combined with various application scenarios. For example, in a video application, when a user searches for a certain video clip, he can provide a piece of text information describing the video clip, and send the text information to the server corresponding to the application, and the server will send the text information based on the text information to the server. feature and segment features of each video segment, to determine the target video segment that matches the text information, without requiring the user to manually filter a large number of videos.
  • the video clips that the user is interested in can be quickly and accurately located, and the features of the video clip dimension are used to locate the video clips, so that the association between the various video clips can be fused during the operation process. to improve the efficiency of video clip positioning.
  • FIG. 2 is a flowchart of a method for locating a video segment provided by an embodiment of the present application. This method can be applied to the above-mentioned implementation environment.
  • the server is used as the execution body to introduce the video segment positioning method. Referring to FIG. 2 , this embodiment may specifically include the following steps:
  • the server performs feature extraction on video units included in at least two video segments in the video to obtain unit features of the video units.
  • the video may be a video stored in the server, or may be a video obtained by the server from other devices, and the embodiment of the present application does not limit which kind of video is specifically used.
  • a segment of a unit duration in a video may be used as a video unit, the video includes multiple consecutive video units, and each video unit includes multiple video frames.
  • the unit duration may be set by a developer, which is not limited in this embodiment of the present application. For example, if the unit duration is set to 1 second, a segment of every 1 second in the video can be used as a video unit.
  • FIG. 3 is a schematic diagram of a video segment and a video unit provided by an embodiment of the present application.
  • a video 301 includes a plurality of continuous video units, for example, including video units 302, 303, 304, 305, 306, etc., wherein, Video segment 307 includes video units 302 , 303 , 304 , and video segment 308 includes video segments 304 , 305 .
  • the server may perform feature extraction on the video through a three-dimensional convolutional layer in response to a video segment positioning instruction of the video to obtain unit features of each video unit.
  • the computer device may also acquire the unit characteristics of each video unit by other methods, which are not limited in this embodiment of the present application.
  • acquiring the features of the video unit dimension can reduce data redundancy and reduce the amount of data of the acquired features, so that the The amount of data in the subsequent operation process is reduced, and the operation complexity is reduced.
  • the server acquires segment features of at least two video segments based on unit features of video units included in the at least two video segments.
  • the segment feature may be used to represent the color feature, texture feature, etc. of the video frame image in the video segment, and may also include the time-series correlation between each video frame. Different video segments correspond to different segment features.
  • the server determines the initial segment feature of each video segment based on the video unit included in each video segment and the unit feature of each video unit, and then samples the initial segment feature of each video segment, and samples the The feature extracted in the process is determined as the segment feature of the video segment.
  • the subsequent video segment location step is performed based on the feature of the video segment dimension, so that the time sequence correlation between the video segments can be fused in the operation process, thereby improving the accuracy of the video segment location result.
  • the server performs feature fusion of the segment features of the at least two video segments with the text features of the target text, respectively, to obtain the fused segment features of the at least two video segments.
  • the target text is used to describe a video segment, the target text may be provided by the user, and the specific content of the target text is not limited in this embodiment of the present application.
  • the server may perform feature extraction on the target text to obtain text features of the target text.
  • the embodiments of the present application do not limit the specific method of text feature extraction.
  • the server can perform cross-modal feature fusion of each segment feature and the text features respectively to obtain the segment fusion features of each video clip.
  • the acquired segment fusion feature fully integrates the features of the two modalities, and the segment fusion feature has a better representation effect. Applying the segment fusion feature for subsequent video segment location can improve the video segment location result. 's accuracy.
  • the server obtains, based on the fused segment features of the at least two video segments, a first attention weight of the at least two video segments, where the first attention weight is used to indicate a degree of matching between the video segment and the target text.
  • the server performs a convolution operation on the segment fusion features of each video segment through at least one convolution layer to obtain the first attention weight of each video segment.
  • the first attention weight is positively correlated with the matching degree between the video segment and the target text, that is, a higher attention weight is assigned to the video segment with a high degree of matching with the target text.
  • the server obtains, from at least two video segments, a video segment whose matching degree with the target text satisfies the reference condition according to the first attention weight, as a target video segment in the video related to the target text.
  • the reference condition may be set by a developer, which is not limited in this embodiment of the present application.
  • the reference condition may be set to take the video segment with the highest attention weight as the target video segment.
  • the unit features of the video unit dimension are obtained, the segment features of the video segments are determined according to the unit features, and the obtained segment features integrate the features of multiple video units and the time sequence association between the video units. Then, the segment features of the video clips and the text features of the target text are fused. In the process of feature fusion, the dimensional features of the video clips and the time-series correlation between the video clips are fully applied, so that the fused features can be obtained. More accurate attention weight, the attention weight is used to represent the matching degree between the video clip and the target text, and then when the video clip is located based on the attention weight, the target video that matches the target text can be more accurately located. Fragment.
  • the server is equipped with a video recognition model, and the video recognition model is used to provide a video segment positioning function.
  • the server can call the video recognition model to execute. each step in the above embodiment.
  • 4 is a schematic structural diagram of a video recognition model provided by an embodiment of the present application.
  • the video recognition model may be a model constructed based on a deep neural network.
  • the deep neural network may be an RNN (Recurrent Neural Network, Recurrent Neural Network) , CNN (Convolutional Neural Networks, Convolutional Neural Networks), etc. As shown in FIG.
  • the video recognition model may include a feature extraction unit 401 , a sampling unit 402 , a three-dimensional convolution layer 403 , a feature fusion unit 404 and at least one two-dimensional convolution layer 405 .
  • the feature extraction unit 401 may be composed of at least one three-dimensional convolution layer and at least one one-dimensional convolution layer, and extracts the features of each video unit in the video by performing at least one convolution operation on the digital matrix corresponding to the video;
  • the sampling unit 402 can perform feature sampling based on the video units included in each video segment and the unit features of each video unit; the three-dimensional convolution layer 403 performs a convolution operation on the output result of the sampling unit to obtain the segment features of each video segment;
  • the The feature fusion unit 404 is used to fuse the segment feature of the video segment and the text feature of the target text; the at least one two-dimensional convolution layer 405 obtains the attention of each video segment by performing at least one convolution operation on the fused feature. Weights. It should be noted that the embodiments of the present application do not limit the specific numbers and connection methods of feature extraction units, sampling units, three-dimensional convolutional layers, feature fusion units, and at least one two-dimensional convolutional layer in the video recognition model.
  • FIG. 5 is a specific flowchart of a method for locating a video clip provided by an embodiment of the present application. The following describes the method for locating a video clip with reference to FIG. 4 and FIG. 5 , taking the server as the execution subject:
  • the server performs feature extraction on video units in the video to obtain unit features of the video units.
  • the server receives the video segment location request sent by the terminal, invokes the video recognition model, and extracts the unit features of each video unit through the feature extraction unit in the video recognition model.
  • the terminal may be a terminal used by any user, and the user may send a video clip location request to the server through the terminal to query video clips of interest. It should be noted that the specific triggering manner of the video segment location request is not limited in this embodiment of the present application.
  • the video segment location request may include target text for describing a video segment and a video identifier, where the video identifier may be used to uniquely indicate a video segment, and the server responds to the video segment
  • the video indicated by the video identifier can be obtained, and the subsequent video segment positioning steps can be performed based on the video and the target text.
  • the video segment location request may include target text.
  • the server in response to the video segment location request, the server may first obtain at least one video that matches the target text, based on the at least one video segment location request.
  • the video and the target text perform subsequent video segment localization steps.
  • the specific information included in the video segment location request is not limited in this embodiment of the present application. In the embodiments of the present application, only one video segment is taken as an example for description.
  • the process of acquiring unit features is described by taking the feature extraction unit of the video recognition model including a three-dimensional convolution layer and a one-dimensional convolution layer as an example.
  • the server converts each video frame in the video into a digital matrix composed of a set of pixel values.
  • the server can also perform size transformation and noise reduction processing on each video frame. This is not limited in the application examples.
  • the server inputs the digital matrix corresponding to each video frame into the video recognition model, and firstly, the three-dimensional convolution layer in the feature extraction unit performs convolution operation on the digital matrix corresponding to each video frame to obtain the initial unit feature of each video unit.
  • the dimensionality reduction process is performed on the initial unit feature through a one-dimensional convolution layer to obtain the unit feature of the video unit.
  • the convolution check of the three-dimensional convolution layer performs convolution operation on the digital matrix corresponding to the 25 video frames to obtain the initial unit feature;
  • the feature is the unit feature of a video unit.
  • the one-dimensional convolution process can be expressed as the following formula (1):
  • Each element in F′ v that is, the dimension of each cell feature is F′ v includes T unit features; r represents the attenuation multiple of the dimension; Conv1d() represents a one-dimensional convolution operation, and the size of the convolution kernel applied by the one-dimensional convolution operation can be set by the developer.
  • the convolution kernel can be set to 3 to obtain the temporal correlation information of the video unit dimension.
  • the server determines initial segment characteristics of the at least two video segments based on unit characteristics of video units included in the at least two video segments.
  • the server may obtain unit features of video units included in the video clip, and splicing the unit features based on the time sequence order of each video unit. For example, the unit features of each video unit may be combined. Feature sequential connection.
  • the spliced unit feature is used as the initial segment feature of the video segment. It should be noted that the above description of the method for obtaining the initial segment feature is only an exemplary description, and the embodiment of the present application does not limit which method is specifically used to obtain the initial segment feature.
  • the server samples initial segment features of at least two video segments to obtain segment features of the at least two video segments.
  • the server determines the sampling moment corresponding to the video clip based on the duration of the video clip. For a video clip, the server may sample the initial clip feature of the video clip based on the sampling moment corresponding to the video clip to obtain the clip feature of the video clip.
  • the number of sampling moments corresponding to each video clip is the same, and the number of sampling moments can be set by the developer, which is not limited in this embodiment of the present application. For sampling based on the same number of sampling moments, video clips of different durations can be sampled to a fixed duration, and each video clip can correspond to features of the same dimension, so that the video recognition model can perform subsequent operations.
  • FIG. 6 is a schematic diagram of a sampling method provided by an embodiment of the present application. In conjunction with FIG.
  • sampling of an initial segment feature of a video segment is taken as an example for description.
  • the start time of the video segment 601 in the video is the second second, and the duration is 3 seconds.
  • the initial segment feature 602 of the video segment includes unit features 603, 604 and 605,
  • the video segment may correspond to two sampling instants, for example, sampling instant 606 and sampling instant 607, respectively.
  • the sampling time 606 is the time between two video units.
  • the unit feature 603 and the unit feature 604 need to be weighted to obtain the sampling feature. For example, the total weight of the two feature units is 1.
  • the server can The elements at the same position in the feature are added and averaged to obtain the sampled feature. As shown in (b) of FIG.
  • the weight corresponding to the unit feature 609 is 1-dec(t n ), where dec ( ) means taking a decimal, t n means the sampling time, that is, the weight corresponding to the unit feature 609 is 0.7, and the weight corresponding to the unit feature 610 is dec(t n ), that is, 0.3, and the server corresponds to the unit feature 609 and the feature unit 610 respectively. Multiply the features of , and then add the weighted two features to obtain the sampling features.
  • the server performs sampling by constructing a sampling matrix.
  • the server may construct a sampling matrix based on the sampling moments corresponding to the at least two video clips and the position information of the at least two video clips in the video; the sampling matrix may be compared with the initial clip characteristics of the at least two video clips.
  • Multiply to obtain a sampled feature matrix where a feature in the sampled feature matrix is used to represent a sampled feature of a video segment.
  • the above sampling process can be expressed as the following formula (2), and each element in the sampling matrix can be determined based on the following formula (3):
  • the sampling matrix W 1 can determine the unit features included in the video segment based on the position of each video segment in the video, that is, determine the initial unit feature of the video unit, and perform the processing based on the initial unit feature of each video segment. Sampling to obtain the sampling feature matrix F′′ v .
  • the server may perform dimension reduction processing on the sampling features of the at least two video clips to obtain the clip features of the at least two video clips.
  • the server may convolve the sampling feature matrix through a three-dimensional convolution layer, so as to perform dimensionality reduction processing on the sampling features of each video segment in the sampling sequence dimension.
  • the above dimensionality reduction process can be expressed as the following formula (4):
  • F′′ v represents a sampling feature matrix
  • Conv3d() represents a three-dimensional convolution operation
  • F vp is a segment feature matrix, and a feature in F vp is used to represent a segment feature of a video segment.
  • FIG. 7 is a schematic diagram of a method for acquiring segment features provided by an embodiment of the present application. The foregoing method for acquiring segment features is described with reference to FIG. 7 .
  • its initial segment feature 702 includes unit features 703 , 704 , 705 and 706 , and the initial unit feature corresponding to 702 corresponds to sampling moments 707 , 708 and 709 .
  • the unit features 704 and 705 can be summed and averaged to obtain the sampling feature 710 corresponding to the sampling time 708, and then the sampling feature 711 of the video segment 701 can be obtained based on the sampling features corresponding to each sampling time. .
  • the server constructs a feature map 712 based on the position information of each video clip in the video and the sampling features of each video clip, the horizontal direction of the feature map is the start time of the video clip, the vertical direction is the duration of the video clip, and one location is used for storing
  • the sampling feature of a video clip for example, the 713 position represents the sampling feature of the video clip whose start time is 0 seconds and the duration is 4 seconds.
  • Each position in the feature map 712 stores the sampling features of the video clips, that is, the sampling feature matrix F′′ v is obtained, and the sampling feature matrix F′′ v is subjected to dimensionality reduction processing through the three-dimensional convolution layer to obtain the clip feature matrix F vp , namely A matrix 714, a feature 715 in the matrix 714 representing a segment feature of a video segment.
  • the server obtains the segment features of the at least two video segments based on the unit features of the video units included in the at least two video segments.
  • feature extraction is performed on the unit feature to obtain the segment feature.
  • the unit feature of each video unit and the time sequence relationship between the unit features can be fused in the segment feature.
  • the video clips of the duration all correspond to the clip features of the same dimension, which is convenient for the model to perform subsequent operations based on the clip features.
  • the server acquires the text feature of the target text.
  • the target text is a piece of text used to describe a video clip, for example, a piece of text input by a user when retrieving a video clip.
  • the server obtains the one-hot (one-hot) encoding of each word in the target text, and maps the one-hot encoding of each word into a word vector through the Embed (word embedding) layer.
  • the Embed layer can be expressed as a fully connected layer, and the server obtains the word vector of each word by multiplying the one-hot encoding of each word with the coefficient matrix of the fully connected layer, thereby obtaining the vector representation of the target text.
  • the server may input the vector representation of the target text into a GRU (Gate Recurrent Unit, recurrent neural network), and the recurrent neural network extracts text features of the target text based on the vector representation of the target text.
  • GRU Gate Recurrent Unit, recurrent neural network
  • the execution sequence of first obtaining the segment features of the video clips, and then obtaining the text features of the target text is used for description.
  • the steps of obtaining the text features may also be executed first, The step of acquiring the segment feature is then performed, or the two steps are performed simultaneously, which is not limited in this embodiment of the present application.
  • the server performs feature fusion of the segment features of the at least two video segments with the text features of the target text, respectively, to obtain fused segment features of the at least two video segments.
  • the server may perform cross-modal feature fusion on segment features and text features through a feature fusion unit in the video recognition model.
  • the server constructs a first feature matrix corresponding to the video based on the segment features of the at least two video segments and the position information of the at least two video segments in the video, that is, the segment feature matrix F vp in step 503 , in step 503, the segment feature matrix F vp can be directly obtained through matrix convolution, and the segment feature matrix F vp does not need to be constructed again here. If the segment feature is obtained based on other methods, it needs to be constructed here. Fragment feature matrix F vp .
  • the server performs dimension expansion on the text feature based on the dimension of the first feature matrix to obtain an expanded matrix, wherein the dimension of the expanded matrix is the same as the dimension of the first feature matrix, so as to facilitate feature fusion.
  • the server performs feature fusion on the first feature matrix and the extended matrix to obtain fused segment features of the at least two video segments. For example, the server multiplies the first feature matrix by the element at the same position in the extended matrix to obtain an intermediate feature matrix; performs pooling processing on the intermediate feature matrix to obtain a second feature matrix, one of which is one of the second feature matrices.
  • Features are used to represent the fused segment features of a video segment.
  • the server can input the features of the two modalities into the linear layer, that is, the fully connected layer. , multiply the same position elements in the linearly transformed features of the two modalities to obtain an intermediate feature matrix, and perform pooling processing on the intermediate feature matrix to obtain a second feature matrix.
  • the above-mentioned bilinear pooling feature fusion method can be expressed as the following formula (5):
  • F vp represents the first feature matrix corresponding to the video
  • F q represents the text feature of the target text
  • Tile(F q ) means to copy the text feature F q along the T dimension and S dimension respectively
  • ° means multiplying the elements at the same position in the two matrices
  • SumPool(x, K) means using a sliding window of size K pair x is summed and pooled
  • F ap represents the second feature matrix.
  • the server obtains the first attention weight of the at least two video segments based on the fused segment features of the at least two video segments.
  • the first attention weight is used to indicate the degree of matching between the video clip and the target text.
  • the value of the first attention weight is positively correlated with the matching degree between the video clip and the target text. .
  • the server performs at least one convolution operation on the second feature matrix obtained after feature fusion through at least one two-dimensional convolution layer in the video recognition model to obtain the first attention matrix.
  • the server may further normalize the result of the convolution operation, and use the normalized matrix as the first attention matrix, where an element of the first attention matrix is used to represent the first attention matrix of a video segment. Attention weights.
  • the acquisition method of the above-mentioned first attention matrix can be expressed as the following formula (6):
  • F ap represents the second feature matrix
  • Conv2d() represents the two-dimensional convolution operation
  • Softmax() represents the normalization processing function.
  • the server obtains, from at least two video segments, a video segment whose matching degree with the target text satisfies the reference condition according to the first attention weight, as a target video segment in the video related to the target text.
  • the reference condition may be set by a developer, which is not limited in this embodiment of the present application.
  • the reference condition may be set to determine the video segment with the highest first attention weight as the target video segment, or may be set to determine the video segment with the first attention weight greater than the weight threshold as the target video segment.
  • the unit features of the video unit dimension are obtained, the segment features of the video segments are determined according to the unit features, and the obtained segment features integrate the features of multiple video units and the time sequence association between the video units. Then, the segment features of the video clips and the text features of the target text are fused. In the process of feature fusion, the dimensional features of the video clips and the time-series correlation between the video clips are fully applied, so that the fused features can be obtained. More accurate attention weight, the attention weight is used to represent the matching degree between the video clip and the target text, and then when the video clip is located based on the attention weight, the target video that matches the target text can be more accurately located. Fragment.
  • FIG. 8 is a schematic diagram of a first attention weight adjustment method provided by an embodiment of the present application. Referring to FIG. 8 , the method may include the following steps:
  • the server fuses the unit feature of the video unit with the text feature of the target text respectively to obtain the fused unit feature of the video unit.
  • the server may perform sampling and dimension reduction processing on each unit feature, so that the unit features are easier to be understood by the video recognition model.
  • the server may multiply the sampling matrix W 2 by the unit feature sequence F′ v corresponding to the video to sample the unit features.
  • the sampling matrix since the duration of each video unit is 1 second, the sampling matrix
  • the server inputs the sampling results into the three-dimensional convolution layer in the video recognition model, and the three-dimensional convolution layer performs dimensionality reduction processing on the sampling results to obtain the processed unit feature sequence F vc .
  • a feature in the unit feature sequence F vc is a Processed cell features.
  • the three-dimensional convolutional layer is the same as the three-dimensional convolutional layer applied when performing dimension reduction processing on the segment features in step 503 .
  • the server may extend the dimension of the text feature based on the dimension of the unit feature sequence F vc , and perform feature fusion on the expanded text feature with the unit feature sequence F vc to obtain the fused unit feature sequence F ac , a feature in the fusion unit feature sequence F ac is based on the fusion unit feature of a video unit.
  • the acquisition method of the fusion unit feature is the same as the acquisition method of the fusion segment feature in the above step 505, and will not be repeated here.
  • the acquisition method of the above fusion unit feature can be expressed as the following formula (7):
  • a learnable parameter which can be expressed as two fully connected layers, and the parameter values in each fully connected layer can be determined during the model training process;
  • F vc represents the first feature matrix corresponding to the video;
  • F q represents the text feature of the target text,
  • Tile(F q ) means to copy the text feature F q along the T dimension and S dimension respectively;
  • ° means multiplying the elements at the same position in the two matrices;
  • SumPool(x, K) means using a sliding window of size K pair x is summed and pooled; Fac represents the fusion unit feature sequence.
  • the server obtains a second attention weight of the video unit based on the fusion unit feature of the video unit.
  • the server may perform two-dimensional convolution on the fusion unit feature sequence F ac , normalize the convolution result, and then compare the normalized result matrix with the global feature matrix of the video. Multiplying to obtain a second attention matrix, where one element of the second attention matrix is used to represent the second attention weight of a video unit.
  • the global feature matrix of the video can be obtained based on the segment feature matrix obtained in step 503 and the first attention matrix obtained in step 506, and can be specifically expressed as the following formula (8):
  • F vp represents the segment feature matrix
  • Att p represents the first attention matrix
  • Att c represents the second attention matrix
  • F ac represents the fusion unit feature sequence
  • Conv2d() represents the two-dimensional convolution operation
  • Re represents the global feature matrix
  • Softmax() represents the normalization processing function.
  • steps 801 and 802 are steps of acquiring the second attention weights of the at least two video units.
  • the attention weight at the video unit level is obtained, and the subsequent video segment location is performed based on the multi-level attention weight, which can improve the accuracy of the video segment location result.
  • the server adjusts the first attention weight of the at least two video segments based on the second attention weight of the video units included in the at least two video segments.
  • the server determines, from the video units included in the target video clip, the target video unit corresponding to the center moment of the target video clip; Based on the second attention weight of the target video unit, the first attention weight of the target video segment is adjusted.
  • the above process of adjusting the first attention weight can be expressed as the following formula (10):
  • Att' p (i) Att p (i)+ ⁇ Att c (j) (10)
  • i represents the ith video clip
  • Att p (i) represents the first attention weight of the ith video clip
  • j represents the jth video unit
  • the specific value of j is T i is the start time of the ith video segment
  • S i is the duration of the ith video segment
  • Att c (j) represents the second attention weight of the j th video unit
  • Att′ p (i) represents the Adjusted first attention weight
  • represents a hyperparameter, and its specific value can be set by a developer, which is not limited in this embodiment of the present application.
  • the technical solutions provided by the embodiments of the present application extend the video recognition model into a multi-level structure, that is, include a video segment-level data processing branch and a video-unit-level data processing branch, obtain the second attention weight of the video unit dimension, and apply The second attention weight adjusts the first attention weight of the video segment dimension, so as to improve the accuracy of the first attention weight, thereby improving the accuracy of the video segment positioning result.
  • the above embodiments describe the process of locating video clips based on natural language description.
  • the target video clip after the target video clip is determined, the target video clip can be displayed.
  • the server may send the video clip positioning result to the terminal used by the user, and the terminal will display label information on the video playback interface, where the label information is used to indicate the start time and end time of the target video clip .
  • the terminal will display label information on the video playback interface, where the label information is used to indicate the start time and end time of the target video clip .
  • the terminal will generate a video in response to detecting the user's triggering operation on the search control.
  • a segment location request where the video segment location request includes the video ID and target text of the video.
  • the terminal may also generate the video segment location request in other manners, which is not limited in this embodiment of the present application.
  • the terminal sends the video segment location request to the server, and the server locates the target video segment in the video that matches the target text.
  • the server can send the start time and duration of the target video segment to the terminal.
  • the terminal may mark the start time and end time of the target video segment in the playback progress bar of the playback interface based on the start time and duration of the target video segment.
  • FIG. 9 is a schematic diagram of a display mode of a target video clip provided by an embodiment of the present application.
  • the playback interface includes a video playback area 901 and a video playback progress bar 902.
  • the terminal may display a label in the video playback progress bar 902 information, the label information is used to indicate the start time and end time of the target video segment.
  • the terminal may also jump to the target video segment for playback, that is, jump from the current playback time to the start time of the target video segment, and start playing the video from the start time.
  • the server may also cut out the target video clip from the video, generate a play link of the target video clip, send the play link to the terminal, and the terminal will display the target video clip on the play interface of the video.
  • the link or hyperlink of the target video clip, the link or hyperlink is used to provide the function of playing the target video clip.
  • FIG. 10 is a schematic diagram of another target video clip display mode provided by an embodiment of the present application, and the play interface includes a video play area 1001 and a video clip display area 1002 .
  • the embodiment of the present application does not limit the position of the video clip display area 1002 in the play interface.
  • the video clip display area 1002 is taken as an example below the video play area 1001 .
  • the terminal may display the play entry 1003 of the target video clip in the video clip display area 1002 in the form of a hyperlink, and the terminal responds to the user clicking on the play entry 1003, jumps to the play interface corresponding to the target video clip, and plays the target video clip. video clips.
  • the server matches the target text with the video segments in multiple videos, and obtains the data from multiple videos. target video clip of each video.
  • the server may generate a play link for each target video clip, display the play link of each video clip on the terminal, and the user clicks each play link to play the video clip.
  • the server may generate a movie set based on multiple target video clips, and send the link or hyperlink of the movie set to the terminal for display, and the user may watch multiple interested targets in the movie set video clips, and the movie set can also be stored in the terminal.
  • by generating a movie set the interestingness of video viewing can be improved, and the user experience can be improved.
  • the server may be equipped with a reconstruction module, the reconstruction module predicts the first candidate text based on the segment features of the video clip, and adjusts the video recognition based on the error between the first candidate text and the target text. parameters of the model.
  • FIG. 11 is a flowchart of a video recognition model training method provided by an embodiment of the present application. Referring to FIG. 11 , the process may specifically include the following steps:
  • the server initializes each parameter in the video recognition model.
  • the server implements parameter initialization by randomly assigning parameters of each convolution layer, pooling layer, and fully connected layer in the video recognition model.
  • the server may use a Gaussian distribution with a variance of 0.01 and a mean of 0 to initialize the parameters of the video recognition model. It should be noted that the embodiment of the present application does not limit the specific method for initializing the model parameters.
  • the server inputs the training data set into the video recognition model.
  • the training data set may include multiple sample videos, the multiple sample videos are marked sample videos, and each sample video is marked with its corresponding text information.
  • the model training is performed in a weakly supervised manner, without the need for fine-grained labeling in time series, that is, without labeling the start time, end time and corresponding text information of each video segment, which reduces the difficulty of obtaining training data sets. .
  • the server inputs multiple labeled sample videos into a video recognition model, and the video recognition model outputs the target located by the text information based on the matching degree between the video clips in the sample video and the text information video clips.
  • the acquisition method of the target video segment is the same as the process of locating the video segment in the above steps 501 to 507 , and details are not described here.
  • the server determines the first candidate text based on the first attention weight and segment features output by the video recognition model, and obtains a first error value between the first candidate text and the target text.
  • the server performs a weighted operation on the segment features of the at least two video segments based on the first attention weights of the at least two video segments to obtain the weighted segment features of the at least two video segments.
  • the server may multiply the segment feature matrix by the first attention matrix to obtain a global feature matrix
  • the global feature matrix A feature in is the weighted segment feature of a video segment.
  • the server performs feature extraction on the weighted segment features of the at least two video segments through a long-short-term memory network, and determines the first candidate text based on the extracted features.
  • the server may combine the CloVe word vector of the m-1th word, the LSTM (Long Short-Term Memory, long and short-term memory network) hidden layer features and global feature matrix are spliced, the long and short-term memory network determines the hidden layer feature of the mth word based on the splicing result, and determines the mth word based on the acquired hidden layer features.
  • m words The above method for obtaining the hidden layer feature of the mth word can be expressed as the following formula (11):
  • h m-1 represents the hidden layer feature of the m-1th word
  • e m-1 represents the CloVe word vector of the m-1th word
  • h m represents the hidden layer feature of the mth word.
  • the server obtains the error value between the first candidate text and the target text.
  • the first error value can be obtained by generating a loss function, which can be specifically expressed as the following formula (12):
  • M represents the number of words in the first candidate text
  • m represents the word sequence number
  • h m-1 represents the hidden layer feature of the m-1th word
  • wm -1 represents the encoded representation of the m-1th word.
  • the server adjusts the parameters of each operation layer in the video recognition model based on the first error value, and obtains a trained video recognition model when the target condition is met.
  • the server may compare the acquired first error value with an error threshold, and when the first error value is greater than the error threshold, the computer device backpropagates the first error value to the video recognition model, each parameter in the video recognition model is solved based on the first error value, and each parameter includes a parameter corresponding to a plurality of convolution kernels, a parameter corresponding to a pooling layer, a parameter corresponding to each fully connected layer, and the like.
  • the error threshold can be set by the developer.
  • the target condition may be set by the developer, and in a possible implementation manner, the target condition may be set as the correct number of obtained output results reaching the target number, where the target number may be Set up by the developer.
  • the server continues to read the next sample video, and executes step 1103, if the number of correct output results obtained by the server reaches the target When the number is equal to the target condition, it is considered that the training of the video recognition model is completed.
  • the server may further predict the second candidate text based on the features at the video unit level, and based on the second candidate text
  • the text determines the second error value, and the method for obtaining the second error value can be expressed as the following formula (13):
  • the first error value may be based on and the second error value A total error value L cap is obtained, and the parameters in the video recognition model are adjusted based on the total error value.
  • the total error value L cap can be expressed as the following formula (14):
  • FIG. 12 is a schematic diagram of a data processing process of a video recognition model provided by an embodiment of the present application. The above process will be described with reference to FIG. 12 .
  • the data processing process of the video recognition model may include a feature extraction stage 1201 , an attention stage 1202 and a reconstruction stage 1203 .
  • the feature 1204 of the video segment dimension, the feature 1205 of the video unit dimension and the text feature 1206 can be obtained through at least one convolutional layer; in the attention stage 1202, the text feature and the video feature are feature fusion, through at least one A two-dimensional convolution layer performs a convolution operation on the fused features to obtain a first attention weight 1207 in the video segment dimension and a second attention weight 1208 in the video unit dimension.
  • the first attention weight 1207 can be adjusted based on the second attention weight 1208, and the target video segment 1209 is predicted based on the adjusted second attention weight.
  • the first global feature 1209 of the video segment dimension can be obtained based on the feature 1204 of the video segment dimension and the first attention weight 1207, and the video can be obtained based on the feature 1205 of the video unit dimension and the second attention weight 1208.
  • the second global feature 1210 of the unit dimension, the LSTM network with parameter sharing is applied, and the candidate text is predicted based on the first global feature 1209 and the second global feature 1210 respectively, and the error between the candidate text and the target text is determined through the loss function.
  • model training is performed based on data at two levels of video clips and video units, and a video recognition model with better model performance can be obtained.
  • FIG. 13 is a schematic structural diagram of a video segment positioning apparatus provided by an embodiment of the present application. Referring to FIG. 13 , the apparatus includes:
  • the first acquisition module 1301 is configured to perform feature extraction on video units included in at least two video clips in the video to obtain unit features of the video units;
  • a second acquiring module 1302, configured to acquire segment features of the at least two video segments based on unit features of video units included in the at least two video segments;
  • the feature fusion module 1303 is used for feature fusion of the segment features of the at least two video segments and the text features of the target text respectively to obtain the fused segment features of the at least two video segments;
  • the third obtaining module 1304 is configured to obtain the first attention weight of the at least two video clips based on the fused clip features of the at least two video clips, where the first attention weight is used to indicate the difference between the video clip and the target text match between;
  • the fourth obtaining module 1305 is configured to obtain, according to the first attention weight, from the at least two video clips, the video clips whose matching degree with the target text satisfies the reference condition, as the video clips in the video and all the video clips.
  • the target video segment related to the target text is described.
  • the second obtaining module 1302 includes:
  • an initial segment feature acquisition submodule configured to determine initial segment features of the at least two video segments based on unit features of the video units included in the at least two video segments;
  • the sampling sub-module is configured to sample the initial segment features of the at least two video segments to obtain segment features of the at least two video segments.
  • the sampling submodule includes:
  • a sampling moment determination unit configured to determine the sampling moment corresponding to the video clip based on the duration of the video clip, and the number of sampling moments corresponding to each video clip is the same;
  • the sampling unit is configured to sample the initial segment feature of the video segment based on the sampling moment corresponding to the video segment to obtain the segment feature of the video segment.
  • the sampling unit is used to:
  • Dimensionality reduction processing is performed on the sampling features of the at least two video segments to obtain segment features of the at least two video segments.
  • the feature fusion module 1303 includes:
  • the text feature acquisition submodule is used to acquire the text features of the target text
  • a matrix construction submodule configured to construct a first feature matrix corresponding to the video based on the segment features of the at least two video segments and the position information of the at least two video segments in the video;
  • An expansion submodule for performing dimension expansion on the text feature based on the dimension of the first feature matrix, to obtain an expanded matrix, and the dimension of the expanded matrix is the same as the dimension of the first feature matrix;
  • a feature fusion sub-module configured to perform feature fusion on the first feature matrix and the extended matrix to obtain fused segment features of the at least two video segments.
  • the feature fusion submodule is used to:
  • the intermediate feature matrix is pooled to obtain a second feature matrix, and a feature in the second feature matrix is used to represent a fused segment feature of a video segment.
  • the third obtaining module 1304 is used for:
  • the device further includes:
  • the fifth acquisition module for acquiring the second attention weight of the video unit, the second attention weight is used to indicate the degree of matching between the video unit and the target text;
  • An adjustment module configured to adjust the first attention weight of the at least two video segments based on the second attention weight of the video units included in the at least two video segments.
  • the fifth acquisition module is used for:
  • the unit features of the video unit are respectively fused with the text features of the target text to obtain the fusion unit features of the video unit;
  • the second attention weight of the video unit is obtained.
  • the adjustment module is configured to:
  • the first attention weight of the target video segment is adjusted.
  • the apparatus further includes a display module configured to perform any one of the following:
  • a link of the target video clip is displayed on the playing interface of the video, and the link is used to provide a function of playing the target video clip.
  • the device further includes:
  • a sixth acquiring module configured to perform a weighted operation on the segment features of the at least two video segments based on the first attention weights of the at least two video segments, to obtain the weighted segment features of the at least two video segments;
  • a seventh acquisition module configured to perform feature extraction on the weighted segment features of the at least two video segments through a long-short-term memory network, and determine a first candidate text based on the extracted features
  • the eighth obtaining module is configured to obtain the first error value between the first candidate text and the target text.
  • the apparatus provided by the embodiment of the present application determines the segment feature of the video segment according to the unit feature by acquiring the unit feature of the video unit dimension, and the acquired segment feature integrates the features of multiple video units and the time sequence correlation between the video units. ; Then fuse the segment features of the video clips with the text features of the target text, and fully apply the features of the video clip dimensions and the time-series correlation between each video clip in the feature fusion process, so that based on the fused features, more information can be obtained.
  • Accurate attention weight, the matching degree between the video clip and the target text is represented by the attention weight, and then when the video clip is located based on the attention weight, the target video clip that matches the target text can be more accurately located. .
  • the video clip positioning apparatus provided in the above embodiments is used for positioning video clips
  • only the division of the above-mentioned functional modules is used as an example for illustration.
  • the above-mentioned functions may be allocated to different functional modules as required. , that is, dividing the internal structure of the device into different functional modules to complete all or part of the functions described above.
  • the video segment positioning apparatus and the video segment positioning method embodiments provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • FIG. 14 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • the terminal 1400 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, the standard audio level 3 of moving picture expert compression), MP4 (Moving Picture Experts Group Audio Layer IV, moving picture expert compression standard audio layer IV) Level 4) Player, laptop or desktop computer.
  • Terminal 1400 may also be called user equipment, portable terminal, laptop terminal, desktop terminal, and the like by other names.
  • the terminal 1400 includes: one or more processors 1401 and one or more memories 1402 .
  • Memory 1402 may include one or more computer-readable storage media, which may be non-transitory. Memory 1402 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 1402 is used to store at least one piece of program code, and the at least one piece of program code is used to be executed by the processor 1401 to implement the methods provided by the method embodiments in this application. Video clip positioning method.
  • the terminal 1400 may optionally further include: a peripheral device interface 1403 and at least one peripheral device.
  • the processor 1401, the memory 1402 and the peripheral device interface 1403 may be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 1403 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 1404 , a display screen 1405 , a camera assembly 1406 , an audio circuit 1407 , a positioning assembly 1408 and a power supply 1409 .
  • the terminal 1400 also includes one or more sensors 1410 .
  • the one or more sensors 1410 include, but are not limited to, an acceleration sensor 1411 , a gyro sensor 1412 , a pressure sensor 1413 , a fingerprint sensor 1414 , an optical sensor 1415 , and a proximity sensor 1416 .
  • FIG. 14 does not constitute a limitation on the terminal 1400, and may include more or less components than the one shown, or combine some components, or adopt different component arrangements.
  • FIG. 15 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the server 1500 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 1501 and a or multiple memories 1502, wherein, the one or more memories 1502 store at least one piece of program code, and the at least one piece of program code is loaded and executed by the one or more processors 1501 to implement the methods provided by the above-mentioned various method embodiments. method.
  • the server 1500 may also have components such as wired or wireless network interfaces, keyboards, and input/output interfaces for input and output, and the server 1500 may also include other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium is also provided, and the storage medium is used for storing a computer program, and the computer program is used for execution to complete the video segment positioning method in the above-mentioned embodiment.
  • the computer-readable storage medium may be Read-Only Memory (ROM), Random Access Memory (RAM), Compact Disc Read-Only Memory (CD-ROM), Tape, floppy disk, and optical data storage devices, etc.
  • the embodiments of the present application also provide a computer program product including instructions, which, when executed on a computer, cause the computer to execute the methods provided by the above embodiments.
  • a computer program product comprising at least one piece of program code stored in a computer-readable storage medium.
  • the processor of the computer device reads the at least one piece of program code from the computer-readable storage medium, and the processor executes the at least one piece of program code, so that the computer device implements the operations performed by the video segment positioning method.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种视频片段定位方法、装置、计算机设备及存储介质,属于视频处理技术领域。本申请通过获取视频单元维度的单元特征,根据单元特征确定视频片段的片段特征,获取到的片段特征中融合了多个视频单元的特征和视频单元之间的时序关联性;再将视频片段的片段特征与目标文本的文本特征进行融合,特征融合过程中充分应用了视频片段维度的特征以及各个视频片段之间的时序关联性,从而基于融合后的特征可以获取到更准确的注意力权重,由注意力权重来表示视频片段和目标文本之间的匹配度,进而在基于注意力权重进行视频片段定位时,可以更准确的定位出与目标文本相匹配的目标视频片段。

Description

视频片段定位方法、装置、计算机设备及存储介质
本申请要求于2020年07月30日提交中国专利局、申请号为202010753184.4、申请名称为“视频片段定位方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频处理技术领域,特别涉及视频片段定位。
背景技术
随着视频应用的普及,网络中的视频数量越来越多,在视频观看时,基于一段文本信息快速、准确地定位到一段视频片段的需求也越来越大。
目前,在基于一段文本信息进行视频片段定位时,通常是需要将文本信息和视频输入视频识别模型,由视频识别模型提取视频中各个视频帧的帧特征以及文本信息的文本特征,基于帧特征与文本特征,进行视频帧与文本信息的匹配,从而确定出各个视频帧与文本信息的之间匹配度,进而在视频中定位出与文本信息最匹配的视频片段。
发明内容
本申请实施例提供了一种视频片段定位方法、装置、计算机设备及存储介质,可以提高视频片段定位结果的准确率。该技术方案如下:
一方面,提供了一种视频片段定位方法,该方法包括:
对视频中至少两个视频片段包括的视频单元进行特征提取,得到该视频单元的单元特征;
基于该至少两个视频片段所包括视频单元的单元特征,获取该至少两个视频片段的片段特征;
将该至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到该至少两个视频片段的融合片段特征;
基于该至少两个视频片段的融合片段特征,得到该至少两个视频片段的第一注意力权重,该第一注意力权重用于指示视频片段与该目标文本之间的匹配度;
根据所述第一注意力权重,从该至少两个视频片段中获取与该目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段。
一方面,提供了一种视频片段定位装置,该装置包括:
第一获取模块,用于对视频中至少两个视频片段包括的视频单元进行特征提取,得到该视频单元的单元特征;
第二获取模块,用于基于该至少两个视频片段所包括视频单元的单元特征,获取该至少两个视频片段的片段特征;
特征融合模块,用于将该至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到该至少两个视频片段的融合片段特征;
第三获取模块,用于基于该至少两个视频片段的融合片段特征,得到该至少两个视频片段的第一注意力权重,该第一注意力权重用于指示视频片段与该目标文本之间的匹配度;
第四获取模块,用于根据所述第一注意力权重,从该至少两个视频片段中,获取与该 目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段。
一方面,提供了一种计算机设备,该计算机设备包括一个或多个处理器和一个或多个存储器,该一个或多个存储器中存储有至少一条程序代码,该至少一条程序代码由该一个或多个处理器加载并执行以实现该视频片段定位方法所执行的操作。
一方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储计算机程序,所述计算机程序用于执行以上方面的视频片段定位方法。
一方面,提供了一种计算机程序产品,该计算机程序产品包括至少一条程序代码,该至少一条程序代码存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该至少一条程序代码,处理器执行该至少一条程序代码,使得该计算机设备实现该视频片段定位方法所执行的操作。
本申请实施例提供的技术方案,通过获取视频单元维度的单元特征,根据单元特征确定视频片段的片段特征,获取到的片段特征中融合了多个视频单元的特征和视频单元之间的时序关联性;再将视频片段的片段特征与目标文本的文本特征进行融合,特征融合过程中充分应用了视频片段维度的特征以及各个视频片段之间的时序关联性,从而基于融合后的特征可以获取到更准确的注意力权重,由注意力权重来表示视频片段和目标文本之间的匹配度,进而在基于注意力权重进行视频片段定位时,可以更准确的定位出与目标文本相匹配的目标视频片段。
附图说明
图1是本申请实施例提供的一种视频片段定位方法的实施环境示意图;
图2是本申请实施例提供的一种视频片段定位方法的流程图;
图3是本申请实施例提供的一种视频片段、视频单元示意图;
图4是本申请实施例提供的一种视频识别模型的结构示意图;
图5是本申请实施例提供的一种视频片段定位方法的具体流程图;
图6是本申请实施例提供的一种采样方法示意图;
图7是本申请实施例提供的一种片段特征获取方法的示意图;
图8是本申请实施例提供的一种第一注意力权重调整方法的示意图;
图9是本申请实施例提供的一种目标视频片段的显示方式示意图;
图10是本申请实施例提供的另一种目标视频片段的显示方式示意图;
图11是本申请实施例提供的一种视频识别模型训练方法的流程图;
图12是本申请实施例提供的一种视频识别模型数据处理过程的示意图;
图13是本申请实施例提供的一种视频片段定位装置的结构示意图;
图14是本申请实施例提供的一种终端的结构示意图;
图15是本申请实施例提供的一种服务器的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。 基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习等几大方向。本申请涉及人工智能技术中的计算机视觉技术,应用视频识别模型对视频进行语义理解,基于一段文本描述,从视频中准确定位出与该文本描述相匹配的视频片段,而无需用户手动筛选大量视频。
图1是本申请实施例提供的一种视频片段定位方法的实施环境示意图。该实施环境包括:终端110和视频识别平台140。
终端110可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表等,但并不局限于此。终端110安装和运行有支持视频识别、视频片段定位的应用程序。该应用程序可以是视频检索类应用程序等。示例性的,终端110是用户使用的终端,终端110中运行的应用程序内登录有用户账号。终端110可以泛指多个终端中的一个,本实施例仅以终端110来举例说明。
视频识别平台140用于为支持视频片段定位的应用程序提供后台服务。可选地,视频识别平台140承担主要视频识别工作,终端110承担次要视频识别工作;或者,视频识别平台140承担次要视频识别工作,终端110承担主要视频识别工作;或者,视频识别平台140或终端110分别可以单独承担视频识别工作。可选地,视频识别平台140包括:接入服务器、视频识别服务器和数据库。接入服务器用于为终端110提供接入服务。视频识别服务器用于提供视频识别、视频片段定位有关的后台服务。视频识别服务器可以是一台或多台。当视频识别服务器是多台时,存在至少两台视频识别服务器用于提供不同的服务,和/或,存在至少两台视频识别服务器用于提供相同的服务,比如以负载均衡方式提供同一种服务,本申请实施例对此不加以限定。视频识别服务器中可以设置有视频识别模型,该视频识别服务器为该模型的训练和应用过程提供支撑。其中,上述服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。
上述终端110与视频识别平台140可以通过有线或无线通信方式进行直接或间接地连接,本申请实施例对此不作限定。
本领域技术人员可以知晓,上述终端的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量。本申请实施例对终端的数量和设备类型不加以限定。
本申请实施例提供了一种基于弱监督学习的视频片段定位方法,通过一段自然语言的描述定位出一个视频片段。本申请提供的技术方案可以应用于多种类型的应用程序中,与多种应用场景相结合。例如,在视频类应用程序中,用户在查找某一视频片段时,可以提供一段用于描述视频片段的文本信息,将该文本信息发送至应用程序对应的服务器,由服务器基于该文本信息的文本特征以及各个视频片段的片段特征,确定出与该文本信息相匹配的目标视频片段,而无需用户手动筛选大量的视频。应用本申请实施例提供的技术方案,可以快速、准确地定位出用户感兴趣的视频片段,且应用视频片段维度的特征进行视频片段定位,从而在运算过程中可以融合各个视频片段之间的关联性,提高视频片段定位的效率。
图2是本申请实施例提供的一种视频片段定位方法的流程图。该方法可以应用于上述实施环境,在本申请实施例中,以服务器作为执行主体,对视频片段定位方法进行介绍,参见图2,该实施例具体可以包括以下步骤:
201、服务器对视频中至少两个视频片段包括的视频单元进行特征提取,得到该视频单元的单元特征。
其中,该视频可以为存储在服务器中的视频,也可以为服务器从其他设备获取的视频,本申请实施例对具体采用哪种视频不作限定。在本申请实施例中,可以将视频中单位时长的一个片段作为一个视频单元,该视频包括多个连续的视频单元,每个视频单元包括多个视频帧。其中,该单位时长可以由开发人员进行设置,本申请实施例对此不作限定,例如,该单位时长设置为1秒,则视频中每1秒的片段均可以作为一个视频单元。
在本申请实施例中,视频包括多个不同时长的视频片段。在一种可能实现方式中,可以通过多个不同尺度的滑动窗口,在视频中确定多个不同时长的视频片段,当然,也可以通过其他方法确定该视频片段,本申请实施例对此不作限定。图3是本申请实施例提供的一种视频片段、视频单元示意图,参见图3,视频301包括多个连续的视频单元,例如,包括视频单元302、303、304、305、306等,其中,视频片段307包括视频单元302、303、304,视频片段308包括视频片段304、305。
在一种可能实现方式中,服务器响应于对视频的视频片段定位指令,可以通过三维卷积层对视频进行特征提取,得到各个视频单元的单元特征。当然,该计算机设备也可以通过其他方法获取各个视频单元的单元特征,本申请实施例对此不作限定。
在视频中,相邻的视频帧之间会具有较高的相似性,在本申请实施例中,获取视频单元维度的特征,可以降低数据冗余,降低获取到的特征的数据量,从而可以降低后续运算过程的数据量,降低运算复杂度。
202、服务器基于至少两个视频片段所包括视频单元的单元特征,获取至少两个视频片 段的片段特征。
其中,该片段特征可以用于表示视频片段中视频帧图像的颜色特征、纹理特征等,还可以包括各个视频帧之间的时序关联性。不同视频片段对应于不同的片段特征。
在一种可能实现方式中,该服务器基于各个视频片段所包括的视频单元以及各个视频单元的单元特征,确定各个视频片段的初始片段特征,再对各个视频片段的初始片段特征进行采样,将采样过程中提取出的特征确定为该视频片段的片段特征。需要说明的是,上述对片段特征获取方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取片段特征不作限定。
在本申请实施例中,基于视频片段维度的特征执行后续的视频片段定位步骤,从而在运算过程中可以融合视频片段之间的时序关联性,进而可以提高视频片段定位结果的准确率。
203、服务器将至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到至少两个视频片段的融合片段特征。
其中,该目标文本用于描述一个视频片段,该目标文本可以由用户提供,本申请实施例对该目标文本的具体内容不作限定。
在一种可能实现方式中,服务器获取到目标文本后,可以对该目标文本进行特征提取,得到目标文本的文本特征,需要说明的是,本申请实施例对文本特征提取的具体方法不作限定。该服务器获取到视频片段的片段特征以及文本特征之后,可以将各个片段特征分别与文本特征进行跨模态特征融合,得到各个视频片段的片段融合特征。在本申请实施例中,获取到的片段融合特征充分融合了两种模态的特征,片段融合特征具有更好的表征效果,应用片段融合特征进行后续的视频片段定位,可以提高视频片段定位结果的准确率。
204、服务器基于至少两个视频片段的融合片段特征,得到至少两个视频片段的第一注意力权重,该第一注意力权重用于指示视频片段与目标文本之间的匹配度。
在一种可能实现方式中,该服务器通过至少一个卷积层,对各个视频片段的片段融合特征进行卷积运算,得到各个视频片段的第一注意力权重。其中,该第一注意力权重和视频片段与目标文本之间的匹配度正相关,即为与目标文本匹配度高的视频片段分配较高的注意力权重。
205、服务器根据该第一注意力权重从至少两个视频片段中,获取与该目标文本之间的匹配度满足参考条件的视频片段,作为该视频中与该目标文本相关的目标视频片段。
其中,该参考条件可以由开发人员进行设置,本申请实施例对此不作限定。例如,该参考条件可以设置为将注意力权重最高的视频片段作为该目标视频片段。
本申请实施例提供的技术方案,通过获取视频单元维度的单元特征,根据单元特征确定视频片段的片段特征,获取到的片段特征中融合了多个视频单元的特征和视频单元之间的时序关联性;再将视频片段的片段特征与目标文本的文本特征进行融合,特征融合过程中充分应用了视频片段维度的特征以及各个视频片段之间的时序关联性,从而基于融合后的特征可以获取到更准确的注意力权重,由注意力权重来表示视频片段和目标文本之间的匹配度,进而在基于注意力权重进行视频片段定位时,可以更准确的定位出与目标文本相 匹配的目标视频片段。
上述实施例是对本申请实施方式的一个简要介绍,在一种可能实现方式中,服务器中搭载有视频识别模型,该视频识别模型用于提供视频片段定位功能,服务器可以调用该视频识别模型来执行上述实施例中的各个步骤。图4是本申请实施例提供的一种视频识别模型的结构示意图,该视频识别模型可以为基于深度神经网络构建的模型,例如,该深度神经网络可以为RNN(Recurrent Neural Network,循环神经网络)、CNN(Convolutional Neural Networks,卷积神经网络)等。如图4所示,在一种可能实现方式中,该视频识别模型可以包括特征提取单元401、采样单元402、三维卷积层403、特征融合单元404以及至少一个二维卷积层405。其中,该特征提取单元401可以由至少一个三维卷积层和至少一个一维卷积层构成,通过对视频对应的数字矩阵进行至少一次卷积运算,来提取视频中各个视频单元的特征;该采样单元402可以基于各个视频片段所包括的视频单元以及各个视频单元的单元特征进行特征采样;该三维卷积层403对采样单元的输出结果进行卷积运算,得到各个视频片段的片段特征;该特征融合单元404用于对视频片段的片段特征和目标文本的文本特征进行融合;该至少一个二维卷积层405通过对融合后的特征进行至少一次卷积运算,得到各个视频片段的注意力权重。需要说明的是,本申请实施例对该视频识别模型中特征提取单元、采样单元、三维卷积层、特征融合单元以及至少一个二维卷积层的具体数目和连接方式不作限定。
图5是本申请实施例提供的一种视频片段定位方法的具体流程图,以下结合图4和图5,以服务器为执行主体,对上述视频片段定位方法进行说明:
501、服务器对视频中视频单元进行特征提取,得到视频单元的单元特征。
在一种可能实现方式中,服务器接收到终端发送的视频片段定位请求,调用视频识别模型,通过该视频识别模型中的特征提取单元,来提取各个视频单元的单元特征。其中,该终端可以为任一用户使用的终端,用户可以通过终端向服务器发送视频片段定位请求,来查询感兴趣的视频片段。需要说明的是,本申请实施例对该视频片段定位请求的具体触发方式不作限定。
在一种可能实现方式中,该视频片段定位请求可以包括用于描述一个视频片段的目标文本以及视频标识,其中,该视频标识可以用于唯一地指示一个视频片段,该服务器响应于该视频片段定位请求,可以获取该视频标识所指示的视频,基于该视频以及目标文本执行后续的视频片段定位步骤。
在一种可能实现方式中,该视频片段定位请求可以包括目标文本,在这种情况下,服务器响应于该视频片段定位请求,可以先获取与该目标文本匹配的至少一个视频,基于该至少一个视频和该目标文本执行后续的视频片段定位步骤。需要说明的是,本申请实施例对该视频片段定位请求所包括的具体信息不作限定。在本申请实施例中,仅以对一个视频进行视频片段为例进行说明。
在本申请实施例中,以视频识别模型的特征提取单元包括一个三维卷积层和一个一维卷积层为例,对单元特征的获取过程进行说明。在一种可能实现方式中,该服务器将该视频中的各个视频帧转换为由一组像素值组成的数字矩阵,当然,该服务器还可以对各个视 频帧进行尺寸变换、降噪处理等,本申请实施例对此不作限定。服务器将各个视频帧对应的数字矩阵输入视频识别模型,先由特征提取单元中的三维卷积层对各个视频帧对应的数字矩阵进行卷积运算,得到各个视频单元的初始单元特征。再通过一维卷积层对初始单元特征进行降维处理,得到视频单元的单元特征。以视频单元的时长为1秒,包括25个视频帧为例,对于每一个视频单元,三维卷积层的卷积核对这25个视频帧对应的数字矩阵进行卷积运算,得到初始单元特征;将各个视频单元的初始单元特征按照视频单元的时序顺序进行排列,得到特征F v;通过一维卷积层对特征F v进行卷积运算,得到特征F′ v,特征F′ v中的一个特征即为一个视频单元的单元特征。具体地,该一维卷积的过程可以表示为下述公式(1):
F′ v=Conv1d(F v)         (1)
其中,
Figure PCTCN2021100860-appb-000001
F′ v中的每个元素,即每个单元特征的维度是
Figure PCTCN2021100860-appb-000002
F′ v包括T个单元特征;r表示维度的衰减倍数;Conv1d()表示一维卷积运算,该一维卷积运算所应用的卷积核大小可以由开发人员进行设置,本申请实施例对此不作具体限定。例如,可以将卷积核设置为3,以获取到视频单元维度的时序关联信息。
需要说明的是,上述对获取视频单元的单元特征的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取单元特征不作限定。
502、服务器基于至少两个视频片段所包括视频单元的单元特征,确定该至少两个视频片段的初始片段特征。
在一种可能实现方式中,对于一个视频片段,服务器可以获取视频片段所包括视频单元的单元特征,基于各个视频单元的时序顺序,对其单元特征进行拼接,例如,可以将各个视频单元的单元特征顺序连接。将拼接后的单元特征作为视频片段的初始片段特征。需要说明的是,上述对初始片段特征获取方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取该初始片段特征不作限定。
503、服务器对至少两个视频片段的初始片段特征进行采样,得到该至少两个视频片段的片段特征。
在本申请实施例中,服务器基于该视频片段的时长,确定该视频片段对应的采样时刻。对于一个视频片段,服务器可以基于该视频片段对应的采样时刻,对该视频片段的初始片段特征进行采样,得到该视频片段的片段特征。其中,每个视频片段对应的采样时刻的数目相同,该采样时刻的数目可以由开发人员进行设置,本申请实施例对此不作限定。基于相同数目的采样时刻进行采样,可以将不同时长的视频片段采样到固定时长,每个视频片段可以对应于相同维度的特征,以便视频识别模型进行后续的运算过程。图6是本申请实施例提供的一种采样方法示意图,结合图6,以对一个视频片段的初始片段特征进行采样为例进行说明。如图6中的(a)图所示,视频片段601在视频中的起始时刻为第2秒,持续时长为3秒,该视频片段的初始片段特征602包括单元特征603、604和605,该视频片段可以对应于两个采样时刻,例如分别为采样时刻606和采样时刻607。以采样时刻606为例,该采样时刻606为两个视频单元之间的时刻,在该时刻进行采样时,需要对单元特征603和单元特征604进行加权运算,得到采样特征。例如,两个特征单元的总权重为1,由于采样时刻606 为两个视频单元之间的时刻,则单元特征603和单元特征604的权重均为0.5,也即是,服务器可以对两个单元特征中相同位置的元素相加再取平均,得到采样特征。如图6中的(b)图所示,若采样时刻608所指示的时刻为6.3秒,则在该时刻进行采样时,单元特征609对应的权重为1-dec(t n),其中,dec()表示取小数,t n表示采样时刻,即单元特征609对应的权重为0.7,单元特征610对应的权重为dec(t n),即0.3,服务器分别将单元特征609和特征单元610与其对应的特征相乘,再将加权后的两个特征相加,得到采样特征。
在一种可能实现方式中,服务器通过构造采样矩阵,来进行采样。例如,服务器可以基于该至少两个视频片段对应的采样时刻以及该至少两个视频片段在该视频中的位置信息,构造采样矩阵;将该采样矩阵与该至少两个视频片段的初始片段特征相乘,得到采样特征矩阵,该采样特征矩阵中的一个特征用于表示一个视频片段的采样特征。具体地,上述采样过程可以表示为下述公式(2),采样矩阵中各个元素可以基于下述公式(3)确定:
Figure PCTCN2021100860-appb-000003
Figure PCTCN2021100860-appb-000004
其中,F′ v表示视频对应的单元特征序列;
Figure PCTCN2021100860-appb-000005
表示矩阵乘法;W 1表示采样矩阵,
Figure PCTCN2021100860-appb-000006
T表示视频片段的起始时刻,S表示视频片段的持续时长,则(T×S)表示视频片段在视频中的位置;N表示采样时刻的数目;t n表示采样时刻;dec(t n)表示对t n取小数;
Figure PCTCN2021100860-appb-000007
表示向下取整,即取t n的整数部分。在卷积运算时,采样矩阵W 1可以基于各个视频片段在视频中的位置,确定该视频片段所包括的单元特征,即确定出视频单元的初始单元特征,基于各个视频片段的初始单元特征进行采样,得到采样特征矩阵F″ v
在本申请实施例中,服务器可以对该至少两个视频片段的采样特征进行降维处理,得到该至少两个视频片段的片段特征。在一种可能实现方式中,服务器可以通过三维卷积层对采样特征矩阵进行卷积,以在采样时序维度上对各个视频片段的采样特征进行降维处理。上述降维处理的过程可以表示为下述公式(4):
F vp=Conv3d(F″ v)          (4)
其中,F″ v表示采样特征矩阵;Conv3d()表示三维卷积运算;F vp是片段特征矩阵,F vp中的一个特征用于表示一个视频片段的片段特征。
图7是本申请实施例提供的一种片段特征获取方法的示意图,结合图7,对上述片段特征获取方法进行说明。在一种可能实现方式中,对于视频片段701,其初始片段特征702包括单元特征703、704、705和706,该初始单元特征对应702对应采样时刻707、708和709。以在采样时刻708进行采样为例,可以对单元特征704和705求和取平均,得到采样时刻708对应的采样特征710,再基于各个采样时刻对应的采样特征,得到视频片段701的采样特征711。服务器基于各个视频片段在视频中的位置信息以及各个视频片段的采样特征,构造特征图712,该特征图的横向为视频片段的起始时间,纵向为视频片段的持续时长,一个位置 用于存储一个视频片段的采样特征,例如,其中713位置表示起始时刻为0秒,持续时长为4秒的视频片段的采样特征。
该特征图712中各个位置均存储视频片段的采样特征,即得到采样特征矩阵F″ v,通过三维卷积层对该采样特征矩阵F″ v进行降维处理,得到片段特征矩阵F vp,即矩阵714,该矩阵714中的一个特征715表示一个视频片段的片段特征。
需要说明的是,上述步骤502和步骤503,是服务器基于至少两个视频片段所包括视频单元的单元特征,获取该至少两个视频片段的片段特征。在本申请实施例中,对单元特征进行特征提取,得到片段特征,一方面,可以在片段特征中融合各个视频单元的单元特征以及单元特征之间的时序关系,另一方面,通过采样使不同时长的视频片段均对应于相同维度的片段特征,便于模型基于片段特征进行后续的运算。
504、服务器获取目标文本的文本特征。
其中,该目标文本为用于描述一个视频片段的一段文本,例如,用户在进行视频片段检索时输入的一段文本。
在一种可能实现方式中,服务器获取目标文本中的各个单词的one-hot(独热)编码,通过Embed(词嵌入)层将各个单词的one-hot编码映射为词向量。其中,该Embed层可以为表现为一个全连接层,服务器通过将各个单词的one-hot编码与该全连接层的系数矩阵相乘,得到各个单词的词向量,从而得到目标文本的向量表示。服务器可以将目标文本的向量表示输入GRU(Gate Recurrent Unit,循环神经网络)由该循环神经网络基于目标文本的向量表示来提取目标文本的文本特征。需要说明的是,上述对目标文本的文本特征获取方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法获取该文本特征不作限定。
需要说明的是,在本申请实施例中,采用先获取视频片段的片段特征,再获取目标文本的文本特征的执行顺序进行描述,在一些实施例中,也可以先执行获取文本特征的步骤,再执行获取片段特征的步骤,或者两个步骤同时执行,本申请实施例对此不作限定。
505、服务器将至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到至少两个视频片段的融合片段特征。
在一种可能实现方式中,服务器可以通过视频识别模型中的特征融合单元对片段特征和文本特征进行跨模态特征融合。首先,服务器基于该至少两个视频片段的片段特征以及该至少两个视频片段在该视频中的位置信息,构造该视频对应的第一特征矩阵,也即是步骤503中的片段特征矩阵F vp,在步骤503中,通过矩阵卷积可以直接得到该片段特征矩阵F vp,则此处可以无需再次构造该片段特征矩阵F vp,若片段特征是基于其他方式获取的,则此处需要构造出片段特征矩阵F vp。然后,服务器基于该第一特征矩阵的维度,对该文本特征进行维度扩展,得到扩展矩阵,其中,该扩展矩阵的维度与该第一特征矩阵的维度相同,以便于进行特征融合。最后,服务器将该第一特征矩阵与该扩展矩阵进行特征融合,得到该至少两个视频片段的融合片段特征。例如,服务器将该第一特征矩阵与该扩展矩阵中相同位置的元素相乘,得到中间特征矩阵;对该中间特征矩阵进行池化处理,得到第二特征矩 阵,该第二特征矩阵中的一个特征用于表示一个视频片段的融合片段特征。具体地,以采用双线性池化方法进行特征融合为例,对上述特征融合过程进行说明,在一种可能实现方式中,服务器可以将两种模态的特征输入线性层,即全连接层,将经过线性变换的两种模态的特征中相同位置元素相乘,得到中间特征矩阵,对该中间特征矩阵进行池化处理,得到第二特征矩阵。上述双线性池化特征融合方法可以表示为下述公式(5):
Figure PCTCN2021100860-appb-000008
其中,
Figure PCTCN2021100860-appb-000009
为可学习参数,可以表示为两个全连接层,各个全连接层中的参数数值可以在模型训练过程中确定;F vp表示视频对应的第一特征矩阵;F q表示目标文本的文本特征,Tile(F q)表示将文本特征F q沿着T维度和S维度分别进行复制;°表示两个矩阵中相同位置的元素相乘;SumPool(x,K)表示使用大小为K的滑动窗口对x进行加和池化;F ap表示第二特征矩阵。
506、服务器基于至少两个视频片段的融合片段特征,得到该至少两个视频片段的第一注意力权重。
其中,第一注意力权重用于指示视频片段与该目标文本之间的匹配度,在本申请实施例中,第一注意力权重的取值与视频片段和目标文本之间的匹配度正相关。
在一种可能实现方式中,服务器通过视频识别模型中的至少一个二维卷积层,对特征融合后得到的第二特征矩阵进行至少一次卷积运算,得到第一注意力矩阵,当然,该服务器还可以对卷积运算的结果进行归一化处理,将归一化处理后的矩阵作为第一注意力矩阵,该第一注意力矩阵中的一个元素用于表示一个视频片段的该第一注意力权重。上述第一注意力矩阵的获取方法可以表示为下述公式(6):
Att p=Softmax(Conv2d(F ap))       (6)
其中,F ap表示第二特征矩阵;Conv2d()表示二维卷积运算;Softmax()表示归一化处理函数。
507、服务器根据该第一注意力权重,从至少两个视频片段中获取与该目标文本之间的匹配度满足参考条件的视频片段,作为该视频中与该目标文本相关的目标视频片段。
其中,该参考条件可以由开发人员进行设置,本申请实施例对此不作限定。例如,该参考条件可以设置为将第一注意力权重最高的视频片段确定为目标视频片段,也可以设置为将第一注意力权重大于权重阈值的视频片段确定为目标视频片段。
本申请实施例提供的技术方案,通过获取视频单元维度的单元特征,根据单元特征确定视频片段的片段特征,获取到的片段特征中融合了多个视频单元的特征和视频单元之间的时序关联性;再将视频片段的片段特征与目标文本的文本特征进行融合,特征融合过程中充分应用了视频片段维度的特征以及各个视频片段之间的时序关联性,从而基于融合后的特征可以获取到更准确的注意力权重,由注意力权重来表示视频片段和目标文本之间的 匹配度,进而在基于注意力权重进行视频片段定位时,可以更准确的定位出与目标文本相匹配的目标视频片段。
上述实施例主要介绍了基于视频片段维度的特征进行视频片段定位的过程,在本申请实施例中,还可以获取各个视频单元与目标文本之间的匹配度,基于各个视频单元与目标文本之间的匹配度,对各个视频片段的第一注意力权重进行调整,基于调整后的第一注意力权重进行视频片段定位。图8是本申请实施例提供的一种第一注意力权重调整方法的示意图,参见图8,该方法可以包括以下步骤:
801、服务器将视频单元的单元特征分别与目标文本的文本特征进行融合,得到该视频单元的融合单元特征。
在一种可能实现方式中,服务器获取到各个视频单元的单元特征后,可以对各个单元特征进行采样和降维处理,使单元特征更容易被视频识别模型理解。以对时长为1秒的视频单元的单元特征进行处理为例,服务器可以将采样矩阵W 2与视频对应的单元特征序列F′ v相乘,以对单元特征进行采样。其中,由于各个视频单元的时长为1秒,则采样矩阵
Figure PCTCN2021100860-appb-000010
服务器将采样结果输入视频识别模型中的三维卷积层,由三维卷积层对采样结果进行降维处理,得到处理后的单元特征序列F vc,单元特征序列F vc中的一个特征即为一个处理后的单元特征。其中,该三维卷积层与步骤503中对片段特征进行降维处理时应用的三维卷积层相同。
在一种可能实现方式中,服务器可以基于单元特征序列F vc的维度,对文本特征的维度进行扩展,将扩展后的文本特征与单元特征序列F vc进行特征融合,得到融合单元特征序列F ac,融合单元特征序列F ac中的一个特征基于一个视频单元的融合单元特征。需要说明的是,该融合单元特征的获取方法与上述步骤505中融合片段特征的获取方法同理,在此不作赘述。上述融合单元特征的获取方法可以表示为下述公式(7):
Figure PCTCN2021100860-appb-000011
其中,
Figure PCTCN2021100860-appb-000012
为可学习参数,可以表示为两个全连接层,各个全连接层中的参数数值可以在模型训练过程中确定;F vc表示视频对应的第一特征矩阵;F q表示目标文本的文本特征,Tile(F q)表示将文本特征F q沿着T维度和S维度分别进行复制;°表示两个矩阵中相同位置的元素相乘;SumPool(x,K)表示使用大小为K的滑动窗口对x进行加和池化;F ac表示融合单元特征序列。
802、服务器基于视频单元的融合单元特征,得到视频单元的第二注意力权重。
在一种可能实现方式中,服务器可以对融合单元特征序列F ac进行二维卷积,对卷积结果进行归一化处理,再将归一化处理后的到矩阵与视频的全局特征矩阵相乘,得到第二注意力矩阵,该第二注意力矩阵中的一个元素用于表示一个视频单元的第二注意力权重。其中,该视频的全局特征矩阵可以基于步骤503中获取的片段特征矩阵和步骤506中获取的第一注意力矩阵得到,具体可以表示为下述公式(8):
Figure PCTCN2021100860-appb-000013
其中,
Figure PCTCN2021100860-appb-000014
表示全局特征矩阵;F vp表示片段特征矩阵;Att p表示第一注意力矩阵;表示矩阵乘法。
上述第二注意力矩阵的获取方法可以表示为下述公式(9):
Figure PCTCN2021100860-appb-000015
其中,Att c表示第二注意力矩阵;F ac表示融合单元特征序列;Conv2d()表示二维卷积运算;
Figure PCTCN2021100860-appb-000016
表示全局特征矩阵;Softmax()表示归一化处理函数。
需要说明的是,上述步骤801和步骤802,是获取该至少两个视频单元的第二注意力权重的步骤。在本申请实施例中,获取视频单元级别的注意力权重,基于多级别的注意力权重进行后续的视频片段定位,可以提高视频片段定位结果的准确性。
803、服务器基于至少两个视频片段所包括视频单元的第二注意力权重,对至少两个视频片段的第一注意力权重进行调整。
在一种可能实现方式中,对于至少两个视频片段中的任一视频片段:目标视频片段,服务器从目标视频片段包括的视频单元中,确定该目标视频片段的中心时刻对应的目标视频单元;基于该目标视频单元的第二注意力权重,对该目标视频片段的第一注意力权重进行调整。上述对第一注意力权重进行调整的过程可以表示为下述公式(10):
Att′ p(i)=Att p(i)+αAtt c(j)        (10)
其中,i表示第i个视频片段,Att p(i)表示第i个视频片段的第一注意力权重;j表示第j个视频单元,j的具体数值为
Figure PCTCN2021100860-appb-000017
T i为第i个视频片段的起始时刻,S i为第i个视频片段的持续时长,Att c(j)表示第j个视频单元的第二注意力权重;Att′ p(i)表示调整后的第一注意力权重;α表示超参数,其具体数值可以由开发人员进行设置,本申请实施例对此不作限定。
本申请实施例提供的技术方案,将视频识别模型扩展为多级别结构,即包括视频片段级的数据处理分支和视频单元级的数据处理分支,获取到视频单元维度的第二注意力权重,应用第二注意力权重对视频片段维度的第一注意力权重进行调整,以提高第一注意力权重的准确性,进而可以提高视频片段定位结果的准确性。
上述实施例介绍了基于自然语言描述进行视频片段定位的过程,在本申请实施例中,确定出目标视频片段后,可以对目标视频片段进行显示。
在一种可能实现方式中,服务器可以将视频片段定位结果发送到用户使用的终端,由终端在视频的播放界面显示标注信息,该标注信息用于指示该目标视频片段的起始时刻和结束时刻。例如,用户在终端上观看视频,有视频片段搜索需求时,可以在该视频的播放界面中搜索区域输入目标文本,点击搜索控件,终端响应于将测到用户对搜索控件的触发操作,生成视频片段定位请求,该视频片段定位请求包括该视频的视频标识和目标文本。 当然,该终端也可以通过其他方式生成该视频片段定位请求,本申请实施例对此不作限定。终端将该视频片段定位请求发送到服务器,由服务器在视频中定位出与该目标文本相匹配的目标视频片段,该服务器可以将该目标视频片段的起始时刻和持续时长发送至终端。终端可以基于该目标视频片段的起始时刻和持续时长,在该播放界面的播放进度条中,对该目标视频片段的起始时刻和结束时刻进行标注。
参见图9,图9是本申请实施例提供的一种目标视频片段的显示方式示意图,该播放界面包括视频播放区域901和视频播放进度栏902,终端可以在该视频播放进度栏902中显示标注信息,该标注信息用于指示该目标视频片段的起始时刻和结束时刻。在一种可能实现方式中,终端还可以跳转至目标视频片段进行播放,即从当前播放时刻跳转至目标视频片段的起始时刻,从该起始时刻开始播放视频。在一种可能实现方式中,该服务器还可以从视频中截取出该目标视频片段,生成该目标视频片段的播放链接,将该播放链接发送至终端,由终端在该视频的该播放界面显示该目标视频片段的链接或超链接,该链接或超链接用于提供对该目标视频片段进行播放的功能。
参见图10,图10是本申请实施例提供的另一种目标视频片段显示方式示意图,该播放界面包括视频播放区域1001和视频片段显示区域1002。需要说明的是,本申请实施例对该视频片段显示区域1002在播放界面中的位置不作限定,在本申请实施例中,以该视频片段显示区域1002在视频播放区域1001的下方为例。终端可以以超链接的形式,在该视频片段显示区域1002显示该目标视频片段的播放入口1003,终端响应于用户点击该播放入口1003,跳转至该目标视频片段对应的播放界面,播放该目标视频片段。
在本申请实施例中,若该视频片段定位请求中不包括视频标识,即不是对某一视频进行视频片段定位,则服务器将目标文本与多个视频中的视频片段进行匹配,获取到来自多个视频的目标视频片段。在一种可能实现方式中,服务器可以为每个目标视频片段生成播放链接,在终端分别显示各个视频片段的播放链接,由用户点击各个播放链接进行视频片段播放。在一种可能实现方式中,服务器可以基于多个目标视频片段生成一个影片集,将该影片集的链接或超链接发送至终端进行显示,用户可以在该影片集中观看到多个感兴趣的目标视频片段,还可以将该影片集存储至终端。在本申请实施例中,通过生成影片集可以提高视频观看的趣味性,提升用户体验。
在本申请实施例中,用户在进行视频片段定位时,只需提供一段用于描述视频片段的文本即可,无需人工对大量的视频进行检索,服务器对视频片段定位完成后,再由终端对服务器的视频片段定位结果进行显示,用户可以快速获取到感兴趣的视频片段,提高了视频片段定位效率。
上述实施例主要介绍了应用视频识别模型进行视频片段定位,显示视频片段定位结果的过程,而在视频片段定位之前,需对该视频识别模型进行训练,以调整视频识别模型中各个运算层的参数,在本申请实施例中,服务器可以搭载有重构模块,由该重构模块基于视频片段的片段特征预测出第一候选文本,基于第一候选文本与目标文本之间的误差,调整视频识别模型的各个参数。图11是本申请实施例提供的一种视频识别模型训练方法的流程图,参见图11,该过程具体可以包括以下步骤:
1101、服务器初始化视频识别模型中的各个参数。
在一种可能实现方式中,服务器通过对该视频识别模型中各个卷积层、池化层、全连接层的参数进行随机赋值,来实现参数初始化。例如,服务器可以采用方差为0.01,均值为0的高斯分布对该视频识别模型进行参数初始化,需要说明的是,本申请实施例对模型参数初始化的具体方法不作限定。
1102、服务器将训练数据集输入视频识别模型。
其中,该训练数据集可以包括多个样本视频,该多个样本视频为已标注的样本视频,每个样本视频均标注出其对应的文本信息。在本申请实施例中,通过弱监督的方式进行模型训练,无需时序上细粒度的标注,即无需标注各个视频片段的起始时刻、结束时刻以及对应的文本信息,降低训练数据集的获取难度。
在一种可能实现方式中,服务器将多个已标注的样本视频输入视频识别模型,该视频识别模型基于样本视频中视频片段与文本信息之间的匹配度,输出由该文本信息定位到的目标视频片段。需要说明的是,该目标视频片段的获取方法与上述步骤501至507中视频片段定位的过程同理,在此不作赘述。
1103、服务器基于视频识别模型输出的第一注意力权重以及片段特征,确定第一候选文本,获取第一候选文本与目标文本之间的第一误差值。
在一种可能实现方式中,首先,服务器基于至少两个视频片段的第一注意力权重对该至少两个视频片段的片段特征进行加权运算,得到该至少两个视频片段的加权片段特征。具体地,该服务器可以将片段特征矩阵与第一注意力矩阵相乘,得到全局特征矩阵
Figure PCTCN2021100860-appb-000018
该全局特征矩阵
Figure PCTCN2021100860-appb-000019
中的一个特征即为一个视频片段的加权片段特征。
然后,服务器通过长短时记忆网络对该至少两个视频片段的加权片段特征进行特征提取,基于提取到的特征确定第一候选文本。在一种可能实现方式中,对第m个第一候选文本中的第m个词进行预测时,服务器可以将第m-1个词的CloVe词向量、第m-1个词的LSTM(Long Short-Term Memory,长短时记忆网路)隐层特征以及全局特征矩阵进行拼接,由长短时记忆网络基于拼接结果确定第m个词的隐层特征,基于获取到的隐层特征,确定该第m个词。上述获取第m个词的隐层特征的方法可以表示为下述公式(11):
Figure PCTCN2021100860-appb-000020
其中,
Figure PCTCN2021100860-appb-000021
表示全局特征矩阵;h m-1表示第m-1个词的隐层特征;e m-1表示第m-1个词的CloVe词向量;
Figure PCTCN2021100860-appb-000022
表示将
Figure PCTCN2021100860-appb-000023
h m-1和e m-1进行拼接,例如,将
Figure PCTCN2021100860-appb-000024
h m-1和e m-1首尾相连进行拼接;h m表示第m个词的隐层特征。
最后,服务器获取该第一候选文本与该目标文本之间的误差值。在一种可能实现方式中,可以通过生成损失函数来获取该第一误差值,具体可以表示为下述公式(12):
Figure PCTCN2021100860-appb-000025
其中,M表示第一候选文本中的单词数量,m表示单词序号;
Figure PCTCN2021100860-appb-000026
表示全局特征矩阵;h m-1表示第m-1个词的隐层特征;w m-1表示第m-1个词的编码表示。
1104、服务器基于该第一误差值对视频识别模型中各个运算层的参数进行调整,直到符合目标条件时,得到训练好的视频识别模型。
在一种可能实现方式中,该服务器可以将获取的第一误差值与误差阈值进行比较,当第一误差值大于误差阈值时,该计算机设备将该第一误差值反向传播到该视频识别模型,基于第一误差值求解该视频识别模型中的各个参数,该各个参数包括多个卷积核对应的参数、池化层对应的参数、各个全连接层的对应的参数等。其中,该误差阈值均可以由开发人员设置。
在本申请实施例中,该目标条件可以由开发人员进行设置,在一种可能实现方式中,该目标条件可以设置为获取到的输出结果正确的个数到达目标数目,其中,该目标数目可以由开发人员进行设置。当该第一误差值小于误差阈值时,则认为该服务器获取的目标识别结果正确,该服务器继续读取下一个样本视频,执行步骤1103,若该服务器获取到的输出结果正确的个数到达目标数目时,也即是符合该目标条件时,则认为该视频识别模型训练完毕。
需要说明的是,上述对视频识别模型训练方法的说明,仅是一种示例性说明,本申请实施例对具体采用哪种方法训练视频识别模型不作限定。
在一种可能实现方式中,当视频识别模型包括视频片段级的数据处理分支和视频单元级的数据处理分支时,服务器还可以基于视频单元级别的特征预测第二候选文本,基于该第二候选文本确定第二误差值,该第二误差值的获取方法可以表示为下述公式(13):
Figure PCTCN2021100860-appb-000027
其中,
Figure PCTCN2021100860-appb-000028
表示第二误差值;
Figure PCTCN2021100860-appb-000029
表示基于视频单元级的特征获取的全局特征矩阵;h m-1表示第m-1个词的隐层特征;w m-1表示第m-1个词的编码表示。需要说明的是,上述获取第二误差值的过程与步骤1103中获取第一误差值的过程同理,在此不做赘述。
在一种可能实现方式中,可以基于第一误差值
Figure PCTCN2021100860-appb-000030
和第二误差值
Figure PCTCN2021100860-appb-000031
得到一个总误差值L cap,基于该总误差值对视频识别模型中的参数进行调整。其中,该总误差值L cap可以表示为下述公式(14):
Figure PCTCN2021100860-appb-000032
其中,
Figure PCTCN2021100860-appb-000033
表示第一误差值,
Figure PCTCN2021100860-appb-000034
表示第二误差值,λ的数值可以由开发人员进行设置,本申请实施例对此不作限定。
图12是本申请实施例提供的一种视频识别模型数据处理过程的示意图,结合图12,对上述过程进行说明。在一种可能实现方式中,视频识别模型的数据处理过程可以包括特征提取阶段1201、注意力阶段1202以及重构阶段1203。在特征提取阶段1201,可以通过至少 一个卷积层获取视频片段维度的特征1204、视频单元维度的特征1205以及文本特征1206;在注意力阶段1202,对文本特征和视频特征进行特征融合,通过至少一个二维卷积层,对融合后的特征进行卷积运算,得到视频片段维度的第一注意力权重1207以及视频单元维度的第二注意力权重1208。在测试过程中,可以基于第二注意力权重1208对第一注意力权重1207进行调整,基于调整后的第二注意力权重预测出目标视频片段1209,。在模型训练过程中,可以基于视频片段维度的特征1204和第一注意力权重1207,得到视频片段维度的第一全局特征1209,基于视频单元维度的特征1205和第二注意力权重1208,得到视频单元维度的第二全局特征1210,应用参数共享的LSTM网络,分别基于第一全局特征1209和第二全局特征1210,进行候选文本预测,通过损失函数,确定出候选文本与目标文本之间的误差。在本申请实施例中,基于视频片段和视频单元两个级别的数据进行模型训练,可以获取到模型表现更好的视频识别模型。
上述所有可选技术方案,可以采用任意结合形成本申请的可选实施例,在此不再一一赘述。
图13是本申请实施例提供的一种视频片段定位装置的结构示意图,参见图13,该装置包括:
第一获取模块1301,用于对视频中至少两个视频片段包括的视频单元进行特征提取,得到该视频单元的单元特征;
第二获取模块1302,用于基于至少两个视频片段所包括视频单元的单元特征,获取该至少两个视频片段的片段特征;
特征融合模块1303,用于将该至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到该至少两个视频片段的融合片段特征;
第三获取模块1304,用于基于该至少两个视频片段的融合片段特征,得到该至少两个视频片段的第一注意力权重,该第一注意力权重用于指示视频片段与该目标文本之间的匹配度;
第四获取模块1305,用于根据所述第一注意力权重,从该至少两个视频片段中,获取与该目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段。
在一种可能实现方式中,该第二获取模块1302包括:
初始片段特征获取子模块,用于基于该至少两个视频片段所包括视频单元的单元特征,确定该至少两个视频片段的初始片段特征;
采样子模块,用于对该至少两个视频片段的初始片段特征进行采样,得到该至少两个视频片段的片段特征。
在一种可能实现方式中,该采样子模块包括:
采样时刻确定单元,用于基于该视频片段的时长,确定该视频片段对应的采样时刻,每个视频片段对应的采样时刻的数目相同;
采样单元,用于基于该视频片段对应的采样时刻,对该视频片段的初始片段特征进行采样,得到该视频片段的片段特征。
在一种可能实现方式中,该采样单元用于:
基于该至少两个视频片段对应的采样时刻以及该至少两个视频片段在该视频中的位置信息,构造采样矩阵;
将该采样矩阵与该至少两个视频片段的初始片段特征相乘,得到采样特征矩阵,该采样特征矩阵中的一个特征用于表示一个视频片段的采样特征;
对该至少两个视频片段的采样特征进行降维处理,得到该至少两个视频片段的片段特征。
在一种可能实现方式中,该特征融合模块1303包括:
文本特征获取子模块,用于获取该目标文本的文本特征;
矩阵构造子模块,用于基于该至少两个视频片段的片段特征以及该至少两个视频片段在该视频中的位置信息,构造该视频对应的第一特征矩阵;
扩展子模块,用于基于该第一特征矩阵的维度,对该文本特征进行维度扩展,得到扩展矩阵,该扩展矩阵的维度与该第一特征矩阵的维度相同;
特征融合子模块,用于将该第一特征矩阵与该扩展矩阵进行特征融合,得到该至少两个视频片段的融合片段特征。
在一种可能实现方式中,该特征融合子模块用于:
将该第一特征矩阵与该扩展矩阵中相同位置的元素相乘,得到中间特征矩阵;
对该中间特征矩阵进行池化处理,得到第二特征矩阵,该第二特征矩阵中的一个特征用于表示一个视频片段的融合片段特征。
在一种可能实现方式中,该第三获取模块1304用于:
对该第二特征矩阵进行至少一次卷积运算,得到第一注意力矩阵,该第一注意力矩阵中的一个元素用于表示一个视频片段的该第一注意力权重。
在一种可能实现方式中,该装置还包括:
第五获取模块,用于获取该视频单元的第二注意力权重,该第二注意力权重用于指示视频单元与该目标文本之间的匹配度;
调整模块,用于基于该至少两个视频片段所包括视频单元的第二注意力权重,对该至少两个视频片段的第一注意力权重进行调整。
在一种可能实现方式中,该第五获取模块用于:
将该视频单元的单元特征分别与该目标文本的文本特征进行融合,得到该视频单元的融合单元特征;
基于该视频单元的融合单元特征,得到该视频单元的第二注意力权重。
在一种可能实现方式中,针对所述至少两个视频片段中的目标视频片段,该调整模块用于:
从该目标视频片段包括的视频单元中,确定该目标视频片段的中心时刻对应的目标视频单元;
基于该目标视频单元的第二注意力权重,对该目标视频片段的第一注意力权重进行调整。
在一种可能实现方式中,该装置还包括显示模块,用于执行下述任一项:
在该视频的播放界面显示标注信息,该标注信息用于指示该目标视频片段的起始时刻和结束时刻;或者
在该视频的该播放界面显示该目标视频片段的链接,该链接用于提供对该目标视频片段进行播放的功能。
在一种可能实现方式中,该装置还包括:
第六获取模块,用于基于该至少两个视频片段的第一注意力权重对该至少两个视频片段的片段特征进行加权运算,得到该至少两个视频片段的加权片段特征;
第七获取模块,用于通过长短时记忆网络对该至少两个视频片段的加权片段特征进行特征提取,基于提取到的特征确定第一候选文本;
第八获取模块,用于获取该第一候选文本与该目标文本之间的第一误差值。
本申请实施例提供的装置,通过获取视频单元维度的单元特征,根据单元特征确定视频片段的片段特征,获取到的片段特征中融合了多个视频单元的特征和视频单元之间的时序关联性;再将视频片段的片段特征与目标文本的文本特征进行融合,特征融合过程中充分应用了视频片段维度的特征以及各个视频片段之间的时序关联性,从而基于融合后的特征可以获取到更准确的注意力权重,由注意力权重来表示视频片段和目标文本之间的匹配度,进而在基于注意力权重进行视频片段定位时,可以更准确的定位出与目标文本相匹配的目标视频片段。
需要说明的是:上述实施例提供的视频片段定位装置在视频片段定位时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视频片段定位装置与视频片段定位方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
上述技术方案所提供的计算机设备可以实现为终端或服务器,例如,图14是本申请实施例提供的一种终端的结构示意图。该终端1400可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端1400还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端1400包括有:一个或多个处理器1401和一个或多个存储器1402。
存储器1402可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器1402还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器1402中的非暂态的计算机可读存储介质用于存储至少一条程序代码,该至少一条程序代码用于被处理器1401所执行以实现本申请中方法实施例提供的视频片段定位方法。
在一些实施例中,终端1400还可选包括有:外围设备接口1403和至少一个外围设备。处理器1401、存储器1402和外围设备接口1403之间可以通过总线或信号线相连。各个外围 设备可以通过总线、信号线或电路板与外围设备接口1403相连。具体地,外围设备包括:射频电路1404、显示屏1405、摄像头组件1406、音频电路1407、定位组件1408和电源1409中的至少一种。
在一些实施例中,终端1400还包括有一个或多个传感器1410。该一个或多个传感器1410包括但不限于:加速度传感器1411、陀螺仪传感器1412、压力传感器1413、指纹传感器1414、光学传感器1415以及接近传感器1416。
本领域技术人员可以理解,图14中示出的结构并不构成对终端1400的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
图15是本申请实施例提供的一种服务器的结构示意图,该服务器1500可因配置或性能不同而产生比较大的差异,可以包括一个或多个处理器(Central Processing Units,CPU)1501和一个或多个的存储器1502,其中,该一个或多个存储器1502中存储有至少一条程序代码,该至少一条程序代码由该一个或多个处理器1501加载并执行以实现上述各个方法实施例提供的方法。当然,该服务器1500还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该服务器1500还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,所述存储介质用于存储计算机程序,所述计算机程序用于执行以完成上述实施例中的视频片段定位方法。例如,该计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
本申请实施例还提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例提供的方法。
在示例性实施例中,还提供了一种计算机程序产品,该计算机程序产品包括至少一条程序代码,该至少一条程序代码存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该至少一条程序代码,处理器执行该至少一条程序代码,使得该计算机设备实现该视频片段定位方法所执行的操作。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来至少一条程序代码相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
上述仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (16)

  1. 一种视频片段定位方法,所述方法由视频识别平台执行,所述方法包括:
    对视频中至少两个视频片段包括的视频单元进行特征提取,得到所述视频单元的单元特征;
    基于所述至少两个视频片段所包括视频单元的单元特征,获取所述至少两个视频片段的片段特征;
    将所述至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到所述至少两个视频片段的融合片段特征;
    基于所述至少两个视频片段的融合片段特征,得到所述至少两个视频片段的第一注意力权重,所述第一注意力权重用于指示视频片段与所述目标文本之间的匹配度;
    根据所述第一注意力权重,从所述至少两个视频片段中获取与所述目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段。
  2. 根据权利要求1所述的方法,所述基于所述至少两个视频片段所包括视频单元的单元特征,获取所述至少两个视频片段的片段特征,包括:
    基于所述至少两个视频片段所包括视频单元的单元特征,确定所述至少两个视频片段的初始片段特征;
    对所述至少两个视频片段的初始片段特征进行采样,得到所述至少两个视频片段的片段特征。
  3. 根据权利要求2所述的方法,所述对所述至少两个视频片段的初始片段特征进行采样,得到所述至少两个视频片段的片段特征,包括:
    基于所述视频片段的时长,确定所述视频片段对应的采样时刻,每个视频片段对应的采样时刻的数目相同;
    基于所述视频片段对应的采样时刻,对所述视频片段的初始片段特征进行采样,得到所述视频片段的片段特征。
  4. 根据权利要求3所述的方法,所述基于所述视频片段对应的采样时刻,对所述视频片段的初始片段特征进行采样,得到所述视频片段的片段特征,包括:
    基于所述至少两个视频片段对应的采样时刻以及所述至少两个视频片段在所述视频中的位置信息,构造采样矩阵;
    将所述采样矩阵与所述至少两个视频片段的初始片段特征相乘,得到采样特征矩阵,所述采样特征矩阵中的一个特征用于表示一个视频片段的采样特征;
    对所述至少两个视频片段的采样特征进行降维处理,得到所述至少两个视频片段的片段特征。
  5. 根据权利要求1所述的方法,所述将所述至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到所述至少两个视频片段的融合片段特征,包括:
    获取所述目标文本的文本特征;
    基于所述至少两个视频片段的片段特征以及所述至少两个视频片段在所述视频中的位置信息,构造所述视频对应的第一特征矩阵;
    基于所述第一特征矩阵的维度,对所述文本特征进行维度扩展,得到扩展矩阵,所述扩展矩阵的维度与所述第一特征矩阵的维度相同;
    将所述第一特征矩阵与所述扩展矩阵进行特征融合,得到所述至少两个视频片段的融合片段特征。
  6. 根据权利要求5所述的方法,所述将所述第一特征矩阵与所述扩展矩阵进行特征融合,得到所述至少两个视频片段的融合片段特征,包括:
    将所述第一特征矩阵与所述扩展矩阵中相同位置的元素相乘,得到中间特征矩阵;
    对所述中间特征矩阵进行池化处理,得到第二特征矩阵,所述第二特征矩阵中的一个特征用于表示一个视频片段的融合片段特征。
  7. 根据权利要求6所述的方法,所述基于所述至少两个视频片段的融合片段特征,得到所述至少两个视频片段的第一注意力权重,包括:
    对所述第二特征矩阵进行至少一次卷积运算,得到第一注意力矩阵,所述第一注意力矩阵中的一个元素用于表示一个视频片段的所述第一注意力权重。
  8. 根据权利要求1所述的方法,所述根据所述第一注意力权重,从所述至少两个视频片段中获取与所述目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段之前,所述方法还包括:
    获取所述视频单元的第二注意力权重,所述第二注意力权重用于指示视频单元与所述目标文本之间的匹配度;
    基于所述至少两个视频片段所包括视频单元的第二注意力权重,对所述至少两个视频片段的第一注意力权重进行调整。
  9. 根据权利要求8所述的方法,所述获取所述视频单元的第二注意力权重,包括:
    将所述视频单元的单元特征分别与所述目标文本的文本特征进行融合,得到所述视频单元的融合单元特征;
    基于所述视频单元的融合单元特征,得到所述视频单元的第二注意力权重。
  10. 根据权利要求8所述的方法,针对所述至少两个视频片段中的目标视频片段,所述基于所述至少两个视频片段所包括视频单元的第二注意力权重,对所述至少两个视频片段的第一注意力权重进行调整,包括:
    从所述目标视频片段包括的视频单元中,确定所述目标视频片段的中心时刻对应的目标视频单元;
    基于所述目标视频单元的第二注意力权重,对所述目标视频片段的第一注意力权重进行调整。
  11. 根据权利要求1所述的方法,所述根据所述第一注意力权重,从所述至少两个视频片段中,获取与所述目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段之后,所述方法还包括下述任一项:
    在所述视频的播放界面显示标注信息,所述标注信息用于指示所述目标视频片段的起始时刻和结束时刻;或者,
    在所述视频的所述播放界面显示所述目标视频片段的链接,所述链接用于提供对所述 目标视频片段进行播放的功能。
  12. 根据权利要求1所述的方法,所述根据所述第一注意力权重,从所述至少两个视频片段中,获取与所述目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段之后,所述方法还包括:
    基于所述至少两个视频片段的第一注意力权重对所述至少两个视频片段的片段特征进行加权运算,得到所述至少两个视频片段的加权片段特征;
    通过长短时记忆网络对所述至少两个视频片段的加权片段特征进行特征提取,基于提取到的特征确定第一候选文本;
    获取所述第一候选文本与所述目标文本之间的第一误差值。
  13. 一种视频片段定位装置,所述装置包括:
    第一获取模块,用于对视频中至少两个视频片段包括的视频单元进行特征提取,得到所述视频单元的单元特征;
    第二获取模块,用于基于所述至少两个视频片段所包括视频单元的单元特征,获取所述至少两个视频片段的片段特征;
    特征融合模块,用于将所述至少两个视频片段的片段特征分别与目标文本的文本特征进行特征融合,得到所述至少两个视频片段的融合片段特征;
    第三获取模块,用于基于所述至少两个视频片段的融合片段特征,得到所述至少两个视频片段的第一注意力权重,所述第一注意力权重用于指示视频片段与所述目标文本之间的匹配度;
    第四获取模块,用于根据所述第一注意力权重,从所述至少两个视频片段中获取与所述目标文本之间的匹配度满足参考条件的视频片段,作为所述视频中与所述目标文本相关的目标视频片段。
  14. 一种计算机设备,所述计算机设备包括一个或多个处理器和一个或多个存储器,所述一个或多个存储器中存储有至少一条程序代码,所述至少一条程序代码由所述一个或多个处理器加载并执行以实现如权利要求1至权利要求12任一项所述的视频片段定位方法所执行的操作。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机程序,所述计算机程序用于执行权利要求1至权利要求12任一项所述的视频片段定位方法。
  16. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1至权利要求12任一项所述的视频片段定位方法。
PCT/CN2021/100860 2020-07-30 2021-06-18 视频片段定位方法、装置、计算机设备及存储介质 WO2022022152A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/949,984 US20230024382A1 (en) 2020-07-30 2022-09-21 Video clip positioning method and apparatus, computer device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010753184.4 2020-07-30
CN202010753184.4A CN111866607B (zh) 2020-07-30 2020-07-30 视频片段定位方法、装置、计算机设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/949,984 Continuation US20230024382A1 (en) 2020-07-30 2022-09-21 Video clip positioning method and apparatus, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022022152A1 true WO2022022152A1 (zh) 2022-02-03

Family

ID=72945235

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/100860 WO2022022152A1 (zh) 2020-07-30 2021-06-18 视频片段定位方法、装置、计算机设备及存储介质

Country Status (3)

Country Link
US (1) US20230024382A1 (zh)
CN (1) CN111866607B (zh)
WO (1) WO2022022152A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114387567A (zh) * 2022-03-23 2022-04-22 长视科技股份有限公司 一种视频数据的处理方法、装置、电子设备及存储介质
CN115168643A (zh) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及计算机可读存储介质
CN115223086A (zh) * 2022-09-20 2022-10-21 之江实验室 基于交互注意力引导与修正的跨模态动作定位方法与系统
US20230007365A1 (en) * 2021-07-02 2023-01-05 Disney Enterprises, Inc. Automated Content Segmentation and Identification of Fungible Content
CN116471452A (zh) * 2023-05-10 2023-07-21 武汉亿臻科技有限公司 一种基于智能ai的视频剪辑平台

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3923183A1 (en) * 2020-06-11 2021-12-15 Tata Consultancy Services Limited Method and system for video analysis
CN111866607B (zh) * 2020-07-30 2022-03-11 腾讯科技(深圳)有限公司 视频片段定位方法、装置、计算机设备及存储介质
CN113032679B (zh) * 2021-04-19 2023-12-29 北京新三优秀科技有限公司 一种短视频处理方法、电子设备和计算机可读存储介质
CN113361376B (zh) * 2021-06-02 2023-01-17 北京三快在线科技有限公司 获取视频封面的方法、装置、计算机设备及可读存储介质
CN115734016A (zh) * 2021-08-31 2023-03-03 腾讯科技(深圳)有限公司 评论信息的显示方法、装置、计算机设备及存储介质
CN114390365B (zh) * 2022-01-04 2024-04-26 京东科技信息技术有限公司 用于生成视频信息的方法和装置
CN114419527B (zh) * 2022-04-01 2022-06-14 腾讯科技(深圳)有限公司 一种数据处理方法、设备以及计算机可读存储介质
CN114780789A (zh) * 2022-06-22 2022-07-22 山东建筑大学 基于自然语言查询的装配式构件施工监控视频定位方法
CN115278378B (zh) * 2022-07-27 2024-06-21 维沃移动通信有限公司 信息显示方法、信息显示装置、电子设备和存储介质
CN115661727B (zh) * 2022-12-27 2023-04-28 苏州浪潮智能科技有限公司 视频的行为定位方法、装置、电子设备及存储介质
CN117152669B (zh) * 2023-10-30 2024-02-06 华中科技大学 一种跨模态时域视频定位方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN110225368A (zh) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 一种视频定位方法、装置及电子设备
CN111866607A (zh) * 2020-07-30 2020-10-30 腾讯科技(深圳)有限公司 视频片段定位方法、装置、计算机设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106646A1 (en) * 2005-11-09 2007-05-10 Bbnt Solutions Llc User-directed navigation of multimedia search results
CN107027060A (zh) * 2017-04-18 2017-08-08 腾讯科技(深圳)有限公司 视频片段的确定方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170357720A1 (en) * 2016-06-10 2017-12-14 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN110225368A (zh) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 一种视频定位方法、装置及电子设备
CN111866607A (zh) * 2020-07-30 2020-10-30 腾讯科技(深圳)有限公司 视频片段定位方法、装置、计算机设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YIJUN SONG; JINGWEN WANG; LIN MA; ZHOU YU; JUN YU: "Weakly-Supervised Multi-Level Attentional Reconstruction Network for Grounding Textual Queries in Videos", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 March 2020 (2020-03-16), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081621935 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230007365A1 (en) * 2021-07-02 2023-01-05 Disney Enterprises, Inc. Automated Content Segmentation and Identification of Fungible Content
US12003831B2 (en) * 2021-07-02 2024-06-04 Disney Enterprises, Inc. Automated content segmentation and identification of fungible content
CN114387567A (zh) * 2022-03-23 2022-04-22 长视科技股份有限公司 一种视频数据的处理方法、装置、电子设备及存储介质
CN115168643A (zh) * 2022-09-07 2022-10-11 腾讯科技(深圳)有限公司 音频处理方法、装置、设备及计算机可读存储介质
CN115223086A (zh) * 2022-09-20 2022-10-21 之江实验室 基于交互注意力引导与修正的跨模态动作定位方法与系统
CN115223086B (zh) * 2022-09-20 2022-12-06 之江实验室 基于交互注意力引导与修正的跨模态动作定位方法与系统
CN116471452A (zh) * 2023-05-10 2023-07-21 武汉亿臻科技有限公司 一种基于智能ai的视频剪辑平台
CN116471452B (zh) * 2023-05-10 2024-01-19 武汉亿臻科技有限公司 一种基于智能ai的视频剪辑平台

Also Published As

Publication number Publication date
CN111866607A (zh) 2020-10-30
CN111866607B (zh) 2022-03-11
US20230024382A1 (en) 2023-01-26

Similar Documents

Publication Publication Date Title
WO2022022152A1 (zh) 视频片段定位方法、装置、计算机设备及存储介质
CN111062871B (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
CN109104620B (zh) 一种短视频推荐方法、装置和可读介质
CN110458107B (zh) 用于图像识别的方法和装置
CN109460514B (zh) 用于推送信息的方法和装置
CN110234018B (zh) 多媒体内容描述生成方法、训练方法、装置、设备及介质
CN111738010B (zh) 用于生成语义匹配模型的方法和装置
CN111783712A (zh) 一种视频处理方法、装置、设备及介质
CN112989212B (zh) 媒体内容推荐方法、装置和设备及计算机存储介质
CN112149699B (zh) 用于生成模型的方法、装置和用于识别图像的方法、装置
CN112765387A (zh) 图像检索方法、图像检索装置和电子设备
CN112149604A (zh) 视频特征提取模型的训练方法、视频推荐方法及装置
CN110674349A (zh) 视频poi识别方法、装置及电子设备
CN113641835B (zh) 多媒体资源推荐方法、装置、电子设备及介质
CN117688204A (zh) 视频推荐模型的训练方法、装置、电子设备和存储介质
CN113128526B (zh) 图像识别方法、装置、电子设备和计算机可读存储介质
CN113918738A (zh) 多媒体资源推荐方法、装置、电子设备及存储介质
CN114187486A (zh) 模型训练方法及相关设备
JP7504192B2 (ja) 画像を検索するための方法及び装置
CN111597361B (zh) 多媒体数据处理方法、装置、存储介质及设备
CN113420203A (zh) 对象推荐方法、装置、电子设备及存储介质
CN116958852A (zh) 视频与文本的匹配方法、装置、电子设备和存储介质
CN112417260B (zh) 本地化推荐方法、装置及存储介质
CN115878839A (zh) 一种视频推荐方法、装置、计算机设备和计算机程序产品
CN113901330B (zh) 视频搜索方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21850871

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28/06/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21850871

Country of ref document: EP

Kind code of ref document: A1