CN115661727A - Video behavior positioning method and device, electronic equipment and storage medium - Google Patents

Video behavior positioning method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115661727A
CN115661727A CN202211680368.8A CN202211680368A CN115661727A CN 115661727 A CN115661727 A CN 115661727A CN 202211680368 A CN202211680368 A CN 202211680368A CN 115661727 A CN115661727 A CN 115661727A
Authority
CN
China
Prior art keywords
frame
video
feature
extraction
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211680368.8A
Other languages
Chinese (zh)
Other versions
CN115661727B (en
Inventor
李晓川
郭振华
赵雅倩
李仁刚
范宝余
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202211680368.8A priority Critical patent/CN115661727B/en
Publication of CN115661727A publication Critical patent/CN115661727A/en
Application granted granted Critical
Publication of CN115661727B publication Critical patent/CN115661727B/en
Priority to PCT/CN2023/101687 priority patent/WO2024139091A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Signal Processing (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The embodiment of the invention provides a behavior positioning method, a behavior positioning device, electronic equipment and a storage medium of a video, wherein a corresponding compressed video is obtained by performing frame extraction on a target video, video coding characteristics are obtained by performing feature extraction for multiple times, text characteristic extraction and broadcast copy operation are simultaneously performed on a description text to obtain corresponding expanded statement characteristics, video splicing characteristics corresponding to the video and the text are obtained by performing video characteristic fusion processing, video correlation operation is performed on the video splicing characteristics to obtain a video time sequence, a fine-grained matching process between the video and the text is realized by an indefinite length matching algorithm between the video and the text, fragment positioning and optimization processing is respectively performed on start and stop frames to obtain optimized start and stop frames, and a video fragment between the optimized start and stop frames is intercepted as a final output fragment, so that an accurate start and stop position can be determined, and positioning errors of video behaviors are greatly reduced.

Description

Video behavior positioning method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for locating a behavior of a video, an electronic device, and a computer-readable storage medium.
Background
In recent years, multimodal Artificial Intelligence has become one of important research directions in the field of AI (Artificial Intelligence). Multimodal research, aiming at integrating various modal inputs such as images, videos, audios, texts, sensor signals, etc., and integrating science of understanding or generating information available to human beings, such as the inclusion of multimodal relationships such as images, texts in the fields of Visual Question Answering (VQA), visual localization (Visual group), etc. In the field of multimodal artificial intelligence, video behavior positioning based on text expression, which is one of video algorithm researches, is a direction with a great research prospect, wherein video positioning or behavior positioning specifically refers to an algorithm task of intercepting a segment conforming to a certain type of behavior from a video.
At present, the existing behavior localization method of a video generally implements video localization and outputs corresponding segments based on technical means such as feature extraction and video segmentation, however, the above method is easy to cause large errors in segment classification of the video, and particularly when a target segment is not related to pictures such as lens splitting and shot switching of the video itself, the above method is also easy to cause the starting and stopping positions of the final output segment to be not accurate enough.
Disclosure of Invention
The embodiment of the invention provides a behavior positioning method and device of a video, electronic equipment and a computer readable storage medium, and aims to solve or partially solve the problems that the positioning error is large and the starting and stopping positions of output segments are not accurate enough in the existing video behavior positioning method.
The embodiment of the invention discloses a behavior positioning method of a video, which comprises the following steps:
acquiring a target video, performing frame extraction on the target video to obtain a corresponding compressed video, and performing multiple feature extraction on the compressed video to obtain corresponding video coding features;
obtaining a description text corresponding to the target video, extracting features of the description text, and outputting corresponding sentence features;
performing broadcast replication operation on the sentence characteristics to obtain corresponding expanded sentence characteristics, and performing characteristic splicing on the expanded sentence characteristics and the video coding characteristics to obtain video splicing characteristics;
performing video correlation operation on the video splicing characteristics, and outputting a corresponding video time sequence, wherein the video time sequence comprises a positioning video frame;
selecting a rough-arrangement starting frame and a rough-arrangement ending frame from the positioning video frame, and respectively carrying out segment positioning and optimizing processing on the rough-arrangement starting frame and the rough-arrangement ending frame to obtain an optimized starting frame corresponding to the rough-arrangement starting frame and an optimized ending frame corresponding to the rough-arrangement ending frame;
and taking a video segment between the optimization starting frame and the optimization ending frame in the target video as a target video segment corresponding to the description text, and outputting the target video segment.
Optionally, the frame extraction of the target video to obtain a corresponding compressed video includes:
and performing frame extraction processing on the target video according to a preset frame extraction interval value to obtain a corresponding extracted frame sequence, and taking the extracted frame sequence as a compressed video corresponding to the target video.
Optionally, the frame extraction processing on the target video according to a preset frame extraction interval value to obtain a corresponding extracted frame sequence includes:
determining an initial extraction frame for the target video, and starting to extract the target video according to a preset extraction frame interval value from the initial extraction frame to obtain an initial extraction frame sequence corresponding to the initial extraction frame;
determining a front extraction frame and a rear extraction frame corresponding to the starting extraction frame, wherein the front extraction frame is a previous frame of the starting extraction frame, and the rear extraction frame is a next frame of the starting extraction frame;
extracting frames of the target video from the pre-extraction frames according to a preset frame extraction interval value to obtain a pre-extraction frame sequence corresponding to the initial extraction frames;
extracting frames of the target video from the post extraction frames according to a preset frame extraction interval value to obtain a post extraction frame sequence corresponding to the initial extraction frame;
and taking the starting extraction frame sequence, the front extraction frame sequence and the rear extraction frame sequence as extraction frame sequences corresponding to the target video.
Optionally, the performing multiple feature extraction on the compressed video to obtain corresponding video coding features includes:
extracting the features of the compressed video to obtain a corresponding video feature map, and performing time sequence space convolution processing on the video feature map to obtain a corresponding frame extraction time sequence feature map;
and performing feature extraction on the frame-extracting time sequence feature diagram to obtain video coding features corresponding to the compressed video.
Optionally, the performing feature extraction on the compressed video to obtain a corresponding video feature map includes:
and inputting the extracted frame sequence into a convolutional neural network for feature extraction to obtain a frame extraction feature image corresponding to the compressed video.
Optionally, the inputting the decimated frame sequence into a convolutional neural network for feature extraction to obtain a decimated feature map corresponding to the compressed video includes:
and respectively inputting the initial extraction frame sequence, the preposed extraction frame sequence and the postposition extraction frame sequence into a convolutional neural network for feature extraction to obtain an initial extraction frame feature map corresponding to the initial extraction frame sequence, a preposed extraction frame feature map corresponding to the preposed extraction frame sequence and a postposition extraction frame feature map corresponding to the postposition extraction frame sequence.
Optionally, the performing time-series spatial convolution processing on the video feature map to obtain a corresponding frame-extraction time-series feature map includes:
and inputting the initial frame-extracting feature map into a three-dimensional convolution for time sequence space convolution processing to obtain a frame-extracting time sequence feature map corresponding to the video feature map.
Optionally, the performing feature extraction on the frame-extraction time sequence feature map to obtain video coding features corresponding to the compressed video includes:
and fusing the frame-extracting time sequence feature map, the front frame-extracting feature map and the rear frame-extracting feature map to obtain a fused feature map, inputting the fused feature map into a feature detection network for feature extraction, and obtaining video coding features corresponding to the compressed video.
Optionally, the performing broadcast copy operation on the statement feature to obtain a corresponding expanded statement feature includes:
and performing broadcast copy operation on the statement features according to the frame number of the extracted frame sequence, and expanding the statement features into expanded statement features corresponding to the extracted frame sequence.
Optionally, the performing a view correlation operation on the view splicing feature and outputting a corresponding view time sequence includes:
performing visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, and calculating a first visual text correlation weight according to the global attention matrix;
performing distance similarity operation by using the video coding features and the expanded sentence features to obtain a second visual text correlation weight, and obtaining a total visual text correlation weight according to the first visual text correlation weight and the second visual text correlation weight;
summing the total video relevance weights to obtain a frame relevance list, wherein the frame relevance list comprises a conformity degree value which is used for representing the degree that each video frame in the compressed video conforms to the text description;
and screening out a target coincidence degree value higher than a preset coincidence degree threshold value from the coincidence degree values, determining video frames corresponding to the target coincidence degree value as positioning video frames, gathering all the positioning video frames into a video moment sequence corresponding to the video splicing characteristics, and outputting the video moment sequence.
Optionally, the performing the view attention calculation on the view splicing feature to obtain a corresponding global attention matrix includes:
using a formula
Figure 243283DEST_PATH_IMAGE001
Performing a visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, wherein,
Figure 467460DEST_PATH_IMAGE002
Figure 824623DEST_PATH_IMAGE003
Matin order to be a global attention matrix,
Figure 238286DEST_PATH_IMAGE004
for the purpose of the text-to-view stitching feature,
Figure 750957DEST_PATH_IMAGE005
and
Figure 224664DEST_PATH_IMAGE006
are all of a size for feature space transformation ofd×dThe transfer matrix of (2).
Optionally, the calculating a first viewership relevance weight according to the global attention matrix includes:
using a formula
Figure 600282DEST_PATH_IMAGE007
Calculating a first viewership relevance weight; wherein,
Figure 207849DEST_PATH_IMAGE008
is a first visual context correlation weight, specifically of a magnitude of
Figure 821364DEST_PATH_IMAGE009
The matrix of (a) is,
Figure 465972DEST_PATH_IMAGE010
representing a number of frames corresponding to the compressed video,Nrepresenting the corresponding characteristic number of a single frame of video in the compressed video,Mrepresenting the feature number corresponding to the description text,
Figure 548460DEST_PATH_IMAGE011
representing a transpose of the matrix in a non-time dimension.
Optionally, the performing distance similarity operation by using the video coding feature and the expanded sentence feature to obtain a second viewership correlation weight includes:
using a formula
Figure 303926DEST_PATH_IMAGE012
Performing distance similarity calculation to obtain a second sight text correlation weight;
wherein ,
Figure 568686DEST_PATH_IMAGE013
is the second view relevance weight,
Figure 243250DEST_PATH_IMAGE014
for the purpose of the video coding feature(s),
Figure 186935DEST_PATH_IMAGE015
is the expanded sentence feature.
Optionally, the obtaining a total viewership relevance weight according to the first viewership relevance weight and the second viewership relevance weight includes:
using a formula
Figure 621458DEST_PATH_IMAGE016
Calculating the relevance weight of the total visual text; wherein,Aa general view relevance weight for characterizing the degree of relevance of the compressed video to the description text,
Figure 334199DEST_PATH_IMAGE017
is the first view relevance weight,
Figure 678200DEST_PATH_IMAGE018
is a second view relevance weight.
Optionally, the summing the total-view correlation weights to obtain a frame correlation list includes:
using a formula
Figure 109181DEST_PATH_IMAGE019
Summing the correlation weights of the general text to obtain a frame correlation list; wherein,Sis expressed as a size of
Figure 878554DEST_PATH_IMAGE020
A frame correlation list of.
Optionally, the selecting a rough-alignment starting frame and a rough-alignment ending frame from the positioning video frame, and performing segment positioning and adjusting processing on the rough-alignment starting frame and the rough-alignment ending frame respectively to obtain an optimization starting frame corresponding to the rough-alignment starting frame and an optimization ending frame corresponding to the rough-alignment ending frame includes:
selecting a rough-arrangement starting frame and a rough-arrangement ending frame from the positioning video frames, respectively matching the rough-arrangement starting frame and the rough-arrangement ending frame with the target video, and determining a target starting frame corresponding to the rough-arrangement starting frame and a target ending frame corresponding to the rough-arrangement ending frame;
and respectively carrying out segment positioning and adjusting optimization processing on the target starting frame and the target ending frame to obtain an optimized starting frame corresponding to the target starting frame and an optimized ending frame corresponding to the target ending frame.
Optionally, the performing, respectively, segment positioning and adjusting processing on the target start frame and the target end frame to obtain an optimized start frame corresponding to the target start frame and an optimized end frame corresponding to the target end frame includes:
performing front-back frame expansion on the target start frame and the target end frame according to preset frame expansion numbers respectively to obtain a start frame candidate image set corresponding to the target start frame and an end frame candidate image set corresponding to the target end frame;
performing initial frame positioning and adjusting optimization processing on the initial frame candidate image set to obtain an optimized initial frame corresponding to the target initial frame;
and performing positioning and tuning processing on the termination frame candidate image set to obtain an optimized termination frame corresponding to the target termination frame.
Optionally, the performing start frame positioning and adjusting processing on the start frame candidate image set to obtain an optimized start frame corresponding to the target start frame includes:
taking each frame in the initial frame candidate image set as a candidate initial frame, and performing video interception on the target video according to each candidate initial frame and a preset interception length to obtain a plurality of first intercepted video segments;
performing feature extraction on each first cut-off video segment to obtain an initial frame feature corresponding to each first cut-off video segment, and calculating by using the initial frame feature and the expanded sentence feature to obtain a first similarity sequence, where the first similarity sequence includes each candidate optimized initial frame corresponding to the initial frame feature;
respectively extracting the characteristics of each candidate initial frame to obtain the characteristics of each candidate initial frame, and calculating by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence;
performing adjacent value subtraction calculation on the first similarity supplement sequence to obtain a corresponding first similarity step signal set, wherein the first similarity step signal set comprises corresponding first step signals between adjacent candidate start frames;
and screening first target step signals larger than a preset step threshold value from the first step signals, selecting corresponding target candidate optimization starting frames from the first similarity sequence according to the first target step signals, and selecting a frame with the highest similarity from the target candidate optimization starting frames as an optimization starting frame.
Optionally, the performing feature extraction on each of the first cut-off video segments to obtain an initial frame feature corresponding to each of the first cut-off video segments, and performing calculation by using the initial frame feature and the expanded sentence feature to obtain a first similarity sequence includes:
and performing time sequence adaptive convolution on each first cut video segment, simultaneously performing feature extraction to obtain an initial frame feature corresponding to each first cut video segment, and performing cosine similarity operation by using the initial frame feature and the expansion statement feature to obtain a first similarity sequence.
Optionally, the performing feature extraction on each candidate start frame to obtain each candidate start frame feature, and performing calculation by using each candidate start frame feature and the expanded sentence feature to obtain a first similarity supplementing sequence includes:
and respectively extracting the characteristics of each candidate initial frame by adopting a convolutional neural network to obtain the characteristics of each candidate initial frame, and performing cosine similarity operation by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence.
Optionally, the performing termination frame positioning and tuning processing on the termination frame candidate image set to obtain an optimized termination frame corresponding to the target termination frame includes:
taking each frame in the termination frame candidate image set as a candidate termination frame, and performing video interception on the target video according to each candidate termination frame and a preset interception length to obtain a plurality of second intercepted video segments;
extracting features of each second intercepted video segment to obtain a termination frame feature corresponding to each second intercepted video segment, and calculating by adopting the termination frame feature and the expansion statement feature to obtain a second similarity sequence, wherein the second similarity sequence comprises each candidate optimized termination frame corresponding to the termination frame feature;
respectively extracting the characteristics of each candidate termination frame to obtain the characteristics of each candidate termination frame, and calculating by adopting the characteristics of each candidate termination frame and the characteristics of the expansion sentences to obtain a second similarity supplement sequence;
performing adjacent value subtraction calculation on the second similarity supplement sequence to obtain a corresponding second similarity step signal set, wherein the second similarity step signal set comprises corresponding second step signals between adjacent candidate termination frames;
and screening second target step signals larger than a preset step threshold value from the second step signals, selecting corresponding target candidate optimization termination frames from the second similarity sequence according to the second target step signals, and selecting one frame with the highest similarity from the target candidate optimization termination frames as an optimization termination frame.
Optionally, the performing feature extraction on each second cut-out video segment to obtain a termination frame feature corresponding to each second cut-out video segment, and performing calculation by using the termination frame feature and the extended sentence feature to obtain a second similarity sequence includes:
and performing time sequence adaptive convolution on each second intercepted video segment, simultaneously performing feature extraction to obtain an end frame feature corresponding to each second intercepted video segment, and performing cosine similarity operation by adopting the end frame feature and the extension statement feature to obtain a second similarity sequence.
Optionally, the performing feature extraction on each candidate end frame to obtain each candidate end frame feature, and performing calculation by using each candidate end frame feature and the expanded sentence feature to obtain a second similarity supplementing sequence includes:
and respectively adopting a convolutional neural network to carry out feature extraction on each candidate termination frame to obtain each candidate termination frame feature, and adopting each candidate termination frame feature and the expansion statement feature to carry out cosine similarity operation to obtain a second similarity supplement sequence.
Optionally, the performing adjacent value subtraction calculation on the second similarity supplementary sequence to obtain a corresponding second similarity step signal set includes:
and performing adjacent value subtraction calculation on the second similarity supplement sequence, and then performing phase inversion processing to obtain a corresponding second similarity step signal set.
The embodiment of the invention also discloses a behavior positioning device of the video, which comprises:
the video coding feature generation module is used for acquiring a target video, performing frame extraction on the target video to acquire a corresponding compressed video, and performing feature extraction on the compressed video for multiple times to acquire corresponding video coding features;
the sentence characteristic output module is used for acquiring a description text corresponding to the target video, extracting the characteristics of the description text and outputting the corresponding sentence characteristics;
the broadcast replication operation module is used for performing broadcast replication operation on the sentence characteristics to obtain corresponding expanded sentence characteristics, and performing characteristic splicing on the expanded sentence characteristics and the video coding characteristics to obtain video splicing characteristics;
the video correlation operation module is used for carrying out video correlation operation on the video splicing characteristics and outputting a corresponding video time sequence, and the video time sequence comprises positioning video frames;
the segment positioning and adjusting module is used for selecting a coarse arrangement starting frame and a coarse arrangement ending frame from the positioning video frame, and respectively carrying out segment positioning and adjusting processing on the coarse arrangement starting frame and the coarse arrangement ending frame to obtain an optimization starting frame corresponding to the coarse arrangement starting frame and an optimization ending frame corresponding to the coarse arrangement ending frame;
and the target video segment output module is used for taking a video segment between the optimization starting frame and the optimization ending frame in the target video as a target video segment corresponding to the description text and outputting the target video segment.
Optionally, the video coding feature generation module is specifically configured to:
and performing frame extraction processing on the target video according to a preset frame extraction interval value to obtain a corresponding extracted frame sequence, and taking the extracted frame sequence as a compressed video corresponding to the target video.
Optionally, the video coding feature generation module includes:
the starting frame sequence determining module is used for determining a starting extraction frame aiming at the target video, and extracting frames of the target video according to a preset extraction frame interval value from the starting extraction frame to obtain a starting extraction frame sequence corresponding to the starting extraction frame;
a pre-extraction frame and post-extraction frame determining module, configured to determine a pre-extraction frame and a post-extraction frame that correspond to the initial extraction frame, where the pre-extraction frame is a previous frame of the initial extraction frame, and the post-extraction frame is a next frame of the initial extraction frame;
a pre-decimation frame sequence determining module, configured to perform frame decimation on the target video according to a preset frame decimation interval value from the pre-decimation frame, so as to obtain a pre-decimation frame sequence corresponding to the initial decimation frame;
a post extraction frame sequence determining module, configured to perform frame extraction on the target video according to a preset frame extraction interval value from the post extraction frame to obtain a post extraction frame sequence corresponding to the initial extraction frame;
and the decimated frame sequence determining module is used for taking the starting decimated frame sequence, the pre-decimated frame sequence and the post decimated frame sequence as decimated frame sequences corresponding to the target video.
Optionally, the video coding feature generation module includes:
the frame-extracting time sequence feature map generation module is used for extracting features of the compressed video to obtain a corresponding video feature map, and performing time sequence space convolution processing on the video feature map to obtain a corresponding frame-extracting time sequence feature map;
and the video coding feature generation submodule is used for extracting features of the frame extraction time sequence feature graph to obtain video coding features corresponding to the compressed video.
Optionally, the frame-extracting timing feature map generating module is specifically configured to:
and inputting the extracted frame sequence into a convolutional neural network for feature extraction to obtain a frame extraction feature image corresponding to the compressed video.
Optionally, the frame-extracting timing feature map generating module is specifically configured to:
and respectively inputting the initial extraction frame sequence, the preposed extraction frame sequence and the postposition extraction frame sequence into a convolutional neural network for feature extraction to obtain an initial extraction frame feature map corresponding to the initial extraction frame sequence, a preposed extraction frame feature map corresponding to the preposed extraction frame sequence and a postposition extraction frame feature map corresponding to the postposition extraction frame sequence.
Optionally, the frame-extracting timing feature map generating module is specifically configured to:
and inputting the initial frame-extracting feature map into a three-dimensional convolution for time sequence space convolution processing to obtain a frame-extracting time sequence feature map corresponding to the video feature map.
Optionally, the video coding feature generation sub-module is specifically configured to:
and fusing the frame-extracting time sequence feature map, the front frame-extracting feature map and the rear frame-extracting feature map to obtain a fused feature map, inputting the fused feature map into a feature detection network for feature extraction, and obtaining video coding features corresponding to the compressed video.
Optionally, the broadcast copy operation module is specifically configured to:
and performing broadcast copy operation on the statement features according to the frame number of the extracted frame sequence, and expanding the statement features into expanded statement features corresponding to the extracted frame sequence.
Optionally, the context correlation operation module includes:
the first visual text correlation weight calculation module is used for carrying out visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix and calculating a first visual text correlation weight according to the global attention matrix;
a total-view-text relevance weight calculating module, configured to perform distance similarity calculation using the video coding features and the expanded sentence features to obtain a second view-text relevance weight, and obtain a total-view-text relevance weight according to the first view-text relevance weight and the second view-text relevance weight;
a frame correlation list generating module, configured to sum the total-video correlation weights to obtain a frame correlation list, where the frame correlation list includes a conformity degree value, and the conformity degree value is used to represent a degree to which each video frame in the compressed video conforms to a text description;
and the video time sequence generation module is used for screening out a target coincidence degree value higher than a preset coincidence degree threshold value from the coincidence degree values, determining video frames corresponding to the target coincidence degree value as positioning video frames, gathering all the positioning video frames into a video time sequence corresponding to the video splicing characteristics, and outputting the video time sequence.
Optionally, the first viewership weight calculation module includes:
a vision attention calculation module for employing a formula
Figure 836014DEST_PATH_IMAGE021
Performing a visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, wherein,
Figure 727747DEST_PATH_IMAGE022
Figure 521391DEST_PATH_IMAGE023
Matin order to be a global attention matrix,
Figure 422351DEST_PATH_IMAGE024
for the purpose of the text-to-view stitching feature,
Figure 470203DEST_PATH_IMAGE025
and
Figure 798416DEST_PATH_IMAGE026
are all of a size for feature space transformation ofd×dThe transfer matrix of (2).
Optionally, the first viewership weight calculation module includes:
a first visual context correlation weight calculation submodule for employing a formula
Figure 813777DEST_PATH_IMAGE027
Calculating a first viewership relevance weight; wherein,
Figure 252849DEST_PATH_IMAGE028
is a first visual context correlation weight, specifically of a magnitude of
Figure 919322DEST_PATH_IMAGE029
The matrix of (a) is,
Figure 418437DEST_PATH_IMAGE030
representing a number of frames corresponding to the compressed video,Nrepresenting the corresponding characteristic number of a single frame of video in the compressed video,Mrepresenting the feature number corresponding to the description text,
Figure 921093DEST_PATH_IMAGE031
representing a transpose of the matrix in a non-time dimension.
Optionally, the overall-view correlation weight calculation module includes:
a second visual context correlation weight calculation module for employing a formula
Figure 163856DEST_PATH_IMAGE032
Performing distance similarity calculation to obtain a second sight text correlation weight;
wherein ,
Figure 200949DEST_PATH_IMAGE033
is the second view relevance weight,
Figure 870965DEST_PATH_IMAGE034
for the purpose of the video coding feature(s),
Figure 126497DEST_PATH_IMAGE035
is the expanded statement feature.
Optionally, the overall-view correlation weight calculation module includes:
a total view text correlation weight calculation submodule for adopting a formula
Figure 641792DEST_PATH_IMAGE036
Calculating the relevance weight of the total visual text; wherein,Aa general view relevance weight for characterizing the degree of relevance of the compressed video to the description text,
Figure 282858DEST_PATH_IMAGE037
is the first view relevance weight,
Figure 123775DEST_PATH_IMAGE038
is a second view relevance weight.
Optionally, the frame correlation list generating module is specifically configured to:
using a formula
Figure 397761DEST_PATH_IMAGE039
Summing the correlation weights of the general text to obtain a frame correlation list; wherein,Sis expressed as a size of
Figure 608425DEST_PATH_IMAGE020
A frame correlation list of.
Optionally, the segment positioning and tuning processing module includes:
a target video matching module, configured to select a rough-layout start frame and a rough-layout end frame from the positioning video frames, match the rough-layout start frame and the rough-layout end frame with the target video, and determine a target start frame corresponding to the rough-layout start frame and a target end frame corresponding to the rough-layout end frame;
and the segment positioning and adjusting module is used for respectively performing segment positioning and adjusting on the target starting frame and the target ending frame to obtain an optimized starting frame corresponding to the target starting frame and an optimized ending frame corresponding to the target ending frame.
Optionally, the segment positioning and tuning processing sub-module includes:
a forward and backward frame expansion module, configured to perform forward and backward frame expansion on the target start frame and the target end frame according to preset frame expansion numbers, respectively, to obtain a start frame candidate image set corresponding to the target start frame and an end frame candidate image set corresponding to the target end frame;
the initial frame positioning and adjusting module is used for performing initial frame positioning and adjusting processing on the initial frame candidate image set to obtain an optimized initial frame corresponding to the target initial frame;
and the termination frame positioning and tuning processing module is used for performing termination frame positioning and tuning processing on the termination frame candidate image set to obtain an optimized termination frame corresponding to the target termination frame.
Optionally, the start frame positioning and tuning processing module includes:
the first capturing video segment generating module is used for taking each frame in the starting frame candidate image set as a candidate starting frame and capturing the target video according to each candidate starting frame and a preset capturing length to obtain a plurality of first capturing video segments;
a first similarity sequence generating module, configured to perform feature extraction on each of the first truncated video segments to obtain an initial frame feature corresponding to each of the first truncated video segments, and perform calculation by using the initial frame feature and the expanded sentence feature to obtain a first similarity sequence, where the first similarity sequence includes each candidate optimized initial frame corresponding to the initial frame feature;
the first similarity supplementing sequence generating module is used for respectively extracting the features of the candidate initial frames to obtain the features of the candidate initial frames, and calculating by adopting the features of the candidate initial frames and the features of the expansion sentences to obtain a first similarity supplementing sequence;
a first similarity step signal set generating module, configured to perform adjacent value subtraction calculation on the first similarity supplementary sequence to obtain a corresponding first similarity step signal set, where the first similarity step signal set includes corresponding first step signals between adjacent candidate start frames;
and the optimization starting frame selecting module is used for screening out a first target step signal which is greater than a preset step threshold value from each first step signal, selecting a corresponding target candidate optimization starting frame from the first similarity sequence according to the first target step signal, and selecting a frame with the highest similarity from the target candidate optimization starting frames as an optimization starting frame.
Optionally, the first similarity sequence generating module is specifically configured to:
and performing time sequence adaptive convolution on each first cut video segment, simultaneously performing feature extraction to obtain an initial frame feature corresponding to each first cut video segment, and performing cosine similarity operation by using the initial frame feature and the expansion statement feature to obtain a first similarity sequence.
Optionally, the first similarity supplementing sequence generating module is specifically configured to:
and respectively extracting the characteristics of each candidate initial frame by adopting a convolutional neural network to obtain the characteristics of each candidate initial frame, and performing cosine similarity operation by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence.
Optionally, the terminating frame positioning and tuning processing module includes:
a second captured video segment generating module, configured to use each frame in the termination frame candidate image set as a candidate termination frame, and perform video capture on the target video according to a preset capture length according to each candidate termination frame to obtain a plurality of second captured video segments;
a second similarity sequence generation module, configured to perform feature extraction on each second captured video segment to obtain a termination frame feature corresponding to each second captured video segment, and perform calculation by using the termination frame feature and the expanded sentence feature to obtain a second similarity sequence, where the second similarity sequence includes each candidate optimized termination frame corresponding to the termination frame feature;
the second similarity supplementing sequence generating module is used for respectively extracting the features of each candidate terminating frame to obtain the features of each candidate terminating frame, and calculating by adopting the features of each candidate terminating frame and the features of the expanded sentences to obtain a second similarity supplementing sequence;
a second similarity step signal set generating module, configured to perform adjacent value subtraction calculation on the second similarity supplementary sequence to obtain a corresponding second similarity step signal set, where the second similarity step signal set includes corresponding second step signals between adjacent candidate termination frames;
and the optimization termination frame selection module is used for screening out a second target step signal which is greater than a preset step threshold value from each second step signal, selecting a corresponding target candidate optimization termination frame from the second similarity sequence according to the second target step signal, and selecting a frame with the highest similarity from the target candidate optimization termination frames as the optimization termination frame.
Optionally, the second similarity sequence generating module is specifically configured to:
and performing time sequence adaptive convolution on each second intercepted video segment, simultaneously performing feature extraction to obtain an end frame feature corresponding to each second intercepted video segment, and performing cosine similarity operation by adopting the end frame feature and the extension statement feature to obtain a second similarity sequence.
Optionally, the second similarity supplementing sequence generating module is specifically configured to:
and respectively adopting a convolutional neural network to carry out feature extraction on each candidate termination frame to obtain each candidate termination frame feature, and adopting each candidate termination frame feature and the expansion statement feature to carry out cosine similarity operation to obtain a second similarity supplement sequence.
Optionally, the second similarity step signal set generating module is specifically configured to:
and performing adjacent value subtraction calculation on the second similarity supplement sequence, and then performing phase inversion processing to obtain a corresponding second similarity step signal set.
The embodiment of the invention also discloses electronic equipment which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory finish mutual communication through the communication bus;
the memory is used for storing a computer program;
the processor is configured to implement the method according to the embodiment of the present invention when executing the program stored in the memory.
Also disclosed is a computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the processors to perform a method according to an embodiment of the invention.
The embodiment of the invention has the following advantages:
in the embodiment of the invention, a systematic flow based on a video text matching algorithm and a prediction interval optimization algorithm is provided, and the method comprises the steps of firstly performing frame extraction on a target video to obtain a corresponding compressed video, then performing feature extraction for multiple times to obtain video coding features, and simultaneously performing text feature extraction and broadcast replication operation on a description text to obtain corresponding extended statement features, and performing text feature fusion processing to obtain text splicing features corresponding to the video and the text, and then performing text correlation operation on the text splicing features to obtain a text time sequence, so that a fine-grained matching process between the video and the text is realized through an indefinite-length matching algorithm between the video and the text, and then performing fragment positioning and optimization processing on start and stop frames respectively to obtain optimized start and stop frames, and capturing a video fragment between the optimized start and stop frames as a final output fragment, so that not only an accurate start and stop position can be determined, but also the positioning error of video behaviors is greatly reduced.
Drawings
FIG. 1 is a schematic diagram of a current video behavior localization;
FIG. 2 is a schematic diagram of video behavior localization based on text description;
FIG. 3 is a schematic overall flowchart of a two-stage text-to-video behavior localization system according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating steps of a method for locating behavior of a video according to an embodiment of the present invention;
FIG. 5 is a flow chart illustrating segment position adjustment of a start frame according to an embodiment of the present invention;
fig. 6 is a block diagram of a video behavior localization apparatus provided in an embodiment of the present invention;
FIG. 7 is a schematic diagram of a computer-readable medium provided in an embodiment of the invention;
fig. 8 is a block diagram of an electronic device provided in an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As an example, video positioning or behavior positioning specifically refers to an algorithm task of capturing a segment conforming to a certain type of behavior from a video, on one hand, a service-type artificial intelligent robot needs to have a function of performing specific behavior positioning according to a user language and a video shot by a camera when serving a user, and on the other hand, for various video related behaviors such as video retrieval/video classification in the internet technology, it is also very necessary to implement text-based video behavior positioning.
At present, the existing behavior localization method of a video generally implements video localization and outputs corresponding segments based on technical means such as feature extraction and video segmentation, however, the above method is easy to cause large errors in segment classification of the video, and particularly when a target segment is not related to pictures such as lens splitting and shot switching of the video itself, the above method is also easy to cause the starting and stopping positions of the final output segment to be not accurate enough.
For better illustration, referring to fig. 1, a schematic diagram of a current video behavior localization is shown, in which a video segment of a football game is shown, wherein for a common video behavior localization, such as a video behavior localization for the category of "goal shooting", the task is to find the start and stop moments of the video that meet the behavior of "goal shooting", and output the segments between the start and stop moments.
Referring to fig. 2, a schematic diagram of video behavior localization based on text description is shown, and the video segment of fig. 1 is still taken as an example, where fig. 2 is different from fig. 1 in that a video behavior localization task based on text description is implemented in fig. 2, and that a specific category of "shoot" is replaced by a more complex text-described action or event, if an input "shoot" category is converted into a sentence description "player a directly hits the goal without offside and the ball enters", then when performing behavior localization of video, not only the behavior of "shoot" but also the condition that the player who hits the goal is "player a" and "has no offside" and "ball enters" are simultaneously satisfied, and thus, compared with general video behavior localization, the video behavior localization task based on text description additionally requires a model with the capability of understanding text content and a finer granularity requirement for localization.
In view of the above, one of the core invention points of the embodiment of the present invention is: the method comprises the steps of firstly carrying out frame extraction on a target video to obtain a corresponding compressed video, then carrying out feature extraction for multiple times to obtain video coding features, simultaneously carrying out text feature extraction and broadcast replication operation on a description text to obtain corresponding expanded sentence features, carrying out video feature fusion processing to obtain video splicing features corresponding to videos and texts, then carrying out video correlation operation on the video splicing features to obtain a video time sequence, further realizing a fine-grained matching process between the videos and the texts through an indefinite length matching algorithm, then respectively carrying out fragment positioning and optimizing processing on start and stop frames to obtain optimized start and stop frames, and intercepting a video fragment between the optimized start and stop frames as a final output fragment, thereby not only determining an accurate start and stop position, but also greatly reducing positioning errors of video behaviors.
Based on the core invention of the present invention, referring to fig. 3, an overall flow diagram of a two-stage text video behavior positioning system provided in the embodiment of the present invention is shown:
the whole process is described by taking the video clip and the corresponding description text of fig. 2 as an example, wherein a video clip of a football match is required to be positioned by video behavior, and the corresponding input text is that a player a directly gets a goal without offside and a ball enters, the two-stage text video behavior positioning system provided by the invention is mainly divided into two processing modules, one is a clip rough positioning module 301 which is mainly used for respectively extracting features of the video clip and text content, realizing feature fusion and performing video correlation operation and outputting a video time sequence, wherein the video time sequence can comprise a plurality of positioning video frames obtained after preliminarily screening video frames of a target video, and can also be called rough video frames, and meanwhile, an initial frame in the rough video frame can be taken as a rough initial frame, and a termination frame in the rough video frame is taken as a rough termination frame. The other is a segment positioning optimization module 302, which mainly performs segment tuning processing on a rough-arranged starting frame and a rough-arranged ending frame in a video time sequence, such as head-to-tail frame scoring and segment interval optimization, and after being processed by the segment rough positioning module 301 and the segment positioning optimization module 302, an optimized starting frame corresponding to the rough-arranged starting frame and an optimized ending frame corresponding to the rough-arranged ending frame can be output, so that the optimized starting frame can be used as a starting frame of video capture, the optimized ending frame can be used as an ending frame of video capture, a video segment between two frames in an original video segment is captured and used as an input text, namely 'player A directly hits the goal without offside, and the ball enters' a corresponding target video segment, thereby completing a video behavior positioning process based on text description.
Specifically, firstly, a complete target video is input into a video frame extraction submodule, the target video is subjected to frame extraction according to a set frame extraction interval number, namely 1 frame is extracted at each interval frame extraction interval number, the extracted video frames are used as compressed videos, so that original video segments can be compressed for subsequent processing, then the compressed videos can be input into a video feature extraction submodule, feature extraction is performed on each frame of the compressed videos in a convolutional neural network to obtain corresponding frame extraction feature maps, then time sequence space convolution is performed on the frame extraction feature maps to obtain frame extraction time sequence feature maps, then the frame extraction time sequence feature maps are input into a classical feature detection network to perform feature extraction and screening, corresponding video coding features are output, and feature extraction specific to the videos is completed.
Meanwhile, a text is input, namely a description text, and the player A directly hits the gate without offside, and enters the goal, and the description text is input into the text feature extraction submodule to perform feature extraction, corresponding sentence features are output, and then the sentence features are subjected to broadcast copy operation and are expanded into expanded sentence features related to the size of the video coding features, so that the features corresponding to the text can be in one-to-one correspondence with the video coding features through the text feature extraction operation and the broadcast copy operation.
Then, the video coding features and the expanded sentence features are input into a video feature fusion submodule for feature fusion to obtain corresponding video splicing features, then, video correlation operation is performed on the video splicing features in a video correlation operation submodule, and a corresponding video time sequence is output, where the video time sequence may include a positioning video frame, and at this point, the segment rough positioning module 301 may output a video time sequence, where it is to be noted that the module may output a plurality of video time sequences (because there may be a plurality of segments that conform to the description in the video), and may also output 0 sequences (there may not be segments that conform to the description in the video).
For convenience of description, assuming that the target video and the video processed by the segment coarse positioning module 301 output a video time sequence to the head-to-tail frame scoring submodule of the segment positioning optimization module 302, the first positioning video frame of the video time sequence may be selected as a coarse-line starting frame, the last positioning video frame may be selected as a coarse-line ending frame, and the segment interval optimization submodule performs segment tuning on the coarse-line starting frame and the coarse-line ending frame, respectively, and outputs an optimization starting frame corresponding to the coarse-line starting frame and an optimization ending frame corresponding to the coarse-line ending frame, and finally may intercept a video segment from the optimization starting frame to the optimization ending frame in the target video as a target video segment, where the target video segment is an accurate segment corresponding to the input text "player a directly gates without offside and ball in", so that the coarse-line starting frame may be positioned and optimized by the relevant submodule in the segment positioning optimization module 302, and a corresponding accurate optimization starting frame may be obtained, and an accurate target video segment may be further output.
It should be noted that, for better auxiliary explanation, the video behavior localization example based on the text description in fig. 2 is adopted in the present embodiment for exemplary explanation, and the overall flow description of the two-stage text video behavior localization system in the present embodiment is simpler, and only as a simple explanation of the implementation principle, the more specific implementation steps can be obtained in the following detailed explanation of fig. 4, and it is understood that the present invention is not limited to this.
It should be noted that the embodiment of the present invention includes but is not limited to the above examples, and it is understood that, under the guidance of the idea of the embodiment of the present invention, a person skilled in the art may also set the method according to actual requirements, and the present invention is not limited to this.
In the embodiment of the invention, a systematic flow based on a video text matching algorithm and a prediction interval optimization algorithm is provided, and the method comprises the steps of firstly performing frame extraction on a target video to obtain a corresponding compressed video, then performing feature extraction for multiple times to obtain video coding features, and simultaneously performing text feature extraction and broadcast replication operation on a description text to obtain corresponding extended statement features, and performing text feature fusion processing to obtain text splicing features corresponding to the video and the text, and then performing text correlation operation on the text splicing features to obtain a text time sequence, so that a fine-grained matching process between the video and the text is realized through an indefinite-length matching algorithm between the video and the text, and then performing fragment positioning and optimization processing on start and stop frames respectively to obtain optimized start and stop frames, and capturing a video fragment between the optimized start and stop frames as a final output fragment, so that not only an accurate start and stop position can be determined, but also the positioning error of video behaviors is greatly reduced.
Referring to fig. 4, a flowchart illustrating steps of a method for locating behavior of a video according to an embodiment of the present invention is shown, which may specifically include the following steps:
step 401, acquiring a target video, performing frame extraction on the target video to obtain a corresponding compressed video, and performing multiple feature extraction on the compressed video to obtain corresponding video coding features;
specifically, the obtaining of the target video, the frame extraction of the target video, and the obtaining of the corresponding compressed video may be: the method comprises the steps of obtaining a target video needing video behavior positioning, performing frame extraction processing on the target video according to a preset frame extraction interval value to obtain a corresponding extracted frame sequence, and using the extracted frame sequence as a compressed video corresponding to the target video, so that the target video is compressed in a video frame extraction mode, and the calculation amount can be reduced in the subsequent related algorithm process on the premise of not influencing fragment positioning.
As an optional embodiment, in order to make the subsequent processing procedure more accurate and have higher reliability, when performing frame extraction processing, in addition to extracting the starting extraction frame sequence corresponding to the starting extraction frame, the same frame extraction processing may be performed on frames before and after the starting extraction frame to obtain a pre-extraction frame sequence and a post-extraction frame sequence, and the three sequences are correspondingly processed to obtain a more accurate feature extraction result, then the acquisition process of the extraction frame sequence may be: firstly, determining an initial extraction frame aiming at a target video, and extracting the target video from the initial extraction frame according to a preset extraction frame interval value to obtain an initial extraction frame sequence corresponding to the initial extraction frame, wherein the initial extraction frame sequence is set as
Figure 979363DEST_PATH_IMAGE040
Then, determining a pre-extraction frame and a post-extraction frame corresponding to the initial extraction frame, wherein the pre-extraction frame is the initial extractionTaking the previous frame of the frame, taking the post-extraction frame as the next frame of the initial extraction frame, then starting to extract the frame of the target video according to the preset frame extraction interval value from the pre-extraction frame to obtain the pre-extraction frame sequence corresponding to the initial extraction frame, and setting as
Figure 866548DEST_PATH_IMAGE041
And extracting frames of the target video from the post extraction frames according to a preset frame extraction interval value to obtain a post extraction frame sequence corresponding to the initial extraction frame, and setting the post extraction frame sequence as
Figure 690147DEST_PATH_IMAGE042
The starting decimated frame sequence, the pre-decimated frame sequence, and the post-decimated frame sequence may then be used as the corresponding decimated frame sequence for the target video.
Exemplarily, assuming that the target video has 20 frames of video frames, 1 frame is extracted every 4 frames, assuming that the sequence number corresponding to the start extraction frame is determined to be 2, after the frame extraction process is performed, the start extraction frame sequence [2,6,10,14,19] can be obtained, meanwhile, the sequence number corresponding to the front extraction frame can be determined to be 1, and the sequence number corresponding to the rear extraction frame can be determined to be 3, after the frame extraction process is performed on the front extraction frame and the rear extraction frame respectively, the front extraction frame sequence [1,5,9,13,18] can be obtained, and the rear extraction frame sequence [3,7,11,15,20] can be obtained.
Further, the feature extraction is performed on the compressed video for multiple times to obtain corresponding video coding features, which may specifically be: firstly, extracting the features of a compressed video to obtain a corresponding video feature map, carrying out time sequence space convolution processing on the video feature map to obtain a corresponding frame extraction time sequence feature map, and then carrying out feature extraction on the frame extraction time sequence feature map to obtain the video coding features corresponding to the compressed video.
In an alternative embodiment, the feature extraction is performed on the compressed video to obtain a corresponding video feature map, and the method includes: inputting each frame in the extracted frame sequence into a convolutional neural network for feature extraction to obtain a extracted frame feature map corresponding to the compressed video
Figure 203037DEST_PATH_IMAGE043
Pre-decimated frame sequence
Figure 428482DEST_PATH_IMAGE044
And post decimated frame sequence
Figure 814464DEST_PATH_IMAGE045
Respectively inputting the data into a convolutional neural network for feature extraction to obtain an initial frame extraction feature map corresponding to an initial frame extraction sequence
Figure 266305DEST_PATH_IMAGE046
Pre-decimated feature maps corresponding to a sequence of pre-decimated frames
Figure 192673DEST_PATH_IMAGE047
Postdecimated feature maps corresponding to a sequence of postdecimated frames
Figure 692531DEST_PATH_IMAGE048
Wherein, the sizes of the initial frame extraction feature map, the front frame extraction feature map and the rear frame extraction feature map are all
Figure 187097DEST_PATH_IMAGE049
hThe height of the characteristic diagram is shown,wthe width of the characteristic diagram is shown,cthe number of channels of the signature is indicated.
Further, performing time-series spatial convolution processing on the video feature map to obtain a corresponding frame-extraction time-series feature map may include: inputting the initial frame-extracting feature map into a three-dimensional convolution for time sequence space convolution processing, so that the time sequence information is extracted, and the frame-extracting time sequence feature map corresponding to the video feature map is obtained
Figure 985289DEST_PATH_IMAGE050
Wherein the size of the frame extraction timing characteristic diagram is
Figure 839982DEST_PATH_IMAGE051
dIn particular to the dimension of the feature, i.e.Each feature is made up of several numbers.
Still further, performing feature extraction on the frame-extraction time sequence feature map to obtain video coding features corresponding to the compressed video, which may specifically include: fusing the frame-drawing time sequence feature map with the front frame-drawing feature map and the rear frame-drawing feature map, merging the frame-drawing time sequence feature maps in the lowest dimension to obtain the frame-drawing time sequence feature map with the front frame-drawing feature map and the rear frame-drawing feature map
Figure 774440DEST_PATH_IMAGE052
The extracted frame is added with more sufficient visual information, and the fused feature map is input into a classical feature detection Network (RPN) for feature extraction and screening to obtain a compressed video with a size corresponding to that of the compressed video
Figure 174328DEST_PATH_IMAGE053
Video coding feature of
Figure 459816DEST_PATH_IMAGE054
Therefore, the video coding characteristics with more sufficient visual information can be obtained by simultaneously processing the front frame and the rear frame of the initial extraction frame and by the characteristic fusion processing.
Step 402, obtaining a description text corresponding to the target video, performing feature extraction on the description text, and outputting corresponding sentence features;
meanwhile, description text corresponding to the target video may be acquired and input to a text feature extractor, illustratively, feature extraction is performed in BERT (Bidirectional Encoder retrieval from transforms), the output size being [, ]Md]For the feature extraction of text content, those skilled in the art may use other similar encoders or text models to perform feature extraction, which is not limited by the present invention.
Step 403, performing broadcast replication operation on the sentence features to obtain corresponding expanded sentence features, and performing feature splicing on the expanded sentence features and the video coding features to obtain video splicing features;
after obtaining the sentence features corresponding to the description text, then according to the frame number of the extracted frame sequence, performing broadcast copy operation on the sentence features, and expanding the sentence features into expanded sentence features corresponding to the extracted frame sequence
Figure 354085DEST_PATH_IMAGE055
Wherein the extended sentence is characterized by a size of
Figure 408628DEST_PATH_IMAGE056
For example, the single sentence characteristic may be subjected to copy expansion according to the starting decimated frame sequence in the preceding decimated frame sequence, and if the frame number corresponding to the starting decimated frame sequence is 5 frames, 1 is set to the size [2 ]Md]The sentence characteristics of (1) are expanded to have 5 sentence characteristics, so that the video characteristics of each frame can be enabled to correspond to the sentence characteristics by carrying out broadcast copy operation on the sentence characteristics, and the aim is to copy the sentence characteristics into the sentence characteristics
Figure 244997DEST_PATH_IMAGE057
So as to correspond one-to-one to the video coding features.
The extended sentence features can then be feature-stitched with the video coding features to obtain a size of
Figure 673574DEST_PATH_IMAGE058
Characteristic of splicing the visuals
Figure 745435DEST_PATH_IMAGE059
Therefore, the feature fusion processing of the video and the text is completed, and the video and text splicing feature with preliminary correlation between the video and the text is obtained.
404, performing a video correlation operation on the video splicing characteristics, and outputting a corresponding video time sequence, wherein the video time sequence comprises a positioning video frame;
then, the video character splicing characteristics can be subjected to video character correlation operation and outputCorresponding sight text time sequence
Figure 123326DEST_PATH_IMAGE060
The temporal sequence of views may include positioning video frames, and it should be noted that after performing the temporal correlation operation, a plurality of temporal sequences of views may be output (because there may be a plurality of segments conforming to the description in the video), or 0 sequences may be output (that is, there is no segment conforming to the description in the video).
Specifically, performing a view correlation operation on the view splicing feature, and outputting a corresponding view time sequence, which may be: firstly, performing visual text attention calculation on visual text splicing characteristics to obtain a corresponding global attention moment arrayMatAnd calculating a first video correlation weight according to the global attention matrix, then performing distance similarity operation by using video coding characteristics and expanded sentence characteristics to obtain a second video correlation weight, obtaining a total video correlation weight according to the first video correlation weight and the second video correlation weight, then summing the total video correlation weights to obtain a frame correlation list, wherein the frame correlation list comprises a conformity degree value used for expressing the degree of conformity of each video frame in the compressed video with text description, finally screening a target conformity degree value higher than a preset conformity degree threshold value from the conformity degree value, determining video frames corresponding to the target conformity degree value as positioning video frames, assembling all the positioning video frames into a video time sequence corresponding to the video characteristics, and outputting the video time sequence.
As an optional embodiment, performing the view attention calculation on the view splicing feature to obtain a corresponding global attention matrix may be: using the formula
Figure 661755DEST_PATH_IMAGE061
Performing the visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, wherein,
Figure 125098DEST_PATH_IMAGE062
Figure 364099DEST_PATH_IMAGE063
Matin order to be a global attention matrix,
Figure 393235DEST_PATH_IMAGE064
in order to view the text splicing characteristics,
Figure 305827DEST_PATH_IMAGE065
and with
Figure 53203DEST_PATH_IMAGE026
Are all of a size for feature space transformation ofd×dT denotes a transpose matrix, size is the size of the view splicing feature, and subscripts of respective parameter symbols in respective formulas, e.g.qwkFor example, the parameters are merely used to distinguish between the various parameters, and there is no special meaning, and those skilled in the art can set the parameters according to actual situations or actual needs, and it should be understood that the present invention is not limited to these.
softmaxThe function, also called normalized exponential function, being a binary functionsigmoidThe popularization on multi-classification aims to show the multi-classification result in a probability form.
As an alternative embodiment, the first viewership relevance weight is calculated according to the global attention matrix, which may be: using the formula
Figure 591501DEST_PATH_IMAGE066
Calculating a first viewership relevance weight; wherein,
Figure 475143DEST_PATH_IMAGE067
is the first view relevance weight,in particular, the size is
Figure 824216DEST_PATH_IMAGE068
The matrix of (a) is,
Figure 793309DEST_PATH_IMAGE040
indicating the number of frames corresponding to the compressed video,Nrepresenting the number of features corresponding to a single frame of video in the compressed video,Mthe number of the corresponding characteristic of the description text is represented,
Figure 371183DEST_PATH_IMAGE069
representing a transpose of the matrix in a non-time dimension.
In addition, for the features of the two modalities (i.e. video and text in the embodiment of the present invention), namely the video coding feature and the expansion sentence feature, similarly, a size of the same video and text can be obtained by traversing all the frames in the image and the video text
Figure 374911DEST_PATH_IMAGE070
The matrix of (a) is to perform distance similarity calculation by using the video coding feature and the expanded sentence feature to obtain the second view correlation weight, which may specifically be: using a formula
Figure 629306DEST_PATH_IMAGE071
Performing distance similarity calculation to obtain a second sight text correlation weight;
wherein ,
Figure 85695DEST_PATH_IMAGE072
is a second view relevance weight that is a second view relevance weight,
Figure 965796DEST_PATH_IMAGE073
for the purpose of the video coding feature,
Figure 824030DEST_PATH_IMAGE074
to extend the sentence features.
Further, the total viewership relevance weight is obtained according to the first viewership relevance weight and the second viewership relevance weight, which may specifically be: using a formula
Figure 311643DEST_PATH_IMAGE075
Calculating the relevance weight of the total visual text; wherein,Aand the overall view text relevance weight is used for representing the relevance degree of the compressed video and the description text.
Further, summing the total-view correlation weights to obtain a frame correlation list, which may specifically be: using a formula
Figure 612918DEST_PATH_IMAGE076
Summing the correlation weights of the general text to obtain a frame correlation list; wherein,Sis expressed as a size of
Figure 906496DEST_PATH_IMAGE077
A list of frame dependencies of (a) is,range(N)representing values as characteristic numbers corresponding to a single frame of video in a compressed video,range(M)And the representation value is the characteristic number corresponding to the description text.
Finally, all the coincidence degree values in the S can be higher than the preset coincidence degree threshold valuethreshThe frames are screened out to obtain the sight text time sequence which meets the text description requirement
Figure 760183DEST_PATH_IMAGE078
In addition, it should be noted that, if the finally obtained text-to-view time sequence meeting the requirement is not continuous, the text-to-view time sequence may be split according to the discontinuous position to obtain a plurality of sequences, for example, it is assumed that the predetermined conformity threshold may be set according to actual requirements
Figure 481014DEST_PATH_IMAGE079
In 5 frames [2,6,10,14,18]At this time, if fullThe required frame is [2,6,14,18]Then the sequence fragment can be split into 2 segments according to the position of the break "10", namely [2,6]And [14,18]In order to facilitate description, the embodiment of the present invention adopts an undisrupted video time sequence as an example for description, and in practical application, the video time sequence may be correspondingly processed according to an actual situation, which is not limited by the present invention.
It should be noted that, for convenience of explanation, the setting of each parameter in the above embodiments, or the setting of the video frame example and the like are only used as an example, and it should be understood that the present invention is not limited thereto.
Step 405, selecting a rough-layout starting frame and a rough-layout ending frame from the positioning video frames, and performing segment positioning and optimizing processing on the rough-layout starting frame and the rough-layout ending frame respectively to obtain an optimized starting frame corresponding to the rough-layout starting frame and an optimized ending frame corresponding to the rough-layout ending frame;
for convenience of subsequent description, if only one sight text time sequence is output after sight text correlation operation, the coarse-row initial frame in the sight text time sequence
Figure 36629DEST_PATH_IMAGE080
And a coarse line termination frame
Figure 133898DEST_PATH_IMAGE081
The specific process of performing the fragment positioning and tuning processing respectively may be as follows: firstly, selecting a rough-arrangement starting frame and a rough-arrangement ending frame from a positioning video frame, respectively matching the rough-arrangement starting frame and the rough-arrangement ending frame with a target video, and determining a target starting frame corresponding to the rough-arrangement starting frame
Figure 576512DEST_PATH_IMAGE082
And a target termination frame corresponding to the coarse termination frame
Figure 733824DEST_PATH_IMAGE083
Then, the target start frame and the target end frame are respectively subjected to segment positioning and adjustingAnd processing to obtain an optimized initial frame corresponding to the target initial frame and an optimized end frame corresponding to the target end frame, so that segment positioning and adjusting optimization processing is respectively carried out on the start-stop frames in a frame matching mode, more accurate segment start-stop frames can be obtained, and video behavior positioning is more accurate.
Further, the segment positioning and adjusting process is performed on the target start frame and the target end frame, so as to obtain an optimized start frame corresponding to the target start frame and an optimized end frame corresponding to the target end frame, which may be: respectively expanding the frame number of the target start frame and the target end frame according to the preset frame expansion numberkPerforming frame expansion before and after to obtain a starting frame candidate image set corresponding to the target starting frame
Figure 278200DEST_PATH_IMAGE084
A set of candidate images of the termination frame corresponding to the target termination frame
Figure 913581DEST_PATH_IMAGE085
Then, the start frame candidate image set is processed by start frame positioning and tuning to obtain the optimized start frame corresponding to the target start frame
Figure 741859DEST_PATH_IMAGE086
And simultaneously, positioning and adjusting the termination frame on the candidate image set of the termination frame to obtain an optimized termination frame corresponding to the target termination frame
Figure 804493DEST_PATH_IMAGE087
Wherein the preset frame expansion numberkThe frame extraction interval value is larger than the preset frame extraction interval value in the frame extraction processing.
For better explanation, referring to fig. 5, a schematic flow chart of segment position adjustment of the start frame provided in the embodiment of the present invention is shown, and a flow of the segment position adjustment processing of the start frame will be further explained with reference to fig. 5. It should be noted that, in order to make the description of segment positioning and tuning of the start frame clearer and more complete, the related processes of matching the coarse-row start frame with the target video, determining the target start frame corresponding to the coarse-row start frame, performing frame expansion on the target start frame according to the preset frame expansion number, obtaining the start frame candidate image set corresponding to the target start frame, and the like are also incorporated into the processing flow of segment positioning and tuning of the start frame in the foregoing steps.
In a specific implementation, performing start frame positioning and optimizing processing on the start frame candidate image set to obtain an optimized start frame corresponding to the target start frame may include the following sub-steps:
step S1, taking each frame in a starting frame candidate image set as a candidate starting frame, and carrying out video interception on a target video according to each candidate starting frame and a preset interception length to obtain a plurality of first intercepted video segments;
that is, the starting frame candidate image set
Figure 69121DEST_PATH_IMAGE088
Each frame in the video sequence is used as a candidate initial frame, and the target video is intercepted according to a preset interception lengthlVideo interception is carried out to obtain a plurality of lengthslThe first truncated video segment of (a), wherein,lfor settable parameters, the number of the first cut video segments may be 2k+1。
Step S2, extracting the characteristics of each first cut-off video segment to obtain the initial frame characteristics corresponding to each first cut-off video segment, and calculating by adopting the initial frame characteristics and the expansion sentence characteristics to obtain a first similarity sequence, wherein the first similarity sequence comprises each candidate optimization initial frame corresponding to the initial frame characteristics;
then can be paired with 2k+1 pieces of lengthlThe first cut-off video segments are subjected to feature extraction to obtain start frame features corresponding to the first cut-off video segments, and the start frame features and the expanded sentence features are adopted for calculation to obtain a first similarity sequence, which specifically can be: to 2k+1 pieces of lengthl ofPerforming feature extraction on the first cut video segment by adopting time sequence Adaptive convolution (TADACONv), and finally performing feature extraction on the first cut video segment to 2k+1 first truncated video segments, the corresponding size of [2 ] can be obtainedk+1,d]Start frame feature of
Figure 773772DEST_PATH_IMAGE089
And cosine similarity operation is carried out by adopting the characteristics of the initial frame and the characteristics of the expanded sentences to obtain the size of [2 ]k+1,1]First similarity sequence of (2)
Figure 190978DEST_PATH_IMAGE090
Wherein "1" in the first similarity sequence indicates that the dimension of the matrix in the last dimension is 1.
S3, respectively extracting the characteristics of each candidate initial frame to obtain the characteristics of each candidate initial frame, and calculating by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence;
meanwhile, for each candidate start frame in the start frame candidate picture set, feature extraction can be performed on the candidate start frame respectively to obtain each candidate start frame feature, and each candidate start frame feature and the extension statement feature are adopted for calculation to obtain a first similarity supplement sequence, specifically, feature extraction can be performed on each candidate start frame in the start frame candidate picture set respectively by adopting a Convolutional Neural Network (CNN) to obtain each candidate start frame feature, and cosine similarity calculation is performed by adopting each candidate start frame feature and the extension statement feature to obtain the size of [2 ]k+1,1]First similarity complementing sequence of
Figure 690093DEST_PATH_IMAGE091
S4, performing adjacent value subtraction calculation on the first similarity supplement sequence to obtain a corresponding first similarity step signal set, wherein the first similarity step signal set comprises corresponding first step signals between adjacent candidate start frames;
the first similarity-complementing sequence may then be subjected to a subtraction of neighboring values, i.e. using the formula
Figure 731430DEST_PATH_IMAGE092
Performing adjacent value subtraction to obtain the value of [2 ]k,1]First similarity step information ofNumber set
Figure 583980DEST_PATH_IMAGE093
It may also be referred to as a set of similarity step signals corresponding to the starting frame.
And S5, screening first target step signals larger than a preset step threshold value from the first step signals, selecting corresponding target candidate optimization starting frames from the first similarity sequence according to the first target step signals, and selecting a frame with the highest similarity from the target candidate optimization starting frames as an optimization starting frame.
Finally, the signals with the first step larger than the preset step threshold value can be screened out from all the first step signals
Figure 980326DEST_PATH_IMAGE094
According to the first target step signal, selecting a corresponding target candidate optimization start frame from the first similarity sequence, and selecting a frame with the highest similarity from the target candidate optimization start frames as an optimization start frame
Figure 774976DEST_PATH_IMAGE095
That is, the time point with the highest corresponding value in the first similarity sequence in the target candidate optimization start frame is selected for output, so that a more accurate optimization start frame can be obtained through the segment optimization processing of the start frame.
The above steps S1 to S5 are the process flow of the segment positioning and tuning corresponding to the start frame, and the following will describe in detail the process flow of the segment positioning and tuning corresponding to the end frame, because the process flow is similar to the process flow of the start frame, the description is not given with the accompanying drawings, and those skilled in the art can obtain a corresponding process schematic diagram according to the process flow of the segment positioning and tuning of the end frame, and it can be understood that the present invention is not limited thereto. For the sake of simplicity, the following describes the procedure of performing the end frame localization and tuning process on the end frame candidate image set.
In a specific implementation, performing termination frame positioning and tuning processing on the termination frame candidate image set to obtain an optimized termination frame corresponding to the target termination frame may include the following sub-steps:
step S11, taking each frame in the termination frame candidate image set as a candidate termination frame, and carrying out video interception on the target video according to each candidate termination frame and a preset interception length to obtain a plurality of second intercepted video segments;
in particular, the frame candidate image set will be terminated
Figure 889562DEST_PATH_IMAGE096
Each frame in the video sequence is used as a candidate termination frame, and the target video is intercepted according to a preset interception lengthlVideo interception is carried out to obtain a plurality of lengthslThe second cut video segment of (1), wherein,lthe number of the second intercepted video segments can be 2 for setting parametersk+1。
Step S22, extracting the characteristics of each second intercepted video segment to obtain the termination frame characteristics corresponding to each second intercepted video segment, and calculating by adopting the termination frame characteristics and the expansion statement characteristics to obtain a second similarity sequence, wherein the second similarity sequence comprises each candidate optimized termination frame corresponding to the termination frame characteristics;
then can be paired with 2k+1 pieces of lengthlThe second intercepted video segments are subjected to feature extraction to obtain termination frame features corresponding to each second intercepted video segment, and the termination frame features and the expanded sentence features are adopted for calculation to obtain a second similarity sequence, which specifically can be as follows: to 2k+1 pieces of lengthl ofPerforming feature extraction on the second cut video segment by adopting time sequence Adaptive convolution (TADACONv), and finally performing feature extraction on the second cut video segment to obtain a second cut video segment with respect to 2k+1 second truncated video segments, the corresponding size of [2 ] can be obtainedk+1,d]Start frame feature of
Figure 545803DEST_PATH_IMAGE097
And cosine similarity operation is carried out by adopting the characteristics of the termination frame and the extended sentence to obtain the value of [2 ]k+1,1]Second similarity degree sequence of
Figure 485071DEST_PATH_IMAGE098
Wherein "1" in the second similarity sequence indicates that the dimension of the matrix in the last dimension is 1.
Step S33, respectively extracting the characteristics of each candidate termination frame to obtain the characteristics of each candidate termination frame, and calculating by adopting the characteristics of each candidate termination frame and the characteristics of the expansion sentences to obtain a second similarity supplement sequence;
meanwhile, for each candidate terminating frame in the terminating frame candidate graph set, feature extraction can be respectively performed on the candidate terminating frame to obtain each candidate terminating frame feature, and each candidate terminating frame feature and the expanded statement feature are adopted for calculation to obtain a second similarity supplement sequence, specifically, feature extraction can be respectively performed on each candidate terminating frame in the terminating frame candidate graph set by adopting a Convolutional Neural Network (CNN) to obtain each candidate terminating frame feature, and cosine similarity calculation is performed by adopting each candidate terminating frame feature and the expanded statement feature to obtain the size of [2 ]k+1,1]Second similarity supplement sequence of (1)
Figure 325988DEST_PATH_IMAGE099
Step S44, carrying out adjacent value subtraction calculation on the second similarity supplement sequence to obtain a corresponding second similarity step signal set, wherein the second similarity step signal set comprises corresponding second step signals between adjacent candidate termination frames;
the second similarity-supplementing sequence may then be subjected to a subtraction of adjacent values, i.e. using the formula
Figure 803237DEST_PATH_IMAGE100
Firstly, carrying out subtraction calculation on adjacent values, and then carrying out phase inversion processing to obtain the value of [2 ]k,1]Second similarity step signal set
Figure 387802DEST_PATH_IMAGE101
And may also be referred to as a set of similarity step signals corresponding to the termination frame.
It should be noted that, when the segment-positioning and tuning process is performed on the end frame, the difference between the segment-positioning and tuning process and the segment-positioning and tuning process corresponding to the start frame is that the second similarity supplementary sequence is first subjected to adjacent value subtraction calculation and then subjected to an inverse number extraction process to obtain a corresponding second similarity step signal set. The inverse number taking processing is provided here, so that when the start-stop frame judgment is performed, it is possible to judge which position is the start frame and which position is the end frame by judging the trend of the similarity change.
And step S55, screening second target step signals larger than a preset step threshold value from the second step signals, selecting corresponding target candidate optimization termination frames from the second similarity sequence according to the second target step signals, and selecting one frame with the highest similarity from the target candidate optimization termination frames as the optimization termination frame.
Finally, the second step signals which are larger than the preset step threshold value can be screened out
Figure 617795DEST_PATH_IMAGE102
According to the second target step signal, selecting a corresponding target candidate optimization termination frame from the second similarity sequence, and selecting a frame with the highest similarity from the target candidate optimization termination frames as an optimization termination frame
Figure 629613DEST_PATH_IMAGE103
That is, the time with the highest corresponding value in the second similarity sequence in the target candidate optimization termination frame is selected and output, so that a more accurate optimization termination frame can be obtained through the segment optimization processing of the termination frame.
The above steps S11 to S55 are the flow of the segment positioning and adjusting process corresponding to the termination frame. For the segment positioning and tuning processing procedure of the start frame or the end frame, although the possibility is extremely low, there may be a case that all the step signals in the similarity step signal set are less than or equal to the preset step threshold, and for this case, in an actual application scenario, a corresponding countermeasure may be set, which specifically may be: on one hand, the preset step threshold may be adjusted down, and on the other hand, a frame with the highest similarity may be directly selected from the corresponding similarity sequence as the final output.
Step 406, taking a video segment between the optimization starting frame and the optimization ending frame in the target video as a target video segment corresponding to the description text, and outputting the target video segment.
After the optimization starting frame and the optimization ending frame are determined, a video segment between the optimization starting frame and the optimization ending frame in the target video can be intercepted and used as a target video segment corresponding to the description text, and the target video segment is output, so that a video behavior positioning process based on text description is completed, and an accurate video positioning result is obtained.
It should be noted that the embodiment of the present invention includes but is not limited to the above examples, and it is understood that, under the guidance of the idea of the embodiment of the present invention, a person skilled in the art may also set the method according to actual requirements, and the present invention is not limited to this.
In the embodiment of the invention, a systematic flow based on a video text matching algorithm and a prediction interval optimization algorithm is provided, firstly, a target video is subjected to frame extraction to obtain a corresponding compressed video, then, a plurality of feature extractions are carried out to obtain video coding features, meanwhile, text feature extraction and broadcast replication operations are carried out on a description text to obtain corresponding expanded statement features, and video splicing features corresponding to the video and the text are obtained through video feature fusion processing.
It should be noted that for simplicity of description, the method embodiments are shown as a series of combinations of acts, but those skilled in the art will recognize that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.
Referring to fig. 6, a block diagram of a structure of a video behavior localization apparatus provided in the embodiment of the present invention is shown, which may specifically include the following modules:
the video coding feature generation module 601 is configured to obtain a target video, perform frame extraction on the target video to obtain a corresponding compressed video, and perform multiple feature extraction on the compressed video to obtain corresponding video coding features;
a sentence characteristic output module 602, configured to obtain a description text corresponding to the target video, perform characteristic extraction on the description text, and output a corresponding sentence characteristic;
a broadcast replication operation module 603, configured to perform broadcast replication operation on the sentence features to obtain corresponding expanded sentence features, and perform feature splicing on the expanded sentence features and the video coding features to obtain video splicing features;
a video correlation operation module 604, configured to perform video correlation operation on the video splicing features, and output a corresponding video time sequence, where the video time sequence includes a positioning video frame;
a segment positioning and adjusting module 605, configured to select a coarse starting frame and a coarse ending frame from the positioning video frame, and perform segment positioning and adjusting processing on the coarse starting frame and the coarse ending frame respectively to obtain an optimized starting frame corresponding to the coarse starting frame and an optimized ending frame corresponding to the coarse ending frame;
and a target video segment output module 606, configured to take a video segment between the optimization start frame and the optimization end frame in the target video as a target video segment corresponding to the description text, and output the target video segment.
In some optional embodiments, the video coding feature generation module 601 is specifically configured to:
and performing frame extraction processing on the target video according to a preset frame extraction interval value to obtain a corresponding extracted frame sequence, and taking the extracted frame sequence as a compressed video corresponding to the target video.
In some optional embodiments, the video coding feature generation module 601 includes:
an initial frame sequence extraction determining module, configured to determine an initial extraction frame for the target video, and start to extract frames from the initial extraction frame according to a preset frame extraction interval value to obtain an initial extraction frame sequence corresponding to the initial extraction frame;
a pre-extraction frame and post-extraction frame determining module, configured to determine a pre-extraction frame and a post-extraction frame that correspond to the initial extraction frame, where the pre-extraction frame is a previous frame of the initial extraction frame, and the post-extraction frame is a next frame of the initial extraction frame;
a pre-decimation frame sequence determining module, configured to perform frame decimation on the target video according to a preset frame decimation interval value from the pre-decimation frame, so as to obtain a pre-decimation frame sequence corresponding to the initial decimation frame;
a post extraction frame sequence determining module, configured to perform frame extraction on the target video according to a preset frame extraction interval value from the post extraction frame to obtain a post extraction frame sequence corresponding to the initial extraction frame;
and the extracted frame sequence determining module is used for taking the starting extracted frame sequence, the front extracted frame sequence and the rear extracted frame sequence as the extracted frame sequence corresponding to the target video.
In some optional embodiments, the video coding feature generation module 601 includes:
the frame-extracting time sequence feature map generation module is used for extracting features of the compressed video to obtain a corresponding video feature map, and performing time sequence space convolution processing on the video feature map to obtain a corresponding frame-extracting time sequence feature map;
and the video coding feature generation submodule is used for extracting the features of the frame extraction time sequence feature graph to obtain the video coding features corresponding to the compressed video.
In some optional embodiments, the frame-extracting timing feature map generating module is specifically configured to:
and inputting the extracted frame sequence into a convolutional neural network for feature extraction to obtain a frame extraction feature image corresponding to the compressed video.
In some optional embodiments, the frame-extracting timing feature map generating module is specifically configured to:
and respectively inputting the initial extraction frame sequence, the preposed extraction frame sequence and the postposition extraction frame sequence into a convolutional neural network for feature extraction to obtain an initial extraction frame feature map corresponding to the initial extraction frame sequence, a preposed extraction frame feature map corresponding to the preposed extraction frame sequence and a postposition extraction frame feature map corresponding to the postposition extraction frame sequence.
In some optional embodiments, the frame-extracting timing feature map generating module is specifically configured to:
and inputting the initial frame-extracting feature map into a three-dimensional convolution for time sequence space convolution processing to obtain a frame-extracting time sequence feature map corresponding to the video feature map.
In some optional embodiments, the video coding feature generation sub-module is specifically configured to:
and fusing the frame-extracting time sequence feature map, the front frame-extracting feature map and the rear frame-extracting feature map to obtain a fused feature map, inputting the fused feature map into a feature detection network for feature extraction, and obtaining video coding features corresponding to the compressed video.
In some optional embodiments, the broadcast copy operation module 603 is specifically configured to:
and performing broadcast copy operation on the statement features according to the frame number of the extracted frame sequence, and expanding the statement features into expanded statement features corresponding to the extracted frame sequence.
In some optional embodiments, the context correlation module 604 comprises:
the first visual text correlation weight calculation module is used for carrying out visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix and calculating a first visual text correlation weight according to the global attention matrix;
a total-view-text relevance weight calculating module, configured to perform distance similarity calculation using the video coding features and the expanded sentence features to obtain a second view-text relevance weight, and obtain a total-view-text relevance weight according to the first view-text relevance weight and the second view-text relevance weight;
a frame correlation list generating module, configured to sum the total-video correlation weights to obtain a frame correlation list, where the frame correlation list includes a conformity degree value, and the conformity degree value is used to represent a degree to which each video frame in the compressed video conforms to a text description;
and the video time sequence generation module is used for screening out a target coincidence degree value higher than a preset coincidence degree threshold value from the coincidence degree values, determining video frames corresponding to the target coincidence degree value as positioning video frames, gathering all the positioning video frames into a video time sequence corresponding to the video splicing characteristics, and outputting the video time sequence.
In some optional embodiments, the first viewership relevance weight calculation module comprises:
a vision attention calculation module for employing a formula
Figure 328579DEST_PATH_IMAGE104
Performing a visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, wherein,
Figure 982414DEST_PATH_IMAGE105
Figure 362187DEST_PATH_IMAGE106
Matin order to be a global attention matrix,
Figure 748169DEST_PATH_IMAGE107
for the purpose of the text-to-view stitching feature,
Figure 200010DEST_PATH_IMAGE108
and
Figure 47749DEST_PATH_IMAGE026
are all of a size for feature space transformation ofd×dThe transfer matrix of (2).
In some optional embodiments, the first viewership relevance weight calculation module comprises:
a first visual context correlation weight calculation submodule for employing a formula
Figure 330963DEST_PATH_IMAGE109
Calculating a first viewership relevance weight; wherein,
Figure 950163DEST_PATH_IMAGE067
is a first visual context correlation weight, specifically of a magnitude of
Figure 623721DEST_PATH_IMAGE009
The matrix of (a) is,
Figure 88200DEST_PATH_IMAGE110
represents the number of frames corresponding to the compressed video,Nrepresenting the corresponding characteristic number of a single frame of video in the compressed video,Mrepresenting the feature number corresponding to the description text,
Figure 914336DEST_PATH_IMAGE011
representing a transpose of the matrix in a non-time dimension.
In some optional embodiments, the overall-view relevance weight calculation module comprises:
a second visual context correlation weight calculation module for employing a formula
Figure 704437DEST_PATH_IMAGE111
Performing distance similarity calculation to obtain a second sight text correlation weight;
wherein ,
Figure 865291DEST_PATH_IMAGE112
is the second view relevance weight,
Figure 133462DEST_PATH_IMAGE054
for the purpose of the video coding feature(s),
Figure 312639DEST_PATH_IMAGE113
is the expanded sentence feature.
In some optional embodiments, the overall-view relevance weight calculation module comprises:
a total view correlation weight calculation submodule for adopting a formula
Figure 273642DEST_PATH_IMAGE114
Calculating the relevance weight of the total visual text; wherein,Aa general view relevance weight for characterizing the degree of relevance of the compressed video to the description text,
Figure 656213DEST_PATH_IMAGE115
is the first view relevance weight,
Figure 728074DEST_PATH_IMAGE038
is a second view relevance weight.
In some optional embodiments, the frame correlation list generating module is specifically configured to:
using a formula
Figure 266153DEST_PATH_IMAGE116
Summing the correlation weights of the general text to obtain a frame correlation list; wherein,Sis expressed as a size of
Figure 70161DEST_PATH_IMAGE117
A frame correlation list of.
In some optional embodiments, the segment position and tuning processing module 605 includes:
a target video matching module, configured to select a rough-layout start frame and a rough-layout end frame from the positioning video frames, match the rough-layout start frame and the rough-layout end frame with the target video, and determine a target start frame corresponding to the rough-layout start frame and a target end frame corresponding to the rough-layout end frame;
and the segment positioning and adjusting module is used for respectively performing segment positioning and adjusting on the target starting frame and the target ending frame to obtain an optimized starting frame corresponding to the target starting frame and an optimized ending frame corresponding to the target ending frame.
In some optional embodiments, the segment position adjustment and optimization processing sub-module comprises:
a forward and backward frame expansion module, configured to perform forward and backward frame expansion on the target start frame and the target end frame according to preset frame expansion numbers, respectively, to obtain a start frame candidate image set corresponding to the target start frame and an end frame candidate image set corresponding to the target end frame;
the initial frame positioning and adjusting module is used for performing initial frame positioning and adjusting processing on the initial frame candidate image set to obtain an optimized initial frame corresponding to the target initial frame;
and the termination frame positioning and tuning processing module is used for performing termination frame positioning and tuning processing on the termination frame candidate image set to obtain an optimized termination frame corresponding to the target termination frame.
In some optional embodiments, the start frame positioning and tuning processing module comprises:
the first capturing video segment generating module is used for taking each frame in the starting frame candidate image set as a candidate starting frame and capturing the target video according to each candidate starting frame and a preset capturing length to obtain a plurality of first capturing video segments;
a first similarity sequence generating module, configured to perform feature extraction on each of the first truncated video segments to obtain an initial frame feature corresponding to each of the first truncated video segments, and perform calculation by using the initial frame feature and the expanded sentence feature to obtain a first similarity sequence, where the first similarity sequence includes each candidate optimized initial frame corresponding to the initial frame feature;
the first similarity supplement sequence generation module is used for respectively extracting the features of the candidate initial frames to obtain the features of the candidate initial frames, and calculating by adopting the features of the candidate initial frames and the features of the expanded sentences to obtain a first similarity supplement sequence;
a first similarity step signal set generating module, configured to perform adjacent value subtraction calculation on the first similarity supplementary sequence to obtain a corresponding first similarity step signal set, where the first similarity step signal set includes corresponding first step signals between adjacent candidate start frames;
and the optimization starting frame selecting module is used for screening out a first target step signal which is greater than a preset step threshold value from each first step signal, selecting a corresponding target candidate optimization starting frame from the first similarity sequence according to the first target step signal, and selecting a frame with the highest similarity from the target candidate optimization starting frames as an optimization starting frame.
In some optional embodiments, the first similarity sequence generating module is specifically configured to:
and performing time sequence adaptive convolution on each first cut video segment, simultaneously performing feature extraction to obtain an initial frame feature corresponding to each first cut video segment, and performing cosine similarity operation by using the initial frame feature and the expansion statement feature to obtain a first similarity sequence.
In some optional embodiments, the first similarity supplementing sequence generating module is specifically configured to:
and respectively extracting the characteristics of each candidate initial frame by adopting a convolutional neural network to obtain the characteristics of each candidate initial frame, and performing cosine similarity operation by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence.
In some optional embodiments, the termination frame positioning tuning processing module comprises:
a second captured video segment generating module, configured to use each frame in the termination frame candidate image set as a candidate termination frame, and perform video capture on the target video according to a preset capture length according to each candidate termination frame to obtain a plurality of second captured video segments;
a second similarity sequence generation module, configured to perform feature extraction on each second captured video segment to obtain a termination frame feature corresponding to each second captured video segment, and perform calculation by using the termination frame feature and the extension statement feature to obtain a second similarity sequence, where the second similarity sequence includes each candidate optimized termination frame corresponding to the termination frame feature;
the second similarity supplementing sequence generating module is used for respectively extracting the features of each candidate terminating frame to obtain the features of each candidate terminating frame, and calculating by adopting the features of each candidate terminating frame and the features of the expansion sentences to obtain a second similarity supplementing sequence;
a second similarity step signal set generating module, configured to perform adjacent value subtraction calculation on the second similarity supplementary sequence to obtain a corresponding second similarity step signal set, where the second similarity step signal set includes corresponding second step signals between adjacent candidate termination frames;
and the optimization termination frame selection module is used for screening out a second target step signal which is greater than a preset step threshold value from each second step signal, selecting a corresponding target candidate optimization termination frame from the second similarity sequence according to the second target step signal, and selecting a frame with the highest similarity from the target candidate optimization termination frames as the optimization termination frame.
In some optional embodiments, the second similarity sequence generating module is specifically configured to:
and performing time sequence adaptive convolution on each second intercepted video segment, simultaneously performing feature extraction to obtain an end frame feature corresponding to each second intercepted video segment, and performing cosine similarity operation by adopting the end frame feature and the extension statement feature to obtain a second similarity sequence.
In some optional embodiments, the second similarity supplementing sequence generating module is specifically configured to:
and respectively extracting the characteristics of each candidate termination frame by adopting a convolutional neural network to obtain the characteristics of each candidate termination frame, and performing cosine similarity operation by adopting the characteristics of each candidate termination frame and the characteristics of the expansion statements to obtain a second similarity supplement sequence.
In some optional embodiments, the second similarity step signal set generating module is specifically configured to:
and performing adjacent value subtraction calculation on the second similarity supplement sequence, and then performing phase inversion processing to obtain a corresponding second similarity step signal set.
For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.
In addition, an embodiment of the present invention further provides an electronic device, including: the processor, the memory, and the computer program stored in the memory and capable of running on the processor, when executed by the processor, implement each process of the above-mentioned behavior localization method embodiment of the video, and can achieve the same technical effect, and are not described herein again to avoid repetition.
As shown in fig. 7, an embodiment of the present invention further provides a computer-readable storage medium 701, where the computer-readable storage medium 701 stores a computer program, and when the computer program is executed by a processor, the computer program implements each process of the behavior localization method for video, and can achieve the same technical effect, and details are not repeated here to avoid repetition. The computer-readable storage medium 701 is, for example, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.
Fig. 8 is a schematic diagram of a hardware structure of an electronic device implementing various embodiments of the present invention.
The electronic device 800 includes, but is not limited to: a radio frequency unit 801, a network module 802, an audio output unit 803, an input unit 804, a sensor 805, a display unit 806, a user input unit 807, an interface unit 808, a memory 809, a processor 810, and a power supply 811. It will be understood by those skilled in the art that the electronic device configurations involved in the embodiments of the present invention are not intended to be limiting, and that an electronic device may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used. In the embodiment of the present invention, the electronic device includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palm computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.
It should be understood that, in the embodiment of the present invention, the radio frequency unit 801 may be used for receiving and sending signals during a message sending and receiving process or a call process, and specifically, receives downlink data from a base station and then processes the received downlink data to the processor 810; in addition, the uplink data is transmitted to the base station. In general, radio frequency unit 801 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. Further, the radio frequency unit 801 can also communicate with a network and other devices through a wireless communication system.
The electronic device provides wireless broadband internet access to the user via the network module 802, such as to assist the user in sending and receiving e-mails, browsing web pages, and accessing streaming media.
The audio output unit 803 may convert audio data received by the radio frequency unit 801 or the network module 802 or stored in the memory 809 into an audio signal and output as sound. Also, the audio output unit 803 may also provide audio output related to a specific function performed by the electronic apparatus 800 (e.g., a call signal reception sound, a message reception sound, etc.). The audio output unit 803 includes a speaker, a buzzer, a receiver, and the like.
The input unit 804 is used for receiving an audio or video signal. The input Unit 804 may include a Graphics Processing Unit (GPU) 8041 and a microphone 8042, and the Graphics processor 8041 processes image data of still pictures or video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The processed image frames may be displayed on the display unit 806. The image frames processed by the graphics processor 8041 may be stored in the memory 809 (or other storage medium) or transmitted via the radio frequency unit 801 or the network module 802. The microphone 8042 can receive sound, and can process such sound into audio data. The processed audio data may be converted into a format output transmittable to a mobile communication base station via the radio frequency unit 801 in case of a phone call mode.
The electronic device 800 also includes at least one sensor 805, such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor that can adjust the brightness of the display panel 8061 according to the brightness of ambient light and a proximity sensor that can turn off the display panel 8061 and/or the backlight when the electronic device 800 is moved to the ear. As one type of motion sensor, an accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of an electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), and vibration identification related functions (such as pedometer, tapping); the sensors 805 may also include fingerprint sensors, pressure sensors, iris sensors, molecular sensors, gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc., which are not described in detail herein.
The display unit 806 is used to display information input by the user or information provided to the user. The Display unit 806 may include a Display panel 8061, and the Display panel 8061 may be configured by a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like.
The user input unit 807 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus. Specifically, the user input unit 807 includes a touch panel 8071 and other input devices 8072. The touch panel 8071, also referred to as a touch screen, may collect touch operations by a user on or near the touch panel 8071 (e.g., operations by a user on or near the touch panel 8071 using a finger, a stylus, or any other suitable object or accessory). The touch panel 8071 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 810, receives a command from the processor 810, and executes the command. In addition, the touch panel 8071 can be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. In addition to the touch panel 8071, the user input unit 807 can include other input devices 8072. In particular, other input devices 8072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein.
Further, the touch panel 8071 can be overlaid on the display panel 8061, and when the touch panel 8071 detects a touch operation on or near the touch panel 8071, the touch operation is transmitted to the processor 810 to determine the type of the touch event, and then the processor 810 provides a corresponding visual output on the display panel 8061 according to the type of the touch event. It is understood that in an embodiment, the touch panel 8071 and the display panel 8061 are two independent components to implement the input and output functions of the electronic device, but in some embodiments, the touch panel 8071 and the display panel 8061 may be integrated to implement the input and output functions of the electronic device, which is not limited herein.
The interface unit 808 is an interface for connecting an external device to the electronic apparatus 800. For example, the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device having an identification module, an audio input/output (I/O) port, a video I/O port, an earphone port, and the like. The interface unit 808 may be used to receive input (e.g., data information, power, etc.) from external devices and transmit the received input to one or more elements within the electronic device 800 or may be used to transmit data between the electronic device 800 and external devices.
The memory 809 may be used to store software programs as well as various data. The memory 809 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 809 can include high speed random access memory, and can also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 810 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 809 and calling data stored in the memory 809, thereby monitoring the whole electronic device. Processor 810 may include one or more processing units; preferably, the processor 810 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 810.
The electronic device 800 may also include a power supply 811 (e.g., a battery) for powering the various components, and preferably, the power supply 811 may be logically coupled to the processor 810 via a power management system to manage charging, discharging, and power consumption management functions via the power management system.
In addition, the electronic device 800 includes some functional modules that are not shown, and are not described in detail herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (27)

1. A method for behavioral targeting of a video, the method comprising:
acquiring a target video, performing frame extraction on the target video to obtain a corresponding compressed video, and performing multiple feature extraction on the compressed video to obtain corresponding video coding features;
obtaining a description text corresponding to the target video, extracting features of the description text, and outputting corresponding sentence features;
performing broadcast replication operation on the sentence characteristics to obtain corresponding expanded sentence characteristics, and performing characteristic splicing on the expanded sentence characteristics and the video coding characteristics to obtain video splicing characteristics;
performing video correlation operation on the video splicing characteristics, and outputting a corresponding video time sequence, wherein the video time sequence comprises a positioning video frame;
selecting a rough-arrangement starting frame and a rough-arrangement ending frame from the positioning video frame, and respectively carrying out segment positioning and optimizing processing on the rough-arrangement starting frame and the rough-arrangement ending frame to obtain an optimized starting frame corresponding to the rough-arrangement starting frame and an optimized ending frame corresponding to the rough-arrangement ending frame;
and taking a video segment between the optimization starting frame and the optimization ending frame in the target video as a target video segment corresponding to the description text, and outputting the target video segment.
2. The method according to claim 1, wherein the decimating the target video to obtain a corresponding compressed video comprises:
and performing frame extraction processing on the target video according to a preset frame extraction interval value to obtain a corresponding extracted frame sequence, and taking the extracted frame sequence as a compressed video corresponding to the target video.
3. The method according to claim 2, wherein the decimating the target video according to a preset decimating interval value to obtain a corresponding decimated frame sequence comprises:
determining an initial extraction frame for the target video, and starting to extract the target video according to a preset extraction frame interval value from the initial extraction frame to obtain an initial extraction frame sequence corresponding to the initial extraction frame;
determining a front extraction frame and a rear extraction frame corresponding to the starting extraction frame, wherein the front extraction frame is a previous frame of the starting extraction frame, and the rear extraction frame is a next frame of the starting extraction frame;
extracting frames of the target video from the pre-extraction frames according to a preset frame extraction interval value to obtain a pre-extraction frame sequence corresponding to the initial extraction frames;
extracting frames of the target video from the post extraction frames according to a preset frame extraction interval value to obtain a post extraction frame sequence corresponding to the initial extraction frame;
and taking the starting decimated frame sequence, the pre-decimated frame sequence and the post-decimated frame sequence as decimated frame sequences corresponding to the target video.
4. The method according to claim 3, wherein said performing multiple feature extractions on the compressed video to obtain corresponding video coding features comprises:
extracting the features of the compressed video to obtain a corresponding video feature map, and performing time sequence space convolution processing on the video feature map to obtain a corresponding frame extraction time sequence feature map;
and extracting the characteristics of the frame-extracting time sequence characteristic diagram to obtain the video coding characteristics corresponding to the compressed video.
5. The method according to claim 4, wherein said extracting features of the compressed video to obtain a corresponding video feature map comprises:
and inputting the extracted frame sequence into a convolutional neural network for feature extraction to obtain a frame extraction feature image corresponding to the compressed video.
6. The method according to claim 5, wherein the inputting the decimated frame sequence into a convolutional neural network for feature extraction to obtain a corresponding decimated feature map of the compressed video comprises:
and respectively inputting the initial extraction frame sequence, the preposed extraction frame sequence and the postposition extraction frame sequence into a convolutional neural network for feature extraction to obtain an initial extraction frame feature map corresponding to the initial extraction frame sequence, a preposed extraction frame feature map corresponding to the preposed extraction frame sequence and a postposition extraction frame feature map corresponding to the postposition extraction frame sequence.
7. The method of claim 6, wherein the performing temporal spatial convolution on the video feature map to obtain a corresponding decimated temporal feature map comprises:
and inputting the initial frame-extracting feature map into a three-dimensional convolution for time sequence space convolution processing to obtain a frame-extracting time sequence feature map corresponding to the video feature map.
8. The method according to claim 7, wherein the extracting the features of the frame-extraction timing feature map to obtain the video coding features corresponding to the compressed video comprises:
and fusing the frame-extracting time sequence feature map, the front frame-extracting feature map and the rear frame-extracting feature map to obtain a fused feature map, inputting the fused feature map into a feature detection network for feature extraction, and obtaining video coding features corresponding to the compressed video.
9. The method of claim 2, wherein the performing a broadcast copy operation on the sentence features to obtain corresponding expanded sentence features comprises:
and performing broadcast copy operation on the statement features according to the frame number of the extracted frame sequence, and expanding the statement features into expanded statement features corresponding to the extracted frame sequence.
10. The method according to claim 1, wherein performing a viewership operation on the viewership stitching features and outputting a corresponding viewership time sequence comprises:
performing visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, and calculating a first visual text correlation weight according to the global attention matrix;
performing distance similarity operation by using the video coding features and the expanded sentence features to obtain a second visual text correlation weight, and obtaining a total visual text correlation weight according to the first visual text correlation weight and the second visual text correlation weight;
summing the total video relevance weights to obtain a frame relevance list, wherein the frame relevance list comprises a conformity degree value which is used for representing the degree that each video frame in the compressed video conforms to the text description;
and screening out a target coincidence degree value higher than a preset coincidence degree threshold value from the coincidence degree values, determining video frames corresponding to the target coincidence degree value as positioning video frames, gathering all the positioning video frames into a video moment sequence corresponding to the video splicing characteristics, and outputting the video moment sequence.
11. The method according to claim 10, wherein performing the visuals attention calculation on the visuals stitching features to obtain a corresponding global attention matrix comprises:
using a formula
Figure 228512DEST_PATH_IMAGE001
Performing a visual text attention calculation on the visual text splicing characteristics to obtain a corresponding global attention matrix, wherein,
Figure 676811DEST_PATH_IMAGE002
Figure 597493DEST_PATH_IMAGE003
Matin order to be a global attention matrix,
Figure 789440DEST_PATH_IMAGE004
for the purpose of the text-to-view stitching feature,
Figure 259605DEST_PATH_IMAGE005
and
Figure 347647DEST_PATH_IMAGE006
are all of a size for feature space transformation ofd×dThe transfer matrix of (2).
12. The method of claim 11, wherein said calculating a first viewership relevance weight from the global attention matrix comprises:
using a formula
Figure 21205DEST_PATH_IMAGE007
Calculating a first viewership correlation weight; wherein,
Figure 751263DEST_PATH_IMAGE008
is a first visual context correlation weight, specifically of a magnitude of
Figure 577399DEST_PATH_IMAGE009
The matrix of (a) is,
Figure 101921DEST_PATH_IMAGE010
representing a number of frames corresponding to the compressed video,Nrepresenting the corresponding characteristic number of a single frame of video in the compressed video,Mrepresenting the feature number corresponding to the description text,
Figure 262775DEST_PATH_IMAGE011
representing a transpose of the matrix in a non-time dimension.
13. The method of claim 10, wherein performing a distance similarity operation using the video coding feature and the expanded sentence feature to obtain a second viewership correlation weight comprises:
using a formula
Figure 796525DEST_PATH_IMAGE012
Performing distance similarity calculation to obtain a second sight text correlation weight; wherein,
Figure 975702DEST_PATH_IMAGE013
is the second view relevance weight,
Figure 936705DEST_PATH_IMAGE014
for the purpose of the video coding feature(s),
Figure 584855DEST_PATH_IMAGE015
is the expanded sentence feature.
14. The method of claim 10, wherein obtaining a total viewership relevance weight based on the first viewership relevance weight and the second viewership relevance weight comprises:
using a formula
Figure 391137DEST_PATH_IMAGE016
Calculating the relevance weight of the total visual text; wherein,Aa general view relevance weight for characterizing the degree of relevance of the compressed video to the description text,
Figure 206514DEST_PATH_IMAGE017
is the first view relevance weight,
Figure 72838DEST_PATH_IMAGE018
is a second view relevance weight.
15. The method of claim 14, wherein summing the overall viewership correlation weights to obtain a frame correlation list comprises:
using a formula
Figure 208285DEST_PATH_IMAGE019
Summing the correlation weights of the general text to obtain a frame correlation list; wherein,Sis expressed as a size of
Figure 818258DEST_PATH_IMAGE020
A frame correlation list of.
16. The method according to claim 1, wherein the selecting a coarse-start frame and a coarse-end frame from the positioning video frames, and performing segment positioning and adjusting processes on the coarse-start frame and the coarse-end frame respectively to obtain an optimized start frame corresponding to the coarse-start frame and an optimized end frame corresponding to the coarse-end frame, comprises:
selecting a rough-arrangement starting frame and a rough-arrangement ending frame from the positioning video frames, respectively matching the rough-arrangement starting frame and the rough-arrangement ending frame with the target video, and determining a target starting frame corresponding to the rough-arrangement starting frame and a target ending frame corresponding to the rough-arrangement ending frame;
and respectively carrying out segment positioning and optimizing treatment on the target starting frame and the target ending frame to obtain an optimized starting frame corresponding to the target starting frame and an optimized ending frame corresponding to the target ending frame.
17. The method according to claim 16, wherein the performing the segment positioning and adjusting process on the target start frame and the target end frame respectively to obtain an optimized start frame corresponding to the target start frame and an optimized end frame corresponding to the target end frame comprises:
performing front-back frame expansion on the target start frame and the target end frame according to preset frame expansion numbers respectively to obtain a start frame candidate image set corresponding to the target start frame and an end frame candidate image set corresponding to the target end frame;
performing initial frame positioning and adjusting optimization processing on the initial frame candidate image set to obtain an optimized initial frame corresponding to the target initial frame;
and performing positioning and tuning processing on the termination frame candidate image set to obtain an optimized termination frame corresponding to the target termination frame.
18. The method of claim 17, wherein performing a start frame position adjustment process on the start frame candidate image set to obtain an optimized start frame corresponding to the target start frame comprises:
taking each frame in the initial frame candidate image set as a candidate initial frame, and performing video interception on the target video according to each candidate initial frame and a preset interception length to obtain a plurality of first intercepted video segments;
performing feature extraction on each first cut-off video segment to obtain an initial frame feature corresponding to each first cut-off video segment, and calculating by using the initial frame feature and the expanded sentence feature to obtain a first similarity sequence, where the first similarity sequence includes each candidate optimized initial frame corresponding to the initial frame feature;
respectively extracting the characteristics of each candidate initial frame to obtain the characteristics of each candidate initial frame, and calculating by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence;
performing adjacent value subtraction calculation on the first similarity supplement sequence to obtain a corresponding first similarity step signal set, wherein the first similarity step signal set comprises corresponding first step signals between adjacent candidate start frames;
and screening first target step signals larger than a preset step threshold value from the first step signals, selecting corresponding target candidate optimization starting frames from the first similarity sequence according to the first target step signals, and selecting a frame with the highest similarity from the target candidate optimization starting frames as an optimization starting frame.
19. The method according to claim 18, wherein said extracting features of each of the first truncated video segments to obtain a starting frame feature corresponding to each of the first truncated video segments, and performing a calculation using the starting frame feature and the expanded sentence feature to obtain a first similarity sequence comprises:
and performing time sequence adaptive convolution on each first cut video segment, simultaneously performing feature extraction to obtain an initial frame feature corresponding to each first cut video segment, and performing cosine similarity operation by using the initial frame feature and the expansion statement feature to obtain a first similarity sequence.
20. The method of claim 18, wherein the performing feature extraction on each candidate start frame to obtain each candidate start frame feature, and performing calculation using each candidate start frame feature and the expanded sentence feature to obtain a first similarity supplementary sequence comprises:
and respectively extracting the characteristics of each candidate initial frame by adopting a convolutional neural network to obtain the characteristics of each candidate initial frame, and performing cosine similarity operation by adopting the characteristics of each candidate initial frame and the characteristics of the expanded sentences to obtain a first similarity supplement sequence.
21. The method of claim 17, wherein performing an end frame localization tuning process on the end frame candidate image set to obtain a optimized end frame corresponding to the target end frame comprises:
taking each frame in the termination frame candidate image set as a candidate termination frame, and performing video interception on the target video according to each candidate termination frame and a preset interception length to obtain a plurality of second intercepted video segments;
performing feature extraction on each second intercepted video segment to obtain a termination frame feature corresponding to each second intercepted video segment, and calculating by adopting the termination frame feature and the expansion statement feature to obtain a second similarity sequence, wherein the second similarity sequence comprises each candidate optimized termination frame corresponding to the termination frame feature;
respectively extracting the characteristics of each candidate termination frame to obtain the characteristics of each candidate termination frame, and calculating by adopting the characteristics of each candidate termination frame and the characteristics of the expansion sentences to obtain a second similarity supplement sequence;
performing adjacent value subtraction calculation on the second similarity supplement sequence to obtain a corresponding second similarity step signal set, wherein the second similarity step signal set comprises corresponding second step signals between adjacent candidate termination frames;
and screening second target step signals larger than a preset step threshold value from the second step signals, selecting corresponding target candidate optimization termination frames from the second similarity sequence according to the second target step signals, and selecting one frame with the highest similarity from the target candidate optimization termination frames as an optimization termination frame.
22. The method according to claim 21, wherein the extracting features of each of the second extracted video segments to obtain a termination frame feature corresponding to each of the second extracted video segments, and performing calculation by using the termination frame feature and the expanded sentence feature to obtain a second similarity sequence comprises:
and performing time sequence adaptive convolution on each second intercepted video segment, simultaneously performing feature extraction to obtain a termination frame feature corresponding to each second intercepted video segment, and performing cosine similarity operation by adopting the termination frame feature and the expansion statement feature to obtain a second similarity sequence.
23. The method according to claim 21, wherein the performing feature extraction on each candidate terminating frame respectively to obtain each candidate terminating frame feature, and performing calculation by using each candidate terminating frame feature and the expanded sentence feature to obtain a second similarity supplementing sequence comprises:
and respectively adopting a convolutional neural network to carry out feature extraction on each candidate termination frame to obtain each candidate termination frame feature, and adopting each candidate termination frame feature and the expansion statement feature to carry out cosine similarity operation to obtain a second similarity supplement sequence.
24. The method of claim 21, wherein performing a neighboring value subtraction calculation on the second similarity supplementary sequence to obtain a corresponding second similarity step signal set comprises:
and performing adjacent value subtraction calculation on the second similarity supplement sequence, and then performing phase inversion processing to obtain a corresponding second similarity step signal set.
25. A video behavior localization apparatus, the apparatus comprising:
the video coding feature generation module is used for acquiring a target video, performing frame extraction on the target video to acquire a corresponding compressed video, and performing feature extraction on the compressed video for multiple times to acquire corresponding video coding features;
the sentence characteristic output module is used for acquiring a description text corresponding to the target video, extracting the characteristics of the description text and outputting the corresponding sentence characteristics;
the broadcast replication operation module is used for performing broadcast replication operation on the sentence characteristics to obtain corresponding expanded sentence characteristics, and performing characteristic splicing on the expanded sentence characteristics and the video coding characteristics to obtain video splicing characteristics;
the video correlation operation module is used for carrying out video correlation operation on the video splicing characteristics and outputting a corresponding video time sequence, and the video time sequence comprises positioning video frames;
a segment positioning and adjusting processing module, configured to select a coarse-line start frame and a coarse-line end frame from the positioning video frames, and perform segment positioning and adjusting processing on the coarse-line start frame and the coarse-line end frame respectively to obtain an optimized start frame corresponding to the coarse-line start frame and an optimized end frame corresponding to the coarse-line end frame;
and the target video segment output module is used for taking a video segment between the optimization starting frame and the optimization ending frame in the target video as a target video segment corresponding to the description text and outputting the target video segment.
26. An electronic device, comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
the memory is used for storing a computer program;
the processor, when executing a program stored on the memory, implementing the method of any of claims 1-24.
27. A computer-readable storage medium having stored thereon instructions, which when executed by one or more processors, cause the processors to perform the method of any one of claims 1-24.
CN202211680368.8A 2022-12-27 2022-12-27 Video behavior positioning method and device, electronic equipment and storage medium Active CN115661727B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211680368.8A CN115661727B (en) 2022-12-27 2022-12-27 Video behavior positioning method and device, electronic equipment and storage medium
PCT/CN2023/101687 WO2024139091A1 (en) 2022-12-27 2023-06-21 Video behavior positioning method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211680368.8A CN115661727B (en) 2022-12-27 2022-12-27 Video behavior positioning method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN115661727A true CN115661727A (en) 2023-01-31
CN115661727B CN115661727B (en) 2023-04-28

Family

ID=85023387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211680368.8A Active CN115661727B (en) 2022-12-27 2022-12-27 Video behavior positioning method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115661727B (en)
WO (1) WO2024139091A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024139091A1 (en) * 2022-12-27 2024-07-04 苏州元脑智能科技有限公司 Video behavior positioning method and apparatus, electronic device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103297851A (en) * 2013-05-16 2013-09-11 中国科学院自动化研究所 Method and device for quickly counting and automatically examining and verifying target contents in long video
CN111866607A (en) * 2020-07-30 2020-10-30 腾讯科技(深圳)有限公司 Video clip positioning method and device, computer equipment and storage medium
US20210349940A1 (en) * 2019-06-17 2021-11-11 Tencent Technology (Shenzhen) Company Limited Video clip positioning method and apparatus, computer device, and storage medium
CN114511472A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Visual positioning method, device, equipment and medium
CN114595357A (en) * 2022-02-22 2022-06-07 平安科技(深圳)有限公司 Video searching method and device, electronic equipment and storage medium
CN115309939A (en) * 2022-07-22 2022-11-08 复旦大学 Video clip positioning system based on space-time semantic decomposition

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509465B (en) * 2017-02-28 2022-03-15 阿里巴巴集团控股有限公司 Video data recommendation method and device and server
CN109858555B (en) * 2019-02-12 2022-05-17 北京百度网讯科技有限公司 Image-based data processing method, device, equipment and readable storage medium
CN109919078B (en) * 2019-03-05 2024-08-09 腾讯科技(深圳)有限公司 Video sequence selection method, model training method and device
CN112380394B (en) * 2020-10-27 2022-05-10 浙江工商大学 Progressive positioning method for positioning from text to video clip
CN112101329B (en) * 2020-11-19 2021-03-30 腾讯科技(深圳)有限公司 Video-based text recognition method, model training method and model training device
CN115049950A (en) * 2021-02-26 2022-09-13 阿里巴巴集团控股有限公司 Video processing method and device
CN114627402B (en) * 2021-12-30 2024-08-13 湖南大学 Cross-modal video moment positioning method and system based on space-time diagram
CN115495615B (en) * 2022-11-15 2023-02-28 浪潮电子信息产业股份有限公司 Method, device, equipment, storage medium and terminal for mutual detection of video and text
CN115661727B (en) * 2022-12-27 2023-04-28 苏州浪潮智能科技有限公司 Video behavior positioning method and device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103297851A (en) * 2013-05-16 2013-09-11 中国科学院自动化研究所 Method and device for quickly counting and automatically examining and verifying target contents in long video
US20210349940A1 (en) * 2019-06-17 2021-11-11 Tencent Technology (Shenzhen) Company Limited Video clip positioning method and apparatus, computer device, and storage medium
CN111866607A (en) * 2020-07-30 2020-10-30 腾讯科技(深圳)有限公司 Video clip positioning method and device, computer equipment and storage medium
CN114595357A (en) * 2022-02-22 2022-06-07 平安科技(深圳)有限公司 Video searching method and device, electronic equipment and storage medium
CN114511472A (en) * 2022-04-19 2022-05-17 苏州浪潮智能科技有限公司 Visual positioning method, device, equipment and medium
CN115309939A (en) * 2022-07-22 2022-11-08 复旦大学 Video clip positioning system based on space-time semantic decomposition

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024139091A1 (en) * 2022-12-27 2024-07-04 苏州元脑智能科技有限公司 Video behavior positioning method and apparatus, electronic device and storage medium

Also Published As

Publication number Publication date
WO2024139091A1 (en) 2024-07-04
CN115661727B (en) 2023-04-28

Similar Documents

Publication Publication Date Title
US12106491B2 (en) Target tracking method and apparatus, medium, and device
CN109005336B (en) Image shooting method and terminal equipment
CN112689201B (en) Barrage information identification method, barrage information display method, server and electronic equipment
CN110706179A (en) Image processing method and electronic equipment
CN108763317B (en) Method for assisting in selecting picture and terminal equipment
CN112820299B (en) Voiceprint recognition model training method and device and related equipment
CN110602389B (en) Display method and electronic equipment
CN110855893A (en) Video shooting method and electronic equipment
CN109495616B (en) Photographing method and terminal equipment
CN111401463B (en) Method for outputting detection result, electronic equipment and medium
CN111031234B (en) Image processing method and electronic equipment
CN109951889B (en) Internet of things network distribution method and mobile terminal
CN115659959A (en) Image text error correction method and device, electronic equipment and storage medium
CN109858447B (en) Information processing method and terminal
CN115661727B (en) Video behavior positioning method and device, electronic equipment and storage medium
CN112464831B (en) Video classification method, training method of video classification model and related equipment
CN108924413B (en) Shooting method and mobile terminal
CN114399813A (en) Face shielding detection method, model training method and device and electronic equipment
CN110674294A (en) Similarity determination method and electronic equipment
CN108960097B (en) Method and device for obtaining face depth information
CN111401283A (en) Face recognition method and device, electronic equipment and storage medium
CN115240250A (en) Model training method and device, computer equipment and readable storage medium
CN111145083B (en) Image processing method, electronic equipment and computer readable storage medium
CN110012225B (en) Image processing method and device and mobile terminal
CN109379531B (en) Shooting method and mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant