CN112040313A - Video content structuring method, device, terminal equipment and medium - Google Patents

Video content structuring method, device, terminal equipment and medium Download PDF

Info

Publication number
CN112040313A
CN112040313A CN202011217518.2A CN202011217518A CN112040313A CN 112040313 A CN112040313 A CN 112040313A CN 202011217518 A CN202011217518 A CN 202011217518A CN 112040313 A CN112040313 A CN 112040313A
Authority
CN
China
Prior art keywords
boundary
video
voice
scene
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011217518.2A
Other languages
Chinese (zh)
Other versions
CN112040313B (en
Inventor
周凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Research Institute of Sun Yat Sen University
Original Assignee
Shenzhen Research Institute of Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Research Institute of Sun Yat Sen University filed Critical Shenzhen Research Institute of Sun Yat Sen University
Priority to CN202011217518.2A priority Critical patent/CN112040313B/en
Publication of CN112040313A publication Critical patent/CN112040313A/en
Application granted granted Critical
Publication of CN112040313B publication Critical patent/CN112040313B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44012Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving rendering scenes according to scene graphs, e.g. MPEG-4 scene graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/57Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23412Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs for generating or manipulating the scene composition of objects, e.g. MPEG-4 objects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application is applicable to the technical field of video processing, and provides a video content structuring method, a device, terminal equipment and a medium, wherein the method comprises the following steps: acquiring visual channel information of a target video, and dividing the target video into a plurality of scene units based on the visual channel information, wherein the plurality of scene units comprise a plurality of scene boundaries; converting the voice of the target video into a voice text, and dividing the voice text into a plurality of text blocks; dividing the target video into a plurality of speech units based on the plurality of text blocks, the plurality of speech units comprising a plurality of speech boundaries; determining a video subject boundary of the target video according to the scene boundaries and the voice boundaries; and dividing the target video into a plurality of subject units according to the video subject boundary. By the method, the accuracy of structuring the video content can be improved.

Description

Video content structuring method, device, terminal equipment and medium
Technical Field
The present application belongs to the technical field of video processing, and in particular, to a method, an apparatus, a terminal device, and a medium for structuring video content.
Background
The video content structuring refers to a process of decomposing a video file into a plurality of semantic subunits in a layering manner and establishing an incidence relation among the subunits, and aims to convert complex and abstract video data into a format which is easy to process by a computer, so that the video content can be further extracted and analyzed.
In a traditional video content structuring method, a video is divided into scenes from visual clues of the video. However, since the video types are too rich, and some video scenes are too single or too rich, the video is divided into different scene units only by using visual cues, which may cause great errors.
Disclosure of Invention
The embodiment of the application provides a method, a device, terminal equipment and a medium for structuring video content, which can improve the accuracy of structuring the video content.
In a first aspect, an embodiment of the present application provides a video content structuring method, including:
acquiring visual channel information of a target video, and dividing the target video into a plurality of scene units based on the visual channel information, wherein the plurality of scene units comprise a plurality of scene boundaries;
converting the voice of the target video into a voice text, and dividing the voice text into a plurality of text blocks;
dividing the target video into a plurality of speech units based on the plurality of text blocks, the plurality of speech units comprising a plurality of speech boundaries;
determining a video subject boundary of the target video according to the scene boundaries and the voice boundaries;
and dividing the target video into a plurality of subject units according to the video subject boundary.
In a second aspect, an embodiment of the present application provides a video content structuring apparatus, including:
the scene boundary dividing module is used for acquiring visual channel information of a target video, and dividing the target video into a plurality of scene units based on the visual channel information, wherein the plurality of scene units comprise a plurality of scene boundaries;
the text block segmentation module is used for converting the voice of the target video into a voice text and segmenting the voice text into a plurality of text blocks;
a speech unit segmentation module for dividing the target video into a plurality of speech units based on the plurality of text blocks, the plurality of speech units including a plurality of speech boundaries;
a video theme boundary determining module, configured to determine a video theme boundary of the target video according to the plurality of scene boundaries and the plurality of voice boundaries;
and the theme unit dividing module is used for dividing the target video into a plurality of theme units according to the video theme boundary.
In a third aspect, an embodiment of the present application provides a terminal device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor, when executing the computer program, implements the method according to the first aspect.
In a fourth aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method of the first aspect.
In a fifth aspect, the present application provides a computer program product, which when run on a terminal device, causes the terminal device to execute the method of any one of the above first aspects.
Compared with the prior art, the embodiment of the application has the advantages that: in the embodiment of the application, when a target video is structured, the visual channel information of the target video is obtained, the target video is divided into a plurality of scene units based on the visual channel information, and each scene unit comprises a corresponding scene boundary; then converting the voice in the target video into a voice text, and dividing the voice text into a plurality of text blocks; dividing the target video into a plurality of voice units based on the plurality of text blocks, wherein each voice unit comprises a corresponding voice boundary; determining a video theme boundary of the target video according to each scene boundary and each voice boundary; and dividing the target video into a plurality of theme units according to the video theme boundary. In the embodiment of the application, content mining is respectively carried out on a visual channel and a voice channel of a target video, then multi-clue information of the visual content and the voice content of the target video is fused, theme boundary detection is carried out on the target video, the target video is divided into a plurality of theme units with independent semantics and coherent content, and a structured index is established for the target video content. By the method, the accuracy of structuring the video content can be improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a schematic flowchart of a video content structuring method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of video visual channel information provided in an embodiment of the present application;
fig. 3 is a schematic flowchart of a scene unit partitioning method according to an embodiment of the present application;
fig. 4 is a schematic diagram for determining a boundary of a video topic according to an embodiment of the present application;
fig. 5 is a schematic diagram of another method for determining a boundary of a video topic according to an embodiment of the present application;
fig. 6 is a schematic flowchart of a video content structuring method according to the second embodiment of the present application;
FIG. 7 is a graph showing the results of the test experiments provided in example two of the present application;
fig. 8 is a schematic structural diagram of a video content structuring apparatus according to a third embodiment of the present application;
fig. 9 is a schematic structural diagram of a terminal device according to a fourth embodiment of the present application.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to" determining "or" in response to detecting ". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
Furthermore, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used for distinguishing between descriptions and not necessarily for describing or implying relative importance.
Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.
Fig. 1 is a schematic flowchart of an embodiment of the present application for providing a video content structuring, as shown in fig. 1, the method includes:
s101, acquiring visual channel information of a target video, and dividing the target video into a plurality of scene units based on the visual channel information, wherein the plurality of scene units comprise a plurality of scene boundaries;
the target video is a video with the need of structuring video content. Video is a comprehensive media composed of visual and voice channels. The video visual channel is composed of a series of continuously changed frame pictures, the images are visual presentation of video visual contents, the visual characteristics of the video frames have great repeatability, video data with the duration of one second usually comprises 25 to 40 frame pictures, the video data with the duration of one hour comprises more than 7 ten thousand frame pictures, and if the video frame pictures with the great quantity and the extremely high repeatability are directly analyzed, the time is consumed, and valuable information is difficult to mine. Therefore, the shot segmentation can be performed on the target video firstly.
The shot segmentation can divide frame pictures with converted contents into different shots according to the difference of visual contents of video frames, only one frame of the same shot is reserved as a representative frame, and most redundant video frames are filtered. In this embodiment, the similarity of visual features of video frames is measured by using color histogram features. And then calculating the Euclidean distance of the color histogram features of the two adjacent frames, and if the distance of the color histograms of the two adjacent frames is greater than the position of the average distance, determining that the lens switching occurs. And finally, taking an image in the middle of the same key frame sequence as a video shot.
Fig. 2 is a schematic diagram of providing video visual channel information according to an embodiment of the present application, as shown in fig. 2, a bottom layer of a video is composed of a series of consecutive video frames, in a group of consecutive video frame sequences, a shot represents an image under a certain shot in a video shooting process, and the image features of the video frames under the same shot are similar and the content change is small. A scene is composed of a set of semantically related shots, and multiple scenes make up the entire video.
After the segmentation of the shots is completed, the scene units of the target video may be divided according to the respective shots. Fig. 3 is a schematic flowchart of a method for dividing scene units according to an embodiment of the present application, and as shown in fig. 3, visual features of a sequence of video frames may be extracted, where the visual features may include color features and motion features, a target video is subjected to shot segmentation based on the visual features, and the segmented shots are subjected to spectral clustering; but the clustering result cannot guarantee that the video frames of each subclass are continuous on the time axis. However, a scene unit is a series of consecutive video frames, so it is necessary to merge the discontinuous shots into a final scene unit. For the class cluster obtained by spectral clustering, the class cluster can be split into a plurality of continuous shot segments { s1,s2,```,snAnd calculating the distance between the two shots according to the time axis of the target video, wherein the calculation formula can be as follows:
Figure 281301DEST_PATH_IMAGE001
wherein s isiRepresenting a certain succession of lens segments, dc(s)i) Denotes siCenter position, s, of shot in video timelinejRepresenting another continuous shot segment, dc(s)j) Denotes sjCenter position of shot in video timeline, dc(s)i,sj) Denotes siShot segment sum sjThe distance between the shot segments, e, is an exponential function,wis the average of the Euclidean distances between all the clusters of the classes. And finally, carrying out k-means clustering on each shot segment based on the time distance of the shot segment, wherein a video shot sequence formed by each cluster after clustering is the division result of the scene unit.
S102, converting the voice of the target video into a voice text, and dividing the voice text into a plurality of text blocks;
the video voice channel contains a large amount of voice information, and the voice information directly reflects video semantics. Therefore, the voice in the video can be converted into the voice text, and particularly, the voice feature can be converted into the voice text which is easy to process by a computer by adopting an Automatic Speech Recognition (ASR) technology.
And then, the voice text is divided into sentences, and each sentence is taken as a text block. The speech may then be processed based on the text blocks.
S103, dividing the target video into a plurality of voice units based on the text blocks, wherein the voice units comprise a plurality of voice boundaries;
specifically, the context association degree of each text block can be calculated respectively; respectively calculating the depth score of each text block according to the context association degree of each text block; and dividing the target video into a plurality of voice units according to the depth score of each text block, wherein the number of the voice units is equal to a preset multiple of the number of the scene units.
Specifically, the context relevance is used to represent semantic similarity between the current text block and the adjacent previous and next text blocks. Specifically, cosine similarity measurement can be adopted, and the calculation formula is as follows:
Figure 879772DEST_PATH_IMAGE002
wherein c represents said each text block, p represents a previous text block adjacent to said each text block, f represents a next text block adjacent to said each text block, w representst,xA value representing the t-dimension of the text feature of x text blocks, x = c, p or f, s (c) representing the contextual relevance of each of said text blocks.
Specifically, the TopicTiling algorithm may be adopted to extract the text features of each text block, and the extracted text features are one piece of multidimensional data.
The depth score represents the difference between the text block and its context relevancy score, reflecting the relative severity of semantic changes on both sides of the text block. The text block with the higher depth score forms a 'deep valley' with the text blocks on the two sides of the text block in the text relevancy degree curve, the higher the depth score is, the lower the semantic compactness of the current text block relative to the context is, and the more likely such a position is to become a semantic boundary of the text. The calculation formula for the depth score may be as follows:
Figure 599204DEST_PATH_IMAGE003
wherein the content of the first and second substances,cwhich represents the current block of text,hl(c) Represents fromcThe first highest associated score peak found to the left of the text block,hr(c) Indicating the peak with the highest first relevancy score found from the right side of the c text block.
The number of speech units may be determined according to the number of scene units, for example, the number of speech units may be set to a preset multiple of the number of scene units. Calculating the depth score of each text block by adopting the formula, then sequencing all the depth scores from high to low, and selecting the first b text blocks as target text blocks according to the sequencing result; b is the number of speech units determined according to the number of scene units.
The target video is divided into a plurality of voice units by adopting a target text block, wherein the target text block can be a voice boundary of the voice units.
S104, determining a video theme boundary of the target video according to the scene boundaries and the voice boundaries;
specifically, a plurality of voice boundaries and a plurality of scene boundaries are respectively mapped onto a time axis of a target video; respectively calculating a plurality of time intervals between a plurality of scene boundaries and each voice boundary; determining a plurality of boundary pairs according to a plurality of time intervals, wherein each boundary pair comprises a voice boundary and a scene boundary; and determining a plurality of time points according to the plurality of boundary pairs, and taking the plurality of time points as a plurality of video subject boundaries of the target video.
Specifically, the number of video topics may be determined first. Before detecting the boundary position of the video theme, the primary work is to determine the number of video themes. The embodiment considers three aspects in determining the number of video topics: first, the number of video scenes, generally speaking, the different story units of a video tend to be distributed in different scenes. Secondly, the number of video theme boundaries is related to the video duration, obviously, longer videos can tell more contents, and the included themes are more, generally, the duration of one theme unit of most videos is different from 3 minutes to 5 minutes, the average value is taken for 4 minutes, and the number of the themes of the videos can be estimated by dividing the video duration by the average duration of each story unit; thirdly, from the point of statistics, the threshold value of the depth score is determined so as to determine the number of the topics of the video, and the method considers the difference of different videos. By combining the three factors, the calculation formula of the number of the video themes can be as follows:
Figure 952825DEST_PATH_IMAGE004
Figure 626383DEST_PATH_IMAGE005
wherein n is the number of scene units, t is the total duration of the video, st is the average duration of a story unit of the video, 240 seconds, α, β and θ are empirical parameters, in this embodiment, the values are 0.6, 0.1 and 0.3, respectively, σ is the standard deviation of the depth scores of the text blocks, depthScore [ i ] represents the depth score of the ith text block, count (depthScore [ i ]) represents the number of the text blocks, and TopicCount represents the number of video topics.
Therefore, the boundary pairs with the same number as the video theme number are determined by adopting the method, and then a time point is selected from each boundary pair as a time point corresponding to the video theme boundary, and the calculation formula can be as follows:
Figure 356442DEST_PATH_IMAGE006
wherein the content of the first and second substances,p k representing the time points corresponding to the boundaries of the video theme,x i indicating the time point corresponding to the scene boundary in the ith boundary pair,y i represents the time point corresponding to the voice boundary in the ith boundary pair, and lambda is the weight value, which determinesp k Position of andx i closer to or further away fromy i More recently, in the present embodiment, the visual content is considered to be more intuitive, and therefore, the value of λ may be increased if the audio content is considered to be more intuitive, of course. Fig. 4 is a schematic diagram for determining a boundary of a video topic according to an embodiment of the present application.
Certainly, there may be another situation, where the number of the calculated video topics is greater than the number of the scene units, and at this time, the number of the boundary pairs is less than the number of the video topics, the video topic boundary may be determined by using the method shown in fig. 5, where fig. 5 is a schematic diagram providing another method for determining a video topic boundary in this embodiment of the present application. And after the video theme boundaries are determined according to the boundary pairs, sequentially taking the voice boundaries as the video theme boundaries in the voice boundaries according to the sequence of the depth scores from large to small until the number of the video theme boundaries meets the number of the video themes obtained by calculation.
And S105, dividing the target video into a plurality of theme units according to the video theme boundary.
Specifically, according to the selected video theme boundary, the target video is divided into a plurality of theme units, and the video content is organized in the manner of theme units, wherein each theme unit comprises corresponding image content and text content.
In the embodiment, in the process of determining the video theme boundary, the contents of the visual channel and the voice channel of the video are fused, the visual content and the voice content can be mutually supplemented, the video unit can be more accurately divided, and the accuracy of the video content structuring is higher.
Fig. 6 is a schematic flowchart of a video content structuring method provided in the second embodiment of the present application, and as shown in fig. 6, the method includes: and respectively extracting visual clues and voice clues of the target video, then carrying out multi-clue fusion on the visual clues and the voice clues, and dividing the target video into a plurality of subject units.
When extracting the visual cue of the target video, the video frame can be used as a processing object to perform scene boundary detection, and the target video is divided into a plurality of scene units. When scene boundary detection is carried out, visual features of video frames can be extracted, shot segmentation is carried out based on the visual features, and then all shots obtained through segmentation are clustered, so that a scene boundary is finally obtained.
When extracting the voice clue of the target video, the voice signal can be used as a processing object, the voice is recognized as a text, then the training of the main body model of the video corpus is carried out, the text is divided into a plurality of text blocks, the text features of the text blocks are extracted, the similarity measurement is carried out, and the target video is divided into a plurality of voice units according to the result of the similarity measurement.
In order to prove the effectiveness of the video structuring method, 50 videos in the YouTube website are selected as an experimental data set, and the video duration is between 12 minutes and 55 minutes. Because the evaluation video boundary has certain subjectivity, the experimental videos selected by the method mainly comprise news videos and education videos, and the videos have obvious video boundaries and are easy to judge manually. First, the data set is divided into 10 groups of 5 videos each. Then 20 students participating in the experiment are invited, the experimenters are also divided into 10 groups, each group of two people requires the experimenters to carefully watch the videos, understand the video contents and then mark time nodes for switching the video themes, and the time nodes are used as references for evaluating the experiment results. The method for evaluating the effectiveness of the algorithm compares the result of the boundary of the subject detected by the algorithm with the result of the boundary of the subject marked manually, and if the difference value of the timestamps of the two boundaries is within 5 seconds, the boundary is considered to be accurate and effective.
The method adopts three commonly used performance indexes of Precision (Precision), Recall (Recall) and F-measure in the field of information retrieval to evaluate the quality of the algorithm. The calculation formula of these three indexes is as follows:
Figure 431845DEST_PATH_IMAGE007
Figure 956368DEST_PATH_IMAGE008
Figure 648380DEST_PATH_IMAGE009
the precision rate reflects the accuracy of the experimental result, the recall rate reflects the comprehensiveness of the experimental result, and the F-measure is a harmonic mean value of the precision rate and the recall rate and comprehensively reflects a good circle of the experimental result. The values of the performance indexes are between 0 and 1, and the larger the value, the better the algorithm effect is.
Because the manually marked video boundaries have certain subjectivity, the text requires that each group of two experimenters are respectively marked with the same group of videos, and the average index of each group of videos is calculated.
Fig. 7 is a test experiment result diagram provided in the second embodiment of the present application, and referring to fig. 7, the precision ratio, the recall ratio, and the harmonic mean value of the structured experiment based on the visual scene are respectively: 0.72, 0.53, 0.59; the precision ratio, the recall ratio and the harmonic mean value of the structured experiment based on the voice text are respectively as follows: 0.70, 0.53, 0.60; the precision ratio, the recall ratio and the harmonic mean value based on the method in the embodiment are respectively as follows: 0.83, 0.68, 0.74. The method in the embodiment has better performance, and can basically accurately and effectively detect the boundary of the video theme and divide the video theme units.
Fig. 8 is a schematic structural diagram of a video content structuring apparatus according to a third embodiment of the present application, where as shown in fig. 8, the apparatus includes:
a scene boundary dividing module 81, configured to obtain visual channel information of a target video, and divide the target video into a plurality of scene units based on the visual channel information, where the plurality of scene units include a plurality of scene boundaries;
a text block segmentation module 82, configured to convert the voice of the target video into a voice text, and segment the voice text into a plurality of text blocks;
a speech unit segmentation module 83, configured to divide the target video into a plurality of speech units based on the plurality of text blocks, where the plurality of speech units include a plurality of speech boundaries;
a video topic boundary determination module 84, configured to determine a video topic boundary of the target video according to the plurality of scene boundaries and the plurality of voice boundaries;
a theme unit dividing module 85, configured to divide the target video into a plurality of theme units according to the video theme boundary.
The voice unit segmentation module 83 includes:
the context relevance degree calculation operator module is used for calculating the context relevance degree of each text block respectively;
the depth score calculating submodule is used for respectively calculating the depth score of each text block according to the context association degree;
and the dividing submodule is used for dividing the target video into a plurality of voice units according to the depth score of each text block, wherein the number of the voice units is equal to a preset multiple of the number of the scene boundaries.
The context relevance calculator operator module includes:
the characteristic extraction unit is used for respectively extracting the text characteristic of each text block;
a calculating unit, configured to calculate, based on the text feature, a context association degree of each text block by using the following formula:
Figure 290452DEST_PATH_IMAGE010
wherein c represents said each text block, p represents a previous text block adjacent to said each text block, f represents a next text block adjacent to said each text block, w representst,xA value representing the t-dimension of the text feature of x text blocks, x = c, p or f, s (c) representing the contextual relevance of each of said text blocks.
The division submodule includes:
the sorting unit is used for sorting the text blocks according to the depth score of each text block;
the target text block determining unit is used for determining a plurality of target text blocks according to the sequencing result, wherein the number of the target text blocks is equal to a preset multiple of the number of the scene boundaries;
and the dividing unit is used for dividing the target video into a plurality of voice units by adopting the plurality of target text blocks.
The video theme boundary determining module 84 includes:
a mapping sub-module, configured to map the multiple voice boundaries and the multiple scene boundaries onto a time axis of the target video, respectively;
a time interval calculation submodule for calculating a plurality of time intervals between the plurality of scene boundaries and each of the speech boundaries, respectively;
a boundary pair determining submodule, configured to determine a plurality of boundary pairs according to the plurality of time intervals, where each boundary pair includes a speech boundary and a scene boundary;
and the video theme boundary determining submodule is used for determining a plurality of time points according to the plurality of boundary pairs and taking the plurality of time points as a plurality of video theme boundaries of the target video.
The boundary pair determination submodule includes:
a first boundary pair determining unit, configured to determine a first scene boundary and a first voice boundary corresponding to a minimum time interval of the multiple time intervals, and use the first scene boundary and the first voice boundary as a first boundary pair;
a deleting unit configured to delete a time interval related to the first scene boundary and/or the first speech boundary among the plurality of time intervals;
and the second boundary pair determining unit is used for determining a second scene boundary and a second voice boundary corresponding to the minimum time interval in the remaining time intervals, and taking the second scene boundary and the second voice boundary as a second boundary pair.
Fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application. As shown in fig. 9, the terminal device 9 of this embodiment includes: at least one processor 90 (only one shown in fig. 9), a memory 91, and a computer program 92 stored in the memory 91 and executable on the at least one processor 90, the processor 90 implementing the steps in any of the various method embodiments described above when executing the computer program 92.
The terminal device 9 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 90, a memory 91. Those skilled in the art will appreciate that fig. 9 is only an example of the terminal device 9, and does not constitute a limitation to the terminal device 9, and may include more or less components than those shown, or combine some components, or different components, for example, and may further include an input/output device, a network access device, and the like.
The processor 90 may be a Central Processing Unit (CPU), and the processor 90 may be other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 91 may in some embodiments be an internal storage unit of the terminal device 9, such as a hard disk or a memory of the terminal device 9. The memory 91 may also be an external storage device of the terminal device 9 in other embodiments, such as a plug-in hard disk, a smart card (SMC), a Secure Digital (SD) card, a flash card (FlashCard), and the like, which are provided on the terminal device 9. Further, the memory 91 may also include both an internal storage unit and an external storage device of the terminal device 9. The memory 91 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 91 may also be used to temporarily store data that has been output or is to be output.
It should be noted that, for the information interaction, execution process, and other contents between the above-mentioned devices/units, the specific functions and technical effects thereof are based on the same concept as those of the embodiment of the method of the present application, and specific reference may be made to the part of the embodiment of the method, which is not described herein again.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in the above-mentioned method embodiments.
The embodiments of the present application provide a computer program product, which when running on a terminal device, enables the terminal device to implement the steps in the above method embodiments when executed.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium and can implement the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing apparatus/terminal apparatus, a recording medium, computer memory, Read-only memory (ROM), random-access memory (RAM), an electrical carrier signal, a telecommunications signal, and a software distribution medium. Such as a usb-disk, a removable hard disk, a magnetic or optical disk, etc. In certain jurisdictions, computer-readable media may not be an electrical carrier signal or a telecommunications signal in accordance with legislative and patent practice.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other ways. For example, the above-described apparatus/network device embodiments are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implementing, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A method for structuring video content, comprising:
acquiring visual channel information of a target video, and dividing the target video into a plurality of scene units based on the visual channel information, wherein the plurality of scene units comprise a plurality of scene boundaries;
converting the voice of the target video into a voice text, and dividing the voice text into a plurality of text blocks;
dividing the target video into a plurality of speech units based on the plurality of text blocks, the plurality of speech units comprising a plurality of speech boundaries;
determining a video subject boundary of the target video according to the scene boundaries and the voice boundaries;
and dividing the target video into a plurality of subject units according to the video subject boundary.
2. The method of claim 1, wherein said dividing the target video into a plurality of speech units based on the plurality of text blocks comprises:
respectively calculating the context association degree of each text block;
respectively calculating the depth score of each text block according to the context association degree;
and dividing the target video into a plurality of voice units according to the depth score of each text block, wherein the number of the voice units is equal to a preset multiple of the number of the scene boundaries.
3. The method of claim 2, wherein said separately calculating a contextual relevance of each text block comprises:
respectively extracting text features of each text block;
based on the text features, calculating the context association degree of each text block by adopting the following formula:
Figure 309292DEST_PATH_IMAGE001
wherein c represents said each text block, p represents a previous text block adjacent to said each text block, f represents a next text block adjacent to said each text block, w representst,xA value representing the t-dimension of the text feature of x text blocks, x = c, p or f, s (c) representing the contextual relevance of each of said text blocks.
4. The method of claim 2, wherein said dividing the target video into a plurality of phonetic units based on the depth score of each text block comprises:
sequencing the text blocks according to the depth fraction of each text block;
determining a plurality of target text blocks according to the sequencing result, wherein the number of the target text blocks is equal to a preset multiple of the number of the scene boundaries;
and dividing the target video into a plurality of voice units by adopting the plurality of target text blocks.
5. The method of claim 1, wherein determining the video subject boundary of the target video based on the plurality of scene boundaries, the plurality of speech boundaries comprises:
mapping the plurality of voice boundaries and the plurality of scene boundaries onto a time axis of the target video respectively;
respectively calculating a plurality of time intervals between the plurality of scene boundaries and each of the voice boundaries;
determining a plurality of boundary pairs according to the plurality of time intervals, wherein each boundary pair comprises a voice boundary and a scene boundary;
and determining a plurality of time points according to the plurality of boundary pairs, and taking the plurality of time points as a plurality of video subject boundaries of the target video.
6. The method of claim 5, wherein the plurality of boundary pairs includes a first boundary pair and a second boundary pair, and wherein selecting the plurality of boundary pairs based on the plurality of time intervals comprises:
determining a first scene boundary and a first voice boundary corresponding to the minimum time interval in the plurality of time intervals, and taking the first scene boundary and the first voice boundary as a first boundary pair;
deleting time intervals of the plurality of time intervals that are associated with the first scene boundary and/or the first speech boundary;
and determining a second scene boundary and a second voice boundary corresponding to the minimum time interval in the remaining time intervals, and taking the second scene boundary and the second voice boundary as a second boundary pair.
7. The method of claim 5, wherein the plurality of time points are determined from the plurality of boundary pairs using the following equation:
Figure 632957DEST_PATH_IMAGE002
wherein p iskRepresenting points in time, x, corresponding to boundaries of a video topiciRepresents the time point, y, corresponding to the scene boundary in the ith boundary pairiAnd the time point corresponding to the voice boundary in the ith boundary pair is represented, and lambda is a weight value.
8. A video content structuring apparatus, comprising:
the scene boundary dividing module is used for acquiring visual channel information of a target video, and dividing the target video into a plurality of scene units based on the visual channel information, wherein the plurality of scene units comprise a plurality of scene boundaries;
the text block segmentation module is used for converting the voice of the target video into a voice text and segmenting the voice text into a plurality of text blocks;
a speech unit segmentation module for dividing the target video into a plurality of speech units based on the plurality of text blocks, the plurality of speech units including a plurality of speech boundaries;
a video theme boundary determining module, configured to determine a video theme boundary of the target video according to the plurality of scene boundaries and the plurality of voice boundaries;
and the theme unit dividing module is used for dividing the target video into a plurality of theme units according to the video theme boundary.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN202011217518.2A 2020-11-04 2020-11-04 Video content structuring method, device, terminal equipment and medium Active CN112040313B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011217518.2A CN112040313B (en) 2020-11-04 2020-11-04 Video content structuring method, device, terminal equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011217518.2A CN112040313B (en) 2020-11-04 2020-11-04 Video content structuring method, device, terminal equipment and medium

Publications (2)

Publication Number Publication Date
CN112040313A true CN112040313A (en) 2020-12-04
CN112040313B CN112040313B (en) 2021-04-09

Family

ID=73572860

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011217518.2A Active CN112040313B (en) 2020-11-04 2020-11-04 Video content structuring method, device, terminal equipment and medium

Country Status (1)

Country Link
CN (1) CN112040313B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112929744A (en) * 2021-01-22 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips
CN113096687A (en) * 2021-03-30 2021-07-09 中国建设银行股份有限公司 Audio and video processing method and device, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547139A (en) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 Method for splitting news video program, and method and system for cataloging news videos
CN106649713A (en) * 2016-12-21 2017-05-10 中山大学 Movie visualization processing method and system based on content
CN109145152A (en) * 2018-06-28 2019-01-04 中山大学 A kind of self-adapting intelligent generation image-text video breviary drawing method based on query word
CN110197135A (en) * 2019-05-13 2019-09-03 北京邮电大学 A kind of video structural method based on multidimensional segmentation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547139A (en) * 2010-12-30 2012-07-04 北京新岸线网络技术有限公司 Method for splitting news video program, and method and system for cataloging news videos
CN106649713A (en) * 2016-12-21 2017-05-10 中山大学 Movie visualization processing method and system based on content
CN109145152A (en) * 2018-06-28 2019-01-04 中山大学 A kind of self-adapting intelligent generation image-text video breviary drawing method based on query word
CN110197135A (en) * 2019-05-13 2019-09-03 北京邮电大学 A kind of video structural method based on multidimensional segmentation

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112929744A (en) * 2021-01-22 2021-06-08 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips
CN112929744B (en) * 2021-01-22 2023-04-07 北京百度网讯科技有限公司 Method, apparatus, device, medium and program product for segmenting video clips
CN113096687A (en) * 2021-03-30 2021-07-09 中国建设银行股份有限公司 Audio and video processing method and device, computer equipment and storage medium
CN113096687B (en) * 2021-03-30 2024-04-26 中国建设银行股份有限公司 Audio and video processing method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112040313B (en) 2021-04-09

Similar Documents

Publication Publication Date Title
US9805270B2 (en) Video segmentation techniques
Rui et al. Constructing table-of-content for videos
CN110083741B (en) Character-oriented video abstract extraction method based on text and image combined modeling
CN111814770B (en) Content keyword extraction method of news video, terminal device and medium
WO2020232796A1 (en) Multimedia data matching method and device, and storage medium
CN108563655B (en) Text-based event recognition method and device
CN112287914B (en) PPT video segment extraction method, device, equipment and medium
Shah et al. TRACE: linguistic-based approach for automatic lecture video segmentation leveraging Wikipedia texts
CN108460098B (en) Information recommendation method and device and computer equipment
CN112040313B (en) Video content structuring method, device, terminal equipment and medium
CN103150373A (en) Generation method of high-satisfaction video summary
CN114297439B (en) Short video tag determining method, system, device and storage medium
US20230057010A1 (en) Term weight generation method, apparatus, device and medium
US20110122137A1 (en) Video summarization method based on mining story structure and semantic relations among concept entities thereof
CN111291177A (en) Information processing method and device and computer storage medium
Dumont et al. Automatic story segmentation for tv news video using multiple modalities
Jou et al. Structured exploration of who, what, when, and where in heterogeneous multimedia news sources
CN110619284B (en) Video scene division method, device, equipment and medium
CN111767393A (en) Text core content extraction method and device
CN108170845B (en) Multimedia data processing method, device and storage medium
CN116361510A (en) Method and device for automatically extracting and retrieving scenario segment video established by utilizing film and television works and scenario
CN114357206A (en) Education video color subtitle generation method and system based on semantic analysis
CN113963303A (en) Image processing method, video recognition method, device, equipment and storage medium
US11990131B2 (en) Method for processing a video file comprising audio content and visual content comprising text content
CN113407775B (en) Video searching method and device and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant