WO2021172674A1 - Appareil et procédé permettant de générer un résumé vidéo par modélisation de graphe récursif - Google Patents

Appareil et procédé permettant de générer un résumé vidéo par modélisation de graphe récursif Download PDF

Info

Publication number
WO2021172674A1
WO2021172674A1 PCT/KR2020/010755 KR2020010755W WO2021172674A1 WO 2021172674 A1 WO2021172674 A1 WO 2021172674A1 KR 2020010755 W KR2020010755 W KR 2020010755W WO 2021172674 A1 WO2021172674 A1 WO 2021172674A1
Authority
WO
WIPO (PCT)
Prior art keywords
graph
initial
adjacency matrix
feature
frames
Prior art date
Application number
PCT/KR2020/010755
Other languages
English (en)
Korean (ko)
Inventor
손광훈
박정인
Original Assignee
연세대학교 산학협력단
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 연세대학교 산학협력단 filed Critical 연세대학교 산학협력단
Publication of WO2021172674A1 publication Critical patent/WO2021172674A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/738Presentation of query results
    • G06F16/739Presentation of query results in form of a video summary, e.g. the video summary being a video sequence, a composite still image or having synthesized frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • G06V20/47Detecting features for summarising video content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/845Structuring of content, e.g. decomposing content into time segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8549Creating video summaries, e.g. movie trailer

Definitions

  • the present invention relates to an apparatus and method for generating a video summary, and to an apparatus and method for generating a video summary through recursive graph modeling.
  • video summarization has a difficulty in that it is necessary to select useful frames in consideration of the interrelationship among all frames of the video. Accordingly, various studies on a video summary method for extracting a key frame using an artificial neural network are being conducted.
  • CNNs convolutional neural networks
  • RNNs recurrent neural networks
  • An object of the present invention is to provide an apparatus and method for generating a video summary that can accurately extract key frames for video summary by considering a plurality of frames of a video as nodes of a relationship graph and applying it to a graph convolutional neural network.
  • Another object of the present invention is to provide an apparatus and method for generating a video summary capable of recursively inferring semantic similarity between a plurality of frames and extracting a key frame in consideration of the global and long-term interrelationship between the plurality of frames.
  • an apparatus for generating a video summary receives an input video composed of a plurality of frames, and generates a plurality of feature maps each extracted from a plurality of frames according to a pre-learned pattern estimation method.
  • an initial graph generating unit that generates an initial feature graph by considering a node vector, and generates an initial adjacency matrix by calculating a similarity between the node vectors;
  • An adjacency matrix and a feature graph for extracting the next calibration graph based on the previous adjacency matrix and the calibration similarity indicating the degree of similarity between the two weighted calibration graphs are obtained.
  • a recursive graph acquisition unit and using a separate graph convolution network in which the pattern estimation method has been previously learned, a plurality of values according to the semantic similarity between each node of the final correction graph and the final adjacency matrix obtained by repeated recursion a predetermined number of times.
  • a key frame extracting unit that selects a plurality of key frames by estimating a probability that each frame becomes a key frame.
  • the initial graph generating unit may include: a feature map obtaining unit configured to generate the plurality of feature maps by extracting features of each of the plurality of frames of the input video according to a pre-learned pattern estimation method; an initial graph obtaining unit for obtaining the initial feature graph by considering each of the plurality of feature maps as a node of a graph consisting of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other; and an initial adjacency matrix obtaining unit configured to obtain an initial adjacency matrix by calculating a degree of similarity between a plurality of nodes of the initial feature graph.
  • the initial graph obtaining unit may obtain the initial feature graph by converting each of the plurality of feature maps into a one-dimensional node vector.
  • the initial adjacency matrix obtaining unit calculates the degree of similarity between each node vector (x 1 , ..., x T ) of the initial feature graph (X 0 ).
  • the initial adjacency matrix A 0 may be obtained.
  • the initial graph generator may further include a frame selector that extracts frames from a plurality of frames of the input video in units of a predetermined time interval and transmits the extracted frames to the feature map acquirer.
  • the plurality of graph convolution modules are sequentially connected in series, and the graph correlation module of an initial stage among the plurality of graph correlation modules is an initial feature graph or a recursive feature graph and an initial adjacency matrix or a recursive adjacency matrix.
  • the first intermediate feature graph is extracted by estimating the pattern with approval, and the remaining graph correlation module receives the intermediate feature graph extracted from the previous stage and the adjacency matrix applied to the graph correlation module of the initial stage, respectively, to estimate the intermediate feature A graph or the calibration graph may be extracted.
  • the recursive graph acquisition unit receives the previously obtained correction graphs, respectively, and weights different predetermined weighting functions to the correction graphs so that the correction graphs are projected on different linear embedding spaces to obtain two projection correction graphs. wealth; an adjacent correction value obtaining unit calculating the similarity between the two projection correction graphs to obtain the correction similarity; and an adjacency matrix obtaining unit configured to obtain the adjacency matrix by adding the obtained corrected similarity to the previous adjacency matrix.
  • the adjacent correction value obtaining unit calculates the correction similarity (dA k ) by the equation
  • W ⁇ Z k , W ⁇ Z k represents the projection compensation graph in which the weighting function (W ⁇ , W ⁇ ) is weighted on the compensation graph (Z k ), T represents the transposition matrix, and ⁇ 2 is the L2 norm. It can be calculated according to (L2 norm) function.
  • the key frame extractor applies the final correction graph and the final adjacency matrix to a separate graph convolution network in which a pattern estimation method has been previously learned, and according to a semantic similarity pattern between each node of the final correction graph, the input video a correlation estimation unit for extracting a key frame probability map indicating a probability that each of a plurality of frames is a key frame; and a key frame selector configured to select a plurality of key frames by analyzing a probability that each of the plurality of frames of the input video is key framed from the key frame probability map.
  • a plurality of feature maps respectively extracted from a plurality of frames according to a pre-learned pattern estimation method after receiving an input video composed of a plurality of frames is applied.
  • an initial feature graph by considering as a node vector, and generating an initial adjacency matrix by calculating a degree of similarity between the node vectors;
  • An adjacency matrix and a feature graph for extracting the next calibration graph based on the previous adjacency matrix and the calibration similarity indicating the degree of similarity between the two weighted calibration graphs are obtained.
  • the apparatus and method for generating a video summary regards multiple frames of a video image as nodes of a relationship graph, and uses a graph convolutional neural network having a recursive reasoning structure to have semantic similarity between multiple frames.
  • a graph convolutional neural network having a recursive reasoning structure to have semantic similarity between multiple frames.
  • FIG. 1 shows a schematic structure of an apparatus for generating a video summary according to an embodiment of the present invention.
  • FIG. 2 illustrates a detailed configuration of an initial graph generator of the apparatus for generating a video summary of FIG. 1 .
  • FIG. 3 shows detailed configurations of a graph correlation unit and a recursive graph acquiring unit of the apparatus for generating a video summary of FIG. 1 .
  • FIG. 4 shows a video summary generation method according to an embodiment of the present invention.
  • FIG. 1 shows a schematic structure of an apparatus for generating a video summary according to an embodiment of the present invention.
  • the video summary generating apparatus may include an initial graph generating unit 100 , a graph correlating unit 200 , a recursive graph obtaining unit 300 , and a key frame extracting unit 400 . have.
  • the initial adjacency matrix (A 0 ) may be obtained with a size of T ⁇ T.
  • the graph correlation unit 200 includes the initial feature graph (X 0 ) and the initial adjacency matrix (A 0 ) generated by the initial graph generating unit 100 or the previously acquired feature graph (X k-1 ) and the recursive graph obtaining unit ( 300) adjacent to the matrix (a k-1) is received, the initial characteristic graph (X 0) and the initial adjacency matrix in accordance with the pattern estimation method using a weighting matrix (W) obtained in advance by learning a obtained in (a 0)
  • the correction graph (Z) by correcting the pattern between the nodes of the initial feature graph (X 0 ) or the feature graph (X k-1 ) from the pattern between the feature graph (X k-1 ) and the adjacency matrix (A k-1 ) k ) is extracted.
  • the key frame extraction unit 400 applies the final correction graph (Z K ) and the final adjacency matrix (A K ) obtained by repeating recursion a predetermined number of times by the graph correlation unit 200 and the recursive graph acquiring unit 300 .
  • the possibility that each of a plurality of frames of the input video (I) is a key frame (F key ) is a key frame (F key )
  • the indicated key frame probability map (Y) is extracted, and a key frame key frame (F key ) is selected according to the extracted key frame probability map (Y).
  • the video summary generating apparatus considers the feature map extracted from each of a plurality of frames of the input video I as a node to indicate the similarity between the initial feature graph (X 0 ) composed of a plurality of nodes and the nodes.
  • the final calibration feature graph (Z K ) and the final adjacency matrix (A K ) obtained by recursively extracting the and adjacency matrix (A k ) a known number of times the final calibration feature graph (Z K )
  • a probability that each of a plurality of frames of the input video (I) is a key frame (F key ) is estimated according to the semantic similarity pattern from the final adjacency matrix (A K ).
  • FIG. 2 illustrates a detailed configuration of an initial graph generator of the apparatus for generating a video summary of FIG. 1 .
  • the initial graph generating unit 100 may include a frame selecting unit 110 , a feature map obtaining unit 120 , an initial graph obtaining unit 130 , and an initial adjacency matrix obtaining unit 140 . .
  • the frame selector 110 receives the input video I including a plurality of frames, selects a frame in a predetermined time interval unit (eg, 1 second), and transmits it to the feature map obtainer 120 .
  • a video image consists of 30 to 60 frames per second, and most frames are very similar to each other within a short time period. Therefore, it is inefficient to analyze the semantic similarity between all frames of the input video (I) to select key frames for generating a video summary.
  • the frame selector 110 selects a frame from the applied input video I in units of a predetermined time period and transmits it to the feature map obtainer 120 .
  • the feature map acquisition unit 120 may be implemented with various artificial neural networks, and since various artificial neural networks for extracting feature maps (x 1 , ..., x T ) from each frame image are already publicly available, among these public artificial neural networks, It may be implemented using one.
  • the feature map acquisition unit 120 may be implemented, for example, as a pre-trained convolutional neural network (CNN).
  • the initial graph obtaining unit 130 acquires the initial feature graph (X 0 ) by considering each of the plurality of feature maps (X) as a node vector.
  • the initial graph obtaining unit 130 converts, for example, a plurality of feature maps (X) obtained with a predetermined two-dimensional or three-dimensional size by the feature map obtaining unit 120 into a one-dimensional node vector, and the initial feature graph (X 0 ) can be obtained.
  • the initial graph acquisition unit 130 regards each of the plurality of feature maps X as a node of the graph consisting of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other, so that thereafter, the graph convolutional network (Graph Convolutional Network) : It is obtained as an initial feature graph (X 0 ) suitable for the graph correlation unit 200 composed of GCN).
  • the initial graph obtaining unit 130 may be configured to obtain the initial feature graph (X 0 ) by converting each of the plurality of feature maps (X) into a node vector having a predetermined format. For example, when each of the plurality of feature maps X is obtained as a two-dimensional vector of size T ⁇ D, the initial graph acquisition unit 130 may convert it into a form of a one-dimensional vector of length T ⁇ D.
  • the initial adjacency matrix obtaining unit 140 is configured to obtain each node vector (x 1 , ..., x T ) of the initial feature graph (X 0 ) obtained by the initial graph obtaining unit 130 , that is, between the plurality of feature maps (X). The similarity is calculated according to Equation 1 to obtain an initial adjacency matrix A 0 .
  • a ij represents the element of the initial adjacency matrix (A 0)
  • x i and x j is the i-th node vector of each of the initial characteristics graph (X 0) number of nodes vector of (x 1, ..., x T ) and represents the j-th node vector
  • T represents the transpose matrix
  • 2 represents the L2 norm function.
  • the initial adjacency matrix (A 0 ) can be viewed as a weight matrix of edges connecting between the node vectors (x 1 , ..., x T ) of the initial feature graph (X 0 ) according to the degree of similarity.
  • the initial graph generating unit 100 compares the initial feature graph (X 0 ) and the initial adjacency matrix (A 0 ) generated by the initial adjacency matrix obtaining unit 130 and the initial adjacency matrix obtaining unit 140 to the graph correlation unit 200 . forward to
  • FIG. 3 shows detailed configurations of a graph correlation unit and a recursive graph acquiring unit of the apparatus for generating a video summary of FIG. 1 .
  • the graph correlation unit 200 includes a plurality of graph correlation modules 210 to 230 sequentially connected in series.
  • Each of the plurality of graph correlation modules 210 to 230 is implemented as a graph convolution network (GCN) in which a pattern estimation method is learned in advance, and the initial feature graph X 0 applied from the initial graph generator 100 and the initial proximity A correction graph (Z 1 ) is extracted by correcting the pattern estimated in the matrix (A 0 ).
  • GCN graph convolution network
  • the graph correlation unit 200 transmits the extracted correction graph (Z 1 ) to the recursive graph acquisition unit 300 , and the adjacency matrix (A 1 ) and the feature graph (X 1 ) transmitted from the recursion graph acquisition unit 300 . ) and repeat the process of extracting the next correction graph (Z 2 ).
  • the graph correlator 200 receives the already acquired feature graph (X k-1 ) and the adjacency matrix (A k-1 ), and a pattern estimation method of a graph convolution network (GCN) that can be expressed by Equation 2
  • GCN graph convolution network
  • the graph correlation module 210 of the initial stage among the plurality of graph correlation modules 210 to 230 receives the feature graph (X k-1 ) and the adjacency matrix (A k-1 ) to estimate the pattern to estimate the first intermediate
  • the remaining graph correlation modules (220, 230) receive the intermediate feature graph (X', X") and the adjacency matrix (A k-1 ) extracted in the previous stage, respectively, and receive a pattern
  • By estimating the first intermediate feature graph (X") or the correction graph (Z k ) is extracted.
  • W denotes the weight of the learned graph convolution network (GCN)
  • denotes a nonlinear activation function such as a Rectified Linear Unit (ReLU).
  • the graph correlation unit 200 is illustrated as including three graph correlation modules 210 to 230 sequentially connected in series, but the graph correlation modules 210 to 230 included in the graph correlation unit 200 are illustrated. ) can be variously adjusted by experimentally analyzing the performance of extracting the correction graph (Z k ). That is, the performance of extracting the correction graph (Z k ) can be adjusted by the number of graph correlation modules 210 to 230, and here, as an experimental result, a case including three graph correlation modules 210 to 230 is shown. .
  • the recursive graph obtaining unit 300 may include first and second projection units 310 and 320 , an adjacency correction value obtaining unit 330 , and an adjacency matrix obtaining unit 340 .
  • First and second different projection section (310, 320) being respectively applied to the calibration graph (Z k) obtained from the graph correlation unit 200, a calibration graph (Z k) is to be projected on the different linear embedding space
  • a projection correction graph (W ⁇ Z k , W ⁇ Z k ) is obtained by weighting the predetermined weighting functions (W ⁇ , W ⁇ ) on the applied correction graph (Z k ).
  • the adjacent correction value acquisition unit 330 receives the projection correction graphs W ⁇ Z k , W ⁇ Z k obtained from the first and second projection units 310 and 320 , and calculates the correction similarity dA k by an equation Obtained according to 3.
  • T denotes a transpose matrix
  • 2 denotes an L2 norm function
  • the adjacency matrix obtaining unit 340 adds the previous adjacency matrix A k-1 and the correction similarity dA k obtained in the adjacency correction value obtaining unit 330, and the graph correlator 200 generates the next calibration graph Z Obtain an adjacency matrix (A k ) for estimating k+1 ).
  • the adjacency matrix obtaining unit 340 acquires the adjacency matrix A k for estimating the next calibration graph Z k+1 by cumulatively adding the calibration similarity dA k obtained thereafter from the initial adjacency matrix A 0 . do. Therefore, if the recursive graph obtaining unit 300 repeatedly recurs for K predetermined times to obtain the adjacency matrix A k , the final adjacency matrix A K finally obtained can be calculated as in Equation 4 have.
  • the adjacency matrix obtaining unit 340 may transmit the already obtained correction graph Z k as the next feature graph X k together with the adjacency matrix A k to the graph correlation unit 200 .
  • the key frame extractor 400 may include a correlation estimator 410 and a key frame selector 420 .
  • Correlation estimator 410 is the final correction graph (Z K ) from the pattern of the final correction graph (Z K ) and the final adjacency matrix (A K ) to which the pattern estimation method is implemented and applied as a graph convolution network (GCN) learned in advance.
  • K number of nodes to estimate a correlation between a vector, each node in the vector representing the input video (I) a plurality of frames, each key frame (F key) potentially keyframe probability map consisting of a probability representing the) ( Y) is extracted according to the semantic similarity.
  • the correlation estimation unit 410 may extract the key frame probability map Y according to the pattern estimation method of the graph convolution network (GCN) expressed by Equation (5).
  • W c represents the weight of the learned graph convolution network (GCN)
  • represents the activation function
  • the key frame probability map (Y) may be obtained with a size of T ⁇ 2 to represent the number of frames (T) and a probability that a frame corresponding to each node vector is a key frame and a probability that is not a key frame, respectively.
  • the key frame selector 420 selects, from the key frame probability map (Y), a frame having a high probability of being a key frame and a low probability of not being a key frame from among a plurality of frames of the input video (I) in a predetermined manner to select a key frame. Select with (F key ).
  • the key frame selection unit 420 selects a predetermined number of key frames (F key ), a probability that a key frame is greater than or equal to a predetermined first threshold value, or a predetermined second threshold value that a probability that the key frame is not a key frame is a predetermined value
  • the following frames can be selected as the key frame (F key).
  • the video summary generating apparatus may further include a learning unit (not shown) for learning the graph correlation unit 200 and the key frame extracting unit 400 .
  • the learning unit may train the graph correlation unit 200 and the key frame extracting unit 400 to minimize losses calculated according to a plurality of predetermined loss functions.
  • the learning unit may perform learning in one of a supervised learning method and an unsupervised learning method.
  • the learning unit acquires a video in which a truth key frame probability map (Y * ) has been acquired in advance based on importance scores normalized and averaged by a large number of users as training data, and the input video ( Input as I), compare the truth key frame probability map (Y * ) of the training data with the key frame probability map (Y) obtained from the video summary generating device, calculate the supervised learning loss (L sup ) and learn by backpropagating carry out
  • the learning unit calculates a classification loss (L c ), a sparsity loss (L s ), a restoration loss (L r ), and a diversity loss (L d ), respectively.
  • Classification loss (L c ) is a binary cross-entropy loss between the key frame probability map (Y) and the truth key frame probability map (Y * ) generated by the video summary generating apparatus according to Equation 6 can be calculated.
  • y * t is the truth label of the t-th frame
  • freq(s) is a value obtained by dividing the number of key frames by the total number of frames
  • median_freq is a median value of the frequency of key frames.
  • the rarity loss (L s) is an input image (I), each element (a ij ⁇ A K) of the number of a loss derived from the assumption that the number of key frames should be rare in the frame, adjacent by Equation 7, matrix L1 It can be obtained by applying the norm function.
  • the restoration loss (L r ) is a loss added based on the assumption that key frames must exist visually in various ways.
  • the key frame probability map (Y) is separately included in one graph convolution module and is applied to a pre-trained graph convolution network.
  • the loss of diversity (L d ) is an additional calibration feature graph ( ) can be obtained according to Equation 9 by applying a repelling regularizer between nodes selected as a key frame.
  • the learning unit is the total loss during supervised learning (L sup ) from the classification loss (L c ), the sparsity loss (L s ), the restoration loss (L r ), and the diversity loss (L d ) calculated according to Equations 6 to 9 is calculated as in Equation 10 and backpropagated.
  • ⁇ , ⁇ and ⁇ are weights for adjusting the importance of each loss.
  • the learning unit performs unsupervised learning, since there is no training data, in Equation 10, except for the classification loss (L c ), the sparsity loss (L s ), the restoration loss (L r ), and the diversity loss (L) From d ), the total loss (L unsup ) during unsupervised learning is calculated as in Equation 11 and backpropagated.
  • FIG. 4 shows a video summary generation method according to an embodiment of the present invention.
  • the video summary generation method of FIG. 4 is described with reference to FIGS. 1 to 3 , first, the input video (I) of a plurality of frames for which the video summary is to be generated is applied, and the feature map (X) extracted from each frame is converted into a node vector. , an initial feature graph (X 0 ) is generated, and an initial adjacency matrix (A 0 ) is generated by calculating the similarity between the node vectors (S10).
  • the input video I composed of a plurality of frames is first obtained ( S11 ). Then, a plurality of feature maps X are generated by extracting features for each of a plurality of frames of the input video I according to a pre-learned pattern estimation method (S12).
  • S12 a pre-learned pattern estimation method
  • each of the plurality of feature maps (X) is regarded as a node vector representing a node of a graph consisting of a plurality of nodes and a plurality of edges connecting the plurality of nodes to each other and the initial feature graph (X 0 ) is obtained (S13).
  • the initial feature graph (X 0 ) and the initial adjacency matrix (A 0 ) are obtained, the initial feature graph (X 0 ) and the initial adjacency matrix (A 0 ) are applied to a plurality of graph convolution networks in which the pattern estimation method has been previously learned. to correct the pattern between a plurality of nodes of the initial feature graph (X 0 ) to extract the correction graph (Z k ) (S20).
  • the correction graph (Z k ) is projected into different linear embedding spaces, and the projected two correction graphs (W ⁇ Z k , W ⁇ Z k )
  • An adjacency matrix (A k ) for obtaining the next correction graph (Z k+1 ) is obtained by calculating the correction similarity (dA k ) between them (S40).
  • the recursive repetition number (k) a group weighting is less than a specified number of times (K), different groups specified weighting function (W ⁇ , W ⁇ ) to the calibration graph (Z k), the correction graph (Z k) Projected to different linear embedding spaces (S41). Then, the similarity between the two projection correction graphs W ⁇ Z k and W ⁇ Z k is calculated to obtain the correction similarity dA k ( S42 ).
  • the next adjacency matrix (A k ) is obtained by adding the previous adjacency matrix (A k-1 ) and the calibration similarity (dA k ), and the calibration graph (Z k ) is converted into the next feature graph It is applied as (X k ) (S43).
  • the obtained next adjacency matrix (A k ) and the next feature graph (X k ) are recursively applied to a plurality of graph convolution networks in which the pattern estimation method has been previously learned, so that the next adjacency matrix (A k ) and the next feature graph (X k ) are recursively applied.
  • k so that the next correction graph (Z k+1 ) is extracted (S20).
  • the final correction graph (Z) according to the pattern estimation method learned in advance from the final adjacency matrix (A K ) and the final correction graph (Z k )
  • a key frame (F key ) is selected by estimating the semantic similarity between each node of K ) (S50).
  • a key frame probability map Y indicating the probability that each of a plurality of frames of the input video I is a key frame F key is extracted according to the semantic similarity pattern between each node of the correction graph Z K (S51) .
  • a plurality of key frames (F key ) are selected by analyzing the probability that each of the plurality of frames of the input video (I) is key framed from the extracted key frame probability map (Y) (S52).
  • a plurality of key frames F key selected here are frames with high semantic similarity among a plurality of frames of the input video I, and may be viewed as a video summary.
  • the method according to the present invention may be implemented as a computer program stored in a medium for execution by a computer.
  • the computer-readable medium may be any available medium that can be accessed by a computer, and may include all computer storage media.
  • Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and read dedicated memory), RAM (Random Access Memory), CD (Compact Disk)-ROM, DVD (Digital Video Disk)-ROM, magnetic tape, floppy disk, optical data storage, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention peut concerner un appareil et un procédé permettant de générer un résumé vidéo, dans lequel, en considérant une pluralité de trames d'images vidéo en tant que nœuds d'un graphe de relations et en inférant une similarité sémantique entre la pluralité de trames au moyen d'un réseau neuronal à convolution de graphe ayant une structure d'inférence récursive, une trame de clé précise en tenant compte d'une corrélation à long terme globale entre la pluralité de trames peut être extraite, permettant ainsi de générer un résumé vidéo dans diverses plateformes vidéo avec un nombre relativement petit de paramètres et une précision élevée.
PCT/KR2020/010755 2020-02-28 2020-08-13 Appareil et procédé permettant de générer un résumé vidéo par modélisation de graphe récursif WO2021172674A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020200024805A KR102198480B1 (ko) 2020-02-28 2020-02-28 재귀 그래프 모델링을 통한 비디오 요약 생성 장치 및 방법
KR10-2020-0024805 2020-02-28

Publications (1)

Publication Number Publication Date
WO2021172674A1 true WO2021172674A1 (fr) 2021-09-02

Family

ID=74140905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/010755 WO2021172674A1 (fr) 2020-02-28 2020-08-13 Appareil et procédé permettant de générer un résumé vidéo par modélisation de graphe récursif

Country Status (2)

Country Link
KR (1) KR102198480B1 (fr)
WO (1) WO2021172674A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102588389B1 (ko) * 2021-08-25 2023-10-11 연세대학교 산학협력단 그래프 인공 신경망 기반 엣지리스 네트워크 임베딩 장치 및 방법
CN113688814B (zh) * 2021-10-27 2022-02-11 武汉邦拓信息科技有限公司 图像识别方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180126362A (ko) * 2017-05-17 2018-11-27 삼성전자주식회사 동영상의 초해상 처리 방법 및 이를 위한 영상 처리 장치
US20190012559A1 (en) * 2017-07-06 2019-01-10 Texas Instruments Incorporated Dynamic quantization for deep neural network inference system and method
KR20190100320A (ko) * 2017-03-08 2019-08-28 텐센트 테크놀로지(센젠) 컴퍼니 리미티드 이미지 처리를 위한 신경망 모델 훈련 방법, 장치 및 저장 매체
US20190303725A1 (en) * 2018-03-30 2019-10-03 Fringefy Ltd. Neural network training system
KR20200015611A (ko) * 2017-08-01 2020-02-12 베이징 센스타임 테크놀로지 디벨롭먼트 컴퍼니 리미티드 시맨틱 분할 모델을 위한 훈련 방법 및 장치, 전자 기기, 저장 매체

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6735927B2 (ja) 2017-05-05 2020-08-05 グーグル エルエルシー ビデオコンテンツの要約処理

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20190100320A (ko) * 2017-03-08 2019-08-28 텐센트 테크놀로지(센젠) 컴퍼니 리미티드 이미지 처리를 위한 신경망 모델 훈련 방법, 장치 및 저장 매체
KR20180126362A (ko) * 2017-05-17 2018-11-27 삼성전자주식회사 동영상의 초해상 처리 방법 및 이를 위한 영상 처리 장치
US20190012559A1 (en) * 2017-07-06 2019-01-10 Texas Instruments Incorporated Dynamic quantization for deep neural network inference system and method
KR20200015611A (ko) * 2017-08-01 2020-02-12 베이징 센스타임 테크놀로지 디벨롭먼트 컴퍼니 리미티드 시맨틱 분할 모델을 위한 훈련 방법 및 장치, 전자 기기, 저장 매체
US20190303725A1 (en) * 2018-03-30 2019-10-03 Fringefy Ltd. Neural network training system

Also Published As

Publication number Publication date
KR102198480B1 (ko) 2021-01-05

Similar Documents

Publication Publication Date Title
WO2021172674A1 (fr) Appareil et procédé permettant de générer un résumé vidéo par modélisation de graphe récursif
WO2018217019A1 (fr) Dispositif de détection d'un code malveillant variant sur la base d'un apprentissage de réseau neuronal, procédé associé, et support d'enregistrement lisible par ordinateur dans lequel un programme d'exécution dudit procédé est enregistré
WO2021095987A1 (fr) Procédé et appareil de complémentation de connaissances basée sur une entité de type multiple
WO2021225360A1 (fr) Procédé permettant d'effectuer un apprentissage sur dispositif d'un réseau d'apprentissage automatique sur un véhicule autonome à l'aide d'un apprentissage à plusieurs étages présentant des ensembles d'hyperparamètres adaptatifs et dispositif l'utilisant
WO2021246810A1 (fr) Procédé d'entraînement de réseau neuronal par auto-codeur et apprentissage multi-instance, et système informatique pour la mise en oeuvre de ce procédé
CN111970078B (zh) 一种非线性失真场景的帧同步方法
WO2020149601A1 (fr) Procédé et dispositif de reconnaissance d'image à grande vitesse à l'aide d'un réseau neuronal à convolution (cnn) tridimensionnel
CN113821668A (zh) 数据分类识别方法、装置、设备及可读存储介质
WO2021215710A1 (fr) Procédé pour empêcher la violation de données d'origine pour un apprentissage profond et dispositif de prévention de violation de données utilisant ledit procédé
CN115100678A (zh) 基于通道重组和注意力机制的跨模态行人重识别方法
Zhang et al. Deep Unfolding With Weighted ℓ₂ Minimization for Compressive Sensing
WO2022034945A1 (fr) Appareil d'apprentissage par renforcement et procédé de classification de données
CN113408343A (zh) 基于双尺度时空分块互注意力的课堂动作识别方法
WO2018212584A2 (fr) Procédé et appareil de classification de catégorie à laquelle une phrase appartient à l'aide d'un réseau neuronal profond
Giraldo et al. Hypergraph convolutional networks for weakly-supervised semantic segmentation
CN112528077B (zh) 基于视频嵌入的视频人脸检索方法及系统
WO2023113437A1 (fr) Dispositif et procédé de segmentation sémantique à l'aide d'une mémoire
TW202201285A (zh) 一種神經網路的訓練方法、視頻識別方法及電腦設備和電腦可讀儲存介質
WO2023048437A1 (fr) Procédé, programme et appareil d'entrainement et de déduction d'un modèle d'apprentissage profond sur la base de données médicales
Wang et al. Very important person localization in unconstrained conditions: A new benchmark
WO2021137415A1 (fr) Procédé et appareil de traitement d'image basé sur l'apprentissage automatique
CN111695526B (zh) 网络模型生成方法、行人重识别方法及装置
CN115348551A (zh) 一种轻量化业务识别方法、装置、电子设备及存储介质
WO2022255523A1 (fr) Procédé et appareil pour restaurer une image d'objet multi-échelle
Xiao et al. Gaze prediction based on long short-term memory convolution with associated features of video frames

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20922401

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20922401

Country of ref document: EP

Kind code of ref document: A1