WO2022239239A1 - Dispositif, procédé et programme d'apprentissage - Google Patents

Dispositif, procédé et programme d'apprentissage Download PDF

Info

Publication number
WO2022239239A1
WO2022239239A1 PCT/JP2021/018443 JP2021018443W WO2022239239A1 WO 2022239239 A1 WO2022239239 A1 WO 2022239239A1 JP 2021018443 W JP2021018443 W JP 2021018443W WO 2022239239 A1 WO2022239239 A1 WO 2022239239A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
audio
speech
feature
feature quantity
Prior art date
Application number
PCT/JP2021/018443
Other languages
English (en)
Japanese (ja)
Inventor
康智 大石
邦夫 柏野
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/018443 priority Critical patent/WO2022239239A1/fr
Priority to JP2023520725A priority patent/JPWO2022239239A1/ja
Publication of WO2022239239A1 publication Critical patent/WO2022239239A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a learning device, a learning method, and a learning program.
  • image recognition technology it is possible to identify various objects in images. Furthermore, there is known a technique for associating visual information with linguistic information by labeling images.
  • Non-Patent Document 1 there is a technology that prepares a large amount of paired data of an image and a voice that describes the content of the image (hereinafter referred to as voice caption) and associates the image area with the voice caption section (hereinafter referred to as voice section). known (see, for example, Non-Patent Document 1).
  • Non-Patent Document 2 there is a known technique for acquiring translation knowledge between languages by preparing voice captions in multiple languages that describe images (see, for example, Non-Patent Document 2).
  • Non-Patent Document 3 there is a known technology that multimodally associates objects and events by focusing on the co-occurrence of video and audio streams using broadcast data and video data on posting sites.
  • the object is, for example, an object that appears in the video.
  • An event is, for example, an action of a person or the like in an image.
  • a learning device uses a model in which a video is input and a feature value obtained by mapping the video in a first embedding space is used as an output to generate a pair of video and audio.
  • a video feature quantity calculation unit that calculates a video feature quantity that is a feature quantity of the video contained in the data set of , and a model that inputs audio and outputs a feature quantity obtained by mapping the audio to a second embedding space.
  • an audio feature amount calculation unit that calculates an audio feature amount that is a feature amount of each audio included in the data set; Updating parameters of each model used by the video feature amount calculation unit and the audio feature amount calculation unit so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the and a part.
  • the cost of labeling objects and events in video can be reduced.
  • FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment.
  • FIG. 2 is an explanatory diagram illustrating audio captions.
  • FIG. 3 is an explanatory diagram illustrating audio captions.
  • FIG. 4 is a diagram illustrating the temporal proximity of objects and events.
  • FIG. 5 is a network diagram of the entire model.
  • FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment.
  • FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment.
  • FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment.
  • FIG. 9 is a diagram showing a data set.
  • FIG. 10 is a diagram showing experimental results.
  • FIG. 11 is a diagram showing experimental results.
  • FIG. 10 is a diagram showing experimental results.
  • FIG. 12 is a diagram showing the correspondence between video and audio.
  • FIG. 13 is a diagram showing the correspondence between video and audio.
  • FIG. 14 is a diagram showing the correspondence between video and audio.
  • FIG. 15 is a diagram showing the correspondence between video and audio.
  • FIG. 16 is a diagram illustrating an example of a computer that executes a learning program;
  • the learning device uses input learning data to train a video encoder and an audio encoder.
  • the learning device then outputs each trained encoder.
  • the learning device outputs parameters for each encoder.
  • the video in this embodiment means a moving image.
  • the learning data in this embodiment is data that includes video and audio associated with the video.
  • the learning data may be video data for broadcasting, video data from a video posting site, or the like.
  • a video encoder is a model that takes video as input and outputs video feature values.
  • a speech encoder is a model that receives speech as an input and outputs speech features. The learning device optimizes the video encoder and the audio encoder based on the output video feature quantity and audio feature quantity.
  • FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment.
  • the learning device 10 has a video feature quantity calculator 111 , an audio feature quantity calculator 121 , a loss function generator 131 and an updater 132 .
  • the learning device 10 also stores video encoder information 112 and audio encoder information 122 .
  • a video 151 and an audio caption 152 are input to the learning device 10 . Also, the learning device 10 can output updated video encoder information 112 and audio encoder information 122 .
  • the audio caption is the audio corresponding to the video.
  • video 151 and audio captions 152 are image data and audio data extracted from the same video data.
  • the audio caption 152 may be a signal in which a person watches the video 151 and records the audio uttered to explain the content.
  • the audio captions 152 may be obtained by using crowdsourcing to show a video 151 to a speaker and record the audio spoken by the speaker to describe the video 151 .
  • FIGS. 2 and 3 are explanatory diagrams explaining audio captions (for details of FIGS. 2 and 3, see Non-Patent Document 1 and Non-Patent Document 2, respectively).
  • Fig. 2 image regions and audio caption sections are associated. For example, a "mountain” in an image and a “mountain” in an audio caption are associated with the same hatching pattern.
  • the learning device of the present embodiment associates video and audio.
  • modality can be said to be a way of expressing ideas.
  • modalities correspond to "images” of dogs, “voices” of uttering “dogs”, and “texts” such as “dogs”, “dogs”, and “dogs”.
  • “dog”, “Dog”, and “DOG” correspond to modalities.
  • Examples of modalities include images, sounds, videos, and predetermined sensing data.
  • the video feature quantity calculation unit 111 uses a video encoder, which is a model that receives video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space, to convert video and audio pair data.
  • a video feature quantity which is a feature quantity of videos included in the set, is calculated.
  • the video encoder information 112 is parameters for constructing a video encoder.
  • the speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
  • a speech encoder which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
  • the speech encoder information 122 is parameters for constructing a speech encoder.
  • the loss function constructing unit 131 calculates the degree of similarity between the video feature amount and the audio feature amount, which is the similarity calculated by emphasizing the temporal proximity of a predetermined event (event or object) appearing in the video and audio. Construct a loss function based on
  • the update unit 132 uses the loss function configured by the loss function configuration unit 131 to update the parameters of the video encoder and the audio encoder.
  • the update unit 132 increases the similarity between the video feature amount and the audio feature amount, which is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio. , update parameters of each model used by the video feature amount calculation unit 111 and the audio feature amount calculation unit 121 .
  • FIG. 4 The processing of each part of the learning device 10 will be described in detail using FIGS. 4 and 5.
  • FIG. 4 The processing of each part of the learning device 10 will be described in detail using FIGS. 4 and 5.
  • FIG. 4 is a diagram explaining the temporal proximity of objects and events.
  • FIG. 4 shows data for learning based on commentary videos of sumo wrestling.
  • each frame of video 151 Video frames
  • Mel-spectrogram of audio caption 152 are arranged in chronological order.
  • events occur in frames f102, f103, f105, and f106. Also, there is a sound corresponding to the event that occurred.
  • frame f102 For example, at frame f102, an event called the start of the match occurs, and frame f102 corresponds to the voice "Hakkeyoi no Kotta" signifying the start of the match.
  • frame f106 an event of winning or losing has occurred, and frame f106 corresponds to the voice "push out” explaining the winning move that determined the winning or losing.
  • Figure 5 is a network diagram of the entire model. As shown in FIG. 5, the video feature amount calculation unit 111 extracts, for example, 32 video frames (32 video frames) cut out at equal intervals from a 10-second video (10-second video), and outputs them to a video encoder (video network). 112a.
  • the video feature amount calculation unit 111 extracts, for example, 32 video frames (32 video frames) cut out at equal intervals from a 10-second video (10-second video), and outputs them to a video encoder (video network). 112a.
  • the video encoder 112a is assumed to be a trained ECO (Non-Patent Document 4). ECOs are traditionally used for motion recognition. In this embodiment, a 3D ResNet is used in which the last layer, which is the ECO classifier, is removed and the feature dimension is set to d without changing the spatial size.
  • the video feature amount can be said to be 32 video frames compressed to 8 on the time axis and 7 ⁇ 7 on the spatial size of the frame. Therefore, the video feature amount can be regarded as a feature amount having a d-dimensional feature vector at each point in an 8 ⁇ 7 ⁇ 7 space.
  • the audio feature amount calculation unit 121 performs frequency analysis with a frame shift length of 10 ms and a frame length of 25 ms, and 40 mel filter bank processes for a 10-second audio file (10-second audio).
  • the Mel filter bank sequence (Mel spectrogram) obtained by the above is input to the audio encoder (Audio network) 122a.
  • the speech encoder 122a is assumed to be CNN-based DAVEnet (Non-Patent Document 7). Also, the speech encoder 122a may be a speech feature extractor (Non-Patent Document 7) that introduces ResDAVEnet or a self-attention mechanism.
  • a speech feature quantity can be regarded as a d-dimensional feature vector sequence with a compressed time axis.
  • d corresponding to the number of channels is 1024.
  • the video encoder 112a and the audio encoder 122a are not limited to those described above.
  • the i-th pair data of input video and audio is expressed as in formula (1).
  • the video feature amount and the audio feature amount corresponding to the i-th paired data are expressed as in formula (2).
  • the loss function constructing unit 131 arranges the pair of video feature amount and audio feature amount shown in equation (2) close to each other in a shared latent space (embedded space) and far from different video and audio. Construct a loss function such that each model is trained as
  • the loss function configuration unit 131 can configure a loss function using Triplet loss (Non-Patent Document 3) or Masked Margin Softmax (MMS loss) (Non-Patent Document 5). In this embodiment, the loss function configuration unit 131 configures a loss function using MMS loss.
  • the loss function using MMS loss is as shown in formulas (3), (4) and (5).
  • Equation (6) B in formulas (4) and (5) is the batch size. Also, the similarity Z m,n is equal to the left side of Equation (6).
  • the loss function of equation (3) is a loss function that calculates the similarity of all video-audio pairs in a batch and learns the parameters of the video encoder and audio encoder so that the pair's similarity is high. can be done.
  • the numerators on the right-hand sides of the equations (4) and (5) are terms related to the similarity Zm ,m between the pair of video feature quantity and audio feature quantity.
  • the denominator includes a term related to the degree of similarity Z m,m and a term related to the degree of similarity Z m,n (m ⁇ n) between different (unpaired) video feature quantities and audio feature quantities.
  • equation (4) is the loss when considering the degree of similarity based on the video.
  • equation (5) is the loss when the degree of similarity is considered based on speech.
  • Minimizing equations (4) and (5) corresponds to increasing the numerator (higher similarity when paired) and decreasing the denominator (lower similarity when not paired).
  • is a hyperparameter that represents the margin, and constrains the pair similarity to be further increased by ⁇ .
  • Equation (6) S i,j (m, n) and G i,j included in Equation (6) for calculating the similarity are calculated by Equations (7) and (8), respectively.
  • Equation (8) is guided attention, and has the effect of emphasizing the temporal proximity of objects and events appearing in video and audio. Also, ⁇ g is a hyperparameter.
  • the loss function constructing unit 131 calculates the dot product of the video feature amount and the audio feature amount, and obtains a tensor (Audio Visual Affinity Tensor) indicating the degree of similarity between the video and the audio.
  • the loss function constructing unit 131 obtains the similarity by weighting the tensors obtained by performing spatial mean pooling with guided attention and averaging. .
  • the announcer explains the situation with audio, resulting in a time lag.
  • the loss function configuration unit 131 may calculate guided attention using a formula including hyperparameters i' and j', such as formula (9).
  • the loss function configuration unit 131 configures the loss function as shown in equation (10).
  • s(x, y) is the degree of similarity between the video feature quantity x and the audio feature quantity y.
  • the update unit 132 updates the parameters of the video encoder and the audio encoder while decreasing the loss function using the optimization algorithm Adam (Non-Patent Document 6).
  • the updating unit 132 sets the following hyperparameters when executing Adam.
  • Weight Decay 4 ⁇ 10 -5
  • the optimization algorithm used by the updating unit 132 is not limited to Adam, and may be the so-called stochastic gradient descent (SGD), RMSProp, or the like.
  • the updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases.
  • the weight of the guided attention increases as the corresponding time between the elements of the video feature amount and the audio feature amount becomes closer.
  • the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element approaches, so that the degree of similarity obtained increases. , to update the parameters of each model.
  • the updating unit 132 may update the parameters based on the loss function using guided attention calculated by Equation (9).
  • the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element with a predetermined shift is closer.
  • FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment. As shown in FIG. 6, first, a data set of a video and a pair of audio captions corresponding to the video is input to the learning device 10 (step S101).
  • the learning device 10 uses a video encoder to calculate a d-dimensional video feature vector from the video (step S102).
  • the learning device 10 uses a speech encoder to calculate a d-dimensional speech feature vector from the speech caption (step S103).
  • the learning device 10 constructs a loss function based on the degree of similarity between the video feature vector and the audio feature vector considering the guided attention (step S104).
  • the learning device 10 updates the parameters of each encoder so that the loss function is optimized (step S105).
  • step S106 Yes
  • step S106 No
  • the termination condition is that the amount of parameter updates has become equal to or less than a threshold, and that the number of parameter updates has exceeded a certain number.
  • the video feature quantity calculation unit 111 uses a model that takes a video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space. A certain video feature amount is calculated.
  • the speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
  • the update unit 132 updates the video feature amount and the audio feature amount so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio is increased.
  • the parameters of each model used by the feature quantity calculation unit 111 and the speech feature quantity calculation unit 121 are updated.
  • the event frame in the video and the audio corresponding to the event have the property of being close in time.
  • the learning device 10 can effectively learn a model by emphasizing such properties. As a result, according to this embodiment, the cost for labeling objects and events in video can be reduced.
  • the updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases.
  • guided attention it is possible to weight each element of the feature quantity represented by the tensor.
  • the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer, so that the degree of similarity obtained increases. Update model parameters. This makes it possible to emphasize the temporal proximity between audio and events in video such as events and objects.
  • the updating unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer to the time at which a predetermined shift is generated. Update the parameters of each model so that the similarity increases. This makes it possible to cope with the time lag between the event/object and the sound.
  • Cross-modal search is to search different forms of data.
  • cross-modal search includes searching video from audio, searching audio from video, searching audio in one language to audio in another language, and the like.
  • the same reference numerals are given to the parts having the same functions as those of the already described embodiments, and the description thereof will be omitted as appropriate.
  • FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment.
  • the search device 20 has a video feature quantity calculator 211 , an audio feature quantity calculator 221 , and a searcher 232 .
  • the search device 20 also stores video encoder information 212 , audio encoder information 222 and video feature amount information 231 .
  • Video and audio captions are input to the search device 20 .
  • the audio caption input to search device 20 is the query for the search.
  • the searching device 20 outputs a video obtained by searching as a search result.
  • the image feature amount calculation unit 211 receives an image as input and calculates image feature amounts, similar to the image feature amount calculation unit 111 of the learning device 10 .
  • the video encoder information 212 has already been trained by the method described in the first embodiment.
  • the video feature amount information 231 accumulates the calculated video feature amount as the video feature amount information 231 .
  • the search unit 232 searches for a video feature amount similar to the audio feature amount 221 calculated by the audio feature amount calculation unit 221 from the accumulated video feature amount information 231 .
  • a similarity calculation method is the same as in the first embodiment.
  • FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment. As shown in FIG. 8, first, a plurality of videos and a query voice are input to the searching device 20 (step S201).
  • the search device 20 uses a video encoder to calculate a d-dimensional video feature vector from each video (step S202).
  • the search device 20 calculates a d-dimensional audio feature vector from the audio caption using the audio encoder (step S203).
  • the search device 20 searches for a video feature vector similar to the audio feature vector (step S204).
  • the search device 20 outputs a video corresponding to the video feature vector obtained by the search (step S205).
  • FIG. 9 is a diagram showing a data set.
  • video and audio streams were used for a total of 10 seconds, 5 seconds before and after the moment the game was decided.
  • Recall@N was used as the evaluation scale. For example, when searching 90 pieces of evaluation data for sounds that form a pair from a feature vector of a certain video query, the degree of similarity with each feature vector of the sound is calculated and ranked, and the top five are determined. If the voice paired with the query is included in these five records, the search is successful. The same process is performed for the 90 images of the evaluation data, and the ratio of the paired audio being included in the top five is defined as Recall@5. Similarly, Recall@3 and Recall@1 were also calculated.
  • FIG. 11 shows the results of recounting for each decisive factor.
  • FIG. 11 is a diagram showing experimental results.
  • FIGS. 12 and 13 are diagrams showing the correspondence between video and audio.
  • the graphs of FIGS. 12 and 13 visualize the equation (7).
  • FIG. 12 corresponds to this embodiment (Proposed). Also, FIG. 13 corresponds to the prior art (Baseline).
  • FIG. 14 and 15 are diagrams showing the correspondence between video and audio in scenes different from those in FIGS. 12 and 13.
  • FIG. 14 and 15 also show the same tendency as in FIGS. 12 and 13.
  • FIG. 14 and 15 also show the same tendency as in FIGS. 12 and 13.
  • each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated.
  • the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed or Can be integrated and configured.
  • all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or realized as hardware by wired logic.
  • the learning device 10 and the searching device 20 can be implemented by installing a program for executing the above-described learning processing or searching processing as package software or online software on a desired computer.
  • the information processing device can function as the learning device 10 or the searching device 20 by causing the information processing device to execute the above program.
  • the information processing apparatus referred to here includes a desktop or notebook personal computer.
  • information processing devices include smart phones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the learning device 10 and the searching device 20 can be implemented as a server device that uses a terminal device used by a user as a client and provides the client with services related to the above-described learning processing or search processing.
  • the server device is implemented as a server device that provides a service in which data for learning is input and information on a trained encoder is output.
  • the server device may be implemented as a web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.
  • FIG. 16 is a diagram showing an example of a computer that executes a learning program. Note that the search program may also be executed by a similar computer.
  • the computer 1000 has a memory 1010 and a CPU 1020, for example.
  • Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • Hard disk drive interface 1030 is connected to hard disk drive 1090 .
  • a disk drive interface 1040 is connected to the disk drive 1100 .
  • a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 .
  • Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example.
  • Video adapter 1060 is connected to display 1130, for example.
  • the hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the learning device 10 is implemented as a program module 1093 in which computer-executable code is described. Program modules 1093 are stored, for example, on hard disk drive 1090 .
  • the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the learning device 10 .
  • the hard disk drive 1090 may be replaced by an SSD.
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads the program modules 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processes of the above-described embodiments.
  • CPU 1020 may be programmed to perform the processes of the above embodiments in conjunction with the memory.
  • program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090.
  • the program modules 1093 and program data 1094 are stored in a detachable non-temporary computer-readable storage medium and stored via the disk drive 1100 or the like. It may be read by CPU 1020 .
  • the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Une unité de calcul d'une quantité caractéristique d'une vidéo (111) calcule une quantité caractéristique d'une vidéo, autrement dit une quantité caractéristique d'une vidéo faisant partie d'un ensemble de données de paires de vidéos et de paroles, en utilisant un modèle permettant de recevoir une vidéo à titre d'entrée et de sortir une quantité caractéristique obtenue en mappant la vidéo dans un premier espace intégré. Une unité de calcul d'une quantité caractéristique de paroles (121) calcule une quantité caractéristique de paroles, autrement dit une quantité caractéristique de paroles faisant partie de l'ensemble de données, en utilisant un codeur de paroles qui est un modèle permettant de recevoir des paroles à titre d'entrée et de sortir une quantité caractéristique obtenue en mappant les paroles dans un second espace intégré. Une unité de mise à jour (132) met à jour des paramètres des modèles respectivement utilisés par l'unité de calcul d'une quantité caractéristique d'une vidéo (111) et l'unité de calcul d'une quantité caractéristique de paroles (121) de manière à augmenter une similarité entre la quantité caractéristique d'une vidéo et la quantité caractéristique de paroles, similarité calculée en renforçant une proximité temporelle entre des phénomènes prédéterminés apparaissant dans la vidéo et les paroles.
PCT/JP2021/018443 2021-05-14 2021-05-14 Dispositif, procédé et programme d'apprentissage WO2022239239A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/018443 WO2022239239A1 (fr) 2021-05-14 2021-05-14 Dispositif, procédé et programme d'apprentissage
JP2023520725A JPWO2022239239A1 (fr) 2021-05-14 2021-05-14

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/018443 WO2022239239A1 (fr) 2021-05-14 2021-05-14 Dispositif, procédé et programme d'apprentissage

Publications (1)

Publication Number Publication Date
WO2022239239A1 true WO2022239239A1 (fr) 2022-11-17

Family

ID=84028969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/018443 WO2022239239A1 (fr) 2021-05-14 2021-05-14 Dispositif, procédé et programme d'apprentissage

Country Status (2)

Country Link
JP (1) JPWO2022239239A1 (fr)
WO (1) WO2022239239A1 (fr)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HARWATH DAVID, TORRALBA ANTONIO, GLASS JAMES R: " Unsupervised learning of spoken language with visual context-unsupervised-learning-of-spoken-language- with-visual-context Unsupervised Learning of Spoken Language with Visual Context", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 1 December 2016 (2016-12-01), pages 1 - 10, XP093003973, Retrieved from the Internet <URL:https://dspace.mit.edu/bitstream/handle/1721.1/124455/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf?sequence=2&isAllowed=y> [retrieved on 20221201] *

Also Published As

Publication number Publication date
JPWO2022239239A1 (fr) 2022-11-17

Similar Documents

Publication Publication Date Title
CN109840287B (zh) 一种基于神经网络的跨模态信息检索方法和装置
CN108009228B (zh) 一种内容标签的设置方法、装置及存储介质
CN108986186B (zh) 文字转化视频的方法和系统
CN108305643B (zh) 情感信息的确定方法和装置
WO2021037113A1 (fr) Procédé et appareil de description d&#39;image, dispositif informatique, et support de stockage
US9704480B2 (en) Information processing apparatus, method for processing information, and program
CN110264991A (zh) 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
WO2021129439A1 (fr) Procédé de reconnaissance vocale et produit associé
JP2020174342A (ja) 映像を生成するための方法、装置、サーバ、コンピュータ可読記憶媒体およびコンピュータプログラム
JP2022158735A (ja) 学習装置、学習方法、学習プログラム、探索装置、探索方法及び探索プログラム
CN114339450B (zh) 视频评论生成方法、系统、设备及存储介质
CN105224581A (zh) 在播放音乐时呈现图片的方法和装置
CN114143479B (zh) 视频摘要的生成方法、装置、设备以及存储介质
CN111767431A (zh) 用于视频配乐的方法和装置
CN113051468A (zh) 一种基于知识图谱和强化学习的电影推荐方法及系统
JP6917210B2 (ja) 要約映像生成装置およびそのプログラム
CN106021413B (zh) 基于主题模型的自展式特征选择方法及系统
JP7226320B2 (ja) 情報処理装置、情報処理方法及びプログラム
CN109543041B (zh) 一种语言模型得分的生成方法及装置
CN113297387B (zh) 一种基于nkd-gnn的图文不匹配新闻检测方法
JP2022158736A (ja) 学習装置、学習方法及び学習プログラム
JP2018084627A (ja) 言語モデル学習装置およびそのプログラム
CN112199954B (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
US10978076B2 (en) Speaker retrieval device, speaker retrieval method, and computer program product
WO2022239239A1 (fr) Dispositif, procédé et programme d&#39;apprentissage

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941969

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023520725

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21941969

Country of ref document: EP

Kind code of ref document: A1