WO2022239239A1 - Learning device, learning method, and learning program - Google Patents

Learning device, learning method, and learning program Download PDF

Info

Publication number
WO2022239239A1
WO2022239239A1 PCT/JP2021/018443 JP2021018443W WO2022239239A1 WO 2022239239 A1 WO2022239239 A1 WO 2022239239A1 JP 2021018443 W JP2021018443 W JP 2021018443W WO 2022239239 A1 WO2022239239 A1 WO 2022239239A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
audio
speech
feature
feature quantity
Prior art date
Application number
PCT/JP2021/018443
Other languages
French (fr)
Japanese (ja)
Inventor
康智 大石
邦夫 柏野
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to JP2023520725A priority Critical patent/JPWO2022239239A1/ja
Priority to PCT/JP2021/018443 priority patent/WO2022239239A1/en
Publication of WO2022239239A1 publication Critical patent/WO2022239239A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a learning device, a learning method, and a learning program.
  • image recognition technology it is possible to identify various objects in images. Furthermore, there is known a technique for associating visual information with linguistic information by labeling images.
  • Non-Patent Document 1 there is a technology that prepares a large amount of paired data of an image and a voice that describes the content of the image (hereinafter referred to as voice caption) and associates the image area with the voice caption section (hereinafter referred to as voice section). known (see, for example, Non-Patent Document 1).
  • Non-Patent Document 2 there is a known technique for acquiring translation knowledge between languages by preparing voice captions in multiple languages that describe images (see, for example, Non-Patent Document 2).
  • Non-Patent Document 3 there is a known technology that multimodally associates objects and events by focusing on the co-occurrence of video and audio streams using broadcast data and video data on posting sites.
  • the object is, for example, an object that appears in the video.
  • An event is, for example, an action of a person or the like in an image.
  • a learning device uses a model in which a video is input and a feature value obtained by mapping the video in a first embedding space is used as an output to generate a pair of video and audio.
  • a video feature quantity calculation unit that calculates a video feature quantity that is a feature quantity of the video contained in the data set of , and a model that inputs audio and outputs a feature quantity obtained by mapping the audio to a second embedding space.
  • an audio feature amount calculation unit that calculates an audio feature amount that is a feature amount of each audio included in the data set; Updating parameters of each model used by the video feature amount calculation unit and the audio feature amount calculation unit so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the and a part.
  • the cost of labeling objects and events in video can be reduced.
  • FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment.
  • FIG. 2 is an explanatory diagram illustrating audio captions.
  • FIG. 3 is an explanatory diagram illustrating audio captions.
  • FIG. 4 is a diagram illustrating the temporal proximity of objects and events.
  • FIG. 5 is a network diagram of the entire model.
  • FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment.
  • FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment.
  • FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment.
  • FIG. 9 is a diagram showing a data set.
  • FIG. 10 is a diagram showing experimental results.
  • FIG. 11 is a diagram showing experimental results.
  • FIG. 10 is a diagram showing experimental results.
  • FIG. 12 is a diagram showing the correspondence between video and audio.
  • FIG. 13 is a diagram showing the correspondence between video and audio.
  • FIG. 14 is a diagram showing the correspondence between video and audio.
  • FIG. 15 is a diagram showing the correspondence between video and audio.
  • FIG. 16 is a diagram illustrating an example of a computer that executes a learning program;
  • the learning device uses input learning data to train a video encoder and an audio encoder.
  • the learning device then outputs each trained encoder.
  • the learning device outputs parameters for each encoder.
  • the video in this embodiment means a moving image.
  • the learning data in this embodiment is data that includes video and audio associated with the video.
  • the learning data may be video data for broadcasting, video data from a video posting site, or the like.
  • a video encoder is a model that takes video as input and outputs video feature values.
  • a speech encoder is a model that receives speech as an input and outputs speech features. The learning device optimizes the video encoder and the audio encoder based on the output video feature quantity and audio feature quantity.
  • FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment.
  • the learning device 10 has a video feature quantity calculator 111 , an audio feature quantity calculator 121 , a loss function generator 131 and an updater 132 .
  • the learning device 10 also stores video encoder information 112 and audio encoder information 122 .
  • a video 151 and an audio caption 152 are input to the learning device 10 . Also, the learning device 10 can output updated video encoder information 112 and audio encoder information 122 .
  • the audio caption is the audio corresponding to the video.
  • video 151 and audio captions 152 are image data and audio data extracted from the same video data.
  • the audio caption 152 may be a signal in which a person watches the video 151 and records the audio uttered to explain the content.
  • the audio captions 152 may be obtained by using crowdsourcing to show a video 151 to a speaker and record the audio spoken by the speaker to describe the video 151 .
  • FIGS. 2 and 3 are explanatory diagrams explaining audio captions (for details of FIGS. 2 and 3, see Non-Patent Document 1 and Non-Patent Document 2, respectively).
  • Fig. 2 image regions and audio caption sections are associated. For example, a "mountain” in an image and a “mountain” in an audio caption are associated with the same hatching pattern.
  • the learning device of the present embodiment associates video and audio.
  • modality can be said to be a way of expressing ideas.
  • modalities correspond to "images” of dogs, “voices” of uttering “dogs”, and “texts” such as “dogs”, “dogs”, and “dogs”.
  • “dog”, “Dog”, and “DOG” correspond to modalities.
  • Examples of modalities include images, sounds, videos, and predetermined sensing data.
  • the video feature quantity calculation unit 111 uses a video encoder, which is a model that receives video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space, to convert video and audio pair data.
  • a video feature quantity which is a feature quantity of videos included in the set, is calculated.
  • the video encoder information 112 is parameters for constructing a video encoder.
  • the speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
  • a speech encoder which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
  • the speech encoder information 122 is parameters for constructing a speech encoder.
  • the loss function constructing unit 131 calculates the degree of similarity between the video feature amount and the audio feature amount, which is the similarity calculated by emphasizing the temporal proximity of a predetermined event (event or object) appearing in the video and audio. Construct a loss function based on
  • the update unit 132 uses the loss function configured by the loss function configuration unit 131 to update the parameters of the video encoder and the audio encoder.
  • the update unit 132 increases the similarity between the video feature amount and the audio feature amount, which is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio. , update parameters of each model used by the video feature amount calculation unit 111 and the audio feature amount calculation unit 121 .
  • FIG. 4 The processing of each part of the learning device 10 will be described in detail using FIGS. 4 and 5.
  • FIG. 4 The processing of each part of the learning device 10 will be described in detail using FIGS. 4 and 5.
  • FIG. 4 is a diagram explaining the temporal proximity of objects and events.
  • FIG. 4 shows data for learning based on commentary videos of sumo wrestling.
  • each frame of video 151 Video frames
  • Mel-spectrogram of audio caption 152 are arranged in chronological order.
  • events occur in frames f102, f103, f105, and f106. Also, there is a sound corresponding to the event that occurred.
  • frame f102 For example, at frame f102, an event called the start of the match occurs, and frame f102 corresponds to the voice "Hakkeyoi no Kotta" signifying the start of the match.
  • frame f106 an event of winning or losing has occurred, and frame f106 corresponds to the voice "push out” explaining the winning move that determined the winning or losing.
  • Figure 5 is a network diagram of the entire model. As shown in FIG. 5, the video feature amount calculation unit 111 extracts, for example, 32 video frames (32 video frames) cut out at equal intervals from a 10-second video (10-second video), and outputs them to a video encoder (video network). 112a.
  • the video feature amount calculation unit 111 extracts, for example, 32 video frames (32 video frames) cut out at equal intervals from a 10-second video (10-second video), and outputs them to a video encoder (video network). 112a.
  • the video encoder 112a is assumed to be a trained ECO (Non-Patent Document 4). ECOs are traditionally used for motion recognition. In this embodiment, a 3D ResNet is used in which the last layer, which is the ECO classifier, is removed and the feature dimension is set to d without changing the spatial size.
  • the video feature amount can be said to be 32 video frames compressed to 8 on the time axis and 7 ⁇ 7 on the spatial size of the frame. Therefore, the video feature amount can be regarded as a feature amount having a d-dimensional feature vector at each point in an 8 ⁇ 7 ⁇ 7 space.
  • the audio feature amount calculation unit 121 performs frequency analysis with a frame shift length of 10 ms and a frame length of 25 ms, and 40 mel filter bank processes for a 10-second audio file (10-second audio).
  • the Mel filter bank sequence (Mel spectrogram) obtained by the above is input to the audio encoder (Audio network) 122a.
  • the speech encoder 122a is assumed to be CNN-based DAVEnet (Non-Patent Document 7). Also, the speech encoder 122a may be a speech feature extractor (Non-Patent Document 7) that introduces ResDAVEnet or a self-attention mechanism.
  • a speech feature quantity can be regarded as a d-dimensional feature vector sequence with a compressed time axis.
  • d corresponding to the number of channels is 1024.
  • the video encoder 112a and the audio encoder 122a are not limited to those described above.
  • the i-th pair data of input video and audio is expressed as in formula (1).
  • the video feature amount and the audio feature amount corresponding to the i-th paired data are expressed as in formula (2).
  • the loss function constructing unit 131 arranges the pair of video feature amount and audio feature amount shown in equation (2) close to each other in a shared latent space (embedded space) and far from different video and audio. Construct a loss function such that each model is trained as
  • the loss function configuration unit 131 can configure a loss function using Triplet loss (Non-Patent Document 3) or Masked Margin Softmax (MMS loss) (Non-Patent Document 5). In this embodiment, the loss function configuration unit 131 configures a loss function using MMS loss.
  • the loss function using MMS loss is as shown in formulas (3), (4) and (5).
  • Equation (6) B in formulas (4) and (5) is the batch size. Also, the similarity Z m,n is equal to the left side of Equation (6).
  • the loss function of equation (3) is a loss function that calculates the similarity of all video-audio pairs in a batch and learns the parameters of the video encoder and audio encoder so that the pair's similarity is high. can be done.
  • the numerators on the right-hand sides of the equations (4) and (5) are terms related to the similarity Zm ,m between the pair of video feature quantity and audio feature quantity.
  • the denominator includes a term related to the degree of similarity Z m,m and a term related to the degree of similarity Z m,n (m ⁇ n) between different (unpaired) video feature quantities and audio feature quantities.
  • equation (4) is the loss when considering the degree of similarity based on the video.
  • equation (5) is the loss when the degree of similarity is considered based on speech.
  • Minimizing equations (4) and (5) corresponds to increasing the numerator (higher similarity when paired) and decreasing the denominator (lower similarity when not paired).
  • is a hyperparameter that represents the margin, and constrains the pair similarity to be further increased by ⁇ .
  • Equation (6) S i,j (m, n) and G i,j included in Equation (6) for calculating the similarity are calculated by Equations (7) and (8), respectively.
  • Equation (8) is guided attention, and has the effect of emphasizing the temporal proximity of objects and events appearing in video and audio. Also, ⁇ g is a hyperparameter.
  • the loss function constructing unit 131 calculates the dot product of the video feature amount and the audio feature amount, and obtains a tensor (Audio Visual Affinity Tensor) indicating the degree of similarity between the video and the audio.
  • the loss function constructing unit 131 obtains the similarity by weighting the tensors obtained by performing spatial mean pooling with guided attention and averaging. .
  • the announcer explains the situation with audio, resulting in a time lag.
  • the loss function configuration unit 131 may calculate guided attention using a formula including hyperparameters i' and j', such as formula (9).
  • the loss function configuration unit 131 configures the loss function as shown in equation (10).
  • s(x, y) is the degree of similarity between the video feature quantity x and the audio feature quantity y.
  • the update unit 132 updates the parameters of the video encoder and the audio encoder while decreasing the loss function using the optimization algorithm Adam (Non-Patent Document 6).
  • the updating unit 132 sets the following hyperparameters when executing Adam.
  • Weight Decay 4 ⁇ 10 -5
  • the optimization algorithm used by the updating unit 132 is not limited to Adam, and may be the so-called stochastic gradient descent (SGD), RMSProp, or the like.
  • the updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases.
  • the weight of the guided attention increases as the corresponding time between the elements of the video feature amount and the audio feature amount becomes closer.
  • the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element approaches, so that the degree of similarity obtained increases. , to update the parameters of each model.
  • the updating unit 132 may update the parameters based on the loss function using guided attention calculated by Equation (9).
  • the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element with a predetermined shift is closer.
  • FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment. As shown in FIG. 6, first, a data set of a video and a pair of audio captions corresponding to the video is input to the learning device 10 (step S101).
  • the learning device 10 uses a video encoder to calculate a d-dimensional video feature vector from the video (step S102).
  • the learning device 10 uses a speech encoder to calculate a d-dimensional speech feature vector from the speech caption (step S103).
  • the learning device 10 constructs a loss function based on the degree of similarity between the video feature vector and the audio feature vector considering the guided attention (step S104).
  • the learning device 10 updates the parameters of each encoder so that the loss function is optimized (step S105).
  • step S106 Yes
  • step S106 No
  • the termination condition is that the amount of parameter updates has become equal to or less than a threshold, and that the number of parameter updates has exceeded a certain number.
  • the video feature quantity calculation unit 111 uses a model that takes a video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space. A certain video feature amount is calculated.
  • the speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
  • the update unit 132 updates the video feature amount and the audio feature amount so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio is increased.
  • the parameters of each model used by the feature quantity calculation unit 111 and the speech feature quantity calculation unit 121 are updated.
  • the event frame in the video and the audio corresponding to the event have the property of being close in time.
  • the learning device 10 can effectively learn a model by emphasizing such properties. As a result, according to this embodiment, the cost for labeling objects and events in video can be reduced.
  • the updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases.
  • guided attention it is possible to weight each element of the feature quantity represented by the tensor.
  • the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer, so that the degree of similarity obtained increases. Update model parameters. This makes it possible to emphasize the temporal proximity between audio and events in video such as events and objects.
  • the updating unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer to the time at which a predetermined shift is generated. Update the parameters of each model so that the similarity increases. This makes it possible to cope with the time lag between the event/object and the sound.
  • Cross-modal search is to search different forms of data.
  • cross-modal search includes searching video from audio, searching audio from video, searching audio in one language to audio in another language, and the like.
  • the same reference numerals are given to the parts having the same functions as those of the already described embodiments, and the description thereof will be omitted as appropriate.
  • FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment.
  • the search device 20 has a video feature quantity calculator 211 , an audio feature quantity calculator 221 , and a searcher 232 .
  • the search device 20 also stores video encoder information 212 , audio encoder information 222 and video feature amount information 231 .
  • Video and audio captions are input to the search device 20 .
  • the audio caption input to search device 20 is the query for the search.
  • the searching device 20 outputs a video obtained by searching as a search result.
  • the image feature amount calculation unit 211 receives an image as input and calculates image feature amounts, similar to the image feature amount calculation unit 111 of the learning device 10 .
  • the video encoder information 212 has already been trained by the method described in the first embodiment.
  • the video feature amount information 231 accumulates the calculated video feature amount as the video feature amount information 231 .
  • the search unit 232 searches for a video feature amount similar to the audio feature amount 221 calculated by the audio feature amount calculation unit 221 from the accumulated video feature amount information 231 .
  • a similarity calculation method is the same as in the first embodiment.
  • FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment. As shown in FIG. 8, first, a plurality of videos and a query voice are input to the searching device 20 (step S201).
  • the search device 20 uses a video encoder to calculate a d-dimensional video feature vector from each video (step S202).
  • the search device 20 calculates a d-dimensional audio feature vector from the audio caption using the audio encoder (step S203).
  • the search device 20 searches for a video feature vector similar to the audio feature vector (step S204).
  • the search device 20 outputs a video corresponding to the video feature vector obtained by the search (step S205).
  • FIG. 9 is a diagram showing a data set.
  • video and audio streams were used for a total of 10 seconds, 5 seconds before and after the moment the game was decided.
  • Recall@N was used as the evaluation scale. For example, when searching 90 pieces of evaluation data for sounds that form a pair from a feature vector of a certain video query, the degree of similarity with each feature vector of the sound is calculated and ranked, and the top five are determined. If the voice paired with the query is included in these five records, the search is successful. The same process is performed for the 90 images of the evaluation data, and the ratio of the paired audio being included in the top five is defined as Recall@5. Similarly, Recall@3 and Recall@1 were also calculated.
  • FIG. 11 shows the results of recounting for each decisive factor.
  • FIG. 11 is a diagram showing experimental results.
  • FIGS. 12 and 13 are diagrams showing the correspondence between video and audio.
  • the graphs of FIGS. 12 and 13 visualize the equation (7).
  • FIG. 12 corresponds to this embodiment (Proposed). Also, FIG. 13 corresponds to the prior art (Baseline).
  • FIG. 14 and 15 are diagrams showing the correspondence between video and audio in scenes different from those in FIGS. 12 and 13.
  • FIG. 14 and 15 also show the same tendency as in FIGS. 12 and 13.
  • FIG. 14 and 15 also show the same tendency as in FIGS. 12 and 13.
  • each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated.
  • the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed or Can be integrated and configured.
  • all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or realized as hardware by wired logic.
  • the learning device 10 and the searching device 20 can be implemented by installing a program for executing the above-described learning processing or searching processing as package software or online software on a desired computer.
  • the information processing device can function as the learning device 10 or the searching device 20 by causing the information processing device to execute the above program.
  • the information processing apparatus referred to here includes a desktop or notebook personal computer.
  • information processing devices include smart phones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
  • the learning device 10 and the searching device 20 can be implemented as a server device that uses a terminal device used by a user as a client and provides the client with services related to the above-described learning processing or search processing.
  • the server device is implemented as a server device that provides a service in which data for learning is input and information on a trained encoder is output.
  • the server device may be implemented as a web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.
  • FIG. 16 is a diagram showing an example of a computer that executes a learning program. Note that the search program may also be executed by a similar computer.
  • the computer 1000 has a memory 1010 and a CPU 1020, for example.
  • Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • Hard disk drive interface 1030 is connected to hard disk drive 1090 .
  • a disk drive interface 1040 is connected to the disk drive 1100 .
  • a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 .
  • Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example.
  • Video adapter 1060 is connected to display 1130, for example.
  • the hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the learning device 10 is implemented as a program module 1093 in which computer-executable code is described. Program modules 1093 are stored, for example, on hard disk drive 1090 .
  • the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the learning device 10 .
  • the hard disk drive 1090 may be replaced by an SSD.
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads the program modules 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processes of the above-described embodiments.
  • CPU 1020 may be programmed to perform the processes of the above embodiments in conjunction with the memory.
  • program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090.
  • the program modules 1093 and program data 1094 are stored in a detachable non-temporary computer-readable storage medium and stored via the disk drive 1100 or the like. It may be read by CPU 1020 .
  • the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

By using a model for receiving a video as input and outputting a feature quantity obtained by mapping the video to a first embedded space, a video feature quantity calculation unit (111) calculates a video feature quantity that is a feature quantity of a video included in a dataset of pairs of videos and speech. By using a speech encoder that is a model for receiving speech as input and outputting a feature quantity obtained by mapping the speech to a second embedded space, a speech feature quantity calculation unit (121) calculates a speech feature quantity that is a feature quantity of speech included in the dataset. An update unit (132) updates parameters of the models respectively used by the video feature quantity calculation unit (111) and the speech feature quantity calculation unit (121) such that a similarity which is between the video feature quantity and the speech feature quantity and which is calculated by emphasizing a temporal proximity between predetermined phenomena appearing in the video and the speech becomes greater.

Description

学習装置、学習方法及び学習プログラムLEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM
 本発明は、学習装置、学習方法及び学習プログラムに関する。 The present invention relates to a learning device, a learning method, and a learning program.
 画像認識技術によれば、画像に写る多様な物体を特定することができる。さらに、画像へのラベル付けにより、視覚的な情報と言語的な情報を対応付ける技術が知られている。 According to image recognition technology, it is possible to identify various objects in images. Furthermore, there is known a technique for associating visual information with linguistic information by labeling images.
 例えば、画像と当該画像の内容を説明する音声(以下、音声キャプションと呼ぶ)のペアデータを大量に用意して、画像の領域と音声キャプションの区間(以下、音声区間と呼ぶ)を対応付ける技術が知られている(例えば、非特許文献1を参照)。 For example, there is a technology that prepares a large amount of paired data of an image and a voice that describes the content of the image (hereinafter referred to as voice caption) and associates the image area with the voice caption section (hereinafter referred to as voice section). known (see, for example, Non-Patent Document 1).
 その他にも、画像を説明する複数言語の音声キャプションを用意することで、言語間の翻訳知識を獲得する技術が知られている(例えば、非特許文献2を参照)。 In addition, there is a known technique for acquiring translation knowledge between languages by preparing voice captions in multiple languages that describe images (see, for example, Non-Patent Document 2).
 さらに、放送データや投稿サイトの動画データを利用して、映像及び音声ストリームの共起に着目して、オブジェクト及びイベントをマルチモーダルに対応付ける技術が知られている(例えば、非特許文献3を参照)。 Furthermore, there is a known technology that multimodally associates objects and events by focusing on the co-occurrence of video and audio streams using broadcast data and video data on posting sites (see, for example, Non-Patent Document 3). ).
 しかしながら、従来の技術には、映像中のオブジェクト及びイベントのラベリングに大きなコストがかかるという問題がある。 However, conventional techniques have the problem that the labeling of objects and events in video is costly.
 なお、オブジェクトは例えば映像に映る物体である。また、イベントは例えば映像中の人物等の動作である。  The object is, for example, an object that appears in the video. An event is, for example, an action of a person or the like in an image.
 例えば、映像中のオブジェクト及びイベントの対応付けを行うためのモデルを生成するためには、オブジェクト及びイベントのラベリングを行う必要がある。このラベリングを人手で行うと大きなコストかかる。 For example, in order to generate a model for associating objects and events in a video, it is necessary to label the objects and events. If this labeling is done manually, it costs a lot.
 一方で、動作認識及び音声認識によるラベリングの自動化が考えられるが、動作認識及び音声認識のためのモデルの学習にも領域や区間のラベリング、書き起こしが必要であり、コストがかかる。 On the other hand, it is possible to automate labeling by action recognition and speech recognition, but training models for action recognition and speech recognition also requires labeling and transcription of regions and sections, which is costly.
 上述した課題を解決し、目的を達成するために、学習装置は、映像を入力とし、前記映像を第1の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、映像と音声のペアのデータセットに含まれる映像の特徴量である映像特徴量を算出する映像特徴量算出部と、音声を入力とし、前記音声を第2の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、前記データセットに含まれる各音声の特徴量である音声特徴量を算出する音声特徴量算出部と、前記映像特徴量と前記音声特徴量との類似度であって、前記映像及び前記音声に現れる所定の事象の時間的な近接性を強調して算出された類似度が大きくなるように、前記映像特徴量算出部及び前記音声特徴量算出部によって用いられる各モデルのパラメータを更新する更新部と、を有することを特徴とする。 In order to solve the above-described problems and achieve the object, a learning device uses a model in which a video is input and a feature value obtained by mapping the video in a first embedding space is used as an output to generate a pair of video and audio. A video feature quantity calculation unit that calculates a video feature quantity that is a feature quantity of the video contained in the data set of , and a model that inputs audio and outputs a feature quantity obtained by mapping the audio to a second embedding space. an audio feature amount calculation unit that calculates an audio feature amount that is a feature amount of each audio included in the data set; Updating parameters of each model used by the video feature amount calculation unit and the audio feature amount calculation unit so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the and a part.
 本発明によれば、映像中のオブジェクト及びイベントのラベリングにかかるコストを低減させることができる。 According to the present invention, the cost of labeling objects and events in video can be reduced.
図1は、第1の実施形態に係る学習装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment. 図2は、音声キャプションについて説明する説明図である。FIG. 2 is an explanatory diagram illustrating audio captions. 図3は、音声キャプションについて説明する説明図である。FIG. 3 is an explanatory diagram illustrating audio captions. 図4は、オブジェクト及びイベントの時間的近接性を説明する図である。FIG. 4 is a diagram illustrating the temporal proximity of objects and events. 図5は、モデル全体のネットワーク図である。FIG. 5 is a network diagram of the entire model. 図6は、第1の実施形態に係る学習装置の処理の流れを示すフローチャートである。FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment. 図7は、第2の実施形態に係る探索装置の構成例を示す図である。FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment. 図8は、第2の実施形態に係る探索装置の処理の流れを示すフローチャートである。FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment. 図9は、データセットを示す図である。FIG. 9 is a diagram showing a data set. 図10は、実験結果を示す図である。FIG. 10 is a diagram showing experimental results. 図11は、実験結果を示す図である。FIG. 11 is a diagram showing experimental results. 図12は、映像と音声の対応関係を示す図である。FIG. 12 is a diagram showing the correspondence between video and audio. 図13は、映像と音声の対応関係を示す図である。FIG. 13 is a diagram showing the correspondence between video and audio. 図14は、映像と音声の対応関係を示す図である。FIG. 14 is a diagram showing the correspondence between video and audio. 図15は、映像と音声の対応関係を示す図である。FIG. 15 is a diagram showing the correspondence between video and audio. 図16は、学習プログラムを実行するコンピュータの一例を示す図である。FIG. 16 is a diagram illustrating an example of a computer that executes a learning program;
 以下に、本願に係る学習装置、学習方法及び学習プログラムの実施形態を図面に基づいて詳細に説明する。なお、本発明は、以下に説明する実施形態により限定されるものではない。 Below, embodiments of a learning device, a learning method, and a learning program according to the present application will be described in detail based on the drawings. In addition, this invention is not limited by embodiment described below.
[第1の実施形態]
 第1の実施形態に係る学習装置は、入力された学習用データを用いて、映像エンコーダ及び音声エンコーダを訓練する。そして、学習装置は、訓練済みの各エンコーダを出力する。例えば、学習装置は、各エンコーダのパラメータを出力する。なお、本実施形態における映像は、動画像を意味するものとする。
[First embodiment]
The learning device according to the first embodiment uses input learning data to train a video encoder and an audio encoder. The learning device then outputs each trained encoder. For example, the learning device outputs parameters for each encoder. It should be noted that the video in this embodiment means a moving image.
 本実施形態における学習用データは、映像及び当該映像と対応付けられた音声を含むデータである。例えば、学習用データは、放送用の映像データ、動画投稿サイトの動画データ等であってもよい。 The learning data in this embodiment is data that includes video and audio associated with the video. For example, the learning data may be video data for broadcasting, video data from a video posting site, or the like.
 映像エンコーダは、映像を入力とし、映像特徴量を出力とするモデルである。また、音声エンコーダは、音声を入力とし、音声特徴量を出力とするモデルである。学習装置は、出力された映像特徴量及び音声特徴量を基に、映像エンコーダ及び音声エンコーダを最適化する。 A video encoder is a model that takes video as input and outputs video feature values. A speech encoder is a model that receives speech as an input and outputs speech features. The learning device optimizes the video encoder and the audio encoder based on the output video feature quantity and audio feature quantity.
[第1の実施形態の構成]
 図1は、第1の実施形態に係る学習装置の構成例を示す図である。図1に示すように、学習装置10は、映像特徴量算出部111、音声特徴量算出部121、損失関数構成部131及び更新部132を有する。また、学習装置10は、映像エンコーダ情報112、音声エンコーダ情報122を記憶する。
[Configuration of the first embodiment]
FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment. As shown in FIG. 1 , the learning device 10 has a video feature quantity calculator 111 , an audio feature quantity calculator 121 , a loss function generator 131 and an updater 132 . The learning device 10 also stores video encoder information 112 and audio encoder information 122 .
 学習装置10には、映像151、音声キャプション152が入力される。また、学習装置10は、更新済みの映像エンコーダ情報112、音声エンコーダ情報122を出力することができる。 A video 151 and an audio caption 152 are input to the learning device 10 . Also, the learning device 10 can output updated video encoder information 112 and audio encoder information 122 .
 ここで、音声キャプションは、映像に対応する音声である。例えば、映像151及び音声キャプション152は、同一の動画データから取り出された画像データと音声データである。 Here, the audio caption is the audio corresponding to the video. For example, video 151 and audio captions 152 are image data and audio data extracted from the same video data.
 なお、音声キャプション152は、人が映像151を見て、その内容を説明するために発した音声を収録した信号であってもよい。例えば、音声キャプション152は、クラウドソーシングを利用して、話者に映像151を見せ、話者が映像151を説明するために発した音声を収録することによって得られてもよい。 It should be noted that the audio caption 152 may be a signal in which a person watches the video 151 and records the audio uttered to explain the content. For example, the audio captions 152 may be obtained by using crowdsourcing to show a video 151 to a speaker and record the audio spoken by the speaker to describe the video 151 .
 図2及び図3は、音声キャプションについて説明する説明図である(図2及び図3の詳細は、それぞれ非特許文献1及び非特許文献2を参照)。 FIGS. 2 and 3 are explanatory diagrams explaining audio captions (for details of FIGS. 2 and 3, see Non-Patent Document 1 and Non-Patent Document 2, respectively).
 図2では、画像の領域と音声キャプションの区間が対応付けられている。例えば、画像中の「山」と音声キャプション中の「山」が同じ網掛けのパターンで関連付けられている。 In Fig. 2, image regions and audio caption sections are associated. For example, a "mountain" in an image and a "mountain" in an audio caption are associated with the same hatching pattern.
 図3では、同一画像を異なる言語で説明した音声キャプション間で関連付けが行われている。例えば、英語及び日本語の音声キャプションから抽出された特徴ベクトル間の類似度を計算すると、対応する部分で類似度が高くなり、単語間の翻訳知識が獲得される。 In Figure 3, associations are made between audio captions describing the same image in different languages. For example, when calculating the similarity between feature vectors extracted from English and Japanese audio captions, the similarity is high in corresponding parts, and translation knowledge between words is obtained.
 このように、モダリティの異なるデータの関連付けが可能であることが知られている。このことを利用して、本実施形態の学習装置は映像と音声の関連付けを行う。 In this way, it is known that it is possible to associate data with different modalities. Using this fact, the learning device of the present embodiment associates video and audio.
 なお、モダリティ(modality:形態、様相、様態)とは、概念の表出形態(a way of expressing ideas)ということができる。例えば、犬という概念に対して、犬が写っている「画像」、“いぬ”と発声している「音声」、“いぬ”、“犬”、“イヌ”といった「テキスト」がモダリティに相当する。また、例えば英語の場合、“dog”、“Dog”、“DOG”がモダリティに相当する。モダリティの例としては、画像、音声、映像、及び所定のセンシングデータ等がある。 It should be noted that modality can be said to be a way of expressing ideas. For example, for the concept of a dog, modalities correspond to "images" of dogs, "voices" of uttering "dogs", and "texts" such as "dogs", "dogs", and "dogs". . For example, in the case of English, "dog", "Dog", and "DOG" correspond to modalities. Examples of modalities include images, sounds, videos, and predetermined sensing data.
 図1に戻り、映像特徴量算出部111は、映像を入力とし、映像を第1の埋め込み空間にマッピングした特徴量を出力とするモデルである映像エンコーダを用いて、映像と音声のペアのデータセットに含まれる映像の特徴量である映像特徴量を算出する。 Returning to FIG. 1, the video feature quantity calculation unit 111 uses a video encoder, which is a model that receives video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space, to convert video and audio pair data. A video feature quantity, which is a feature quantity of videos included in the set, is calculated.
 映像エンコーダ情報112は、映像エンコーダを構築するためのパラメータである。 The video encoder information 112 is parameters for constructing a video encoder.
 音声特徴量算出部121は、音声を入力とし、音声を第2の埋め込み空間にマッピングした特徴量を出力とするモデルである音声エンコーダを用いて、データセットに含まれる音声の特徴量である音声特徴量を算出する。 The speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.
 音声エンコーダ情報122は、音声エンコーダを構築するためのパラメータである。 The speech encoder information 122 is parameters for constructing a speech encoder.
 損失関数構成部131は、映像特徴量と音声特徴量との類似度であって、映像及び音声に現れる所定の事象(イベント又はオブジェクト)の時間的な近接性を強調して算出された類似度を基に損失関数を構成する。 The loss function constructing unit 131 calculates the degree of similarity between the video feature amount and the audio feature amount, which is the similarity calculated by emphasizing the temporal proximity of a predetermined event (event or object) appearing in the video and audio. Construct a loss function based on
 更新部132は、損失関数構成部131によって構成された損失関数を用いて、映像エンコーダ及び音声エンコーダのパラメータを更新する。 The update unit 132 uses the loss function configured by the loss function configuration unit 131 to update the parameters of the video encoder and the audio encoder.
 すなわち、更新部132は、映像特徴量と音声特徴量との類似度であって、映像及び音声に現れる所定の事象の時間的な近接性を強調して算出された類似度が大きくなるように、映像特徴量算出部111及び音声特徴量算出部121によって用いられる各モデルのパラメータを更新する。 That is, the update unit 132 increases the similarity between the video feature amount and the audio feature amount, which is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio. , update parameters of each model used by the video feature amount calculation unit 111 and the audio feature amount calculation unit 121 .
 図4及び図5を用いて、学習装置10の各部の処理を詳細に説明する。 The processing of each part of the learning device 10 will be described in detail using FIGS. 4 and 5. FIG.
 図4は、オブジェクト及びイベントの時間的近接性を説明する図である。図4には、大相撲の解説映像を基にした学習用のデータが示されている。図4においては、映像151の各フレーム(Video frames)及び音声キャプション152のMel-spectrogramが時系列に沿って配置されている。 FIG. 4 is a diagram explaining the temporal proximity of objects and events. FIG. 4 shows data for learning based on commentary videos of sumo wrestling. In FIG. 4, each frame of video 151 (Video frames) and Mel-spectrogram of audio caption 152 are arranged in chronological order.
 例えば、フレームf102、f103、f105、f106においてはイベントが発生している。また、発生したイベントに対応する音声が存在する。 For example, events occur in frames f102, f103, f105, and f106. Also, there is a sound corresponding to the event that occurred.
 例えば、フレームf102では試合開始というイベントが発生しており、さらにフレームf102は試合の開始を意味する「はっけよいのこった」という音声と対応している。 For example, at frame f102, an event called the start of the match occurs, and frame f102 corresponds to the voice "Hakkeyoi no Kotta" signifying the start of the match.
 また、例えば、フレームf106では勝敗の決定というイベントが発生しており、さらにフレームf106は勝敗を決定させた決まり手を説明する「押し出し」という音声と対応している。 Also, for example, in frame f106, an event of winning or losing has occurred, and frame f106 corresponds to the voice "push out" explaining the winning move that determined the winning or losing.
 このように、イベントのフレームと対応する音声は、時間的に近くに存在するということができる。 In this way, it can be said that the frames of the event and the corresponding audio exist close to each other in terms of time.
 図5は、モデル全体のネットワーク図である。図5に示すように、映像特徴量算出部111は、例えば10秒間の映像(10-second video)から等間隔に切り出した32枚の映像フレーム系列(32 video frames)を映像エンコーダ(Video network)112aに入力する。 Figure 5 is a network diagram of the entire model. As shown in FIG. 5, the video feature amount calculation unit 111 extracts, for example, 32 video frames (32 video frames) cut out at equal intervals from a 10-second video (10-second video), and outputs them to a video encoder (video network). 112a.
 映像エンコーダ112aは、学習済みのECOであるものとする(非特許文献4)。ECOは従来、動作認識のために利用される。本実施形態では、ECOの分類器となる最終層を取り除き、空間的なサイズを変更せずに特徴次元数をdにする3D ResNetを付与したものを使用した。 The video encoder 112a is assumed to be a trained ECO (Non-Patent Document 4). ECOs are traditionally used for motion recognition. In this embodiment, a 3D ResNet is used in which the last layer, which is the ECO classifier, is removed and the feature dimension is set to d without changing the spatial size.
 例えば、映像特徴量算出部111は、32枚の映像フレームをのフレームサイズを224×224にリサイズしたテンソルを映像エンコーダ112aに入力する。そして、映像エンコーダ112aは、(時間,チャネル,高さ,幅)=(8,d,7,7)のテンソルである映像特徴量(Visual feature)を出力する。 For example, the video feature amount calculation unit 111 inputs a tensor obtained by resizing the frame size of 32 video frames to 224×224 to the video encoder 112a. Then, the video encoder 112a outputs a video feature amount (Visual feature) which is a tensor of (time, channel, height, width)=(8, d, 7, 7).
 映像特徴量は、32枚の映像フレームの時間軸が8に、フレームの空間サイズが7×7に圧縮されたものということができる。このため、映像特徴量は、8×7×7の空間の各点においてd次元の特徴ベクトルを持つ特徴量と見なせる。 The video feature amount can be said to be 32 video frames compressed to 8 on the time axis and 7×7 on the spatial size of the frame. Therefore, the video feature amount can be regarded as a feature amount having a d-dimensional feature vector at each point in an 8×7×7 space.
 図5に示すように、音声特徴量算出部121は、例えば10秒の音声ファイル(10-second audio)に対してフレームシフト長10ms、フレーム長25msの周波数分析、40個のメルフィルタバンク処理をして得られるメルフィルタバンク系列(Mel spectrogram)を音声エンコーダ(Audio network)122aに入力する。 As shown in FIG. 5, the audio feature amount calculation unit 121 performs frequency analysis with a frame shift length of 10 ms and a frame length of 25 ms, and 40 mel filter bank processes for a 10-second audio file (10-second audio). The Mel filter bank sequence (Mel spectrogram) obtained by the above is input to the audio encoder (Audio network) 122a.
 音声エンコーダ122aは、CNNをベースとするDAVEnet(非特許文献7)であるものとする。また、音声エンコーダ122aは、ResDAVEnetや自己注意機構を導入した音声特徴抽出器(非特許文献7)であってもよい。 The speech encoder 122a is assumed to be CNN-based DAVEnet (Non-Patent Document 7). Also, the speech encoder 122a may be a speech feature extractor (Non-Patent Document 7) that introduces ResDAVEnet or a self-attention mechanism.
 音声特徴量算出部121は、メルフィルタバンク系列を音声エンコーダ122aに入力する。そして、音声エンコーダ122aは、(時間,チャネル)=(64,d)の音声特徴量(Audio feature)を出力する。 The speech feature amount calculation unit 121 inputs the mel filter bank sequence to the speech encoder 122a. Then, the audio encoder 122a outputs an audio feature of (time, channel)=(64, d).
 音声特徴量は、時間軸が圧縮されたd次元の特徴ベクトル系列と見なせる。 A speech feature quantity can be regarded as a d-dimensional feature vector sequence with a compressed time axis.
 なお、図5の例では、チャネル数に相当するdは1024である。また、映像エンコーダ112a及び音声エンコーダ122aは、上記のものに限られない。 In the example of FIG. 5, d corresponding to the number of channels is 1024. Also, the video encoder 112a and the audio encoder 122a are not limited to those described above.
 ここで、入力される映像と音声のi番目のペアデータは(1)式のように表される。 Here, the i-th pair data of input video and audio is expressed as in formula (1).
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 また、i番目のペアデータに対応する映像特徴量及び音声特徴量は(2)式のように表される。 Also, the video feature amount and the audio feature amount corresponding to the i-th paired data are expressed as in formula (2).
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 損失関数構成部131は、(2)式に示す映像特徴量と音声特徴量のペアが、共有潜在空間(埋め込み空間)にて近くに配置され、相異なる映像や音声とは遠くに配置されるように各モデルが訓練されるような損失関数を構成する。 The loss function constructing unit 131 arranges the pair of video feature amount and audio feature amount shown in equation (2) close to each other in a shared latent space (embedded space) and far from different video and audio. Construct a loss function such that each model is trained as
 例えば、損失関数構成部131は、Triplet loss(非特許文献3)又はMasked Margin Softmax(MMS loss)(非特許文献5)を用いた損失関数を構成することができる。本実施形態では、損失関数構成部131は、MMS lossを用いた損失関数を構成するものとする。 For example, the loss function configuration unit 131 can configure a loss function using Triplet loss (Non-Patent Document 3) or Masked Margin Softmax (MMS loss) (Non-Patent Document 5). In this embodiment, the loss function configuration unit 131 configures a loss function using MMS loss.
 MMS lossを用いた損失関数は、(3)式、(4)式及び(5)式に示す通りである。 The loss function using MMS loss is as shown in formulas (3), (4) and (5).
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 ここで、(4)式及び(5)式のBはバッチサイズである。また、類似度Zm,nは、(6)式の左辺と等しい。 Here, B in formulas (4) and (5) is the batch size. Also, the similarity Z m,n is equal to the left side of Equation (6).
Figure JPOXMLDOC01-appb-M000006
Figure JPOXMLDOC01-appb-M000006
 (3)式の損失関数は、バッチ内のすべての映像と音声のペアの類似度を計算し、ペアの類似度が高くなるように、映像エンコーダ及び音声エンコーダのパラメータを学習する損失関数ということができる。 The loss function of equation (3) is a loss function that calculates the similarity of all video-audio pairs in a batch and learns the parameters of the video encoder and audio encoder so that the pair's similarity is high. can be done.
 さらに、(4)式及び(5)式の右辺の分子は、ペアである映像特徴量と音声特徴量の類似度Zm,mに関する項である。一方、分母は、類似度Zm,mに関する項と、異なる(ペアでない)映像特徴量と音声特徴量の類似度Zm,n(m≠n)に関する項とを含む。 Furthermore, the numerators on the right-hand sides of the equations (4) and (5) are terms related to the similarity Zm ,m between the pair of video feature quantity and audio feature quantity. On the other hand, the denominator includes a term related to the degree of similarity Z m,m and a term related to the degree of similarity Z m,n (m≠n) between different (unpaired) video feature quantities and audio feature quantities.
 また、(4)式は映像を基点に類似度を考えた場合の損失である。一方、(5)式は音声を基点に類似度を考えた場合の損失である。 In addition, equation (4) is the loss when considering the degree of similarity based on the video. On the other hand, the equation (5) is the loss when the degree of similarity is considered based on speech.
 (4)式及び(5)式を最小化することは、分子を大きく(ペアである場合の類似度は大きく)、分母を小さく(ペアでない場合の類似度は小さく)することに対応する。 Minimizing equations (4) and (5) corresponds to increasing the numerator (higher similarity when paired) and decreasing the denominator (lower similarity when not paired).
 なお、δはマージンを表すハイパーパラメータであり、ペアの類似度がさらにδだけ大きくなるように制約する。 Note that δ is a hyperparameter that represents the margin, and constrains the pair similarity to be further increased by δ.
 ここで、類似度を計算するための(6)式に含まれるSi,j (m,n)及びGi,jは、それぞれ(7)式及び(8)式によって計算される。 Here, S i,j (m, n) and G i,j included in Equation (6) for calculating the similarity are calculated by Equations (7) and (8), respectively.
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000007
Figure JPOXMLDOC01-appb-M000008
Figure JPOXMLDOC01-appb-M000008
 (8)式はガイド付き注意(Guided attention)であり、映像と音声に現れるオブジェクトやイベントの時間的な近接性を強調する作用がある。また、σはハイパーパラメータである。 Equation (8) is guided attention, and has the effect of emphasizing the temporal proximity of objects and events appearing in video and audio. Also, σ g is a hyperparameter.
 図5に戻り、損失関数構成部131は、映像特徴量と音声特徴量とのドット積を計算し、映像と音声の類似度を示すテンソル(Audio Visual Affinity Tensor)を得る。 Returning to FIG. 5, the loss function constructing unit 131 calculates the dot product of the video feature amount and the audio feature amount, and obtains a tensor (Audio Visual Affinity Tensor) indicating the degree of similarity between the video and the audio.
 さらに、損失関数構成部131は、テンソルに対し空間方向の平均化(Spatial mean pooling)を行ったものに、ガイド付き注意(Guided attention)による重みを付け、さらに平均化することで類似度を得る。 Furthermore, the loss function constructing unit 131 obtains the similarity by weighting the tensors obtained by performing spatial mean pooling with guided attention and averaging. .
 図5及び(8)式に示すガイド付き注意は、オブジェクトやイベントが映像と音声にほぼ同時刻に表出することを想定し、対角成分に大きな重みを与えるものであった。一方で、映像と音声で表出がずれる場合も考えられる。 The guided attention shown in Fig. 5 and formula (8) assumes that objects and events appear in video and audio at approximately the same time, and gives greater weight to diagonal components. On the other hand, it is conceivable that the representation of the video and the voice may be out of sync.
 例えば、スポーツの実況中継では、映像でオブジェクトやイベントが起こった後に、アナウンサーがその様子を音声で説明するため、時刻のずれが起こる。 For example, in live broadcasts of sports, after an object or event has occurred in the video, the announcer explains the situation with audio, resulting in a time lag.
 また、料理等の解説映像において、料理人がこれから行う手順を音声で説明し、その後に当該手順に関するオブジェクトやイベントが映像中に表出することもあり得る。 In addition, in commentary videos such as cooking, it is possible that the chef will explain the procedure to be followed by voice, and then objects and events related to the procedure will appear in the video.
 例えば、図4の「正面からあたって」という音声は、映像において正面からあたるというイベントが発生してからある程度の時間が経過してから出現している。 For example, the sound "Hit from the front" in Fig. 4 appears after a certain amount of time has passed since the event of the frontal hit occurred in the video.
 そこで、損失関数構成部131は、(9)式のようにハイパーパラメータであるi´及びj´を含む式によりガイド付き注意を計算してもよい。 Therefore, the loss function configuration unit 131 may calculate guided attention using a formula including hyperparameters i' and j', such as formula (9).
Figure JPOXMLDOC01-appb-M000009
Figure JPOXMLDOC01-appb-M000009
 なお、σの調整により上記のような時刻のずれを調整することも可能である。 It should be noted that it is also possible to adjust the time lag as described above by adjusting σg .
 また、損失関数としてTriplet lossを採用する場合、損失関数構成部131は、(10)式のように損失関数を構成する。 Also, when triplet loss is adopted as the loss function, the loss function configuration unit 131 configures the loss function as shown in equation (10).
Figure JPOXMLDOC01-appb-M000010
Figure JPOXMLDOC01-appb-M000010
 ただし、s(x,y)は映像特徴量xと音声特徴量yの類似度である。 However, s(x, y) is the degree of similarity between the video feature quantity x and the audio feature quantity y.
 更新部132は、最適化アルゴリズムのAdam(非特許文献6)を使って損失関数を減少させながら映像エンコーダ及び音声エンコーダのパラメータを更新する。 The update unit 132 updates the parameters of the video encoder and the audio encoder while decreasing the loss function using the optimization algorithm Adam (Non-Patent Document 6).
 例えば、更新部132は、Adamを実行する際に下記のようなハイパーパラメータを設定する。
 重み減衰(Weight Decay):4×10-5
 初期学習率:0.001
 β=0.95,β=0.99
For example, the updating unit 132 sets the following hyperparameters when executing Adam.
Weight Decay: 4×10 -5
Initial learning rate: 0.001
β 1 =0.95, β 2 =0.99
 なお、更新部132が使う最適化アルゴリズムはAdamに限られず、いわゆる確率的勾配降下法(SGD)やRMSProp等であってもよい。 The optimization algorithm used by the updating unit 132 is not limited to Adam, and may be the so-called stochastic gradient descent (SGD), RMSProp, or the like.
 このように、更新部132は、映像特徴量と音声特徴量との類似度であって、ガイド付き注意によって近接性を強調された類似度が大きくなるように、各モデルのパラメータを更新する。 In this way, the updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases.
 ガイド付き注意は、映像特徴量と音声特徴量との要素同士の対応する時間が近いほど大きくなる重みということができる。 It can be said that the weight of the guided attention increases as the corresponding time between the elements of the video feature amount and the audio feature amount becomes closer.
 すなわち、更新部132は、映像特徴量と音声特徴量との要素同士の積から得られる値に、各要素に対応する時間が近いほど大きくなる重みを掛けて得られる類似度が大きくなるように、各モデルのパラメータを更新する。 That is, the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element approaches, so that the degree of similarity obtained increases. , to update the parameters of each model.
 更新部132は、(9)式によって計算されたガイド付き注意を用いた損失関数を基にパラメータの更新を行ってもよい。 The updating unit 132 may update the parameters based on the loss function using guided attention calculated by Equation (9).
 この場合、更新部132は、映像特徴量と音声特徴量との要素同士の積から得られる値に、各要素に対応する時間に所定のずれを生じさせた時間が近いほど大きくなる重みを掛けて得られる類似度が大きくなるように、各モデルのパラメータを更新する。 In this case, the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element with a predetermined shift is closer. We update the parameters of each model so that the similarity obtained by
[第1の実施形態の処理]
 図6は、第1の実施形態に係る学習装置の処理の流れを示すフローチャートである。図6に示すように、まず、学習装置10には、映像と、映像に対応する音声キャプションのペアのデータセットが入力される(ステップS101)。
[Processing of the first embodiment]
FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment. As shown in FIG. 6, first, a data set of a video and a pair of audio captions corresponding to the video is input to the learning device 10 (step S101).
 次に、学習装置10は、映像から、映像エンコーダを用いてd次元の映像特徴ベクトルを算出する(ステップS102)。 Next, the learning device 10 uses a video encoder to calculate a d-dimensional video feature vector from the video (step S102).
 続いて、学習装置10は、音声キャプションから、音声エンコーダを用いて、d次元の音声特徴ベクトルを算出する(ステップS103)。 Next, the learning device 10 uses a speech encoder to calculate a d-dimensional speech feature vector from the speech caption (step S103).
 ここで、学習装置10は、ガイド付き注意を考慮した映像特徴ベクトルと音声特徴ベクトルの類似度を基に損失関数を構成する(ステップS104)。 Here, the learning device 10 constructs a loss function based on the degree of similarity between the video feature vector and the audio feature vector considering the guided attention (step S104).
 そして、学習装置10は、損失関数が最適化されるように各エンコーダのパラメータを更新する(ステップS105)。 Then, the learning device 10 updates the parameters of each encoder so that the loss function is optimized (step S105).
 このとき、終了条件が充足されていれば(ステップS106、Yes)、学習装置10は処理を終了する。一方、終了条件が充足されていない場合(ステップS106、No)、学習装置10はステップS102に戻り処理を繰り返す。 At this time, if the termination condition is satisfied (step S106, Yes), the learning device 10 terminates the process. On the other hand, if the termination condition is not satisfied (step S106, No), the learning device 10 returns to step S102 and repeats the process.
 例えば、終了条件は、パラメータの更新量が閾値以下になったこと、及びパラメータの更新回数が一定回数以上になったこと等である。 For example, the termination condition is that the amount of parameter updates has become equal to or less than a threshold, and that the number of parameter updates has exceeded a certain number.
 映像特徴量算出部111は、映像を入力とし、映像を第1の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、映像と音声のペアのデータセットに含まれる映像の特徴量である映像特徴量を算出する。音声特徴量算出部121は、音声を入力とし、音声を第2の埋め込み空間にマッピングした特徴量を出力とするモデルである音声エンコーダを用いて、データセットに含まれる音声の特徴量である音声特徴量を算出する。更新部132は、映像特徴量と音声特徴量との類似度であって、映像及び音声に現れる所定の事象の時間的な近接性を強調して算出された類似度が大きくなるように、映像特徴量算出部111及び音声特徴量算出部121によって用いられる各モデルのパラメータを更新する。 The video feature quantity calculation unit 111 uses a model that takes a video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space. A certain video feature amount is calculated. The speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount. The update unit 132 updates the video feature amount and the audio feature amount so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio is increased. The parameters of each model used by the feature quantity calculation unit 111 and the speech feature quantity calculation unit 121 are updated.
 映像におけるイベントのフレームと、当該イベントに対応する音声は、時間的に近くに存在するという性質がある。学習装置10は、そのような性質を強調してモデルの学習を効果的に行うことができる。その結果、本実施形態によれば、映像中のオブジェクト及びイベントのラベリングにかかるコストを低減させることができる。  The event frame in the video and the audio corresponding to the event have the property of being close in time. The learning device 10 can effectively learn a model by emphasizing such properties. As a result, according to this embodiment, the cost for labeling objects and events in video can be reduced.
 更新部132は、映像特徴量と音声特徴量との類似度であって、ガイド付き注意によって近接性を強調された類似度が大きくなるように、各モデルのパラメータを更新する。このように、ガイド付き注意を用いることで、テンソルで表された特徴量の各要素に重み付けを行うことができる。 The updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases. Thus, by using guided attention, it is possible to weight each element of the feature quantity represented by the tensor.
 更新部132は、映像特徴量と音声特徴量との要素同士の積から得られる値に、各要素に対応する時間が近いほど大きくなる重みを掛けて得られる類似度が大きくなるように、各モデルのパラメータを更新する。これにより、イベントやオブジェクトといった映像中の事象と音声との時間的な近接性を強調することができる。 The update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer, so that the degree of similarity obtained increases. Update model parameters. This makes it possible to emphasize the temporal proximity between audio and events in video such as events and objects.
 更新部132は、映像特徴量と音声特徴量との要素同士の積から得られる値に、各要素に対応する時間に所定のずれを生じさせた時間が近いほど大きくなる重みを掛けて得られる類似度が大きくなるように、各モデルのパラメータを更新する。これにより、イベント及びオブジェクトと音声との時刻ずれに対応することができる。 The updating unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer to the time at which a predetermined shift is generated. Update the parameters of each model so that the similarity increases. This makes it possible to cope with the time lag between the event/object and the sound.
[第2の実施形態]
 第2の実施形態では、第1の実施形態において訓練されたモデルを用いて、実際に推論を行う処理について説明する。訓練済みの映像エンコーダ及び音声エンコーダによれば、クロスモーダル探索が可能になる。
[Second embodiment]
In the second embodiment, a process of actually inferring using the model trained in the first embodiment will be described. Pre-trained video and audio encoders enable cross-modal search.
 クロスモーダル探索とは、異なる形態のデータを探索することである。例えば、クロスモーダル探索には、音声から映像を探索すること、映像から音声を探索すること、ある言語の音声から他の言語の音声を探索すること等が含まれる。また、各実施形態の説明においては、説明済みの実施形態と同様の機能を有する部には同じ符号を付し、適宜説明を省略する。  Cross-modal search is to search different forms of data. For example, cross-modal search includes searching video from audio, searching audio from video, searching audio in one language to audio in another language, and the like. Moreover, in the description of each embodiment, the same reference numerals are given to the parts having the same functions as those of the already described embodiments, and the description thereof will be omitted as appropriate.
[第2の実施形態の構成]
 図7は、第2の実施形態に係る探索装置の構成例を示す図である。図7に示すように、探索装置20は、映像特徴量算出部211、音声特徴量算出部221、探索部232を有する。また、探索装置20は、映像エンコーダ情報212、音声エンコーダ情報222及び映像特徴量情報231を記憶する。
[Configuration of Second Embodiment]
FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment. As shown in FIG. 7 , the search device 20 has a video feature quantity calculator 211 , an audio feature quantity calculator 221 , and a searcher 232 . The search device 20 also stores video encoder information 212 , audio encoder information 222 and video feature amount information 231 .
 探索装置20には、映像と音声キャプションが入力される。探索装置20に入力される音声キャプションは、探索のためのクエリである。例えば、探索装置20は、探索により得られた映像を探索結果として出力する。 Video and audio captions are input to the search device 20 . The audio caption input to search device 20 is the query for the search. For example, the searching device 20 outputs a video obtained by searching as a search result.
 映像特徴量算出部211は、学習装置10の映像特徴量算出部111と同様に、映像を入力として受け付け、映像特徴量を算出する。ただし、映像エンコーダ情報212は、第1の実施形態で説明した方法により訓練済みである。 The image feature amount calculation unit 211 receives an image as input and calculates image feature amounts, similar to the image feature amount calculation unit 111 of the learning device 10 . However, the video encoder information 212 has already been trained by the method described in the first embodiment.
 映像特徴量情報231は、算出した映像特徴量を映像特徴量情報231として蓄積する。 The video feature amount information 231 accumulates the calculated video feature amount as the video feature amount information 231 .
 探索部232は、蓄積済みの映像特徴量情報231の中から、音声特徴量算出部221によって算出された音声特徴量221に類似する映像特徴量を探索する。類似度の算出方法は、第1の実施形態と同様である。 The search unit 232 searches for a video feature amount similar to the audio feature amount 221 calculated by the audio feature amount calculation unit 221 from the accumulated video feature amount information 231 . A similarity calculation method is the same as in the first embodiment.
[第2の実施形態の処理]
 図8は、第2の実施形態に係る探索装置の処理の流れを示すフローチャートである。図8に示すように、まず、探索装置20には、複数の映像と、クエリである音声とが入力される(ステップS201)。
[Processing of Second Embodiment]
FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment. As shown in FIG. 8, first, a plurality of videos and a query voice are input to the searching device 20 (step S201).
 次に、探索装置20は、各映像から、映像エンコーダを用いてd次元の映像特徴ベクトルを算出する(ステップS202)。 Next, the search device 20 uses a video encoder to calculate a d-dimensional video feature vector from each video (step S202).
 続いて、探索装置20は、音声キャプションから、音声エンコーダを用いてd次元の音声特徴ベクトルを算出する(ステップS203)。 Subsequently, the search device 20 calculates a d-dimensional audio feature vector from the audio caption using the audio encoder (step S203).
 そして、探索装置20は、音声特徴ベクトルに類似する映像特徴ベクトルを探索する(ステップS204)。 Then, the search device 20 searches for a video feature vector similar to the audio feature vector (step S204).
 さらに、探索装置20は、探索により得られた映像特徴ベクトルに対応する映像を出力する(ステップS205)。 Furthermore, the search device 20 outputs a video corresponding to the video feature vector obtained by the search (step S205).
[第2の実施形態の効果]
 このように、第2の実施形態によれば、音声から映像を探索するクロスモーダル探索を行うことができる。
[Effect of Second Embodiment]
Thus, according to the second embodiment, it is possible to perform cross-modal search for searching video from audio.
[実験結果]
 第2の実施形態の探索装置を用いて行った実験について説明する。実験では、第1の実施形態の学習装置により訓練したエンコーダを用いて、第2の実施形態の探索装置により探索を行った。
[Experimental result]
An experiment conducted using the search device of the second embodiment will be described. In the experiment, an encoder trained by the learning device of the first embodiment was used and a search was performed by the search device of the second embodiment.
 まず、実験では、図4に示すような大相撲映像を利用して、映像及び音声の対応付けを学習した。具体的には、170時間分の大相撲映像を録画し、勝負が決まる瞬間の時刻とその決まり手を手作業でラベル付けした。 First, in the experiment, we used the video of the sumo tournament shown in Figure 4 to learn the correspondence between video and audio. Specifically, 170 hours of sumo footage were recorded and manually labeled with the time of the moment when the match was decided and the decision.
 さらに、出現頻度の高い決まり手として9種類を選び、図9に示すデータセットを作成した。図9は、データセットを示す図である。実験では、勝負が決まる瞬間の前後5秒間、合計10秒間の映像及び音声のストリームを利用した。 In addition, 9 types were selected as determinants with high appearance frequency, and the data set shown in Fig. 9 was created. FIG. 9 is a diagram showing a data set. In the experiment, video and audio streams were used for a total of 10 seconds, 5 seconds before and after the moment the game was decided.
 ここで、学習データを増加させるため、実験では、勝負が決まる瞬間から±5,±10,±15,±20,±25秒ずらして得られる映像及び音声ストリームも学習データに利用した(データ拡張)。 Here, in order to increase the learning data, in the experiment, video and audio streams obtained with a delay of ±5, ±10, ±15, ±20, and ±25 seconds from the moment the game was decided were also used as learning data (data extension ).
 また、評価尺度はRecall@Nを利用した。例えば、ある映像クエリの特徴ベクトルから、ペアとなる音声を90件の評価データから探索する場合、音声の特徴ベクトルそれぞれとの類似度を計算して順位付けし、上位5件を決定する。クエリとペアになる音声がこの5件に含まれていれば探索成功とする。評価データ90件の映像に対して同様のことを行い、ペアとなる音声が5位以内に含まれる割合をRecall@5とする。同様にRecall@3, Recall@1も算出した。 In addition, Recall@N was used as the evaluation scale. For example, when searching 90 pieces of evaluation data for sounds that form a pair from a feature vector of a certain video query, the degree of similarity with each feature vector of the sound is calculated and ranked, and the top five are determined. If the voice paired with the query is included in these five records, the search is successful. The same process is performed for the 90 images of the evaluation data, and the ratio of the paired audio being included in the top five is defined as Recall@5. Similarly, Recall@3 and Recall@1 were also calculated.
 図10は、実験結果を示す図である。Baselineとは従来技術に相当し、Guided Attentionを使用しない場合、すなわちGi,j=1とした場合である。 FIG. 10 is a diagram showing experimental results. Baseline corresponds to the prior art and is the case where guided attention is not used, that is, the case where G i,j =1.
 図10より、本実施形態Proposedによれば、Recall rateが大幅に改善していることがわかる。また、データ拡張により学習データを増やすことでさらにRecallは向上した。 From FIG. 10, it can be seen that the recall rate is greatly improved according to the proposed embodiment. In addition, recall was further improved by increasing the learning data through data augmentation.
 決まり手ごとに再集計した結果を図11に示す。図11は、実験結果を示す図である。 Fig. 11 shows the results of recounting for each decisive factor. FIG. 11 is a diagram showing experimental results.
 Recall@5の場合、5位以内に映像クエリと同じ「決まり手」の音声が含まれていれば正解とする。逆も同様に音声クエリと同じ「決まり手」の映像が5位以内に含まれていれば正解とした。 In the case of Recall@5, if the voice of the same "decision" as the video query is included in the top 5, it will be considered correct. Conversely, if the video with the same “decision factor” as the voice query was included in the top five, it was determined to be correct.
 図11の結果を見ると、図10の結果よりもRecallが向上している。さらに、Recallが向上の程度から、「決まり手」が概念として学習されたことが考えられる。 Looking at the results in Figure 11, recall is better than the results in Figure 10. Furthermore, from the degree of improvement in recall, it is conceivable that the "determinant" was learned as a concept.
 図12及び図13は、映像と音声の対応関係を示す図である。図12及び図13のグラフは、(7)式を可視化したものである。 12 and 13 are diagrams showing the correspondence between video and audio. The graphs of FIGS. 12 and 13 visualize the equation (7).
 図12は本実施形態(Proposed)に対応する。また、図13は、従来技術(Baseline)に対応する。 FIG. 12 corresponds to this embodiment (Proposed). Also, FIG. 13 corresponds to the prior art (Baseline).
 図12及び図13より、本実施形態の方が従来技術に比べて映像中のイベント(動作)が音声とより明確に対応付けられていることがわかる。 From FIGS. 12 and 13, it can be seen that the event (action) in the video is more clearly associated with the audio in this embodiment than in the conventional technology.
 図14及び図15は、図12及び図13とは異なる場面における映像と音声の対応関係を示す図である。図14及び図15にも、図12及び図13と同様の傾向が見られる。 14 and 15 are diagrams showing the correspondence between video and audio in scenes different from those in FIGS. 12 and 13. FIG. 14 and 15 also show the same tendency as in FIGS. 12 and 13. FIG.
[システム構成等]
 また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、各装置の分散及び統合の具体的形態は図示のものに限られず、その全部又は一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的又は物理的に分散又は統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部又は任意の一部が、CPU及び当該CPUにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。
[System configuration, etc.]
Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed or Can be integrated and configured. Further, all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or realized as hardware by wired logic.
 また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部又は一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部又は一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be done automatically by known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.
[プログラム]
 一実施形態として、学習装置10及び探索装置20は、パッケージソフトウェアやオンラインソフトウェアとして上記の学習処理又は探索処理を実行するプログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記のプログラムを情報処理装置に実行させることにより、情報処理装置を学習装置10又は探索装置20として機能させることができる。ここで言う情報処理装置には、デスクトップ型又はノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やPHS(Personal Handyphone System)等の移動体通信端末、さらには、PDA(Personal Digital Assistant)等のスレート端末等がその範疇に含まれる。
[program]
As an embodiment, the learning device 10 and the searching device 20 can be implemented by installing a program for executing the above-described learning processing or searching processing as package software or online software on a desired computer. For example, the information processing device can function as the learning device 10 or the searching device 20 by causing the information processing device to execute the above program. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, information processing devices include smart phones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).
 また、学習装置10及び探索装置20は、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の学習処理又は探索処理に関するサービスを提供するサーバ装置として実装することもできる。例えば、サーバ装置は、学習用のデータを入力とし、訓練済みのエンコーダの情報を出力とするサービスを提供するサーバ装置として実装される。この場合、サーバ装置は、Webサーバとして実装することとしてもよいし、アウトソーシングによって上記の処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Also, the learning device 10 and the searching device 20 can be implemented as a server device that uses a terminal device used by a user as a client and provides the client with services related to the above-described learning processing or search processing. For example, the server device is implemented as a server device that provides a service in which data for learning is input and information on a trained encoder is output. In this case, the server device may be implemented as a web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.
 図16は、学習プログラムを実行するコンピュータの一例を示す図である。なお、探索プログラムについても同様のコンピュータによって実行されてもよい。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。 FIG. 16 is a diagram showing an example of a computer that executes a learning program. Note that the search program may also be executed by a similar computer. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
 メモリ1010は、ROM(Read Only Memory)1011及びRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.
 ハードディスクドライブ1090は、例えば、OS1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、学習装置10の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、学習装置10における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSDにより代替されてもよい。 The hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the learning device 10 is implemented as a program module 1093 in which computer-executable code is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the learning device 10 . Note that the hard disk drive 1090 may be replaced by an SSD.
 また、上述した実施形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020は、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して、上述した実施形態の処理を実行する。CPU1020は、メモリと連結して上記の実施形態の処理を実行するようにプログラムされたものであってもよい。 Also, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads the program modules 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processes of the above-described embodiments. CPU 1020 may be programmed to perform the processes of the above embodiments in conjunction with the memory.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な一時的でなくかつコンピュータで読み取り可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 Note that the program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090. For example, the program modules 1093 and program data 1094 are stored in a detachable non-temporary computer-readable storage medium and stored via the disk drive 1100 or the like. It may be read by CPU 1020 . Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
 10 学習装置
 20 探索装置
 111、211 映像特徴量算出部
 112、212 映像エンコーダ情報
 112a 映像エンコーダ
 121、221 音声特徴量算出部
 122、222 音声エンコーダ情報
 122a 音声エンコーダ
 131 損失関数構成部
 132 更新部
 151 映像
 152 音声キャプション
 231 映像特徴量情報
 232 探索部
10 learning device 20 search device 111, 211 video feature amount calculation unit 112, 212 video encoder information 112a video encoder 121, 221 audio feature amount calculation unit 122, 222 audio encoder information 122a audio encoder 131 loss function configuration unit 132 updating unit 151 video 152 audio caption 231 video feature amount information 232 search unit

Claims (6)

  1.  映像を入力とし、前記映像を第1の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、映像と音声のペアのデータセットに含まれる映像の特徴量である映像特徴量を算出する映像特徴量算出部と、
     音声を入力とし、前記音声を第2の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、前記データセットに含まれる各音声の特徴量である音声特徴量を算出する音声特徴量算出部と、
     前記映像特徴量と前記音声特徴量との類似度であって、前記映像及び前記音声に現れる所定の事象の時間的な近接性を強調して算出された類似度が大きくなるように、前記映像特徴量算出部及び前記音声特徴量算出部によって用いられる各モデルのパラメータを更新する更新部と、
     を有することを特徴とする学習装置。
    Using a model that takes a video as an input and outputs a feature that is obtained by mapping the video into the first embedding space, a video feature that is a feature of the video included in the video-audio pair data set is calculated. a video feature quantity calculator;
    Speech feature quantity calculation for calculating the speech feature quantity, which is the feature quantity of each speech contained in the data set, using a model that receives speech as an input and outputs a feature quantity obtained by mapping the speech to a second embedding space. Department and
    The similarity between the video feature amount and the audio feature amount, which is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and the audio, is calculated so that the video an updating unit that updates the parameters of each model used by the feature calculation unit and the speech feature calculation unit;
    A learning device characterized by comprising:
  2.  前記更新部は、前記映像特徴量と前記音声特徴量との類似度であって、ガイド付き注意によって前記近接性を強調された類似度が大きくなるように、各モデルのパラメータを更新することを特徴とする請求項1に記載の学習装置。 The updating unit updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity in which the proximity is emphasized by guided attention, increases. 2. A learning device according to claim 1.
  3.  前記更新部は、前記映像特徴量と前記音声特徴量との要素同士の積から得られる値に、各要素に対応する時間が近いほど大きくなる重みを掛けて得られる類似度が大きくなるように、各モデルのパラメータを更新することを特徴とする請求項1に記載の学習装置。 The update unit multiplies a value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer, so that the similarity obtained increases. 2. The learning device according to claim 1, wherein the parameters of each model are updated.
  4.  前記更新部は、前記映像特徴量と前記音声特徴量との要素同士の積から得られる値に、各要素に対応する時間に所定のずれを生じさせた時間が近いほど大きくなる重みを掛けて得られる類似度が大きくなるように、各モデルのパラメータを更新することを特徴とする請求項3に記載の学習装置。 The update unit multiplies a value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is shifted by a predetermined shift. 4. The learning device according to claim 3, wherein parameters of each model are updated so as to increase the degree of similarity obtained.
  5.  学習装置によって実行される学習方法であって、
     映像を入力とし、前記映像を第1の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、映像と音声のペアのデータセットに含まれる映像の特徴量である映像特徴量を算出する映像特徴量算出工程と、
     音声を入力とし、前記音声を第2の埋め込み空間にマッピングした特徴量を出力とするモデルを用いて、前記データセットに含まれる各音声の特徴量である音声特徴量を算出する音声特徴量算出工程と、
     前記映像特徴量と前記音声特徴量との類似度であって、前記映像及び前記音声に現れる所定の事象の時間的な近接性を強調して計算された類似度が大きくなるように、前記映像特徴量算出工程及び前記音声特徴量算出工程によって用いられる各モデルのパラメータを更新する更新工程と、
     を含むことを特徴とする学習方法。
    A learning method performed by a learning device, comprising:
    Using a model that takes a video as an input and outputs a feature that is obtained by mapping the video into the first embedding space, a video feature that is a feature of the video included in the video-audio pair data set is calculated. a video feature quantity calculation step;
    Speech feature quantity calculation for calculating the speech feature quantity, which is the feature quantity of each speech contained in the data set, using a model that receives speech as an input and outputs a feature quantity obtained by mapping the speech to a second embedding space. process and
    The similarity between the video feature amount and the audio feature amount is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and the audio so that the similarity is increased. an update step of updating the parameters of each model used by the feature quantity calculation step and the speech feature quantity calculation step;
    A learning method comprising:
  6.  コンピュータを、請求項1から4のいずれか1項に記載の学習装置として機能させるための学習プログラム。 A learning program for causing a computer to function as the learning device according to any one of claims 1 to 4.
PCT/JP2021/018443 2021-05-14 2021-05-14 Learning device, learning method, and learning program WO2022239239A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2023520725A JPWO2022239239A1 (en) 2021-05-14 2021-05-14
PCT/JP2021/018443 WO2022239239A1 (en) 2021-05-14 2021-05-14 Learning device, learning method, and learning program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/018443 WO2022239239A1 (en) 2021-05-14 2021-05-14 Learning device, learning method, and learning program

Publications (1)

Publication Number Publication Date
WO2022239239A1 true WO2022239239A1 (en) 2022-11-17

Family

ID=84028969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/018443 WO2022239239A1 (en) 2021-05-14 2021-05-14 Learning device, learning method, and learning program

Country Status (2)

Country Link
JP (1) JPWO2022239239A1 (en)
WO (1) WO2022239239A1 (en)

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HARWATH DAVID, TORRALBA ANTONIO, GLASS JAMES R: " Unsupervised learning of spoken language with visual context-unsupervised-learning-of-spoken-language- with-visual-context Unsupervised Learning of Spoken Language with Visual Context", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 1 December 2016 (2016-12-01), pages 1 - 10, XP093003973, Retrieved from the Internet <URL:https://dspace.mit.edu/bitstream/handle/1721.1/124455/6186-unsupervised-learning-of-spoken-language-with-visual-context.pdf?sequence=2&isAllowed=y> [retrieved on 20221201] *

Also Published As

Publication number Publication date
JPWO2022239239A1 (en) 2022-11-17

Similar Documents

Publication Publication Date Title
CN108009228B (en) Method and device for setting content label and storage medium
CN108305643B (en) Method and device for determining emotion information
WO2021037113A1 (en) Image description method and apparatus, computing device, and storage medium
WO2021129439A1 (en) Voice recognition method and related product
CN110210028B (en) Method, device, equipment and medium for extracting domain feature words aiming at voice translation text
CN105224581B (en) The method and apparatus of picture are presented when playing music
JP2020174342A (en) Method, device, server, computer-readable storage medium, and computer program for generating video
WO2023035923A1 (en) Video checking method and apparatus and electronic device
CN114143479B (en) Video abstract generation method, device, equipment and storage medium
CN111767431A (en) Method and device for video dubbing
CN114339450A (en) Video comment generation method, system, device and storage medium
JP2022538702A (en) Voice packet recommendation method, device, electronic device and program
JP6486165B2 (en) Candidate keyword evaluation apparatus and candidate keyword evaluation program
JP7100737B1 (en) Learning equipment, learning methods and learning programs
JP7226320B2 (en) Information processing device, information processing method and program
CN113297387B (en) News detection method for image-text mismatching based on NKD-GNN
JP2016139229A (en) Device and program for generating personal profile, and content recommendation device
JP2018084627A (en) Language model learning device and program thereof
US10978076B2 (en) Speaker retrieval device, speaker retrieval method, and computer program product
WO2022239239A1 (en) Learning device, learning method, and learning program
CN113051468A (en) Movie recommendation method and system based on knowledge graph and reinforcement learning
CN111402919B (en) Method for identifying style of playing cavity based on multi-scale and multi-view
CN114363531B (en) H5-based text description video generation method, device, equipment and medium
CN114861610A (en) Title generation method and device, storage medium and electronic equipment
CN111477212A (en) Content recognition, model training and data processing method, system and equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21941969

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023520725

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21941969

Country of ref document: EP

Kind code of ref document: A1