WO2022239239A1

WO2022239239A1 - Learning device, learning method, and learning program

Info

Publication number: WO2022239239A1
Application number: PCT/JP2021/018443
Authority: WO
Inventors: 康智大石; 邦夫柏野
Original assignee: 日本電信電話株式会社
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-17
Also published as: JPWO2022239239A1

Abstract

By using a model for receiving a video as input and outputting a feature quantity obtained by mapping the video to a first embedded space, a video feature quantity calculation unit (111) calculates a video feature quantity that is a feature quantity of a video included in a dataset of pairs of videos and speech. By using a speech encoder that is a model for receiving speech as input and outputting a feature quantity obtained by mapping the speech to a second embedded space, a speech feature quantity calculation unit (121) calculates a speech feature quantity that is a feature quantity of speech included in the dataset. An update unit (132) updates parameters of the models respectively used by the video feature quantity calculation unit (111) and the speech feature quantity calculation unit (121) such that a similarity which is between the video feature quantity and the speech feature quantity and which is calculated by emphasizing a temporal proximity between predetermined phenomena appearing in the video and the speech becomes greater.

Description

LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM

The present invention relates to a learning device, a learning method, and a learning program.

According to image recognition technology, it is possible to identify various objects in images. Furthermore, there is known a technique for associating visual information with linguistic information by labeling images.

For example, there is a technology that prepares a large amount of paired data of an image and a voice that describes the content of the image (hereinafter referred to as voice caption) and associates the image area with the voice caption section (hereinafter referred to as voice section). known (see, for example, Non-Patent Document 1).

In addition, there is a known technique for acquiring translation knowledge between languages by preparing voice captions in multiple languages that describe images (see, for example, Non-Patent Document 2).

Furthermore, there is a known technology that multimodally associates objects and events by focusing on the co-occurrence of video and audio streams using broadcast data and video data on posting sites (see, for example, Non-Patent Document 3). ).

However, conventional techniques have the problem that the labeling of objects and events in video is costly.

　The object is, for example, an object that appears in the video. An event is, for example, an action of a person or the like in an image.

For example, in order to generate a model for associating objects and events in a video, it is necessary to label the objects and events. If this labeling is done manually, it costs a lot.

On the other hand, it is possible to automate labeling by action recognition and speech recognition, but training models for action recognition and speech recognition also requires labeling and transcription of regions and sections, which is costly.

In order to solve the above-described problems and achieve the object, a learning device uses a model in which a video is input and a feature value obtained by mapping the video in a first embedding space is used as an output to generate a pair of video and audio. A video feature quantity calculation unit that calculates a video feature quantity that is a feature quantity of the video contained in the data set of , and a model that inputs audio and outputs a feature quantity obtained by mapping the audio to a second embedding space. an audio feature amount calculation unit that calculates an audio feature amount that is a feature amount of each audio included in the data set; Updating parameters of each model used by the video feature amount calculation unit and the audio feature amount calculation unit so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the and a part.

According to the present invention, the cost of labeling objects and events in video can be reduced.

FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment. FIG. 2 is an explanatory diagram illustrating audio captions. FIG. 3 is an explanatory diagram illustrating audio captions. FIG. 4 is a diagram illustrating the temporal proximity of objects and events. FIG. 5 is a network diagram of the entire model. FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment. FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment. FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment. FIG. 9 is a diagram showing a data set. FIG. 10 is a diagram showing experimental results. FIG. 11 is a diagram showing experimental results. FIG. 12 is a diagram showing the correspondence between video and audio. FIG. 13 is a diagram showing the correspondence between video and audio. FIG. 14 is a diagram showing the correspondence between video and audio. FIG. 15 is a diagram showing the correspondence between video and audio. FIG. 16 is a diagram illustrating an example of a computer that executes a learning program;

Below, embodiments of a learning device, a learning method, and a learning program according to the present application will be described in detail based on the drawings. In addition, this invention is not limited by embodiment described below.

[First embodiment]
The learning device according to the first embodiment uses input learning data to train a video encoder and an audio encoder. The learning device then outputs each trained encoder. For example, the learning device outputs parameters for each encoder. It should be noted that the video in this embodiment means a moving image.

The learning data in this embodiment is data that includes video and audio associated with the video. For example, the learning data may be video data for broadcasting, video data from a video posting site, or the like.

A video encoder is a model that takes video as input and outputs video feature values. A speech encoder is a model that receives speech as an input and outputs speech features. The learning device optimizes the video encoder and the audio encoder based on the output video feature quantity and audio feature quantity.

[Configuration of the first embodiment]
FIG. 1 is a diagram showing a configuration example of a learning device according to the first embodiment. As shown in FIG. 1 , the learning device 10 has a video feature quantity calculator 111 , an audio feature quantity calculator 121 , a loss function generator 131 and an updater 132 . The learning device 10 also stores video encoder information 112 and audio encoder information 122 .

A video 151 and an audio caption 152 are input to the learning device 10 . Also, the learning device 10 can output updated video encoder information 112 and audio encoder information 122 .

Here, the audio caption is the audio corresponding to the video. For example, video 151 and audio captions 152 are image data and audio data extracted from the same video data.

It should be noted that the audio caption 152 may be a signal in which a person watches the video 151 and records the audio uttered to explain the content. For example, the audio captions 152 may be obtained by using crowdsourcing to show a video 151 to a speaker and record the audio spoken by the speaker to describe the video 151 .

FIGS. 2 and 3 are explanatory diagrams explaining audio captions (for details of FIGS. 2 and 3, see Non-Patent Document 1 and Non-Patent Document 2, respectively).

In Fig. 2, image regions and audio caption sections are associated. For example, a "mountain" in an image and a "mountain" in an audio caption are associated with the same hatching pattern.

In Figure 3, associations are made between audio captions describing the same image in different languages. For example, when calculating the similarity between feature vectors extracted from English and Japanese audio captions, the similarity is high in corresponding parts, and translation knowledge between words is obtained.

In this way, it is known that it is possible to associate data with different modalities. Using this fact, the learning device of the present embodiment associates video and audio.

It should be noted that modality can be said to be a way of expressing ideas. For example, for the concept of a dog, modalities correspond to "images" of dogs, "voices" of uttering "dogs", and "texts" such as "dogs", "dogs", and "dogs". . For example, in the case of English, "dog", "Dog", and "DOG" correspond to modalities. Examples of modalities include images, sounds, videos, and predetermined sensing data.

Returning to FIG. 1, the video feature quantity calculation unit 111 uses a video encoder, which is a model that receives video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space, to convert video and audio pair data. A video feature quantity, which is a feature quantity of videos included in the set, is calculated.

The video encoder information 112 is parameters for constructing a video encoder.

The speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount.

The speech encoder information 122 is parameters for constructing a speech encoder.

The loss function constructing unit 131 calculates the degree of similarity between the video feature amount and the audio feature amount, which is the similarity calculated by emphasizing the temporal proximity of a predetermined event (event or object) appearing in the video and audio. Construct a loss function based on

The update unit 132 uses the loss function configured by the loss function configuration unit 131 to update the parameters of the video encoder and the audio encoder.

That is, the update unit 132 increases the similarity between the video feature amount and the audio feature amount, which is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio. , update parameters of each model used by the video feature amount calculation unit 111 and the audio feature amount calculation unit 121 .

The processing of each part of the learning device 10 will be described in detail using FIGS. 4 and 5. FIG.

FIG. 4 is a diagram explaining the temporal proximity of objects and events. FIG. 4 shows data for learning based on commentary videos of sumo wrestling. In FIG. 4, each frame of video 151 (Video frames) and Mel-spectrogram of audio caption 152 are arranged in chronological order.

For example, events occur in frames f102, f103, f105, and f106. Also, there is a sound corresponding to the event that occurred.

For example, at frame f102, an event called the start of the match occurs, and frame f102 corresponds to the voice "Hakkeyoi no Kotta" signifying the start of the match.

Also, for example, in frame f106, an event of winning or losing has occurred, and frame f106 corresponds to the voice "push out" explaining the winning move that determined the winning or losing.

In this way, it can be said that the frames of the event and the corresponding audio exist close to each other in terms of time.

Figure 5 is a network diagram of the entire model. As shown in FIG. 5, the video feature amount calculation unit 111 extracts, for example, 32 video frames (32 video frames) cut out at equal intervals from a 10-second video (10-second video), and outputs them to a video encoder (video network). 112a.

The video encoder 112a is assumed to be a trained ECO (Non-Patent Document 4). ECOs are traditionally used for motion recognition. In this embodiment, a 3D ResNet is used in which the last layer, which is the ECO classifier, is removed and the feature dimension is set to d without changing the spatial size.

For example, the video feature amount calculation unit 111 inputs a tensor obtained by resizing the frame size of 32 video frames to 224×224 to the video encoder 112a. Then, the video encoder 112a outputs a video feature amount (Visual feature) which is a tensor of (time, channel, height, width)=(8, d, 7, 7).

The video feature amount can be said to be 32 video frames compressed to 8 on the time axis and 7×7 on the spatial size of the frame. Therefore, the video feature amount can be regarded as a feature amount having a d-dimensional feature vector at each point in an 8×7×7 space.

As shown in FIG. 5, the audio feature amount calculation unit 121 performs frequency analysis with a frame shift length of 10 ms and a frame length of 25 ms, and 40 mel filter bank processes for a 10-second audio file (10-second audio). The Mel filter bank sequence (Mel spectrogram) obtained by the above is input to the audio encoder (Audio network) 122a.

The speech encoder 122a is assumed to be CNN-based DAVEnet (Non-Patent Document 7). Also, the speech encoder 122a may be a speech feature extractor (Non-Patent Document 7) that introduces ResDAVEnet or a self-attention mechanism.

The speech feature amount calculation unit 121 inputs the mel filter bank sequence to the speech encoder 122a. Then, the audio encoder 122a outputs an audio feature of (time, channel)=(64, d).

A speech feature quantity can be regarded as a d-dimensional feature vector sequence with a compressed time axis.

In the example of FIG. 5, d corresponding to the number of channels is 1024. Also, the video encoder 112a and the audio encoder 122a are not limited to those described above.

Here, the i-th pair data of input video and audio is expressed as in formula (1).

Also, the video feature amount and the audio feature amount corresponding to the i-th paired data are expressed as in formula (2).

The loss function constructing unit 131 arranges the pair of video feature amount and audio feature amount shown in equation (2) close to each other in a shared latent space (embedded space) and far from different video and audio. Construct a loss function such that each model is trained as

For example, the loss function configuration unit 131 can configure a loss function using Triplet loss (Non-Patent Document 3) or Masked Margin Softmax (MMS loss) (Non-Patent Document 5). In this embodiment, the loss function configuration unit 131 configures a loss function using MMS loss.

The loss function using MMS loss is as shown in formulas (3), (4) and (5).

Here, B in formulas (4) and (5) is the batch size. Also, the similarity Z _m,n is equal to the left side of Equation (6).

The loss function of equation (3) is a loss function that calculates the similarity of all video-audio pairs in a batch and learns the parameters of the video encoder and audio encoder so that the pair's similarity is high. can be done.

Furthermore, the numerators on the right-hand sides of the equations (4) and (5) are terms related to the similarity Zm _,m between the pair of video feature quantity and audio feature quantity. On the other hand, the denominator includes a term related to the degree of similarity Z _m,m and a term related to the degree of similarity Z _m,n (m≠n) between different (unpaired) video feature quantities and audio feature quantities.

In addition, equation (4) is the loss when considering the degree of similarity based on the video. On the other hand, the equation (5) is the loss when the degree of similarity is considered based on speech.

Minimizing equations (4) and (5) corresponds to increasing the numerator (higher similarity when paired) and decreasing the denominator (lower similarity when not paired).

Note that δ is a hyperparameter that represents the margin, and constrains the pair similarity to be further increased by δ.

Here, S _i,j ^{(m, n)} and G _i,j included in Equation (6) for calculating the similarity are calculated by Equations (7) and (8), respectively.

Equation (8) is guided attention, and has the effect of emphasizing the temporal proximity of objects and events appearing in video and audio. Also, σ _g is a hyperparameter.

Returning to FIG. 5, the loss function constructing unit 131 calculates the dot product of the video feature amount and the audio feature amount, and obtains a tensor (Audio Visual Affinity Tensor) indicating the degree of similarity between the video and the audio.

Furthermore, the loss function constructing unit 131 obtains the similarity by weighting the tensors obtained by performing spatial mean pooling with guided attention and averaging. .

The guided attention shown in Fig. 5 and formula (8) assumes that objects and events appear in video and audio at approximately the same time, and gives greater weight to diagonal components. On the other hand, it is conceivable that the representation of the video and the voice may be out of sync.

For example, in live broadcasts of sports, after an object or event has occurred in the video, the announcer explains the situation with audio, resulting in a time lag.

In addition, in commentary videos such as cooking, it is possible that the chef will explain the procedure to be followed by voice, and then objects and events related to the procedure will appear in the video.

For example, the sound "Hit from the front" in Fig. 4 appears after a certain amount of time has passed since the event of the frontal hit occurred in the video.

Therefore, the loss function configuration unit 131 may calculate guided attention using a formula including hyperparameters i' and j', such as formula (9).

It should be noted that it is also possible to adjust the time lag as described above by adjusting _σg .

Also, when triplet loss is adopted as the loss function, the loss function configuration unit 131 configures the loss function as shown in equation (10).

However, s(x, y) is the degree of similarity between the video feature quantity x and the audio feature quantity y.

The update unit 132 updates the parameters of the video encoder and the audio encoder while decreasing the loss function using the optimization algorithm Adam (Non-Patent Document 6).

For example, the updating unit 132 sets the following hyperparameters when executing Adam.
Weight Decay: 4×10 ^-5
Initial learning rate: 0.001
β ₁ =0.95, β ₂ =0.99

The optimization algorithm used by the updating unit 132 is not limited to Adam, and may be the so-called stochastic gradient descent (SGD), RMSProp, or the like.

In this way, the updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases.

It can be said that the weight of the guided attention increases as the corresponding time between the elements of the video feature amount and the audio feature amount becomes closer.

That is, the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element approaches, so that the degree of similarity obtained increases. , to update the parameters of each model.

The updating unit 132 may update the parameters based on the loss function using guided attention calculated by Equation (9).

In this case, the update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element with a predetermined shift is closer. We update the parameters of each model so that the similarity obtained by

[Processing of the first embodiment]
FIG. 6 is a flow chart showing the flow of processing of the learning device according to the first embodiment. As shown in FIG. 6, first, a data set of a video and a pair of audio captions corresponding to the video is input to the learning device 10 (step S101).

Next, the learning device 10 uses a video encoder to calculate a d-dimensional video feature vector from the video (step S102).

Next, the learning device 10 uses a speech encoder to calculate a d-dimensional speech feature vector from the speech caption (step S103).

Here, the learning device 10 constructs a loss function based on the degree of similarity between the video feature vector and the audio feature vector considering the guided attention (step S104).

Then, the learning device 10 updates the parameters of each encoder so that the loss function is optimized (step S105).

At this time, if the termination condition is satisfied (step S106, Yes), the learning device 10 terminates the process. On the other hand, if the termination condition is not satisfied (step S106, No), the learning device 10 returns to step S102 and repeats the process.

For example, the termination condition is that the amount of parameter updates has become equal to or less than a threshold, and that the number of parameter updates has exceeded a certain number.

The video feature quantity calculation unit 111 uses a model that takes a video as an input and outputs a feature quantity obtained by mapping the video in the first embedding space. A certain video feature amount is calculated. The speech feature amount calculation unit 121 uses a speech encoder, which is a model that receives speech as an input and outputs a feature amount obtained by mapping the speech to a second embedding space, to calculate speech as a feature amount of the speech included in the data set. Calculate the feature amount. The update unit 132 updates the video feature amount and the audio feature amount so that the similarity calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and audio is increased. The parameters of each model used by the feature quantity calculation unit 111 and the speech feature quantity calculation unit 121 are updated.

　The event frame in the video and the audio corresponding to the event have the property of being close in time. The learning device 10 can effectively learn a model by emphasizing such properties. As a result, according to this embodiment, the cost for labeling objects and events in video can be reduced.

The updating unit 132 updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity with the proximity emphasized by the guided attention, increases. Thus, by using guided attention, it is possible to weight each element of the feature quantity represented by the tensor.

The update unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer, so that the degree of similarity obtained increases. Update model parameters. This makes it possible to emphasize the temporal proximity between audio and events in video such as events and objects.

The updating unit 132 multiplies the value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer to the time at which a predetermined shift is generated. Update the parameters of each model so that the similarity increases. This makes it possible to cope with the time lag between the event/object and the sound.

[Second embodiment]
In the second embodiment, a process of actually inferring using the model trained in the first embodiment will be described. Pre-trained video and audio encoders enable cross-modal search.

　Cross-modal search is to search different forms of data. For example, cross-modal search includes searching video from audio, searching audio from video, searching audio in one language to audio in another language, and the like. Moreover, in the description of each embodiment, the same reference numerals are given to the parts having the same functions as those of the already described embodiments, and the description thereof will be omitted as appropriate.

[Configuration of Second Embodiment]
FIG. 7 is a diagram illustrating a configuration example of a search device according to the second embodiment. As shown in FIG. 7 , the search device 20 has a video feature quantity calculator 211 , an audio feature quantity calculator 221 , and a searcher 232 . The search device 20 also stores video encoder information 212 , audio encoder information 222 and video feature amount information 231 .

Video and audio captions are input to the search device 20 . The audio caption input to search device 20 is the query for the search. For example, the searching device 20 outputs a video obtained by searching as a search result.

The image feature amount calculation unit 211 receives an image as input and calculates image feature amounts, similar to the image feature amount calculation unit 111 of the learning device 10 . However, the video encoder information 212 has already been trained by the method described in the first embodiment.

The video feature amount information 231 accumulates the calculated video feature amount as the video feature amount information 231 .

The search unit 232 searches for a video feature amount similar to the audio feature amount 221 calculated by the audio feature amount calculation unit 221 from the accumulated video feature amount information 231 . A similarity calculation method is the same as in the first embodiment.

[Processing of Second Embodiment]
FIG. 8 is a flow chart showing the flow of processing of the search device according to the second embodiment. As shown in FIG. 8, first, a plurality of videos and a query voice are input to the searching device 20 (step S201).

Next, the search device 20 uses a video encoder to calculate a d-dimensional video feature vector from each video (step S202).

Subsequently, the search device 20 calculates a d-dimensional audio feature vector from the audio caption using the audio encoder (step S203).

Then, the search device 20 searches for a video feature vector similar to the audio feature vector (step S204).

Furthermore, the search device 20 outputs a video corresponding to the video feature vector obtained by the search (step S205).

[Effect of Second Embodiment]
Thus, according to the second embodiment, it is possible to perform cross-modal search for searching video from audio.

[Experimental result]
An experiment conducted using the search device of the second embodiment will be described. In the experiment, an encoder trained by the learning device of the first embodiment was used and a search was performed by the search device of the second embodiment.

First, in the experiment, we used the video of the sumo tournament shown in Figure 4 to learn the correspondence between video and audio. Specifically, 170 hours of sumo footage were recorded and manually labeled with the time of the moment when the match was decided and the decision.

In addition, 9 types were selected as determinants with high appearance frequency, and the data set shown in Fig. 9 was created. FIG. 9 is a diagram showing a data set. In the experiment, video and audio streams were used for a total of 10 seconds, 5 seconds before and after the moment the game was decided.

Here, in order to increase the learning data, in the experiment, video and audio streams obtained with a delay of ±5, ±10, ±15, ±20, and ±25 seconds from the moment the game was decided were also used as learning data (data extension ).

In addition, Recall@N was used as the evaluation scale. For example, when searching 90 pieces of evaluation data for sounds that form a pair from a feature vector of a certain video query, the degree of similarity with each feature vector of the sound is calculated and ranked, and the top five are determined. If the voice paired with the query is included in these five records, the search is successful. The same process is performed for the 90 images of the evaluation data, and the ratio of the paired audio being included in the top five is defined as Recall@5. Similarly, Recall@3 and Recall@1 were also calculated.

FIG. 10 is a diagram showing experimental results. Baseline corresponds to the prior art and is the case where guided attention is not used, that is, the case where G _i,j =1.

From FIG. 10, it can be seen that the recall rate is greatly improved according to the proposed embodiment. In addition, recall was further improved by increasing the learning data through data augmentation.

Fig. 11 shows the results of recounting for each decisive factor. FIG. 11 is a diagram showing experimental results.

In the case of Recall@5, if the voice of the same "decision" as the video query is included in the top 5, it will be considered correct. Conversely, if the video with the same “decision factor” as the voice query was included in the top five, it was determined to be correct.

Looking at the results in Figure 11, recall is better than the results in Figure 10. Furthermore, from the degree of improvement in recall, it is conceivable that the "determinant" was learned as a concept.

12 and 13 are diagrams showing the correspondence between video and audio. The graphs of FIGS. 12 and 13 visualize the equation (7).

FIG. 12 corresponds to this embodiment (Proposed). Also, FIG. 13 corresponds to the prior art (Baseline).

From FIGS. 12 and 13, it can be seen that the event (action) in the video is more clearly associated with the audio in this embodiment than in the conventional technology.

14 and 15 are diagrams showing the correspondence between video and audio in scenes different from those in FIGS. 12 and 13. FIG. 14 and 15 also show the same tendency as in FIGS. 12 and 13. FIG.

[System configuration, etc.]
Also, each component of each device illustrated is functionally conceptual, and does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution and integration of each device is not limited to the illustrated one, and all or part of them can be functionally or physically distributed or Can be integrated and configured. Further, all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or realized as hardware by wired logic.

Further, among the processes described in the present embodiment, all or part of the processes described as being automatically performed can be performed manually, or the processes described as being performed manually can be performed manually. All or part of this can also be done automatically by known methods. In addition, information including processing procedures, control procedures, specific names, and various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.

[program]
As an embodiment, the learning device 10 and the searching device 20 can be implemented by installing a program for executing the above-described learning processing or searching processing as package software or online software on a desired computer. For example, the information processing device can function as the learning device 10 or the searching device 20 by causing the information processing device to execute the above program. The information processing apparatus referred to here includes a desktop or notebook personal computer. In addition, information processing devices include smart phones, mobile communication terminals such as mobile phones and PHS (Personal Handyphone Systems), and slate terminals such as PDAs (Personal Digital Assistants).

Also, the learning device 10 and the searching device 20 can be implemented as a server device that uses a terminal device used by a user as a client and provides the client with services related to the above-described learning processing or search processing. For example, the server device is implemented as a server device that provides a service in which data for learning is input and information on a trained encoder is output. In this case, the server device may be implemented as a web server, or may be implemented as a cloud that provides services related to the above processing by outsourcing.

FIG. 16 is a diagram showing an example of a computer that executes a learning program. Note that the search program may also be executed by a similar computer. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.

The hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the learning device 10 is implemented as a program module 1093 in which computer-executable code is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the learning device 10 . Note that the hard disk drive 1090 may be replaced by an SSD.

Also, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads the program modules 1093 and program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary, and executes the processes of the above-described embodiments. CPU 1020 may be programmed to perform the processes of the above embodiments in conjunction with the memory.

Note that the program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090. For example, the program modules 1093 and program data 1094 are stored in a detachable non-temporary computer-readable storage medium and stored via the disk drive 1100 or the like. It may be read by CPU 1020 . Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.

10 learning device 20 search device 111, 211 video feature amount calculation unit 112, 212 video encoder information 112a video encoder 121, 221 audio feature amount calculation unit 122, 222 audio encoder information 122a audio encoder 131 loss function configuration unit 132 updating unit 151 video 152 audio caption 231 video feature amount information 232 search unit

Claims

Using a model that takes a video as an input and outputs a feature that is obtained by mapping the video into the first embedding space, a video feature that is a feature of the video included in the video-audio pair data set is calculated. a video feature quantity calculator;
Speech feature quantity calculation for calculating the speech feature quantity, which is the feature quantity of each speech contained in the data set, using a model that receives speech as an input and outputs a feature quantity obtained by mapping the speech to a second embedding space. Department and
The similarity between the video feature amount and the audio feature amount, which is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and the audio, is calculated so that the video an updating unit that updates the parameters of each model used by the feature calculation unit and the speech feature calculation unit;
A learning device characterized by comprising:
The updating unit updates the parameters of each model so that the degree of similarity between the video feature amount and the audio feature amount, which is the degree of similarity in which the proximity is emphasized by guided attention, increases. 2. A learning device according to claim 1.
The update unit multiplies a value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is closer, so that the similarity obtained increases. 2. The learning device according to claim 1, wherein the parameters of each model are updated.
The update unit multiplies a value obtained from the product of the elements of the video feature amount and the audio feature amount by a weight that increases as the time corresponding to each element is shifted by a predetermined shift. 4. The learning device according to claim 3, wherein parameters of each model are updated so as to increase the degree of similarity obtained.
A learning method performed by a learning device, comprising:
Using a model that takes a video as an input and outputs a feature that is obtained by mapping the video into the first embedding space, a video feature that is a feature of the video included in the video-audio pair data set is calculated. a video feature quantity calculation step;
Speech feature quantity calculation for calculating the speech feature quantity, which is the feature quantity of each speech contained in the data set, using a model that receives speech as an input and outputs a feature quantity obtained by mapping the speech to a second embedding space. process and
The similarity between the video feature amount and the audio feature amount is calculated by emphasizing the temporal proximity of a predetermined event appearing in the video and the audio so that the similarity is increased. an update step of updating the parameters of each model used by the feature quantity calculation step and the speech feature quantity calculation step;
A learning method comprising:
A learning program for causing a computer to function as the learning device according to any one of claims 1 to 4.