WO2022044100A1

WO2022044100A1 - Learning device, search device, learning method, and program

Info

Publication number: WO2022044100A1
Application number: PCT/JP2020/031933
Authority: WO
Inventors: 昌弘安田; 康智大石; 登原田; 悠馬小泉
Original assignee: 日本電信電話株式会社
Priority date: 2020-08-25
Filing date: 2020-08-25
Publication date: 2022-03-03
Also published as: JPWO2022044100A1

Abstract

In this learning device, which performs learning so as to maximize or minimize a loss value calculated by means of an integrated function in which multiple loss functions are integrated, the multiple loss functions include a first loss function in which the magnitude of the value changes with progress of the learning, and at least one second loss function which is different from the first loss function, wherein the integrated function calculates the aforementioned loss value on the basis of a value obtained by normalizing the value of the first loss function and on the basis of the value of the second loss function.

Description

Learning device, search device, learning method, and program

The present invention relates to a learning device, a search device, a learning method, and a program.

Visual and acoustic events are often related and occur at the same time. For example, lip movements and voices, passing cars and engine sounds, movie images and sound effects, etc. are events that are likely to occur at the same time. From the relationship between such visual events and acoustic events, it is possible to estimate the engine sound of a car, for example, when watching an image of the car traveling. Cross-modal search is a technology that utilizes such "co-occurrence" of video and sound.

Cross-modal search is a technology that can estimate and search for an appropriate sound for an image or an appropriate image for a sound by using, for example, "co-occurrence" between an image and a sound. .. Traditionally, studies on such cross-modal search of video and sound have mainly focused on the limited (single) aspect of co-occurrence of video and sound. The limited aspect referred to here is, for example, a physical aspect or a linguistic aspect.

For example, Non-Patent Document 1 describes a learning model for associating an object appearing in an image with a spoken language as a technique for cross-modal search focusing on the linguistic aspect of co-occurrence of video and sound. Further, for example, in Non-Patent Document 2, as a technique for cross-modal search focusing on the physical aspect in the coexistence of video and sound, sound generated from the physical interaction between an object and an object is described from the video. It describes a learning model of machine learning for estimation. This makes it possible to estimate and search for the corresponding sound from, for example, an image of hitting an object with a stick.

Conventionally, in the technical field related to cross-modal search, various studies have been conducted on techniques for improving the accuracy of estimation and search using machine learning as described above. Generally, in order to improve the accuracy of estimation and retrieval by using machine learning, it is necessary to prepare learning data with detailed labels in advance.

By the way, in videos used in the real world, video and sound are often mixed in a complicated manner. For example, in a moving image in a television broadcast, a plurality of voices such as a caster's voice and a BGM sound effect are often used at the same time together with the video. Therefore, in the moving image used in the real world, the co-occurrence relationship between the image and the sound has a more complicated relationship. Larger-scale learning data is required for estimation and retrieval based on such complicated co-occurrence relationships. However, when detailed labeling is performed on such a large-scale training data, there is a problem that the cost and labor required for creating the training data increase.

In view of the above circumstances, an object of the present invention is to provide a learning device, a search device, a learning method, and a program that can reduce the cost and labor required for creating learning data.

One aspect of the present invention is a learning device that learns to maximize or minimize the loss value calculated by the integrated function in which a plurality of loss functions are integrated, and the plurality of loss functions are the learning. The integrated function includes a first loss function whose value changes with the progress of the first loss function and at least one second loss function different from the first loss function, and the integrated function uses the value of the first loss function. It is a learning device that calculates the loss value based on the normalized value and the value of the second loss function.

Further, in one aspect of the present invention, an input unit that accepts input of video data and acoustic data included in the same moving image, the video data, the acoustic data, and the moving image including the video data produce specific audio. It is a learning device including a learning unit for learning the co-occurrence relationship between the video data and the specific voice based on the label data of a weak label indicating whether or not the video data is included.

Further, one aspect of the present invention is based on an input unit that accepts input of video data and acoustic data included in the same moving image and label data that classifies audio included in the moving image including the video data. It is a learning device including a learning unit for multi-task learning of the co-occurrence relationship between the voice and the voice and the type of the voice that co-occurs with the video data.

Further, one aspect of the present invention is a search device that searches for audio corresponding to video data using the learning results of the above learning device.

Further, one aspect of the present invention is a learning method in which learning is performed so as to maximize or minimize the loss value calculated by the integrated function in which the plurality of loss functions are integrated, and the plurality of loss functions are the same. The integrated function includes a first loss function whose value magnitude changes as the learning progresses, and at least one second loss function different from the first loss function, and the integrated function is the first loss function. It is a learning method to calculate the loss value based on the value obtained by normalizing the value and the value of the second loss function.

Further, one aspect of the present invention includes an input step for accepting input of video data and acoustic data included in the same moving image, and a weak label indicating whether or not the moving image containing the video data contains specific acoustic data. It is a learning method having a learning step of learning a co-occurrence relationship between the video data and the acoustic data based on the data.

Further, one aspect of the present invention classifies an input step that accepts input of video data and acoustic data included in the same moving image and specific acoustic data included in the moving image including the video data into one of a plurality of types. It is a learning method having a co-occurrence relationship between the video data and the acoustic data and a learning step for multi-task learning of the type of the acoustic data co-occurring with the video data based on the label data.

Further, one aspect of the present invention is a program for operating a computer as the above-mentioned learning device.

Further, one aspect of the present invention is a program for operating a computer as the above-mentioned search device.

According to the present invention, the cost and labor required to create learning data can be reduced.

The schematic diagram of the structure of the search system in 1st Embodiment of this invention. The block diagram which shows the structure of the search system 1 in 1st Embodiment of this invention. The block diagram which shows the functional structure of the learning apparatus 10 in the 1st Embodiment of this invention. The block diagram which shows the functional structure of the search apparatus 20 in 1st Embodiment of this invention. The flowchart which shows the operation of the learning apparatus 10 in the 1st Embodiment of this invention. The flowchart which shows the operation of the search apparatus 20 in 1st Embodiment of this invention. The figure which shows the experimental result of the objective evaluation experiment in the Example of 1st Embodiment of this invention. The figure which shows the experimental result of the subjective evaluation experiment in the Example of 1st Embodiment of this invention.

The present invention is intended for a cross-modal search that estimates and searches for a sound corresponding to the image from the image or an image corresponding to the sound from the sound by utilizing the connection (relationship) between the image and the sound. And. Hereinafter, the learning device, the search device, the learning method, and the program of the embodiment will be described with reference to the drawings.

<First Embodiment>
Hereinafter, the search system according to the first embodiment will be described. The search system in the present embodiment targets a moving image in which a moving image and a plurality of sounds co-occur, such as a moving image in the real world. The search system in the present embodiment pays attention to a specific target from the video and performs a cross-modal search to search for an appropriate sound corresponding to the video.

Hereinafter, as an example, a case where the search system estimates an appropriate "sound effect" corresponding to the video from the video in the video of the television broadcast will be described. Generally, in television broadcasting, sound effects appear mixed with other acoustic elements such as "speaking voice" and "music (eg BGM)". Therefore, the co-occurrence relationship between video and sound has a complicated relationship in moving images of television broadcasting.

In such a moving image in which the co-occurrence relationship between the image and the sound has a complicated relationship, a cross-modal search focusing on the co-occurrence between the image and a specific sound (sound effect in the present embodiment) is performed. In order to do so, it is first necessary to identify the video and sound that are the targets of the cross-modal search.

In this embodiment, machine learning is used to identify the video and sound that are the targets of the cross-modal search. Large-scale learning data is required in order to perform highly accurate estimation and retrieval of moving images having a complicated co-occurrence relationship between video and sound. However, detailed labeling of such large-scale training data increases the cost and labor required to create the training data.

Therefore, machine learning (weak label learning) using learning data based on weak labels is desirable for the classifier that identifies the video and sound to be cross-modal search. The learning data based on the weak label referred to here is learning data without a detailed label and to which, for example, metadata and tags are labeled.
In the present embodiment, the learning data based on the weak label is weakly labeled learning data indicating only whether or not the input video and the input sound include sound effects.

Note that the "voice" in the present embodiment is not limited to the human voice, but refers to all acoustic signals including sounds other than the human voice. For example, the "voice" in the present embodiment also includes BGM, sound effects, and the like included in moving images and the like in television broadcasting.

The search system in the present embodiment is a learning device that performs expression learning based on ternary loss in order to enable cross-modal search for a video in which a video and a plurality of sounds co-occur, such as a video in the real world. And a system that combines target target detection (Awareness mechanism). The target target detection unit performs machine learning based on the above-mentioned weak label indicating only whether or not the input video and the input sound include sound effects.

In recent years, cross-modal search has been developed mainly by using expression learning using DNN (Deep Neural Network). In expression learning, DNN learns using a cost function such as triple loss, which associates video and audio based on similarity, for example. The most basic form of trinomial loss is expressed by equation (1) below.

In the following description, the subscript characters in mathematical formulas and functions are described using the underscore "_". For example, when indicating a character string in which "m" is added as a subscript character to the character "X", it is described as "X_m". In addition, superscripts in mathematical formulas and functions are described using the caret "^". For example, when indicating a character string in which "n" is added as a superscript to the character "X", it is described as "X ^ n". Further, for example, when indicating a character string in which "m" is added as a subscript to the character "X" and "n" is added as a superscript to the character "X", "X" is indicated. It is described as "^ n_m".

Here, N_b indicates the number of mini-batch samples in mini-batch learning. Further, A_m, P_m, and N_m represent embedded expressions corresponding to the reference input (Anchor), the similar pair (Positive), and the dissimilar pair (Negative) of the m-th sample data among the sample data included in the mini-batch. show. Further, D (a, b) indicates the L2 norm (Euclidean distance) between the vector a and the vector b. Further, δ is called a margin parameter and indicates a positive constant.

For the purpose of stabilizing machine learning, the margin parameter δ may change as machine learning progresses. For example, the technique described in Non-Patent Document 3 increases the value of the margin parameter δ as the machine learning progresses, as shown in the following equation (2), for the purpose of stabilizing machine learning.

Here, δ_0 and δ_max indicate the initial value and the maximum value of δ. Further, γ is a value obtained by dividing the current number of learning epochs by the maximum number of epochs.

Further, the technique described in Non-Patent Document 1 performs a cross-modal search between video and audio, and the audio includes only sentences explaining the video. Therefore, in the technical field described in Non-Patent Document 1, video and audio are always connected by the relationship of linguistic aspects. For such a moving image containing only a limited (single) aspect of co-occurrence between video and audio, the similarity of the input sample, for example, as in equation (1) above. Expression learning using a cost function based on is effective.

On the other hand, in moving images used in the real world, video and multiple audios exist at the same time, and co-occurrence of video and audio in different aspects is mixed for each audio. Therefore, for moving images used in the real world, it may be difficult to obtain a sufficient learning effect by conventional expression learning using a cost function such as the above-mentioned equation (1).

As described above, in the search system of the present embodiment, when video and a plurality of audios are mixed as in a moving image in the real world, machine learning using learning data based on weak labels is performed between video and sound. Enables cross-modal search. Video in the real world includes co-occurrence of video and sound from various perspectives. For example, in a TV program, a short time (for example, about 1 to 5 seconds) is adjusted to the timing of various changes included in the video, such as visual effects such as subtitles, change (switching) of the video scene, and movement of a person. (Degree) sound effect is given.

However, such sound effects generally have a shorter generation time than other acoustic effects such as voice and BGM. Therefore, sound effects tend to be buried in other acoustic effects. However, human beings have a mixture of multiple types of co-occurrence between video and audio, and are sound effects that tend to be buried in other sound effects, such as sound effects. Or even visual effects that tend to be buried in other visual effects can be properly recognized and associated with audio effects. This is because humans have the ability to recognize and pay attention to these video and audio effects.

The search system in the present embodiment is a combination of target object detection and expression learning based on the co-occurrence relationship between video and sound in the above-mentioned prior art in order to imitate the human ability as described above. be.

[Overview of search system]
Hereinafter, the outline of the search system in this embodiment will be described. The search system in the present embodiment performs cross-modal expression learning of video and sound, and multi-task learning in target target detection based on weak labels.

FIG. 1 is a schematic diagram of the configuration of the search system 1 according to the first embodiment of the present invention. The search system 1 in the present embodiment includes a DNN including a video query extraction unit (Video encoder), a voice dictionary extraction unit (Audio encoder), and a target target detection unit (Awareness mechanism). ..

The video query extraction unit (Video encoder) encodes the video, and the voice dictionary extraction unit (Audio encoder) encodes the audio into a vector in a common embedded space for embedding. Embedding is the transformation of a high-dimensional vector into a low-dimensional space. The embedding by the video query extraction unit and the embedding by the voice dictionary extraction unit have the roles of a query and a dictionary, respectively, at the time of search.

In the target target detection unit (Awareness mechanism), it is used only during learning. The target object detection unit performs machine learning so as to identify whether or not the input video and the input sound include the target object.

[Margin normalization]
The target target detection unit performs multi-task learning based on the weak label. The margin normalization by the target target detection unit will be described below.
The loss function of multi-task learning consisting of N tasks can be expressed by the following equation (3) as a weighted sum of the loss function L_i of each task.

Here, λ_i is a positive constant weight for the i-th task. λ_i adjusts the magnitude of the value of the loss function for each task.

However, when the loss function whose value range changes as the learning progresses and the loss function whose value range does not change as the learning progresses are integrated, the weight of the fixed positive constant as described above is used. When a certain λ_i is used, there is an imbalance in the magnitude of the value of the loss function for each task. As a result, only the loss function having the larger value is evaluated, and a situation may occur in which learning is not performed correctly.

In order to solve such a problem, it is necessary to normalize the value of the loss function whose range of values changes as the learning progresses. Then, the value of the normalized loss function needs to be integrated with the value of the loss function whose range of values does not change as the learning progresses.

The value range of the value of the cost function of the trinomial loss represented by the above-mentioned equation (1) changes greatly with the progress of learning due to the introduction of the margin parameter δ represented by the above-mentioned equation (2). On the other hand, the search system 1 in the present embodiment performs margin normalization. The margin normalization here is a method of dividing the cost function of the trinomial loss by the margin parameter δ to cancel and normalize the change in the value range of the loss function as the learning progresses.

The above margin normalization is applied to multitask learning by N cost functions L ^ triplet_i of the trinomial loss shown by the above equation (1) and M cost functions L ^ table_j that do not depend on the margin parameter δ. The total cost function is expressed by the following equation (4).

Here, δ_i is a margin parameter for the i-th task. δ_i is a positive real number or a positive constant that changes as learning progresses. By performing the margin normalization, the stabilization of learning by increasing the value of the margin parameter δ described above and the elimination of the imbalance of the loss function described above are compatible.

[Search system configuration]
Hereinafter, the configuration of the search system 1 according to the first embodiment will be described.
The object of the search process by the search system 1 in the present embodiment is, for example, a sound effect in a television broadcast and an accompanying video effect.

FIG. 2 is a block diagram showing a configuration of the search system 1 according to the first embodiment of the present invention. As shown in FIG. 2, the search system 1 in the present embodiment has a learning device 10 and a search device 20.
The learning device 10 performs machine learning and generates a learning model which is a learning result. The learning device 10 outputs the generated learning model to the search device 20.
The search device 20 acquires the learning model output from the learning device 10. The search device 20 executes a search process using the acquired learning model.

[Configuration of learning device]
Hereinafter, the configuration of the learning device 10 will be described.
FIG. 3 is a block diagram showing a functional configuration of the learning device 10 according to the first embodiment of the present invention.

As shown in FIG. 3, the learning device 10 includes a video input unit 101, a voice input unit 102, a spectrogram extraction unit 103, a video query extraction unit 104, a voice dictionary extraction unit 105, and a target target detection unit 106. The first detection result output unit 107, the second detection result output unit 108, the target target label storage unit 109, the detection cost calculation unit 110, the embedding cost calculation unit 111, and the learning cost calculation unit 112 are included. Consists of.

The video input unit 101 acquires input data. In the present embodiment, the input data is a video (moving image) with audio. The video with audio is divided into video clips every short time (for example, every few seconds). The video input unit 101 takes out video data from the input data. The video input unit 101 compresses the acquired video by performing a resolution lowering process and a frame thinning process. The video input unit 101 performs the above-mentioned image compression so that the acquired video can be input to the video query extraction unit 104 in the subsequent stage. The video input unit 101 outputs the compressed video to the video query extraction unit 104.

The voice input unit 102 acquires input data. In the present embodiment, the input data is a video (moving image) with audio. The input data input to the voice input unit 102 is the same data as the input data input to the video input unit 101 described above. The voice input unit 102 extracts acoustic data from the input data. The voice input unit 102 outputs the acquired acoustic data to the spectrogram extraction unit 103.

The video (video) with audio, which is input data, is separated into video data and acoustic data in advance, and the separated video data and acoustic data are input to the video input unit 101 and the audio input unit 102, respectively. It may be configured. In this case, the functional unit that separates the input data into the video data and the acoustic data may be provided inside the learning device 10 or may be provided outside the learning device 10.

The spectrogram extraction unit 103 acquires the acoustic data output from the voice input unit 102. The spectrogram extraction unit 103 performs a short-time Fourier transform (STFT) on the acquired acoustic data. The spectrogram extraction unit 103 outputs the acoustic data subjected to the short-time Fourier transform to the speech dictionary extraction unit 105.

The video query extraction unit 104 (Video encoder) acquires the video data output from the video input unit 101. The video query extraction unit 104 extracts an embedded vector (embedding) for use as a search query for the target audio by using a DNN model that inputs the acquired video data. The video query extraction unit 104 outputs the extracted embedded vector to the target target detection unit 106 and the embedding cost calculation unit 111.

The voice dictionary extraction unit 105 (Audio encoder) acquires the acoustic data output from the spectrogram extraction unit 103. The voice dictionary extraction unit 105 extracts an embedded vector (embedding) for use as a search query for sound effects by using a DNN model that inputs acquired sound data. The voice dictionary extraction unit 105 outputs the extracted embedded vector to the target target detection unit 106 and the embedding cost calculation unit 111.

The target target detection unit 106 (Awareness mechanism) acquires an embedded vector output from the video query extraction unit 104 and an embedded vector output from the audio dictionary extraction unit 105. Whether or not the input data includes the target object by the DNN model in which the embedded vector output from the video query extraction unit 104 or the embedded vector output from the voice dictionary extraction unit 105 is input to the target target detection unit 106. Is detected.

The target target detection unit 106 detects both the input of the embedded vector output from the video query extraction unit 104 and the embedded vector output from the audio dictionary extraction unit 105 using the same DNN model. .. However, the calculation using the embedded vector output from the video query extraction unit 104 as the input and the calculation using the embedded vector output from the audio dictionary extraction unit 105 as the input are performed independently.

The target target detection unit 106 outputs the detection result calculated by inputting the embedded vector output from the video query extraction unit 104 to the first detection result output unit 107. Further, the target target detection unit 106 outputs the detection result calculated by inputting the embedded vector output from the voice dictionary extraction unit 105 to the second detection result output unit 108.

The first detection result output unit 107 acquires the detection result output from the target target detection unit 106. Here, the detection result acquired by the first detection result output unit 107 is an input detected by the DNN model in which the embedded vector output from the video query extraction unit 104 is input in the target target detection unit 106. Information indicating whether or not the data includes a target object. The first detection result output unit 107 outputs information indicating the acquired detection result to the detection cost calculation unit 110.

The second detection result output unit 108 acquires the detection result output from the target target detection unit 106. Here, the detection result acquired by the second detection result output unit 108 is an input detected by the DNN model in which the embedded vector output from the voice dictionary extraction unit 105 is input in the target target detection unit 106. Information indicating whether or not the data includes a target object. The second detection result output unit 108 outputs the information indicating the acquired detection result to the detection cost calculation unit 110.

The target target label storage unit 109 stores label data in advance. In the present embodiment, the label data is information indicating whether or not each video clip of the input data input to the video input unit 101 and the audio input unit 102 includes the target target of the search. That is, the label data stored in the target target label storage unit 109 is weak label label data.

The target label storage unit 109 is, for example, a storage medium such as a RAM (RandomAccessMemory), a flash memory, an EEPROM (Electrically ErasableProgrammableReadOnlyMemory), and an HDD (HardDiskDrive), or any of these storage media. Consists of combinations.

The detection cost calculation unit 110 acquires information indicating the detection result output from the first detection result output unit 107 and information indicating the detection result output from the second detection result output unit 108. Further, the detection cost calculation unit 110 acquires the target target label which is the label data of the target target stored in the target target label storage unit 109.

The detection cost calculation unit 110 is from the information indicating the detection result output from the first detection result output unit 107, the information indicating the detection result output from the second detection result output unit 108, and the target target label storage unit 109. Calculate the detection cost based on the obtained target label.

Here, the information indicating the detection result output from the first detection result output unit 107, the information indicating the detection result output from the second detection result output unit 108, and the target target acquired from the target target label storage unit 109. The BCE (Binary Conclusion Entry) between the label and the label is defined as the detection cost function L_aware.
The detection cost calculation unit 110 outputs the detection cost function L_aware to the learning cost calculation unit 112.

The embedding cost calculation unit 111 acquires the embedding vector output from the video query extraction unit 104 and the embedding vector output from the audio dictionary extraction unit 105. The embedding cost calculation unit 111 is based on the embedding vector output from the video query extraction unit 104, the embedding vector output from the voice dictionary extraction unit 105, and the cost function represented by the following equation (5), and the embedding cost L_triplet. To calculate.

The two terms on the right side of Eq. (5) are both trinomial loss functions expressed by Eq. (1). The composition of similar pairs and dissimilar pairs in the trinomial loss function is based on the similarity between video and audio in the first term L_inter, and in the second term L_intra, the similarity and audio in the video. It is done based on the similarities within. Further, λ_1 is a positive constant.

The embedding cost calculation unit 111 outputs the calculated embedding cost L_triplet to the learning cost calculation unit 112.

The learning cost calculation unit 112 acquires the detection cost function L_aware output from the detection cost calculation unit 110. Further, the learning cost calculation unit 112 acquires the embedding cost L_triplet output from the embedding cost calculation unit 111. The learning cost calculation unit 112 calculates the total learning cost L by the following equation (6) based on the acquired detection cost function L_aware and the embedding cost L_triplet.

Here, δ is a margin parameter in the trinomial loss function expressed by Eq. (1). The learning cost calculation unit 112 performs the above-mentioned margin normalization by dividing the embedding cost L_triplet by δ in the first term on the right side of the equation (6).

The learning device 10 has the parameters of the DNN model of the video query extraction unit 104, the parameters of the DNN model of the voice dictionary extraction unit 105, and the target object so as to minimize the overall cost function L represented by the equation (6). The parameters of the DNN model of the detection unit 106 are updated.

In the present embodiment, the cost function optimized by minimizing the value is used, but the cost function optimized by maximizing the value may be used.

When the machine learning is completed, the learning device 10 outputs the updated information indicating the DNN model of the video query extraction unit 104 and the information indicating the DNN model of the audio dictionary extraction unit 105 to the search device 20 described later. ..

[Search device configuration]
Hereinafter, the configuration of the search device 20 will be described.
FIG. 4 is a block diagram showing a functional configuration of the search device 20 according to the first embodiment of the present invention.

As shown in FIG. 4, the search device 20 includes a video input unit 201, a voice input unit 202, a spectrogram extraction unit 203, a video query extraction unit 204, a voice dictionary extraction unit 205, and a voice dictionary storage unit 206. The embedded distance calculation unit 207, the audio search result output unit 208, and the video output unit 209 are included.

The video input unit 201 acquires input data. In the present embodiment, the input data is a video (moving image) with audio. The video with audio is divided into video clips every short time (for example, every few seconds). The video input unit 201 takes out video data from the input data. The video input unit 201 performs image compression on the acquired video by performing a resolution lowering process and a frame thinning process. The video input unit 201 performs the above-mentioned image compression so that the acquired video can be input to the video query extraction unit 204 in the subsequent stage. The video input unit 201 outputs the compressed video to the video query extraction unit 204.

The voice input unit 202 acquires input data. In the present embodiment, the input data is a video (moving image) with audio. The input data input to the audio input unit 202 is the same data as the input data input to the video input unit 201 described above. The voice input unit 202 extracts acoustic data from the input data. The voice input unit 202 outputs the acquired acoustic data to the spectrogram extraction unit 203.

The video (video) with audio, which is input data, is separated into video data and acoustic data in advance, and the separated video data and acoustic data are input to the video input unit 201 and the audio input unit 202, respectively. It may be configured. In this case, the functional unit that separates the input data into the video data and the acoustic data may be provided inside the search device 20 or may be provided outside the search device 20.

As described above, in the present embodiment, the audio input unit 202 extracts acoustic data from the same data as the input data input to the video input unit 201, similarly to the audio input unit 102 of the learning device 10 described above. However, it is not limited to this. The audio input unit 202 may acquire acoustic data different from the acoustic data included in the input data input to the video input unit 201.

The spectrogram extraction unit 203 acquires the acoustic data output from the voice input unit 202. The spectrogram extraction unit 203 performs a short-time Fourier transform (STFT) on the acquired acoustic data. The spectrogram extraction unit 203 outputs the acoustic data subjected to the short-time Fourier transform to the speech dictionary extraction unit 205.

The video query extraction unit 204 (Video encoder) acquires the video data output from the video input unit 201. The video query extraction unit 204 extracts an embedded vector (embedding) for use as a search query for the target audio by using a DNN model that inputs the acquired video data. The video query extraction unit 204 outputs the extracted embedded vector to the embedding distance calculation unit 207.

The voice dictionary extraction unit 205 (Audio encoder) acquires the acoustic data output from the spectrogram extraction unit 203. The voice dictionary extraction unit 205 extracts an embedded vector (embedding) to be used as a search query for sound effects by a DNN model in which the acquired sound data is input. The voice dictionary extraction unit 205 records the extracted embedded vector in the voice dictionary storage unit 206.

The voice dictionary storage unit 206 stores a voice dictionary which is an embedded vector recorded by the voice dictionary extraction unit 205. The voice dictionary is a dictionary in which the entire voice input to be searched is converted into a dictionary by the voice dictionary extraction unit 205.
The voice dictionary storage unit 206 is configured to include storage media such as RAM, flash memory, EEPROM, and HDD storage media, or any combination of these storage media.

The embedding distance calculation unit 207 acquires the embedding vector output from the video query extraction unit 204. Further, the embedded distance calculation unit 207 acquires an embedded vector (voice dictionary) stored in the voice dictionary storage unit 206. The embedded distance calculation unit 207 calculates the distance (for example, the Euclidean distance) between the embedded vector output from the video query extraction unit 204 and the embedded vector stored in the voice dictionary storage unit 206. The embedded distance calculation unit 207 outputs information indicating the calculated distance to the voice search result output unit 208.

The voice search result output unit 208 acquires the information indicating the distance between the embedding vectors output from the embedding distance calculation unit 207. The voice search result output unit 208 identifies an embedded vector (voice dictionary) corresponding to the leading distance (that is, the shortest distance) when the distances based on the acquired information are sorted in ascending order. The voice search result output unit 208 outputs the acoustic data corresponding to the specified embedded vector to the video output unit 209.

The video output unit 209 acquires the acoustic data output from the voice search result output unit 208. The video output unit 209 combines the acoustic data with the video data included in the input data acquired by the video input unit 201. The video output unit 209 outputs a video with audio.

[Operation of learning device]
Hereinafter, an example of the operation of the learning device 10 will be described.
FIG. 5 is a flowchart showing the operation of the learning device 10 according to the first embodiment of the present invention.

The video input unit 101 and the audio input unit 102 of the learning device 10 acquire input data (step S001). In the present embodiment, the input data is a video (moving image) with audio. The video with audio is divided into video clips every short time (for example, every few seconds).

Next, the video input unit 101 takes out video data from the input data. The video input unit 101 executes image compression by performing a resolution lowering process and a frame thinning process on the acquired video (step S002).

Next, the video query extraction unit 104 extracts an embedded vector to be used as a search query for the target audio by using a DNN model that inputs video data acquired from the video input unit 101 (step S003).

Next, the voice input unit 102 extracts acoustic data from the input data. The spectrogram extraction unit 103 executes a short-time Fourier transform on the acoustic data acquired from the voice input unit 102 (step S004).

Next, the voice dictionary extraction unit 105 extracts an embedded vector to be used as a search query for acoustic effects by a DNN model that inputs acoustic data acquired from the spectrogram extraction unit 103 (step S005).

Next, the target target detection unit 106 uses a DNN model that inputs an embedded vector acquired from the video query extraction unit 104 and a DNN model that inputs an embedded vector acquired from the voice dictionary extraction unit 105 to input data. It is determined whether or not the target is included (step S006).

The detection cost calculation unit 110 has the above determination result by the DNN model inputting the embedded vector acquired from the video query extraction unit 104, and the above determination result by the DNN model inputting the embedded vector acquired from the voice dictionary extraction unit 105. , And the detection cost is calculated based on the target target label stored in advance in the target target label storage unit 109 (step S007).

The embedding cost calculation unit 111 calculates the embedding cost based on the embedding vector acquired from the video query extraction unit 104, the embedding vector acquired from the voice dictionary extraction unit 105, and the cost function (step S008).

The learning cost calculation unit 112 calculates the total learning cost based on the detection cost acquired from the detection cost calculation unit and the embedding cost acquired from the embedding cost calculation unit 111 (step S009).
This completes the operation of the learning device 10 shown in the flowchart of FIG.

[Operation of search device]
Hereinafter, an example of the operation of the search device 20 will be described.
FIG. 6 is a flowchart showing the operation of the search device 20 according to the first embodiment of the present invention.

The search device 20 acquires a learning model from the learning device 10 (step S101). The learning model referred to here includes, for example, the parameters of the DNN model of the video query extraction unit 104 of the learning device 10 and the parameters of the DNN model of the voice dictionary extraction unit 105 after machine learning is performed by the learning device 10. .. The search device 20 sets the parameters of the DNN model of the video query extraction unit 104 of the learning device 10 and the parameters of the DNN model of the voice dictionary extraction unit 105 to the DNN model and the voice dictionary extraction unit of the video query extraction unit 204 of the search device 20, respectively. Set to 205 DNN model.

The video input unit 201 and the audio input unit 202 of the search device 20 acquire input data (step S102). In the present embodiment, the input data is a video (moving image) with audio. The video with audio is divided into video clips every short time (for example, every few seconds).

Next, the video input unit 201 takes out video data from the input data. The video input unit 201 executes image compression by performing a resolution lowering process and a frame thinning process on the acquired video (step S103).

Next, the video query extraction unit 204 extracts an embedded vector to be used as a search query for the target audio by using a DNN model that inputs video data acquired from the video input unit 201 (step S104).

Next, the voice input unit 202 extracts acoustic data from the input data. The spectrogram extraction unit 203 executes a short-time Fourier transform on the acoustic data acquired from the voice input unit 202 (step S105).

Next, the voice dictionary extraction unit 205 extracts an embedded vector to be used as a search query for acoustic effects by a DNN model that inputs acoustic data acquired from the spectrogram extraction unit 203 (step S106).

Next, the embedded distance calculation unit 207 calculates the distance (for example, the Euclidean distance) between the embedded vector output from the video query extraction unit 204 and the embedded vector output from the voice dictionary extraction unit 205 (step). S107).

Next, the voice search result output unit 208 creates an embedded vector (voice dictionary) corresponding to the leading distance (that is, the shortest distance) when the distances calculated by the embedded distance calculation unit 207 are rearranged in ascending order. Specify (step S108). The voice search result output unit 208 outputs the acoustic data corresponding to the specified embedded vector to the video output unit 209.

Next, the video output unit 209 combines the audio data acquired from the voice search result output unit 208 with the video data included in the input data acquired by the video input unit 201. Then, the video output unit 209 outputs a video with audio.
This completes the operation of the search device 20 shown in the flowchart of FIG.

[Example]
Hereinafter, an embodiment of the search system 1 according to the first embodiment described above will be described.

The implementation conditions of this embodiment are as follows.
-The purpose of the search was the sound effects in TV broadcasting and the accompanying video effects.
-The input data input to the video input unit and the audio input unit was a video with audio equivalent to 10 days (240 hours) of television broadcasting. Further, the input data was obtained by dividing the video with audio into 6.4 [seconds].

-The video input unit performs a process of converting the resolution of the video data acquired from the input data to 224 × 224 [Pixel], and performs a process of thinning out the number of frames to 5 [fps].
-The voice input unit shall sample the voice contained in the input data at a sampling rate of 48 [kHz].
The spectrogram extraction unit 103 performs a short-time Fourier transform (STFT) with the window function as a humming window, the window length as 2048 [points], and the shift width as 1024 [points].

-Target target label The storage unit is a label indicating whether or not the search target target is included in the input data (each video clip every 6.4 seconds) input to the video input unit and the audio input unit. It is assumed that each data is stored.
-The embedded distance calculation unit calculates the Euclidean distance between the embedded vector output from the video query extraction unit and the embedded vector stored in the voice dictionary storage unit.

The experimental results of the evaluation experiment conducted according to the above-mentioned implementation conditions will be described below. In the evaluation experiment, an evaluation experiment based on objective evaluation (hereinafter referred to as "objective evaluation experiment") and an evaluation experiment based on subjective evaluation (hereinafter referred to as "subjective evaluation experiment") were conducted. In the objective evaluation experiment and the subjective evaluation experiment, in order to show the effectiveness of the proposed method according to the present invention, the experimental results for each of the following methods were compared with each other.

(A) Comparison target method 1 (Triplet 1)
The target object detection unit of the learning device is not used, and only the sample data including the target object (that is, the sound effect) is used as the input data (learning data) input to the learning device.
(B) Comparison target method 2 (Triplet2)
Half of the sample data including the target object (that is, the sound effect) and the sample data not including the target object as the input data (learning data) input to the learning device without using the target object detection unit of the learning device. Use one by one.
(C) Proposed method
Half of the sample data including the target object (that is, the sound effect) and the sample data not including the target object as the input data (learning data) input to the learning device using the target object detection unit of the learning device. Use one by one.

Below, as the experimental results, the experimental results based on the objective evaluation and the experimental results based on the subjective evaluation are shown.

First, the experimental results of the objective evaluation experiment will be described.
In the objective evaluation experiment, the ranking accuracy, which is an evaluation index of the accuracy of the search results, was calculated for each method.

FIG. 7 is a diagram showing the experimental results of the objective evaluation experiment in the embodiment of the first embodiment of the present invention. In FIG. 7, “(A) Triplet 1”, “(B) Triplet 2”, and “(C) Proposed” are the above-mentioned (A) comparison target method 1, (B) comparison target method 2, and (C), respectively. It shows the ranking accuracy of the search results by the proposed method. Further, "Random" represents the ranking accuracy of the search result by the random search.

As shown in FIG. 7, in this objective evaluation experiment, the ranking accuracy of the random search, (A) comparison target method 1, (B) comparison target method 2, and (C) proposed method is "0", respectively. It became .500 "," 0.625 "," 0.685 ", and" 0.716 ". As described above, in this objective evaluation experiment, the proposed method (C) surpassed the other methods, and the evaluation result achieved the highest ranking accuracy.

Next, the experimental results of the subjective evaluation experiment will be described.
In the subjective evaluation experiment, 19 subjects were targeted for verification by investigating the sensations received by the subjects for each method.

Specifically, in the subjective evaluation experiment, 19 images with audio (sound effects) output by (A) comparison target method 1, (B) comparison target method 2, (C) proposal method, and random search, respectively. The subjects watched the video, and each viewer evaluated the appropriateness of the sound (sound effect) given to the video by each method on a five-point scale. The evaluation value (subjective evaluation score) in this evaluation was the average value (MOS Core) of the five-grade evaluation by 19 persons.

FIG. 8 is a diagram showing the experimental results of the subjective evaluation experiment in the embodiment of the first embodiment of the present invention. FIG. 8 shows the subjective evaluation scores in the case of the random search, (A) comparison target method 1, (B) comparison target method 2, and (C) proposed method, respectively. As shown in FIG. 8, in the subjective evaluation experiment, the proposed method (C) surpassed the other methods, and the evaluation result achieved the highest subjective evaluation score.

Further, according to the one-sided Mann-Whitney U test for the evaluation results of the subjective evaluation experiment shown in FIG. 8, there is a statistically significant difference (p <0.05) between the proposed method and the other method. confirmed. That is, the evaluation results of the above subjective evaluation experiment show that the search system 1 in the present embodiment operates effectively.

As described above, the search system 1 in the present embodiment is different from the conventional search system that does not use the target target detection by the objective evaluation experiment and the subjective evaluation experiment in the estimation / search of the sound effect for the moving image (video with sound) of the television broadcast. By comparison, it was shown that more appropriate sound effects can be estimated and searched.

By providing the above configuration, the search system 1 in the first embodiment of the present invention performs machine learning using a weak label indicating whether or not the input video and sound include sound effects. Machine learning can be effectively performed while reducing the cost and labor required to create training data.

Note that the search system 1 in the first embodiment described above has a configuration in which a video is input and a sound effect corresponding to the video is estimated, but the present invention is not limited to this. For example, the search system may be configured to use the sound effect as an input and estimate the video corresponding to the sound effect.

The search system 1 in the first embodiment described above uses a weak label indicating whether or not a sound effect is included as label data, but the label data is not limited to this. For example, when a weak label indicating whether or not a human voice is included is used, it is possible to realize a system that outputs the voice of a specific speaker with respect to the input video.

Note that the search system 1 in the first embodiment described above is configured to perform a cross-modal search based on the co-occurrence relationship between video and audio, but is not limited to this. For example, the search system may be configured to perform a cross-modal search using a co-occurrence relationship between video and text information (for example, subtitles in subtitle broadcasting) or a co-occurrence relationship between sound effects and text information.

Note that FIG. 1 shows the video and sound effects at the time when the news is started in the television broadcast, but the present invention is based on the co-occurrence relationship between the video and the audio when such video content is switched. It is not limited to cross-modal search. The present invention can also be applied to, for example, a cross-modal search based on a co-occurrence relationship between video and audio when switching display / non-display of a panel, a speaker, a telop, or the like.

As described in the first embodiment above, the entire cost function L_multi to which the margin normalization used in the learning cost calculation unit 112 is applied is generalized as in the above equation (4). This margin normalization may be a modification as in the second to fifth embodiments described below.

<Second embodiment>
[Margin normalization for trinomial loss function using inner product distance]
In the first embodiment, in the trinomial loss function represented by the above equation (1), L2 is used to measure the distance between the reference input (Anchor) and the similar pair (Positive) and the dissimilar pair (Negative). The configuration used the norm D (a, b). It is possible to apply margin normalization even if this distance metric is done by other means. For example, in order to measure this distance, D (a, b), which is an inner product distance as shown by the following equation (7), may be used.

<Third embodiment>
[Margin normalization for multiple trinomial loss functions with different margins]
Similar to the embedding cost L_triplet expressed by the above equation (5) in the first embodiment, the embedding cost L_triplet consists of the sum of a plurality of trinomial loss functions, and the margin parameter of each trinomial loss function. It is possible to apply margin normalization even if the values of δ are different.

For example, the L_triplet represented by the embedding cost represented by the above equation (5) consists of the sum of the terms of the two trinomial loss functions of L_inter and L_intra. When the margin parameters δ of these terms are δ_inter and δ_intra, respectively, the total learning cost L to which the margin normalization is applied is expressed by the following equation (8).

<Fourth Embodiment>
[Margin normalization in multi-task learning with multi-class classification problems]
Margin normalization can be applied to multi-task learning combined with learning using any margin-independent loss function.

For example, the target target detection unit 106, the first detection result output unit 107, the second detection result output unit 108, the target target label storage unit 109, and the detection cost calculation unit of the learning device 10 according to the first embodiment shown in FIG. The 110 and the learning cost calculation unit 112 are the sound effect identification unit 106b, the first detection result output unit 107b, the second detection result output unit 108b, the sound effect identification label storage unit 109b, and the detection cost calculation unit 110b, which are described below, respectively. , And the learning cost calculation unit 112b can be configured. This makes it possible to perform margin normalization for multi-task learning with a multi-class classification problem.

The sound effect identification label storage unit 109b stores the label data in advance. In the present embodiment, the label data is "1. program start sound" and "2. subtitle synchronization sound" given to each video clip of the input data input to the video input unit 101 and the audio input unit 102. , "3. Item-enhanced synchronized sound", "4. Scene-enhanced sound", and "5. Other". The label data is, for example, in the form of a vector of One-hot expression.

The sound effect identification label storage unit 109b is configured to include, for example, a storage medium such as a RAM, a flash memory, an EEPROM, and an HDD, or any combination of these storage media.

The sound effect identification unit 106b acquires the embedded vector output from the video query extraction unit 104 and the embedded vector output from the audio dictionary extraction unit 105. The sound effect identification unit 106b identifies the sound effect included in the input data by the DNN model that inputs the embedded vector output from the video query extraction unit 104 or the embedded vector output from the voice dictionary extraction unit 105.

The sound effect identification unit 106b detects both the input of the embedded vector output from the video query extraction unit 104 and the embedded vector output from the audio dictionary extraction unit 105 using the same DNN model. .. However, the calculation using the embedded vector output from the video query extraction unit 104 as the input and the calculation using the embedded vector output from the audio dictionary extraction unit 105 as the input are performed independently.

Here, the DNN model is composed of a fully connected layer having a 64-dimensional input, a 128-dimensional hidden layer, and a 5-dimensional softmax function output.

The sound effect identification unit 106b outputs the detection result calculated by inputting the embedded vector output from the video query extraction unit 104 to the first detection result output unit 107b. Further, the sound effect identification unit 106b outputs the detection result calculated by inputting the embedded vector output from the voice dictionary extraction unit 105 to the second detection result output unit 108b.

The first detection result output unit 107b acquires the detection result output from the sound effect identification unit 106b. Here, the detection result acquired by the first detection result output unit 107b is an input detected by a calculation by the DNN model in which the embedded vector output from the video query extraction unit 104 is input in the sound effect identification unit 106b. Information that identifies the sound effects contained in the data. The first detection result output unit 107b outputs information indicating the acquired detection result to the detection cost calculation unit 110b.

The second detection result output unit 108b acquires the detection result output from the sound effect identification unit 106b. Here, the detection result acquired by the second detection result output unit 108b is an input detected by a calculation by the DNN model in which the embedded vector output from the voice dictionary extraction unit 105 is input in the sound effect identification unit 106b. Information that identifies the sound effects contained in the data. The second detection result output unit 108b outputs the information indicating the acquired detection result to the detection cost calculation unit 110b.

The detection cost calculation unit 110b acquires information indicating the detection result output from the first detection result output unit 107b and information indicating the detection result output from the second detection result output unit 108b. Further, the detection cost calculation unit 110b acquires the label data stored in the sound effect identification label storage unit 109b.

The detection cost calculation unit 110b has information indicating the detection result output from the first detection result output unit 107b, information indicating the detection result output from the second detection result output unit 108b, and the sound effect identification label storage unit 109b. Calculate the detection cost based on the label data obtained from.

Here, the information indicating the detection result output from the first detection result output unit 107b, the information indicating the detection result output from the second detection result output unit 108b, and the label acquired from the sound effect identification label storage unit 109b. Let the Cross Entry between the data and the data be the detection cost function L_class.
The detection cost calculation unit 110b outputs the detection cost function L_class to the learning cost calculation unit 112b.

The learning cost calculation unit 112 acquires the detection cost function L_class output from the detection cost calculation unit 110b. Further, the learning cost calculation unit 112b acquires the embedding cost L_triplet output from the embedding cost calculation unit 111. The learning cost calculation unit 112b calculates the total learning cost L by the following equation (9) based on the acquired detection cost function L_class and the embedding cost L_triplet.

<Fifth Embodiment>
[Margin normalization for margin increase / decrease adaptive to sample selection]
It is also possible to make the margin parameter δ dependent on the input sample data. In this case, the trinomial loss function T_δ can be expressed by, for example, the following equation (10).

At this time, the value of the margin parameter δ is not constant in the process of taking the sum of the distances of the vectors in the mini-batch in the mini-batch learning. Therefore, the value of the margin parameter δ cannot be used as it is as the normalization coefficient. However, margin normalization is performed by using the margin parameter δ- (“−” on “δ”) represented by the following equation (11) as the normalization coefficient as the average value of the margin for each set of sample data. Is possible to apply.

As described above, the search system in each embodiment of the present invention is a system for performing machine learning and search in cross-modal search using the co-occurrence relationship between video and audio. For example, the co-occurrence relationship between video and audio includes the co-occurrence relationship between video and sound effects in TV broadcasting as described above, the co-occurrence relationship between video of a moving vehicle and the engine sound of the vehicle, and the like. , There are various things.

In cross-modal search, various approaches are taken, especially in terms of search technology using machine learning. However, when the training data is made large-scale and detailed in order to improve the search accuracy, the cost required for labeling increases, which poses a practical problem. On the other hand, the search system in each of the above-described embodiments performs machine learning using weak labels such as label data indicating whether or not the input video and audio include sound effects, and thus is used for creating learning data. Machine learning can be performed effectively while reducing the cost and labor required.

According to the above-described embodiment, the learning device performs learning so as to maximize or minimize the loss value calculated by the integrated function in which a plurality of loss functions are integrated. For example, the learning device is the learning device 10 in the embodiment, the plurality of integrated functions are the embedded cost function L_triplet and the detection cost function L_aware represented by the equation (5) in the embodiment, and the integrated function is the integrated function in the embodiment. It is a learning cost function L shown in the equation (6), and the loss value is the total learning cost in the embodiment.

Further, according to the above-described embodiment, the plurality of loss functions include a first loss function whose value magnitude changes as learning progresses, and at least one second loss function different from the first loss function. including. The integrated function calculates the loss value based on the normalized value of the first loss function and the value of the second loss function. For example, the first loss function is the embedded cost function L_triplet shown in the equation (5) in the embodiment, the second loss function is the detection cost function L_aware, and the value of the first loss function is normalized. Is L_triplet / δ represented by the equation (6) in the embodiment.

The first loss function is a ternary loss function, and the normalization of the first loss function is performed by dividing the ternary loss function by a parameter whose value increases as the learning progresses. May be good. For example, the parameter is the margin parameter δ represented by the equation (2) in the embodiment.

Further, according to the above-described embodiment, the learning device includes an input unit that accepts input of video data and acoustic data included in the same moving image, video data, acoustic data, and a moving image including the video data as a specific audio. It is provided with a learning unit for learning the co-occurrence relationship between the video data and a specific voice based on the label data of a weak label indicating whether or not the data is included. For example, the input unit is the video input unit 101 and the audio input unit 102 in the embodiment, and the learning unit is the video query extraction unit 104, the audio dictionary extraction unit 105, and the target target detection unit 106 in the embodiment. The voice of is a sound effect in the embodiment.

The learning unit performs learning so as to maximize or minimize the loss value calculated by the integrated function in which a plurality of loss functions including the ternary loss function are integrated, and the ternary loss function advances the learning. If the value is a loss function whose magnitude changes with it, the value of the ternary loss function may be normalized.

Further, according to the above-described embodiment, the learning device includes an input unit that accepts input of video data and acoustic data included in the same moving image, video data, acoustic data, and audio included in the moving image including the video data. It is provided with a learning unit for multi-task learning of the co-occurrence relationship between the video data and the audio and the type of the audio co-occurring with the video data based on the label data that classifies the data. For example, the label data is the label data stored in the sound effect identification label storage unit 109b in the embodiment, and the learning unit is the sound effect identification unit 106b in the embodiment.

Further, according to the above-described embodiment, the search device is a search device that searches for audio corresponding to video data using the learning results of the above-mentioned learning device. For example, the search device is the search device 20 in the embodiment, and the learning result is the DNN model of the learned video query extraction unit 104 and the DNN model of the learned voice dictionary extraction unit 105 in the embodiment.

A part or all of the configuration of the search system in each of the above-described embodiments may be realized by a computer. In that case, a program for realizing this function may be recorded on a computer-readable recording medium, and the program recorded on the recording medium may be read by a computer system and executed. The term "computer system" as used herein includes hardware such as an OS and peripheral devices. Further, the "computer-readable recording medium" refers to a portable medium such as a flexible disk, a magneto-optical disk, a ROM, or a CD-ROM, and a storage device such as a hard disk built in a computer system. Further, a "computer-readable recording medium" is a communication line for transmitting a program via a network such as the Internet or a communication line such as a telephone line, and dynamically holds the program for a short period of time. It may also include a program that holds a program for a certain period of time, such as a volatile memory inside a computer system that is a server or a client in that case. Further, the above program may be for realizing a part of the above-mentioned functions, and may be further realized for realizing the above-mentioned functions in combination with a program already recorded in the computer system. It may be realized by using a programmable logic device such as FPGA (Field Programmable Gate Array).

As described above, the embodiment of the present invention has been described in detail with reference to the drawings, but the specific configuration is not limited to this embodiment, and the design and the like within a range not deviating from the gist of the present invention are also included.

1 ... Search system, 10 ... Learning device, 20 ... Search device, 101 ... Video input unit, 102 ... Voice input unit, 103 ... Spectrogram extraction unit, 104 ... Video query extraction unit, 105 ... Voice dictionary extraction unit, 106 ... Purpose Target detection unit, 106b ... Sound effect identification unit, 107, 107b ... First detection result output unit, 108, 108b ... Second detection result output unit, 109 ... Target target label storage unit, 109b ... Sound sound identification label storage unit, 110, 110b ... Detection cost calculation unit, 111 ... Embedded cost calculation unit, 112, 112b ... Learning cost calculation unit, 201 ... Video input unit, 202 ... Audio input unit, 203 ... Spectrogram extraction unit, 204 ... Video query extraction unit, 205 ... Audio dictionary extraction unit, 206 ... Audio dictionary storage unit, 207 ... Embedded distance calculation unit, 208 ... Audio search result output unit, 209 ... Video output unit

Claims

A learning device that learns to maximize or minimize the loss value calculated by the integrated function in which multiple loss functions are integrated.
The plurality of loss functions include a first loss function whose value magnitude changes as the learning progresses, and at least one second loss function different from the first loss function.
The integrated function is a learning device that calculates the loss value based on the value obtained by normalizing the value of the first loss function and the value of the second loss function.
The first loss function is a trinomial loss function.
The learning device according to claim 1, wherein the normalization of the first loss function is performed by dividing the trinomial loss function by a parameter whose value increases as the learning progresses.
An input unit that accepts input of video data and acoustic data included in the same video,
Co-occurrence of the video data and the specific audio based on the video data, the acoustic data, and weakly labeled label data indicating whether the moving image containing the video data contains a specific audio. The learning department that learns relationships and
A learning device equipped with.
The learning unit
Learning is performed to maximize or minimize the loss value calculated by the integrated function in which multiple loss functions including the trinomial loss function are integrated.
The learning device according to claim 3, wherein when the trinomial loss function is a loss function whose value changes with the progress of learning, the value of the trinomial loss function is normalized.
An input unit that accepts input of video data and acoustic data included in the same video,
Based on the video data, the acoustic data, and the label data that classifies the audio included in the moving image including the video data, the co-occurrence relationship between the video data and the audio and the co-occurrence with the video data. A learning unit that multi-tasks the types of voice,
A learning device equipped with.
A search device that searches for audio corresponding to video data using the learning results of the learning device according to any one of claims 1 to 5.
It is a learning method in which learning is performed so as to maximize or minimize the loss value calculated by the integrated function in which a plurality of loss functions are integrated.
The plurality of loss functions include a first loss function whose value magnitude changes as the learning progresses, and at least one second loss function different from the first loss function.
The integrated function is a learning method for calculating the loss value based on the value obtained by normalizing the value of the first loss function and the value of the second loss function.
An input step that accepts input of video data and acoustic data included in the same video,
A co-occurrence relationship between the video data and the specific audio based on the video data, the acoustic data, and weakly labeled label data indicating whether or not the moving image containing the video data contains a specific audio. And learning steps to learn
Learning method with.
An input step that accepts input of video data and acoustic data included in the same video,
Based on the video data, the acoustic data, and the label data that classifies the audio included in the moving image including the video data into any of a plurality of types, the co-occurrence relationship between the video data and the audio and the co-occurrence relationship between the video data and the audio. A learning step for multi-task learning of the type of audio that co-occurs with the video data,
Learning method with.
A program for operating a computer as the learning device according to any one of claims 1 to 5.
A program for operating a computer as the search device according to claim 6.