WO2024052996A1

WO2024052996A1 - Learning device, conversion device, learning method, conversion method, and program

Info

Publication number: WO2024052996A1
Application number: PCT/JP2022/033441
Authority: WO
Inventors: 大輔仁泉; 大起竹内; 康智大石; 登原田; 邦夫柏野
Original assignee: 日本電信電話株式会社
Priority date: 2022-09-06
Filing date: 2022-09-06
Publication date: 2024-03-14

Abstract

An embodiment of the present invention is a learning device provided with a control unit that obtains, through learning, a target representation conversion process for converting the representation of data to be converted into a target representation, which is a predetermined specific representation, wherein the control unit performs: a first representation conversion process in which the representation of a processing target, which is first data obtained by removing mask data from 0th data that is data represented by a tensor, is converted into a target representation; a mask data representation prediction process in which the target representation of the mask data is predicted on the basis of the result of the first representation conversion process; a second representation conversion process in which the representation of second data, which is part or all of the mask data, is converted into a target representation; and an update process in which the content of the first representation conversion process is updated so as to reduce the difference between the result of the mask data representation prediction process and the result of the second representation conversion process. The first representation conversion process at the point in time when a prescribed condition for ending the updating is satisfied is said target representation conversion process.

Description

Learning device, conversion device, learning method, conversion method and program

The present invention relates to a learning device, a conversion device, a learning method, a conversion method, and a program.

A technique is known that uses machine learning to generate a mathematical model that converts the expression of data to be converted into a predetermined expression. The predetermined expression is, for example, an expression required by a downstream task. Note that converting the expression of the data to be converted into a predetermined expression means an encoding process of converting the data to be converted into data expressed in a predetermined expression.

So far, MAE (Masked Autoencoders) and data2vec have been proposed as such technologies. All of these techniques mask part of the input information and perform learning using the results. However, both techniques sometimes have poor conversion accuracy.

In view of the above circumstances, an object of the present invention is to provide a technique that improves the accuracy of converting the expression of data to be converted into a predetermined expression.

One aspect of the present invention includes a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, the control unit , transform the expression of the processing target into the target expression, using the first data, which is data obtained by removing the mask data that is part of the 0th data, from the 0th data, which is data expressed by a tensor, as the processing target. a first expression conversion process; a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting the expression of two data into a target expression; the first expression conversion process so as to reduce the difference between the result of the mask data expression prediction process and the result of the second expression conversion process; A mask data expression prediction process and an update process for updating the contents of the second expression conversion process are executed, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. It is a learning device.

One aspect of the present invention includes a conversion target acquisition unit that acquires target data for converting the expression into a target expression that is a predetermined expression; and a control unit that obtains by learning a process of converting the target representation into a target representation, the control unit converts 0th data, which is data expressed by a tensor, to a part of the 0th data. A first expression conversion process that converts the expression to be processed into a target expression using first data, which is data from which certain mask data has been removed, and converts the mask data based on the result of the first expression conversion process. a mask data expression prediction process for predicting a target expression; a second expression conversion process for converting the expression of the second data into a target expression based on second data that is part or all of the mask data; and the mask data expression. an update process of updating the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the prediction process and the result of the second expression conversion process; and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. The conversion device includes an expression conversion unit that converts the expression of data acquired by the target acquisition unit.

One aspect of the present invention includes a control step of obtaining a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, and the control step converts the expression of the processing target into the target expression, using the first data, which is data from which mask data, which is a part of the 0th data, has been removed from the 0th data, which is data expressed by a tensor, as the processing target. a first expression conversion process to predict the target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process to predict the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process that converts the expression of the second data into a target expression; and the first expression conversion process that reduces the difference between the result of the mask data expression prediction process and the result of the second expression conversion process; an update process for updating the contents of the mask data expression prediction process and the second expression conversion process, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion. It is a process and a learning method.

One aspect of the present invention includes a conversion target acquisition step of acquiring target data for converting the expression into a target expression that is a predetermined expression; and a control step that obtains by learning a process of converting the target representation into a target representation, and the control step includes converting the 0th data, which is data expressed by a tensor, to a part of the 0th data. A first expression conversion process that converts the expression to be processed into a target expression using first data that is data from which the mask data is removed, and the mask data is converted based on the result of the first expression conversion process. a second expression conversion process that converts the expression of the second data into a target expression based on second data that is part or all of the mask data; an update process for updating the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the expression prediction process and the result of the second expression conversion process; , and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. The conversion method includes an expression conversion step for converting the expression of the data acquired by the conversion target acquisition step.

One aspect of the present invention is a program for causing a computer to function as either the above learning device or the above converting device.

According to the present invention, it is possible to improve the accuracy of converting the expression of data to be converted into a predetermined expression.

FIG. 1 is a diagram illustrating an example of the configuration of an expression conversion system according to an embodiment. FIG. 2 is an explanatory diagram illustrating a patch in an embodiment. FIG. 2 is an explanatory diagram illustrating an overview of the flow of processing executed by the learning unit in the embodiment. 5 is a flowchart illustrating an example of the flow of learning processing executed by the learning unit in the embodiment. FIG. 1 is a diagram showing an example of a hardware configuration of a learning device according to an embodiment. The figure which shows an example of the structure of the control part in embodiment. The figure which shows an example of the hardware configuration of the conversion apparatus in embodiment. The figure which shows an example of the structure of the control part in embodiment. 5 is a flowchart illustrating an example of the flow of processing executed by the conversion device in the embodiment. The figure which shows an example of the result of an experiment in embodiment.

(Embodiment)
FIG. 1 is a diagram showing an example of the configuration of an expression conversion system 100 according to an embodiment. The expression conversion system 100 includes a learning device 1 and a conversion device 2. The learning device 1 acquires a process (hereinafter referred to as "target expression conversion process") for converting the expression of data to be converted into a predetermined expression (hereinafter referred to as "target expression") through learning.

Note that converting the representation of data into a target representation means an encoding process that converts the data to be encoded into data expressed in the target representation. Therefore, converting the representation of the data to be converted into the target representation means an encoding process that converts the data to be converted into data expressed in the target representation.

The target expression is, for example, expression embedding. The target representation is, for example, 768 floating point values. The target representation may be, for example, 1024 floating point values. The target representation may be, for example, 2048 floating point values. Representation conversion processing is a type of learning model.

The data to be converted can be image data, acoustic signal data, natural language data, or general time series data, as long as it is expressed as a tensor. It may be data. Therefore, the data to be converted may be, for example, image data, a spectrogram of acoustic signal data, or a sequence of words as samples, where each word is an M-dimensional vector (M is 1 or more). It may be a sequence expressed as a natural number (hereinafter referred to as a "natural language sequence"). The data to be converted may be, for example, general time series data in which symbols other than characters and numbers are expressed as vectors.

The spectrogram of the acoustic signal data is obtained based on the acoustic signal data. Natural language sequences are obtained based on natural language data.

The learning device 1 includes a learning section 10. The learning unit 10 executes a first expression conversion process, a mask data expression prediction process, a second expression conversion process, and an update process. Note that the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process may be updated in advance by, for example, transfer learning or the like.

The first expression conversion process is a process of converting the expression to be processed into the target expression. The processing target of the first expression conversion process when the learning unit 10 executes the first expression conversion process is the first data.

The first data is data obtained by removing mask data, which is a part of the 0th data, from the 0th data. The 0th data is data expressed by a tensor.

The 0th data is, for example, a tensor that expresses the original data that has been divided into patches, a tensor that also has data position information that is information indicating the position of each patch, and a tensor that has tensors as elements. It is. The 0th data is, for example, a matrix whose elements are vectors of each patch resulting from patch division of the tensor to be divided. Patch division is a process of dividing the target of division into parts called patches.

The original data is data that corresponds to the conversion target and is expressed as a tensor. Therefore, when the conversion target is image data, the original data is image data.

When the data to be converted is data obtained based on acoustic signal data, the original data is data obtained based on acoustic signal data and expressed as a tensor. Such data is, for example, a spectrogram of acoustic signal data.

When the data to be converted is data obtained based on natural language data, the original data is data obtained based on natural language data and expressed as a tensor. Such data is, for example, a natural language sequence. When the data to be converted is data obtained based on general time series data, the original data is data obtained based on general time series data and expressed as a tensor. Such data is, for example, a time series of numerical values such as stock prices and temperature.

In this way, the original data can be divided into image data, data obtained based on acoustic signal data and expressed as a tensor, and data obtained based on natural language data and expressed as a tensor. , data obtained based on general time series data and expressed as a tensor.

One element of the tensor of the 0th data indicates information regarding one patch. One patch includes one or more elements of a tensor representing the original data. One element is data expressed by a 768-dimensional vector, for example. Note that the position of a patch is the position within the 0th data of the element corresponding to each patch.

Note that the learning device 1 may perform the classification of patches and the assignment of data location information. That is, the process of obtaining the 0th data based on the original data (hereinafter referred to as "0th data generation process") may be executed by the learning device 1, or may be executed by another device different from the learning device 1. Good too. The 0th data may be, for example, the original data itself. In such a case, there is no need to execute the 0th data generation process.

FIG. 2 is an explanatory diagram illustrating a patch in the embodiment. Specifically, FIG. 2 is an explanatory diagram illustrating a patch using an example in which the conversion target is acoustic signal data. More specifically, the example in FIG. 2 is an explanatory diagram illustrating a patch in a case where the acoustic signal data is a spectrogram composed of a frequency axis and a time axis.

In order to explain patches, FIG. 2 shows a spectrogram divided into a plurality of rectangular sections having the same size on both the frequency axis and the time axis. A patch is data that expresses each divided section as a vector. For example, data representing one section of area D1 in FIG. 2 is one patch.

A patch is, for example, a vector indicating the pixel value of each pixel included in the corresponding section. The number of elements of a vector representing a patch is the same regardless of the vector. The number of elements of a vector expressing a patch may be the same regardless of the vector, and may be proportional to the number of pixels included in each section, for example. The coefficient of proportionality may or may not be 1. When the proportionality coefficient is not 1, the value of each element may be a value obtained by interpolation, for example. For example, if the number of pixels included in one patch is 256 and the coefficient is 3, the patch is a 768-dimensional vector.

The data position information is, for example, a vector with the same number of dimensions as the number of dimensions of the vector representing the patch. Each element of the 0th data is, for example, a vector expressed as a vector sum of a vector expressing a patch and a vector indicating data position information.

In the example of FIG. 2, the position of each patch is shown in each position in the frequency axis direction and the time axis direction. Therefore, in the example of FIG. 2, the 0th data is a matrix whose elements are vectors. In the example of FIG. 2, when the patch is a vector whose number of elements is proportional to the number of pixels, the 0th data is a matrix whose elements are first-order tensors. Note that it goes without saying that the first-order tensor is a vector, and the second-order tensor is a matrix. By the way, the 0th floor tensor is a scalar.

Note that when the proportionality coefficient is greater than 1, it is possible to suppress an increase in missing information when data is encoded in various processes such as the first representation conversion process.

In this way, the 0th data is obtained based on the original data. Then, the value of one element of the tensor representing the 0th data represents each division in a vector when the tensor representing the original data is divided according to a predetermined rule. The dimension of the vector representing the partition is, for example, the number of elements included in the partition, which is larger than the number of elements of the tensor representing the original data.

Returning to the explanation of FIG. The mask data expression prediction process is a process of predicting the target expression of the mask data based on the result of the first expression conversion process. The second expression conversion process is a process of converting the expression of the second data into a target expression based on the second data that is part or all of the mask data. The update process includes the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the mask data expression prediction process and the result of the second expression conversion process (hereinafter referred to as "conversion error"). This process updates the contents of the expression conversion process.

The conversion error may be the MSE (mean square error) between the result of the mask data expression prediction process and the result of the second expression conversion process, or may be L1, which is the average absolute value of the difference.

In this way, the learning unit 10 reduces the difference between the target expression of the mask data obtained based on the data excluding the mask data and the target expression of the mask data obtained based on part or all of the mask data. The first expression conversion process and the mask data expression prediction process are learned as follows.

The conversion device 2 converts the representation of the data to be processed using the first representation conversion process at the time of updating until a predetermined condition regarding the end of the update (hereinafter referred to as "update end condition") is satisfied. .

Here, an example of the flow of processing executed by the learning unit 10 will be explained using the drawings.
FIG. 3 is an explanatory diagram illustrating an overview of the flow of processing executed by the learning unit 10 in the embodiment. In FIG. 3, data x is an example of 0th data. In the example of FIG. 3, the learning unit 10 generates first data and second data based on the 0th data x. Data D101 in the example of FIG. 3 is an example of first data, and data D102 is an example of second data.

Data D101 is data in which part of the 0th data is masked. Note that "masking" means a process in which data to be processed is not subject to processing by other predetermined processes. In other words, masking means a process of restricting information access. Note that the predetermined other processing is, for example, mask data expression prediction processing. An example of the result of mask processing will be explained by showing an example of how masked data is handled in the next stage. For example, if the input data is in the series "010101010101" and is converted to data in the series "＊＊＊＊0101＊＊＊＊" by masking, the data with "*" will be will not be subject to processing. This is the result of the mask. In this example, "*" is not subject to the next stage of processing. That is, in this example, "*" is an example of data whose information access is restricted.

In Figure 3, the plain patches represent masked patches. In FIG. 3, patches that are not solid color represent unmasked patches. Patch P1 is an example of a plain patch. Patch P2 is an example of a patch that is not plain. A set of masked patches is an example of mask data.

In the example of FIG. 3, the mask data in the first data is data that is not mask data in the second data. Therefore, data including data that is not mask data in the first data and data that is not mask data in the second data includes the 0th data.

However, as described in the explanation of the second expression conversion process, not all of the mask data in the first data necessarily needs to be data that is not mask data in the second data. The data that is not mask data in the second data may be part of the mask data in the first data.

The determination of which data among the 0th data is to be used as the mask data in the 1st data may be performed in any manner, for example, at random. Hereinafter, the process of determining which data of the 0th data should be used as the mask data in the first data will be referred to as a first mask data determination process.

As described above, the second data is, for example, data determined as mask data in the first data. However, as described above, all of the data determined as mask data in the first data does not necessarily have to be the second data. An example of how to determine the case where some but not all of the data determined as mask data in the first data is the second data will be explained. In such a case, for the second data as well, the determination of which data is to be used as mask data among the data determined as mask data in the first data may be performed in any manner; for example, it may be determined randomly. .

Hereinafter, the process of determining which of the data determined as mask data in the first data is to be used as mask data will be referred to as second mask data determination process. Note that, as can be seen from the explanation up to this point, when the second data is all of the data determined as mask data in the first data, the second mask data determination process does not necessarily need to be executed. This is because the data determined as the mask data in the first data is the second data.

The first mask data determination process and the second mask data determination process are executed by the learning unit 10 in the example of FIG. However, the learning unit 10 does not necessarily need to execute the first mask data determination process and the second mask data determination process. For example, the generation of the first data and the second data may be executed by a device other than the learning device 1, and the generated first data and second data may be executed by the learning unit 10. In such a case, the learning unit 10 does not perform the first mask data determination process and the second mask data determination process. Note that if the second data is all of the mask data in the first data, there is no need to perform the second mask data determination process.

In the example of FIG. 3, the learning unit 10 performs the first conversion process on the first data. The function f _θ (·) in FIG. 3 represents the first conversion process. z _θ in FIG. 3 represents the result of additional processing. The addition process is a process of adding the same number of mask tokens as the number of elements belonging to the mask data to the result of the first conversion process on the first data. The mask token is information indicating whether or not an element in the 0th data belongs to mask data.

The element belonging to the mask data is an element belonging to the mask data in the first data among the elements of the tensor expressing the 0th data. Therefore, an element belonging to mask data is, for example, a patch determined to be mask data in the first mask data determination process.

That is, the addition process is a process of adding a vector that is the sum of information indicating the position in the 0th data of patch P1 and information indicating that it is a mask token to the result of the first conversion process. In this way, the result of the additional process is an example of a result based on the result of the first conversion process. As in the example of FIG. 3, when additional processing is executed, the learning unit 10 executes the additional processing, for example.

In the example of FIG. 3, the learning unit 10 executes mask data expression prediction processing on the result of the additional processing. In this way, the additional process is executed after the first expression conversion process and before the mask data expression prediction process. The function q _θ (·) in FIG. 3 represents mask data expression prediction processing. y _θ shown in FIG. 3 represents the result of mask data expression prediction processing.

In the example of FIG. 3, the learning unit 10 performs the second conversion process on the second data. The function f _ξ (·) in FIG. 3 represents the second conversion process. The function f _θ representing the first conversion process and the function f _ξ representing the second conversion process are both parameterized functions, and the values of parameters are updated by the update process.

The symbol function f _θ means a function f whose parameter value is θ, and the symbol function f _ξ means a function f whose parameter value is ξ. Therefore, the symbol "f _θ " and the symbol "f _ξ " indicate that the functions of the first conversion process and the second conversion process are the same except for the difference in parameters. z′ _ξ shown in FIG. 3 represents the result of the second conversion process.

The result of the mask data expression prediction process is the target expression of the mask data. The result of the second transformation process is a target representation of all or part of the mask data. Therefore, based on the conversion error, the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process are updated to reduce the conversion error, thereby changing the expression of the data to be converted to a predetermined value. This improves the accuracy of converting to the expression.

"Maximize agreement" in FIG. 3 indicates that the contents of the first expression conversion process, mask data expression prediction process, and second expression conversion process are updated so as to reduce the conversion error. That is, "Maximize agreement" in FIG. 3 indicates that the learning unit 10 executes the update process.

"Stop gradient" in FIG. 3 indicates that the error backpropagation method is not executed when updating the contents of the second representation conversion process. Note that in updating the contents of the process, specifically, the values of parameters included in the functions executed in the process are updated. Therefore, in the example of the second conversion process expressed by the function f _ξ , updating the value of the parameter ξ is an update of the contents of the second conversion process. In the example of FIG. 3, the contents of the second representation conversion process are updated not by the error backpropagation method but by other updating processes. The other update process is, for example, a predetermined exponential moving average process based on the content of the first representation conversion process.

An example of exponential moving average processing will be explained. Specifically, the contents of the first conversion process and the contents of the second conversion process change by changing the values of parameters of parameterized functions expressing each process. The parameterized function representing the first transformation process is, for example, the function f _θ , and the parameterized function representing the second transformation process is, for example, the function f _ξ . In this case, the value of the parameter ξ after the t-th update obtained by the exponential moving average process is, for example, ξ[t]=βξ[t-1]+(1-β)θ[t-1]. β is a predetermined constant. β is, for example, 0.99.

In fact, ξ[t]=βξ[t-1]+(1-β)θ[t-1] is an amount obtained by integrating the value of θ at each update with predetermined weighting. Therefore, the exponential moving average process is an average value of the values of θ at each update. The average value of the values of θ at each update converges as the update progresses and the accuracy increases. Therefore, exponential moving average processing allows updating of the transformation to ξ with high accuracy. If the value of ξ converges to a value with high conversion accuracy, the value of θ also converges to a value with high conversion accuracy as a result of the update process that attempts to reduce the conversion error.

FIG. 4 is a flowchart illustrating an example of the flow of learning processing executed by the learning unit 10 in the embodiment. The learning unit 10 acquires first data and second data (step S101). Next, the learning unit 10 executes a first conversion process (step S102). Next, the learning unit 10 executes mask data expression prediction processing (step S103). Next, the learning unit 10 executes a second conversion process (step S104).

Next, the learning unit 10 executes an update process (step S105). Next, the learning unit 10 determines whether the update end condition is satisfied (step S106). If the update end condition is satisfied (step S106: YES), the process ends. On the other hand, if the update end condition is not satisfied (step S106: NO), the process returns to step S101.

The first expression conversion process at the time when the update end condition is satisfied is the target expression conversion process. At the time of learning by the learning unit 10, the processing target of the first representation conversion process was the first data. However, the mask data is not predetermined and is not the same every time learning is performed. Therefore, the first expression conversion process (i.e., target expression conversion process) at the time when the update termination condition is satisfied can convert the expression to the target expression with high accuracy even for the processing target that does not include mask data. . Note that the processing in steps S102 to S103 and the processing in step S104 may be executed in parallel.

FIG. 5 is a diagram showing an example of the hardware configuration of the learning device 1 of the embodiment. The learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program. The learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.

More specifically, the processor 91 reads a program stored in the storage unit 14 and stores the read program in the memory 92. When the processor 91 executes the program stored in the memory 92, the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.

The control unit 11 controls the operations of various functional units included in the learning device 1. The control section 11 includes a learning section 10. Therefore, the control unit 11 executes, for example, a first expression conversion process, a mask data expression prediction process, a second expression conversion process, and an update process. The control unit 11 may further execute a first mask data determination process, a second mask data determination process, or an additional process.

The control unit 11 controls the operation of the output unit 15, for example. The control unit 11 records, in the storage unit 14, various types of information generated by executing various processes such as, for example, the first expression conversion process, the mask data expression prediction process, the second expression conversion process, and the update process.

The input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 12 may be configured as an interface that connects these input devices to the learning device 1. The input unit 12 receives input of various information to the learning device 1.

The communication unit 13 includes a communication interface for connecting the learning device 1 to an external device. The communication unit 13 communicates with an external device via wire or wireless. The external device is, for example, a device that is the source of the 0th data. When the external device is the device that is the source of the 0th data, the communication unit 13 acquires the 0th data by communicating with the device that is the source of the 0th data.

The external device is, for example, the source device of the original data. When the external device is the source device of the original data, the communication unit 13 acquires the original data by communicating with the device that is the source of the original data. The external device is, for example, a device that transmits the acoustic signal data.

The external device is, for example, a device that transmits the acoustic signal data. When the external device is the device that is the source of the audio signal data, the communication unit 13 acquires the audio signal data by communicating with the device that is the source of the audio signal data.

The external device is, for example, a device that is a source of natural language data. When the external device is the device that is the source of the natural language data, the communication unit 13 acquires the natural language data by communicating with the device that is the source of the natural language data.

The external device is, for example, a device that sends general time series data that is a vector representation of symbols and numbers other than letters. If the external device is a device that is a source of general time-series data, the communication unit 13 acquires the general time-series data by communicating with the device that is a source of general time-series data.

Note that the various information input to the communication section 13 may be input to the input section 12 instead of the communication section 13.

The external device is, for example, the conversion device 2. Through communication with the conversion device 2, the communication unit 13 transmits to the conversion device 2 information indicating the content of the first expression conversion process (that is, the target expression conversion process) at the time when the update end condition is satisfied.

The storage unit 14 is configured using a non-transitory computer-readable recording medium such as a magnetic hard disk device or a semiconductor storage device. The storage unit 14 stores various information regarding the learning device 1. The storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example. The storage unit 14 stores various information generated by the operation of the control unit 11, for example.

The output unit 15 outputs various information. The output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 15 may be configured as an interface that connects these display devices to the learning device 1. The output unit 15 outputs, for example, information input to the input unit 12. The output unit 15 may display, for example, the results of the processing by the control unit 11.

FIG. 6 is a diagram showing an example of the configuration of the control unit 11 in the embodiment. The control unit 11 includes a learning unit 10, a data acquisition unit 110, a storage control unit 120, a communication control unit 130, and an output control unit 140.

The data acquisition unit 110 acquires data to be sent to the learning unit 10. The data transmitted to the learning unit 10 may be the 0th data or may be a set of the first data and the second data. When the data acquisition unit 110 transmits a set of first data and second data to the learning unit 10, and when the data acquisition unit 110 acquires the 0th data, the data acquisition unit 110 transmits the first data and the second data to the learning unit 10. A mask data determination process and a second mask data determination process are executed.

When the input unit 12 or the communication unit 13 acquires acoustic signal data, the data acquisition unit 110 acquires data expressed in a tensor that is obtained based on the acoustic signal data acquired by the input unit 12 or the communication unit 13. Obtain as original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.

When the input unit 12 or the communication unit 13 acquires natural language data, the data acquisition unit 110 acquires data expressed in a tensor that is obtained based on the natural language data acquired by the input unit 12 or the communication unit 13. Obtain as original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.

When the input unit 12 or the communication unit 13 acquires general time-series data in which symbols and numbers other than characters are expressed as vectors, the data acquisition unit 110 uses the general time-series data acquired by the input unit 12 or the communication unit 13 to The data obtained by using a tensor and expressed as a tensor is obtained as the original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.

When the input unit 12 or the communication unit 13 acquires the 0th data, the data acquisition unit 110 acquires the 0th data acquired by the input unit 12 or the communication unit 13. When the input unit 12 or the communication unit 13 acquires a set of first data and second data, the data acquisition unit 110 acquires the set of first data and second data acquired by the input unit 12 or communication unit 13. get.

The storage control unit 120 records various information in the storage unit 14. The communication control unit 130 controls the operation of the communication unit 13. The output control section 140 controls the operation of the output section 15.

FIG. 7 is a diagram showing an example of the hardware configuration of the conversion device 2 in the embodiment. The conversion device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program. The conversion device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.

More specifically, the processor 93 reads the program stored in the storage unit 24 and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the conversion device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25.

The control unit 21 controls the operations of various functional units included in the conversion device 2. The control unit 21 acquires, for example, information obtained by the learning device 1 that indicates the content of the first expression conversion process (i.e., the target expression conversion process) at the time when the update end condition is satisfied, and records it in the storage unit 24. do.

The control unit 21 executes object expression conversion processing. The execution of the target expression conversion process by the control unit 21 is performed, for example, by the control unit 21 reading and executing information indicating the content of the target expression conversion process recorded in the storage unit 24. The control unit 21 controls the operation of the output unit 25, for example. The control unit 21 records, for example, various information generated by executing the object expression conversion process in the storage unit 24.

The input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 22 may be configured as an interface that connects these input devices to the conversion device 2. The input unit 22 receives input of various information to the conversion device 2 .

The communication unit 23 includes a communication interface for connecting the conversion device 2 to an external device. The communication unit 23 communicates with an external device via wire or wireless. The external device is, for example, a device from which data to be converted into a target representation is sent. The communication unit 23 acquires data to be converted into a target expression through communication with such an external device. The external device is, for example, the learning device 1. The communication unit 23 acquires information indicating the content of the target expression conversion process through communication with the learning device 1.

Note that the various information input to the communication section 23 may be input to the input section 22 instead of the communication section 23.

The storage unit 24 is configured using a non-transitory computer-readable recording medium such as a magnetic hard disk device or a semiconductor storage device. The storage unit 24 stores various information regarding the conversion device 2. The storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example. The storage unit 24 stores, for example, various information generated by the operation of the control unit 21. The storage unit 24 stores, for example, the contents of the target expression conversion process.

The output unit 25 outputs various information. The output unit 25 is, for example, a communication interface communicably connected to a device that executes a downstream task. The output unit 25 may include a display device such as a CRT display, a liquid crystal display, or an organic EL display. The output unit 25 may be configured as an interface that connects these display devices to the conversion device 2. The output unit 25 outputs the information input to the input unit 22, for example. The output unit 25 may output, for example, the execution result of the target expression conversion process.

FIG. 8 is a diagram showing an example of the configuration of the control section 21 in the embodiment. The control unit 21 includes a conversion target acquisition unit 210, an expression conversion unit 220, a storage control unit 230, a communication control unit 240, and an output control unit 250. The conversion target acquisition unit 210 acquires data that is input to the communication unit 23 and is the target of expression conversion into a target expression.

The expression conversion unit 220 converts the expression of the data acquired by the conversion target acquisition unit 210 using target expression conversion processing. The expression conversion unit 220 executes the various processes executed by the data acquisition unit 110, such as the 0th data generation process, so that the target expression conversion process can be executed according to the data acquired by the conversion target acquisition unit 210. Good too.

The storage control unit 230 records various information in the storage unit 24. The communication control unit 240 controls the operation of the communication unit 23. The output control section 250 controls the operation of the output section 25.

FIG. 9 is a flowchart showing an example of the flow of processing executed by the conversion device 2 in the embodiment. The conversion target acquisition unit 210 acquires data that is input to the communication unit 23 and is the target of expression conversion into a target expression (step S201). Next, the expression conversion unit 220 converts the expression of the data acquired in step S201 using target expression conversion processing (step S202). Next, the output control unit 250 controls the operation of the output unit 25 to output the result of step S202 to the output unit 25 (step S203). Note that the output destination of the output unit 25 may be, for example, a device that executes a downstream task.

<Experiment results>
The results of an experiment in which a downstream task is executed on data whose expression has been converted using the target expression conversion process will be explained using FIG. 10. In the experiment, the downstream tasks included classification of environmental sounds, voice command word identification, speaker identification, speech language classification, emotion classification in speech, music genre classification, instrument classification of musical sounds, and interval classification of musical sounds. The task was done.

FIG. 10 is a diagram showing an example of the results of an experiment in the embodiment. “ESC50” and “US8K” both indicate the task of classifying environmental sounds. "SPCV2" indicates the task of voice command word identification. “VC1” indicates a speaker identification task. "VF" indicates the task of vocal language classification. “CRM-D” indicates a task of classifying emotions contained in speech. “GTZAN” indicates a music genre classification task. “NSynth” indicates a task of classifying musical instruments. “Surge” indicates a task of classifying musical tones.

"MAE" indicates MAE (Masked Autoencoders), which is an existing technology. "MABL" indicates object expression conversion processing. The results in Figure 9 show that the accuracy of downstream tasks that use the results of expression conversion by target expression conversion processing is higher than the accuracy of downstream tasks that use the results of expression conversion by MAE for all downstream tasks. shows.

For example, regarding "ESC50", the accuracy of the downstream task "ESC50" using the result of expression conversion by MAE is 87.35%, while the accuracy of the downstream task "ESC50" using the result of expression conversion by target expression conversion processing is 87.35%. The accuracy of task “ESC50” is 89.03%.

For example, for "VC1", the accuracy of downstream task "VC1" using the result of expression conversion by MAE is 54.64%, whereas the accuracy of downstream task "VC1" using the result of expression conversion by target expression conversion processing is 54.64%. The accuracy of task “VC1” is 58.96%.

The learning device 1 configured in this way can distinguish between the expression of mask data obtained based on data excluding the mask data and the expression of mask data obtained based on part or all of the mask data. The first expression conversion process is trained to minimize the size of the expression.

This is different from MAE, which restores the patch image of the masked part from the representation of the non-masked patch output by the model and calculates the loss using the difference between the input signal and the restored signal. In the case of MAE, since restoration (that is, decoding) is further performed after expression conversion, errors may occur in the information as a result of the restoration. Therefore, unlike MAE, the learning device 1 that does not perform restoration during learning can improve the accuracy of converting the expression of data to be converted into a predetermined expression (ie, target expression).

Furthermore, learning by the learning device 1 is different from data2vec. In the case of data2vec, all patches are input to obtain the target representation of the moving average model, and only the masked part is used as a teacher signal, so the target representation of the masked part contains information of the non-masked part. Note that the mask portion refers to mask data, and the non-mask portion refers to data other than mask data among the 0th data. As a result, in data2vec, learning is performed that depends on information in the non-masked portion, and learning that improves the accuracy of expression conversion may not be performed.

On the other hand, in learning by the learning device 1, a target representation of part or all of the mask data is obtained without using data that is not mask data among the 0th data in the second representation conversion process. The result of the second expression conversion process is then compared with the result of the mask data expression prediction process. That is, in learning by the learning device 1, the result of the second representation conversion process is used as the teacher signal. As described above, the result of the second representation conversion process does not include information on the non-masked portion. Therefore, unlike data2vec, the learning device 1 can improve the accuracy of converting the expression of data to be converted into a predetermined expression.

(Modified example)
Note that in the first data, the ratio of mask data may be 50% of the data included in the 0th data. When the ratio of mask data is 50%, the accuracy of converting the expression of the data to be converted into the predetermined expression is improved than when it is not 50% for the following reason (hereinafter referred to as the "first reason").

<First reason>
For the target expressions output by each of the first expression conversion process and the second expression conversion process, conversion errors are reduced by modeling the entire data including the respective mask data. Modeling is easier if there is less mask data, and more difficult if there is more. Object representations processed at different difficulty levels have different amounts of information available for modeling and are heterogeneous. By setting the ratio of mask data to 50%, the first data and second data become homogeneous object representations processed with the same amount of information, and the conversion error is optimal for learning better modeling. becomes.

In addition, in the first data, the proportion of data that is not mask data is 50% or less of the data included in the 0th data, and the proportion of data that is part of the mask data and used in the second expression conversion process. may also be 50% or less of the data included in the 0th data. Since the difficulty of modeling is higher in such a case, the accuracy of converting the expression of the data to be converted into a predetermined expression is improved than in other cases.

The learning device 1 and the conversion device 2 may each be implemented using a plurality of information processing devices that are communicably connected via a network. In this case, each functional unit included in each of the learning device 1 and the conversion device 2 may be distributed and implemented in a plurality of information processing devices.

Note that the learning device 1 and the conversion device 2 do not necessarily need to be implemented as different devices. The learning device 1 and the conversion device 2 may be implemented, for example, as one device that has both functions.

All or part of each function of the expression conversion system 100, the learning device 1, and the conversion device 2 may be implemented using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array). It may also be realized using hardware. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The program may be transmitted via a telecommunications line.

Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.

100... Expression conversion system, 1... Learning device, 2... Conversion device, 10... Learning section, 11... Control section, 12... Input section, 13... Communication section, 14... Storage section, 15... Output section, 110... Data acquisition 120...Storage control unit, 130...Communication control unit, 140...Output control unit, 21...Control unit, 22...Input unit, 23...Communication unit, 24...Storage unit, 25...Output unit, 210...Conversion target acquisition 220... Expression conversion unit, 230... Storage control unit, 240... Communication control unit, 250... Output control unit, 91... Processor, 92... Memory, 93... Processor, 94... Memory

Claims

a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression;
Equipped with
The control unit includes:
The first data, which is data obtained by removing the mask data that is part of the zeroth data, from the zeroth data, which is data expressed by a tensor, is the processing target, and the representation of the processing target is converted into the target representation. a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a second expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting a data expression into a target expression; a first expression conversion process for reducing the difference between a result of the mask data expression prediction process and a result of the second expression conversion process; performing an update process for updating the contents of the data expression prediction process and the second expression conversion process;
the first expression conversion process at a time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process;
learning device.
The control unit applies mask tokens indicating whether or not elements in the 0th data belong to mask data to the result of the first representation conversion process, the same number as the number of elements belonging to the mask data. an additional process to be added after the first expression conversion process and before the mask data expression prediction process;
The processing target of the mask data expression prediction processing is the result of the additional processing,
The learning device according to claim 1.
The ratio of mask data in the first data is 50% of the data included in the 0th data.
The learning device according to claim 1.
In the first data, the proportion of data that is not mask data is 50% or less of the data included in the zero data, and the data is part of the mask data and is used in the second expression conversion process. The ratio of is also 50% or less of the data included in the 0th data,
The learning device according to claim 1.
a conversion target acquisition unit that acquires target data for converting an expression into a target expression that is a predetermined expression;
a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, and the control unit is configured to convert data expressed by a tensor into a target expression that is a predetermined expression. a first expression conversion process of converting an expression to be processed into a target expression using first data, which is data obtained by removing mask data that is a part of the 0th data, as a processing target; mask data expression prediction processing that predicts a target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; the first expression conversion process, the mask data expression prediction process, and the second expression conversion process to reduce the difference between the result of the mask data expression prediction process and the second expression conversion process. an update process for updating the contents of the second expression conversion process, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. an expression conversion unit that converts the expression of the data acquired by the conversion target acquisition unit using the target expression conversion process that has been performed;
A conversion device comprising:
a control step of obtaining a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression;
has
The control step includes:
The first data, which is data obtained by removing the mask data that is part of the zeroth data, from the zeroth data, which is data expressed by a tensor, is the processing target, and the representation of the processing target is converted into the target representation. a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a second expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting a data expression into a target expression; a first expression conversion process for reducing the difference between a result of the mask data expression prediction process and a result of the second expression conversion process; performing an update process for updating the contents of the data expression prediction process and the second expression conversion process;
the first expression conversion process at a time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process;
How to learn.
a conversion target acquisition step of acquiring target data for converting the expression into a target expression that is a predetermined expression;
a control step for obtaining a target expression conversion process by learning, which is a process of converting the expression of the data to be converted into a target expression that is a predetermined expression, and the control step A first expression conversion process that converts an expression to be processed into a target expression using first data, which is data from which mask data that is a part of the 0th data has been removed, as a processing target; , mask data expression prediction processing that predicts the target expression of the mask data based on the result of the first expression conversion process; and target expression of the second data based on second data that is part or all of the mask data. a second expression conversion process for converting into a representation; the first expression conversion process; the mask data expression prediction process; and the first expression conversion process, the mask data expression prediction process, and an update process for updating the contents of the second expression conversion process; and a learning method in which the first expression conversion process at a time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. a representation conversion step of converting the representation of the data acquired by the conversion target acquisition step using the obtained target expression conversion process;
A conversion method having
A program for causing a computer to function as either the learning device according to claim 1 or the converting device according to claim 5.