WO2024052996A1 - Learning device, conversion device, learning method, conversion method, and program - Google Patents

Learning device, conversion device, learning method, conversion method, and program Download PDF

Info

Publication number
WO2024052996A1
WO2024052996A1 PCT/JP2022/033441 JP2022033441W WO2024052996A1 WO 2024052996 A1 WO2024052996 A1 WO 2024052996A1 JP 2022033441 W JP2022033441 W JP 2022033441W WO 2024052996 A1 WO2024052996 A1 WO 2024052996A1
Authority
WO
WIPO (PCT)
Prior art keywords
expression
data
target
conversion process
mask
Prior art date
Application number
PCT/JP2022/033441
Other languages
French (fr)
Japanese (ja)
Inventor
大輔 仁泉
大起 竹内
康智 大石
登 原田
邦夫 柏野
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2022/033441 priority Critical patent/WO2024052996A1/en
Publication of WO2024052996A1 publication Critical patent/WO2024052996A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a learning device, a conversion device, a learning method, a conversion method, and a program.
  • a technique uses machine learning to generate a mathematical model that converts the expression of data to be converted into a predetermined expression.
  • the predetermined expression is, for example, an expression required by a downstream task.
  • converting the expression of the data to be converted into a predetermined expression means an encoding process of converting the data to be converted into data expressed in a predetermined expression.
  • an object of the present invention is to provide a technique that improves the accuracy of converting the expression of data to be converted into a predetermined expression.
  • One aspect of the present invention includes a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, the control unit , transform the expression of the processing target into the target expression, using the first data, which is data obtained by removing the mask data that is part of the 0th data, from the 0th data, which is data expressed by a tensor, as the processing target.
  • a first expression conversion process a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting the expression of two data into a target expression; the first expression conversion process so as to reduce the difference between the result of the mask data expression prediction process and the result of the second expression conversion process;
  • a mask data expression prediction process and an update process for updating the contents of the second expression conversion process are executed, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. It is a learning device.
  • One aspect of the present invention includes a conversion target acquisition unit that acquires target data for converting the expression into a target expression that is a predetermined expression; and a control unit that obtains by learning a process of converting the target representation into a target representation, the control unit converts 0th data, which is data expressed by a tensor, to a part of the 0th data.
  • a first expression conversion process that converts the expression to be processed into a target expression using first data, which is data from which certain mask data has been removed, and converts the mask data based on the result of the first expression conversion process.
  • the conversion device includes an expression conversion unit that converts the expression of data acquired by the target acquisition unit.
  • One aspect of the present invention includes a control step of obtaining a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, and the control step converts the expression of the processing target into the target expression, using the first data, which is data from which mask data, which is a part of the 0th data, has been removed from the 0th data, which is data expressed by a tensor, as the processing target.
  • a first expression conversion process to predict the target expression of the mask data based on the result of the first expression conversion process
  • a mask data expression prediction process to predict the target expression of the mask data based on the result of the first expression conversion process
  • a second expression conversion process that converts the expression of the second data into a target expression
  • the first expression conversion process that reduces the difference between the result of the mask data expression prediction process and the result of the second expression conversion process
  • an update process for updating the contents of the mask data expression prediction process and the second expression conversion process, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion. It is a process and a learning method.
  • One aspect of the present invention includes a conversion target acquisition step of acquiring target data for converting the expression into a target expression that is a predetermined expression; and a control step that obtains by learning a process of converting the target representation into a target representation, and the control step includes converting the 0th data, which is data expressed by a tensor, to a part of the 0th data.
  • a first expression conversion process that converts the expression to be processed into a target expression using first data that is data from which the mask data is removed, and the mask data is converted based on the result of the first expression conversion process.
  • the conversion method includes an expression conversion step for converting the expression of the data acquired by the conversion target acquisition step.
  • One aspect of the present invention is a program for causing a computer to function as either the above learning device or the above converting device.
  • FIG. 1 is a diagram illustrating an example of the configuration of an expression conversion system according to an embodiment.
  • FIG. 2 is an explanatory diagram illustrating a patch in an embodiment.
  • FIG. 2 is an explanatory diagram illustrating an overview of the flow of processing executed by the learning unit in the embodiment.
  • 5 is a flowchart illustrating an example of the flow of learning processing executed by the learning unit in the embodiment.
  • FIG. 1 is a diagram showing an example of the configuration of an expression conversion system 100 according to an embodiment.
  • the expression conversion system 100 includes a learning device 1 and a conversion device 2.
  • the learning device 1 acquires a process (hereinafter referred to as "target expression conversion process") for converting the expression of data to be converted into a predetermined expression (hereinafter referred to as "target expression”) through learning.
  • target expression conversion process a process for converting the expression of data to be converted into a predetermined expression (hereinafter referred to as "target expression” through learning.
  • converting the representation of data into a target representation means an encoding process that converts the data to be encoded into data expressed in the target representation. Therefore, converting the representation of the data to be converted into the target representation means an encoding process that converts the data to be converted into data expressed in the target representation.
  • the target expression is, for example, expression embedding.
  • the target representation is, for example, 768 floating point values.
  • the target representation may be, for example, 1024 floating point values.
  • the target representation may be, for example, 2048 floating point values.
  • Representation conversion processing is a type of learning model.
  • the data to be converted can be image data, acoustic signal data, natural language data, or general time series data, as long as it is expressed as a tensor. It may be data. Therefore, the data to be converted may be, for example, image data, a spectrogram of acoustic signal data, or a sequence of words as samples, where each word is an M-dimensional vector (M is 1 or more). It may be a sequence expressed as a natural number (hereinafter referred to as a "natural language sequence").
  • the data to be converted may be, for example, general time series data in which symbols other than characters and numbers are expressed as vectors.
  • the spectrogram of the acoustic signal data is obtained based on the acoustic signal data.
  • Natural language sequences are obtained based on natural language data.
  • the learning device 1 includes a learning section 10.
  • the learning unit 10 executes a first expression conversion process, a mask data expression prediction process, a second expression conversion process, and an update process. Note that the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process may be updated in advance by, for example, transfer learning or the like.
  • the first expression conversion process is a process of converting the expression to be processed into the target expression.
  • the processing target of the first expression conversion process when the learning unit 10 executes the first expression conversion process is the first data.
  • the first data is data obtained by removing mask data, which is a part of the 0th data, from the 0th data.
  • the 0th data is data expressed by a tensor.
  • the 0th data is, for example, a tensor that expresses the original data that has been divided into patches, a tensor that also has data position information that is information indicating the position of each patch, and a tensor that has tensors as elements. It is.
  • the 0th data is, for example, a matrix whose elements are vectors of each patch resulting from patch division of the tensor to be divided. Patch division is a process of dividing the target of division into parts called patches.
  • the original data is data that corresponds to the conversion target and is expressed as a tensor. Therefore, when the conversion target is image data, the original data is image data.
  • the original data is data obtained based on acoustic signal data and expressed as a tensor.
  • data is, for example, a spectrogram of acoustic signal data.
  • the original data is data obtained based on natural language data and expressed as a tensor.
  • Such data is, for example, a natural language sequence.
  • the data to be converted is data obtained based on general time series data
  • the original data is data obtained based on general time series data and expressed as a tensor.
  • Such data is, for example, a time series of numerical values such as stock prices and temperature.
  • the original data can be divided into image data, data obtained based on acoustic signal data and expressed as a tensor, and data obtained based on natural language data and expressed as a tensor. , data obtained based on general time series data and expressed as a tensor.
  • One element of the tensor of the 0th data indicates information regarding one patch.
  • One patch includes one or more elements of a tensor representing the original data.
  • One element is data expressed by a 768-dimensional vector, for example. Note that the position of a patch is the position within the 0th data of the element corresponding to each patch.
  • the learning device 1 may perform the classification of patches and the assignment of data location information. That is, the process of obtaining the 0th data based on the original data (hereinafter referred to as "0th data generation process") may be executed by the learning device 1, or may be executed by another device different from the learning device 1. Good too.
  • the 0th data may be, for example, the original data itself. In such a case, there is no need to execute the 0th data generation process.
  • FIG. 2 is an explanatory diagram illustrating a patch in the embodiment. Specifically, FIG. 2 is an explanatory diagram illustrating a patch using an example in which the conversion target is acoustic signal data. More specifically, the example in FIG. 2 is an explanatory diagram illustrating a patch in a case where the acoustic signal data is a spectrogram composed of a frequency axis and a time axis.
  • FIG. 2 shows a spectrogram divided into a plurality of rectangular sections having the same size on both the frequency axis and the time axis.
  • a patch is data that expresses each divided section as a vector. For example, data representing one section of area D1 in FIG. 2 is one patch.
  • a patch is, for example, a vector indicating the pixel value of each pixel included in the corresponding section.
  • the number of elements of a vector representing a patch is the same regardless of the vector.
  • the number of elements of a vector expressing a patch may be the same regardless of the vector, and may be proportional to the number of pixels included in each section, for example.
  • the coefficient of proportionality may or may not be 1. When the proportionality coefficient is not 1, the value of each element may be a value obtained by interpolation, for example. For example, if the number of pixels included in one patch is 256 and the coefficient is 3, the patch is a 768-dimensional vector.
  • the data position information is, for example, a vector with the same number of dimensions as the number of dimensions of the vector representing the patch.
  • Each element of the 0th data is, for example, a vector expressed as a vector sum of a vector expressing a patch and a vector indicating data position information.
  • the 0th data is a matrix whose elements are vectors.
  • the 0th data when the patch is a vector whose number of elements is proportional to the number of pixels, the 0th data is a matrix whose elements are first-order tensors. Note that it goes without saying that the first-order tensor is a vector, and the second-order tensor is a matrix. By the way, the 0th floor tensor is a scalar.
  • proportionality coefficient is greater than 1, it is possible to suppress an increase in missing information when data is encoded in various processes such as the first representation conversion process.
  • the 0th data is obtained based on the original data.
  • the value of one element of the tensor representing the 0th data represents each division in a vector when the tensor representing the original data is divided according to a predetermined rule.
  • the dimension of the vector representing the partition is, for example, the number of elements included in the partition, which is larger than the number of elements of the tensor representing the original data.
  • the mask data expression prediction process is a process of predicting the target expression of the mask data based on the result of the first expression conversion process.
  • the second expression conversion process is a process of converting the expression of the second data into a target expression based on the second data that is part or all of the mask data.
  • the update process includes the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the mask data expression prediction process and the result of the second expression conversion process (hereinafter referred to as "conversion error"). This process updates the contents of the expression conversion process.
  • the conversion error may be the MSE (mean square error) between the result of the mask data expression prediction process and the result of the second expression conversion process, or may be L1, which is the average absolute value of the difference.
  • the learning unit 10 reduces the difference between the target expression of the mask data obtained based on the data excluding the mask data and the target expression of the mask data obtained based on part or all of the mask data.
  • the first expression conversion process and the mask data expression prediction process are learned as follows.
  • the conversion device 2 converts the representation of the data to be processed using the first representation conversion process at the time of updating until a predetermined condition regarding the end of the update (hereinafter referred to as "update end condition") is satisfied. .
  • FIG. 3 is an explanatory diagram illustrating an overview of the flow of processing executed by the learning unit 10 in the embodiment.
  • data x is an example of 0th data.
  • the learning unit 10 generates first data and second data based on the 0th data x.
  • Data D101 in the example of FIG. 3 is an example of first data, and data D102 is an example of second data.
  • Data D101 is data in which part of the 0th data is masked.
  • masking means a process in which data to be processed is not subject to processing by other predetermined processes.
  • masking means a process of restricting information access.
  • the predetermined other processing is, for example, mask data expression prediction processing.
  • An example of the result of mask processing will be explained by showing an example of how masked data is handled in the next stage. For example, if the input data is in the series "0101010101" and is converted to data in the series "****0101****" by masking, the data with "*" will be will not be subject to processing. This is the result of the mask. In this example, "*" is not subject to the next stage of processing. That is, in this example, "*” is an example of data whose information access is restricted.
  • the plain patches represent masked patches.
  • patches that are not solid color represent unmasked patches.
  • Patch P1 is an example of a plain patch.
  • Patch P2 is an example of a patch that is not plain.
  • a set of masked patches is an example of mask data.
  • the mask data in the first data is data that is not mask data in the second data. Therefore, data including data that is not mask data in the first data and data that is not mask data in the second data includes the 0th data.
  • not all of the mask data in the first data necessarily needs to be data that is not mask data in the second data.
  • the data that is not mask data in the second data may be part of the mask data in the first data.
  • the determination of which data among the 0th data is to be used as the mask data in the 1st data may be performed in any manner, for example, at random.
  • the process of determining which data of the 0th data should be used as the mask data in the first data will be referred to as a first mask data determination process.
  • the second data is, for example, data determined as mask data in the first data.
  • all of the data determined as mask data in the first data does not necessarily have to be the second data.
  • An example of how to determine the case where some but not all of the data determined as mask data in the first data is the second data will be explained.
  • the determination of which data is to be used as mask data among the data determined as mask data in the first data may be performed in any manner; for example, it may be determined randomly. .
  • second mask data determination process the process of determining which of the data determined as mask data in the first data is to be used as mask data. Note that, as can be seen from the explanation up to this point, when the second data is all of the data determined as mask data in the first data, the second mask data determination process does not necessarily need to be executed. This is because the data determined as the mask data in the first data is the second data.
  • the first mask data determination process and the second mask data determination process are executed by the learning unit 10 in the example of FIG.
  • the learning unit 10 does not necessarily need to execute the first mask data determination process and the second mask data determination process.
  • the generation of the first data and the second data may be executed by a device other than the learning device 1, and the generated first data and second data may be executed by the learning unit 10. In such a case, the learning unit 10 does not perform the first mask data determination process and the second mask data determination process. Note that if the second data is all of the mask data in the first data, there is no need to perform the second mask data determination process.
  • the learning unit 10 performs the first conversion process on the first data.
  • the function f ⁇ ( ⁇ ) in FIG. 3 represents the first conversion process.
  • z ⁇ in FIG. 3 represents the result of additional processing.
  • the addition process is a process of adding the same number of mask tokens as the number of elements belonging to the mask data to the result of the first conversion process on the first data.
  • the mask token is information indicating whether or not an element in the 0th data belongs to mask data.
  • the element belonging to the mask data is an element belonging to the mask data in the first data among the elements of the tensor expressing the 0th data. Therefore, an element belonging to mask data is, for example, a patch determined to be mask data in the first mask data determination process.
  • the addition process is a process of adding a vector that is the sum of information indicating the position in the 0th data of patch P1 and information indicating that it is a mask token to the result of the first conversion process.
  • the result of the additional process is an example of a result based on the result of the first conversion process.
  • the learning unit 10 executes the additional processing, for example.
  • the learning unit 10 executes mask data expression prediction processing on the result of the additional processing.
  • the additional process is executed after the first expression conversion process and before the mask data expression prediction process.
  • the function q ⁇ ( ⁇ ) in FIG. 3 represents mask data expression prediction processing.
  • y ⁇ shown in FIG. 3 represents the result of mask data expression prediction processing.
  • the learning unit 10 performs the second conversion process on the second data.
  • the function f ⁇ ( ⁇ ) in FIG. 3 represents the second conversion process.
  • the function f ⁇ representing the first conversion process and the function f ⁇ representing the second conversion process are both parameterized functions, and the values of parameters are updated by the update process.
  • the symbol function f ⁇ means a function f whose parameter value is ⁇
  • the symbol function f ⁇ means a function f whose parameter value is ⁇ . Therefore, the symbol “f ⁇ " and the symbol “f ⁇ " indicate that the functions of the first conversion process and the second conversion process are the same except for the difference in parameters.
  • z′ ⁇ shown in FIG. 3 represents the result of the second conversion process.
  • the result of the mask data expression prediction process is the target expression of the mask data.
  • the result of the second transformation process is a target representation of all or part of the mask data. Therefore, based on the conversion error, the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process are updated to reduce the conversion error, thereby changing the expression of the data to be converted to a predetermined value. This improves the accuracy of converting to the expression.
  • “Maximize agreement” in FIG. 3 indicates that the contents of the first expression conversion process, mask data expression prediction process, and second expression conversion process are updated so as to reduce the conversion error. That is, “Maximize agreement” in FIG. 3 indicates that the learning unit 10 executes the update process.
  • “Stop gradient” in FIG. 3 indicates that the error backpropagation method is not executed when updating the contents of the second representation conversion process. Note that in updating the contents of the process, specifically, the values of parameters included in the functions executed in the process are updated. Therefore, in the example of the second conversion process expressed by the function f ⁇ , updating the value of the parameter ⁇ is an update of the contents of the second conversion process. In the example of FIG. 3, the contents of the second representation conversion process are updated not by the error backpropagation method but by other updating processes.
  • the other update process is, for example, a predetermined exponential moving average process based on the content of the first representation conversion process.
  • the parameterized function representing the first transformation process is, for example, the function f ⁇
  • the parameterized function representing the second transformation process is, for example, the function f ⁇
  • is a predetermined constant. ⁇ is, for example, 0.99.
  • FIG. 4 is a flowchart illustrating an example of the flow of learning processing executed by the learning unit 10 in the embodiment.
  • the learning unit 10 acquires first data and second data (step S101).
  • the learning unit 10 executes a first conversion process (step S102).
  • the learning unit 10 executes mask data expression prediction processing (step S103).
  • the learning unit 10 executes a second conversion process (step S104).
  • step S105 the learning unit 10 executes an update process.
  • step S106 determines whether the update end condition is satisfied. If the update end condition is satisfied (step S106: YES), the process ends. On the other hand, if the update end condition is not satisfied (step S106: NO), the process returns to step S101.
  • the first expression conversion process at the time when the update end condition is satisfied is the target expression conversion process.
  • the processing target of the first representation conversion process was the first data.
  • the mask data is not predetermined and is not the same every time learning is performed. Therefore, the first expression conversion process (i.e., target expression conversion process) at the time when the update termination condition is satisfied can convert the expression to the target expression with high accuracy even for the processing target that does not include mask data. .
  • the processing in steps S102 to S103 and the processing in step S104 may be executed in parallel.
  • FIG. 5 is a diagram showing an example of the hardware configuration of the learning device 1 of the embodiment.
  • the learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program.
  • the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.
  • the processor 91 reads a program stored in the storage unit 14 and stores the read program in the memory 92.
  • the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.
  • the control unit 11 controls the operations of various functional units included in the learning device 1.
  • the control section 11 includes a learning section 10. Therefore, the control unit 11 executes, for example, a first expression conversion process, a mask data expression prediction process, a second expression conversion process, and an update process.
  • the control unit 11 may further execute a first mask data determination process, a second mask data determination process, or an additional process.
  • the control unit 11 controls the operation of the output unit 15, for example.
  • the control unit 11 records, in the storage unit 14, various types of information generated by executing various processes such as, for example, the first expression conversion process, the mask data expression prediction process, the second expression conversion process, and the update process.
  • the input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel.
  • the input unit 12 may be configured as an interface that connects these input devices to the learning device 1.
  • the input unit 12 receives input of various information to the learning device 1.
  • the communication unit 13 includes a communication interface for connecting the learning device 1 to an external device.
  • the communication unit 13 communicates with an external device via wire or wireless.
  • the external device is, for example, a device that is the source of the 0th data.
  • the communication unit 13 acquires the 0th data by communicating with the device that is the source of the 0th data.
  • the external device is, for example, the source device of the original data.
  • the communication unit 13 acquires the original data by communicating with the device that is the source of the original data.
  • the external device is, for example, a device that transmits the acoustic signal data.
  • the external device is, for example, a device that transmits the acoustic signal data.
  • the communication unit 13 acquires the audio signal data by communicating with the device that is the source of the audio signal data.
  • the external device is, for example, a device that is a source of natural language data.
  • the communication unit 13 acquires the natural language data by communicating with the device that is the source of the natural language data.
  • the external device is, for example, a device that sends general time series data that is a vector representation of symbols and numbers other than letters. If the external device is a device that is a source of general time-series data, the communication unit 13 acquires the general time-series data by communicating with the device that is a source of general time-series data.
  • the external device is, for example, the conversion device 2.
  • the communication unit 13 transmits to the conversion device 2 information indicating the content of the first expression conversion process (that is, the target expression conversion process) at the time when the update end condition is satisfied.
  • the storage unit 14 is configured using a non-transitory computer-readable recording medium such as a magnetic hard disk device or a semiconductor storage device.
  • the storage unit 14 stores various information regarding the learning device 1.
  • the storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example.
  • the storage unit 14 stores various information generated by the operation of the control unit 11, for example.
  • the output unit 15 outputs various information.
  • the output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display.
  • the output unit 15 may be configured as an interface that connects these display devices to the learning device 1.
  • the output unit 15 outputs, for example, information input to the input unit 12.
  • the output unit 15 may display, for example, the results of the processing by the control unit 11.
  • FIG. 6 is a diagram showing an example of the configuration of the control unit 11 in the embodiment.
  • the control unit 11 includes a learning unit 10, a data acquisition unit 110, a storage control unit 120, a communication control unit 130, and an output control unit 140.
  • the data acquisition unit 110 acquires data to be sent to the learning unit 10.
  • the data transmitted to the learning unit 10 may be the 0th data or may be a set of the first data and the second data.
  • the data acquisition unit 110 transmits a set of first data and second data to the learning unit 10
  • the data acquisition unit 110 acquires the 0th data
  • the data acquisition unit 110 transmits the first data and the second data to the learning unit 10.
  • a mask data determination process and a second mask data determination process are executed.
  • the data acquisition unit 110 acquires data expressed in a tensor that is obtained based on the acoustic signal data acquired by the input unit 12 or the communication unit 13. Obtain as original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.
  • the data acquisition unit 110 acquires data expressed in a tensor that is obtained based on the natural language data acquired by the input unit 12 or the communication unit 13. Obtain as original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.
  • the data acquisition unit 110 uses the general time-series data acquired by the input unit 12 or the communication unit 13 to The data obtained by using a tensor and expressed as a tensor is obtained as the original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.
  • the data acquisition unit 110 acquires the 0th data acquired by the input unit 12 or the communication unit 13.
  • the data acquisition unit 110 acquires the set of first data and second data acquired by the input unit 12 or communication unit 13. get.
  • the storage control unit 120 records various information in the storage unit 14.
  • the communication control unit 130 controls the operation of the communication unit 13.
  • the output control section 140 controls the operation of the output section 15.
  • FIG. 7 is a diagram showing an example of the hardware configuration of the conversion device 2 in the embodiment.
  • the conversion device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program.
  • the conversion device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.
  • the processor 93 reads the program stored in the storage unit 24 and stores the read program in the memory 94.
  • the conversion device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25.
  • the control unit 21 controls the operations of various functional units included in the conversion device 2.
  • the control unit 21 acquires, for example, information obtained by the learning device 1 that indicates the content of the first expression conversion process (i.e., the target expression conversion process) at the time when the update end condition is satisfied, and records it in the storage unit 24. do.
  • the control unit 21 executes object expression conversion processing.
  • the execution of the target expression conversion process by the control unit 21 is performed, for example, by the control unit 21 reading and executing information indicating the content of the target expression conversion process recorded in the storage unit 24.
  • the control unit 21 controls the operation of the output unit 25, for example.
  • the control unit 21 records, for example, various information generated by executing the object expression conversion process in the storage unit 24.
  • the input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel.
  • the input unit 22 may be configured as an interface that connects these input devices to the conversion device 2.
  • the input unit 22 receives input of various information to the conversion device 2 .
  • the communication unit 23 includes a communication interface for connecting the conversion device 2 to an external device.
  • the communication unit 23 communicates with an external device via wire or wireless.
  • the external device is, for example, a device from which data to be converted into a target representation is sent.
  • the communication unit 23 acquires data to be converted into a target expression through communication with such an external device.
  • the external device is, for example, the learning device 1.
  • the communication unit 23 acquires information indicating the content of the target expression conversion process through communication with the learning device 1.
  • the storage unit 24 is configured using a non-transitory computer-readable recording medium such as a magnetic hard disk device or a semiconductor storage device.
  • the storage unit 24 stores various information regarding the conversion device 2.
  • the storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example.
  • the storage unit 24 stores, for example, various information generated by the operation of the control unit 21.
  • the storage unit 24 stores, for example, the contents of the target expression conversion process.
  • the output unit 25 outputs various information.
  • the output unit 25 is, for example, a communication interface communicably connected to a device that executes a downstream task.
  • the output unit 25 may include a display device such as a CRT display, a liquid crystal display, or an organic EL display.
  • the output unit 25 may be configured as an interface that connects these display devices to the conversion device 2.
  • the output unit 25 outputs the information input to the input unit 22, for example.
  • the output unit 25 may output, for example, the execution result of the target expression conversion process.
  • FIG. 8 is a diagram showing an example of the configuration of the control section 21 in the embodiment.
  • the control unit 21 includes a conversion target acquisition unit 210, an expression conversion unit 220, a storage control unit 230, a communication control unit 240, and an output control unit 250.
  • the conversion target acquisition unit 210 acquires data that is input to the communication unit 23 and is the target of expression conversion into a target expression.
  • the expression conversion unit 220 converts the expression of the data acquired by the conversion target acquisition unit 210 using target expression conversion processing.
  • the expression conversion unit 220 executes the various processes executed by the data acquisition unit 110, such as the 0th data generation process, so that the target expression conversion process can be executed according to the data acquired by the conversion target acquisition unit 210. Good too.
  • the storage control unit 230 records various information in the storage unit 24.
  • the communication control unit 240 controls the operation of the communication unit 23.
  • the output control section 250 controls the operation of the output section 25.
  • FIG. 9 is a flowchart showing an example of the flow of processing executed by the conversion device 2 in the embodiment.
  • the conversion target acquisition unit 210 acquires data that is input to the communication unit 23 and is the target of expression conversion into a target expression (step S201).
  • the expression conversion unit 220 converts the expression of the data acquired in step S201 using target expression conversion processing (step S202).
  • the output control unit 250 controls the operation of the output unit 25 to output the result of step S202 to the output unit 25 (step S203).
  • the output destination of the output unit 25 may be, for example, a device that executes a downstream task.
  • FIG. 10 is a diagram showing an example of the results of an experiment in the embodiment.
  • “ESC50” and “US8K” both indicate the task of classifying environmental sounds.
  • SPCV2 indicates the task of voice command word identification.
  • VC1 indicates a speaker identification task.
  • VF indicates the task of vocal language classification.
  • CRM-D indicates a task of classifying emotions contained in speech.
  • GTZAN indicates a music genre classification task.
  • “NSynth” indicates a task of classifying musical instruments.
  • “Surge” indicates a task of classifying musical tones.
  • MAE indicates MAE (Masked Autoencoders), which is an existing technology.
  • MABL indicates object expression conversion processing. The results in Figure 9 show that the accuracy of downstream tasks that use the results of expression conversion by target expression conversion processing is higher than the accuracy of downstream tasks that use the results of expression conversion by MAE for all downstream tasks. shows.
  • the accuracy of the downstream task “ESC50” using the result of expression conversion by MAE is 87.35%
  • the accuracy of the downstream task “ESC50” using the result of expression conversion by target expression conversion processing is 87.35%
  • the accuracy of task “ESC50” is 89.03%.
  • the learning device 1 configured in this way can distinguish between the expression of mask data obtained based on data excluding the mask data and the expression of mask data obtained based on part or all of the mask data.
  • the first expression conversion process is trained to minimize the size of the expression.
  • MAE restores the patch image of the masked part from the representation of the non-masked patch output by the model and calculates the loss using the difference between the input signal and the restored signal.
  • restoration that is, decoding
  • errors may occur in the information as a result of the restoration. Therefore, unlike MAE, the learning device 1 that does not perform restoration during learning can improve the accuracy of converting the expression of data to be converted into a predetermined expression (ie, target expression).
  • learning by the learning device 1 is different from data2vec.
  • data2vec all patches are input to obtain the target representation of the moving average model, and only the masked part is used as a teacher signal, so the target representation of the masked part contains information of the non-masked part.
  • the mask portion refers to mask data
  • the non-mask portion refers to data other than mask data among the 0th data.
  • the learning device 1 in learning by the learning device 1, a target representation of part or all of the mask data is obtained without using data that is not mask data among the 0th data in the second representation conversion process. The result of the second expression conversion process is then compared with the result of the mask data expression prediction process. That is, in learning by the learning device 1, the result of the second representation conversion process is used as the teacher signal. As described above, the result of the second representation conversion process does not include information on the non-masked portion. Therefore, unlike data2vec, the learning device 1 can improve the accuracy of converting the expression of data to be converted into a predetermined expression.
  • the ratio of mask data may be 50% of the data included in the 0th data.
  • the accuracy of converting the expression of the data to be converted into the predetermined expression is improved than when it is not 50% for the following reason (hereinafter referred to as the "first reason").
  • the proportion of data that is not mask data is 50% or less of the data included in the 0th data
  • the proportion of data that is part of the mask data and used in the second expression conversion process. may also be 50% or less of the data included in the 0th data. Since the difficulty of modeling is higher in such a case, the accuracy of converting the expression of the data to be converted into a predetermined expression is improved than in other cases.
  • the learning device 1 and the conversion device 2 may each be implemented using a plurality of information processing devices that are communicably connected via a network.
  • each functional unit included in each of the learning device 1 and the conversion device 2 may be distributed and implemented in a plurality of information processing devices.
  • learning device 1 and the conversion device 2 do not necessarily need to be implemented as different devices.
  • the learning device 1 and the conversion device 2 may be implemented, for example, as one device that has both functions.
  • All or part of each function of the expression conversion system 100, the learning device 1, and the conversion device 2 may be implemented using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array). It may also be realized using hardware.
  • the program may be recorded on a computer-readable recording medium.
  • the computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system.
  • the program may be transmitted via a telecommunications line.

Abstract

An embodiment of the present invention is a learning device provided with a control unit that obtains, through learning, a target representation conversion process for converting the representation of data to be converted into a target representation, which is a predetermined specific representation, wherein the control unit performs: a first representation conversion process in which the representation of a processing target, which is first data obtained by removing mask data from 0th data that is data represented by a tensor, is converted into a target representation; a mask data representation prediction process in which the target representation of the mask data is predicted on the basis of the result of the first representation conversion process; a second representation conversion process in which the representation of second data, which is part or all of the mask data, is converted into a target representation; and an update process in which the content of the first representation conversion process is updated so as to reduce the difference between the result of the mask data representation prediction process and the result of the second representation conversion process. The first representation conversion process at the point in time when a prescribed condition for ending the updating is satisfied is said target representation conversion process.

Description

学習装置、変換装置、学習方法、変換方法及びプログラムLearning device, conversion device, learning method, conversion method and program
 本発明は、学習装置、変換装置、学習方法、変換方法及びプログラム
に関する。
The present invention relates to a learning device, a conversion device, a learning method, a conversion method, and a program.
 変換対象のデータの表現を予め定められた所定の表現に変換する数理モデルを、機械学習を用いて生成する技術が知られている。予め定められた所定の表現は、例えば下流タスクで要求される表現である。なお、変換対象のデータの表現を予め定められた所定の表現に変換するとは、変換対象のデータを予め定められた所定の表現で表現されたデータへと変換する符号化の処理を意味する。 A technique is known that uses machine learning to generate a mathematical model that converts the expression of data to be converted into a predetermined expression. The predetermined expression is, for example, an expression required by a downstream task. Note that converting the expression of the data to be converted into a predetermined expression means an encoding process of converting the data to be converted into data expressed in a predetermined expression.
 これまで、そのような技術として、MAE(Masked Autoencoders)や、data2vecが提案されている。これらはいずれも、入力された情報の一部をマスクし、その結果を用いて学習を行う技術である。しかしながら、いずれの技術も変換の精度が悪い場合があった。 So far, MAE (Masked Autoencoders) and data2vec have been proposed as such technologies. All of these techniques mask part of the input information and perform learning using the results. However, both techniques sometimes have poor conversion accuracy.
 上記事情に鑑み、本発明は、変換対象のデータの表現を所定の表現に変換する精度を向上させる技術を提供することを目的としている。 In view of the above circumstances, an object of the present invention is to provide a technique that improves the accuracy of converting the expression of data to be converted into a predetermined expression.
 本発明の一態様は、変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御部、を備え、前記制御部は、テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である、学習装置である。 One aspect of the present invention includes a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, the control unit , transform the expression of the processing target into the target expression, using the first data, which is data obtained by removing the mask data that is part of the 0th data, from the 0th data, which is data expressed by a tensor, as the processing target. a first expression conversion process; a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting the expression of two data into a target expression; the first expression conversion process so as to reduce the difference between the result of the mask data expression prediction process and the result of the second expression conversion process; A mask data expression prediction process and an update process for updating the contents of the second expression conversion process are executed, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. It is a learning device.
 本発明の一態様は、予め定められた所定の表現である対象表現への表現の変換の対象のデータを取得する変換対象取得部と、変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御部、を備え、前記制御部は、テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である学習装置、によって得られた前記対象表現変換処理、を用いて、前記変換対象取得部が取得したデータの表現を変換する表現変換部、を備える変換装置である。 One aspect of the present invention includes a conversion target acquisition unit that acquires target data for converting the expression into a target expression that is a predetermined expression; and a control unit that obtains by learning a process of converting the target representation into a target representation, the control unit converts 0th data, which is data expressed by a tensor, to a part of the 0th data. A first expression conversion process that converts the expression to be processed into a target expression using first data, which is data from which certain mask data has been removed, and converts the mask data based on the result of the first expression conversion process. a mask data expression prediction process for predicting a target expression; a second expression conversion process for converting the expression of the second data into a target expression based on second data that is part or all of the mask data; and the mask data expression. an update process of updating the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the prediction process and the result of the second expression conversion process; and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. The conversion device includes an expression conversion unit that converts the expression of data acquired by the target acquisition unit.
 本発明の一態様は、変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御ステップ、を有し、前記制御ステップは、テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である、学習方法である。 One aspect of the present invention includes a control step of obtaining a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, and the control step converts the expression of the processing target into the target expression, using the first data, which is data from which mask data, which is a part of the 0th data, has been removed from the 0th data, which is data expressed by a tensor, as the processing target. a first expression conversion process to predict the target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process to predict the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process that converts the expression of the second data into a target expression; and the first expression conversion process that reduces the difference between the result of the mask data expression prediction process and the result of the second expression conversion process; an update process for updating the contents of the mask data expression prediction process and the second expression conversion process, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion. It is a process and a learning method.
 本発明の一態様は、予め定められた所定の表現である対象表現への表現の変換の対象のデータを取得する変換対象取得ステップと、変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御ステップ、を有し、前記制御ステップは、テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である学習方法、によって得られた前記対象表現変換処理、を用いて、前記変換対象取得ステップが取得したデータの表現を変換する表現変換ステップと、を有する変換方法である。 One aspect of the present invention includes a conversion target acquisition step of acquiring target data for converting the expression into a target expression that is a predetermined expression; and a control step that obtains by learning a process of converting the target representation into a target representation, and the control step includes converting the 0th data, which is data expressed by a tensor, to a part of the 0th data. A first expression conversion process that converts the expression to be processed into a target expression using first data that is data from which the mask data is removed, and the mask data is converted based on the result of the first expression conversion process. a second expression conversion process that converts the expression of the second data into a target expression based on second data that is part or all of the mask data; an update process for updating the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the expression prediction process and the result of the second expression conversion process; , and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. The conversion method includes an expression conversion step for converting the expression of the data acquired by the conversion target acquisition step.
 本発明の一態様は、上記の学習装置と上記の変換装置とのいずれかとしてコンピュータを機能させるためのプログラムである。 One aspect of the present invention is a program for causing a computer to function as either the above learning device or the above converting device.
 本発明により、変換対象のデータの表現を所定の表現に変換する精度を向上させることが可能となる。 According to the present invention, it is possible to improve the accuracy of converting the expression of data to be converted into a predetermined expression.
実施形態の表現変換システムの構成の一例を示す図。FIG. 1 is a diagram illustrating an example of the configuration of an expression conversion system according to an embodiment. 実施形態におけるパッチを説明する説明図。FIG. 2 is an explanatory diagram illustrating a patch in an embodiment. 実施形態における学習部が実行する処理の流れの概要を説明する説明図。FIG. 2 is an explanatory diagram illustrating an overview of the flow of processing executed by the learning unit in the embodiment. 実施形態における学習部が実行する学習の処理の流れの一例を示すフローチャート。5 is a flowchart illustrating an example of the flow of learning processing executed by the learning unit in the embodiment. 実施形態の学習装置のハードウェア構成の一例を示す図。FIG. 1 is a diagram showing an example of a hardware configuration of a learning device according to an embodiment. 実施形態における制御部の構成の一例を示す図。The figure which shows an example of the structure of the control part in embodiment. 実施形態における変換装置のハードウェア構成の一例を示す図。The figure which shows an example of the hardware configuration of the conversion apparatus in embodiment. 実施形態における制御部の構成の一例を示す図。The figure which shows an example of the structure of the control part in embodiment. 実施形態における変換装置が実行する処理の流れの一例を示すフローチャート。5 is a flowchart illustrating an example of the flow of processing executed by the conversion device in the embodiment. 実施形態における実験の結果の一例を示す図。The figure which shows an example of the result of an experiment in embodiment.
 (実施形態)
 図1は、実施形態の表現変換システム100の構成の一例を示す図である。表現変換システム100は、学習装置1と変換装置2とを含む。学習装置1は、変換対象のデータの表現を予め定められた所定の表現(以下「対象表現」という。)に変換する処理(以下「対象表現変換処理」という。)を学習により得る。
(Embodiment)
FIG. 1 is a diagram showing an example of the configuration of an expression conversion system 100 according to an embodiment. The expression conversion system 100 includes a learning device 1 and a conversion device 2. The learning device 1 acquires a process (hereinafter referred to as "target expression conversion process") for converting the expression of data to be converted into a predetermined expression (hereinafter referred to as "target expression") through learning.
 なお、データの表現を対象表現に変換するとは、符号化の対象のデータを、対象表現で表現されたデータへと変換する符号化の処理を意味する。したがって、変換対象のデータの表現を対象表現に変換するとは、変換対象のデータを対象表現で表現されたデータへと変換する符号化の処理を意味する。 Note that converting the representation of data into a target representation means an encoding process that converts the data to be encoded into data expressed in the target representation. Therefore, converting the representation of the data to be converted into the target representation means an encoding process that converts the data to be converted into data expressed in the target representation.
 対象表現は、例えば表現埋め込みである。対象表現は、例えば768個の浮動小数点数値、という表現である。対象表現は、例えば1024個の浮動小数点数値、という表現であってもよい。対象表現は、例えば2048個の浮動小数点数値、という表現であってもよい。表現変換処理は学習モデルの一種である。 The target expression is, for example, expression embedding. The target representation is, for example, 768 floating point values. The target representation may be, for example, 1024 floating point values. The target representation may be, for example, 2048 floating point values. Representation conversion processing is a type of learning model.
 変換対象のデータは、画像データと、音響信号データと、自然言語データと、一般の時系列データとのいずれかに基づいて得られるデータであってテンソルで表現されるデータであればどのようなデータであってもよい。したがって、変換対象のデータは、例えば画像データであってもよいし、音響信号データのスペクトログラムであってもよいし、語をサンプルとする系列であって各語がM次元ベクトル(Mは1以上の自然数)で表現された系列(以下「自然言語系列」という。)であってもよい。変換対象のデータは、例えば文字以外の記号や数字をベクトル表現した一般の時系列データであってもよい。 The data to be converted can be image data, acoustic signal data, natural language data, or general time series data, as long as it is expressed as a tensor. It may be data. Therefore, the data to be converted may be, for example, image data, a spectrogram of acoustic signal data, or a sequence of words as samples, where each word is an M-dimensional vector (M is 1 or more). It may be a sequence expressed as a natural number (hereinafter referred to as a "natural language sequence"). The data to be converted may be, for example, general time series data in which symbols other than characters and numbers are expressed as vectors.
 音響信号データのスペクトログラムは、音響信号データに基づいて得られる。自然言語系列は、自然言語データに基づいて得られる。 The spectrogram of the acoustic signal data is obtained based on the acoustic signal data. Natural language sequences are obtained based on natural language data.
 学習装置1は、学習部10を備える。学習部10は、第1表現変換処理と、マスクデータ表現予測処理と、第2表現変換処理と、更新処理と、を実行する。なお、第1表現変換処理と、マスクデータ表現予測処理と、第2表現変換処理の内容は、例えば転移学習等により予め更新済みであってもよい。 The learning device 1 includes a learning section 10. The learning unit 10 executes a first expression conversion process, a mask data expression prediction process, a second expression conversion process, and an update process. Note that the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process may be updated in advance by, for example, transfer learning or the like.
 第1表現変換処理は、処理対象の表現を対象表現に変換する処理である。学習部10による第1表現変換処理の実行の際の第1表現変換処理の処理対象は、第1データである。 The first expression conversion process is a process of converting the expression to be processed into the target expression. The processing target of the first expression conversion process when the learning unit 10 executes the first expression conversion process is the first data.
 第1データは、第0データの一部であるマスクデータが第0データから除かれたデータである。第0データは、テンソルによって表現されたデータである。 The first data is data obtained by removing mask data, which is a part of the 0th data, from the 0th data. The 0th data is data expressed by a tensor.
 第0データは、例えば、パッチへの区分が行われた状態の原データを表現するテンソルであり、各パッチの位置を示す情報であるデータ位置情報も有するテンソルであり、テンソルを要素とするテンソルである。第0データは、例えば、分割対象のテンソルがパッチ分割された結果の各パッチのベクトルを要素とする行列である。パッチ分割とは、分割の対象を、パッチと呼称される部分に区分けする処理である。 The 0th data is, for example, a tensor that expresses the original data that has been divided into patches, a tensor that also has data position information that is information indicating the position of each patch, and a tensor that has tensors as elements. It is. The 0th data is, for example, a matrix whose elements are vectors of each patch resulting from patch division of the tensor to be divided. Patch division is a process of dividing the target of division into parts called patches.
 原データは、変換対象に応じたデータであってテンソルで表現されるデータである。したがって、変換対象が画像データである場合には、原データは、画像データデータである。 The original data is data that corresponds to the conversion target and is expressed as a tensor. Therefore, when the conversion target is image data, the original data is image data.
 変換対象データが音響信号データに基づいて得られるデータである場合には、原データは、音響信号データに基づいて得られるデータであってテンソルで表現されるデータである。このようなデータは、例えば音響信号データのスペクトログラムである。 When the data to be converted is data obtained based on acoustic signal data, the original data is data obtained based on acoustic signal data and expressed as a tensor. Such data is, for example, a spectrogram of acoustic signal data.
 変換対象データが自然言語データに基づいて得られるデータである場合には、原データは、自然言語データに基づいて得られるデータであってテンソルで表現されるデータである。このようなデータは、例えば自然言語系列である。変換対象データが一般の時系列データに基づいて得られるデータである場合には、原データは、一般の時系列データに基づいて得られるデータであってテンソルで表現されるデータである。このようなデータは、例えば株価や気温等の数値の時系列である。 When the data to be converted is data obtained based on natural language data, the original data is data obtained based on natural language data and expressed as a tensor. Such data is, for example, a natural language sequence. When the data to be converted is data obtained based on general time series data, the original data is data obtained based on general time series data and expressed as a tensor. Such data is, for example, a time series of numerical values such as stock prices and temperature.
 このように原データは、画像データと、音響信号データに基づいて得られるデータであってテンソルで表現されるデータと、自然言語データに基づいて得られるデータであってテンソルで表現されるデータと、一般の時系列データに基づいて得られるデータであってテンソルで表現されるデータと、のいずれかである。 In this way, the original data can be divided into image data, data obtained based on acoustic signal data and expressed as a tensor, and data obtained based on natural language data and expressed as a tensor. , data obtained based on general time series data and expressed as a tensor.
 第0データのテンソルの1つの要素が1つのパッチに関する情報を示す。1つのパッチには、原データを表現するテンソルの1以上の要素が含まれる。1つの要素は、例えば768次元のベクトルで表現されるデータである。なお、パッチの位置とは、各パッチに対応する要素の第0データ内における位置、である。 One element of the tensor of the 0th data indicates information regarding one patch. One patch includes one or more elements of a tensor representing the original data. One element is data expressed by a 768-dimensional vector, for example. Note that the position of a patch is the position within the 0th data of the element corresponding to each patch.
 なお、パッチへの区分とデータ位置情報の付与とは、学習装置1が実行してもよい。すなわち、原データに基づき第0データを得る処理(以下「第0データ生成処理」という。)は、学習装置1が実行してもよいし、学習装置1とは異なる他の装置が実行してもよい。第0データは、例えば原データそのものであってもよい。このような場合、第0データ生成処理が実行される必要はない。 Note that the learning device 1 may perform the classification of patches and the assignment of data location information. That is, the process of obtaining the 0th data based on the original data (hereinafter referred to as "0th data generation process") may be executed by the learning device 1, or may be executed by another device different from the learning device 1. Good too. The 0th data may be, for example, the original data itself. In such a case, there is no need to execute the 0th data generation process.
 図2は、実施形態におけるパッチを説明する説明図である。図2は、具体的には、変換対象が音響信号データである場合を例にパッチを説明する説明図である。さらに具体的には、図2の例では、音響信号データが、周波数軸と時間軸とから構成されるスペクトログラムである場合を例にパッチを説明する説明図である。 FIG. 2 is an explanatory diagram illustrating a patch in the embodiment. Specifically, FIG. 2 is an explanatory diagram illustrating a patch using an example in which the conversion target is acoustic signal data. More specifically, the example in FIG. 2 is an explanatory diagram illustrating a patch in a case where the acoustic signal data is a spectrogram composed of a frequency axis and a time axis.
 図2はパッチの説明のため、スペクトログラムを、周波数軸及び時間軸のどちらについてもサイズが同じ複数の四角形の区画に区分けして、示している。パッチとは、この区分けされた各区画をベクトルで表現するデータである。例えば図2の領域D1の区画1つを表現するデータが1つのパッチである。 In order to explain patches, FIG. 2 shows a spectrogram divided into a plurality of rectangular sections having the same size on both the frequency axis and the time axis. A patch is data that expresses each divided section as a vector. For example, data representing one section of area D1 in FIG. 2 is one patch.
 パッチは、例えば対応する区画に含まれる各画素の画素値を示すベクトルである。パッチを表現するベクトルの要素数は、ベクトルによらず同一である。パッチを表現するベクトルの要素数はベクトルによらず同一であればよく、例えば、区画に含まれる各画素の画素数に比例してもよい。比例の係数は1であってもよいし1でなくてもよい。比例の係数が1ではない場合、各要素の値は、例えば補間によって得られた値であってもよい。例えば1つのパッチに含まれる画素数が256であって係数が3である場合、パッチは、768次元のベクトルである。 A patch is, for example, a vector indicating the pixel value of each pixel included in the corresponding section. The number of elements of a vector representing a patch is the same regardless of the vector. The number of elements of a vector expressing a patch may be the same regardless of the vector, and may be proportional to the number of pixels included in each section, for example. The coefficient of proportionality may or may not be 1. When the proportionality coefficient is not 1, the value of each element may be a value obtained by interpolation, for example. For example, if the number of pixels included in one patch is 256 and the coefficient is 3, the patch is a 768-dimensional vector.
 データ位置情報は、例えば、パッチを表現するベクトルの次元数と同一の次元数のベクトルである。第0データの各要素は例えば、パッチを表現するベクトルとデータ位置情報を示すベクトルとのベクトル和で表現されたベクトルである。 The data position information is, for example, a vector with the same number of dimensions as the number of dimensions of the vector representing the patch. Each element of the 0th data is, for example, a vector expressed as a vector sum of a vector expressing a patch and a vector indicating data position information.
 図2の例において各パッチの位置は、周波数軸方向と時間軸方向との各位置で示される。したがって図2の例において第0データは、ベクトルを要素とする行列である。図2の例においてパッチが画素数に比例する要素数のベクトルである場合、第0データは、1階のテンソルを要素とする行列である。なお、1階のテンソルとはベクトルであり、2階のテンソルとは行列であることは言うまでもない。ちなみに、0階のテンソルはスカラーである。 In the example of FIG. 2, the position of each patch is shown in each position in the frequency axis direction and the time axis direction. Therefore, in the example of FIG. 2, the 0th data is a matrix whose elements are vectors. In the example of FIG. 2, when the patch is a vector whose number of elements is proportional to the number of pixels, the 0th data is a matrix whose elements are first-order tensors. Note that it goes without saying that the first-order tensor is a vector, and the second-order tensor is a matrix. By the way, the 0th floor tensor is a scalar.
 なお、比例の係数が1より大きい場合、第1表現変換処理等の各種処理においてデータの符号化が行われた際の情報の欠落の増大を抑制することができる。 Note that when the proportionality coefficient is greater than 1, it is possible to suppress an increase in missing information when data is encoded in various processes such as the first representation conversion process.
 このように、第0データは、原データに基づいて得られる。そして、第0データを表現するテンソルの1つの要素の値は、原データを表現するテンソルが所定の規則で区分けされた場合における各区画をベクトルで表現する。そして、区画を表現するベクトルの次元は、区画に含まれる要素数であって原データを表現するテンソルの要素数、よりも、例えば、大きい。 In this way, the 0th data is obtained based on the original data. Then, the value of one element of the tensor representing the 0th data represents each division in a vector when the tensor representing the original data is divided according to a predetermined rule. The dimension of the vector representing the partition is, for example, the number of elements included in the partition, which is larger than the number of elements of the tensor representing the original data.
 図1の説明に戻る。マスクデータ表現予測処理は、第1表現変換処理の結果に基づきマスクデータの対象表現を予測する処理である。第2表現変換処理は、マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する処理である。更新処理は、マスクデータ表現予測処理の結果と第2表現変換処理の結果との違い(以下「変換誤差」という。)を小さくするように第1表現変換処理、マスクデータ表現予測処理及び第2表現変換処理の内容を更新する処理である。 Returning to the explanation of FIG. The mask data expression prediction process is a process of predicting the target expression of the mask data based on the result of the first expression conversion process. The second expression conversion process is a process of converting the expression of the second data into a target expression based on the second data that is part or all of the mask data. The update process includes the first expression conversion process, the mask data expression prediction process, and the second expression conversion process so as to reduce the difference between the result of the mask data expression prediction process and the result of the second expression conversion process (hereinafter referred to as "conversion error"). This process updates the contents of the expression conversion process.
 変換誤差は、マスクデータ表現予測処理の結果と第2表現変換処理の結果との間のMSE(mean square error)であってもよいし、差分の平均絶対値であるL1であってもよい。 The conversion error may be the MSE (mean square error) between the result of the mask data expression prediction process and the result of the second expression conversion process, or may be L1, which is the average absolute value of the difference.
 このように学習部10は、マスクデータを除いたデータに基づいて得られたマスクデータの対象表現と、マスクデータの一部又は全部に基づいて得られたマスクデータの対象表現との違いを小さくように、第1表現変換処理とマスクデータ表現予測処理との学習を行う。 In this way, the learning unit 10 reduces the difference between the target expression of the mask data obtained based on the data excluding the mask data and the target expression of the mask data obtained based on part or all of the mask data. The first expression conversion process and the mask data expression prediction process are learned as follows.
 変換装置2は、更新の終了に関する所定の条件(以下「更新終了条件」という。)が満たされるまで更新された時点の第1表現変換処理を用いて、処理対象のデータの表現の変換を行う。 The conversion device 2 converts the representation of the data to be processed using the first representation conversion process at the time of updating until a predetermined condition regarding the end of the update (hereinafter referred to as "update end condition") is satisfied. .
 ここで、学習部10が実行する処理の流れの一例を、図を用いて説明する。
 図3は、実施形態における学習部10が実行する処理の流れの概要を説明する説明図である。図3においてデータxは、第0データの一例である。図3の例において学習部10は、第0データxに基づき第1データと第2データとを生成する。図3の例におけるデータD101が第1データの一例であり、データD102が第2データの一例である。
Here, an example of the flow of processing executed by the learning unit 10 will be explained using the drawings.
FIG. 3 is an explanatory diagram illustrating an overview of the flow of processing executed by the learning unit 10 in the embodiment. In FIG. 3, data x is an example of 0th data. In the example of FIG. 3, the learning unit 10 generates first data and second data based on the 0th data x. Data D101 in the example of FIG. 3 is an example of first data, and data D102 is an example of second data.
 データD101は第0データの一部がマスクされたデータである。なお、マスクするとは、処理対象のデータを所定の他の処理による処理対象とさせない処理を意味する。すなわち、マスクするとは、情報アクセスの制限をかける処理を意味する。なお、所定の他の処理は、例えばマスクデータ表現予測処理である。マスクされたデータの次段での取り扱いの一例を示すことでマスクの処理の結果の一例を説明する。例えば、入力されたデータが“010101010101”という系列である場合であってマスクによって“****0101****”という系列のデータに変換された場合に、「*」のデータは次段の処理の対象にはならない。これがマスクの結果である。この例において「*」が次段の処理の対象にならない。すなわち、この例において「*」は情報アクセスの制限されたデータの一例である。 Data D101 is data in which part of the 0th data is masked. Note that "masking" means a process in which data to be processed is not subject to processing by other predetermined processes. In other words, masking means a process of restricting information access. Note that the predetermined other processing is, for example, mask data expression prediction processing. An example of the result of mask processing will be explained by showing an example of how masked data is handled in the next stage. For example, if the input data is in the series "010101010101" and is converted to data in the series "****0101****" by masking, the data with "*" will be will not be subject to processing. This is the result of the mask. In this example, "*" is not subject to the next stage of processing. That is, in this example, "*" is an example of data whose information access is restricted.
 図3において、無地のパッチはマスクされたパッチを表す。図3において無地ではないパッチはマスクされていないパッチを表す。パッチP1は無地のパッチの一例である。パッチP2は、無地ではないパッチの一例である。マスクされたパッチの集合がマスクデータの一例である。 In Figure 3, the plain patches represent masked patches. In FIG. 3, patches that are not solid color represent unmasked patches. Patch P1 is an example of a plain patch. Patch P2 is an example of a patch that is not plain. A set of masked patches is an example of mask data.
 図3の例では、第1データにおけるマスクデータが第2データにおいてマスクデータではないデータである。したがって第1データにおけるマスクデータではないデータと、第2データにおけるマスクデータではないデータとを含むデータは、第0データを含む。 In the example of FIG. 3, the mask data in the first data is data that is not mask data in the second data. Therefore, data including data that is not mask data in the first data and data that is not mask data in the second data includes the 0th data.
 しかしながら、第2表現変換処理の説明で述べたように、必ずしも、第1データにおけるマスクデータの全てが第2データにおいて、マスクデータではないデータ、である必要はない。第2データにおいてマスクデータではないデータは、第1データにおけるマスクデータの一部、であってもよい。 However, as described in the explanation of the second expression conversion process, not all of the mask data in the first data necessarily needs to be data that is not mask data in the second data. The data that is not mask data in the second data may be part of the mask data in the first data.
 第1データにおけるマスクデータを、第0データのうちのいずれのデータにするのかの決定は、どのように行われてもよく、例えばランダムに行われる。以下、第1データにおけるマスクデータを、第0データのうちのいずれのデータにするのかの決定の処理を、第1マスクデータ決定処理という。 The determination of which data among the 0th data is to be used as the mask data in the 1st data may be performed in any manner, for example, at random. Hereinafter, the process of determining which data of the 0th data should be used as the mask data in the first data will be referred to as a first mask data determination process.
 上述したように、第2データは、例えば、第1データにおいてマスクデータとして決定されたデータである。しかしながら、上述したように、必ずしも第1データにおいてマスクデータとして決定されたデータの全てが第2データである必要はない。第1データにおいてマスクデータとして決定されたデータの全てではなく一部が第2データであるという場合について、どのように決定するのか一例の説明を行う。このような場合、第2データについても、第1データにおいてマスクデータとして決定されたデータのうち、いずれをマスクデータとするのか、の決定はどのように行われてもよく、例えばランダムに行われる。 As described above, the second data is, for example, data determined as mask data in the first data. However, as described above, all of the data determined as mask data in the first data does not necessarily have to be the second data. An example of how to determine the case where some but not all of the data determined as mask data in the first data is the second data will be explained. In such a case, for the second data as well, the determination of which data is to be used as mask data among the data determined as mask data in the first data may be performed in any manner; for example, it may be determined randomly. .
 以下、第1データにおいてマスクデータとして判定されたデータのうち、いずれをマスクデータとするのか、の決定の処理は、第2マスクデータ決定処理という。なお、ここまでの説明からわかるように、第2データが第1データにおいてマスクデータとして決定されたデータの全てである場合には、第2マスクデータ決定処理は必ずしも実行される必要はない。なぜなら、第1データにおいてマスクデータとして決定されたデータが第2データだからである。 Hereinafter, the process of determining which of the data determined as mask data in the first data is to be used as mask data will be referred to as second mask data determination process. Note that, as can be seen from the explanation up to this point, when the second data is all of the data determined as mask data in the first data, the second mask data determination process does not necessarily need to be executed. This is because the data determined as the mask data in the first data is the second data.
 第1マスクデータ決定処理と第2マスクデータ決定処理とは、図3の例では学習部10が実行する。しかしながら、第1マスクデータ決定処理と第2マスクデータ決定処理とは、必ずしも学習部10が実行する必要はない。例えば、第1データ及び第2データの生成まで学習装置1以外の他の装置が実行し、生成された第1データ及び第2データが学習部10に実行されてもよい。このような場合、学習部10は第1マスクデータ決定処理と第2マスクデータ決定処理とを実行しない。なお、第2データが第1データにおけるマスクデータの全てである場合には、第2マスクデータ決定処理が実行される必要はない。 The first mask data determination process and the second mask data determination process are executed by the learning unit 10 in the example of FIG. However, the learning unit 10 does not necessarily need to execute the first mask data determination process and the second mask data determination process. For example, the generation of the first data and the second data may be executed by a device other than the learning device 1, and the generated first data and second data may be executed by the learning unit 10. In such a case, the learning unit 10 does not perform the first mask data determination process and the second mask data determination process. Note that if the second data is all of the mask data in the first data, there is no need to perform the second mask data determination process.
 図3の例において学習部10は、第1データに対して第1変換処理を実行する。図3における関数fθ(・)は第1変換処理を表す。図3に記載のzθは、追加処理の結果を表す。追加処理は、第1データに対する第1変換処理の結果に対して、マスクデータに属する要素の数と同じ数だけマスクトークンを追加する処理である。マスクトークンは、第0データ内の要素についてマスクデータに属するか否かを示す情報である。 In the example of FIG. 3, the learning unit 10 performs the first conversion process on the first data. The function f θ (·) in FIG. 3 represents the first conversion process. z θ in FIG. 3 represents the result of additional processing. The addition process is a process of adding the same number of mask tokens as the number of elements belonging to the mask data to the result of the first conversion process on the first data. The mask token is information indicating whether or not an element in the 0th data belongs to mask data.
 マスクデータに属する要素とは、第0データを表現するテンソルの要素のうち第1データにおけるマスクデータに属する要素である。したがって、マスクデータに属する要素は、例えば第1マスクデータ決定処理においてマスクデータであると判定されたパッチである。 The element belonging to the mask data is an element belonging to the mask data in the first data among the elements of the tensor expressing the 0th data. Therefore, an element belonging to mask data is, for example, a patch determined to be mask data in the first mask data determination process.
 すなわち、追加処理とは、パッチP1の第0データ内の位置を示す情報とマスクトークンであることを示す情報を足し合わせたベクトルを、第1変換処理の結果に追加する処理である。このように、追加処理の結果は、第1変換処理の結果に基づく結果、の一例である。図3の例のように、追加処理が実行される場合には、追加処理は例えば学習部10が実行する。 That is, the addition process is a process of adding a vector that is the sum of information indicating the position in the 0th data of patch P1 and information indicating that it is a mask token to the result of the first conversion process. In this way, the result of the additional process is an example of a result based on the result of the first conversion process. As in the example of FIG. 3, when additional processing is executed, the learning unit 10 executes the additional processing, for example.
 図3の例において学習部10は、追加処理の結果に対してマスクデータ表現予測処理を実行する。このように、追加処理は、第1表現変換処理の実行後であってマスクデータ表現予測処理の前に実行される。図3における関数qθ(・)はマスクデータ表現予測処理を表す。図3に記載のyθは、マスクデータ表現予測処理の結果を表す。 In the example of FIG. 3, the learning unit 10 executes mask data expression prediction processing on the result of the additional processing. In this way, the additional process is executed after the first expression conversion process and before the mask data expression prediction process. The function q θ (·) in FIG. 3 represents mask data expression prediction processing. y θ shown in FIG. 3 represents the result of mask data expression prediction processing.
 図3の例において学習部10は、第2データに対して第2変換処理を実行する。図3における関数fξ(・)は第2変換処理を表す。第1変換処理を表す関数fθと第2変換処理を表す関数fξとは、どちらもパラメトライズされた関数であり、更新処理によりパラメータの値が更新される関数である。 In the example of FIG. 3, the learning unit 10 performs the second conversion process on the second data. The function f ξ (·) in FIG. 3 represents the second conversion process. The function f θ representing the first conversion process and the function f ξ representing the second conversion process are both parameterized functions, and the values of parameters are updated by the update process.
 関数fθという記号はパラメータの値がθである関数fを意味し、関数fξという記号はパラメータの値がξである関数fを意味する。したがって、関数fθという記号と関数fξという記号とは、パラメータの違いを除いて、第1変換処理と第2変換処理との関数が同じであることを示す。図3に記載のz´ξは、第2変換処理の結果を表す。 The symbol function f θ means a function f whose parameter value is θ, and the symbol function f ξ means a function f whose parameter value is ξ. Therefore, the symbol "f θ " and the symbol "f ξ " indicate that the functions of the first conversion process and the second conversion process are the same except for the difference in parameters. z′ ξ shown in FIG. 3 represents the result of the second conversion process.
 マスクデータ表現予測処理の結果は、マスクデータの対象表現である。第2変換処理の結果は、マスクデータの全部又は一部の対象表現である。したがって、変換誤差に基づき、変換誤差を小さくするように第1表現変換処理と、マスクデータ表現予測処理と、第2表現変換処理の内容が更新されることで、変換対象のデータの表現を所定の表現に変換する精度が向上する。 The result of the mask data expression prediction process is the target expression of the mask data. The result of the second transformation process is a target representation of all or part of the mask data. Therefore, based on the conversion error, the contents of the first expression conversion process, the mask data expression prediction process, and the second expression conversion process are updated to reduce the conversion error, thereby changing the expression of the data to be converted to a predetermined value. This improves the accuracy of converting to the expression.
 図3における“Maximize agreement”は変換誤差を小さくするように、第1表現変換処理と、マスクデータ表現予測処理と、第2表現変換処理の内容を更新することを示す。すなわち、図3における“Maximize agreement”は学習部10が更新処理を実行することを示す。 "Maximize agreement" in FIG. 3 indicates that the contents of the first expression conversion process, mask data expression prediction process, and second expression conversion process are updated so as to reduce the conversion error. That is, "Maximize agreement" in FIG. 3 indicates that the learning unit 10 executes the update process.
 図3における“stop gradient”は、第2表現変換処理の内容の更新に際して、誤差逆伝搬法は実行されない、ことを示す。なお、処理の内容の更新では、具体的には、処理で実行される関数に含まれるパラメータの値が更新される。したがって関数fξで表現される第2変換処理の例でいえば、パラメータξの値が更新されることが、第2変換処理の内容の更新である。図3の例において第2表現変換処理の内容は、誤差逆伝搬法ではなく、他の更新の処理によって更新される。他の更新の処理は、例えば、第1表現変換処理の内容に基づく所定の指数移動平均処理である。 "Stop gradient" in FIG. 3 indicates that the error backpropagation method is not executed when updating the contents of the second representation conversion process. Note that in updating the contents of the process, specifically, the values of parameters included in the functions executed in the process are updated. Therefore, in the example of the second conversion process expressed by the function f ξ , updating the value of the parameter ξ is an update of the contents of the second conversion process. In the example of FIG. 3, the contents of the second representation conversion process are updated not by the error backpropagation method but by other updating processes. The other update process is, for example, a predetermined exponential moving average process based on the content of the first representation conversion process.
 指数移動平均処理の例を説明する。第1変換処理の内容や、第2変換処理の内容は、具体的には各処理を表現するパラメトライズされた関数のパラメータの値が変更されることで変化する。第1変換処理を表現するパラメトライズされた関数は例えば関数fθであり、第2変換処理を表現するパラメトライズされた関数は例えば関数fξである。この場合、指数移動平均処理で得られるt回目の更新後のパラメータξの値は、例えばξ[t]=βξ[t-1]+(1-β)θ[t-1]である。βは、所定の定数である。βは例えば0.99である。 An example of exponential moving average processing will be explained. Specifically, the contents of the first conversion process and the contents of the second conversion process change by changing the values of parameters of parameterized functions expressing each process. The parameterized function representing the first transformation process is, for example, the function f θ , and the parameterized function representing the second transformation process is, for example, the function f ξ . In this case, the value of the parameter ξ after the t-th update obtained by the exponential moving average process is, for example, ξ[t]=βξ[t-1]+(1-β)θ[t-1]. β is a predetermined constant. β is, for example, 0.99.
 実はξ[t]=βξ[t-1]+(1-β)θ[t-1]は、各更新時のθの値を所定の重み付けをして積算した量である。したがって、指数移動平均処理とは、各更新時のθの値の平均値である。各更新時のθの値の平均値は、更新が進み精度が高まるほど収束する。したがって、指数移動平均処理は、変換の精度の高いξへの更新を可能にする。そして、ξの値が変換の精度の高い値に収束すれば、変換誤差を小さくしようとする更新処理の結果、θの値も変換の精度の高い値に収束する。 In fact, ξ[t]=βξ[t-1]+(1-β)θ[t-1] is an amount obtained by integrating the value of θ at each update with predetermined weighting. Therefore, the exponential moving average process is an average value of the values of θ at each update. The average value of the values of θ at each update converges as the update progresses and the accuracy increases. Therefore, exponential moving average processing allows updating of the transformation to ξ with high accuracy. If the value of ξ converges to a value with high conversion accuracy, the value of θ also converges to a value with high conversion accuracy as a result of the update process that attempts to reduce the conversion error.
 図4は、実施形態における学習部10が実行する学習の処理の流れの一例を示すフローチャートである。学習部10が第1データ及び第2データを取得する(ステップS101)。次に学習部10が、第1変換処理を実行する(ステップS102)。次に学習部10が、マスクデータ表現予測処理を実行する(ステップS103)。次に学習部10が、第2変換処理を実行する(ステップS104)。 FIG. 4 is a flowchart illustrating an example of the flow of learning processing executed by the learning unit 10 in the embodiment. The learning unit 10 acquires first data and second data (step S101). Next, the learning unit 10 executes a first conversion process (step S102). Next, the learning unit 10 executes mask data expression prediction processing (step S103). Next, the learning unit 10 executes a second conversion process (step S104).
 次に学習部10が更新処理を実行する(ステップS105)。次に学習部10が、更新終了条件が満たされたか否かを判定する(ステップS106)。更新終了条件が満たされた場合(ステップS106:YES)、処理が終了する。一方、更新終了条件が満たされない場合(ステップS106:NO)、ステップS101の処理に戻る。 Next, the learning unit 10 executes an update process (step S105). Next, the learning unit 10 determines whether the update end condition is satisfied (step S106). If the update end condition is satisfied (step S106: YES), the process ends. On the other hand, if the update end condition is not satisfied (step S106: NO), the process returns to step S101.
 更新終了条件が満たされた時点の第1表現変換処理が対象表現変換処理、である。学習部10による学習時には第1表現変換処理の処理対象は第1データであった。しかしながら、マスクデータは予め定められたものではなく学習のたびに同じなわけではない。そのため、更新終了条件が満たされた時点の第1表現変換処理(すなわち、対象表現変換処理)は、マスクデータを含まない処理対象についても、その表現を高い精度で対象表現に変換することができる。なお、ステップS102~ステップS103の処理とステップS104の処理とは並列に実行されてもよい。 The first expression conversion process at the time when the update end condition is satisfied is the target expression conversion process. At the time of learning by the learning unit 10, the processing target of the first representation conversion process was the first data. However, the mask data is not predetermined and is not the same every time learning is performed. Therefore, the first expression conversion process (i.e., target expression conversion process) at the time when the update termination condition is satisfied can convert the expression to the target expression with high accuracy even for the processing target that does not include mask data. . Note that the processing in steps S102 to S103 and the processing in step S104 may be executed in parallel.
 図5は、実施形態の学習装置1のハードウェア構成の一例を示す図である。学習装置1は、バスで接続されたCPU(Central Processing Unit)等のプロセッサ91とメモリ92とを備える制御部11を備え、プログラムを実行する。学習装置1は、プログラムの実行によって制御部11、入力部12、通信部13、記憶部14及び出力部15を備える装置として機能する。 FIG. 5 is a diagram showing an example of the hardware configuration of the learning device 1 of the embodiment. The learning device 1 includes a control unit 11 including a processor 91 such as a CPU (Central Processing Unit) and a memory 92 connected via a bus, and executes a program. The learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15 by executing a program.
 より具体的には、プロセッサ91が記憶部14に記憶されているプログラムを読み出し、読み出したプログラムをメモリ92に記憶させる。プロセッサ91が、メモリ92に記憶させたプログラムを実行することによって、学習装置1は、制御部11、入力部12、通信部13、記憶部14及び出力部15を備える装置として機能する。 More specifically, the processor 91 reads a program stored in the storage unit 14 and stores the read program in the memory 92. When the processor 91 executes the program stored in the memory 92, the learning device 1 functions as a device including a control section 11, an input section 12, a communication section 13, a storage section 14, and an output section 15.
 制御部11は、学習装置1が備える各種機能部の動作を制御する。制御部11は、学習部10を備える。そのため制御部11は、例えば第1表現変換処理と、マスクデータ表現予測処理と、第2表現変換処理と、更新処理とを実行する。制御部11は、さらに、第1マスクデータ決定処理を実行してもよいし、第2マスクデータ決定処理を実行してもよいし、追加処理を実行してもよい。 The control unit 11 controls the operations of various functional units included in the learning device 1. The control section 11 includes a learning section 10. Therefore, the control unit 11 executes, for example, a first expression conversion process, a mask data expression prediction process, a second expression conversion process, and an update process. The control unit 11 may further execute a first mask data determination process, a second mask data determination process, or an additional process.
 制御部11は、例えば出力部15の動作を制御する。制御部11は、例えば第1表現変換処理や、マスクデータ表現予測処理や、第2表現変換処理や、更新処理等の各種処理の実行により生じた各種情報を記憶部14に記録する。 The control unit 11 controls the operation of the output unit 15, for example. The control unit 11 records, in the storage unit 14, various types of information generated by executing various processes such as, for example, the first expression conversion process, the mask data expression prediction process, the second expression conversion process, and the update process.
 入力部12は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部12は、これらの入力装置を学習装置1に接続するインタフェースとして構成されてもよい。入力部12は、学習装置1に対する各種情報の入力を受け付ける。 The input unit 12 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 12 may be configured as an interface that connects these input devices to the learning device 1. The input unit 12 receives input of various information to the learning device 1.
 通信部13は、学習装置1を外部装置に接続するための通信インタフェースを含んで構成される。通信部13は、有線又は無線を介して外部装置と通信する。外部装置は、例えば第0データの送信元の装置である。外部装置が第0データの送信元の装置である場合、通信部13は第0データの送信元の装置との通信によって、第0データを取得する。 The communication unit 13 includes a communication interface for connecting the learning device 1 to an external device. The communication unit 13 communicates with an external device via wire or wireless. The external device is, for example, a device that is the source of the 0th data. When the external device is the device that is the source of the 0th data, the communication unit 13 acquires the 0th data by communicating with the device that is the source of the 0th data.
 外部装置は、例えば原データの送信元の装置である。外部装置が原データの送信元の装置である場合、通信部13は原データの送信元の装置との通信によって、原データを取得する。外部装置は、例えば音響信号データの送信元の装置である。 The external device is, for example, the source device of the original data. When the external device is the source device of the original data, the communication unit 13 acquires the original data by communicating with the device that is the source of the original data. The external device is, for example, a device that transmits the acoustic signal data.
 外部装置は、例えば音響信号データの送信元の装置である。外部装置が音響信号データの送信元の装置である場合、通信部13は音響信号データの送信元の装置との通信によって、音響信号データを取得する。 The external device is, for example, a device that transmits the acoustic signal data. When the external device is the device that is the source of the audio signal data, the communication unit 13 acquires the audio signal data by communicating with the device that is the source of the audio signal data.
 外部装置は、例えば自然言語データの送信元の装置である。外部装置が自然言語データの送信元の装置である場合、通信部13は自然言語データの送信元の装置との通信によって、自然言語データを取得する。 The external device is, for example, a device that is a source of natural language data. When the external device is the device that is the source of the natural language data, the communication unit 13 acquires the natural language data by communicating with the device that is the source of the natural language data.
 外部装置は、例えば文字以外の記号や数字をベクトル表現した一般の時系列データの送信元の装置である。外部装置が一般の時系列データの送信元の装置である場合、通信部13は一般の時系列データの送信元の装置との通信によって、一般の時系列データを取得する。 The external device is, for example, a device that sends general time series data that is a vector representation of symbols and numbers other than letters. If the external device is a device that is a source of general time-series data, the communication unit 13 acquires the general time-series data by communicating with the device that is a source of general time-series data.
 なお通信部13に入力される各種情報は、通信部13に代えて入力部12に入力されてもよい。 Note that the various information input to the communication section 13 may be input to the input section 12 instead of the communication section 13.
 外部装置は、例えば変換装置2である。通信部13は、変換装置2との通信によって、更新終了条件が満たされた時点の第1表現変換処理(すなわち、対象表現変換処理)の内容を示す情報を変換装置2に送信する。 The external device is, for example, the conversion device 2. Through communication with the conversion device 2, the communication unit 13 transmits to the conversion device 2 information indicating the content of the first expression conversion process (that is, the target expression conversion process) at the time when the update end condition is satisfied.
 記憶部14は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置(non-transitory computer-readable recording medium)を用いて構成される。記憶部14は学習装置1に関する各種情報を記憶する。記憶部14は、例えば入力部12又は通信部13を介して入力された情報を記憶する。記憶部14は、例えば制御部11の動作により生じた各種情報を記憶する。 The storage unit 14 is configured using a non-transitory computer-readable recording medium such as a magnetic hard disk device or a semiconductor storage device. The storage unit 14 stores various information regarding the learning device 1. The storage unit 14 stores information input via the input unit 12 or the communication unit 13, for example. The storage unit 14 stores various information generated by the operation of the control unit 11, for example.
 出力部15は、各種情報を出力する。出力部15は、例えばCRT(Cathode Ray Tube)ディスプレイや液晶ディスプレイ、有機EL(Electro-Luminescence)ディスプレイ等の表示装置を含んで構成される。出力部15は、これらの表示装置を学習装置1に接続するインタフェースとして構成されてもよい。出力部15は、例えば入力部12に入力された情報を出力する。出力部15は、例えば制御部11の処理の結果を表示してもよい。 The output unit 15 outputs various information. The output unit 15 includes a display device such as a CRT (Cathode Ray Tube) display, a liquid crystal display, and an organic EL (Electro-Luminescence) display. The output unit 15 may be configured as an interface that connects these display devices to the learning device 1. The output unit 15 outputs, for example, information input to the input unit 12. The output unit 15 may display, for example, the results of the processing by the control unit 11.
 図6は、実施形態における制御部11の構成の一例を示す図である。制御部11は、学習部10と、データ取得部110、記憶制御部120、通信制御部130及び出力制御部140を備える。 FIG. 6 is a diagram showing an example of the configuration of the control unit 11 in the embodiment. The control unit 11 includes a learning unit 10, a data acquisition unit 110, a storage control unit 120, a communication control unit 130, and an output control unit 140.
 データ取得部110は、学習部10に送信するデータを取得する。学習部10に送信するデータは、第0データであってもよいし、第1データと第2データとの組であってもよい。データ取得部110が第1データと第2データとの組を学習部10に送信する場合であって、データ取得部110が第0データを取得する場合には、データ取得部110は、第1マスクデータ決定処理と第2マスクデータ決定処理とを実行する。 The data acquisition unit 110 acquires data to be sent to the learning unit 10. The data transmitted to the learning unit 10 may be the 0th data or may be a set of the first data and the second data. When the data acquisition unit 110 transmits a set of first data and second data to the learning unit 10, and when the data acquisition unit 110 acquires the 0th data, the data acquisition unit 110 transmits the first data and the second data to the learning unit 10. A mask data determination process and a second mask data determination process are executed.
 入力部12又は通信部13が音響信号データを取得する場合、データ取得部110は入力部12又は通信部13の取得した音響信号データに基づいて得られるデータであってテンソルで表現されるデータを原データとして取得する。そしてさらにデータ取得部110は第0データ生成処理を実行することで、得た原データに基づき第0データを取得する。 When the input unit 12 or the communication unit 13 acquires acoustic signal data, the data acquisition unit 110 acquires data expressed in a tensor that is obtained based on the acoustic signal data acquired by the input unit 12 or the communication unit 13. Obtain as original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.
 入力部12又は通信部13が自然言語データを取得する場合、データ取得部110は入力部12又は通信部13の取得した自然言語データに基づいて得られるデータであってテンソルで表現されるデータを原データとして取得する。そしてさらにデータ取得部110は第0データ生成処理を実行することで、得た原データに基づき第0データを取得する。 When the input unit 12 or the communication unit 13 acquires natural language data, the data acquisition unit 110 acquires data expressed in a tensor that is obtained based on the natural language data acquired by the input unit 12 or the communication unit 13. Obtain as original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.
 入力部12又は通信部13が文字以外の記号や数字をベクトル表現した一般の時系列データを取得する場合、データ取得部110は入力部12又は通信部13の取得した一般の時系列データに基づいて得られるデータであってテンソルで表現されるデータを原データとして取得する。そしてさらにデータ取得部110は第0データ生成処理を実行することで、得た原データに基づき第0データを取得する。 When the input unit 12 or the communication unit 13 acquires general time-series data in which symbols and numbers other than characters are expressed as vectors, the data acquisition unit 110 uses the general time-series data acquired by the input unit 12 or the communication unit 13 to The data obtained by using a tensor and expressed as a tensor is obtained as the original data. Furthermore, the data acquisition unit 110 acquires the 0th data based on the obtained original data by executing the 0th data generation process.
 入力部12又は通信部13が第0データを取得する場合、データ取得部110は、入力部12又は通信部13の取得した第0データを取得する。入力部12又は通信部13が第1データと第2データとの組を取得する場合、データ取得部110は、入力部12又は通信部13の取得した第1データと第2データとの組を取得する。 When the input unit 12 or the communication unit 13 acquires the 0th data, the data acquisition unit 110 acquires the 0th data acquired by the input unit 12 or the communication unit 13. When the input unit 12 or the communication unit 13 acquires a set of first data and second data, the data acquisition unit 110 acquires the set of first data and second data acquired by the input unit 12 or communication unit 13. get.
 記憶制御部120は、記憶部14に各種情報を記録する。通信制御部130は通信部13の動作を制御する。出力制御部140は、出力部15の動作を制御する。 The storage control unit 120 records various information in the storage unit 14. The communication control unit 130 controls the operation of the communication unit 13. The output control section 140 controls the operation of the output section 15.
 図7は、実施形態における変換装置2のハードウェア構成の一例を示す図である。変換装置2は、バスで接続されたCPU等のプロセッサ93とメモリ94とを備える制御部21を備え、プログラムを実行する。変換装置2は、プログラムの実行によって制御部21、入力部22、通信部23、記憶部24及び出力部25を備える装置として機能する。 FIG. 7 is a diagram showing an example of the hardware configuration of the conversion device 2 in the embodiment. The conversion device 2 includes a control unit 21 including a processor 93 such as a CPU and a memory 94 connected via a bus, and executes a program. The conversion device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25 by executing a program.
 より具体的には、プロセッサ93が記憶部24に記憶されているプログラムを読み出し、読み出したプログラムをメモリ94に記憶させる。プロセッサ93が、メモリ94に記憶させたプログラムを実行することによって、変換装置2は、制御部21、入力部22、通信部23、記憶部24及び出力部25を備える装置として機能する。 More specifically, the processor 93 reads the program stored in the storage unit 24 and stores the read program in the memory 94. When the processor 93 executes the program stored in the memory 94, the conversion device 2 functions as a device including a control section 21, an input section 22, a communication section 23, a storage section 24, and an output section 25.
 制御部21は、変換装置2が備える各種機能部の動作を制御する。制御部21は、例えば学習装置1が得た、更新終了条件が満たされた時点の第1表現変換処理(すなわち、対象表現変換処理)の内容を示す情報、を取得し、記憶部24に記録する。 The control unit 21 controls the operations of various functional units included in the conversion device 2. The control unit 21 acquires, for example, information obtained by the learning device 1 that indicates the content of the first expression conversion process (i.e., the target expression conversion process) at the time when the update end condition is satisfied, and records it in the storage unit 24. do.
 制御部21は、対象表現変換処理を実行する。制御部21による対象表現変換処理の実行は、例えば制御部21が、記憶部24に記録された対象表現変換処理の内容を示す情報を読み出し実行することで、実行される。制御部21は、例えば出力部25の動作を制御する。制御部21は、例えば対象表現変換処理の実行により生じた各種情報を記憶部24に記録する。 The control unit 21 executes object expression conversion processing. The execution of the target expression conversion process by the control unit 21 is performed, for example, by the control unit 21 reading and executing information indicating the content of the target expression conversion process recorded in the storage unit 24. The control unit 21 controls the operation of the output unit 25, for example. The control unit 21 records, for example, various information generated by executing the object expression conversion process in the storage unit 24.
 入力部22は、マウスやキーボード、タッチパネル等の入力装置を含んで構成される。入力部22は、これらの入力装置を変換装置2に接続するインタフェースとして構成されてもよい。入力部22は、変換装置2に対する各種情報の入力を受け付ける。 The input unit 22 includes input devices such as a mouse, a keyboard, and a touch panel. The input unit 22 may be configured as an interface that connects these input devices to the conversion device 2. The input unit 22 receives input of various information to the conversion device 2 .
 通信部23は、変換装置2を外部装置に接続するための通信インタフェースを含んで構成される。通信部23は、有線又は無線を介して外部装置と通信する。外部装置は、例えば対象表現への表現の変換の対象のデータの送信元の装置である。通信部23は、このような外部装置との通信により、対象表現への表現の変換の対象のデータを取得する。外部装置は、例えば学習装置1である。通信部23は、学習装置1との通信により、対象表現変換処理の内容を示す情報を取得する。 The communication unit 23 includes a communication interface for connecting the conversion device 2 to an external device. The communication unit 23 communicates with an external device via wire or wireless. The external device is, for example, a device from which data to be converted into a target representation is sent. The communication unit 23 acquires data to be converted into a target expression through communication with such an external device. The external device is, for example, the learning device 1. The communication unit 23 acquires information indicating the content of the target expression conversion process through communication with the learning device 1.
 なお通信部23に入力される各種情報は、通信部23に代えて入力部22に入力されてもよい。 Note that the various information input to the communication section 23 may be input to the input section 22 instead of the communication section 23.
 記憶部24は、磁気ハードディスク装置や半導体記憶装置などのコンピュータ読み出し可能な記憶媒体装置(non-transitory computer-readable recording medium)を用いて構成される。記憶部24は変換装置2に関する各種情報を記憶する。記憶部24は、例えば入力部22又は通信部23を介して入力された情報を記憶する。記憶部24は、例えば制御部21の動作により生じた各種情報を記憶する。記憶部24は、例えば対象表現変換処理の内容を記憶する。 The storage unit 24 is configured using a non-transitory computer-readable recording medium such as a magnetic hard disk device or a semiconductor storage device. The storage unit 24 stores various information regarding the conversion device 2. The storage unit 24 stores information input via the input unit 22 or the communication unit 23, for example. The storage unit 24 stores, for example, various information generated by the operation of the control unit 21. The storage unit 24 stores, for example, the contents of the target expression conversion process.
 出力部25は、各種情報を出力する。出力部25は、例えば下流タスクを実行する装置と通信可能に接続された通信インタフェースである。出力部25は、例えばCRTディスプレイや液晶ディスプレイ、有機ELディスプレイ等の表示装置を含んで構成されてもよい。出力部25は、これらの表示装置を変換装置2に接続するインタフェースとして構成されてもよい。出力部25は、例えば入力部22に入力された情報を出力する。出力部25は、例えば対象表現変換処理の実行結果を出力してもよい。 The output unit 25 outputs various information. The output unit 25 is, for example, a communication interface communicably connected to a device that executes a downstream task. The output unit 25 may include a display device such as a CRT display, a liquid crystal display, or an organic EL display. The output unit 25 may be configured as an interface that connects these display devices to the conversion device 2. The output unit 25 outputs the information input to the input unit 22, for example. The output unit 25 may output, for example, the execution result of the target expression conversion process.
 図8は、実施形態における制御部21の構成の一例を示す図である。制御部21は、変換対象取得部210、表現変換部220、記憶制御部230、通信制御部240及び出力制御部250を備える。変換対象取得部210は、通信部23に入力された、対象表現への表現の変換の対象のデータ、を取得する。 FIG. 8 is a diagram showing an example of the configuration of the control section 21 in the embodiment. The control unit 21 includes a conversion target acquisition unit 210, an expression conversion unit 220, a storage control unit 230, a communication control unit 240, and an output control unit 250. The conversion target acquisition unit 210 acquires data that is input to the communication unit 23 and is the target of expression conversion into a target expression.
 表現変換部220は、対象表現変換処理、を用いて、変換対象取得部210が取得したデータの表現を変換する。表現変換部220は、変換対象取得部210が取得したデータに応じて、対象表現変換処理を実行可能なように、第0データ生成処理等のデータ取得部110が実行した各種処理を実行してもよい。 The expression conversion unit 220 converts the expression of the data acquired by the conversion target acquisition unit 210 using target expression conversion processing. The expression conversion unit 220 executes the various processes executed by the data acquisition unit 110, such as the 0th data generation process, so that the target expression conversion process can be executed according to the data acquired by the conversion target acquisition unit 210. Good too.
 記憶制御部230は、記憶部24に各種情報を記録する。通信制御部240は通信部23の動作を制御する。出力制御部250は、出力部25の動作を制御する。 The storage control unit 230 records various information in the storage unit 24. The communication control unit 240 controls the operation of the communication unit 23. The output control section 250 controls the operation of the output section 25.
 図9は、実施形態における変換装置2が実行する処理の流れの一例を示すフローチャートである。変換対象取得部210が、通信部23に入力された、対象表現への表現の変換の対象のデータ、を取得する(ステップS201)。次に表現変換部220が対象表現変換処理を用いて、ステップS201で取得されたデータの表現を変換する(ステップS202)。次に、出力制御部250が出力部25の動作を制御して、ステップS202の結果を出力部25に出力させる(ステップS203)。なお、出力部25による出力先は、例えば下流タスクを実行する装置であってもよい。 FIG. 9 is a flowchart showing an example of the flow of processing executed by the conversion device 2 in the embodiment. The conversion target acquisition unit 210 acquires data that is input to the communication unit 23 and is the target of expression conversion into a target expression (step S201). Next, the expression conversion unit 220 converts the expression of the data acquired in step S201 using target expression conversion processing (step S202). Next, the output control unit 250 controls the operation of the output unit 25 to output the result of step S202 to the output unit 25 (step S203). Note that the output destination of the output unit 25 may be, for example, a device that executes a downstream task.
<実験結果>
 対象表現変換処理を用いて表現の変換が行われたデータ、に対する下流タスクを実行する実験の結果を、図10を用いて説明する。実験において下流タスクとしては、環境音の分類、音声コマンドワード識別、話者識別、音声言語分類、音声に含まれる感情の分類、音楽ジャンルの分類、楽音の楽器分類、楽音の音程分類、の各タスクが行われた。
<Experiment results>
The results of an experiment in which a downstream task is executed on data whose expression has been converted using the target expression conversion process will be explained using FIG. 10. In the experiment, the downstream tasks included classification of environmental sounds, voice command word identification, speaker identification, speech language classification, emotion classification in speech, music genre classification, instrument classification of musical sounds, and interval classification of musical sounds. The task was done.
 図10は、実施形態における実験の結果の一例を示す図である。“ESC50”及び“US8K”はいずれも、環境音の分類のタスクを示す。“SPCV2”は、音声コマンドワード識別のタスクを示す。“VC1”は、話者識別のタスクを示す。“VF”は、音声言語分類のタスクを示す。“CRM-D”は、音声に含まれる感情の分類のタスクを示す。“GTZAN”は、音楽ジャンルの分類のタスクを示す。“NSynth”は、楽音の楽器分類のタスクを示す。“Surge”は、楽音の音程分類のタスクを示す。 FIG. 10 is a diagram showing an example of the results of an experiment in the embodiment. “ESC50” and “US8K” both indicate the task of classifying environmental sounds. "SPCV2" indicates the task of voice command word identification. “VC1” indicates a speaker identification task. "VF" indicates the task of vocal language classification. “CRM-D” indicates a task of classifying emotions contained in speech. “GTZAN” indicates a music genre classification task. “NSynth” indicates a task of classifying musical instruments. “Surge” indicates a task of classifying musical tones.
 “MAE”は、既存の技術であるMAE(Masked Autoencoders)を示す。“MABL”は、対象表現変換処理を示す。図9の結果は、対象表現変換処理による表現の変換の結果を用いた下流タスクの精度は、いずれの下流タスクについても、MAEによる表現の変換の結果を用いた下流タスクの精度よりも高いことを示す。 "MAE" indicates MAE (Masked Autoencoders), which is an existing technology. "MABL" indicates object expression conversion processing. The results in Figure 9 show that the accuracy of downstream tasks that use the results of expression conversion by target expression conversion processing is higher than the accuracy of downstream tasks that use the results of expression conversion by MAE for all downstream tasks. shows.
 例えば、“ESC50”については、MAEによる表現の変換の結果を用いた下流タスク“ESC50”の精度は87.35%であるのに対し、対象表現変換処理による表現の変換の結果を用いた下流タスク“ESC50”の精度は89.03%である。 For example, regarding "ESC50", the accuracy of the downstream task "ESC50" using the result of expression conversion by MAE is 87.35%, while the accuracy of the downstream task "ESC50" using the result of expression conversion by target expression conversion processing is 87.35%. The accuracy of task “ESC50” is 89.03%.
 例えば、“VC1”については、MAEによる表現の変換の結果を用いた下流タスク“VC1”の精度は54.64%であるのに対し、対象表現変換処理による表現の変換の結果を用いた下流タスク“VC1”の精度は58.96%である。 For example, for "VC1", the accuracy of downstream task "VC1" using the result of expression conversion by MAE is 54.64%, whereas the accuracy of downstream task "VC1" using the result of expression conversion by target expression conversion processing is 54.64%. The accuracy of task “VC1” is 58.96%.
 このように構成された学習装置1は、マスクデータを除いたデータに基づいて得られたマスクデータの表現と、マスクデータの一部又は全部に基づいて得られたマスクデータの表現との違いを小さくように、第1表現変換処理の学習を行う。 The learning device 1 configured in this way can distinguish between the expression of mask data obtained based on data excluding the mask data and the expression of mask data obtained based on part or all of the mask data. The first expression conversion process is trained to minimize the size of the expression.
 これは、モデルの出力する非マスクパッチの表現からマスク部分のパッチ画像を復元し、入力信号と復元信号の差を利用して損失を計算するMAEとは異なる。MAEの場合、表現の変換の後にさらに復元(すなわち復号化)を行うため、復元に伴い、情報に誤差が生じてしまう場合がある。そのため、学習時に復元を行わない学習装置1は、MAEと異なり、変換対象のデータの表現を所定の表現(すなわち対象表現)に変換する精度を向上させることができる。 This is different from MAE, which restores the patch image of the masked part from the representation of the non-masked patch output by the model and calculates the loss using the difference between the input signal and the restored signal. In the case of MAE, since restoration (that is, decoding) is further performed after expression conversion, errors may occur in the information as a result of the restoration. Therefore, unlike MAE, the learning device 1 that does not perform restoration during learning can improve the accuracy of converting the expression of data to be converted into a predetermined expression (ie, target expression).
 また、学習装置1による学習は、data2vecとも異なる。data2vecの場合、全パッチを入力して移動平均モデルの対象表現を獲得し、そのうちマスク部分のみを教師信号として利用するため、マスク部分の対象表現には非マスク部分の情報が含まれてしまう。なお、マスク部分とはマスクデータのことであり、非マスク部分とは第0データのうちマスクデータ以外のデータである。その結果、data2vecでは非マスク部分の情報に依存した学習が行われてしまい、表現の変換の精度を向上させる学習が行われない場合がある。 Furthermore, learning by the learning device 1 is different from data2vec. In the case of data2vec, all patches are input to obtain the target representation of the moving average model, and only the masked part is used as a teacher signal, so the target representation of the masked part contains information of the non-masked part. Note that the mask portion refers to mask data, and the non-mask portion refers to data other than mask data among the 0th data. As a result, in data2vec, learning is performed that depends on information in the non-masked portion, and learning that improves the accuracy of expression conversion may not be performed.
 一方、学習装置1による学習では、第2表現変換処理において第0データのうちマスクデータではないデータを用いずにマスクデータの一部又は全部の対象表現を得る。そして、第2表現変換処理の結果が、マスクデータ表現予測処理の結果と比較される。すなわち、学習装置1による学習では、教師信号として第2表現変換処理の結果が用いられる。第2表現変換処理の結果は上述したように非マスク部分の情報を含んでいない。そのため学習装置1は、data2vecと異なり、変換対象のデータの表現を所定の表現に変換する精度を向上させることができる。 On the other hand, in learning by the learning device 1, a target representation of part or all of the mask data is obtained without using data that is not mask data among the 0th data in the second representation conversion process. The result of the second expression conversion process is then compared with the result of the mask data expression prediction process. That is, in learning by the learning device 1, the result of the second representation conversion process is used as the teacher signal. As described above, the result of the second representation conversion process does not include information on the non-masked portion. Therefore, unlike data2vec, the learning device 1 can improve the accuracy of converting the expression of data to be converted into a predetermined expression.
(変形例)
 なお、第1データにおいて、マスクデータの割合は第0データが含むデータのうちの50%であってもよい。マスクデータの割合が50%である場合、以下の理由(以下「第1理由」という。)により、50%でない場合よりも変換対象のデータの表現を所定の表現に変換する精度が向上する。
(Modified example)
Note that in the first data, the ratio of mask data may be 50% of the data included in the 0th data. When the ratio of mask data is 50%, the accuracy of converting the expression of the data to be converted into the predetermined expression is improved than when it is not 50% for the following reason (hereinafter referred to as the "first reason").
<第1理由>
 第1表現変換処理、第2表現変換処理それぞれが出力する対象表現は、それぞれのマスクデータを含んだデータ全体をモデル化することで変換誤差が小さくなる。モデル化はマスクデータが少なければ容易であり、多ければ困難である。異なる難易度で処理された対象表現は、モデル化に利用できる情報量が異なり異質である。 マスクデータの割合を50%とすることで、第一データと第二データは同じ情報量を素に処理された同質な対象表現になり、その変換誤差はより良いモデル化を学習するために最適となる。 
<First reason>
For the target expressions output by each of the first expression conversion process and the second expression conversion process, conversion errors are reduced by modeling the entire data including the respective mask data. Modeling is easier if there is less mask data, and more difficult if there is more. Object representations processed at different difficulty levels have different amounts of information available for modeling and are heterogeneous. By setting the ratio of mask data to 50%, the first data and second data become homogeneous object representations processed with the same amount of information, and the conversion error is optimal for learning better modeling. becomes.
 なお、第1データにおいて、マスクデータではないデータの割合は、第0データが含むデータのうちの50%以下であり、マスクデータの一部であって第2表現変換処理に用いられるデータの割合も第0データが含むデータのうちの50%以下であってもよい。このような場合よりモデル化の難易度が上がるため、そうでない場合よりも、変換対象のデータの表現を所定の表現に変換する精度が向上する。 In addition, in the first data, the proportion of data that is not mask data is 50% or less of the data included in the 0th data, and the proportion of data that is part of the mask data and used in the second expression conversion process. may also be 50% or less of the data included in the 0th data. Since the difficulty of modeling is higher in such a case, the accuracy of converting the expression of the data to be converted into a predetermined expression is improved than in other cases.
 学習装置1及び変換装置2はそれぞれ、ネットワークを介して通信可能に接続された複数台の情報処理装置を用いて実装されてもよい。この場合、学習装置1及び変換装置2のそれぞれが備える各機能部は、複数の情報処理装置に分散して実装されてもよい。 The learning device 1 and the conversion device 2 may each be implemented using a plurality of information processing devices that are communicably connected via a network. In this case, each functional unit included in each of the learning device 1 and the conversion device 2 may be distributed and implemented in a plurality of information processing devices.
 なお、学習装置1及び変換装置2は、必ずしも異なる装置として実装される必要は無い。学習装置1及び変換装置2は、例えば両者の機能を併せ持つ1つの装置として実装されてもよい。 Note that the learning device 1 and the conversion device 2 do not necessarily need to be implemented as different devices. The learning device 1 and the conversion device 2 may be implemented, for example, as one device that has both functions.
 なお、表現変換システム100、学習装置1及び変換装置2それぞれの各機能の全て又は一部は、ASIC(Application Specific Integrated Circuit)やPLD(Programmable Logic Device)やFPGA(Field Programmable Gate Array)等のハードウェアを用いて実現されてもよい。プログラムは、コンピュータ読み取り可能な記録媒体に記録されてもよい。コンピュータ読み取り可能な記録媒体とは、例えばフレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置である。プログラムは、電気通信回線を介して送信されてもよい。 All or part of each function of the expression conversion system 100, the learning device 1, and the conversion device 2 may be implemented using hardware such as an ASIC (Application Specific Integrated Circuit), a PLD (Programmable Logic Device), or an FPGA (Field Programmable Gate Array). It may also be realized using hardware. The program may be recorded on a computer-readable recording medium. The computer-readable recording medium is, for example, a portable medium such as a flexible disk, magneto-optical disk, ROM, or CD-ROM, or a storage device such as a hard disk built into a computer system. The program may be transmitted via a telecommunications line.
 以上、この発明の実施形態について図面を参照して詳述してきたが、具体的な構成はこの実施形態に限られるものではなく、この発明の要旨を逸脱しない範囲の設計等も含まれる。 Although the embodiments of the present invention have been described above in detail with reference to the drawings, the specific configuration is not limited to these embodiments, and includes designs within the scope of the gist of the present invention.
 100…表現変換システム、 1…学習装置、 2…変換装置、 10…学習部、 11…制御部、 12…入力部、 13…通信部、 14…記憶部、 15…出力部、 110…データ取得部、 120…記憶制御部、 130…通信制御部、 140…出力制御部、 21…制御部、 22…入力部、 23…通信部、 24…記憶部、 25…出力部、 210…変換対象取得部、 220…表現変換部、 230…記憶制御部、 240…通信制御部、 250…出力制御部、 91…プロセッサ、 92…メモリ、 93…プロセッサ、 94…メモリ 100... Expression conversion system, 1... Learning device, 2... Conversion device, 10... Learning section, 11... Control section, 12... Input section, 13... Communication section, 14... Storage section, 15... Output section, 110... Data acquisition 120...Storage control unit, 130...Communication control unit, 140...Output control unit, 21...Control unit, 22...Input unit, 23...Communication unit, 24...Storage unit, 25...Output unit, 210...Conversion target acquisition 220... Expression conversion unit, 230... Storage control unit, 240... Communication control unit, 250... Output control unit, 91... Processor, 92... Memory, 93... Processor, 94... Memory

Claims (8)

  1.  変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御部、
     を備え、
     前記制御部は、
     テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、
     更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である、
     学習装置。
    a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression;
    Equipped with
    The control unit includes:
    The first data, which is data obtained by removing the mask data that is part of the zeroth data, from the zeroth data, which is data expressed by a tensor, is the processing target, and the representation of the processing target is converted into the target representation. a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a second expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting a data expression into a target expression; a first expression conversion process for reducing the difference between a result of the mask data expression prediction process and a result of the second expression conversion process; performing an update process for updating the contents of the data expression prediction process and the second expression conversion process;
    the first expression conversion process at a time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process;
    learning device.
  2.  前記制御部は、前記第0データ内の要素についてマスクデータに属するか否かを示すマスクトークンを、前記マスクデータに属する要素の数と同じ数だけ、前記第1表現変換処理の結果に対して追加する追加処理、を前記第1表現変換処理の実行後であって前記マスクデータ表現予測処理の前に実行し、
     前記マスクデータ表現予測処理の処理対象は前記追加処理の結果である、
     請求項1に記載の学習装置。
    The control unit applies mask tokens indicating whether or not elements in the 0th data belong to mask data to the result of the first representation conversion process, the same number as the number of elements belonging to the mask data. an additional process to be added after the first expression conversion process and before the mask data expression prediction process;
    The processing target of the mask data expression prediction processing is the result of the additional processing,
    The learning device according to claim 1.
  3.  前記第1データにおいてマスクデータの割合は、前記第0データが含むデータのうちの50%である、
     請求項1に記載の学習装置。
    The ratio of mask data in the first data is 50% of the data included in the 0th data.
    The learning device according to claim 1.
  4.  前記第1データにおいて、マスクデータではないデータの割合は、前記第0データが含むデータのうちの50%以下であり、前記マスクデータの一部であって前記第2表現変換処理に用いられるデータの割合も前記第0データが含むデータのうちの50%以下である、
     請求項1に記載の学習装置。
    In the first data, the proportion of data that is not mask data is 50% or less of the data included in the zero data, and the data is part of the mask data and is used in the second expression conversion process. The ratio of is also 50% or less of the data included in the 0th data,
    The learning device according to claim 1.
  5.  予め定められた所定の表現である対象表現への表現の変換の対象のデータを取得する変換対象取得部と、
     変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御部、を備え、前記制御部は、テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である学習装置、によって得られた前記対象表現変換処理、を用いて、前記変換対象取得部が取得したデータの表現を変換する表現変換部、
     を備える変換装置。
    a conversion target acquisition unit that acquires target data for converting an expression into a target expression that is a predetermined expression;
    a control unit that obtains a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression, and the control unit is configured to convert data expressed by a tensor into a target expression that is a predetermined expression. a first expression conversion process of converting an expression to be processed into a target expression using first data, which is data obtained by removing mask data that is a part of the 0th data, as a processing target; mask data expression prediction processing that predicts a target expression of the mask data based on the result of the first expression conversion process; and a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; the first expression conversion process, the mask data expression prediction process, and the second expression conversion process to reduce the difference between the result of the mask data expression prediction process and the second expression conversion process. an update process for updating the contents of the second expression conversion process, and the first expression conversion process at the time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. an expression conversion unit that converts the expression of the data acquired by the conversion target acquisition unit using the target expression conversion process that has been performed;
    A conversion device comprising:
  6.  変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御ステップ、
     を有し、
     前記制御ステップは、
     テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、
     更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である、
     学習方法。
    a control step of obtaining a target expression conversion process by learning, which is a process of converting the expression of data to be converted into a target expression that is a predetermined expression;
    has
    The control step includes:
    The first data, which is data obtained by removing the mask data that is part of the zeroth data, from the zeroth data, which is data expressed by a tensor, is the processing target, and the representation of the processing target is converted into the target representation. a mask data expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; and a second expression prediction process that predicts the target expression of the mask data based on the result of the first expression conversion process; a second expression conversion process for converting a data expression into a target expression; a first expression conversion process for reducing the difference between a result of the mask data expression prediction process and a result of the second expression conversion process; performing an update process for updating the contents of the data expression prediction process and the second expression conversion process;
    the first expression conversion process at a time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process;
    How to learn.
  7.  予め定められた所定の表現である対象表現への表現の変換の対象のデータを取得する変換対象取得ステップと、
     変換対象のデータの表現を予め定められた所定の表現である対象表現に変換する処理、である対象表現変換処理を学習により得る制御ステップ、を有し、前記制御ステップは、テンソルによって表現されたデータである第0データから前記第0データの一部であるマスクデータが除かれたデータ、である第1データを処理対象として、処理対象の表現を対象表現に変換する第1表現変換処理と、前記第1表現変換処理の結果に基づき前記マスクデータの対象表現を予測するマスクデータ表現予測処理と、前記マスクデータの一部又は全部である第2データに基づき前記第2データの表現を対象表現に変換する第2表現変換処理と、前記マスクデータ表現予測処理の結果と前記第2表現変換処理の結果との違いを小さくするように前記第1表現変換処理、前記マスクデータ表現予測処理及び前記第2表現変換処理の内容を更新する更新処理と、を実行し、更新の終了に関する所定の条件が満たされた時点の前記第1表現変換処理が前記対象表現変換処理である学習方法、によって得られた前記対象表現変換処理、を用いて、前記変換対象取得ステップが取得したデータの表現を変換する表現変換ステップと、
     を有する変換方法。
    a conversion target acquisition step of acquiring target data for converting the expression into a target expression that is a predetermined expression;
    a control step for obtaining a target expression conversion process by learning, which is a process of converting the expression of the data to be converted into a target expression that is a predetermined expression, and the control step A first expression conversion process that converts an expression to be processed into a target expression using first data, which is data from which mask data that is a part of the 0th data has been removed, as a processing target; , mask data expression prediction processing that predicts the target expression of the mask data based on the result of the first expression conversion process; and target expression of the second data based on second data that is part or all of the mask data. a second expression conversion process for converting into a representation; the first expression conversion process; the mask data expression prediction process; and the first expression conversion process, the mask data expression prediction process, and an update process for updating the contents of the second expression conversion process; and a learning method in which the first expression conversion process at a time when a predetermined condition regarding the end of the update is satisfied is the target expression conversion process. a representation conversion step of converting the representation of the data acquired by the conversion target acquisition step using the obtained target expression conversion process;
    A conversion method having
  8.  請求項1から4のいずれか一項に記載の学習装置と、請求項5に記載の変換装置とのいずれか、としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as either the learning device according to claim 1 or the converting device according to claim 5.
PCT/JP2022/033441 2022-09-06 2022-09-06 Learning device, conversion device, learning method, conversion method, and program WO2024052996A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/033441 WO2024052996A1 (en) 2022-09-06 2022-09-06 Learning device, conversion device, learning method, conversion method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/033441 WO2024052996A1 (en) 2022-09-06 2022-09-06 Learning device, conversion device, learning method, conversion method, and program

Publications (1)

Publication Number Publication Date
WO2024052996A1 true WO2024052996A1 (en) 2024-03-14

Family

ID=90192407

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/033441 WO2024052996A1 (en) 2022-09-06 2022-09-06 Learning device, conversion device, learning method, conversion method, and program

Country Status (1)

Country Link
WO (1) WO2024052996A1 (en)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BAADE ALAN, PENG PUYUAN, HARWATH DAVID: "MAE-AST: Masked Autoencoding Audio Spectrogram Transformer", ARXIV, 30 March 2022 (2022-03-30), XP093148285, ISSN: 2331-8422, DOI: 10.48550/arxiv.2203.16691 *
NIIZUMI DAISUKE, TAKEUCHI DAIKI, OHISHI YASUNORI, HARADA NOBORU, KASHINO KUNIO: "Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation", ARXIV, 26 April 2022 (2022-04-26), XP093148283, ISSN: 2331-8422, DOI: 10.48550/arxiv.2204.12260 *

Similar Documents

Publication Publication Date Title
EP3816873A1 (en) Neural network circuit device, neural network processing method, and neural network execution program
US20230196202A1 (en) System and method for automatic building of learning machines using learning machines
US20210224447A1 (en) Grouping of pauli strings using entangled measurements
CN108701253A (en) The target output training neural network of operating specification
US20220180198A1 (en) Training method, storage medium, and training device
CN108280513B (en) Model generation method and device
US11809995B2 (en) Information processing device and method, and recording medium for determining a variable data type for a neural network
WO2024052996A1 (en) Learning device, conversion device, learning method, conversion method, and program
US20200202212A1 (en) Learning device, learning method, and computer-readable recording medium
CN113868368A (en) Method, electronic device and computer program product for information processing
JP7109071B2 (en) Learning device, learning method, speech synthesizer, speech synthesis method and program
JP6633556B2 (en) Acoustic model learning device, speech recognition device, acoustic model learning method, speech recognition method, and program
JP7349811B2 (en) Training device, generation device, and graph generation method
WO2021117089A1 (en) Model learning device, voice recognition device, method for same, and program
JP2021135683A (en) Learning device, deduction device, method for learning, and method for deduction
CN110929033A (en) Long text classification method and device, computer equipment and storage medium
JP7274441B2 (en) LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM
CN111767204A (en) Overflow risk detection method, device and equipment
JP2019133627A (en) Information processing method and information processing system
CN113792784B (en) Method, electronic device and storage medium for user clustering
JP7419615B2 (en) Learning device, estimation device, learning method, estimation method and program
WO2023105596A1 (en) Language processing device, image processing method, and program
CN111767980B (en) Model optimization method, device and equipment
CN114764620B (en) Quantum convolution operator
US20220180197A1 (en) Training method, storage medium, and training device