WO2023013075A1

WO2023013075A1 - Learning device, estimation device, learning method, and learning program

Info

Publication number: WO2023013075A1
Application number: PCT/JP2021/029440
Authority: WO
Inventors: 直弘俵; 厚徳小川; 佑樹北岸; 歩相名神山
Original assignee: 日本電信電話株式会社
Priority date: 2021-08-06
Filing date: 2021-08-06
Publication date: 2023-02-09
Also published as: JPWO2023013075A1

Abstract

A learning device (10) includes: a feature amount conversion unit (121) that uses a first neural network (NN) (1211) to convert labeled source data and unlabeled target data to feature amount vectors; an age estimation unit (122) that uses a second NN (1221) to estimate a posteriori probability of the age of a subject from the source data feature amount vector; an age bracket estimation unit (123) that uses a third NN (1231) to estimate a posteriori probability of the age bracket of the subject from the feature amount vectors of the source data and the target data; and an updating unit (13) that updates the parameters of the first NN (1211), the second NN (1221), and the third NN (1231) such that the distance between distributions of the feature amount vectors of the source data and the target data approach each other, using as conditions the posteriori probability of the target data age bracket and the correct age bracket in the source data, while maximizing the posteriori probability of the correct age and the correct age bracket with respect to the posteriori probability of the source data age and age bracket.

Description

LEARNING DEVICE, ESTIMATION DEVICE, LEARNING METHOD, AND LEARNING PROGRAM

The present invention relates to a learning device, an estimation device, a learning method, and a learning program.

　There is a demand for age estimation technology that estimates a person's age from facial images and voice in the call center and marketing fields. On the other hand, in recent years, a human age estimation method using a neural network (NN) (for example, Non-Patent Document 1) is known in the field of voice processing and image processing.

In Non-Patent Document 1, an NN that converts a speech signal into a feature amount vector and an NN that estimates the posterior probability of an age label from the feature amount vector are connected to maximize the posterior probability for the correct age value. It is described that the age can be estimated with high accuracy by learning these NNs at the same time.

Here, it is known that in the NN-based age estimation method, if the input data recording conditions do not match when the NN is learning and when the NN is operating, the performance drops significantly. For example, in non-patent document 1, in an age estimation task using face images and voices, if different recording devices are used during learning and during operation, the distribution of input data changes significantly during learning and during operation. Therefore, it is described that the accuracy of age estimation is greatly reduced.

The degradation of NN performance due to differences in input data properties is a well-known phenomenon in the field of image processing, and several solutions have been proposed (for example, Non-Patent Documents 2 and 3).

For example, in Non-Patent Document 2, during NN learning, training data (source data) collected in an environment different from that during operation and given a teacher label is given along with teacher labels collected in the same environment as during operation. A method for solving this problem is described by using data (target data) that is not targeted. In this method, the source data and target data are input, and the distance between the distributions of the obtained NN intermediate output is calculated according to the Maximum Mean Discrepancy (MMD) standard, and the NN is configured to minimize this. It is described that learning can suppress performance degradation caused by inconsistencies in input data distributions.

However, in the method described in Non-Patent Document 2, the distribution of each data is brought closer regardless of the class to be estimated, so there is a problem that the original purpose of classification accuracy is lowered.

For this reason, Non-Patent Document 3 describes that this problem can be solved by introducing a technique called Local MMD that first estimates the class of the target data and approximates the distribution for each class based on the estimated class posterior probability. It is

The methods described in

Non-Patent Documents

2 and 3 were developed for the problem of estimating labels that are independent of each other, such as image classification, and do not work well for the problem of estimating ordered labels, such as age. .

For example, in the technique described in Non-Patent Document 3, the label is first estimated for the target data, and based on the estimated label, the distributions of the source data and the target data are approximated to solve the problem of distribution mismatch.

However, in the case of the technique described in Non-Patent Document 3, in the age estimation problem, there is a label order such that the difference between 20 and 25 years is greater than the difference between 20 and 80 years. There is a problem that the performance deteriorates when two distributions are approached independently for each class, ignoring the order.

The present invention has been made in view of the above. An object is to provide an apparatus, a learning method and a learning program.

In order to solve the above-described problems and achieve the object, a learning device according to the present invention uses a first neural network to convert age-labeled source data into a feature amount vector, and age-labeled source data. A conversion unit that converts target data that has not been converted into a feature amount vector, and a second neural network that estimates the posterior probability for the age of the target person from the feature amount vector of the source data converted by the conversion unit. Using an estimation unit and a third neural network, the posterior probability for the age of the target person is estimated from the feature amount vector of the source data, and the posterior probability for the age of the target person is estimated from the feature amount vector of the target data. and the posterior probability of the correct age of the target person with respect to the posterior probability for the age of the source data estimated by the first estimation unit and the posterior probability for the age of the source data estimated by the second estimation unit While maximizing the probability and the posterior probability of the correct age, converted by the conversion unit under the conditions of the posterior probability of the age of the target data estimated by the second estimation unit and the correct age of the source data A first neural network, a second neural network, and a third neural network are arranged so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data are brought close to each other according to the inter-distribution distance criterion defined in advance for each age. and an updating unit that updates each parameter with the network.

In addition, the estimation apparatus according to the present invention includes a conversion unit that converts data into a feature amount vector using a first neural network, and a feature amount vector converted by the conversion unit using a second neural network. an estimator for estimating the age of a person, wherein the first neural network and the second neural network estimate the posterior probability for each age of the age-labeled source data estimated by the second neural network; and the posterior probability of the correct age of the target person and the posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated by the third neural network for estimating the posterior probability of the age of the person. Transformed by the first neural network under the condition of the posterior probability of the age of the target data with no age label and the correct age of the source data estimated by the third neural network while maximizing the It is characterized in that learning is performed so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data are brought close to each other according to the inter-distribution distance criterion defined in advance for each age.

According to the present invention, it is possible to obtain a robust age estimator even when there are differences in the data recording conditions during learning and during operation.

FIG. 1 is a diagram schematically showing an example of the configuration of a learning device according to an embodiment. 2 is a diagram for explaining the flow of processing in the learning apparatus shown in FIG. 1; FIG. FIG. 3 is a diagram illustrating an example of the configuration of the first NN. FIG. 4 is a diagram illustrating an example of the configuration of the first NN. FIG. 5 is a diagram illustrating an example of the configuration of the second NN. FIG. 6 is a diagram illustrating an example of the configuration of the third NN. FIG. 7 is a flow chart showing a processing procedure of learning processing according to the embodiment. FIG. 8 is a diagram schematically illustrating an example of a configuration of an estimation device according to an embodiment; 9 is a flowchart showing an estimation processing procedure executed by the estimation device shown in FIG. 8. FIG. FIG. 10 is a diagram illustrating an example of a computer that implements a learning device and an estimation device by executing a program.

An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals. In the following description, "~A" for A, which is a vector, is equivalent to "a symbol in which "~" is written just above "A"".

[Embodiment]
In this embodiment, learning of an estimation model for estimating the age of a person using a neural network (NN) from input face image data and voice data will be described. In this embodiment, after estimating a label in a larger age group than the age group to be originally estimated (for example, ages in 10-year increments), the feature values of the source data and the target data are estimated in this large age group. NN is trained so that the distributions are close. As a result, in this embodiment, it is possible to obtain an age estimator capable of absorbing the difference in data recording conditions between the time of learning and the time of operation while considering the order of labels.

[Learning device]
Next, a learning device according to an embodiment will be described. FIG. 1 is a diagram schematically showing an example of the configuration of a learning device according to an embodiment. FIG. 2 is a diagram for explaining the flow of processing in the learning device shown in FIG.

In the learning device 10 according to the embodiment, for example, a computer or the like including ROM (Read Only Memory), RAM (Random Access Memory), CPU (Central Processing Unit), etc. is loaded with a predetermined program, and the CPU executes a predetermined program. It is realized by executing the program. The learning device 10 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network or the like.

As shown in FIGS. 1 and 2, the learning device 10 has a data selection unit 11, an estimation unit 12, an update unit 13, and a control processing unit . The learning device 10 uses source data with age labels collected under recording conditions different from those during actual operation and data (target data) without age labels recorded under the same environment as during operation. The source data and adaptation data are face image data or voice data. The source data and target data are face image data or voice data.

The data selection unit 11 selects one source data from learning data (source data group) having a plurality of source data as data to be input to the feature amount conversion unit 121 (described later), Target data is randomly selected from the target data (target data group). The data selection unit 11 outputs the correct age of the selected source data and the correct age of the source data obtained from the correct age to the updating unit 13 .

The estimation unit 12 estimates the age of the target person based on a plurality of face image data or voice data based on the same person's face image data or voice data. The estimating unit 12 has a feature quantity transforming unit 121 (transforming unit), an age estimating unit 122 (first estimating unit), and an age estimating unit 123 (second estimating unit).

The feature amount conversion unit 121 uses the first NN 1211 to convert face image data or voice data into a feature amount vector for age estimation. The feature amount conversion unit 121 selects source data and target data from the source data set and the target data set, and extracts feature amount vectors of each data.

The first NN 1211 is a NN that converts a series of face image data or voice data of a person, selected as source data and target data by the data selection unit 11, into feature amount vectors. The first NN 1211 converts the source data selected by the data selection unit 11 into a feature amount vector. The first NN 1211 converts the target data selected by the data selection unit 11 into a feature amount vector.

In the case of speech data, the first NN 1211 is implemented by an NN that converts speech data into feature vectors using the technique described in Non-Patent Document 1, for example. FIG. 3 is a diagram illustrating an example of the configuration of the first NN 1211. As shown in FIG. In this case, the first NN 1211 is implemented by, for example, an NN having a structure as shown in FIG. For example, the first NN 1211 is realized by a convolutional NN consisting of multiple time-delay layers and statistical pooling layers.

In the case of facial image data, the first NN 1211 is implemented by an NN that converts facial image data into feature vectors using the technique described in Non-Patent Document 2, for example. FIG. 4 is a diagram illustrating an example of the configuration of the first NN 1211. As shown in FIG. In this case, the first NN 1211 is implemented by, for example, an NN having a structure as shown in FIG. As an example, the first NN 1211 is implemented by a convolutional NN consisting of multiple residual blocks employing squeeze-and-excitation.

The age estimation unit 122 uses the second NN 1221 to estimate the posterior probability for the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121 . The second NN 1221 is a NN that estimates the age of the target person from the feature quantity vector transformed by the first NN 1211 .

The second NN 1221 is implemented by an NN that estimates the age value of the target person from the feature amount vector, for example, using the technology described in Non-Patent Document 1. FIG. 5 is a diagram illustrating an example of the configuration of the second NN 1221. As shown in FIG. This second NN 1221 is implemented by, for example, an NN having a structure as shown in FIG. For example, the second NN 1221 is realized by a plurality of fully connected layers of 512 dimensions and fully connected layers of the same number of dimensions as the number of age classes to be estimated (for example, 101 classes from 0 to 100).

The age estimation unit 123 uses the third NN 1231 to estimate each posterior probability for the target person's age from the feature amount vector of the source data and the feature amount vector of the target data converted by the feature amount conversion unit 121 . The third NN 1231 is a NN that estimates the age of the target person from the feature vectors converted by the first NN 1211 . The third NN 1231 estimates the posterior probability for the age of the target person from the feature vector of the source data. The third NN 1231 estimates the posterior probability for the target person's age from the feature vector of the target data.

FIG. 6 is a diagram illustrating an example of the configuration of the third NN 1231. FIG. The third NN 1231 is implemented by, for example, an NN having a structure as shown in FIG. The third NN 1231 is implemented, for example, by a plurality of 512-dimensional fully connected layers and a fully connected layer with the same number of dimensions as the number of predefined age classes. However, the number of age classes should be less than or equal to the age group originally desired to be estimated. For example, if the second NN 1221 estimates from 0 to 100 in 1-year increments, the third NN 1231 estimates 10 years from 0 to 100 in 10-year increments (teens, twenties, ...). should be a class.

The update unit 13 updates the first NN 1211, the second NN 1221, the third NN 1231, and each parameter so as to maximize the posterior probability of the correct age class and the correct age class of the target person. At this time, the updating unit 13 inputs the feature amount vector of the target data to the third NN 1231 to obtain the posterior probability of the age class of the target person of the target data. Then, the update unit 13 weights the source data feature amount vector and the target data feature amount vector output from the first NN 1211 with the correct age of each source data and the posterior probability of the age of the target data. Calculate the conditional inter-distribution distance. The updating unit 13 updates each parameter of the first NN 1211, the second NN 1221, and the third NN 1231 so as to minimize the calculated conditional distance between distributions.

Therefore, the update unit 13 updates the correct age of the target person with respect to the posterior probability for each age of the source data estimated by the age estimation unit 122 and the posterior probability for each age of the source data estimated by the age estimation unit 123. The posterior probability of each age of the target data estimated by the age estimation unit 123 and the correct age of the source data are used as conditions to maximize the posterior probability of the posterior probability of and the correct age. The first NN 1211 and the second NN 1221 are arranged so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data, which are converted by the unit 121, are brought closer to each other based on the inter-distribution distance criterion defined in advance for each age. Each parameter with the third NN 1231 is updated.

_For ^example , let x _s ₁ , x ^s ₂ , ^. ^Let the age estimation _results obtained by applying the second NN 1221 to the quantity vector be ~y ^s ₁ , ~y ^s ₂ , . ^Let _the ^age _estimation results _obtained ^by ^applying _NN1231 _of ^. . ^. , y ^s _ns , and the correct age of the target person of _{the source data is vs 1} ^, ^vs ₂ _, .

However, cross_entropy (~ ^ys _i , ^ys _i ), which is the first term in Equation (1), is the cross entropy between the age posterior probability estimated by the second NN 1221 and the correct age label. Cross_entropy (~v ^si , _v ^si ), which is the second term in Equation (2), is the cross entropy between the age posterior probability estimated _by the third 1131 and the correct age label.

The third term of Equation (1), d _H (p ^s , p ^t ), represents the conditional inter-distribution distance between the feature vector of the source data and the feature vector of the target data. d _H (p ^s , p ^t ) is _, for example, the first _Let the ^feature vectors obtained by applying NN1211 be ^x ^t ₁ , x ^t ₂ _, ^. be done. Also, hereinafter, the age estimation results obtained by applying the third NN 1231 to the feature amount vector of the target data are assumed to be ~v ^t ₁ , ~v ^t ₂ , ..., ~v ^t _nt .

At this time, C represents the age class number. w ^sc _i and w ^tc _j represent the contribution ratios of the i-th source data and the j-th target data to age class c (weights that contribute to the conditional inter-distribution distance of each age class); Let p(˜v ^t _j =c|x ^t _j ) be the posterior probability for the c-th age class of the j-th target data obtained by NN 1231 of .

Assuming that the parameter to be updated is θ, the updating unit 13 calculates the parameters (the first NN 1211 and the second NN 1221 and each parameter with the third NN 1231).

μ in Equation (4) is a preset learning weight and is a positive constant.

Note that λ in Equation (1) is a learning weight of the conditional inter-distribution distance, and is designed to be a small value close to 0 at the beginning of learning and gradually approach 1 as learning progresses. For example, if the maximum number of iterations of the updating unit 13 is I, the weight λ _i at the i-th iteration can be calculated by Equation (5).

However, γ in Equation (5) is a parameter that determines the speed of the preset learning weights and is a positive constant.

The control processing unit 14 causes the feature quantity conversion unit 121, the age estimation unit 122, the age estimation unit 123, and the updating unit 13 to repeatedly execute the processing until a predetermined condition is satisfied. The control processing unit 14 causes the updating unit 13 to repeatedly update the parameters of the first NN 1211, the second NN 1221, and the third NN 1231 until a predetermined condition is satisfied. The predetermined condition is, for example, a predetermined number of iterations is reached, the total update amount of the parameters of the first NN 1211, the second NN 1221, and the third NN 1231 is less than a predetermined threshold, and the first NN 1211 and the second NN 1221 and the third NN 1231 are sufficiently trained.

[Processing procedure of learning process]
Next, learning processing executed by the learning device 10 will be described. FIG. 7 is a flow chart showing a processing procedure of learning processing according to the embodiment.

As shown in FIG. 7, in the learning device 10, the data selection unit 11 selects source data and target data (step S1). The data selector 11 randomly selects target data. Then, the feature amount conversion unit 121 uses the first NN 1211 to convert the source data and the target data selected by the data selection unit 11 into feature amount vectors (step S2).

The age estimation unit 122 uses the second NN 1221 to estimate the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121 (step S3). The age estimation unit 123 uses the third NN 1231 to estimate the posterior probability for the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121, and the target person from the feature amount vector of the target data. Estimate the posterior probability for the age of (step S4).

The update unit 13 updates the posterior probability of the correct age of the target person and the correct answer with respect to the posterior probability for each age of the source data estimated by the age estimation unit 122 and each age of the source data estimated by the age estimation unit 123. The posterior probability of each age of the target data estimated by the age estimation unit 123 and the correct age of the source data are converted by the feature amount conversion unit 121 as conditions so that the posterior probability with the age of 123 is maximized. The first NN 1211, the second NN 1221 and the third NN 1231 and are updated (step S5).

The control processing unit 14 determines whether or not a predetermined condition is satisfied (step S6). If the predetermined condition is not satisfied (step S6: No), the learning device 10 returns to step S2 and performs each process of data processing, feature conversion, age estimation, and parameter update. On the other hand, if the predetermined condition is satisfied (step S6: Yes), the learning device 10 ends the learning process.

[Estimation device]
Next, an estimation device according to an embodiment will be described. FIG. 8 is a diagram schematically illustrating an example of a configuration of an estimation device according to an embodiment; 9 is a flowchart showing an estimation processing procedure executed by the estimation device shown in FIG. 8. FIG.

The estimation device 20 shown in FIG. 8 has a feature amount conversion unit 221 (conversion unit) having a first NN 1211 and an age estimation unit 222 (estimation unit) having a second NN 1221 . The first NN 1211 and the second NN 1221 are NNs that have been learned by the learning device 10 .

When the feature amount conversion unit 221 receives input of face image data or voice data (step S11 in FIG. 9), it uses the first NN 1211 to convert the face image data or voice data into feature amounts (step S11 in FIG. 9). step S12).

The age estimation unit 222 uses the second NN 1221 to estimate the age of the target person from the feature amount vector converted by the feature amount conversion unit 211 (step S13 in FIG. 9), and outputs the estimated age (step S13 in FIG. 9). step S14).

[Evaluation experiment]
Next, an evaluation experiment was performed on the first NN 1211 and the second NN 1221 learned by the learning device 10 based on the formula (1). Here, 29076 utterances of 587 speakers recorded with a headset microphone were used as source data, and 8180 utterances of 409 speakers recorded with a smartphone-mounted microphone different from the source data were used as target data. A second NN1221 and a third NN1231 were trained. After that, the estimation device 20 used the first NN 1211 and the second NN 1221 to estimate the ages of the 1300 utterances of 120 speakers recorded by the same smartphone-mounted microphone as the target data. .

As a result, in the estimating device 20, the average absolute error between the correct age value and the result of estimating the age of the speaker by the first NN 1211 and the second NN 1221 was 8.02 years old. Also, in the estimation device 20, the correlation coefficient between the correct age value and the estimation result of the speaker's age was 0.84.

On the other hand, as a reference, when the first NN1211 and the second NN1221 were learned using only the source data, the absolute error with the estimation result of the speaker's age was 11.76 years, and the correlation coefficient was 0.71. . Therefore, like the learning device 10, the method of learning the first NN 1211, the second NN 1221, and the third NN 1231 so that the distributions of the source data and the target data are brought closer to each other for each age function effectively. was confirmed.

[Effects of Embodiment]
As described above, according to the present embodiment, even if there is a difference in data recording conditions during learning and during operation, the feature extractor (first NN 1211) that is robust against these differences, An age estimator (second NN 1221) can be obtained. In other words, in the present embodiment, when learning an NN that estimates age from face image data or voice data, "labels are labeled in age groups larger than the age groups to be originally estimated (for example, ages in 10-year increments). After estimating, by training the NN so that the distributions of the feature vectors of the source data and the target data are close in this large segment, the difference in data recording conditions during training and during operation is eliminated. In spite of this, it was possible to realize an NN capable of outputting age estimation results with high accuracy.

Note that the learning device 10 estimates the label of the target data and approximates the distributions of the source data and the target data based on this estimation result to solve the data mismatch problem. It is similar to technology, but differs in the following points.

First, the technology described in Non-Patent Document 3 targets only discrete labels whose data to be estimated are independent of each other, like general classification problems. On the other hand, the present embodiment differs in that ordered age labels are targeted for estimation.

Second, the technique described in Non-Patent Document 3 attempts to approximate the distribution using the estimation result of the label, which is the original estimation target. On the other hand, in the present embodiment, learning is performed so that the distribution is brought closer to each age with coarser granularity.

In the embodiment, by considering the order of the labels in this way, it becomes easier to approximate the distribution for each age, and a higher performance improvement can be expected.

Thus, in this embodiment, source data with an age label recorded in an environment different from that during operation and target data recorded in the same environment as during operation but without an age label are used. Thus, the performance of the feature transformer and age estimator can be significantly improved compared to the case of learning only with source data.

In addition, in order to collect age-labeled data in an operational environment, collecting a large number of speakers of each age group (several hundred people) is a very complicated process. By omitting the processing, efficiency can be improved.

The content of approximating the distribution for each class to be estimated is similar to Non-Patent Document 3, but the technology described in Non-Patent Document 3 does not assume a task such as age estimation, and even if it is applied as it is little effect. On the other hand, in the present embodiment, "age is a value with an order, and it can be defined in a larger division such as age, so the distribution is approximated in this large division", so that a highly accurate age estimator realized.

It should be noted that this embodiment can be applied regardless of differences in input such as image data and audio data. Specifically, the first NN 1211 may be changed to one suitable for each type of input data.

[Regarding the system configuration of the embodiment]
Each component of the learning device 10 and the estimating device 20 is functionally conceptual and does not necessarily need to be physically configured as illustrated. That is, the specific forms of distribution and integration of the functions of the learning device 10 and the estimating device 20 are not limited to those illustrated, and all or part of them can be functioned in arbitrary units according to various loads and usage conditions. can be physically or physically distributed or integrated.

In addition, all or any part of each process performed in the learning device 10 and the estimation device 20 may be realized by a CPU, a GPU (Graphics Processing Unit), and a program that is analyzed and executed by the CPU and GPU. good. Further, each process performed in the learning device 10 and the estimation device 20 may be implemented as hardware based on wired logic.

Also, among the processes described in the embodiments, all or part of the processes described as being performed automatically can also be performed manually. Alternatively, all or part of the processes described as being performed manually can be performed automatically by known methods. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.

[program]
FIG. 10 is a diagram showing an example of a computer that implements the learning device 10 and the estimation device 20 by executing programs. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.

The hard disk drive 1090 stores an OS (Operating System) 1091, application programs 1092, program modules 1093, and program data 1094, for example. That is, a program that defines each process of the learning device 10 and the estimation device 20 is implemented as a program module 1093 in which code executable by the computer 1000 is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configurations of the learning device 10 and the estimation device 20 . The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

Also, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.

The program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.

Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.

10 learning device 11 data selection unit 12 estimation unit 13 update unit 14 control processing unit 121, 221 feature

quantity conversion unit

122, 222 age estimation unit 123 generation estimation unit 1211 first NN
1221 Second NN
1231 Third NN

Claims

a conversion unit that converts age-labeled source data into a feature amount vector and converts age-unlabeled target data into a feature amount vector using a first neural network;
A first estimating unit that uses a second neural network to estimate the posterior probability for the age of the target person from the feature amount vector of the source data transformed by the transforming unit;
Using a third neural network, estimating the posterior probability for the age of the target person from the feature amount vector of the source data, and second estimation for estimating the posterior probability for the age of the target person from the feature amount vector of the target data Department and
The posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated by the first estimation unit and the posterior probability of the age of the source data estimated by the second estimation unit. and the posterior probability of the correct age are maximized, and with the posterior probability of the age of the target data estimated by the second estimation unit and the correct age of the source data as conditions, the conversion unit The first neural network and the second neural network are arranged so that the distributions of the transformed feature amount vector of the source data and the feature amount vector of the target data are brought close to each other based on a distance criterion between distributions defined in advance for each age. an updating unit that updates each parameter of the neural network and the third neural network;
A learning device characterized by comprising:
2. The data selection unit according to claim 1, further comprising a data selection unit that selects one of the source data from a group of source data and randomly selects the target data from a group of target data as data to be input to the conversion unit. learning device.
The source data is age-labeled data recorded in an environment different from that during operation,
3. The learning device according to claim 1, wherein the target data is data recorded in the same environment as during operation and not given an age label.
3. The apparatus further comprises a control processing unit for repeatedly executing the processing by the transforming unit, the first estimating unit, the second estimating unit, and the updating unit until a predetermined condition is satisfied. A learning device according to any one of
a conversion unit that converts data into a feature amount vector using a first neural network;
an estimating unit for estimating the age of the target person from the feature amount vector converted by the converting unit using a second neural network;
has
The first neural network and the second neural network estimate the posterior probability for each age of the age-labeled source data estimated by the second neural network and the posterior probability for the age of the person. While maximizing the posterior probability of the correct age of the target person and the posterior probability of the correct age of the target person with respect to the posterior probability for the age of the source data estimated by the third neural network, A feature vector of the source data transformed by the first neural network on the condition of the posterior probability of the age of the target data with no age label assigned and the correct age of the source data estimated by the neural network. and the feature amount vector of the target data are learned so as to approximate the distributions of the target data using a predetermined inter-distribution distance reference for each age.
A learning method executed by a learning device,
a conversion step of converting age-labeled source data into a feature quantity vector and converting target data without an age label into a feature quantity vector using a first neural network;
A first estimation step of estimating the posterior probability for the age of the target person from the feature amount vector of the source data converted in the conversion step using a second neural network;
Using a third neural network, estimating the posterior probability for the age of the target person from the feature amount vector of the source data, and second estimation for estimating the posterior probability for the age of the target person from the feature amount vector of the target data process and
The posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated in the first estimation step and the posterior probability of the age of the source data estimated in the second estimation step. and the posterior probability of the correct age are maximized, and on the condition of the posterior probability of the age of the target data estimated in the second estimation step and the correct age of the source data, in the conversion step The first neural network and the second neural network are arranged so that the distributions of the transformed feature amount vector of the source data and the feature amount vector of the target data are brought close to each other based on a distance criterion between distributions defined in advance for each age. an update step of updating each parameter of the neural network and the third neural network;
A learning method comprising:
A learning program for causing a computer to function as the learning device according to any one of claims 1 to 4.