WO2023013075A1 - Learning device, estimation device, learning method, and learning program - Google Patents

Learning device, estimation device, learning method, and learning program Download PDF

Info

Publication number
WO2023013075A1
WO2023013075A1 PCT/JP2021/029440 JP2021029440W WO2023013075A1 WO 2023013075 A1 WO2023013075 A1 WO 2023013075A1 JP 2021029440 W JP2021029440 W JP 2021029440W WO 2023013075 A1 WO2023013075 A1 WO 2023013075A1
Authority
WO
WIPO (PCT)
Prior art keywords
age
data
posterior probability
source data
neural network
Prior art date
Application number
PCT/JP2021/029440
Other languages
French (fr)
Japanese (ja)
Inventor
直弘 俵
厚徳 小川
佑樹 北岸
歩相名 神山
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2021/029440 priority Critical patent/WO2023013075A1/en
Priority to JP2023539586A priority patent/JPWO2023013075A1/ja
Publication of WO2023013075A1 publication Critical patent/WO2023013075A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present invention relates to a learning device, an estimation device, a learning method, and a learning program.
  • Non-Patent Document 1 a human age estimation method using a neural network (NN) (for example, Non-Patent Document 1) is known in the field of voice processing and image processing.
  • NN neural network
  • Non-Patent Document 1 an NN that converts a speech signal into a feature amount vector and an NN that estimates the posterior probability of an age label from the feature amount vector are connected to maximize the posterior probability for the correct age value. It is described that the age can be estimated with high accuracy by learning these NNs at the same time.
  • Non-Patent Documents 2 and 3 The degradation of NN performance due to differences in input data properties is a well-known phenomenon in the field of image processing, and several solutions have been proposed (for example, Non-Patent Documents 2 and 3).
  • Non-Patent Document 2 during NN learning, training data (source data) collected in an environment different from that during operation and given a teacher label is given along with teacher labels collected in the same environment as during operation.
  • source data collected in an environment different from that during operation and given a teacher label
  • teacher labels collected in the same environment as during operation.
  • a method for solving this problem is described by using data (target data) that is not targeted.
  • the source data and target data are input, and the distance between the distributions of the obtained NN intermediate output is calculated according to the Maximum Mean Discrepancy (MMD) standard, and the NN is configured to minimize this.
  • MMD Maximum Mean Discrepancy
  • Non-Patent Document 2 the distribution of each data is brought closer regardless of the class to be estimated, so there is a problem that the original purpose of classification accuracy is lowered.
  • Non-Patent Document 3 describes that this problem can be solved by introducing a technique called Local MMD that first estimates the class of the target data and approximates the distribution for each class based on the estimated class posterior probability. It is
  • Non-Patent Documents 2 and 3 were developed for the problem of estimating labels that are independent of each other, such as image classification, and do not work well for the problem of estimating ordered labels, such as age. .
  • the label is first estimated for the target data, and based on the estimated label, the distributions of the source data and the target data are approximated to solve the problem of distribution mismatch.
  • Non-Patent Document 3 in the age estimation problem, there is a label order such that the difference between 20 and 25 years is greater than the difference between 20 and 80 years. There is a problem that the performance deteriorates when two distributions are approached independently for each class, ignoring the order.
  • An object is to provide an apparatus, a learning method and a learning program.
  • a learning device uses a first neural network to convert age-labeled source data into a feature amount vector, and age-labeled source data.
  • a conversion unit that converts target data that has not been converted into a feature amount vector
  • a second neural network that estimates the posterior probability for the age of the target person from the feature amount vector of the source data converted by the conversion unit.
  • the posterior probability for the age of the target person is estimated from the feature amount vector of the source data
  • the posterior probability for the age of the target person is estimated from the feature amount vector of the target data.
  • a first neural network, a second neural network, and a third neural network are arranged so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data are brought close to each other according to the inter-distribution distance criterion defined in advance for each age. and an updating unit that updates each parameter with the network.
  • the estimation apparatus includes a conversion unit that converts data into a feature amount vector using a first neural network, and a feature amount vector converted by the conversion unit using a second neural network.
  • an estimator for estimating the age of a person wherein the first neural network and the second neural network estimate the posterior probability for each age of the age-labeled source data estimated by the second neural network; and the posterior probability of the correct age of the target person and the posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated by the third neural network for estimating the posterior probability of the age of the person.
  • FIG. 1 is a diagram schematically showing an example of the configuration of a learning device according to an embodiment.
  • 2 is a diagram for explaining the flow of processing in the learning apparatus shown in FIG. 1;
  • FIG. 3 is a diagram illustrating an example of the configuration of the first NN.
  • FIG. 4 is a diagram illustrating an example of the configuration of the first NN.
  • FIG. 5 is a diagram illustrating an example of the configuration of the second NN.
  • FIG. 6 is a diagram illustrating an example of the configuration of the third NN.
  • FIG. 7 is a flow chart showing a processing procedure of learning processing according to the embodiment.
  • FIG. 8 is a diagram schematically illustrating an example of a configuration of an estimation device according to an embodiment
  • 9 is a flowchart showing an estimation processing procedure executed by the estimation device shown in FIG. 8.
  • FIG. 10 is a diagram illustrating an example of a computer that implements a learning device and an estimation device by executing a program.
  • FIG. 1 is a diagram schematically showing an example of the configuration of a learning device according to an embodiment.
  • FIG. 2 is a diagram for explaining the flow of processing in the learning device shown in FIG.
  • the learning device 10 for example, a computer or the like including ROM (Read Only Memory), RAM (Random Access Memory), CPU (Central Processing Unit), etc. is loaded with a predetermined program, and the CPU executes a predetermined program. It is realized by executing the program.
  • the learning device 10 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network or the like.
  • the learning device 10 has a data selection unit 11, an estimation unit 12, an update unit 13, and a control processing unit .
  • the learning device 10 uses source data with age labels collected under recording conditions different from those during actual operation and data (target data) without age labels recorded under the same environment as during operation.
  • the source data and adaptation data are face image data or voice data.
  • the source data and target data are face image data or voice data.
  • the data selection unit 11 selects one source data from learning data (source data group) having a plurality of source data as data to be input to the feature amount conversion unit 121 (described later), Target data is randomly selected from the target data (target data group).
  • the data selection unit 11 outputs the correct age of the selected source data and the correct age of the source data obtained from the correct age to the updating unit 13 .
  • the estimation unit 12 estimates the age of the target person based on a plurality of face image data or voice data based on the same person's face image data or voice data.
  • the estimating unit 12 has a feature quantity transforming unit 121 (transforming unit), an age estimating unit 122 (first estimating unit), and an age estimating unit 123 (second estimating unit).
  • the feature amount conversion unit 121 uses the first NN 1211 to convert face image data or voice data into a feature amount vector for age estimation.
  • the feature amount conversion unit 121 selects source data and target data from the source data set and the target data set, and extracts feature amount vectors of each data.
  • the first NN 1211 is a NN that converts a series of face image data or voice data of a person, selected as source data and target data by the data selection unit 11, into feature amount vectors.
  • the first NN 1211 converts the source data selected by the data selection unit 11 into a feature amount vector.
  • the first NN 1211 converts the target data selected by the data selection unit 11 into a feature amount vector.
  • the first NN 1211 is implemented by an NN that converts speech data into feature vectors using the technique described in Non-Patent Document 1, for example.
  • FIG. 3 is a diagram illustrating an example of the configuration of the first NN 1211.
  • the first NN 1211 is implemented by, for example, an NN having a structure as shown in FIG.
  • the first NN 1211 is realized by a convolutional NN consisting of multiple time-delay layers and statistical pooling layers.
  • the first NN 1211 is implemented by an NN that converts facial image data into feature vectors using the technique described in Non-Patent Document 2, for example.
  • FIG. 4 is a diagram illustrating an example of the configuration of the first NN 1211.
  • the first NN 1211 is implemented by, for example, an NN having a structure as shown in FIG.
  • the first NN 1211 is implemented by a convolutional NN consisting of multiple residual blocks employing squeeze-and-excitation.
  • the age estimation unit 122 uses the second NN 1221 to estimate the posterior probability for the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121 .
  • the second NN 1221 is a NN that estimates the age of the target person from the feature quantity vector transformed by the first NN 1211 .
  • the second NN 1221 is implemented by an NN that estimates the age value of the target person from the feature amount vector, for example, using the technology described in Non-Patent Document 1.
  • FIG. 5 is a diagram illustrating an example of the configuration of the second NN 1221.
  • This second NN 1221 is implemented by, for example, an NN having a structure as shown in FIG.
  • the second NN 1221 is realized by a plurality of fully connected layers of 512 dimensions and fully connected layers of the same number of dimensions as the number of age classes to be estimated (for example, 101 classes from 0 to 100).
  • the age estimation unit 123 uses the third NN 1231 to estimate each posterior probability for the target person's age from the feature amount vector of the source data and the feature amount vector of the target data converted by the feature amount conversion unit 121 .
  • the third NN 1231 is a NN that estimates the age of the target person from the feature vectors converted by the first NN 1211 .
  • the third NN 1231 estimates the posterior probability for the age of the target person from the feature vector of the source data.
  • the third NN 1231 estimates the posterior probability for the target person's age from the feature vector of the target data.
  • FIG. 6 is a diagram illustrating an example of the configuration of the third NN 1231.
  • the third NN 1231 is implemented by, for example, an NN having a structure as shown in FIG.
  • the third NN 1231 is implemented, for example, by a plurality of 512-dimensional fully connected layers and a fully connected layer with the same number of dimensions as the number of predefined age classes.
  • the number of age classes should be less than or equal to the age group originally desired to be estimated. For example, if the second NN 1221 estimates from 0 to 100 in 1-year increments, the third NN 1231 estimates 10 years from 0 to 100 in 10-year increments (teens, twenties, ). should be a class.
  • the update unit 13 updates the first NN 1211, the second NN 1221, the third NN 1231, and each parameter so as to maximize the posterior probability of the correct age class and the correct age class of the target person.
  • the updating unit 13 inputs the feature amount vector of the target data to the third NN 1231 to obtain the posterior probability of the age class of the target person of the target data.
  • the update unit 13 weights the source data feature amount vector and the target data feature amount vector output from the first NN 1211 with the correct age of each source data and the posterior probability of the age of the target data. Calculate the conditional inter-distribution distance.
  • the updating unit 13 updates each parameter of the first NN 1211, the second NN 1221, and the third NN 1231 so as to minimize the calculated conditional distance between distributions.
  • the update unit 13 updates the correct age of the target person with respect to the posterior probability for each age of the source data estimated by the age estimation unit 122 and the posterior probability for each age of the source data estimated by the age estimation unit 123.
  • the posterior probability of each age of the target data estimated by the age estimation unit 123 and the correct age of the source data are used as conditions to maximize the posterior probability of the posterior probability of and the correct age.
  • the first NN 1211 and the second NN 1221 are arranged so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data, which are converted by the unit 121, are brought closer to each other based on the inter-distribution distance criterion defined in advance for each age.
  • Each parameter with the third NN 1231 is updated.
  • cross_entropy ( ⁇ ys i , ys i ), which is the first term in Equation (1), is the cross entropy between the age posterior probability estimated by the second NN 1221 and the correct age label.
  • Cross_entropy ( ⁇ v si , v si ), which is the second term in Equation (2), is the cross entropy between the age posterior probability estimated by the third 1131 and the correct age label.
  • d H (p s , p t ) represents the conditional inter-distribution distance between the feature vector of the source data and the feature vector of the target data.
  • d H (p s , p t ) is , for example, the first Let the feature vectors obtained by applying NN1211 be x t 1 , x t 2 , . be done. Also, hereinafter, the age estimation results obtained by applying the third NN 1231 to the feature amount vector of the target data are assumed to be ⁇ v t 1 , ⁇ v t 2 , ..., ⁇ v t nt .
  • C represents the age class number.
  • w sc i and w tc j represent the contribution ratios of the i-th source data and the j-th target data to age class c (weights that contribute to the conditional inter-distribution distance of each age class);
  • p( ⁇ v t j c
  • x t j ) be the posterior probability for the c-th age class of the j-th target data obtained by NN 1231 of .
  • the updating unit 13 calculates the parameters (the first NN 1211 and the second NN 1221 and each parameter with the third NN 1231).
  • Equation (4) is a preset learning weight and is a positive constant.
  • Equation (1) is a learning weight of the conditional inter-distribution distance, and is designed to be a small value close to 0 at the beginning of learning and gradually approach 1 as learning progresses. For example, if the maximum number of iterations of the updating unit 13 is I, the weight ⁇ i at the i-th iteration can be calculated by Equation (5).
  • ⁇ in Equation (5) is a parameter that determines the speed of the preset learning weights and is a positive constant.
  • the control processing unit 14 causes the feature quantity conversion unit 121, the age estimation unit 122, the age estimation unit 123, and the updating unit 13 to repeatedly execute the processing until a predetermined condition is satisfied.
  • the control processing unit 14 causes the updating unit 13 to repeatedly update the parameters of the first NN 1211, the second NN 1221, and the third NN 1231 until a predetermined condition is satisfied.
  • the predetermined condition is, for example, a predetermined number of iterations is reached, the total update amount of the parameters of the first NN 1211, the second NN 1221, and the third NN 1231 is less than a predetermined threshold, and the first NN 1211 and the second NN 1221 and the third NN 1231 are sufficiently trained.
  • FIG. 7 is a flow chart showing a processing procedure of learning processing according to the embodiment.
  • the data selection unit 11 selects source data and target data (step S1).
  • the data selector 11 randomly selects target data.
  • the feature amount conversion unit 121 uses the first NN 1211 to convert the source data and the target data selected by the data selection unit 11 into feature amount vectors (step S2).
  • the age estimation unit 122 uses the second NN 1221 to estimate the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121 (step S3).
  • the age estimation unit 123 uses the third NN 1231 to estimate the posterior probability for the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121, and the target person from the feature amount vector of the target data. Estimate the posterior probability for the age of (step S4).
  • the update unit 13 updates the posterior probability of the correct age of the target person and the correct answer with respect to the posterior probability for each age of the source data estimated by the age estimation unit 122 and each age of the source data estimated by the age estimation unit 123.
  • the posterior probability of each age of the target data estimated by the age estimation unit 123 and the correct age of the source data are converted by the feature amount conversion unit 121 as conditions so that the posterior probability with the age of 123 is maximized.
  • the first NN 1211, the second NN 1221 and the third NN 1231 and are updated (step S5).
  • the control processing unit 14 determines whether or not a predetermined condition is satisfied (step S6). If the predetermined condition is not satisfied (step S6: No), the learning device 10 returns to step S2 and performs each process of data processing, feature conversion, age estimation, and parameter update. On the other hand, if the predetermined condition is satisfied (step S6: Yes), the learning device 10 ends the learning process.
  • FIG. 8 is a diagram schematically illustrating an example of a configuration of an estimation device according to an embodiment
  • 9 is a flowchart showing an estimation processing procedure executed by the estimation device shown in FIG. 8.
  • the estimation device 20 shown in FIG. 8 has a feature amount conversion unit 221 (conversion unit) having a first NN 1211 and an age estimation unit 222 (estimation unit) having a second NN 1221 .
  • the first NN 1211 and the second NN 1221 are NNs that have been learned by the learning device 10 .
  • the feature amount conversion unit 221 When the feature amount conversion unit 221 receives input of face image data or voice data (step S11 in FIG. 9), it uses the first NN 1211 to convert the face image data or voice data into feature amounts (step S11 in FIG. 9). step S12).
  • the age estimation unit 222 uses the second NN 1221 to estimate the age of the target person from the feature amount vector converted by the feature amount conversion unit 211 (step S13 in FIG. 9), and outputs the estimated age (step S13 in FIG. 9). step S14).
  • the average absolute error between the correct age value and the result of estimating the age of the speaker by the first NN 1211 and the second NN 1221 was 8.02 years old.
  • the correlation coefficient between the correct age value and the estimation result of the speaker's age was 0.84.
  • the absolute error with the estimation result of the speaker's age was 11.76 years, and the correlation coefficient was 0.71. . Therefore, like the learning device 10, the method of learning the first NN 1211, the second NN 1221, and the third NN 1231 so that the distributions of the source data and the target data are brought closer to each other for each age function effectively. was confirmed.
  • an age estimator (second NN 1221) can be obtained.
  • first NN 1211 when learning an NN that estimates age from face image data or voice data, "labels are labeled in age groups larger than the age groups to be originally estimated (for example, ages in 10-year increments).
  • the difference in data recording conditions during training and during operation is eliminated. In spite of this, it was possible to realize an NN capable of outputting age estimation results with high accuracy.
  • the learning device 10 estimates the label of the target data and approximates the distributions of the source data and the target data based on this estimation result to solve the data mismatch problem. It is similar to technology, but differs in the following points.
  • Non-Patent Document 3 targets only discrete labels whose data to be estimated are independent of each other, like general classification problems.
  • the present embodiment differs in that ordered age labels are targeted for estimation.
  • Non-Patent Document 3 attempts to approximate the distribution using the estimation result of the label, which is the original estimation target.
  • learning is performed so that the distribution is brought closer to each age with coarser granularity.
  • source data with an age label recorded in an environment different from that during operation and target data recorded in the same environment as during operation but without an age label are used.
  • the performance of the feature transformer and age estimator can be significantly improved compared to the case of learning only with source data.
  • Non-Patent Document 3 The content of approximating the distribution for each class to be estimated is similar to Non-Patent Document 3, but the technology described in Non-Patent Document 3 does not assume a task such as age estimation, and even if it is applied as it is little effect.
  • age is a value with an order, and it can be defined in a larger division such as age, so the distribution is approximated in this large division", so that a highly accurate age estimator realized.
  • the first NN 1211 may be changed to one suitable for each type of input data.
  • Each component of the learning device 10 and the estimating device 20 is functionally conceptual and does not necessarily need to be physically configured as illustrated. That is, the specific forms of distribution and integration of the functions of the learning device 10 and the estimating device 20 are not limited to those illustrated, and all or part of them can be functioned in arbitrary units according to various loads and usage conditions. can be physically or physically distributed or integrated.
  • each process performed in the learning device 10 and the estimation device 20 may be realized by a CPU, a GPU (Graphics Processing Unit), and a program that is analyzed and executed by the CPU and GPU. good. Further, each process performed in the learning device 10 and the estimation device 20 may be implemented as hardware based on wired logic.
  • FIG. 10 is a diagram showing an example of a computer that implements the learning device 10 and the estimation device 20 by executing programs.
  • the computer 1000 has a memory 1010 and a CPU 1020, for example.
  • Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
  • the memory 1010 includes a ROM 1011 and a RAM 1012.
  • the ROM 1011 stores a boot program such as BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • Hard disk drive interface 1030 is connected to hard disk drive 1090 .
  • a disk drive interface 1040 is connected to the disk drive 1100 .
  • a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 .
  • Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example.
  • Video adapter 1060 is connected to display 1130, for example.
  • the hard disk drive 1090 stores an OS (Operating System) 1091, application programs 1092, program modules 1093, and program data 1094, for example. That is, a program that defines each process of the learning device 10 and the estimation device 20 is implemented as a program module 1093 in which code executable by the computer 1000 is described. Program modules 1093 are stored, for example, on hard disk drive 1090 .
  • the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configurations of the learning device 10 and the estimation device 20 .
  • the hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
  • the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.
  • the program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
  • LAN Local Area Network
  • WAN Wide Area Network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

A learning device (10) includes: a feature amount conversion unit (121) that uses a first neural network (NN) (1211) to convert labeled source data and unlabeled target data to feature amount vectors; an age estimation unit (122) that uses a second NN (1221) to estimate a posteriori probability of the age of a subject from the source data feature amount vector; an age bracket estimation unit (123) that uses a third NN (1231) to estimate a posteriori probability of the age bracket of the subject from the feature amount vectors of the source data and the target data; and an updating unit (13) that updates the parameters of the first NN (1211), the second NN (1221), and the third NN (1231) such that the distance between distributions of the feature amount vectors of the source data and the target data approach each other, using as conditions the posteriori probability of the target data age bracket and the correct age bracket in the source data, while maximizing the posteriori probability of the correct age and the correct age bracket with respect to the posteriori probability of the source data age and age bracket.

Description

学習装置、推定装置、学習方法及び学習プログラムLEARNING DEVICE, ESTIMATION DEVICE, LEARNING METHOD, AND LEARNING PROGRAM
 本発明は、学習装置、推定装置、学習方法及び学習プログラムに関する。 The present invention relates to a learning device, an estimation device, a learning method, and a learning program.
 人物の年齢を顔画像や音声から推定する年齢推定技術が、コールセンターやマーケティング分野において求められている。これに対し、近年、音声処理及び画像処理分野において、ニューラルネットワーク(NN)を用いた人物年齢推定手法(例えば、非特許文献1)が知られている。  There is a demand for age estimation technology that estimates a person's age from facial images and voice in the call center and marketing fields. On the other hand, in recent years, a human age estimation method using a neural network (NN) (for example, Non-Patent Document 1) is known in the field of voice processing and image processing.
 非特許文献1では、音声信号を特徴量ベクトルに変換するNNと、特徴量ベクトルから年齢ラベルの事後確率を推定するNNとを連結し、正解の年齢値に対する事後確率を最大とするように、これらのNNを同時に学習させることで、高い精度で年齢を推定できることが記載されている。 In Non-Patent Document 1, an NN that converts a speech signal into a feature amount vector and an NN that estimates the posterior probability of an age label from the feature amount vector are connected to maximize the posterior probability for the correct age value. It is described that the age can be estimated with high accuracy by learning these NNs at the same time.
 ここで、NNに基づく年齢推定手法では、NNの学習時とNNの運用時とにおいて、入力データの収録条件が一致していないと、性能が著しく低下することが知られている。例えば、非特許文献1には、顔画像及び音声を利用した年齢推定タスクにおいて、学習時と運用時とで異なる収録機器を用いると、学習時と運用時とで入力データの分布が著しく変化するため、年齢推定精度が大きく低下することが記載されている。 Here, it is known that in the NN-based age estimation method, if the input data recording conditions do not match when the NN is learning and when the NN is operating, the performance drops significantly. For example, in non-patent document 1, in an age estimation task using face images and voices, if different recording devices are used during learning and during operation, the distribution of input data changes significantly during learning and during operation. Therefore, it is described that the accuracy of age estimation is greatly reduced.
 このような入力データの性質の違いによるNNの性能低下は、画像処理分野でよく知られている現象で、解決手法がいくつか提案されている(例えば、非特許文献2,3)。 The degradation of NN performance due to differences in input data properties is a well-known phenomenon in the field of image processing, and several solutions have been proposed (for example, Non-Patent Documents 2 and 3).
 例えば、非特許文献2には、NNの学習時に、運用時とは異なる環境で収集し、教師ラベルを付与した学習用データ(ソースデータ)とともに、運用時と同じ環境で収集した教師ラベルが付与されていないデータ(ターゲットデータ)を用いることで、この問題を解決する手法が記載されている。この方法では、ソースデータとターゲットデータとを入力し、得られたNNの中間出力について、お互いの分布間の距離をMaximum mean discrepancy(MMD)基準で算出し、これを最小化するようにNNを学習させることで、入力データ分布の不一致に起因する性能低下が抑制できることが記載されている。 For example, in Non-Patent Document 2, during NN learning, training data (source data) collected in an environment different from that during operation and given a teacher label is given along with teacher labels collected in the same environment as during operation. A method for solving this problem is described by using data (target data) that is not targeted. In this method, the source data and target data are input, and the distance between the distributions of the obtained NN intermediate output is calculated according to the Maximum Mean Discrepancy (MMD) standard, and the NN is configured to minimize this. It is described that learning can suppress performance degradation caused by inconsistencies in input data distributions.
 しかしながら、非特許文献2記載の手法では、推定すべきクラスとは無関係に各データの分布を近づけてしまうため、本来の目的のクラス分類精度が下がってしまうという問題があった。 However, in the method described in Non-Patent Document 2, the distribution of each data is brought closer regardless of the class to be estimated, so there is a problem that the original purpose of classification accuracy is lowered.
 このため、非特許文献3には、まず、ターゲットデータのクラス推定を行い、推定したクラス事後確率に基づきクラスごとに分布を近づけるLocal MMDと呼ばれる技術を導入することでこの問題を解決できることが記載されている。 For this reason, Non-Patent Document 3 describes that this problem can be solved by introducing a technique called Local MMD that first estimates the class of the target data and approximates the distribution for each class based on the estimated class posterior probability. It is
 非特許文献2,3に記載の手法は、画像分類のような互いに独立なラベルを推定する問題のために開発されており、年齢のような順序のあるラベルを推定する問題では適切に動作しない。 The methods described in Non-Patent Documents 2 and 3 were developed for the problem of estimating labels that are independent of each other, such as image classification, and do not work well for the problem of estimating ordered labels, such as age. .
 例えば、非特許文献3に記載の技術では、まず、ターゲットデータに対しラベルを推定し、推定したラベルを基にソースデータとターゲットデータとの分布を近づけることで分布の不一致の問題を解消する。 For example, in the technique described in Non-Patent Document 3, the label is first estimated for the target data, and based on the estimated label, the distributions of the source data and the target data are approximated to solve the problem of distribution mismatch.
 しかしながら、非特許文献3に記載の技術の場合、年齢推定問題では、例えば20歳と25歳との違いは、20歳と80歳との違いより大きいといったようにラベルの順序があるため、この順序を無視してクラスごとに独立に2つの分布を近づけようとすると性能が低下するという問題があった。 However, in the case of the technique described in Non-Patent Document 3, in the age estimation problem, there is a label order such that the difference between 20 and 25 years is greater than the difference between 20 and 80 years. There is a problem that the performance deteriorates when two distributions are approached independently for each class, ignoring the order.
 本発明は、上記に鑑みてなされたものであって、学習時と運用時とにおけるデータの収録条件に違いがある場合であっても頑健な年齢推定器を取得することができる学習装置、推定装置、学習方法及び学習プログラムを提供することを目的とする。 The present invention has been made in view of the above. An object is to provide an apparatus, a learning method and a learning program.
 上述した課題を解決し、目的を達成するために、本発明に係る学習装置は、第1のニューラルネットワークを用いて、年齢ラベル付きのソースデータを特徴量ベクトルに変換し、年齢ラベルが付与されていないターゲットデータを特徴量ベクトルに変換する変換部と、第2のニューラルネットワークを用いて、変換部によって変換されたソースデータの特徴量ベクトルから対象人物の年齢に対する事後確率を推定する第1の推定部と、第3のニューラルネットワークを用いて、ソースデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定し、ターゲットデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定する第2の推定部と、第1の推定部によって推定されたソースデータの年齢に対する事後確率及び第2の推定部によって推定されたソースデータの年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代の事後確率とが最も大きくなるようにしつつ、第2の推定部によって推定されたターゲットデータの年代の事後確率とソースデータの正解年代とを条件として、変換部により変換されたソースデータの特徴量ベクトルとターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、第1のニューラルネットワークと第2のニューラルネットワークと第3のニューラルネットワークとの各パラメータを更新する更新部と、を有することを特徴とする。 In order to solve the above-described problems and achieve the object, a learning device according to the present invention uses a first neural network to convert age-labeled source data into a feature amount vector, and age-labeled source data. A conversion unit that converts target data that has not been converted into a feature amount vector, and a second neural network that estimates the posterior probability for the age of the target person from the feature amount vector of the source data converted by the conversion unit. Using an estimation unit and a third neural network, the posterior probability for the age of the target person is estimated from the feature amount vector of the source data, and the posterior probability for the age of the target person is estimated from the feature amount vector of the target data. and the posterior probability of the correct age of the target person with respect to the posterior probability for the age of the source data estimated by the first estimation unit and the posterior probability for the age of the source data estimated by the second estimation unit While maximizing the probability and the posterior probability of the correct age, converted by the conversion unit under the conditions of the posterior probability of the age of the target data estimated by the second estimation unit and the correct age of the source data A first neural network, a second neural network, and a third neural network are arranged so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data are brought close to each other according to the inter-distribution distance criterion defined in advance for each age. and an updating unit that updates each parameter with the network.
 また、本発明かかる推定装置は、第1のニューラルネットワークを用いて、データを特徴量ベクトルに変換する変換部と、第2のニューラルネットワークを用いて、変換部によって変換された特徴量ベクトルから対象人物の年齢を推定する推定部と、を有し、第1のニューラルネットワーク及び第2のニューラルネットワークは、第2のニューラルネットワークによって推定された、年齢ラベル付きのソースデータの各年齢に対する事後確率、及び、人物の年代に対する事後確率を推定する第3のNNニューラルネットワークによって推定されたソースデータの年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代の事後確率とが最も大きくなるようにしつつ、第3のニューラルネットワークによって推定された、年齢ラベルが付与されていないターゲットデータの年代の事後確率とソースデータの正解年代とを条件として、第1のニューラルネットワークによって変換されたソースデータの特徴量ベクトルとターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、学習されたことを特徴とする。 In addition, the estimation apparatus according to the present invention includes a conversion unit that converts data into a feature amount vector using a first neural network, and a feature amount vector converted by the conversion unit using a second neural network. an estimator for estimating the age of a person, wherein the first neural network and the second neural network estimate the posterior probability for each age of the age-labeled source data estimated by the second neural network; and the posterior probability of the correct age of the target person and the posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated by the third neural network for estimating the posterior probability of the age of the person. Transformed by the first neural network under the condition of the posterior probability of the age of the target data with no age label and the correct age of the source data estimated by the third neural network while maximizing the It is characterized in that learning is performed so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data are brought close to each other according to the inter-distribution distance criterion defined in advance for each age.
 本発明によれば、学習時と運用時とにおけるデータの収録条件に違いがある場合であっても頑健な年齢推定器を取得することができる。 According to the present invention, it is possible to obtain a robust age estimator even when there are differences in the data recording conditions during learning and during operation.
図1は、実施の形態に係る学習装置の構成の一例を模式的に示す図である。FIG. 1 is a diagram schematically showing an example of the configuration of a learning device according to an embodiment. 図2は、図2は、図1に示す学習装置における処理の流れを説明する図である。2 is a diagram for explaining the flow of processing in the learning apparatus shown in FIG. 1; FIG. 図3は、第1のNNの構成の一例を説明する図である。FIG. 3 is a diagram illustrating an example of the configuration of the first NN. 図4は、第1のNNの構成の一例を説明する図である。FIG. 4 is a diagram illustrating an example of the configuration of the first NN. 図5は、第2のNNの構成の一例を説明する図である。FIG. 5 is a diagram illustrating an example of the configuration of the second NN. 図6は、第3のNNの構成の一例を説明する図である。FIG. 6 is a diagram illustrating an example of the configuration of the third NN. 図7は、実施の形態に係る学習処理の処理手順を示すフローチャートである。FIG. 7 is a flow chart showing a processing procedure of learning processing according to the embodiment. 図8は、実施の形態に係る推定装置の構成の一例を模式的に示す図である。FIG. 8 is a diagram schematically illustrating an example of a configuration of an estimation device according to an embodiment; 図9は、図8に示す推定装置が実行する推定処理手順を示すフローチャートである。9 is a flowchart showing an estimation processing procedure executed by the estimation device shown in FIG. 8. FIG. 図10は、プログラムが実行されることにより、学習装置及び推定装置が実現されるコンピュータの一例を示す図である。FIG. 10 is a diagram illustrating an example of a computer that implements a learning device and an estimation device by executing a program.
 以下、図面を参照して、本発明の一実施形態を詳細に説明する。なお、この実施形態により本発明が限定されるものではない。また、図面の記載において、同一部分には同一の符号を付して示している。なお、以下では、ベクトルであるAに対し、“~A”と記載する場合は「“A”の直上に“~”が記された記号」と同等であるとする。 An embodiment of the present invention will be described in detail below with reference to the drawings. It should be noted that the present invention is not limited by this embodiment. Moreover, in the description of the drawings, the same parts are denoted by the same reference numerals. In the following description, "~A" for A, which is a vector, is equivalent to "a symbol in which "~" is written just above "A"".
[実施の形態]
 本実施の形態では、入力の顔画像データや音声データから、ニューラルネットワーク(NN)を用いて人物の年齢を推定する推定モデルに対する学習について説明する。本実施の形態では、本来推定したい年齢の区分よりも大きな区分(例えば10歳刻みの年代など)でラベルの推定を行った後に、この大きな区分の中で、ソースデータとターゲットデータの特徴量の分布が近くなるようにNNの学習を行う。これにより、本実施の形態では、ラベルの順序を考慮しながら、学習時と運用時とのデータの収録条件の違いを吸収できる年齢推定器の取得を実現する。
[Embodiment]
In this embodiment, learning of an estimation model for estimating the age of a person using a neural network (NN) from input face image data and voice data will be described. In this embodiment, after estimating a label in a larger age group than the age group to be originally estimated (for example, ages in 10-year increments), the feature values of the source data and the target data are estimated in this large age group. NN is trained so that the distributions are close. As a result, in this embodiment, it is possible to obtain an age estimator capable of absorbing the difference in data recording conditions between the time of learning and the time of operation while considering the order of labels.
[学習装置]
 次に、実施の形態に係る学習装置について説明する。図1は、実施の形態に係る学習装置の構成の一例を模式的に示す図である。図2は、図1に示す学習装置における処理の流れを説明する図である。
[Learning device]
Next, a learning device according to an embodiment will be described. FIG. 1 is a diagram schematically showing an example of the configuration of a learning device according to an embodiment. FIG. 2 is a diagram for explaining the flow of processing in the learning device shown in FIG.
 実施の形態に係る学習装置10は、例えば、ROM(Read Only Memory)、RAM(Random Access Memory)、CPU(Central Processing Unit)等を含むコンピュータ等に所定のプログラムが読み込まれて、CPUが所定のプログラムを実行することで実現される。また、学習装置10は、有線接続、或いは、ネットワーク等を介して接続された他の装置との間で、各種情報を送受信する通信インタフェースを有する。 In the learning device 10 according to the embodiment, for example, a computer or the like including ROM (Read Only Memory), RAM (Random Access Memory), CPU (Central Processing Unit), etc. is loaded with a predetermined program, and the CPU executes a predetermined program. It is realized by executing the program. The learning device 10 also has a communication interface for transmitting and receiving various information to and from another device connected via a wired connection or a network or the like.
 図1及び図2に示すように、学習装置10は、データ選択部11、推定部12、更新部13及び制御処理部14を有する。学習装置10は、実際の運用時とは異なる収録条件で収集した年齢ラベル付きのソースデータと、運用時と同一環境で収録した、年齢ラベルが付与されていないデータ(ターゲットデータ)を用いる。ソースデータ及び適応用データは、顔画像データ或いは音声データである。ソースデータ及びターゲットデータは、顔画像データ或いは音声データである。 As shown in FIGS. 1 and 2, the learning device 10 has a data selection unit 11, an estimation unit 12, an update unit 13, and a control processing unit . The learning device 10 uses source data with age labels collected under recording conditions different from those during actual operation and data (target data) without age labels recorded under the same environment as during operation. The source data and adaptation data are face image data or voice data. The source data and target data are face image data or voice data.
 データ選択部11は、特徴量変換部121(後述)に入力するデータとして、複数のソースデータを有する学習用データ(ソースデータ群)からソースデータを一つ選択し、複数のターゲットデータを有する適応用データ(ターゲットデータ群)からターゲットデータをランダムに選択する。データ選択部11は、選択したソースデータの正解年齢と、この正解年齢から求めたソースデータの正解年代とを更新部13に出力する。 The data selection unit 11 selects one source data from learning data (source data group) having a plurality of source data as data to be input to the feature amount conversion unit 121 (described later), Target data is randomly selected from the target data (target data group). The data selection unit 11 outputs the correct age of the selected source data and the correct age of the source data obtained from the correct age to the updating unit 13 .
 推定部12は、同一人物の顔画像データ或いは音声データに基づく複数の顔画像データ或いは音声データを基に、対象人物の年齢を推定する。推定部12は、特徴量変換部121(変換部)、年齢推定部122(第1の推定部)及び年代推定部123(第2の推定部)を有する。 The estimation unit 12 estimates the age of the target person based on a plurality of face image data or voice data based on the same person's face image data or voice data. The estimating unit 12 has a feature quantity transforming unit 121 (transforming unit), an age estimating unit 122 (first estimating unit), and an age estimating unit 123 (second estimating unit).
 特徴量変換部121は、第1のNN1211を用いて、顔画像データ或いは音声データを、年齢推定用の特徴量ベクトルに変換する。特徴量変換部121は、ソースデータセットとターゲットデータセットとの中から、ソースデータとターゲットデータとを選択し、各データの特徴量ベクトルをそれぞれ抽出する。 The feature amount conversion unit 121 uses the first NN 1211 to convert face image data or voice data into a feature amount vector for age estimation. The feature amount conversion unit 121 selects source data and target data from the source data set and the target data set, and extracts feature amount vectors of each data.
 第1のNN1211は、データ選択部11がソースデータとターゲットデータとして選択した、人物の一連の顔画像データ或いは音声データを、特徴量ベクトルにそれぞれ変換するNNである。第1のNN1211は、データ選択部11が選択したソースデータを特徴量ベクトルに変換する。第1のNN1211は、データ選択部11が選択したターゲットデータを特徴量ベクトルに変換する。 The first NN 1211 is a NN that converts a series of face image data or voice data of a person, selected as source data and target data by the data selection unit 11, into feature amount vectors. The first NN 1211 converts the source data selected by the data selection unit 11 into a feature amount vector. The first NN 1211 converts the target data selected by the data selection unit 11 into a feature amount vector.
 音声データを対象とする場合、第1のNN1211は、例えば、非特許文献1に記載の技術を用いて、音声データを特徴ベクトルに変換するNNにより実現される。図3は、第1のNN1211の構成の一例を説明する図である。この場合、第1のNN1211は、例えば、図3に示すような構造を持つNNにより実現される。一例を挙げると、第1のNN1211は、複数のtime-delay 層とstatistical pooling層なる畳み込みNNにより実現される。 In the case of speech data, the first NN 1211 is implemented by an NN that converts speech data into feature vectors using the technique described in Non-Patent Document 1, for example. FIG. 3 is a diagram illustrating an example of the configuration of the first NN 1211. As shown in FIG. In this case, the first NN 1211 is implemented by, for example, an NN having a structure as shown in FIG. For example, the first NN 1211 is realized by a convolutional NN consisting of multiple time-delay layers and statistical pooling layers.
 顔画像データを対象とする場合、第1のNN1211は、例えば、非特許文献2に記載の技術を用いて、顔画像データを特徴ベクトルに変換するNNにより実現される。図4は、第1のNN1211の構成の一例を説明する図である。この場合、第1のNN1211は、例えば、図4に示すような構造を持つNNにより実現される。一例を挙げると、第1のNN1211は、Squeeze-and-Excitationを採用した複数のResidual blockからなる畳み込みNNにより実現される。 In the case of facial image data, the first NN 1211 is implemented by an NN that converts facial image data into feature vectors using the technique described in Non-Patent Document 2, for example. FIG. 4 is a diagram illustrating an example of the configuration of the first NN 1211. As shown in FIG. In this case, the first NN 1211 is implemented by, for example, an NN having a structure as shown in FIG. As an example, the first NN 1211 is implemented by a convolutional NN consisting of multiple residual blocks employing squeeze-and-excitation.
 年齢推定部122は、第2のNN1221を用いて、特徴量変換部121によって変換されたソースデータの特徴量ベクトルから対象人物の年齢に対する事後確率を推定する。第2のNN1221は、第1のNN1211により変換された特徴量ベクトルから、対象人物の年齢を推定するNNである。 The age estimation unit 122 uses the second NN 1221 to estimate the posterior probability for the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121 . The second NN 1221 is a NN that estimates the age of the target person from the feature quantity vector transformed by the first NN 1211 .
 第2のNN1221は、例えば、非特許文献1に記載の技術を用いて、特徴量ベクトルから対象人物の年齢値を推定するNNにより実現される。図5は、第2のNN1221の構成の一例を説明する図である。この第2のNN1221は、例えば、図5に示すような構造を持つNNにより実現される。一例を挙げると、第2のNN1221は、複数の512次元の全結合層と、推定したい年齢クラス数(例えば、0から100歳の101クラス)と同数次元の全結合層とにより実現される。 The second NN 1221 is implemented by an NN that estimates the age value of the target person from the feature amount vector, for example, using the technology described in Non-Patent Document 1. FIG. 5 is a diagram illustrating an example of the configuration of the second NN 1221. As shown in FIG. This second NN 1221 is implemented by, for example, an NN having a structure as shown in FIG. For example, the second NN 1221 is realized by a plurality of fully connected layers of 512 dimensions and fully connected layers of the same number of dimensions as the number of age classes to be estimated (for example, 101 classes from 0 to 100).
 年代推定部123は、第3のNN1231を用いて、特徴量変換部121によって変換されたソースデータの特徴量ベクトル及びターゲットデータの特徴量ベクトルから対象人物の年代に対する各事後確率を推定する。第3のNN1231は、第1のNN1211により変換された特徴量ベクトルから、対象人物の年代を推定するNNである。第3のNN1231は、ソースデータの特徴量ベクトルから、対象人物の年代に対する事後確率を推定する。第3のNN1231は、ターゲットデータの特徴量ベクトルから、対象人物の年代に対する事後確率を推定する。 The age estimation unit 123 uses the third NN 1231 to estimate each posterior probability for the target person's age from the feature amount vector of the source data and the feature amount vector of the target data converted by the feature amount conversion unit 121 . The third NN 1231 is a NN that estimates the age of the target person from the feature vectors converted by the first NN 1211 . The third NN 1231 estimates the posterior probability for the age of the target person from the feature vector of the source data. The third NN 1231 estimates the posterior probability for the target person's age from the feature vector of the target data.
 図6は、第3のNN1231の構成の一例を説明する図である。第3のNN1231は、例えば、図6に示すような構造を持つNNにより実現される。第3のNN1231は、例えば、複数の512次元の全結合層と、事前に定義した年代クラス数と同数次元の全結合層とにより実現される。ただし、年代クラス数は、本来推定したい年齢区分以下とする。例えば第2のNN1221で0から100歳までの1歳刻みで推定する場合は、第3のNN1231では、0から100歳までの10歳刻み(10代、20代、・・・)である10クラスとすればよい。 FIG. 6 is a diagram illustrating an example of the configuration of the third NN 1231. FIG. The third NN 1231 is implemented by, for example, an NN having a structure as shown in FIG. The third NN 1231 is implemented, for example, by a plurality of 512-dimensional fully connected layers and a fully connected layer with the same number of dimensions as the number of predefined age classes. However, the number of age classes should be less than or equal to the age group originally desired to be estimated. For example, if the second NN 1221 estimates from 0 to 100 in 1-year increments, the third NN 1231 estimates 10 years from 0 to 100 in 10-year increments (teens, twenties, ...). should be a class.
 更新部13は、対象人物の正解の年齢クラスの事後確率と正解の年代クラスの事後確率を最大とするように第1のNN1211と第2のNN1221と第3のNN1231と各パラメータを更新する。このとき、更新部13は、ターゲットデータの特徴量ベクトルを第3のNN1231に入力し、ターゲットデータの対象人物の年代クラスの事後確率を得る。そして、更新部13は、第1のNN1211から出力されるソースデータの特徴量ベクトルとターゲットデータの特徴量ベクトルとについて、それぞれのソースデータの正解年代と、ターゲットデータの年代の事後確率とで重み付けた条件付き分布間距離を算出する。更新部13は、算出した条件付き分布間距離を最小とするように第1のNN1211と第2のNN1221と第3のNN1231との各パラメータを更新する。 The update unit 13 updates the first NN 1211, the second NN 1221, the third NN 1231, and each parameter so as to maximize the posterior probability of the correct age class and the correct age class of the target person. At this time, the updating unit 13 inputs the feature amount vector of the target data to the third NN 1231 to obtain the posterior probability of the age class of the target person of the target data. Then, the update unit 13 weights the source data feature amount vector and the target data feature amount vector output from the first NN 1211 with the correct age of each source data and the posterior probability of the age of the target data. Calculate the conditional inter-distribution distance. The updating unit 13 updates each parameter of the first NN 1211, the second NN 1221, and the third NN 1231 so as to minimize the calculated conditional distance between distributions.
 したがって、更新部13は、年齢推定部122によって推定されたソースデータの各年齢に対する事後確率及び年代推定部123によって推定されたソースデータの各年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代との事後確率とが最も大きくなるようにしつつ、年代推定部123によって推定されたターゲットデータの各年代の事後確率とソースデータの正解年代とを条件として、特徴量変換部121によって変換されたソースデータの特徴量ベクトルとターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、第1のNN1211と第2のNN1221と第3のNN1231との各パラメータを更新する。 Therefore, the update unit 13 updates the correct age of the target person with respect to the posterior probability for each age of the source data estimated by the age estimation unit 122 and the posterior probability for each age of the source data estimated by the age estimation unit 123. The posterior probability of each age of the target data estimated by the age estimation unit 123 and the correct age of the source data are used as conditions to maximize the posterior probability of the posterior probability of and the correct age. The first NN 1211 and the second NN 1221 are arranged so that the distributions of the feature amount vectors of the source data and the feature amount vectors of the target data, which are converted by the unit 121, are brought closer to each other based on the inter-distribution distance criterion defined in advance for each age. Each parameter with the third NN 1231 is updated.
 例えば、n個のソースデータに対し、第1のNN1211を適用し得られたソースデータの特徴量ベクトルをxs ,xs ,・・・,xs nsとし,これらソースデータの特徴量ベクトルに対して第2のNN1221を適用し得られた年齢推定結果を~y ,~y ,・・・,~y nsとし、ソースデータの特徴量ベクトルに対して第3のNN1231を適用し得られた年代推定結果を~v ,~v ,・・・,~v nsとし、ソースデータの対象人物の正解年齢をそれぞれy ,y ,・・・,y nsとし、ソースデータの対象人物の正解年代をv ,v ,・・・,v nsとすると損失関数Lは、式(1)で表される。 For example , let x s 1 , x s 2 , . Let the age estimation results obtained by applying the second NN 1221 to the quantity vector be ~y s 1 , ~y s 2 , . Let the age estimation results obtained by applying NN1231 of . . . , y s ns , and the correct age of the target person of the source data is vs 1 , vs 2 , .
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 ただし、式(1)の第一項であるcross_entropy(~y ,y )は、第2のNN1221により推定された年齢事後確率と正解の年齢ラベルとのクロスエントロピーである。また、式(2)の第二項であるcross_entropy(~v ,v )は、第3の1131により推定された年代事後確率と正解の年代ラベルとのクロスエントロピーである。 However, cross_entropy (~ ys i , ys i ), which is the first term in Equation (1), is the cross entropy between the age posterior probability estimated by the second NN 1221 and the correct age label. Cross_entropy (~v si , v si ), which is the second term in Equation (2), is the cross entropy between the age posterior probability estimated by the third 1131 and the correct age label.
 式(1)の第三項であるd(p,p)は、ソースデータの特徴量ベクトルとターゲットデータの特徴量ベクトとの間の条件付き分布間距離を表す。d(p,p)は、例えば、分布間距離として、非特許文献3で定義されたlocal maximum mean discrepancy(LMMD)を用いた場合、n個のターゲットデータに対し、第1のNN1211を適用し得られた特徴量ベクトルをそれぞれx ,x ,・・・,x ntとすると、分布間距離d(p,p)は、式(2)で表される。また、以降において、ターゲットデータの特徴量ベクトルに対して第3のNN1231を適用し得られた年代推定結果を~v ,~v ,・・・,~v ntとする。 The third term of Equation (1), d H (p s , p t ), represents the conditional inter-distribution distance between the feature vector of the source data and the feature vector of the target data. d H (p s , p t ) is , for example, the first Let the feature vectors obtained by applying NN1211 be x t 1 , x t 2 , . be done. Also, hereinafter, the age estimation results obtained by applying the third NN 1231 to the feature amount vector of the target data are assumed to be ~v t 1 , ~v t 2 , ..., ~v t nt .
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 このとき、Cは、年代クラス数を表す。wsc とwtc とは、それぞれ年代クラスcに対するi番目のソースデータとj番目のターゲットデータの寄与率(各年代クラスの条件付き分布間距離に寄与する重み)とを表し、第3のNN1231により得られたj番目のターゲットデータのc番目の年代クラスに対する事後確率をp(~v =c|x )とすると、式(3)で定義される。 At this time, C represents the age class number. w sc i and w tc j represent the contribution ratios of the i-th source data and the j-th target data to age class c (weights that contribute to the conditional inter-distribution distance of each age class); Let p(˜v t j =c|x t j ) be the posterior probability for the c-th age class of the j-th target data obtained by NN 1231 of .
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 更新対象のパラメータをθとして、更新部13は、式(1)で定義される損失関数Lを最小化するように、以下の式(4)によりパラメータ(第1のNN1211と第2のNN1221と第3のNN1231との各パラメータ)を更新する。 Assuming that the parameter to be updated is θ, the updating unit 13 calculates the parameters (the first NN 1211 and the second NN 1221 and each parameter with the third NN 1231).
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 式(4)におけるμは、予め設定される学習重みで正の定数である。 μ in Equation (4) is a preset learning weight and is a positive constant.
 なお、式(1)におけるλは、条件付き分布間距離の学習重みで、学習時の初期は0に近い小さい値で、学習が進むに連れ徐々に1に近づくように設計される。例えば、更新部13の最大の繰り返し回数をIとすると、i回目の繰り返しの際の重みλは、式(5)で算出できる。 Note that λ in Equation (1) is a learning weight of the conditional inter-distribution distance, and is designed to be a small value close to 0 at the beginning of learning and gradually approach 1 as learning progresses. For example, if the maximum number of iterations of the updating unit 13 is I, the weight λ i at the i-th iteration can be calculated by Equation (5).
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 ただし、式(5)におけるγは、予め設定される学習重みの速度を決定するパラメータで正の定数である。 However, γ in Equation (5) is a parameter that determines the speed of the preset learning weights and is a positive constant.
 制御処理部14は、所定の条件を満たすまで、特徴量変換部121、年齢推定部122、年代推定部123及び更新部13による処理を繰り返し実行させる。制御処理部14は、所定の条件を満たすまで、更新部13による第1のNN1211と第2のNN1221と第3のNN1231とのパラメータの更新を繰り返し実行させる。所定の条件とは、例えば、所定の繰り返し回数に達する、第1のNN1211と第2のNN1221と第3のNN1231とのパラメータの更新量の総量が所定の閾値未満となる等、第1のNN1211と第2のNN1221と第3のNN1231との学習が充分に行われた状態となる条件である。 The control processing unit 14 causes the feature quantity conversion unit 121, the age estimation unit 122, the age estimation unit 123, and the updating unit 13 to repeatedly execute the processing until a predetermined condition is satisfied. The control processing unit 14 causes the updating unit 13 to repeatedly update the parameters of the first NN 1211, the second NN 1221, and the third NN 1231 until a predetermined condition is satisfied. The predetermined condition is, for example, a predetermined number of iterations is reached, the total update amount of the parameters of the first NN 1211, the second NN 1221, and the third NN 1231 is less than a predetermined threshold, and the first NN 1211 and the second NN 1221 and the third NN 1231 are sufficiently trained.
[学習処理の処理手順]
 次に、学習装置10が実行する学習処理について説明する。図7は、実施の形態に係る学習処理の処理手順を示すフローチャートである。
[Processing procedure of learning process]
Next, learning processing executed by the learning device 10 will be described. FIG. 7 is a flow chart showing a processing procedure of learning processing according to the embodiment.
 図7に示すように、学習装置10では、データ選択部11が、ソースデータとターゲットデータとを選択する(ステップS1)。データ選択部11は、ランダムにターゲットデータを選択する。そして、特徴量変換部121が、第1のNN1211を用いて、データ選択部11が選択したソースデータとターゲットデータとを特徴量ベクトルにそれぞれ変換する(ステップS2)。 As shown in FIG. 7, in the learning device 10, the data selection unit 11 selects source data and target data (step S1). The data selector 11 randomly selects target data. Then, the feature amount conversion unit 121 uses the first NN 1211 to convert the source data and the target data selected by the data selection unit 11 into feature amount vectors (step S2).
 年齢推定部122は、第2のNN1221を用いて、特徴量変換部121によって変換されたソースデータの特徴量ベクトルから対象人物の年齢を推定する(ステップS3)。年代推定部123は、第3のNN1231を用いて、特徴量変換部121によって変換されたソースデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定し、ターゲットデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定する(ステップS4)。 The age estimation unit 122 uses the second NN 1221 to estimate the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121 (step S3). The age estimation unit 123 uses the third NN 1231 to estimate the posterior probability for the age of the target person from the feature amount vector of the source data converted by the feature amount conversion unit 121, and the target person from the feature amount vector of the target data. Estimate the posterior probability for the age of (step S4).
 更新部13は、年齢推定部122によって推定されたソースデータの各年齢及び年代推定部123によって推定されたソースデータの各年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代との事後確率とが最も大きくなるようにしつつ、年代推定部123によって推定されたターゲットデータの各年代の事後確率とソースデータの正解年代とを条件として、特徴量変換部121によって変換されたソースデータの特徴量ベクトルとターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、第1のNN1211と第2のNN1221と第3のNN1231との各パラメータを更新する(ステップS5)。 The update unit 13 updates the posterior probability of the correct age of the target person and the correct answer with respect to the posterior probability for each age of the source data estimated by the age estimation unit 122 and each age of the source data estimated by the age estimation unit 123. The posterior probability of each age of the target data estimated by the age estimation unit 123 and the correct age of the source data are converted by the feature amount conversion unit 121 as conditions so that the posterior probability with the age of 123 is maximized. The first NN 1211, the second NN 1221 and the third NN 1231 and are updated (step S5).
 制御処理部14は、所定の条件を満たすか否かを判定する(ステップS6)。所定の条件を満たしていない場合(ステップS6:No)、学習装置10は、ステップS2に戻り、データ加工、特徴量変換、年齢推定、パラメータ更新の各処理を行う。一方、所定の条件を満たした場合(ステップS6:Yes)、学習装置10は、学習処理を終了する。 The control processing unit 14 determines whether or not a predetermined condition is satisfied (step S6). If the predetermined condition is not satisfied (step S6: No), the learning device 10 returns to step S2 and performs each process of data processing, feature conversion, age estimation, and parameter update. On the other hand, if the predetermined condition is satisfied (step S6: Yes), the learning device 10 ends the learning process.
[推定装置]
 次に、実施の形態に係る推定装置について説明する。図8は、実施の形態に係る推定装置の構成の一例を模式的に示す図である。図9は、図8に示す推定装置が実行する推定処理手順を示すフローチャートである。
[Estimation device]
Next, an estimation device according to an embodiment will be described. FIG. 8 is a diagram schematically illustrating an example of a configuration of an estimation device according to an embodiment; 9 is a flowchart showing an estimation processing procedure executed by the estimation device shown in FIG. 8. FIG.
 図8に示す推定装置20は、第1のNN1211を有する特徴量変換部221(変換部)と、第2のNN1221を有する年齢推定部222(推定部)とを有する。第1のNN1211及び第2のNN1221は、学習装置10によって学習済みのNNである。 The estimation device 20 shown in FIG. 8 has a feature amount conversion unit 221 (conversion unit) having a first NN 1211 and an age estimation unit 222 (estimation unit) having a second NN 1221 . The first NN 1211 and the second NN 1221 are NNs that have been learned by the learning device 10 .
 特徴量変換部221は、顔画像データ或いは音声データの入力を受け付けると(図9のステップS11)、第1のNN1211を用いて、顔画像データ或いは音声データを特徴量にそれぞれ変換する(図9のステップS12)。 When the feature amount conversion unit 221 receives input of face image data or voice data (step S11 in FIG. 9), it uses the first NN 1211 to convert the face image data or voice data into feature amounts (step S11 in FIG. 9). step S12).
 年齢推定部222は、第2のNN1221を用いて、特徴量変換部211によって変換された特徴量ベクトルから対象人物の年齢を推定し(図9のステップS13)、推定年齢を出力する(図9のステップS14)。 The age estimation unit 222 uses the second NN 1221 to estimate the age of the target person from the feature amount vector converted by the feature amount conversion unit 211 (step S13 in FIG. 9), and outputs the estimated age (step S13 in FIG. 9). step S14).
[評価実験]
 次に、学習装置10によって、式(1)を基に学習された第1のNN1211及び第2のNN1221について評価実験を行った。ここでは、ヘッドセットマイクで収録した587話者の29076発話をソースデータとし、ソースデータとは異なるスマートフォン搭載のマイクロフォンで収録した409話者の8180発話をターゲットデータとして用いて、第1のNN1211、第2のNN1221及び第3のNN1231を学習した。その後、推定装置20は、ターゲットデータと同一のスマートフォン搭載のマイクロフォンで収録した120話者の1300発話に対し、第1のNN1211及び第2のNN1221を用いて、話者の年齢の推定を行った。
[Evaluation experiment]
Next, an evaluation experiment was performed on the first NN 1211 and the second NN 1221 learned by the learning device 10 based on the formula (1). Here, 29076 utterances of 587 speakers recorded with a headset microphone were used as source data, and 8180 utterances of 409 speakers recorded with a smartphone-mounted microphone different from the source data were used as target data. A second NN1221 and a third NN1231 were trained. After that, the estimation device 20 used the first NN 1211 and the second NN 1221 to estimate the ages of the 1300 utterances of 120 speakers recorded by the same smartphone-mounted microphone as the target data. .
 その結果、推定装置20では、正解年齢値と、第1のNN1211及び第2のNN1221による話者の年齢の推定結果との平均絶対誤差は8.02歳であった。また、推定装置20では、正解年齢値と話者の年齢の推定結果との相関係数は0.84であった。 As a result, in the estimating device 20, the average absolute error between the correct age value and the result of estimating the age of the speaker by the first NN 1211 and the second NN 1221 was 8.02 years old. Also, in the estimation device 20, the correlation coefficient between the correct age value and the estimation result of the speaker's age was 0.84.
 一方、参考として、ソースデータのみで第1のNN1211及び第2のNN1221を学習した場合は、話者の年齢の推定結果との絶対誤差は11.76歳であり、その相関係数は0.71であった。このことから、学習装置10のように、年代ごとにソースデータとターゲットデータの分布を近づけるように第1のNN1211、第2のNN1221及び第3のNN1231を学習する方法が、有効に機能することが確認できた。 On the other hand, as a reference, when the first NN1211 and the second NN1221 were learned using only the source data, the absolute error with the estimation result of the speaker's age was 11.76 years, and the correlation coefficient was 0.71. . Therefore, like the learning device 10, the method of learning the first NN 1211, the second NN 1221, and the third NN 1231 so that the distributions of the source data and the target data are brought closer to each other for each age function effectively. was confirmed.
[実施の形態の効果]
 このように、本実施の形態によれば、学習時と運用時とにおけるデータの収録条件に違いがある場合であっても、これらの違いに頑強な特徴抽出器(第1のNN1211)と、年齢推定器(第2のNN1221)とを得ることができる。言い換えると、本実施の形態では、顔画像データ或いは音声データから年齢推定を行うNNを学習する場合に、「本来推定したい年齢の区分よりも大きな区分(例えば10歳刻みの年代など)でラベルの推定を行った後に、この大きな区分のなかで、ソースデータとターゲットデータの特徴量ベクトルの分布が近くなるようにNNの学習を行うことで、学習時と運用時とにおけるデータの収録条件に違いにかかわらず、高い精度で年齢推定結果を出力できるNNを実現することができた。
[Effects of Embodiment]
As described above, according to the present embodiment, even if there is a difference in data recording conditions during learning and during operation, the feature extractor (first NN 1211) that is robust against these differences, An age estimator (second NN 1221) can be obtained. In other words, in the present embodiment, when learning an NN that estimates age from face image data or voice data, "labels are labeled in age groups larger than the age groups to be originally estimated (for example, ages in 10-year increments). After estimating, by training the NN so that the distributions of the feature vectors of the source data and the target data are close in this large segment, the difference in data recording conditions during training and during operation is eliminated. In spite of this, it was possible to realize an NN capable of outputting age estimation results with high accuracy.
 なお、学習装置10は、ターゲットデータのラベルを推定し、この推定結果に基づきソースデータとターゲットデータとの分布を近づけてデータの不一致の問題を解決するという点において、非特許文献3に記載の技術と類似しているが、以下の点において相違する。 Note that the learning device 10 estimates the label of the target data and approximates the distributions of the source data and the target data based on this estimation result to solve the data mismatch problem. It is similar to technology, but differs in the following points.
 第一に、非特許文献3記載の技術では、一般的な分類問題のように、推定すべきデータが互いに独立な離散ラベルのみを対象としている。これに対し、本実施の形態では、順序のある年齢ラベルを推定対象とするという点で異なる。 First, the technology described in Non-Patent Document 3 targets only discrete labels whose data to be estimated are independent of each other, like general classification problems. On the other hand, the present embodiment differs in that ordered age labels are targeted for estimation.
 第二に、非特許文献3記載の技術では、本来の推定対象であるラベルの推定結果を用いて分布を近づけることを試みている。これに対し、本実施の形態では、より粒度の粗い年代ごとに分布を近づけるよう学習を行っている。 Second, the technique described in Non-Patent Document 3 attempts to approximate the distribution using the estimation result of the label, which is the original estimation target. On the other hand, in the present embodiment, learning is performed so that the distribution is brought closer to each age with coarser granularity.
 実施の形態では、このようにラベルの順序を考慮することで、年代ごとにより分布が近づけやすくなり、より高い性能改善が期待できる。 In the embodiment, by considering the order of the labels in this way, it becomes easier to approximate the distribution for each age, and a higher performance improvement can be expected.
 このように、本実施の形態では、運用時とは異なる環境で収録した年齢ラベル付きのソースデータと、運用時と同一の環境で収録したが年齢ラベルが付与されていないターゲットデータとを用いることにより、ソースデータのみで学習する場合よりも、大幅に、特徴量変換器及び年齢推定器の性能を改善することができる。 Thus, in this embodiment, source data with an age label recorded in an environment different from that during operation and target data recorded in the same environment as during operation but without an age label are used. Thus, the performance of the feature transformer and age estimator can be significantly improved compared to the case of learning only with source data.
 また、年齢ラベル付きデータを運用環境で収集するためには、各年代の話者を多く(数百人規模)集めることは非常に煩雑な処理であるため、本実施の形態のように、この処理を省略できることで、効率化を図ることができる。 In addition, in order to collect age-labeled data in an operational environment, collecting a large number of speakers of each age group (several hundred people) is a very complicated process. By omitting the processing, efficiency can be improved.
 そして、推定対象のクラスごとに分布を近づけるという内容は非特許文献3と同様だが、非特許文献3に記載の技術では、年齢推定のようなタスクは想定しておらず、そのまま適用しても効果が薄い。これに対して、本実施の形態では、「年齢は順序のある値で、例えば年代のようなより大きい区分でも定義できるのでこの大きな区分で分布を近づける」ことで、高精度化な年齢推定器を実現した。 The content of approximating the distribution for each class to be estimated is similar to Non-Patent Document 3, but the technology described in Non-Patent Document 3 does not assume a task such as age estimation, and even if it is applied as it is little effect. On the other hand, in the present embodiment, "age is a value with an order, and it can be defined in a larger division such as age, so the distribution is approximated in this large division", so that a highly accurate age estimator realized.
 なお、本実施の形態は、画像データ、音声データ等の入力の違いに関わらず適用可能である。具体的には、第1のNN1211を各入力データの種別に適したものに変えればよい。 It should be noted that this embodiment can be applied regardless of differences in input such as image data and audio data. Specifically, the first NN 1211 may be changed to one suitable for each type of input data.
[実施の形態のシステム構成について]
 学習装置10及び推定装置20の各構成要素は機能概念的なものであり、必ずしも物理的に図示のように構成されていることを要しない。すなわち、学習装置10及び推定装置20の機能の分散及び統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散または統合して構成することができる。
[Regarding the system configuration of the embodiment]
Each component of the learning device 10 and the estimating device 20 is functionally conceptual and does not necessarily need to be physically configured as illustrated. That is, the specific forms of distribution and integration of the functions of the learning device 10 and the estimating device 20 are not limited to those illustrated, and all or part of them can be functioned in arbitrary units according to various loads and usage conditions. can be physically or physically distributed or integrated.
 また、学習装置10及び推定装置20においておこなわれる各処理は、全部または任意の一部が、CPU、GPU(Graphics Processing Unit)、及び、CPU、GPUにより解析実行されるプログラムにて実現されてもよい。また、学習装置10及び推定装置20においておこなわれる各処理は、ワイヤードロジックによるハードウェアとして実現されてもよい。 In addition, all or any part of each process performed in the learning device 10 and the estimation device 20 may be realized by a CPU, a GPU (Graphics Processing Unit), and a program that is analyzed and executed by the CPU and GPU. good. Further, each process performed in the learning device 10 and the estimation device 20 may be implemented as hardware based on wired logic.
 また、実施の形態において説明した各処理のうち、自動的におこなわれるものとして説明した処理の全部または一部を手動的に行うこともできる。もしくは、手動的におこなわれるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上述及び図示の処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて適宜変更することができる。 Also, among the processes described in the embodiments, all or part of the processes described as being performed automatically can also be performed manually. Alternatively, all or part of the processes described as being performed manually can be performed automatically by known methods. In addition, the above-described and illustrated processing procedures, control procedures, specific names, and information including various data and parameters can be changed as appropriate unless otherwise specified.
[プログラム]
 図10は、プログラムが実行されることにより、学習装置10及び推定装置20が実現されるコンピュータの一例を示す図である。コンピュータ1000は、例えば、メモリ1010、CPU1020を有する。また、コンピュータ1000は、ハードディスクドライブインタフェース1030、ディスクドライブインタフェース1040、シリアルポートインタフェース1050、ビデオアダプタ1060、ネットワークインタフェース1070を有する。これらの各部は、バス1080によって接続される。
[program]
FIG. 10 is a diagram showing an example of a computer that implements the learning device 10 and the estimation device 20 by executing programs. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .
 メモリ1010は、ROM1011及びRAM1012を含む。ROM1011は、例えば、BIOS(Basic Input Output System)等のブートプログラムを記憶する。ハードディスクドライブインタフェース1030は、ハードディスクドライブ1090に接続される。ディスクドライブインタフェース1040は、ディスクドライブ1100に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ1100に挿入される。シリアルポートインタフェース1050は、例えばマウス1110、キーボード1120に接続される。ビデオアダプタ1060は、例えばディスプレイ1130に接続される。 The memory 1010 includes a ROM 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.
 ハードディスクドライブ1090は、例えば、OS(Operating System)1091、アプリケーションプログラム1092、プログラムモジュール1093、プログラムデータ1094を記憶する。すなわち、学習装置10及び推定装置20の各処理を規定するプログラムは、コンピュータ1000により実行可能なコードが記述されたプログラムモジュール1093として実装される。プログラムモジュール1093は、例えばハードディスクドライブ1090に記憶される。例えば、学習装置10及び推定装置20における機能構成と同様の処理を実行するためのプログラムモジュール1093が、ハードディスクドライブ1090に記憶される。なお、ハードディスクドライブ1090は、SSD(Solid State Drive)により代替されてもよい。 The hard disk drive 1090 stores an OS (Operating System) 1091, application programs 1092, program modules 1093, and program data 1094, for example. That is, a program that defines each process of the learning device 10 and the estimation device 20 is implemented as a program module 1093 in which code executable by the computer 1000 is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configurations of the learning device 10 and the estimation device 20 . The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).
 また、上述した実施の形態の処理で用いられる設定データは、プログラムデータ1094として、例えばメモリ1010やハードディスクドライブ1090に記憶される。そして、CPU1020が、メモリ1010やハードディスクドライブ1090に記憶されたプログラムモジュール1093やプログラムデータ1094を必要に応じてRAM1012に読み出して実行する。 Also, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.
 なお、プログラムモジュール1093やプログラムデータ1094は、ハードディスクドライブ1090に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ1100等を介してCPU1020によって読み出されてもよい。あるいは、プログラムモジュール1093及びプログラムデータ1094は、ネットワーク(LAN(Local Area Network)、WAN(Wide Area Network)等)を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール1093及びプログラムデータ1094は、他のコンピュータから、ネットワークインタフェース1070を介してCPU1020によって読み出されてもよい。 The program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program modules 1093 and program data 1094 may be stored in another computer connected via a network (LAN (Local Area Network), WAN (Wide Area Network), etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.
 以上、本発明者によってなされた発明を適用した実施の形態について説明したが、本実施の形態による本発明の開示の一部をなす記述及び図面により本発明は限定されることはない。すなわち、本実施の形態に基づいて当業者等によりなされる他の実施の形態、実施例及び運用技術等は全て本発明の範疇に含まれる。 Although the embodiment to which the invention made by the present inventor is applied has been described above, the present invention is not limited by the description and drawings forming part of the disclosure of the present invention according to the present embodiment. That is, other embodiments, examples, operation techniques, etc. made by those skilled in the art based on the present embodiment are all included in the scope of the present invention.
 10 学習装置
 11 データ選択部
 12 推定部
 13 更新部
 14 制御処理部
 121,221 特徴量変換部
 122,222 年齢推定部
 123 年代推定部
 1211 第1のNN
 1221 第2のNN
 1231 第3のNN
10 learning device 11 data selection unit 12 estimation unit 13 update unit 14 control processing unit 121, 221 feature quantity conversion unit 122, 222 age estimation unit 123 generation estimation unit 1211 first NN
1221 Second NN
1231 Third NN

Claims (7)

  1.  第1のニューラルネットワークを用いて、年齢ラベル付きのソースデータを特徴量ベクトルに変換し、年齢ラベルが付与されていないターゲットデータを特徴量ベクトルに変換する変換部と、
     第2のニューラルネットワークを用いて、前記変換部によって変換されたソースデータの特徴量ベクトルから対象人物の年齢に対する事後確率を推定する第1の推定部と、
     第3のニューラルネットワークを用いて、前記ソースデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定し、前記ターゲットデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定する第2の推定部と、
     前記第1の推定部によって推定された前記ソースデータの年齢に対する事後確率及び前記第2の推定部によって推定された前記ソースデータの年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代の事後確率とが最も大きくなるようにしつつ、前記第2の推定部によって推定された前記ターゲットデータの年代の事後確率と前記ソースデータの正解年代とを条件として、前記変換部により変換された前記ソースデータの特徴量ベクトルと前記ターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、前記第1のニューラルネットワークと前記第2のニューラルネットワークと前記第3のニューラルネットワークとの各パラメータを更新する更新部と、
     を有することを特徴とする学習装置。
    a conversion unit that converts age-labeled source data into a feature amount vector and converts age-unlabeled target data into a feature amount vector using a first neural network;
    A first estimating unit that uses a second neural network to estimate the posterior probability for the age of the target person from the feature amount vector of the source data transformed by the transforming unit;
    Using a third neural network, estimating the posterior probability for the age of the target person from the feature amount vector of the source data, and second estimation for estimating the posterior probability for the age of the target person from the feature amount vector of the target data Department and
    The posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated by the first estimation unit and the posterior probability of the age of the source data estimated by the second estimation unit. and the posterior probability of the correct age are maximized, and with the posterior probability of the age of the target data estimated by the second estimation unit and the correct age of the source data as conditions, the conversion unit The first neural network and the second neural network are arranged so that the distributions of the transformed feature amount vector of the source data and the feature amount vector of the target data are brought close to each other based on a distance criterion between distributions defined in advance for each age. an updating unit that updates each parameter of the neural network and the third neural network;
    A learning device characterized by comprising:
  2.  前記変換部に入力するデータとして、ソースデータ群から前記ソースデータを一つ選択し、ターゲットデータ群から前記ターゲットデータをランダムに選択するデータ選択部をさらに有することを特徴とする請求項1に記載の学習装置。 2. The data selection unit according to claim 1, further comprising a data selection unit that selects one of the source data from a group of source data and randomly selects the target data from a group of target data as data to be input to the conversion unit. learning device.
  3.  前記ソースデータは、運用時とは異なる環境で収録した年齢ラベル付きのデータであり、
     前記ターゲットデータは、運用時と同一の環境で収録した、年齢ラベルが付与されていないデータであることを特徴とする請求項1または2に記載の学習装置。
    The source data is age-labeled data recorded in an environment different from that during operation,
    3. The learning device according to claim 1, wherein the target data is data recorded in the same environment as during operation and not given an age label.
  4.  所定の条件を満たすまで、前記変換部、前記第1の推定部、前記第2の推定部及び前記更新部による処理を繰り返し実行させる制御処理部をさらに有することを特徴とする請求項1~3のいずれか一つに記載の学習装置。 3. The apparatus further comprises a control processing unit for repeatedly executing the processing by the transforming unit, the first estimating unit, the second estimating unit, and the updating unit until a predetermined condition is satisfied. A learning device according to any one of
  5.  第1のニューラルネットワークを用いて、データを特徴量ベクトルに変換する変換部と、
     第2のニューラルネットワークを用いて、前記変換部によって変換された特徴量ベクトルから対象人物の年齢を推定する推定部と、
     を有し、
     前記第1のニューラルネットワーク及び前記第2のニューラルネットワークは、前記第2のニューラルネットワークによって推定された、年齢ラベル付きのソースデータの各年齢に対する事後確率、及び、人物の年代に対する事後確率を推定する第3のニューラルネットワークによって推定された前記ソースデータの年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代の事後確率とが最も大きくなるようにしつつ、前記第3のニューラルネットワークによって推定された、年齢ラベルが付与されていないターゲットデータの年代の事後確率と前記ソースデータの正解年代とを条件として、前記第1のニューラルネットワークによって変換された前記ソースデータの特徴量ベクトルと前記ターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、学習されたことを特徴とする推定装置。
    a conversion unit that converts data into a feature amount vector using a first neural network;
    an estimating unit for estimating the age of the target person from the feature amount vector converted by the converting unit using a second neural network;
    has
    The first neural network and the second neural network estimate the posterior probability for each age of the age-labeled source data estimated by the second neural network and the posterior probability for the age of the person. While maximizing the posterior probability of the correct age of the target person and the posterior probability of the correct age of the target person with respect to the posterior probability for the age of the source data estimated by the third neural network, A feature vector of the source data transformed by the first neural network on the condition of the posterior probability of the age of the target data with no age label assigned and the correct age of the source data estimated by the neural network. and the feature amount vector of the target data are learned so as to approximate the distributions of the target data using a predetermined inter-distribution distance reference for each age.
  6.  学習装置が実行する学習方法であって、
     第1のニューラルネットワークを用いて、年齢ラベル付きのソースデータを特徴量ベクトルに変換し、年齢ラベルが付与されていないターゲットデータを特徴量ベクトルに変換する変換工程と、
     第2のニューラルネットワークを用いて、前記変換工程において変換されたソースデータの特徴量ベクトルから対象人物の年齢に対する事後確率を推定する第1の推定工程と、
     第3のニューラルネットワークを用いて、前記ソースデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定し、前記ターゲットデータの特徴量ベクトルから対象人物の年代に対する事後確率を推定する第2の推定工程と、
     前記第1の推定工程において推定された前記ソースデータの年齢に対する事後確率及び前記第2の推定工程において推定された前記ソースデータの年代に対する事後確率に対して、対象人物の正解の年齢の事後確率と正解の年代の事後確率とが最も大きくなるようにしつつ、前記第2の推定工程において推定された前記ターゲットデータの年代の事後確率と前記ソースデータの正解年代とを条件として、前記変換工程において変換された前記ソースデータの特徴量ベクトルと前記ターゲットデータの特徴量ベクトルとの分布を、年代ごとに事前に定義した分布間距離基準で近づけるように、前記第1のニューラルネットワークと前記第2のニューラルネットワークと前記第3のニューラルネットワークとの各パラメータを更新する更新工程と、
     を含んだことを特徴とする学習方法。
    A learning method executed by a learning device,
    a conversion step of converting age-labeled source data into a feature quantity vector and converting target data without an age label into a feature quantity vector using a first neural network;
    A first estimation step of estimating the posterior probability for the age of the target person from the feature amount vector of the source data converted in the conversion step using a second neural network;
    Using a third neural network, estimating the posterior probability for the age of the target person from the feature amount vector of the source data, and second estimation for estimating the posterior probability for the age of the target person from the feature amount vector of the target data process and
    The posterior probability of the correct age of the target person with respect to the posterior probability of the age of the source data estimated in the first estimation step and the posterior probability of the age of the source data estimated in the second estimation step. and the posterior probability of the correct age are maximized, and on the condition of the posterior probability of the age of the target data estimated in the second estimation step and the correct age of the source data, in the conversion step The first neural network and the second neural network are arranged so that the distributions of the transformed feature amount vector of the source data and the feature amount vector of the target data are brought close to each other based on a distance criterion between distributions defined in advance for each age. an update step of updating each parameter of the neural network and the third neural network;
    A learning method comprising:
  7.  コンピュータを請求項1~4のいずれか一つに記載の学習装置として機能させるための学習プログラム。 A learning program for causing a computer to function as the learning device according to any one of claims 1 to 4.
PCT/JP2021/029440 2021-08-06 2021-08-06 Learning device, estimation device, learning method, and learning program WO2023013075A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2021/029440 WO2023013075A1 (en) 2021-08-06 2021-08-06 Learning device, estimation device, learning method, and learning program
JP2023539586A JPWO2023013075A1 (en) 2021-08-06 2021-08-06

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/029440 WO2023013075A1 (en) 2021-08-06 2021-08-06 Learning device, estimation device, learning method, and learning program

Publications (1)

Publication Number Publication Date
WO2023013075A1 true WO2023013075A1 (en) 2023-02-09

Family

ID=85154110

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/029440 WO2023013075A1 (en) 2021-08-06 2021-08-06 Learning device, estimation device, learning method, and learning program

Country Status (2)

Country Link
JP (1) JPWO2023013075A1 (en)
WO (1) WO2023013075A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021095509A1 (en) * 2019-11-14 2021-05-20 オムロン株式会社 Inference system, inference device, and inference method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021095509A1 (en) * 2019-11-14 2021-05-20 オムロン株式会社 Inference system, inference device, and inference method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
TAWARA NAOHIRO; OGAWA ATSUNORI; KITAGISHI YUKI; KAMIYAMA HOSANA: "Age-VOX-Celeb: Multi-Modal Corpus for Facial and Speech Estimation", ICASSP 2021 - 2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 6 June 2021 (2021-06-06), pages 6963 - 6967, XP033955338, DOI: 10.1109/ICASSP39728.2021.9414272 *
YONGCHUN ZHU; FUZHEN ZHUANG; JINDONG WANG; GUOLIN KE; JINGWU CHEN; JIANG BIAN; HUI XIONG; QING HE: "Deep Subdomain Adaptation Network for Image Classification", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 June 2021 (2021-06-17), 201 Olin Library Cornell University Ithaca, NY 14853, XP081991391, DOI: 10.1109/TNNLS.2020.2988928 *

Also Published As

Publication number Publication date
JPWO2023013075A1 (en) 2023-02-09

Similar Documents

Publication Publication Date Title
CN111340021B (en) Unsupervised domain adaptive target detection method based on center alignment and relation significance
Stolcke et al. Explicit word error minimization in n-best list rescoring.
WO2016037350A1 (en) Learning student dnn via output distribution
US11468293B2 (en) Simulating and post-processing using a generative adversarial network
CN108304890B (en) Generation method and device of classification model
EP4143752A1 (en) Methods and apparatuses for federated learning
JP6992709B2 (en) Mask estimation device, mask estimation method and mask estimation program
US20180232632A1 (en) Efficient connectionist temporal classification for binary classification
JP6620882B2 (en) Pattern recognition apparatus, method and program using domain adaptation
JPWO2019198306A1 (en) Estimator, learning device, estimation method, learning method and program
EP1863014B1 (en) Apparatuses and methods for learning and using a distance transition model
CN110598848A (en) Migration learning acceleration method based on channel pruning
KR20190136578A (en) Method and apparatus for speech recognition
WO2019138897A1 (en) Learning device and method, and program
US7496509B2 (en) Methods and apparatus for statistical biometric model migration
WO2023013075A1 (en) Learning device, estimation device, learning method, and learning program
McDermott et al. A derivation of minimum classification error from the theoretical classification risk using Parzen estimation
Lee et al. Training hidden Markov models by hybrid simulated annealing for visual speech recognition
WO2020151017A1 (en) Scalable field human-machine dialogue system state tracking method and device
JP7103235B2 (en) Parameter calculation device, parameter calculation method, and parameter calculation program
JP2015038709A (en) Model parameter estimation method, device, and program
Tong et al. Graph convolutional network based semi-supervised learning on multi-speaker meeting data
CN115828100A (en) Mobile phone radiation source spectrogram category increment learning method based on deep neural network
CN114742166A (en) Communication network field maintenance model migration method based on time delay optimization
KR102432854B1 (en) Method and apparatus for clustering data by using latent vector

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21952898

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023539586

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE