WO2023062763A1

WO2023062763A1 - Machine learning device, feature extraction device, machine learning method, feature extraction method, machine learning program, and feature extraction program

Info

Publication number: WO2023062763A1
Application number: PCT/JP2021/037975
Authority: WO
Inventors: 知之藤野; 啓一郎柏木
Original assignee: 日本電信電話株式会社
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-04-20

Abstract

This feature extraction device comprises an acquisition unit and a learning unit. The acquisition unit acquires a multi-layer model having a plurality of layers connected in series. Each layer includes a plurality of boosting machines. The learning unit uses the multi-layer model acquired by the acquisition unit as a machine learning model to execute machine learning such that at least one of the input layer and middle layer of the multi-layer model extracts a feature from given data.

Description

Machine learning device, feature quantity extraction device, machine learning method, feature quantity extraction method, machine learning program and feature quantity extraction program

The present disclosure relates to a machine learning device, a feature quantity extraction device, a machine learning method, a feature quantity extraction method, a machine learning program, and a feature quantity extraction program.

Ensemble learning is a machine learning method that combines multiple classifiers. There are various methods of ensemble learning. One method is Boosting.

Boosting is an algorithm that creates a strong classifier by stringing together weak classifiers. Specifically, boosting is a learning method in which a new classifier is added based on the output of a certain classifier, and this new classifier is optimized so that the sum of the classifier outputs reduces the error. is. A commonly used boosting algorithm is Gradient Boosting. Gradient boosting is a group of algorithms that use the gradient method to boost so that the objective function is minimized. The famous AdaBoost (Adaptive Boosting) can also be considered as one of gradient boosting.

However, with the above prior art, it may be difficult to acquire feature quantities with high inference performance.

Therefore, the present disclosure proposes a machine learning device, a feature quantity extraction device, a machine learning method, a feature quantity extraction method, a machine learning program, and a feature quantity extraction program capable of acquiring feature quantities with high inference performance.

In one aspect of the present disclosure, a machine learning device is a multi-layer model having a plurality of layers connected in series, each layer obtaining a multi-layer model including a plurality of gradient boosting machines; a machine learning executor for performing machine learning using the model as a machine learning model to extract features from data given to at least one of the input layer or the intermediate layer of the multi-layer model. .

A machine learning device according to one or more embodiments of the present disclosure can acquire feature quantities with high inference performance.

FIG. 1 is a block diagram of an example environment for machine learning. FIG. 2 is a block diagram of an example configuration of a feature extraction device according to the present disclosure. FIG. 3 shows an example of graphical representation of GBDTs (Gradient Boosting Decision Trees). FIG. 4A shows an example graphical representation of a GBDT in accordance with this disclosure. FIG. 4B shows an example graphical representation of a GBDT in accordance with this disclosure. FIG. 5 illustrates an example of gradient computation according to this disclosure. FIG. 6 is a flowchart illustrating an example of processing for learning a plurality of discriminators in the boosting machine. FIG. 7 is a flowchart illustrating an example of inference processing using boosting machines. FIG. 8 shows an example of the hardware configuration of a computer.

A number of embodiments are described in detail below with reference to the drawings. However, the present invention is not limited by these multiple embodiments. Features of various embodiments may be combined in various ways provided the features are not mutually exclusive. Identical elements are denoted by identical reference numerals, and duplicate descriptions are omitted.

The next paragraph explains the outline of the technology according to the present disclosure. However, this summary is not intended to limit the invention or the embodiments described in the following sections.

〔overview〕
Gradient-boosted decision trees (GBDTs) are used for various tasks such as classification, while deep learning is used for classification of high-dimensional data such as video, image and audio. Due to its algorithmic structure, GBDT has been inferior to deep learning in accuracy for tasks such as classification of high-dimensional data. In this disclosure, the structure of GBDT is extended. The concept of backpropagation is then introduced into the structure-extended GBDT. The GBDT according to the present disclosure has a feature extraction layer and can extract features with high inference performance. Therefore, the GBDT according to the present disclosure can handle high-dimensional data with high accuracy.

The following description consists of 9 sections: 1. First, 2. environment for machine learning;3. 4. Configuration of feature quantity extraction device; 4. Boosting processing; 5. Flowchart of boosting process; effect;7. Others,8. hardware configuration; and9. Summary of embodiments.

[1. Introduction]
Gradient boosting, like deep learning, is a popular machine learning technique. Gradient boosting successively adds weak discriminators such that the objective function is minimized.

Gradient boosting that uses decision trees as weak classifiers is called Gradient Boosting Decision Trees (GBDT). GBDT is a general-purpose supervised learning algorithm. GBDT is used in various applications such as regression and classification using IoT (Internet of Things) sensor data. Examples of GBDT-based models include XGBoost and LightGBM.

On the other hand, deep learning is becoming mainstream in the classification of media data such as videos, images, natural language, and audio. For such tasks, GBDT is generally inferior to deep learning in accuracy. A possible reason is that GBDT does not perform feature acquisition that deep learning does.

In conventional GBDT, features are determined by dividing the input space into subspaces. The partitioning method is generally determined by performing a grid search on the dimensions and data points.

　In order to divide the input space, the bifurcation points are grid-searched for each dimension using the gain values of the dataset. Specifically, the set of data points involved in the branch are sorted in all dimensions. Gain values for the left and right data sets after the split are calculated for each possible split point. Then, the branch position that has the greatest increase over the current gain value is searched. As a result of such grid search, the best branch point is selected.

In this way, all dimensions are searched independently when a grid search is performed. Then the one dimension with the highest gain value is selected.

However, such treatment of input dimensions makes it difficult for GBDT to grasp the correlation between dimensions in the input space. GBDT treats the dimensions of the input data independently and does not consider the correlation between the dimensions.

　Feature extraction is a process that captures the correlation information between the dimensions of the input data and maps the input data into a feature space that is easy to classify. Since the conventional GBDT treats the dimensions independently, the conventional GBDT is not expected to acquire features more suitable for the task.

In order to solve the above problems, the feature extraction device according to one or more embodiments of the present disclosure performs one or more boosting processes described below.

[2. environment for machine learning]
First, an environment for machine learning according to the present disclosure will be described with reference to FIG.

Figure 1 is a block diagram of environment 1, which is an example of an environment for machine learning. As shown in FIG. 1, the environment 1 includes a feature extraction device 100, a network 200, and a user device 300. FIG.

The feature extraction device 100 is a device that performs one or more boosting processes. One or more boosting processes include a process of generating a strong classifier by connecting a plurality of weak classifiers, and an inference process using the generated strong classifiers. Details of the boosting process according to the present disclosure are described in Section 4. The feature quantity extraction device 100 is an example of a machine learning device.

The feature quantity extraction device 100 is a data processing device such as a server. An example of the configuration of the feature quantity extraction device 100 will be described in Section 3.

The network 200 is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet. A network 200 connects the feature extraction device 100 and the user device 300 .

The user device 300 is a data processing device such as a client device. A user is, for example, a data scientist. For example, the user device 300 sends a request to the feature quantity extraction device 100 to acquire a generated strong discriminator. The user device 300 may send a request to the feature quantity extraction device 100 to execute inference processing using the generated strong discriminator. In this case, the user device 300 also sends data, which is the target of inference processing, to the feature quantity extraction device 100 .

[3. Configuration of Feature Amount Extraction Device]
Next, an example of the configuration of the feature quantity extraction device 100 will be described with reference to FIG.

FIG. 2 is a block diagram of the feature quantity extraction device 100, which is an example of the configuration of the feature quantity extraction device according to the present disclosure. As shown in FIG. 2 , feature quantity extraction device 100 includes communication unit 110 , control unit 120 and storage unit 130 . The feature quantity extraction device 100 may include an input unit (for example, keyboard, mouse) that receives input from an administrator of the feature quantity extraction device 100 . The feature quantity extraction device 100 may also include an output unit (for example, a liquid crystal display, an organic EL (Electro Luminescence) display) that displays information to the administrator of the feature quantity extraction device 100 .

[3-1. Communication unit 110]
The communication unit 110 is implemented by, for example, a NIC (Network Interface Card). Communication unit 110 is connected to network 200 by wire or wirelessly. The communication unit 110 can transmit and receive information to and from the user device 300 via the network 200 .

[3-2. control unit 120]
The control unit 120 is a controller. The control unit 120 uses a RAM (Random Access Memory) as a work area, and includes one or more processors (for example, a CPU (Central Processing Unit)) that execute various programs stored in the storage device of the feature extraction device 100. , MPU (Micro Processing Unit). Also, the control unit 120 may be implemented by an integrated circuit such as an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a GPGPU (General Purpose Graphic Processing Unit).

As shown in FIG. 2, the control unit 120 includes a receiving unit 121, an acquiring unit 122, a learning unit 123, an inference unit 124 and a providing unit 125. One or more processors of the feature extraction device 100 can implement each controller by executing instructions stored in one or more memories of the feature extraction device 100 . The data processing performed by each control unit is an example, and each control unit (e.g., learning unit 123) may perform data processing described in relation to other control units (e.g., reasoning unit 124). .

[3-2-1. Receiving unit 121]
The receiving unit 121 receives various data. Also, the receiving unit 121 stores the received data in the storage unit 130 . For example, the receiving unit 121 receives data regarding various machine learning algorithms from the administrator of the feature quantity extraction device 100 . The receiving unit 121 also receives training data for machine learning from the administrator.

[3-2-2. Acquisition unit 122]
Acquisition unit 122 acquires various data from storage unit 130 . For example, the acquisition unit 122 acquires data related to various machine learning algorithms and training data for machine learning.

[3-2-3. learning unit 123]
The learning unit 123 uses various data acquired by the acquisition unit 122 to perform machine learning. For example, the learning unit 123 uses training data to train a machine learning algorithm, thereby generating a learned model. For example, a trained model is a generated strong classifier. Learning unit 123 stores the trained model in storage unit 130 .

The learning unit 123 is an example of a machine learning execution unit.

[3-2-4. reasoning unit 124]
The inference unit 124 receives data to be subjected to inference processing from the user device 300 . Also, the inference unit 124 acquires a trained model from the storage unit 130 . The inference unit 124 performs inference processing by applying the received data to the trained model.

The inference unit 124 is an example of an acquisition unit and an extraction unit.

[3-2-5. providing unit 125]
The providing unit 125 provides various information. For example, the providing unit 125 provides the generated strong classifier to the user device 300 . When the inference unit 124 performs inference processing using the data received from the user device 300 , the provision unit 125 provides the result of the inference processing to the user device 300 .

[3-3. Storage unit 130]
The storage unit 130 is implemented by, for example, a semiconductor memory device such as a RAM or a flash memory, or a storage device such as a hard disk or an optical disk. The storage unit 130 stores various data. For example, the storage unit 130 stores data related to various machine learning algorithms, training data for machine learning, learned models, and the like.

[4. Boosting processing]
Next, boosting processing according to the present disclosure will be described with reference to FIGS. 3, 4A, 4B and 5. FIG.

First, conventional boosting processing will be described in order to compare the boosting processing according to the present disclosure with conventional boosting processing.

FIG. 3 shows a graphical representation 10, which is an example of a graphical representation of GBDT. The graphical representation 10 is a graphical representation of a conventional boosting process. Graphical representation 10 is a standard GBDT graphical representation.

The variable x from _x1 to _xD is the input data. "D" is the dimension of the input data. Input data is entered into each boosting machine.

A boosting machine is a multi-input, single-output discriminator represented as a linear sum of multiple weak discriminators that are sequentially learned according to a boosting algorithm so as to reduce output errors.

In the example of Figure 3, the boosting machine is a function of multiple inputs and one output. A boosting machine is a set of weak classifiers. For example, weak classifiers are decision trees.

When GBDT is used to solve a multi-class classification problem, boosting machines are typically generated for the number of output classes. In the example of FIG. 3, the number of output classes is "C". Thus, the graphical representation 10 includes boosting machines F ₁ through F _C .

In the example of FIG. 3, the variables x ₁ through x _D are input to the boosting machine F ₁ . In this case, boosting machine F ₁ outputs the value "y ₁ ". Similarly, the variables _x1 through _xD are also input to the boosting machine _F2 . In this case, the boosting machine _F2 outputs the value " _y2 ". For example, the output value indicates the probability that the input data belongs to a particular output class.

However, these boosting machines are independent of each other. Accordingly, the graphical representation 10 does not take into account correlations between dimensions of the input data. In order to obtain accuracy comparable to deep learning, it is necessary to consider the correlation between dimensions.

Therefore, the feature quantity extraction device 100 of FIG. 1 executes the boosting process according to the present disclosure in order to realize GBDT considering the correlation between dimensions of input data.

The feature quantity extraction device 100 in FIG. 1 handles multi-class high-dimensional input data in GBDT, as in the case of deep learning. Then, the feature quantity extraction device 100 performs optimization similar to error backpropagation on the GBDT. This enables classification and regression considering correlations between dimensions of high-dimensional input data.

In the following, GBDT using error backpropagation for high-dimensional input data will be described with reference to FIGS. 4A, 4B and 5. FIG. In this specification, GBDT using backpropagation is called BP Boost (Back Propagation Boosting).

4A and 4B show graphical representation 20, which is an example graphical representation of a GBDT according to the present disclosure. A graph representation 20 is an example of a BP Boost graph representation. BP Boost is executed, for example, by the learning unit 123 of the feature quantity extraction device 100 in FIG.

In graph representation 20, the conventional GBDT is extended to a multistage model. As shown in FIG. 4A, graphical representation 20 includes boosting machine 21a, boosting machine 21b, boosting machine 22a and boosting machine 22b.

Variables x ₁ through x ₃ are the input data contained in the training data. The input data includes

values

23a, 23b and 23c. The set of

values

23a, 23b and 23c are input to boosting machine 21a and boosting machine 21b corresponding respectively to

Equations

1 and 2 below.

The boosting machine 21a outputs a value 24a corresponding to _y1 . The boosting machine 21b also outputs a value 24b corresponding to _y2 .

The set of

values

24a and 24b are input to boosting machine 22a and boosting machine 22b corresponding respectively to Equations 3 and 4 below.

The boosting machine 22a outputs a value 25a corresponding to _z1 . Boosting machine 22b also outputs a value 25b corresponding to _z2 .

Values

25a and 25b are values output from the multistage model.

The feature quantity extraction device 100 acquires feature quantities by the boosting machine 21a and the boosting machine 21b, which are boosting machines in the preceding stage.

Referring to FIG. 4B, the variables up to t ₁ and t ₂ are the correct labels. Correct labels are included in the training data and associated with the input data.

Deep learning calculates gradients and uses gradient descent to optimize parameters. On the other hand, as shown in FIG. 4B, BP Boost optimizes parameters by adding trees through boosting.

[4-1. Learning BP Boost]
In the following, the technical details of BP Boost training are described.

Conventional gradient boosting uses gradient information to fit a weak discriminator to the training data. This fitting is optimization of the classifier by Newton's method.

On the other hand, the concept of BP Boost is to optimize boosting machines cascaded by boosting. Cascading boosting machines means connecting boosting machines in series.

BP Boost propagates gradient information from the output side to the input side. BP Boost then adds a weak classifier to each boosting machine to perform global optimization of the classifier using Newton's method.

In the graphical representation 20 of FIG. 4B, the multistage boosting machine is configured by boosting machine 21a, boosting machine 21b, boosting machine 22a and boosting machine 22b. The boosting machines in the front stage are the boosting machine 21a and the boosting machine 21b, and the boosting machines in the rear stage are the boosting machine 22a and the boosting machine 22b.

First, assume that M-1 weak classifiers have already been learned in each boosting machine.

If the set of

values

24a and 24b is determined, the normal boosting learning algorithm can be applied to the subsequent boosting machine. Therefore, the learning of the boosting machine in the former stage will be explained.

Assume that the Mth decision tree (Equation 5 below) is newly added to the boosting machine 21a in FIG. 4B.

Then, the objective function L _(M) is given by Equation 6 below.

Thus, the objective function L _(M) has a nested structure.

l _c is the loss function. The l _c variables are a function of the output of the boosting machine and the training data afterward. c is a number that distinguishes the latter boosting machine. i is a number that distinguishes a sample of training data. N is the number of samples.

First, the following Equation 7 is obtained by performing a first-order Taylor expansion on Equation 5 above inside the nest.

However, Equation 8 below was used. This _y1 does not include the contribution of Equation 5 above. Also, the abbreviation for Equation 9 below was used.

It should be noted that Equation 9 above represents the gradient for the input of the boosting machine.

Formula 7 above is substituted into Formula 6 above. Outside the nesting, the following Equation 10 is obtained by performing a second-order Taylor expansion on the second term of Equation 7 above.

However, g _i,c and hi _,c are the following Formulas 11 and 12.

The general GBDT objective function L _(M) is given by Equation 13 below (see, for example, Jerome Friedman, Trevor Hastie, and Robert Tibshirani, “Additive logistic regression: a statistical view of boosting,” The Annals of Statistics , Vol. 28, No. 2, pp. 337-407, 2000.”).

However, g _i and _hi are the following Formulas 14 and 15.

Equation 10 above is Equation 13 above with g _i and h _i placed in “g _i and h _i multiplied by the gradient of Equation 9 above”. From this, the gradient and the backpropagation measure of the second derivative are obtained (equations 16 and 17 below).

The gain value of the boosting machine in the latter stage is calculated using the gradient of the latter boosting machine and the second derivative value, as in the case of the normal boosting learning algorithm. Then, the gradient and the second derivative of the former boosting machine are obtained from the gradient and the second derivative of the latter boosting machine by using Equations 16 and 17 above. As a result, the gain value of the previous boosting machine is calculated using the gradient of the previous boosting machine and the second derivative. A global optimization of the discriminator is performed using these calculated gain values.

The feature quantity extraction device 100 can perform machine learning using the BP Boost machine learning algorithm described above. Boosting processing based on BP Boost will be described below. Boosting processing based on BP Boost is executed by each control unit of the feature quantity extraction device 100 .

First, the acquisition unit 122 of the feature extraction device 100 acquires a multi-layer model having multiple layers connected in series from the storage unit 130 . A multilayer model, for example, has a structure similar to the graphical representation 20 of FIGS. 4A and 4B. Each layer contains multiple boosting machines (eg, multiple decision trees).

These multiple layers include, for example, an input layer, an intermediate layer and an output layer. If the multiple layers include only input and output layers, the input layer corresponds to the previous boosting machine of FIGS. 4A and 4B. Also, the output layer corresponds to the latter boosting machine in FIGS. 4A and 4B.

Next, the learning unit 123 of the feature quantity extraction device 100 uses the multi-layer model acquired by the acquisition unit 122 to perform machine learning so as to extract feature quantities from the data given the input layer and the intermediate layer. do. For example, the learning unit 123 propagates the information about the gradient determined by using Equation 9 above from the output layer of the multi-layer model to the input layer of the multi-layer model. Thus, the learning unit 123 updates the multi-layer model. The learning unit 123 stores the updated multilayer model in the storage unit 130 .

After that, the inference unit 124 of the feature quantity extraction device 100 acquires the updated multi-layer model from the storage unit 130. The inference unit 124 applies the inference data to the updated multi-layer model. In this way, the inference unit 124 extracts feature amounts from the inference data.

[4-2. Initial value setting]
This section describes setting the initial state of the boosting machine.

In order to perform the learning process of the multi-stage boosting machine in FIG. 4B using the above Equations 16 and 17, the output of the boosting machine in the latter stage is required. Therefore, first the

values

23a, 23b and 23c are forward propagated to obtain the output of the subsequent boosting machine.

However, in the initial state, the input/output values of each stage are not determined. In this case no gradient is calculated. In a normal GBDT, processing begins by setting the output value to '0' and calculating the initial state slope.

In BP Boost, the initial gradient can be calculated in the latter boosting machine (that is, the final layer), as in normal boosting. However, if the output value is set to '0', then all inputs to the next stage boosting machine will be '0'. With such settings, learning will not start.

Therefore, the feature quantity extraction device 100 sets an initial output value for each boosting machine. The feature quantity extraction device 100 can add an initial weak discriminator (for example, a decision tree) based on a plurality of set initial output values.

First, the output value of the latter boosting machine (that is, the final layer) is set to "0". Then, the feature amount extraction apparatus 100 initializes the gradient information of stages other than the latter stage with uniform random numbers in the interval [−1, 1].

As a result, random initial gradients are set in stages other than the latter stage (that is, layers other than the final layer). This allows the feature extraction device 100 to add a random tree structure to the boosting machine, thereby starting learning of the multistage boosting machine.

[4-3. Gradient calculation]
The gradient of the boosting machine is given by Equation 9 above. Calculation of the gradient therefore requires a partial derivative with respect to the input of the boosting machine.

However, a boosting machine is a set of weak classifiers. For example, the set of weak classifiers is the sum of decision trees, which is not a differentiable function. That is, the gradient cannot be determined analytically.

Therefore, the feature quantity extraction device 100 approximately calculates the gradient. For example, the feature quantity extraction device 100 can employ linear approximation.

FIG. 5 shows gradient calculation 30, which is an example of gradient calculation according to the present disclosure. In gradient computation 30, the boosting machine is a sum of decision trees. When the dimension of the input data is 2, the branches of the decision tree are shaped like the sum of squares, as shown in FIG.

As shown in FIG. 5, the shape of the function is a step function in the vicinity of the data points. The data points in FIG. 5 correspond to specific branch points in the decision tree. The feature quantity extraction device 100 calculates the slope of the straight line connecting the branch point on the left of a specific branch point and the data point. Also, the feature quantity extraction apparatus 100 calculates the slope of the straight line connecting the right branch point of the specific branch point and the data point. The feature quantity extraction device 100 determines the larger slope of the two calculated slopes as the slope for the data points.

In the example of FIG. 5, the left slope is the difference between the weight of the branch point of the data point and the weight of the left branch point (Left Weight). The right slope is the difference between the branch point weight of the data point and the right branch point weight (Right Weight). In this example, the slope approximation is the left slope.

However, when approaching the branch point and data point, the slope approaches infinity. The feature quantity extraction device 100 may set upper and lower limits for the gradient to avoid large gradients. The upper and lower limits may be, for example, the interval [-1, 1].

[4-4. Other embodiments]
Other embodiments are described in this subsection.

[4-4-1. weak discriminator]
In the above embodiments, weak classifiers included in the boosting machine are described as decision trees, but weak classifiers are not limited to decision trees. Weak classifiers may be other machine learning algorithms such as SVM (Support Vector Machine).

[5. Flowchart of Boosting Processing]
Next, a flowchart of an example of boosting processing according to the present disclosure will be described with reference to FIGS. 6 and 7. FIG. Examples of boosting processing include processing for training multiple classifiers in a boosting machine. Processing for performing this learning is performed by, for example, the feature quantity extraction device 100 in FIG.

FIG. 6 is a flowchart showing process P100, which is an example of the process for learning a plurality of discriminators in the boosting machine.

As shown in FIG. 6, first, the learning unit 123 of the feature extraction device 100 adds a boosting machine to the model (step S101). This model is, for example, the multi-layer model described above before training is performed.

Next, the learning unit 123 sets the input/output connection relationship of the boosting machine (step S102). For example, learner 123 may construct a structure such as graphical representation 20 of FIGS. 4A and 4B.

Next, the learning unit 123 determines whether to add a boosting machine (step S103). For example, this determination is based on the class of interest for data analysis.

When the learning unit 123 determines to add a boosting machine (step S103: Yes), the learning unit 123 executes step S101 again.

When the learning unit 123 determines not to add a boosting machine (step S103: No), the learning unit 123 sets an initial gradient for each boosting machine (step S104). For example, uniform random numbers are used to set the initial gradient.

Next, the learning unit 123 adds an initial discriminator to each boosting machine (step S105). As a result, an initialized model is generated.

Next, the learning unit 123 executes inference processing (step S106). For the first time, the learning unit 123 performs inference processing using the initialization model. From the second time onwards, inference processing is executed using the updated model.

The inference process is detailed below with reference to FIG.

Next, the learning unit 123 uses the learning data included in the learning data set to calculate the error between the inference result and the learning label (step S107).

Next, the learning unit 123 determines whether the error has converged (step S108).

If it is determined that the error has converged (step S108: Yes), the process P100 ends.

When it is determined that the error has not converged (step S108: No), the learning unit 123 calculates the gradient information of each boosting machine based on backpropagation (step S109).

Next, the learning unit 123 adds a discriminator to each boosting machine based on the gradient information (step S110). The result is an updated model. Then, the learning unit 123 executes step S106 again using the updated model.

FIG. 7 is a flowchart showing step S106, which is an example of inference processing using the boosting machine group. Step S106 includes sub-step S106a, sub-step S106b, sub-step S106c and sub-step S106d.

As shown in FIG. 7, first, the inference unit 124 of the feature quantity extraction device 100 acquires a model and inputs the inference data to the boosting machine (step S106a).

Next, the inference unit 124 determines whether there is an unprocessed boosting machine with all inputs (step S106b).

If it is determined that there is no unprocessed boosting machine with all inputs (step S106b: No), step S106 ends.

When it is determined that there is no unprocessed boosting machine with all inputs (step S106b: Yes), the inference unit 124 calculates the output value of the boosting machine (step S106c). That is, the inference unit 124 calculates the sum of weak discriminators.

Next, the inference unit 124 determines whether the boosting machine has a subsequent connection (step S106d).

When it is determined that the boosting machine does not have a subsequent connection (step S106d: No), the inference unit 124 outputs the inference result, and step S106 ends.

When it is determined that the boosting machine has a subsequent connection (step S106d: Yes), the inference unit 124 executes step S106a again.

[6. effect〕
As described above, the feature extraction apparatus 100 cascades multiple layers of parallel boosting machines. Thereby, the feature quantity extraction device 100 can generate a multi-layered boosting model.

The feature quantity extraction device 100 can extract feature quantities with high inference performance from the middle layer of the trained multi-layer boosting model. Also, the multi-layer boosting model uses a combination of weak discriminators (ie, simple models with weak discriminative power) for inference processing. Therefore, the feature quantity extraction device 100 can perform inference processing at a lower computational cost than a neural network.

[7. others〕
Some of the processes described as being performed automatically may be performed manually. Alternatively, all or part of the processes described as being performed manually may be performed automatically in known manner. Furthermore, information including processing procedures, specific names, various data and parameters shown in this specification and drawings may be arbitrarily changed unless otherwise specified. For example, various information shown in each drawing is not limited to the illustrated information.

The illustrated components of the device conceptually indicate the functions of the device. Components are not necessarily physically arranged as shown in the drawings. In other words, the specific form of the distributed or integrated apparatus is not limited to the form of the system and apparatus shown in the figures. All or part of the devices may be functionally or physically distributed or integrated according to various loads and usage conditions.

[8. Hardware configuration]
FIG. 8 is a diagram showing a computer 1000 as an example of the hardware configuration of a computer. The systems and methods described herein may be implemented, for example, by computer 1000 shown in FIG.

FIG. 8 shows an example of a computer in which the feature quantity extraction device 100 is implemented by executing a program. The computer 1000 has a memory 1010 and a CPU 1020, for example. Computer 1000 also has hard disk drive interface 1030 , disk drive interface 1040 , serial port interface 1050 , video adapter 1060 and network interface 1070 . These units are connected by a bus 1080 .

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as BIOS (Basic Input Output System). Hard disk drive interface 1030 is connected to hard disk drive 1090 . A disk drive interface 1040 is connected to the disk drive 1100 . A removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100 . Serial port interface 1050 is connected to mouse 1110 and keyboard 1120, for example. Video adapter 1060 is connected to display 1130, for example.

The hard disk drive 1090 stores, for example, an OS 1091, application programs 1092, program modules 1093, and program data 1094. That is, a program that defines each process of the feature quantity extraction apparatus 100 is implemented as a program module 1093 in which code executable by the computer 1000 is described. Program modules 1093 are stored, for example, on hard disk drive 1090 . For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functional configuration of the feature quantity extraction apparatus 100 . The hard disk drive 1090 may be replaced by an SSD (Solid State Drive).

The hard disk drive 1090 can store a machine learning program for boosting processing and a feature extraction program for boosting processing. Also, the machine learning program and the feature quantity extraction program can be created as program products. The program product, when executed, performs one or more methods, such as those described above.

Also, the setting data used in the processing of the above-described embodiment is stored as program data 1094 in the memory 1010 or the hard disk drive 1090, for example. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as necessary and executes them.

The program modules 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, program modules 1093 and program data 1094 may be stored in other computers connected through a network (LAN, WAN, etc.). Program modules 1093 and program data 1094 may then be read by CPU 1020 through network interface 1070 from other computers.

[9. Summary of Embodiments]
As described above, the feature quantity extraction device 100 according to the present disclosure includes the acquisition unit 122 and the learning unit 123. In at least one embodiment, the acquisition unit 122 acquires a multi-layer model having multiple layers connected in series. Each layer contains multiple boosting machines. In at least one embodiment, the learning unit 123 uses the multi-layered model obtained by the obtaining unit 122 as a machine learning model such that at least one of the input layer or the intermediate layer of the multi-layered model is Execute machine learning to extract features.

In some embodiments, the learning unit 123 transfers information about gradients of multiple boosting machines included in the output layer of the multi-layer model from the output layer of the multi-layer model to the input layer of the multi-layer model as machine learning is performed. Update multi-tier models by propagating.

In some embodiments, the learning unit 123 sets an initial output value for each boosting machine as execution of machine learning, and based on the set multiple initial output values, sets an initial output value for each boosting machine. Add a discriminator.

As described above, the feature quantity extraction device 100 according to the present disclosure includes the inference unit 124. In at least one embodiment, the reasoner 124 obtains a trained multi-layer model having multiple layers connected in series. Each layer contains multiple boosting machines. The inference unit 124 extracts features from the data by applying this data to a trained multi-layer model.

While various embodiments have been described in detail herein with reference to the drawings, these embodiments are examples and are intended to limit the invention to these embodiments. isn't it. The features described herein can be implemented in various ways, including various modifications and improvements based on the knowledge of those skilled in the art.

Also, the above "parts (module, -er suffix, -or suffix)" can be read as units, means, circuits, etc. For example, a communication module, a control module, and a storage module can be read as a communication unit, a control unit, and a storage unit, respectively.

1 environment 100 feature extraction device 110 communication unit 120 control unit 121 reception unit 122 acquisition unit 123 learning unit 124 inference unit 125 provision unit 130 storage unit 200 network 300 user device

Claims

a multi-layer model having multiple layers connected in series, each layer including multiple boosting machines;
a machine learning execution unit that performs machine learning using the multi-layer model as a machine learning model so that at least one of an input layer or an intermediate layer of the multi-layer model extracts features from given data; A machine learning device with
The machine learning execution unit propagates information about gradients of a plurality of boosting machines included in the output layer of the multilayer model from the output layer of the multilayer model to the input layer of the multilayer model as the execution of the machine learning. The machine learning device according to claim 1, wherein the multi-layer model is updated by causing
The machine learning execution unit sets an initial output value to each boosting machine to perform the machine learning, and adds an initial classifier to each boosting machine based on the set multiple initial output values. The machine learning device according to claim 1 or 2.
a trained multi-layer model having multiple layers connected in series, each layer including multiple boosting machines;
An extraction unit that extracts a feature amount from data by applying the data to the trained multi-layer model.
A computer implemented machine learning method comprising:
a multi-layer model having multiple layers connected in series, each layer including multiple boosting machines;
a machine learning execution unit that performs machine learning using the multi-layer model as a machine learning model so that at least one of an input layer or an intermediate layer of the multi-layer model extracts features from given data; Machine learning methods, including
A computer-executed feature extraction method comprising:
obtaining a trained multi-layer model having multiple layers connected in series, each layer including multiple boosting machines;
An extraction step of extracting a feature from data by applying the data to the trained multi-layer model.
A machine learning program for causing a computer to function as the machine learning device according to any one of claims 1 to 3.
a trained multi-layer model having multiple layers connected in series, each layer including multiple boosting machines;
A feature quantity extraction program for causing a computer to execute an extraction procedure for extracting a feature quantity from data by applying said data to said trained multi-layer model.