EP4309079A1

EP4309079A1 - Transfer learning between neural networks

Info

Publication number: EP4309079A1
Application number: EP21716362.5A
Authority: EP
Inventors: Henrique Koji MIYAMOTO; Apostolos Destounis; Jean-Claude Belfiore; Ingmar LAND
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2024-01-24
Also published as: CN117121020A; WO2022207097A1

Abstract

Various embodiments relate to transfer learning associated with neural networks. A semantic analysis of a source neural network may be performed, and logical behaviour data of the source neural network may be extracted based on the semantic analysis. The logical behaviour data may then be transmitted to a target neural network. The logical behaviour data may be used in pre-training the target neural network. Devices, methods, and computer programs are disclosed.

Description

TRANSFER LEARNING BETWEEN NEURAL NETWORKS

TECHNICAL FIELD

The disclosure relates to a method, and more particularly to a method for transfer learning associated with neural networks. Furthermore, the disclosure relates to a corresponding computing device and a computer program.

BACKGROUND

Deep Neural Networks (DNNs) are computing systems inspired by biological neural networks that constitute biological brains. DNNs can be trained to perform tasks by considering examples, generally without being programmed with any task-specific rules. For example, in image recognition, DNNs may be trained to identify images that contain cars by analysing example images that have been manually labelled as “car” or “no car” and using the results to identify cars in other images. DNNs do this without any prior knowledge about cars. Instead, they automatically generate identifying features from the learning material that they process.

Transfer learning may also be applied with DNNs. In transfer learning, learning acquired in a first neural network can be transferred to a second neural network.

SUMMARY

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

It is an objective to provide a device and a method for transfer learning associated with neural networks. One or more of the objectives is achieved by the features of the independent claims. Further implementation forms are provided in the dependent claims, the description and the figures.

According to a first aspect, a computing device is configured to initialize a source neural network; train the source neural network with training data of the source neural network; perform a semantic analysis of the source neural network; extract logical behaviour data of the source neural network based on the semantic analysis; and cause a transmission of the logical behaviour data. The solution may, for example, significantly reduce the amount of bits transferred from the source neural network to the target neural network. In an implementation form of the first aspect, the logical behaviour data comprises a logical table. The solution may enable, for example, an efficient data structure for the logical behaviour data.

In a further implementation form of the first aspect, the computing device is further configured to semantically analyse neurons of the source neural network; and store, based on the semantic analysis, logical propositions corresponding to outputs of at least some of the neurons into the logical table. The solution may enable, for example, an efficient analysis of the neurons.

In a further implementation form of the first aspect, the computing device is further configured to encode the logical propositions into binary vectors; and cause a transmission of the binary vectors. The solution may enable, for example, to optimize the amount of data needed to be transferred to the target neural network.

According to a second aspect, a computing device is configured to receive logical behaviour data associated with a source neural network, the logical behaviour data being based on a semantic analysis of the source neural network; pre-train a target neural network with the logical behaviour data associated with the source neural network; and train the target neural network with training data of the target neural network. The solution may enable, for example, fast learning speed and improved final accuracy of the target neural network.

In an implementation form of the second aspect, the logical behaviour data comprises a logical table. The solution may enable, for example, an efficient data structure for the logical behaviour data.

In an implementation form of the second aspect, the logical table comprises logical propositions corresponding to outputs of at least some of the neurons of the source neural network, the logical propositions being based on a semantic analysis of the neurons of the source neural network. The solution may enable, for example, an efficient analysis of the neurons.

In an implementation form of the second aspect, the computing device is further configured to compute an inverse logical table based on the received logical table, the inverse logical table being indicative of the desired logical behaviour for each neuron in the target neural network, wherein the computing device is configured to pre-train the target neural network by using a cost function that takes into account the outputs of the neurons of the target neural network and penalises deviations from desired outputs indicated by the inverted logical table. The solution may enable, for example, increased efficiency. In an implementation form of the second aspect, the computing device is further configured to pre-train the target neural network successively layer by layer. The solution may enable, for example, increased efficiency.

In an implementation form of the second aspect, the logical table comprises the logical propositions encoded into binary vectors. The solution may enable, for example, to optimize the amount of data needed to be transferred to the target neural network.

In an implementation form of the second aspect, the computing device is configured to pre-train the target neural network with a limited set of the training data of the target neural network and a limited number of epochs. The solution may enable, for example, increased efficiency.

According to a third aspect, a method comprises initializing a source neural network; training the source neural network with training data of the source neural network; performing a semantic analysis of the source neural network; extracting logical behaviour data of the source neural network based on the semantic analysis; and causing a transmission of the logical behaviour data. The solution may, for example, significantly reduce the amount of bits transferred from the source neural network to the target neural network.

In an implementation form of the third aspect, the logical behaviour data comprises a logical table. The solution may enable, for example, an efficient data structure for the logical behaviour data.

In a further implementation form of the third aspect, the method further comprises semantically analysing neurons of the source neural network; and storing, based on the semantic analysis, logical propositions corresponding to outputs of at least some of the neurons into the logical table. The solution may enable, for example, an efficient analysis of the neurons.

In a further implementation form of the third aspect, the method further comprises encoding the logical propositions into binary vectors; and causing a transmission of the binary vectors. The solution may enable, for example, to optimize the amount of data needed to be transferred to the target neural network.

According to a fourth aspect, a method comprises receiving logical behaviour data associated with a source neural network, the logical behaviour data being based on a semantic analysis of the source neural network; pre-training a target neural network with the logical behaviour data associated with the source neural network; and training the target neural network with training data of the target neural network. The solution may enable, for example, fast learning speed and improved final accuracy of the target neural network. In an implementation form of the fourth aspect, the logical behaviour data comprises a logical table. The solution may enable, for example, an efficient data structure for the logical behaviour data.

In an implementation form of the fourth aspect, the logical table comprises logical propositions corresponding to outputs of at least some of the neurons of the source neural network, the logical propositions being based on a semantic analysis of the neurons of the source neural network. The solution may enable, for example, an efficient analysis of the neurons.

In an implementation form of the fourth aspect, the method further comprises computing an inverse logical table based on the received logical table, the inverse logical table being indicative of the desired logical behaviour for each neuron in the target neural network; and pre-training the target neural network by using a cost function that takes into account the outputs of the neurons of the target neural network and penalises deviations from desired outputs indicated by the inverted logical table. The solution may enable, for example, increased efficiency.

In an implementation form of the fourth aspect, the method further comprises pre training the target neural network successively layer by layer. The solution may enable, for example, increased efficiency.

In an implementation form of the fourth aspect, the logical table comprises the logical propositions encoded into binary vectors. The solution may enable, for example, to optimize the amount of data needed to be transferred to the target neural network.

In an implementation form of the fourth aspect, the method further comprises pre training the target neural network with a limited set of the training data of the target neural network training data and a limited number of epochs. The solution may enable, for example, increased efficiency.

According to a fifth aspect, a computer program is provided, comprising program code configured to perform a method according to the third aspect when the computer program is executed on a computer.

According to a sixth aspect, a computer program is provided, comprising program code configured to perform a method according to the fourth aspect when the computer program is executed on a computer.

According to a seventh aspect, a telecommunication device is provided, comprising the computing device of the first aspect. According to an eighth aspect, a telecommunication device is provided, comprising the computing device of the second aspect.

According to a ninth aspect, a computing device is provided, comprising means for initializing a source neural network; means for training the source neural network with source neural network training data; means for performing a semantic analysis of the source neural network; means for extracting logical behaviour data of the source neural network based on the semantic analysis; and means for causing transmission of the logical behaviour data. The solution may, for example, significantly reduce the amount of bits transferred from the source neural network to the target neural network.

According to a tenth aspect, a computing device is provided, which comprises means for receiving logical behaviour data associated with a source neural network, the logical behaviour data being obtained based on a semantic analysis of the source neural network; means for pre-training a target neural network with the logical behaviour data associated with the source neural network; and means for training the target neural network with target neural network training data. The solution may enable, for example, fast learning speed and improved final accuracy of the target neural network.

Many of the attendant features will be more readily appreciated as they become better understood by reference to the following detailed description considered in connection with the accompanying drawings.

DESCRIPTION OF THE DRAWINGS

The present description will be better understood from the following detailed description read in light of the accompanying drawings, wherein:

FIG. 1 illustrates a schematic representation of a computing device according to an embodiment;

FIG. 2 illustrates a schematic representation of a computing device according to an embodiment;

FIG. 3 illustrates a flow chart representation of a method according to an embodiment;

FIG. 4 illustrates a flow chart representation of a method according to an embodiment;

FIG. 5 illustrates a schematic representation of a deep neural network according to an embodiment;

FIG. 6A illustrates the problem of identifying the shape of a signal according to an embodiment; FIG. 6B illustrates the problem of identifying the shape of a signal according to an embodiment;

FIGS. 7A-7C illustrate an example result of an analysis of a neuron according to an embodiment;

FIG. 8 illustrates an example of a logical table according to an embodiment;

FIG. 9 illustrates a numerical example of transferred bits according to an embodiment; and

FIG. 10 illustrates a performance example of the semantic transfer according to an embodiment.

Like references are used to designate like parts in the accompanying drawings.

DETAILED DESCRIPTION

The detailed description provided below in connection with the appended drawings is intended as a description of the embodiments and is not intended to represent the only forms in which the embodiment may be constructed or utilized. However, the same or equivalent functions and structures may be accomplished by different embodiments.

Fig. 1 illustrates a schematic representation of a computing device 100 according to an embodiment.

According to an embodiment, the computing device 100 is configured to initialize a source neural network.

The computing device 100 may be further configured to train the source neural network with training data of the source neural network.

The computing device 100 may be further configured to perform a semantic analysis of the source neural network.

The computing device 100 may be further configured to extract logical behaviour data of the source neural network based on the semantic analysis.

The computing device 100 may be further configured to cause a transmission of the logical behaviour data.

The computing device 100 may comprise a processor 102. The computing device 100 may further comprise a memory 104.

In some embodiments, at least some parts of the computing device 100 may be implemented as a system on a chip (SoC). For example, the processor 102, the memory 104, and/or other components of computing device 100 may be implemented using a field- programmable gate array (FPGA). Components of the computing device 100, such as the processor 102 and the memory 104, may not be discrete components. For example, if the device 100 is implemented using a SoC, the components may correspond to different units of the SoC.

The processor 102 may comprise, for example, one or more of various processing devices, such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or various other processing devices including integrated circuits such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.

The memory 104 may be configured to store, for example, computer programs and the like. The memory 104 may include one or more volatile memory devices, one or more non volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. For example, the memory 104 may be embodied as magnetic storage devices (such as hard disk drives, floppy disks, magnetic tapes, etc.), optical magnetic storage devices, and semi-conductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc ).

Functionality described herein may be implemented via the various components of the computing device 100. For example, the memory 104 may comprise program code for performing any functionality disclosed herein, and the processor 102 may be configured to perform the functionality according to the program code comprised in the memory 102.

When the computing device 100 is configured to implement some functionality, some component and/or components of the computing device 100, such as the one or more processors 102 and/or the memory 104, may be configured to implement this functionality. Furthermore, when the one or more processors 102 are configured to implement some functionality, this functionality may be implemented using program code comprised, for example, in the memory 104. For example, if the computing device 100 is configured to perform an operation, the one or more memories 104 and the computer program code can be configured to, with the one or more processors 102, cause the computing device 100 to perform that operation.

According to an embodiment, a telecommunication device comprises the computing device 100.

Fig. 2 illustrates a schematic representation of a computing device 200 according to an embodiment. According to an embodiment, the computing device 200 is configured to receive logical behaviour data associated with a source neural network, the logical behaviour data being based on a semantic analysis of the source neural network.

The computing device 200 may be further configured to pre-train a target neural network with the logical behaviour data associated with the source neural network.

The computing device 200 may be further configured to train the target neural network with training data of the target neural network.

The computing device 200 may comprise a processor 202. The computing device 200 may further comprise a memory 204.

In some embodiments, at least some parts of the computing device 200 may be implemented as a system on a chip (SoC). For example, the processor 202, the memory 204, and/or other components of computing device 200 may be implemented using a field- programmable gate array (FPGA).

Components of the computing device 200, such as the processor 202 and the memory 204, may not be discrete components. For example, if the device 200 is implemented using a SoC, the components may correspond to different units of the SoC.

The processor 202 may comprise, for example, one or more of various processing devices, such as a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or various other processing devices including integrated circuits such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.

The memory 204 may be configured to store, for example, computer programs and the like. The memory 204 may include one or more volatile memory devices, one or more non volatile memory devices, and/or a combination of one or more volatile memory devices and non-volatile memory devices. For example, the memory 204 may be embodied as magnetic storage devices (such as hard disk drives, floppy disks, magnetic tapes, etc.), optical magnetic storage devices, and semi-conductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc ).

Functionality described herein may be implemented via the various components of the computing device 200. For example, the memory 204 may comprise program code for performing any functionality disclosed herein, and the processor 202 may be configured to perform the functionality according to the program code comprised in the memory 202. When the computing device 200 is configured to implement some functionality, some component and/or components of the computing device 200, such as the one or more processors 202 and/or the memory 204, may be configured to implement this functionality. Furthermore, when the one or more processors 202 are configured to implement some functionality, this functionality may be implemented using program code comprised, for example, in the memory 204. For example, if the computing device 200 is configured to perform an operation, the one or more memories 204 and the computer program code can be configured to, with the one or more processors 202, cause the computing device 200 to perform that operation.

According to an embodiment, a telecommunication device comprises the computing device 200.

Fig. 3 illustrates a flow chart representation of a method 300 according to an embodiment.

According to an embodiment, the method 300 comprises initializing 302 a source neural network.

The method 300 may further comprise training 304 the source neural network with training data of the source neural network.

The method 300 may further comprise performing 306 a semantic analysis of the source neural network.

The method 300 may further comprise extracting 308 logical behaviour data of the source neural network based on the semantic analysis. The logical behaviour data may comprise, for example, a logical table. In an example embodiment, the method 300 may further comprise semantically analysing neurons of the source neural network and storing, based on the semantic analysis, logical propositions corresponding to outputs of at least some of the neurons into the logical table.

The method 300 may further comprise causing 310 a transmission of the logical behaviour data. A recipient of the logical behaviour data may use the logical behaviour data for training a target neural network.

The method 300 may be performed, for example, by the computing device 100.

At least some operations of the method 300 may be performed by a computer program product when executed on a computer.

Fig. 4 illustrates a flow chart representation of a method 400 according to an embodiment.

According to an embodiment, the method 400 comprises receiving 402 logical behaviour data associated with a source neural network, the logical behaviour data being based on a semantic analysis of the source neural network. The logical behaviour data may comprise, for example, a logical table. The logical table may comprise logical propositions corresponding to outputs of at least some of the neurons of the source neural network, the logical propositions being obtained based on a semantic analysis of the neurons of the source neural network. The logical propositions may be encoded into binary vectors.

The method 400 may further comprise pre-training 404 a target neural network with the logical behaviour data associated with the source neural network. In an example embodiment, the method 400 may further comprise computing an inverse logical table based on the received logical table, the inverse logical table being indicative of the desired logical behaviour for each neuron in the target neural network, and pre-training the target neural network by using a cost function that takes into account the outputs of the neurons of the target neural network and penalises deviations from desired outputs indicated by the inverted logical table. In an example embodiment, the inverse logical table may describe the desired logical behaviour for some neurons in the target neural network if the logical table only comprises entries for some neurons and not all neurons in the source network. The target neural network may be pre-trained successively layer by layer. Further, the target neural network may be pre-trained with a limited set of the training data of target neural network and a limited number of epochs.

The method 400 may further comprise training 406 the target neural network with training data of the target neural network.

The method 400 may be performed, for example, by the computing device 200.

At least some operations of the method 400 may be performed by a computer program product when executed on a computer.

Fig. 5 illustrates a schematic representation of neural network usage according to an embodiment.

A deep neural network (DNN) 500 may be based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain. Typically, artificial neurons are aggregated into layers 502, 504, 506, 508 where different layers may perform different kinds of transformations on their inputs. The layer 502 may be called the input layer, the layers 504, 506 may be called hidden layers, and the layer 508 may be called the output layer. The connections between artificial neurons have weights that are adjusted as learning proceeds. The weight increases or decreases the strength of the signal at a connection.

DNNs are efficient tools to solve various classification tasks. As an example, an image may be presented to the input layer 502, one value per input neuron. Each neuron in the network 500 then may compute a function (a nonlinear function of an affine transformation) of the input values and may forward the result to the following layer. The functions are parameterised, and these parameters are to be optimised. The output layer 508 may consist of one neuron per class, and the classification result corresponds to the class with the largest output value.

DNNs are trained using a training data set, consisting of data labelled with their corresponding classes. An optimisation algorithm, for example, stochastic gradient descent, may be used to find the network parameters (weights and biases for the affine transformation of the outputs of a layer to be used in the activation function of the next layer) that minimise a cost function that assigns a large cost to wrongly classified data points and a small cost to correctly classified data points. The actual classification accuracy of the DNN in terms of the percentage of correctly classified data points is determined using a second labelled data set, the validation data set.

In order to apply learning achieved at a source side at a target side, transfer learning may be used. In transfer learning, learning acquired in one neural network, i.e. a source network, can be transferred to a second neural network, i.e. a target network. Reasons for transfer learning may comprise, for example, that the source and target tasks are similar and learning of the target network is expensive (for example, in terms of time or computational cost), or that the target network has access to insufficient training data and requires initialisation provided by the trained source network.

In the following, as an example, a single-class task is used, where each data point is associated to exactly one class, as well as fully connected DNNs, where every neuron is fully connected with all neurons of the previous layer.

Let us assume a single-class task with M classes, i, ₂, ...,y_M, and denote the set of classes by

V = {yi>y2>->y_M}·

The training and validation data sets are and consisting of pairs of data points x^l and labels y^l. Each data point is assumed to be a vector of length N real values, x = (x_l x₂, ...,x_w) .

A fully connected DNN with L layers is also assumed, where l = 0 refers to the input layer 502 and l = L to the output layer 508. Each layer l has JV₍ neurons. The input layer consists of N₀ = N neurons, one for each input value. Similarly, the output layer 508 consists of N_L = M neurons, one for each class. The network size is characterised symbolically as N — JV, - N_L- - M.

Every neuron in the hidden layers 504, 506 and the output 508 layer combines its input values into output values (activations). For neuron k in layer Z, this function may be, for example, where w_{k i} are called the weights, b_k ^l the biases, and

1 s(z) =

1 + e -Z is a non-linear function called sigmoid function. Weights and biases are referred to as the network parameters, and they are to be optimised during training.

For simplicity, it is assumed that the source problem and the target problem are single class tasks on the same set of classes y. The target problem, however, may be more difficult than the source problem. This may mean, for example, that the data at the source and the data at the target may be different in a way such that the classes in the target problem are harder to classify correctly than in the source problem. It is also assumed that the source and the target DNN have identical structures and that training data is available for both the source and the target DNN. In other embodiments, the structures may differ from each other.

In transfer learning, the goal is to train first the source network, then transfer data carrying the learning to the target network, and finally train the target network. In the following, an example approach is illustrated, the approach comprising six steps:

1. Initialise source network,

2. Train source network with source training data, 3. Extract network data from source network,

4. Transfer network data to target network,

5. Initialise target network with transferred network data,

6. Train target network with target training data.

The main challenge relates to the 4^th step, i.e. the transfer of network data. For example, the source and target network may not be co-located and this data needs to be transmitted over a wireless link, where network resources and transmit power may be costly.

One possible solution for transfer learning is to transmit all weights and biases from the source network to the target network. Weights and biases are real values that require a high precision. Further, the amount of data to be transferred is proportional to the network size, particularly to the number of connections in the network. The use of this approach, however, has disadvantages. First, the amount of data is very large. Assuming a network of size 100-40- 20-10-6 (6 layers, input length 100, 4 hidden layers, 6 classes at output), to be used in a small example later on, and a precision of 16 bits for each weight value, the data to be transferred amounts to 80000 bits. Second, the functionality represented by the weights is specific to the source problem. Source and target data, however, may be slightly different, and it would be preferable to transfer data that is more related to the classes rather than the source data.

The data to be transferred to the target network may be compressed by transferring only parts of the source network, for example, by quantising the weights to few bits only, or by applying other methods for model compression. However, the transfer problem still applies, and the order of the magnitude of the data to be transmitted will remain the same.

Referring back to the solution discussed above in relation to FIGS. 3 and 4, a semantic transfer learning solution is illustrated. After training the source network, a semantic analysis is performed, the logical behaviour data of the neurons is extracted. The logical behaviour data is then transferred from the source network to the target network. In the target network, a logical pre-training using the logical behaviour data is performed, followed by a regular training.

The logical behaviour data may carry semantic information about the classification in the source network, and it may provide information about the source network functionality to the target network.

Using an example network having a size of 100-40-20-10-6, only about 900 bits are required for the logical behaviour data. As compared to 80000 bits in the conventional approach, the decrease in the amount of data to be transmitted to the target network is significant. The illustrated solution may be applied whenever two or more neural networks are exchanging information about the functionality they are implementing. The solution may also be used in an iterative fashion between two or more networks to enable collaborative learning while exchanging only small amounts of information.

FIGs 6A and 6B illustrate the problem of identifying the shape of a signal according to an embodiment. In this example embodiment, two single-class classification tasks, the source and the target task, and two fully connected DNNs, the source and the target network, are assumed.

The signals are a rectangle, a triangle, and a half-circle, where the shapes may be in different positions and may have different heights. In this example, the source network is given a simpler problem, where the signals are wider (FIG. 6A), and the target network is given a harder problem, where the signals are narrower and thus the shapes are more difficult to distinguish (FIG. 6B). The positions and heights of the shapes may differ within the data set.

The three different shapes of signals represent three classes of the classification task. The set of classes is given by y = {A, B, C} , where A refers to the rectangle, B refers to the triangle, and C refers to the half-circle.

The source network is first initialized and trained using source training data. Any method applicable to achieving this can be used.

The source network may be fed with some or all of the training data again, and neuron outputs (i.e. activations) of at least some neurons are analysed. In this example it is assumed the neuron outputs of all neurons are analysed.

FIGS. 7A-7C illustrate an example result of an analysis of a neuron according to an embodiment. For each class, A , B , C, the frequency of each output value is depicted. In the following an example is explained for how the output of the neuron can be associated with semantics by assigning logical propositions to the outputs.

In the following, the neuron output (activation) is denoted by a. To simplify the notation, the indices l for the layer and k for the neuron within the layer are omitted. It is also assumed that the same analysis applies to all neurons in the network. A quantised neuron output is denoted as

The following logical propositions are defined:

A : “the data point belongs to class ri” B : “the data point belongs to class B” C : “the data point belongs to class C”

It is noted that A denotes both the class and the corresponding proposition, and the meaning is clear from the context. The same applies for B and C. In an example embodiment, these basic propositions may be combined using logical operations, particularly negation -, conjunction A, disjunction v, and implication ®, to form new logical propositions. Further, as another example, also other logical systems, like predicate logic, may be applied.

Using the examples illustrated in FIGS. 7A-7C, logical propositions can be associated to the quantised activation values: If d = 0, then the proposition A is true. If a = 1, then the proposition B V C is true. The first statement is correct if activation values smaller than ½ are produced by data points belonging to class A and they are not produced by data points belonging to classes B and C. In FIGS. 7A-7B this is approximately true, up to only few data points. In an example embodiment, a tolerance value e may be introduced, and the logical propositions are said to be e-true if the propositions hold for the relative amount of (1 — e) of the training data set. In the following, they can still be called “true” for simplicity.

Using a similar approach, all neurons of the source network are semantically analysed. In an example embodiment, the logical propositions corresponding to each neuron output may be written into a logical table.

In an example embodiment, the logical characterisation may be more general. In a multi class task, the network may learn logical propositions that are not labels of the data points and rather relations, like A ® B. In another example embodiment, a network may identify animals in images, and a neuron may come up with the proposition “if mouse then cat”, even though the image labels are only conjunctions.

FIG. 8 illustrates an example of a logical table according to an embodiment. The logical table illustrated in FIG. 8 has five layers (layer 0, layer 1, layer 2, layer 3, layer 4). The table uses four classes A, B, C and D, and in addition the notation T = ( A or B or C or D ) (always true) and F (always false). Expressed in a more general form, there may be L + 1 layers with JV_; neurons in layer l , 1 = 0,1, ... L , and thus an overall of N_tot = /V, neurons.

Correspondingly, the logical table may have JV_tot rows (one row for each neuron) with two entries each, one corresponding to the proposition for the activation being less than ½ and for the activation being greater than ½. For the single-class task used as an example, it is sufficient to consider propositions that are disjunctions of basic propositions, like A, A V C, A V B V C, etc. In an example embodiment, these may be encoded in binary vectors of length M, where M is the number of classes: a zero indicates that a class is not present in the disjunction and a one indicates that the class is present. In the simplified example discussed above relating to FIG. 5, there are M = 3 classes. The proposition A may then be encoded as 100, and the proposition B V C may be encoded as Oil. The size of this logical table is therefore 2 M jV₍ bits.

The general logical table (or the encoded version of it) may then transferred from the source network to the target network. When a logical table, i.e. logical behaviour data, relating to the source network is used, the logical table carries semantic information about the classification in the source network. The logical table represents semantically how the source network understands the data. The logical table has a very compact representation, and it provides information about the source network functionality to the target network. Taking an example network of size 100-40-20-10-6, only about 900 bits are required for the logical behaviour data. As compared to 80000 bits in the conventional approach, the decrease in the amount of data to be transmitted to the target network is significant.

In an example embodiment, at the target network side, the logical table represents semantically how the source network understands the data. A pre-training is configured to be performed at the target network side based on the obtained logical behaviour data associated with the source network.

In an example embodiment, an inverse logical table may be computed based on the received logical table, the inverse logical table describing the desired logical behaviour for each neuron in the target neural network. While the logical table associates propositions with the activation, the inverse logical table associates the desired activation (according to the learning of the source network) with the proposition, as given by the class of the data point.

Referring back to FIGS. 7A-7C, the same three-class example is continued here. From the logical table, the following statement applies for the neuron:

If d = 0, then the proposition A is true.

If d = 1, then the proposition B V C is true. Inverting this gives the following desired behaviour for this specific neuron:

If proposition A is true, then a = ½.

If proposition B is true, then a = 1.

If proposition C is true, then a = 1.

Here, a corresponds to the quantised neuron output, and denotes the desired target value. Mathematically, this can be written as: ½ for y = A a(y) 1 for y = B 1 for y = C

In this example, not only B and C are required to lead to a = 1, but also that A leads to a neutral value, i.e. to a = ½. The values of a are the target values for this specific neuron in the pre-training.

All other neurons may be processed similarly, and the end result will be an inverted logical table, a_k ^l (y ) is determined for every neuron k in every layer l. The inverted logical table describes the desired logical behaviour for each neuron in the target network.

In an example embodiment, in the pre-training a cost function may be employed. The cost function may take into account the outputs (activations) a_k ^l of all neurons and may penalise deviations from the desired outputs a_k ^l , according to the inverted logical table. This may be done, for example, by using the cross-entropy (CE) cost function per neuron and taking a sum over all neurons, i.e. for a given training data pair (x,y):

It is noted that the actual activation a_k( ) is a function of the data point x, while the target (quantised) activation a_k(y ) is a function of the label/class y.

The logical pre-training may be performed successively layer by layer, starting with the first hidden layer and ending with the output layer. The size of training data and number of epochs may be design parameters. The pre-training may be performed with a limited or small set of the target neural network training data and a limited or small number of epochs. When the pre-training has been performed, the target network weights after the pre training represent the initialisation for the following conventional training of the target network. There the target network is trained using the target network training data. Any applicable method may be applied in the final training phase.

FIG. 9 illustrates a numerical example of transferred bits according to an embodiment.

The example assumes a shape recognition problem with M = 6 different shapes. The (easier) source problem uses wider shapes, while the (harder) target problem uses more narrow shapes. The source and target networks have the size of 100 — 40 — 20 — 10 — 6. The training in source and target network use 10000 epochs, and the logical pre-training in the target network uses 400 epochs out of the 10000 epochs. The example also employs the cross-entropy cost function and gradient descent.

From FIG. 9, it can be seen that the semantic transfer requires only 912 bits, while the conventional transfer of weights requires over 40000 bits at an 8-bit quantisation and over 160000 bits at a 16-bit quantisation. The illustrated example shows that the data to be communicated using semantic transfer is orders of magnitude smaller than in the conventional approach.

The performance of the target network in terms of the classification accuracy is compared in Figure 8, where the first 2000 of the 10000 epochs are shown. The Y-axis represents the classification accuracy and the X-axis represents the number of epochs. FIG. 10 illustrates four performance results: regular training, conventional training, semantic training A and semantic training B. For the conventional transfer learning, the target network is initialised with the weights of the trained source network. As compared to the regular training, the accuracy increases faster, however converges slightly below the regular training.

FIG. 10 shows two versions of the semantic transfer learning. The semantic transfer A is the approach that has been discussed above in the various example embodiments. During the pre-training period, the accuracy is low and unstable, as the pre-training is preformed successively from the first hidden layer towards the output layer such that only in the last few periods of the pre-training, the output layer is considered in the cost function. After pre-training, the accuracy increases very fast and even outperforms both the regular training and the full- weight transfer. The semantic transfer B uses a modified cost function, where the relative frequency of data points leading to the corresponding logical propositions is also taken into account. As can be seen from FIGS. 9 and 10, the discussed semantic transfer learning is advantageous in both the amount of bits to be transferred from the source network to the target network, as well as in terms of learning speed and final accuracy.

Although some of the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as embodiments of implementing the claims and other equivalent features and acts are intended to be within the scope of the claims.

The functionality described herein can be performed, at least in part, by one or more computer program product components such as software components. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), Graphics Processing Units (GPUs).

It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. It will further be understood that reference to 'an' item may refer to one or more of those items. The term ‘and/or’ may be used to indicate that one or more of the cases it connects may occur. Both, or more, connected cases may occur, or only either one of the connected cases may occur.

The operations of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the objective and scope of the subject matter described herein. Aspects of any of the embodiments described above may be combined with aspects of any of the other embodiments described to form further embodiments without losing the effect sought.

The term 'comprising' is used herein to mean including the method, blocks or elements identified, but that such blocks or elements do not comprise an exclusive list and a method or apparatus may contain additional blocks or elements. It will be understood that the above description is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, embodiments and data provide a complete description of the structure and use of exemplary embodiments. Although various embodiments have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from the spirit or scope of this specification.

Claims

CLAIMS:

1. A computing device (100), configured to: initialize (302) a source neural network; train (304) the source neural network with training data of the source neural network; perform (306) a semantic analysis of the source neural network; extract (308) logical behaviour data of the source neural network based on the semantic analysis; and cause (310) a transmission of the logical behaviour data.

2. The computing device (100) of claim 1, wherein the logical behaviour data comprises a logical table.

3. The computing device (100) of claim 2, further configured to: semantically analyse neurons of the source neural network; and store, based on the semantic analysis, logical propositions corresponding to outputs of at least some of the neurons into the logical table.

4. The computing device (100) of claim 3, further configured to: encode the logical propositions into binary vectors; and cause a transmission of the binary vectors.

5. A computing device (200), configured to: receive (402) logical behaviour data associated with a source neural network, the logical behaviour data being based on a semantic analysis of the source neural network; pre-train (404) a target neural network with the logical behaviour data associated with the source neural network; and train (406) the target neural network with training data of the target neural network.

6. The computing device (200) of claim 5, wherein the logical behaviour data comprises a logical table.

7. The computing device (200) of claim 6, wherein the logical table comprises logical propositions corresponding to outputs of at least some of the neurons of the source neural network, the logical propositions being based on a semantic analysis of the neurons of the source neural network.

8. The computing device (200) of claim 7, further configured to compute an inverse logical table based on the received logical table, the inverse logical table being indicative of the desired logical behaviour for each neuron in the target neural network, wherein the computing device (200) is configured to pre-train (404) the target neural network by using a cost function that takes into account the outputs of the neurons of the target neural network and penalises deviations from desired outputs indicated by the inverted logical table.

9. The computing device (200) of claim 8, further configured to pre-train (404) the target neural network successively layer by layer.

10. The computing device (200) of any of claims 7 to 9, wherein the logical table comprises the logical propositions encoded into binary vectors.

11. The computing device (200) of any of claims 5 to 10, wherein the computing device (200) is configured to pre-train (404) the target neural network with a limited set of the training data of the target neural network and a limited number of epochs.

12. A method (300), comprising: initializing (302) a source neural network; training (304) the source neural network with training data of the source neural network; performing (306) a semantic analysis of the source neural network; extracting (308) logical behaviour data of the source neural network based on the semantic analysis; and causing (310) a transmission of the logical behaviour data.

13. The method (300) of claim 12, wherein the logical behaviour data comprises a logical table.

14. The method (300) of claim 13, further comprising: semantically analysing all neurons of the source neural network; and storing, based on the semantic analysis, logical propositions corresponding to outputs of at least some of the neurons into the logical table.

15. The method (300) of claim 14, further comprising: encoding the logical propositions into binary vectors; and causing a transmission of the binary vectors.

16. A method (400), comprising: receiving (402) logical behaviour data associated with a source neural network, the logical behaviour data being based on a semantic analysis of the source neural network; pre-training (404) a target neural network with the logical behaviour data associated with the source neural network; and training (406) the target neural network with training data of the target neural network.

17. The method (400) of claim 16, wherein the logical behaviour data comprises a logical table.

18. The method (400) of claim 17, wherein the logical table comprises logical propositions corresponding to outputs of at least some of the neurons of the source neural network, the logical propositions being based on a semantic analysis of the neurons of the source neural network.

19. The method (400) of claim 18, further comprising: computing an inverse logical table based on the received logical table, the inverse logical table being indicative of the desired logical behaviour for each neuron in the target neural network; and pre-training the target neural network by using a cost function that takes into account the outputs of the neurons of the target neural network and penalises deviations from desired outputs indicated by the inverted logical table.

20. The method (400) of claim 19, further comprising: pre-training (404) the target neural network successively layer by layer.

21. The method (400) of any of claims 18 to 20, wherein the logical table comprises the logical propositions encoded into binary vectors.

22. The method (400) of any of claims 16 to 21, further comprising: pre-training the target neural network with a limited set of the training data of the target neural network and a limited number of epochs.

23. A computer program comprising program code configured to cause performance of the method according to any of claims 12 to 15, when the computer program is executed on a computer.

24. A computer program comprising program code configured to cause performance of the method according to any of claims 16 to 22, when the computer program is executed on a computer.

25. A telecommunication device comprising the computing device (100) according to any of claims 1 to 4.

26. A telecommunication device comprising the computing device (200) according to any of claims 5 to 11.