WO2019080988A1

WO2019080988A1 - End-to-end learning in communication systems

Info

Publication number: WO2019080988A1
Application number: PCT/EP2017/076965
Authority: WO
Inventors: Jakob Hoydis
Original assignee: Nokia Technologies Oy
Priority date: 2017-10-23
Filing date: 2017-10-23
Publication date: 2019-05-02

Abstract

This specification relates to end-to-end learning in communication systems and describes a method comprising: converting first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network; transmitting one or more symbols from the transmitter to the receiver; converting each of the one or more symbols into first output data bits at the receiver; and training at least some weights of the transmitter and receiver neural networks using a loss function.

Description

End-to-end Learning in Communication Systems Field

The present specification relates to learning in communication systems.

Background

A simple communications system includes a transmitter, a transmission channel and a receiver. The design of such communication systems typically involves the separate design and optimisation of each part of the system. An alternative approach is to consider the entire communication system as a single system and to seek to optimise the entire system. Although some attempts have been made in the prior art, there remains scope for further developments in this area.

Summary

In a first aspect, this specification describes a method comprising: converting first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network;

transmitting one or more symbols from the transmitter to the receiver; converting each of the one or more symbols into first output data bits at the receiver; and training at least some weights of the transmitter and receiver neural networks using a loss function.

The first aspect may further comprise converting the one or more symbols into a probability vector over output bits and a probability vector over output symbols, wherein training at least some weights of the receiver neural network using the loss function includes considering a probability vector over the output bits and a probability vector over output symbols.

The loss function may be related to a symbol error rate for the one or more symbols and a bit error rate for the first output data bits. Furthermore, a relative weight of the symbol error rate and the bit error rate in the loss function may be defined by a weighting coefficient.

The transmitter neural network may be a multi-layer neural network, the method further comprising initializing the last layer of the multi-layer neural network. Furthermore, the other layers in the transmitter neutral network may be initialized arbitrarily. The receiver neural network may be a multi-layer neural network, the method further comprising initializing the last layer of the multi-layer neural network. Furthermore, the other layers in the receiver neutral network may be initialized arbitrarily. The first aspect may further comprise initializing at least some of the parameters of the transmitter neural network. Furthermore, the first aspect may further comprise initializing at least some of the parameters of the transmitter neural network based on a known initial weight matrix. The known initial weight matrix may correspond to a first modulation scheme.

The communication system may further comprise a channel model, wherein each symbol is transmitted from the transmitter to the receiver via the channel model.

The first aspect may further comprise splitting up a codeword into a plurality of symbols and transmitting each symbol in the plurality separately.

In a second aspect, this specification describes an apparatus configured to perform the method of any method as described with reference to the first aspect. In a third aspect, this specification describes computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform any method as described with reference to the first aspect.

In a fourth aspect, this specification describes a computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by at least one processor, causes performance of: converting first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network; transmitting one or more symbols from the transmitter to the receiver; converting each of the one or more symbols into first output data bits at the receiver; and training at least some weights of the transmitter and receiver neural networks using a loss function.

In a fifth aspect, this specification describes an apparatus comprising: at least one processor; and at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to: convert first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network; transmit one or more symbols from the transmitter to the receiver; convert each of the one or more symbols into first output data bits at the receiver; and train at least some weights of the transmitter and receiver neural networks using a loss function.

In a sixth aspect, this specification describes an apparatus comprising: means for converting first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network; means for transmitting one or more symbols from the transmitter to the receiver; means for converting each of the one or more symbols into first output data bits at the receiver; and means for training at least some weights of the transmitter and receiver neural networks using a loss function.

Brief description of the drawings

Example embodiments will now be described, by way of non-limiting examples, with reference to the following schematic drawings, in which: Figure l is a block diagram of an exemplary end-to-end communication system;

Figure 2 is a block diagram of an exemplary transmitter used in an exemplary

implementation of the system of Figure l;

Figure 3 is a block diagram of an exemplary channel model used in an exemplary implementation of the system of Figure l;

Figure 4 is a block diagram of an exemplary receiver used in an exemplary implementation of the system of Figure 1;

Figure 5 is a flow chart showing an algorithm in accordance with an exemplary

embodiment;

Figure 6 is a block diagram of a components of a system in accordance with an exemplary embodiment; and

Figures 7a and 7b show tangible media, respectively a removable memory unit and a compact disc (CD) storing computer-readable code which when run by a computer perform operations according to embodiments. Detailed description

Figure 1 is a block diagram of an exemplary communication system, indicated generally by the reference numeral 1, in which exemplary embodiments may be implemented. The system 1 includes a transmitter 2, a channel 4 and a receiver 6. Viewed at a system level, the system 1 converts an input vector (IN) received at the input to the transmitter 2 into an output vector (OUT) at the output of the receiver 6. The transmitter 1 includes a neural network 10. Similarly, the receiver 6 includes a neural network 14. As described in detail below, the neural networks 10 and 14 are trained in order to optimise the performance of the system as a whole.

Typically, the channel 4 includes a network 12 that is used to model the transformations that would occur in a communications channel (e.g. noise, upsampling, filtering, convolution with a channel impulse response, resampling, time/frequency/phase offsets, etc.) The network 12 is typically a sequence of stochastic transformations of the input to the channel (i.e. the output of the transmitter 2). In general, the weights of the network 12 implementing the channel mode are not trainable.

The channel 4 could be implemented using a real channel, but there are a number of practical advantages with using a channel model (such as not needing to set up a physical channel when training the neural networks of the system 40). Also, it is not

straightforward to use a real channel here, since the transfer function of the channel is not known during training. A possible workaround is to use a two-stage training process where the system is first trained from end-to-end using a stochastic channel model and the only the receiver is fine-tuned based on real data transmissions. Other arrangements are also possible. As shown in Figure 1, the transmitter 2 receives an input (IN). The input IN is encoded by the transmitter 2. The neural network 10 is used to transform the input into a signal for transmission using the channel 4. The neural network may include multiple layers or levels (a so-called deep neural network). For example, the neural network 10 may have some layers with weights that are trainable and some layers with weights that are fixed. Similarly, the receiver 6 is used to transform the output of the channel into the output

OUT. The neural network 14 may include multiple layers or levels (a so-called deep neural network). For example, the neural network 14 may have some layers with weights that are trainable and some layers with weights that are fixed. In the context of a communication system, the output OUT is typically the receiver's best guess of the input IN. As described in detail below, the receiver 6 may include a loss function that monitors how accurately the output OUT matches the input IN. The output of the loss function can then be used in the training of the weights of the neural network 10 of the transmitter and/or the neural network 14 of the receiver.

In the vast majority of cases, we cannot train the weights as to minimise the loss function with a closed form solution and have to employ an iterative method such as gradient descent. Gradient descent uses the observation that, at a given point, updating the parameters in the opposite direction to the gradient of the loss function with respect to these parameters will lead to the greatest reduction in loss. After the parameters have been updated, the gradient is recalculated and this is repeated until convergence, when the loss value is no longer decreasing significantly with each iteration, or until some user specified iteration limit. Traditional, or batch, gradient descent calculates this gradient using the loss over all given inputs and desired values, on each iteration. Analysing the entire sample on each iteration is very inefficient and so convergence would take a relatively long time. Instead, most neural networks are trained using a procedure known as stochastic gradient descent (SGD). Stochastic gradient descent estimates the gradient using a single or small number of input and desired value pair(s) on each iteration. In most scenarios, stochastic gradient descent reaches convergence relatively quickly while still finding suitable parameter values.

Figure 2 is a block diagram showing details of an exemplary implementation of the transmitter 2 described above. As shown in Figure 2, the transmitter 2 includes a binary- to-decimal module 20, an embedding module 22, a dense layer of one or more neural networks 24, a complex vector generator 26 and a normalization module 28. The modules within the transmitter 2 are provided by way of example and modifications are possible. For example, the complex vector generator 26 and the normalization module 28 could be provided in a different order.

The binary input vector b e {0, l)^fe of length k≥ 1 is transformed (in binary-to-decimal module 20) into the message index s e M = {0, \, ... , M— 1), where M = 2^k, through the function

bin2dec : {0,l)^fe ·→ M, which could be implemented as bin2dec(b) = ∑^=ο b l^l. Other implementations are possible as long as the mapping is bijective.

The message index s is fed into the embedding module 22, embedding: M i→ I ⁿemb _; that transforms s into an n_emb -dimensional real-valued vector. The embedding module 22 can optionally be followed by several dense neural network (NN) layers 24 with possible different activation functions, such as ReLU, tanh, signmoid, linear etc. (also known as a multilayer perceptron (MLP)). The final layer of the neural network has 2n output dimensions and a linear activation function. If no dense layer is used, n_emb = 2n.

The output of the neural network 24 is converted to a complex- valued vector (by complex vector generator 26) through the mapping I 2C : I ²ⁿ i→ Cⁿ, which could be implemented as M2C(z) = z£^_1 + jz^-¹.

A normalization is applied by the normalization module 28 that ensures that power, amplitude or other constraints are met. The result of the normalization process is the transmit vector x of the transmitter 2 (where x 6 Cⁿ). As noted above, the order of the complex vector generation and the normalization could be reversed.

Figure 3 is a block diagram showing details of an exemplary implementation of the channel 4 described above. As shown in Figure 3, the channel model 4 includes a channel layer network 30. The network 30 typically may not include any trainable weights (in embodiments having trainable weights, the network 30 would then be a neural network). The network 30 seeks to model the transformation undergone in a typical communication channel. Such transformations might include one or more of the following: upsampling, pulse shaping, adding of noise, convolution with random filter taps, phase rotations, resampling at a different rate with timing of offset. As shown in Figure 3, the network 30 receives the vector x as output by the transmitter 2 and provides a vector y to the receiver 6.

Figure 4 is a block diagram showing details of an exemplary implementation of the receiver 6 described above. As shown in Figure 4, the receiver 6 includes a real vector generator 40, a dense layer of one or more neural networks 42 and a softmax module 44. As described further below, the output of the softmax module is a probability vector that is provided to the input of an arg max module 46 and to an input of multiplier 48. The output of the multiplier is provided to module 50.

The received vector y e C"^rx, where n_rx can be different from n, is transformed (by real vector generator 40) into a real-valued vector of 2n_rx dimensions through the mapping C2I : C"^rx i→ ²ⁿ , which could be implemented as C2M(z) = [i2{z)^T, 7{z)^T]^T. The result is fed into the one or more neural networks 42, which neural networks may have different activation functions such as ReLU, tanh, sigmoid, linear, etc. The last layer has M output dimensions to which a softmax activation is applied (by softmax module 44). This generates the probability vector p_s e I , whose ith element [p_s]j can be interpreted as Pr(s = i |y). A hard decision for the message index is obtained as s = arg max(p_s) by arg max module 46.

The probability vector p_s is multiplied (by multiplier 48) from the left by the matrix B = [b₀, ... , !½__!] e {0,l)^fex , where b_; = dec2bin(i) and dec2bin : M >→ {0,l)^fe is the inverse of the previously defined mapping bin2dec. This generates the vector p_b = Bp_s e [0,1] ^k, where [p_b]i can be interpreted as Pr(bi = l|y).

_Λ 1

Hard decisions for the bit representations can be computed (by module 50) as b = p_b > -, where the > operator is applied element-wise. (The threshold 0.5 is provided by way of example only: other thresholds could be used to make this decision.)

The autoencoder, comprising transmitter and receiver neural networks, is trained using an appropriate method such as SGD, as described above, with the following loss function:

k-1

L = (l - a) logOpJs) + ^ log(p_ft].) + (1 - *>.) log(l - [p_ft].)]

Categorical cross entropy for message s i=0 Binary cross entropy for the ith bit

where a e [0,1] is an arbitrary weighting coefficient that decides how much weight is given to the categorical cross-entropy between s and p_s and the sum of the binary cross entropies between the bit-wise soft decisions [p_b]i and b .

Looking at the loss function L, it is clear that for a=o, the loss function is given by the symbol error log([p_s]_s). In such a scenario, the neural networks in the system 1 are optimised for the message index, which can be termed the symbol error rate or block error rate (BLER). Thus, for a=o, the bit-mapping is not taken into account.

For α>0, the bit-mapping is integrated into the end-to-end learning process so that not only the block error rate (BLER) but also the bit error rate (BER) is optimised.

If α=ι, then only the bit error rate (and not the block or symbol error rate) is optimised.

In case the binary inputs b are coded bits (rather than information bits), for example if they are generated by an outer channel code, the soft-decisions p_b can be used for decoding. Note that multiple messages can form a codeword. In the event that the codeword is long, that codeword can be split. Thus, a codeword having a length L. k can be split into L binary vectors of k elements, which vectors are individually transmitted using the above architecture, and whose soft decisions are used for decoding.

A mathematical argument of why p_b = Bp_s can be given as follows:

M- 1

= ^ Pr¾ = l, s = ; |y)

7 =0

M- 1

= ^ PrOi = l |y, s = ;)Pr(s = ; |y)

7 =0

M- 1

= ^ Pr(&_i = l |s = ;)Pr(s = ; |y)

7 =0

which can be expressed compactly in matrix form as p_b = Bp.

Figure 5 is a flow chart showing an algorithm, indicated generally by the reference numeral 6o, in accordance with an exemplary embodiment.

The algorithm 6o starts at operation 62, where the weights in the relevant neural networks (e.g. trainable neural networks within the dense layers 24 and 42 described above) are initialised. With the weights initialised, the algorithm 60 moves to operation 64 where the communication system 1 is used to transmit data over the channel 4. The data transmitted is received at the receiver 6 (operation 66) and the loss function described above is calculated (operation 68).

On the basis of the calculated loss function, the trainable weights within the relevant neural networks are updated (operation 70), for example using a SGD operation.

At operation 72, it is determined whether or not the algorithm 60 is complete. If so, the algorithm terminates at operation 74. Otherwise, the algorithm returns to operation 64 so that data is again transmitted and the trainable weights updated again based on an updated loss function. The operations 64 to 70 may be repeated many times so that the weights in the neural networks are updated in operation 70 many times. At operation 62 described above, the trainable weights are initialised. This could be implemented in a number of different ways.

The trainable weights may be initialised to favour solutions with certain properties, e.g., to resemble existing modulation schemes, to speed up the training process, or simply to converge to better solutions.

The transmitters embedding layer Embedding : M i→ ^ne, is essentially a lookup table that returns the ith column of an arbitrary n_emb x M matrix W_emb = [w_emb , ... , w_{emb M}] e ^"emb XM ^ i__e_ Embedding(i) = w_{emb j}. In one embodiment, we initialize W_emb not randomly according to some distribution, but instead with a deterministic matrix which has some desired properties (for example orthogonal columns, a known modulation scheme, etc.)

In this embodiment, we assume n_emb = 2n, such that each column of W_emb has the same dimensions as the transmitter before I 2C conversion. In this case, no additional dense layers are used after the embedding. For example, for k = 4 and n = 1, one can initialize

W_emb e I ^{2 x 16} with the constellation symbols of the QAM-16 modulation scheme, i.e., e_mb,i = C2M(QAM16(0), i = 0, ... , 15

where QAM 16 : {0, ... , 15) >→ C is the QAM-16 mapping. Additionally the function bin2dec / dec2bin can be chosen according to some desired bit-labelling, such as Gray labelling (i.e., adjacent constellation symbols differ only in one bit).

Similarly, for k = 8 and n = 2, one could initialize W_emb e j¾^{4x 256} as

_ Γ C2M(QAM16( /16J)) 1

w_emb ^,i = _{C 2}M(QAM16(mod(i, 16))J ' ^{1 =} ° ^255'

This can easily be extended to other values of k and n by using other traditional modulation schemes with a suitable order.

Another interesting initialization is based on optimal (or approximate) sphere packing. In this case, the columns w_{emb i} corresponds to the centers of the M spheres (e.g., of a close cubic or hexagonal packing) in an n-dimensional space that are closest to the origin.

Additional dense layers 24 after the embedding module 22 would, in general, tend to destroy the structure of an initialization. An exemplary approach to initialization consists in letting the last dense layer have a weight matrix W e u¾^2nx that is initialized in the same way the embedding is initialized above. In this case, the embedding and all dense layers but the last can be initialized arbitrarily. The second to last layer needs to have M output dimensions and the bias vector of the last dense layer is initialized to all zeros. Linear activations are applied to the outputs of the last layers which are then fed into the normalization layer.

A goal of this approach is to initialize the transmitter with good message representations based on a traditional baseline scheme, which can then be further optimized during training. If the embedding is initialized as described above, it is possible to use subsequent dense layers that have all dimensions 2n x 2n with linear activations, and whose weights are initialized as identity matrices, and whose biases are initialized as all- zero vectors. An advantage of initializing the last dense layer is that resulting initial constellation is a linear combination of the matrix W.

For completeness, Figure 6 is a schematic diagram of components of one or more of the modules described previously (e.g. the transmitter or receiver neural networks), which hereafter are referred to generically as processing systems 110. A processing system 110 may have a processor 112, a memory 114 closely coupled to the processor and comprised of a RAM 124 and ROM 122, and, optionally, hardware keys 120 and a display 128. The processing system 110 may comprise one or more network interfaces 118 for connection to a network, e.g. a modem which may be wired or wireless.

The processor 112 is connected to each of the other components in order to control operation thereof. The memory 114 may comprise a non-volatile memory, a hard disk drive (HDD) or a solid state drive (SSD). The ROM 122 of the memory 114 stores, amongst other things, an operating system 125 and may store software applications 126. The RAM 124 of the memory 114 is used by the processor 112 for the temporary storage of data. The operating system 125 may contain code which, when executed by the processor, implements aspects of the algorithm 60.

The processor 112 may take any suitable form. For instance, it may be a microcontroller, plural microcontrollers, a processor, or plural processors. The processing system 110 may be a standalone computer, a server, a console, or a network thereof. In some embodiments, the processing system no may also be associated with external software applications. These may be applications stored on a remote server device and may run partly or exclusively on the remote server device. These applications may be termed cloud-hosted applications. The processing system no may be in communication with the remote server device in order to utilize the software application stored there.

Figures 7a and 7b show tangible media, respectively a removable memory unit 165 and a compact disc (CD) 168, storing computer-readable code which when run by a computer may perform methods according to embodiments described above. The removable memory unit 165 may be a memory stick, e.g. a USB memory stick, having internal memory 166 storing the computer-readable code. The memory 166 may be accessed by a computer system via a connector 167. The CD 168 may be a CD-ROM or a DVD or similar. Other forms of tangible storage media may be used. Embodiments of the present invention may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" may be any non-transitory media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer. Reference to, where relevant, "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing circuitry" etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialised circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer program, instructions, code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.

As used in this application, the term "circuitry" refers to all of the following: (a) hardware- only circuit implementations (such as implementations in only analogue and/or digital circuitry) and (b) to combinations of circuits and software (and/or firmware), such as (as applicable): (i) to a combination of processor(s) or (ii) to portions of processor(s)/software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a server, to perform various functions) and (c) to circuits, such as a microprocessor(s) or a portion of a microprocessor(s), that require software or firmware for operation, even if the software or firmware is not physically present.

If desired, the different functions discussed herein may be performed in a different order and/or concurrently with each other. Furthermore, if desired, one or more of the above- described functions may be optional or may be combined. Similarly, it will also be appreciated that the flow diagram of Figure 5 is an example only and that various operations depicted therein may be omitted, reordered and/or combined.

It will be appreciated that the above described example embodiments are purely illustrative and are not limiting on the scope of the invention. Other variations and modifications will be apparent to persons skilled in the art upon reading the present specification. By way of example, at least some of the dense layers described herein (dense layers 24 and 42) could include one or more convolutional layers. Moreover, the disclosure of the present application should be understood to include any novel features or any novel combination of features either explicitly or implicitly disclosed herein or any generalization thereof and during the prosecution of the present application or of any application derived therefrom, new claims may be formulated to cover any such features and/or combination of such features.

Although various aspects of the invention are set out in the independent claims, other aspects of the invention comprise other combinations of features from the described embodiments and/or the dependent claims with the features of the independent claims, and not solely the combinations explicitly set out in the claims.

It is also noted herein that while the above describes various examples, these descriptions should not be viewed in a limiting sense. Rather, there are several variations and modifications which may be made without departing from the scope of the present invention as defined in the appended claims.

Claims

Claims:

1. A method comprising:

converting first input data bits into symbols for transmission by a data

transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network;

transmitting one or more symbols from the transmitter to the receiver;

converting each of the one or more symbols into first output data bits at the receiver; and

training at least some weights of the transmitter and receiver neural networks using a loss function.

2. A method as claimed in claim 1, further comprising converting the one or more symbols into a probability vector over output bits and a probability vector over output symbols, wherein training at least some weights of the receiver neural network using the loss function includes considering a probability vector over the output bits and a probability vector over output symbols.

3. A method as claimed in claim 1 or claim 2, wherein the loss function is related to a symbol error rate for the one or more symbols and a bit error rate for the first output data bits.

4. A method as claimed in claim 3, wherein a relative weight of the symbol error rate and the bit error rate in the loss function is defined by a weighting coefficient.

5. A method as claimed in any one of the preceding claims, wherein the transmitter neural network is a multi-layer neural network, the method further comprising initializing the last layer of the multi-layer neural network.

6. A method as claimed in claim 5, wherein other layers in the transmitter neutral network are initialized arbitrarily.

7. A method as claimed in any one of the preceding claims, wherein the receiver neural network is a multi-layer neural network, the method further comprising initializing the last layer of the multi-layer neural network.

8. A method as claimed in claim 7, wherein other layers in the receiver neutral network are initialized arbitrarily.

9. A method as claimed in any one of the preceding claims, further comprising initializing at least some of the parameters of the transmitter neural network.

10. A method as claimed in claim 9, further comprising initializing at least some of the parameters of the transmitter neural network based on a known initial weight matrix.

11. A method as claimed in claim 10, wherein the known initial weight matrix corresponds to a first modulation scheme.

12. A method as claimed in any one of the preceding claims, wherein the

communication system further comprises a channel model, wherein each symbol is transmitted from the transmitter to the receiver via the channel model.

13. A method as claimed in any one of the preceding claims, further comprising splitting up a codeword into a plurality of symbols and transmitting each symbol in the plurality separately.

14. An apparatus configured to perform the method of any preceding claim.

15. Computer-readable instructions which, when executed by computing apparatus, cause the computing apparatus to perform a method according to any one of claims 1 to 13·

16. A computer-readable medium having computer-readable code stored thereon, the computer readable code, when executed by at least one processor, causes performance of: converting first input data bits into symbols for transmission by a data

transmitting one or more symbols from the transmitter to the receiver;

17. Apparatus comprising:

at least one processor; and

at least one memory including computer program code which, when executed by the at least one processor, causes the apparatus to:

convert first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network;

transmit one or more symbols from the transmitter to the receiver;

convert each of the one or more symbols into first output data bits at the receiver; and

train at least some weights of the transmitter and receiver neural networks using a loss function.

18. Apparatus comprising:

means for converting first input data bits into symbols for transmission by a data transmission system comprising a transmitter and a receiver, wherein the transmitter is represented using a transmitter neural network and the receiver is represented using a receiver neural network;

means for transmitting one or more symbols from the transmitter to the receiver; means for converting each of the one or more symbols into first output data bits at the receiver; and

means for training at least some weights of the transmitter and receiver neural networks using a loss function.