CN114338309A

CN114338309A - Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning

Info

Publication number: CN114338309A
Application number: CN202111572693.8A
Authority: CN
Inventors: 义理林; 徐永鑫; 黄璐瑶; 蒋文清
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-04-12
Anticipated expiration: 2041-12-21
Also published as: CN114338309B

Abstract

The invention provides a method and a system for optimizing a Volterra equalizer structure based on deep reinforcement learning, which comprises the following steps: initializing memory length states of an Agent, an experience playback pool and a Volterra equalizer; randomly generating actions for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, and storing the transfer process as experience into an experience playback pool; sampling experiences from the experience playback pool, and training and soft updating the Agent; and determining the memory length of each order of the Volterra equalizer according to the convergence value. The invention realizes the automatic searching method of the optimal structure of the Volterra equalizers of different types under the condition of given computing resources, and compared with the traditional greedy search, the method not only can further improve the equalizing effect, but also greatly reduces the complexity of the equalizers.

Description

Method and system for optimizing Volterra equalizer structure based on deep reinforcement learning

Technical Field

The invention relates to the technical field of optical communication, in particular to a method and a system for optimizing a Volterra equalizer structure based on deep reinforcement learning.

Background

The Volterra non-linear equalizer is widely applied to an optical fiber communication system and is used for relieving linear and non-linear damages to signals in a transmission process. In long-distance optical fiber communication systems, nonlinear damage mainly comes from optical fiber nonlinearity, and in short-distance optical fiber communication systems, nonlinear damage mainly comes from transceiver devices, such as nonlinear response of a modulator, square-law detection of a photodetector, and the like. The equalization effect and the realization complexity are important indexes for evaluating the equalizer, and in order to realize the equalization effect and the realization complexity on hardware in real time, the structure of the high-performance low-complexity Volterra nonlinear equalizer becomes a hot point of research.

The equalization effect and complexity of the Volterra non-linear equalizer depend on the order, the memory length of each order, the existence of feedback, the feedback memory length and the like, and the higher the order and the larger the memory length, the higher the complexity. In order to ensure the equalization effect, in a short-distance optical fiber communication system, the order of a Volterra nonlinear equalizer usually takes 3 orders, and the memory length of each order is usually determined by manual experience or greedy search. In reducing the complexity of the Volterra non-linear equalizer, the current method mainly adopts two ways: one is pruning, which comprises unstructured pruning realized by setting a pruning threshold and L1 regularization and structured pruning realized by setting a correlation distance between Volterra cross term signals; the other is to design a simplified Volterra non-linear equalizer structure based on device physical characteristics and channel characteristics.

However, the existing method still needs to determine the memory length, pruning threshold or related distance of each order of the Volterra nonlinear equalizer through manual experience or greedy search, so that the efficiency is low, the best performance of the Volterra nonlinear equalizer can not be well exerted, and the compromise between the equalization effect and the complexity is difficult to realize.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for optimizing a Volterra equalizer structure based on deep reinforcement learning.

The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning provided by the invention comprises the following steps:

step S1: initializing an Agent of an intelligent Agent, initializing an experience playback pool, initializing a memory length state of a Volterra equalizer and defining a state transition process;

step S2: starting from the initial memory length state of the Volterra equalizer, randomly generating action for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, storing the transfer process as experience into an experience playback pool, and circulating from the initial state again until a specified amount of experience is generated;

step S3: sampling experiences from an experience playback pool, training the agents, and then performing soft updating on the agents every preset step number;

step S4: and generating a deterministic action for the updated Agent from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculating an incentive value and storing the transition process into an experience playback pool, then repeating the step S3 and the step S4 until the incentive value and the action output by the Agent are converged, and finally determining each order of memory length of the Volterra equalizer according to the convergence value.

Preferably, the step S1 includes:

step S11: four neural networks in the Agent are defined: actor network mu_θCritic network Q_wTarget Actor network

And Target critical network

Actor network mu initialization using random parameters theta, w_θAnd Critic network Q_wUsing random parameters

Initializing Target Actor networks

And Target critical networkCollaterals of kidney meridian

Wherein is provided with

Is equal to theta, set

Is equal to w;

step S12: initializing an experience playback pool storing experiences in a format of(s)_i,a_i,r_i,s_i+1Done), wherein s_iRepresenting the memory length state of the current Volterra equalizer; a is_iRepresenting Agent according to current state s_iGenerating an action in which the memory length of each step is in proportion to the maximum memory length limit; r is_iRepresenting Agent facing status s_iTaking action of_iThe reward earned; s_i+1Representing Agent taking action a_iThen, the memory length state after the Volterra equalizer is updated; done is a mark for judging whether the whole state transition process is finished or not;

step S13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process.

Preferably, the step S2 includes:

step S21: selecting a state transition process according to the type of the Volterra equalizer, starting by the Agent from an initial state, generating random actions uniformly distributed according to [0, 1], updating the memory length state of the Volterra equalizer, and continuing to generate the random actions according to the current state until the memory length state of the Volterra equalizer is updated to an end state;

step S22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value by using the complexity of the current equalizer and the average error rate after equalization;

step S23: transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, and storing the experience into an experience playback pool;

step S24: the steps S21 to S23 are repeated until a preset number of experiences are generated.

Preferably, the step S3 includes:

sampling N experiences(s) from an experience replay pool by using a data guided pool technique_i,a_i,r_i,s_i+1Done) for Agent training, including K experiences with the greatest reward;

calculating a target Q value

Wherein the content of the first and second substances,

ε～clip(N(0,σ²) -c, c), wherein,

representing a TargetActor network

Facing state is s_i+1A motion of a temporal output; ε is the strategic noise, obeys a mean of 0, and the variance is σ²And is truncated between-c and c; n (0, sigma)²) Representing a mean of 0 and a variance of σ²(ii) a gaussian distribution of; clip represents truncation; gamma is a discount factor;

representing Target critical network

Facing state is s_i+1And act as

An output of time;

by minimizing errors

Updating critical network Q_w，Q_w(s_i,a_i) Indicating criticic network Q_wFacing state is s_iAnd the action is a_iAn output of time;

every d steps by minimizing the error

Updating Actor network mu_θ，μ_θ(s_i) Represents an Actor network mu_θFacing state is s_iA motion of a temporal output; q_w(s_i,μ_θ(s_i) Denotes a Critic network Q_wFacing state is s_iAnd the sum action is mu_θ(s_i) An output of time; for TargetActor networks as follows

And Target critical network

And (3) carrying out soft updating:

tau is a positive number much smaller than 1 and is responsible for regulating the degree of soft update.

Preferably, the step S4 includes:

the updated Agent starts to generate actions from the initial state of the Volterra equalizer, updates the memory length state of the Volterra equalizer, continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, and adds obedient mean value of 0 and variance sigma of the action generated each time by the Agent²The search noise e of the gaussian distribution of (a);

determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, storing the experience in an experience playback pool; then, step S3 is executed;

after each update, the variance of the search noise e is attenuated: sigma²←σ²ξⁿXi is attenuation rate, n is updating times;

repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x₁The absolute value of the difference between the current Agent output action and the last Agent output action is less than x₂Then, it is determined that the training result has converged, wherein χ₁≥0，χ₂And (3) more than or equal to 0 is a set decision threshold, and finally, the memory length of each position of the Volterra equalizer is determined according to the convergence value of the Agent output action and the maximum memory length limit of each position, so that the determination of the optimal structure of the Volterra equalizer is completed.

The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning provided by the invention comprises the following steps:

module M1: initializing an Agent of an intelligent Agent, initializing an experience playback pool, initializing a memory length state of a Volterra equalizer and defining a state transition process;

module M2: starting from the initial memory length state of the Volterra equalizer, randomly generating action for the Agent, updating the memory length state of the Volterra equalizer until the memory length state reaches the end state, calculating a reward value according to the complexity of the Volterra equalizer and the error rate after signal equalization, storing the transfer process as experience into an experience playback pool, and circulating from the initial state again until a specified amount of experience is generated;

module M3: sampling experiences from an experience playback pool, training the agents, and then performing soft updating on the agents every preset step number;

module M4: and generating a deterministic action for the updated Agent from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculating an incentive value and storing the transition process into an experience playback pool, then repeating the module M3 and the module M4 until the incentive value and the action output by the Agent are converged, and finally determining each order of memory length of the Volterra equalizer according to the convergence value.

Preferably, the module M1 includes:

module M11: four neural networks in the Agent are defined: actor network mu_θCritic network Q_wTargetactor network

And Target critical network

Initializing TargetActor networks

And Target critical network

Wherein is provided with

Is equal to theta, set

Is equal to w;

module M12: initializing an experience playback pool storing experiences in a format of(s)_i,a_i,r_i,s_i+1Done), wherein s_iRepresenting the memory length state of the current Volterra equalizer; a is_iRepresenting Agent according to current state s_iGenerating an action in which the memory length of each step is in proportion to the maximum memory length limit; r is_iRepresenting Agent facing status s_iTaking action of_iThe reward earned; s_i+1Representing Agent taking action a_iThen, the memory length state after the Volterra equalizer is updated; done is a mark for judging whether the whole state transition process is finished or not;

module M13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process.

Preferably, the module M2 includes:

module M21: selecting a state transition process according to the type of the Volterra equalizer, starting by the Agent from an initial state, generating random actions uniformly distributed according to [0, 1], updating the memory length state of the Volterra equalizer, and continuing to generate the random actions according to the current state until the memory length state of the Volterra equalizer is updated to an end state;

module M22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value by using the complexity of the current equalizer and the average error rate after equalization;

module M23: transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, and storing the experience into an experience playback pool;

module M24: the blocks M21 through M23 are repeated until a predetermined number of experiences are made.

Preferably, the module M3 includes:

calculating a target Q value

Wherein the content of the first and second substances,

ε～clip(N(0,σ²) -c, c), wherein,

representing a TargetActor network

representing Target critical network

Facing state is s_i+1And act as

An output of time;

by minimizing errors

every d steps by minimizing the error

And Target critical network

And (3) carrying out soft updating:

Preferably, the module M4 includes:

determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, storing the experience in an experience playback pool; then calling module M3;

Compared with the prior art, the invention has the following beneficial effects:

(1) the invention realizes the automatic searching method of the optimal structure of the Volterra equalizers of different types under the condition of given computing resources, and the equalizing effect is superior to the traditional greedy searching;

(2) different types of reward value calculation modes are designed aiming at a scene of singly pursuing high performance and a scene of considering compromise between performance and complexity, the application is more flexible, the reward value calculation mode of considering compromise between performance and complexity is adopted, the optimal structures of a feedforward Volterra equalizer and a feedback Volterra equalizer are searched out, the complexity of the equalizer can be greatly reduced under the condition of ensuring small loss of an equalization effect, and the pruning quality of the Volterra equalizer for structured pruning can be improved;

(3) according to the invention, a strategy noise and data guide pool technology is introduced on the basis of a DDPG basic framework, so that the stability of a training process and the generalization of a finally searched structure are improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is an exemplary diagram of a direct alignment test system employed in the present invention;

FIG. 3 is a graph comparing error rates of the present invention and Volterra-FFE based on greedy search when a reward value calculation mode of simply pursuing high performance is adopted;

FIG. 4 is a graph comparing error rates for the present invention and a greedy search based Volterra-DFE when a reward value calculation mode is used that simply pursues high performance;

FIG. 5 is a graph comparing error rates of the present invention and Volterra-FFE based on greedy search when a reward value calculation mode considering compromise of performance and complexity is adopted;

FIG. 6 is a graph of complexity comparison of the present invention with a Volterra-FFE based greedy search when employing a reward value calculation mode that considers a compromise of performance and complexity;

FIG. 7 is a graph comparing bit error rates for the present invention and a Volterra-DFE based on greedy search, using a reward value calculation mode that considers a compromise of performance and complexity;

FIG. 8 is a graph of complexity comparison of the present invention and a greedy search based Volterra-DFE when employing a reward value calculation that considers a performance and complexity tradeoff;

FIG. 9 is a graph comparing error rates of Volterra-rounding based on greedy search according to the present invention when a reward value calculation mode considering compromise of performance and complexity is adopted;

FIG. 10 is a graph of complexity comparison of the present invention and Volterra-rounding based on greedy search when employing a reward value calculation that considers a compromise of performance and complexity.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example 1:

the invention discloses a method for determining an optimal structure of a Volterra equalizer based on deep reinforcement learning. The method takes a depth deterministic strategy gradient algorithm (DDPG) algorithm in depth reinforcement learning as an Agent, calculates a reward value according to the complexity of a Volterra equalizer and an error rate after signal equalization, optimizes the decision of the Agent, and selects an optimal structure for a feedforward Volterra equalizer, a feedback Volterra equalizer and a third-order structured pruning Volterra equalizer. Referring to fig. 1, the steps of the method of the present invention are as follows:

s1: an initialization stage: initializing an Agent; initializing an experience playback pool; initializing a memory length state of a Volterra equalizer and defining a state transition process;

s2: a preheating stage: starting from the initial memory length state of the Volterra equalizer, the Agent randomly generates actions, namely the memory length of each order accounts for the proportion of the maximum memory length limit, the Volterra equalizer updates the memory length state of the Volterra equalizer until the memory length state is finished, a reward value is calculated according to the complexity of the Volterra equalizer and the error rate after signal equalization, and the transfer process is used as experience and stored in an experience playback pool; cycling again from the initial state until a specified number of experiences are generated;

s3: agent training and updating stage: sampling a batch of experiences from an experience playback pool, training the agents, and then performing soft updating on the agents at regular intervals;

s4: and (3) exploring and converging: the updated Agent starts to generate deterministic action from the initial memory length state of the Volterra equalizer until the state transition process is finished, calculates an incentive value and stores the transition process into an experience playback pool; then repeating the step S3 until the reward value and the action output by the Agent are converged; and finally, determining the memory length of each order of the Volterra equalizer according to the convergence value.

Further, the specific steps of step S1 are as follows:

s11: four neural networks in the Agent are defined: actor network mu_θCritic network Q_wTarget Actor network

And Target critical network

Initializing TargetActor networks

And Target critical network

Wherein is provided with

Is equal to theta, set

Is equal to w;

s12: initializing an experience playback pool storing experiences in a format of(s)_i,a_i,r_i,s_i+1Done), where s_iRepresenting the memory length state of the current Volterra equalizer, a_iRepresenting Agent according to current state s_iThe resulting action, i.e. the ratio of the memory length of each stage to the maximum memory length limit, r_iRepresenting Agent facing status s_iTaking action of_iAwarded prize, s_i+1Representing Agent taking action a_iThen, the memory length state after the Volterra equalizer is updated, done is a mark for judging whether the whole state transition process is finished;

s13: initializing the memory length state of the Volterra equalizer according to the type of the Volterra equalizer and defining the state transition process:

(1) for a Volterra equalizer of the M-order feedforward type (Volterra-FFE), the formula is as follows:

where x (n) is the input signal of the equalizer, y (n) is the equalized output signal, h_i(k₁,k₂,...,k_i) 1,2, M denotes the ith tap weight of Volterra-FFE, 2L_i+1, i 1, 2.. M denotes the ith memory length of Volterra-FFE; the ith state of the memory length of Volterra-FFE is defined as s_i＝(id,L_limit(id),l₁,l₂,...,l_M) Id is the index of the position of the memory length to be determined, the positions are sorted according to the sequence from 1-order memory length, 2-order memory length to M-order memory length, L_limit(id) represents the maximum memory length limit for the id-th position,/_id1,2, wherein M represents the proportion of the memory length of the id position to the maximum memory length limit;

initializing the memory length state of the Volterra-FFE equalizer to s₀＝(1,L_limit(1),0,0,...,0)；

Defining the state transition process of the Volterra-FFE equalizer as follows:

from the initial memory length state s₀＝(1,L_limit(1) 0, 0.., 0) is started, the Agent generates action a₀＝l₁I.e. the ratio of the memory length of the 1 st position to its maximum memory length limit, and then the prize r₀Update state to s ═ 0₁＝(2,L_limit(2),l₁0,0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;

from memorizing the length state s_i＝(i+1,L_limit(i+1),l₁,l₂,...,l_i,., 0), the Agent generates action a_i＝l_i+1I.e. the ratio of the memory length of the i +1 th position to its maximum memory length limit, and then the prize r_iUpdate state to s ═ 0_i+1＝(i+2,L_limit(i+2),l₁,l₂,...,l_i,l_i+1,., 0), setting a state transition end flag done ═ False, which indicates that the whole state transition process is not ended;

and so on until the update state reaches the end state s_M＝(0,0,l₁,l₂,...,l_M) Setting a state transition end flag done equal to True to indicate that the whole state transition process is ended, and when done equal to True, letting r₀＝...＝r_i＝...＝r_M-1The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished;

(2) for a Volterra equalizer (Volterra-DFE) of the M-order feedback type, the formula is expressed as:

where x (n) is the input signal to the equalizer, y (n) is the equalized output signal,

is to the output signal after y (n) hard decision, h_i(k₁,k₂,...,k_i) I 1, 2.. M denotes the ith order tap weight of the Volterra-DFE, 2L_i+1, i-1, 2, M denotes the ith memory length of Volterra-FFE, h_fb(k₁) Representing Volterra-DFE feedback tap weights, L_fbRepresenting the feedback memory length of the Volterra-DFE; the ith state of the memory length of the Volterra-DFE is defined as s_i＝(id,L_limit(id),l₁,l₂,...,l_M,l_M+1) Id is the index of the position of the memory length to be determined, the positions are sorted according to the sequence from 1-order memory length, 2-order memory length to M-order memory length and feedback memory length, L_limit(id) represents the maximum memory length limit for the id-th position,/_id1,2, M +1 represents the ratio of the memory length of the id-th position to the maximum memory length limit;

initializing the memory length state of the Volterra-DFE equalizer to s₀＝(1,L_limit(1),0,0,...,0)；

The state transition process for the Volterra-DFE equalizer is defined as:

from memorizing the length state s_i＝(i+1,L_limit(i+1),l₁,l₂,...,l_i,., 0), the Agent generates action a_i＝l_i+1I.e. the ratio of the memory length of the i +1 th position to its maximum memory length limit, and then the prize r_iUpdate state to s ═ 0_i+1＝(i+2,L_limit(i+2),l₁,l₂,...,l_i,l_i+1,., 0), setting state transitionsAn end flag done ═ False, which indicates that the whole state transition process is not ended;

and so on until the update state reaches the end state s_M+1＝(0,0,l₁,l₂,...,l_M+1) Setting a state transition end flag done equal to True to indicate that the whole state transition process is ended, and when done equal to True, letting r₀＝...＝r_i＝...＝r_MThe reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished;

(3) for an M-order structured Pruning type Volterra equalizer (Volterra-Pruning), the formula is as follows:

where x (n) is the input signal of the equalizer, y (n) is the equalized output signal, h_i(k₁,k₂,...,k_i) 1,2, M denotes the ith tap weight of Volterra-FFE, 2L_i+1, i-1, 2, M denotes the ith memory length of Volterra-FFE, L_mdM is 2, wherein M represents the mth order correlation distance of Volterra-Pruning and is used for regulating and controlling the Pruning degree; the ith state of the memory length of Volterra-Pruning is defined as s_i＝(id,L_limit(id),l₁,l₂,...,l_M,l_2M-1) Id represents the index of the position of the memory length or the related distance to be determined, the positions are sorted according to the sequence from the 1-order memory length, the 2-order related distance, the 3-order memory length, the 3-order related distance to the M-order memory length and the M-order related distance, L_limit(id) represents the maximum memory length limit or maximum correlation distance limit for the id-th position,/_id1, 2., 2M-1 denotes the proportion of the memory length or correlation distance of the id-th position to its maximum memory length limit or maximum correlation distance limit;

initializing the memory length state of the Volterra-rounding equalizer to s₀＝(1,L_limit(1),0,0,...,0)；

Defining the state transition process of the Volterra-rounding equalizer as follows:

and so on until the update state reaches the end state s_2M-1＝(0,0,l₁,l₂,...,l_2M-1) Setting a state transition end flag done equal to True to indicate that the whole state transition process is ended, and when done equal to True, letting r₀＝...＝r_i＝...＝r_2M-2And the reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished.

Further, the specific steps of step S2 are as follows:

s21: firstly, selecting a state transition process according to the type of a Volterra equalizer, starting from an initial state by an Agent, generating random actions uniformly distributed according to [0, 1], namely updating the memory length state of the Volterra equalizer by the Agent according to the proportion of each memory length in the maximum memory length limit, and continuing to generate the random actions according to the current state by the Agent until the memory length state of the Volterra equalizer is updated to an end state;

s22: calculating a reward value, determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating the reward value reward according to the average bit error rate after equalization and the complexity of the equalizer; for the scenario of simply pursuing high performance, i.e. low Bit Error Rate (BER), the reward value reward is 100 × (1-BER)_valid) (ii) a For scenarios that take into account a trade-off between performance and complexity, the reward value reward-100 BER_valid*log(Vol_MACs)，BER_validPerforming double-fold cross validation on the signal for the Volterra equalizer to obtain an average error rate; vol _ MACs is the number of multipliers required by the Volterra equalizer;

s23: then transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, and storing the experience into an experience playback pool;

s24: finally, the above operations are repeated until a specified number of experiences are generated.

Further, the specific method of step S3 is as follows:

calculating a target Q value

Wherein

ε～clip(N(0,σ²) -c, c), wherein,

represents the Target Actor network mu_θFacing state is s_i+1Time-out action, epsilon is the strategy noise, obey mean is 0, variance is sigma²And truncated between-c and c, N (0, σ)²) Representing a mean of 0 and a variance of σ²The clip represents truncation, γ is the discounting factor,

representing Target critical network Q_wFacing state is s_i+1And act as

An output of time;

by minimizing errors

every d steps by minimizing the error

Updating Actor network mu_θ，μ_θ(s_i) Represents an Actor network mu_θFacing state is s_iMotion of time output, Q_w(s_i,μ_θ(s_i) Denotes a Critic network Q_wFacing state is s_iAnd the sum action is mu_θ(s_i) An output of time; for Target Actor network according to the following formula

And Target critical network

And (3) carrying out soft updating:

Further, the specific method of step S4 is as follows:

updated Agent from initial state of Volterra equalizerStarting to generate actions, namely, the memory length of each step accounts for the proportion of the maximum memory length limit, updating the memory length state of the Volterra equalizer, and the Agent continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, wherein the action generated by the Agent each time needs to be added with the obedient mean value of 0 and the variance of sigma²The search noise e of the gaussian distribution of (a);

determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, performing 2-fold cross validation on signal data, and calculating a reward value by using the equalized average error rate and the equalizer complexity; transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, storing the experience in an experience playback pool; then, step S3 is executed;

Example 2:

example 2 is a preferred example of example 1.

The invention takes the optimization of a three-order Volterra equalizer as an example, and the effectiveness of the invention is experimentally demonstrated. Considering a C-band direct alignment detection system, a sending end generates a PAM4 signal with a rate of 50Gbps through an Arbitrary Waveform Generator (AWG), the PAM4 signal is amplified by an Electric Amplifier (EA), then modulated by a Mach Zehnder Modulator (MZM) with a C-band and 10GHz level, meanwhile, the AWG loads an NRZ signal with a rate of 100Mbps onto a Direct Modulation Laser (DML) to broaden a central carrier, suppress a Stimulated Brillouin Scattering (SBS) effect affected by power, amplified by an erbium-doped fiber amplifier (EDFA), transmitted through a 20-standard km single-mode fiber (SSMF), received by a 30GHz level Avalanche Photodetector (APD), and then performs offline Digital Signal Processing (DSP), including steps of synchronization decision, resampling, equalization, sign, decoding and the like, wherein the overall 3dB bandwidth of the system is about 6GHz, and the specific system configuration is shown in fig. 2.

The fixed APD receive power was-15 dBm, the incoming fiber power was varied from 8dBm to 20dBm, and at each incoming fiber power, 3 sets of 98304 symbols generated by different random numbers were transmitted.

Setting the computing resource limit of the equalizer to 1000MACs, wherein MACs represents the number of multipliers, the input signal of the equalizer is x (n), the equalized output signal is y (n), and the output signal after symbol decision is carried out on y (n) is

The maximum memory length limit of each order of the different types of 3-order Volterra equalizers is calculated as follows:

(1) for a third-order feedforward type Volterra equalizer (Volterra-FFE), the formula is:

wherein h is₁(i),h₂(i,j),h₃(i, j, k) represent tap weights of 1 st, 2 nd and 3 rd order of Volterra-FFE, respectively, 2L₁+1,2L₂+1,2L₃+1 represents the 1 st, 2 nd and 3 rd order memory lengths of Volterra-FFE, respectively; the tap numbers of 1 st order, 2 nd order and 3 rd order of Volterra-FFE are 2L respectively₁+1，(2L₂+1)(2L₂+2)/2，(2L₃+1)(2L₃+2)(2L₃+3)/6, the number of multipliers required is 2L respectively₁+1，(2L₂+1)(2L₂+2)，(2L₃+1)(2L₃+2)(2L₃+3)/2；

Setting the current remaining number of multipliers to rest _ MACs, the 1 st-order maximum memory length limit L_limit(1) Meter (2)The calculation method is as follows:

2L_limit(1)+1≤rest_MACs

in the formula (I), the compound is shown in the specification,

represents rounding down;

2-order maximum memory length limit L_limit(2) The calculation method of (c) is as follows:

(2L_limit(2)+1)(2L_limit(2)+2)≤rest_MACs

3-order maximum memory length limit L_limit(3) The calculation method of (c) is as follows:

(2L_limit(3)+1)(2L_limit(3)+2)(2L_limit(3)+3)/2≤rest_MACs

the simplification is as follows:

4(L_limit(3))³+12(L_limit(3))²+11L_limit(3)+3-rest_MACs≤0

solving the above formula according to the formula of flourishing gold: let a be 4, b be 12, c be 11, d be 3-rest _ MACs, a be b²-3ac，B＝bc-9ad，C＝c²-3bd, then

Wherein the content of the first and second substances,

finally, the method is simplified to obtain:

wherein the content of the first and second substances,

(2) for a third order feedback type Volterra equalizer (Volterra-DFE), the formula is:

wherein h is₁(i),h₂(i,j),h₃(i,j,k),h_fb(i) Respectively representing orders 1,2, 3 and feedback tap weights, 2L, of a Volterra-DFE₁+1,2L₂+1,2L₃+1,L_fbRespectively representing 1 st order, 2 nd order, 3 rd order and feedback memory length of the Volterra-DFE; the 1 st, 2 nd, 3 rd order and feedback tap numbers of the Volterra-DFE are 2L, respectively₁+1，(2L₂+1)(2L₂+2)/2，(2L₃+1)(2L₃+2)(2L₃+3)/6，L_fbThe number of multipliers required is 2L₁+1，(2L₂+1)(2L₂+2)，(2L₃+1)(2L₃+2)(2L₃+3)/2，L_fb；

Referring to Volterra-FFE, if the current number of remaining multipliers is rest _ MACs, the maximum memory length limit L of order 1 in the Volterra-DFE_limit(1) The calculation method of (c) is as follows:

wherein the content of the first and second substances,

feedback maximum memory length limit L_limit(4) The calculation method of (c) is as follows:

L_limit(4)＝rest_MACs

(3) for a third-order structured Pruning type Volterra equalizer (Volterra-rounding), the formula is expressed as follows:

wherein h is₁(i),h₂(i,j),h₃(i, j, k) represent tap weights of 1 st, 2 nd and 3 rd orders of Volterra-rounding, respectively, 2L₁+1,2L₂+1,2L₃+1 represents the 1 st, 2 nd and 3 rd order memory lengths of Volterra-sounding respectively; l is_2d,L_3dRespectively representing 2-order and 3-order correlation distances of Volterra-Pruning and used for regulating and controlling the Pruning degree;

referring to Volterra-FFE, if the number of the current remaining multipliers is rest _ MACs, the maximum memory length limit L of 1 st order in Volterra-rounding is limited_limit(1) The calculation method of (c) is as follows:

maximum length limit L for 2 nd order correlation distance_limit(3) The calculation method of (c) is as follows:

L_limit(3)＝2L₂+1

3-order maximum memory length limit L_limit(4) The calculation method of (c) is as follows:

wherein the content of the first and second substances,

3-order maximum correlation distance limit L_limit(5) The calculation method of (c) is as follows:

L_limit(5)＝2L₃+1

the invention adopts a depth deterministic strategy gradient (DDPG) algorithm in the deep reinforcement learning as an Agent to determine the memory length or the related distance of each order of the Volterra equalizer step by step. The DDPG algorithm belongs to an Actor-critical framework, and an Actor part comprises an Actor network mu_θAnd corresponding Target Actor network

The Critic part comprises a Critic network Q_wAnd corresponding Target critical network

The Actor part is responsible for making decisions of as high value as possible in the current environmental state, and the Critic part is responsible for making as accurate an estimate as possible of the value of the action taken in the current environmental state.

Specifically, the Agent receives the memory length state of the current Volterra equalizer as input, the output action is the proportion of the memory length of the current order to the maximum memory length limit, then the Volterra equalizer updates the memory length state of the Volterra equalizer and calculates the reward value, and the process is repeated, and the Agent continuously optimizes the strategy according to the reward. For different types of Volterra equalizers, the initial memory length state and state transition process are defined as follows:

(1) for a third order feedforward type Volterra equalizer (Volterra-FFE):

the memory length state of Volterra-FFE is defined as (id, L)_limit(id),l₁,l₂,l₃) Id denotes an index of the memory length position to be determined, L_limit(id) represents the maximum memory length limit for the id-th location; l₁,l₂,l₃Respectively representing the proportion of the memory length of 1 order, 2 order and 3 order to the maximum memory length limit; the initial memory length state of the Volterra-FFE equalizer is s₀＝(1,L_limit(1) 0,0,0), starting from the initial state, determining the memory lengths of the positions one by one according to the sequence from 1 order, 2 order and 3 order, wherein the whole state transition process is as follows:

done is a flag indicating whether the whole state transition process is finished, and when done is True, r is set₀＝r₁＝r₂The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished; in the formula, s₀Indicates the initial state, s₃Indicates the end state, s₁,s₂Represents an intermediate state, a₀,a₁,a₂Representing the action of Agent output, r₀,r₁,r₂Representing rewards earned during the state transition;

(2) for a third order feedback type Volterra equalizer (Volterra-DFE):

the memory length state of the Volterra-DFE is defined as (id, L)_limit(id),l₁,l₂,l₃,l_fb) Id denotes an index of the memory length position to be determined, L_limit(id) represents the maximum memory length limit for the id-th location; l₁,l₂,l₃,l_fbRepresenting the proportion of 1 order, 2 order, 3 order and feedback memory length in the maximum memory length limit; the initial memory length state of the Volterra-DFE equalizer is s₀＝(1,L_limit(1) 0,0,0,0), starting from the initial state, determining the memory lengths of the positions one by one according to the sequence of 1 order, 2 order, 3 order and feedback, wherein the whole state transition process is as follows:

when done is True, let r₀＝r₁＝r₂＝r₃The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished; in the formula, s₀Indicates the initial state, s₄Indicates the end state, s₁,s₂,s₃Represents an intermediate state, a₀,a₁,a₂,a₃Representing the action of Agent output, r₀,r₁,r₂,r₃Representing rewards earned during the state transition;

(3) for a third-order structured Pruning type Volterra equalizer (Volterra-rounding):

the memory length state of Volterra-Pruning is defined as (id, L)_limit(id),l₁,l₂,l_2d,l₃,l_3d) Id denotes an index of the memory length position to be determined, L_limit(id) represents a limit of maximum memory length or correlation distance for the id-th location; l₁,l₂,l_2d,l₃,l_3dRepresenting the proportion of the 1 st order memory length, the 2 nd order correlation distance, the 3 rd order memory length and the 3 rd order correlation distance in the maximum memory length or the correlation distance limit; the memory length state of the Volterra-rounding equalizer is s₀＝(1,L_limit(1) 0,0,0,0,0), starting from the initial state, determining the memory lengths and the correlation distances of the positions one by one according to the sequence of the correlation distances from 1 order, 2 order, 3 order and 3 order, wherein the whole state transition process is as follows:

when done is True, let r₀＝r₁＝r₂＝r₃＝r₄The reward represents an incentive value calculated according to the complexity of the current equalizer and the average error rate after performing 2-fold cross validation on the signal data after the whole state transition process is finished; in the formula, s₀Indicates the initial state, s₅Indicates the end state, s₁,s₂,s₃,s₄Represents an intermediate state, a₀,a₁,a₂,a₃,a₄Representing the action of Agent output, r₀,r₁,r₂,r₃,r₄Indicating the reward earned during the state transition.

And after the whole state transition process is finished, calculating the memory length of each step according to the action of the Agent, and performing 2-fold cross validation on the Volterra equalizer, wherein the length of each fold of signal data is 16384. Setting the learning rate of Volterra-FFE and Volterra-training to be 0.005, and training for 20 rounds; the learning rate for training the Volterra-DFE was set to 0.001 and 20 rounds of training were performed. Then verifying the BER by averaging_validA prize value is calculated. The invention designs two calculation modes of the reward value to adapt to different application scenes, and for the scene of simply pursuing high performance (namely low BER), the reward value reward is 100 (1-BER)_valid) (ii) a For scenarios that consider a trade-off between performance and complexity, such as Volterra-rounding, the reward value reward-100 BER_validLog (Vol _ MACs), Vol _ MACs is the number of multipliers required for the Volterra equalizer.

In order to improve the generalization effect and the training stability of DDPG algorithm decision, the invention uses strategy noise and data guide pool technology in the Agent training process. Strategy noise is used for smoothing strategy expectations by adding noise perturbations to the output action of the TargetActor network; the data guide pool technology guides the Agent to make a more optimal decision by regularly playing back the experience with the highest reward value to the Agent in the training process. Therefore, the step of determining the optimal structure of the Volterra equalizer based on the DDPG algorithm is as follows:

step-1: an initialization stage: four neural networks in the Agent are defined: actor network mu_θCritic network Q_wTargetactor network

And Target critical network

Actor network mu in Agent is initialized by using random parameters theta, w_θAnd Critic network Q_wUsing random parameters

Initializing TargetActor networks

And Target critical network

Is provided with

Is equal to theta, set

Is equal to w; initializing an experience playback pool;

step-2: a preheating stage: the Agent starts from an initial state and generates obedience 0, 1]Updating the memory length state of the Volterra equalizer until reaching the end state by uniformly distributed random actions, namely the proportion of each memory length in the maximum memory length limit; determining the memory length of each order according to the maximum memory length limit of each order of a Volterra equalizer and the action of an Agent, performing 2-fold cross validation on signal data, calculating a reward value by using the complexity of the equalizer and the average validation error rate after equalization, and then transferring the state(s) in the process_i,a_i,r_i,s_i+1Done) as experience, and storing the experience into an experience playback pool; repeating the above operations untilGenerating a specified amount of experience;

step-3: agent training and updating: sampling N experiences from a pool of experience replays(s)_i,a_i,r_i,s_i+1Done) for Agent training, including K experiences with the greatest reward;

calculating a target Q value

Wherein

ε～clip(N(0,σ²) -c, c), wherein,

representing a TargetActor network

Facing state is s_i+1Time-of-flight output, ε being the strategy noise, obey mean 0, variance σ²And is truncated between-c and c, gamma is a discount factor,

as a Target critical network

Receiving state s_i+1And actions

Is the output at the input;

by minimizing the error L-N^-1∑_i(y_i-Q_w(s_i,a_i))²Updating critical network Q_w(ii) a Every d steps, passing strategy gradient

Updating Actor network mu_θ；

For Target Actor network according to the following formula

And Target critical network

And (3) carrying out soft updating:

Step-4: after the Agent is updated, a determined action is generated from the initial state of the Volterra equalizer, the memory length state of the Volterra equalizer is updated to the end state, the action generated by the Agent each time needs to be added with the obedient mean value of 0 and the variance of sigma²The search noise e of the gaussian distribution of (a); calculating a reward value according to the average verification error rate of the 2-fold cross verification of the Volterra equalizer; transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) is stored in the experience playback pool, and then Step-3 is executed; after each update, the variance of the search noise e is attenuated: sigma²←σ²ξⁿXi is attenuation rate, n is updating times;

repeating the above operations until the absolute value of the difference between the current reward value and the last reward value is less than x₁The absolute value of the difference between the current Agent output action and the last Agent output action is less than x₂Then, it is determined that the training result has converged, wherein χ₁≥0，χ₂And (3) more than or equal to 0 is a set decision threshold, and finally, determining the memory length of each order of the Volterra equalizer according to the convergence value of the Agent output action and the maximum memory length limit of each order, so as to complete the determination of the optimal structure of the Volterra equalizer.

After the structure of the Volterra equalizer is determined, 3 groups of signal data generated by different random number seeds are selected, the Volterra equalizer is trained by using the first 32768 samples, the Volterra equalizer is tested by using the remaining 65536 samples, then the average test error rate is taken as a performance evaluation index, and the number of consumed multipliers is taken as a complexity evaluation index.

Fig. 3 and 4 respectively show error rate comparison graphs of the invention and the Volterra-FFE and Volterra-DFE based on greedy search when a reward value calculation mode of simply pursuing high performance is adopted. It can be seen from fig. 3 that, when the transmission power is greater than or equal to 14dBm, the error rate performance of the Volterra-FFE structure searched by the invention is improved to a certain extent compared with that of the greedy search structure, and especially, the improvement is more remarkable under the original better transmission power of the error rates of 16dBm and 18 dBm. As can be seen from fig. 4, the error rate performance of the Volterra-DFE structure searched by the present invention is significantly improved compared with the greedy search structure, and particularly, when the transmission power is 16dBm, the curves of the Volterra-DFE structure and the greedy search structure show a great difference, at this time, the 1 st order, the 2 nd order, the 3 rd order and the feedback memory length of the greedy search are [117,29,1,7], the number of consumed multipliers is 997, the 1 st order, the 2 nd order, the 3 rd order and the feedback memory length of the present invention are [105,15,7,11], the number of consumed multipliers is 608, since the greedy search determines the optimal memory length step by step, extreme situations may occur, such as greedy division of most of the calculation resources by step 1 and 2, the memory length of the 3 rd order and the feedback has only a small selection space, resulting in poor performance of the finally searched structure, and the present invention is based on the DDPG algorithm, samples over the whole structure space, and continuously optimizing according to the reward value to learn an optimal structure.

Fig. 5 and fig. 6 respectively show the bit error rate contrast diagram and the complexity contrast diagram of the invention and the Volterra-FFE based on greedy search when a reward value calculation mode considering compromise between performance and complexity is adopted. The invention's Volterra-FFE-TradeOff' in the legend refers to the way of calculating the reward value to be selected as the way of considering the compromise between performance and complexity when the invention is used for searching the optimal structure of the Volterra-FFE. As can be seen from FIG. 5, the loss of the error rate performance of the Volterra-FFE-TradeOff structure searched by the present invention is smaller than that of the Volterra-FFE structure searched by the present invention and that of the Volterra-FFE structure searched by the greedy search. In terms of complexity, as can be seen from fig. 6, the Volterra-FFE-TradeOff structure searched by the present invention is much smaller than the result of the greedy search. Therefore, the invention can greatly reduce the complexity of the Volterra-FFE under the condition of small balance performance loss, and well realizes the compromise between the balance performance and the complexity of the Volterra-FFE.

Fig. 7 and 8 respectively show an error rate comparison graph and a complexity comparison graph of the invention and a greedy search-based Volterra-DFE when a reward value calculation mode considering compromise between performance and complexity is adopted, and the 'Volterra-DFE-trade off' in the legend refers to a mode of calculating a reward value as a compromise between performance and complexity when the optimal structure of the Volterra-DFE is searched by using the invention. As can be seen from FIG. 7, the error rate performance of the Volterra-DFE-TradeOff structure searched by the invention is closer to that of the Volterra-DFE structure searched by the invention, and is generally better than that of the Volterra-DFE structure based on greedy search; in terms of complexity, as can be seen from fig. 8, firstly, compared with the Volterra-DFE-tradeof structure searched by the present invention, the complexity is significantly reduced when the error rate performance is close, for example, when the transmission power is 16dBm, the error rate performance is the same for both structures, but the complexity of the Volterra-DFE-tradeof structure searched by the present invention is reduced by more than half; secondly, compared with the Volterra-DFE-TradeOff structure searched by the invention, the error rate performance of the Volterra-DFE-TradeOff structure searched by the invention is generally superior to that of the Volterra-DFE structure searched by the invention, and meanwhile, under partial transmitting power, such as 8dBm, 16dBm and 18dBm, the complexity of the Volterra-DFE-TradeOff structure is far less than that of the Volterra-DFE structure searched by the invention. Therefore, the balance performance and complexity compromise of the Volterra-DFE can be well realized through the method, and the method has excellent structural generalization and training stability.

Fig. 9 and fig. 10 respectively show an error rate comparison graph and a complexity comparison graph of the present invention and the greedy search based Volterra-rounding when a reward value calculation mode considering compromise of performance and complexity is adopted, where the greedy search Volterra-rounding in the legend refers to determining the 2 nd order correlation distance and the 3 rd order correlation distance in Volterra-rounding by using greedy search on the basis of the searched Volterra-FFE optimal structure in the present invention. As can be seen from FIG. 9, the error rate expressions of the Volterra-FFE-TradeOff structure searched by the present invention, the Volterra-sounding structure searched by the greedy search, and the Volterra-sounding structure searched by the present invention are very close; in terms of complexity, it can be seen from fig. 10 that the complexity of the Volterra-rounding structure searched by the present invention is further reduced compared with the Volterra-rounding structure searched by the greedy search. Therefore, the Pruning quality of the Volterra-Pruning can be improved through the method and the device, and a more compact equalizer structure is obtained.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A method for optimizing a Volterra equalizer structure based on deep reinforcement learning is characterized by comprising the following steps:

2. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 1, wherein the step S1 comprises:

And Target critical network

Initializing Target Actor networks

And Target critical network

Wherein is provided with

Is equal to theta, set

Is equal to w;

3. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 2, wherein the step S2 includes:

4. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 1, wherein the step S3 comprises:

calculating a target Q value

Wherein the content of the first and second substances,

ε～clip(N(0,σ²) -c, c), wherein,

representing a Target Actor network

representing Target critical network

Facing state is s_i+1And act as

An output of time;

by minimizing errors

every d steps by minimizing the error

Updating Actor network mu_θ，μ_θ(s_i) Represents an Actor network mu_θFacing state is s_iA motion of a temporal output; q_w(s_i,μ_θ(s_i) Denotes a Critic network Q_wFacing state is s_iAnd the sum action is mu_θ(s_i) An output of time; for Target Actor network according to the following formula

And Target critical network

And (3) carrying out soft updating:

5. The method for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 1, wherein the step S4 comprises:

the updated Agent starts to generate actions from the initial state of the Volterra equalizer, the memory length state of the Volterra equalizer is updated, the Agent continues to generate actions according to the current state until the memory length state of the Volterra equalizer is updated to the end state, and each time the Agent updates the memory length state of the Volterra equalizer to the end stateThe sub-generated actions are all added with the obedient mean of 0 and the variance of sigma²The search noise e of the gaussian distribution of (a);

6. A system for optimizing a Volterra equalizer structure based on deep reinforcement learning, comprising:

7. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 6, wherein the module M1 comprises:

module M11: four neural networks in the Agent are defined: actor network mu_θCritic network Q_wTarget Actor network

And Target critical network

Initializing Target Actor networks

And Target critical network

Wherein is provided with

Is equal to theta, set

Is equal to w;

8. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 7, wherein the module M2 comprises:

9. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 6, wherein the module M3 comprises:

calculating a target Q value

Wherein the content of the first and second substances,

ε～clip(N(0,σ²) -c, c), wherein,

representing a Target Actor network

representing Target critical network

Facing state is s_i+1And act as

An output of time;

by minimizing errors

every d steps by minimizing the error

And Target critical network

And (3) carrying out soft updating:

10. The system for optimizing the structure of the Volterra equalizer based on the deep reinforcement learning of claim 6, wherein the module M4 comprises:

determining the memory length of each order according to the maximum memory length limit of each order of the Volterra equalizer and the action of the Agent, and performing the comparisonPerforming 2-fold cross validation on the signal data, and calculating a reward value by using the complexity of the current equalizer and the average bit error rate after equalization; transferring the state to the process(s)_i,a_i,r_i,s_i+1Done) as experience, storing the experience in an experience playback pool; then calling module M3;